## Common Fields

This script investigates common fields in the loans and listings data from Prosper. There are 500+ fields, but many of them are missing, because they do not always apply to the application. Here, I identify 88 variables which are present in at least 95% of the data records. 

The idea is that models built on these fields will be useful in evaluating most loan listings; while there may be better fields for evaluating special cases, which we can investigate elsewhere. 

In [3]:
import pandas as pd
from db_conn import DBConn

In [4]:
#setup connection to db
con = DBConn(<db creds>)    #need to fill in
con = con.connect()

In [7]:
data = pd.read_sql('select * from merged_train where term<>12', con)

In [37]:
cnts = data.count().sort_values(ascending=False)
common_fields = cnts[cnts >= len(data)*.99]

In [38]:
len(common_fields)

106

In [39]:
pd.options.display.max_rows = 200

In [40]:
common_fields    #present 95% of the time

loan_number                      232261
service_fees_paid                232261
simple_return                    232261
payment                          232261
next_payment_due_amount          232261
loan_status_description          232261
debt_sale_proceeds_received      232261
late_fees_paid                   232261
prosper_fees_paid                232261
interest_paid                    232261
principal_paid                   232261
loan_status                      232261
principal_balance                232261
origination_date                 232261
age_in_months                    232261
term                             232261
prosper_rating                   232261
borrower_rate                    232261
amount_borrowed                  232261
analysis_class                   232261
days_past_due                    232261
next_payment_due_date            231073
income_range_description         231070
effective_yield                  231070
borrower_apr                     231070


In [28]:
#get listings fields only
listings_fields = pd.read_sql('select * from listings limit 1', con).columns

In [29]:
common_listing_fields = common_fields[common_fields.index.intersection(listings_fields)]

In [41]:
common_listing_fields.sort_values(ascending=False)

prosper_rating                   232261
borrower_rate                    232261
last_updated_date                231070
effective_yield                  231070
prior_prosper_loans              231070
prior_prosper_loans_active       231070
employment_status_description    231070
dti_wprosper_loan                231070
income_verifiable                231070
stated_monthly_income            231070
income_range_description         231070
income_range                     231070
listing_category_id              231070
listing_monthly_payment          231070
borrower_apr                     231070
lender_yield                     231070
investment_type_description      231070
estimated_loss_rate              231070
estimated_return                 231070
partial_funding_indicator        231070
percent_funded                   231070
amount_remaining                 231070
amount_funded                    231070
listing_status_reason            231070
listing_status                   231070


In [42]:
len(common_listing_fields)

88

In [33]:
common_listing_fields.to_pickle('common_listing_fields.pkl')
common_listing_fields.to_csv('common_listing_fields.csv')