<a href="https://colab.research.google.com/github/cesar-ca/Algorithms-and-Simulation/blob/master/CS156_Assignment_2_Lending_Club.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Assignment 2

#### Lending Club Data

The Lending Club has made an anonymized set of data for reject data and approved data for loan data.

The Lending Club is a platform which allows the crowdfunding of various loans. Various investors are able to browse the profiles of people applying for loans and decide whether or not to help fund them.

In this assignment, the aim is to build a model that predicts the largest loan amount that will be successfully funded for any given individual. The model can then be used to advise the applicants on how much they could apply for.

#### Data Cleaning

Using reject data and approved data, combining into one dataset.

In [1]:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# Using reject data and approved data

# To use when file is in the drive folder
# reject_data = pd.read_csv('content/drive/rejected_2007_to_2018Q4.csv')
# accept_data = pd.read_csv('content/drive/accepted_2007_to_2018Q4.csv')

# To use when file are in the working directory
reject_data = pd.read_csv('rejected_2007_to_2018Q4.csv')
accept_data = pd.read_csv('accepted_2007_to_2018Q4.csv')

  interactivity=interactivity, compiler=compiler, result=result)


There is no need to use all the available data, especially for the accept data. A smart decision about which columns to use for reject and accept data is to use only the columns that can be found on both.**

Those are the following:
- Amount Requested
- Loan Title
- Debt-To-Income Ratio
- State
- Employment Length

Some other columns found in both datasets such as Zip Code and Policy Code are not relevant because the former entails regional data that is already covered by State data and the latter does not reveal a lot about the loan data.

** After going over the data (i.e., loading the ```csv``` data with pandas and checking what's going on in the dataframe by calling the relevant methods such as ``` df.head() ```


In [3]:
# The approach to this problem entails predicting for accept/reject status, so
# there is also an additonal column to indicate this
reject_data = reject_data[['Amount Requested','Application Date','Loan Title','Debt-To-Income Ratio',
                           'State','Employment Length']]

reject_data.rename(columns={'Application Date': 'Status', 'Loan Title': 'Title'}, inplace=True)

reject_data['Status'] = 0

# We do similar preprocessing for both the accept and the reject data,
# to keep our data as similar as possible
accept_data = accept_data[['loan_amnt','funded_amnt','title','dti',
                           'addr_state','emp_length']]

accept_data.rename(columns={'loan_amnt': 'Amount Requested', 'funded_amnt': 'Status',
                            'title': 'Title', 'dti': 'Debt-To-Income Ratio',
                            'addr_state': 'State','emp_length': 'Employment Length'}, inplace=True)

accept_data['Status'] = 1

In [4]:
reject_data

Unnamed: 0,Amount Requested,Status,Title,Debt-To-Income Ratio,State,Employment Length
0,1000.0,0,Wedding Covered but No Honeymoon,10%,NM,4 years
1,1000.0,0,Consolidating Debt,10%,MA,< 1 year
2,11000.0,0,Want to consolidate my debt,10%,MD,1 year
3,6000.0,0,waksman,38.64%,MA,< 1 year
4,1500.0,0,mdrigo,9.43%,MD,< 1 year
...,...,...,...,...,...,...
23261333,10000.0,0,debt_consolidation,36.37%,AR,< 1 year
23261334,7000.0,0,credit_card,44.57%,NY,< 1 year
23261335,20000.0,0,home_improvement,-1%,MN,< 1 year
23261336,2000.0,0,other,21.24%,FL,< 1 year


In [5]:
accept_data

Unnamed: 0,Amount Requested,Status,Title,Debt-To-Income Ratio,State,Employment Length
0,3600.0,1,Debt consolidation,5.91,PA,10+ years
1,24700.0,1,Business,16.06,SD,10+ years
2,20000.0,1,,10.78,IL,10+ years
3,35000.0,1,Debt consolidation,17.06,NJ,10+ years
4,10400.0,1,Major purchase,25.37,PA,3 years
...,...,...,...,...,...,...
2081845,11000.0,1,Debt consolidation,18.79,FL,
2081846,10000.0,1,Debt consolidation,15.82,NV,
2081847,35000.0,1,Credit card refinancing,13.02,GA,4 years
2081848,8000.0,1,Debt consolidation,22.06,NJ,10+ years


The current loan_data contains all available datapoints.

In [6]:
# Having done some pre-processing on the noisy data, we combine the reject and accept data
loan_data = pd.concat([reject_data, accept_data])
loan_data

Unnamed: 0,Amount Requested,Status,Title,Debt-To-Income Ratio,State,Employment Length
0,1000.0,0,Wedding Covered but No Honeymoon,10%,NM,4 years
1,1000.0,0,Consolidating Debt,10%,MA,< 1 year
2,11000.0,0,Want to consolidate my debt,10%,MD,1 year
3,6000.0,0,waksman,38.64%,MA,< 1 year
4,1500.0,0,mdrigo,9.43%,MD,< 1 year
...,...,...,...,...,...,...
2081845,11000.0,1,Debt consolidation,18.79,FL,
2081846,10000.0,1,Debt consolidation,15.82,NV,
2081847,35000.0,1,Credit card refinancing,13.02,GA,4 years
2081848,8000.0,1,Debt consolidation,22.06,NJ,10+ years


In [7]:
loan_data.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25343188 entries, 0 to 2081849
Data columns (total 6 columns):
 #   Column                Non-Null Count     Dtype  
---  ------                --------------     -----  
 0   Amount Requested      25343159 non-null  float64
 1   Status                25343188 non-null  int64  
 2   Title                 25319058 non-null  object 
 3   Debt-To-Income Ratio  25341615 non-null  object 
 4   State                 25343136 non-null  object 
 5   Employment Length     24462534 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 1.3+ GB


In [8]:
# Dropping datapoints with NaN values
loan_data.dropna(inplace=True)
loan_data.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24441533 entries, 0 to 2081848
Data columns (total 6 columns):
 #   Column                Non-Null Count     Dtype  
---  ------                --------------     -----  
 0   Amount Requested      24441533 non-null  float64
 1   Status                24441533 non-null  int64  
 2   Title                 24441533 non-null  object 
 3   Debt-To-Income Ratio  24441533 non-null  object 
 4   State                 24441533 non-null  object 
 5   Employment Length     24441533 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 1.3+ GB


Debt-to-income ratio and employment length are variables encoded differently throughout the dataset. For example, 98.5% or 0.47 for debt-to-income ratio and \< 3 years or 10+ years for employment length. So, we convert them into strictly numeric data.

In [9]:
# Convert debt-to-income ratio and employment length to numeric data types
loan_data['Debt-To-Income Ratio'] = loan_data['Debt-To-Income Ratio'].astype(str).str.extract('(\d+)').astype(float)
loan_data['Employment Length'] = loan_data['Employment Length'].str.extract('(\d+)').astype(int)
loan_data.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24441533 entries, 0 to 2081848
Data columns (total 6 columns):
 #   Column                Non-Null Count     Dtype  
---  ------                --------------     -----  
 0   Amount Requested      24441533 non-null  float64
 1   Status                24441533 non-null  int64  
 2   Title                 24441533 non-null  object 
 3   Debt-To-Income Ratio  24441533 non-null  float64
 4   State                 24441533 non-null  object 
 5   Employment Length     24441533 non-null  int64  
dtypes: float64(2), int64(2), object(2)
memory usage: 1.3+ GB


In [10]:
# Looking at the categorical variables
loan_data.describe(include = ['O'])

Unnamed: 0,Title,State
count,24441533,24441533
unique,121824,51
top,Debt consolidation,CA
freq,6540782,2912739


In [11]:
# Get value counts of loan title column
loan_titles_value_counts = loan_data['Title'].value_counts()
print(loan_titles_value_counts)

Debt consolidation              6540782
debt_consolidation              4390529
Other                           2511526
Credit card refinancing         2485708
other                           1443324
                                 ...   
Creation Station Fund                 1
rhmyers                               1
Red Sky                               1
Financing for Summer Program          1
saved from debt                       1
Name: Title, Length: 121824, dtype: int64


In [12]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', -1)

  after removing the cwd from sys.path.


In [13]:
rare_loan_titles = loan_data['Title'].isin(loan_titles_value_counts.index[loan_titles_value_counts <= 500])

#change value_count == 1 loan titles to "Other"
loan_data.loc[rare_loan_titles, 'Title'] = "Other"

print(loan_data['Title'].value_counts())
print(len(loan_data['Title'].unique()))

Debt consolidation           6540782
debt_consolidation           4390529
Other                        2744853
Credit card refinancing      2485708
other                        1443324
credit_card                  1001087
Car financing                687693 
home_improvement             515182 
Home improvement             494310 
Major purchase               473940 
Home buying                  465131 
car                          390991 
Medical expenses             353765 
major_purchase               347088 
Moving and relocation        287901 
moving                       286665 
Business Loan                265325 
medical                      260592 
small_business               248815 
Business                     195186 
house                        135516 
vacation                     128841 
Vacation                     124027 
renewable_energy             23199  
Green loan                   20650  
Debt Consolidation           20510  
wedding                      18130  
 

In [14]:
# The different loan titles that appear can fit in broader categories that are represented
# in the dictionary below, and then mapped into the dataset

loan_titles_dct = {'Business':'Business',
                   'Business Line Of Credit':'Business',
                   'Business Loan':'Business',
                   'Small Business Loan':'Business',
                   'small_business':'Business',
                   'car':'Car',
                   'Car financing':'Car',
                   'Car Loan':'Car',
                   'consolidate':'Consolidation',
                   'Consolidate':'Consolidation',
                   'consolidation':'Consolidation',
                   'Consolidation Loan':'Consolidation',
                   'consolidation loan':'Consolidation',
                   'credit card':'Credit Card',
                   'credit card consolidation':'Credit Card',
                   'Credit Card Consolidation':'Credit Card',
                   'Credit card consolidation':'Credit Card',
                   'Credit Card Loan':'Credit Card',
                   'Credit Card Payoff':'Credit Card',
                   'credit card payoff':'Credit Card',
                   'Credit Card Refinance':'Credit Card',
                   'credit card refinance':'Credit Card',
                   'Credit card refinancing':'Credit Card',
                   'credit cards':'Credit Card',
                   'Credit Cards':'Credit Card',
                   'Credit Consolidation':'Credit Card',
                   'credit_card':'Credit Card',
                   'Payoff':'Credit Card',
                   'payoff':'Credit Card',
                   'Refinance':'Credit Card',
                   'Debt':'Debt Consolidation',
                   'debt':'Debt Consolidation',
                   'debt consolidation':'Debt Consolidation',
                   'Debt consolidation':'Debt Consolidation',
                   'DEBT CONSOLIDATION':'Debt Consolidation',
                   'Debt Consolidation Loan':'Debt Consolidation',
                   'debt consolidation loan':'Debt Consolidation',
                   'Debt Free':'Debt Consolidation',
                   'Debt Loan':'Debt Consolidation',
                   'debt_consolidation':'Debt Consolidation',
                   'Home buying':'Home',
                   'home improvement':'Home',
                   'Home Improvement':'Home',
                   'Home improvement':'Home',
                   'Home Improvement Loan':'Home',
                   'home_improvement':'Home',
                   'house':'Home',
                   'Major purchase':'Major Purchase',
                   'major_purchase':'Major Purchase',
                   'medical':'Medical Expenses',
                   'Medical expenses':'Medical Expenses',
                   'moving':'Relocation',
                   'Moving and relocation':'Relocation',
                   'other':'Other',
                   'Other':'Other',
                   'personal':'Personal Loan',
                   'Personal':'Personal Loan',
                   'personal loan':'Personal Loan',
                   'Personal loan':'Personal Loan',
                   'my loan':'Personal Loan',
                   'My Loan':'Personal Loan',
                   'Freedom':'Personal Loan',
                   'freedom':'Personal Loan',
                   'loan':'Personal Loan',
                   'Loan':'Personal Loan',
                   'Vacation':'Personal Loan',
                   'vacation':'Personal Loan',
                   'renewable_energy':'Environment',
                   'Green loan':'Environment',
                   'Student Loan':'Student Loan',
                   'educational':'Student Loan',
                   'wedding':'Wedding Loan',
                   'Wedding':'Wedding Loan',
                   'Wedding Loan': 'Wedding Loan'
                   }

loan_data['Title'] = loan_data['Title'].map(loan_titles_dct) 

print(loan_data['Title'].value_counts())

Debt Consolidation    10947416
Other                 4188177 
Credit Card           3501861 
Home                  1614690 
Car                   1079316 
Major Purchase        821028  
Business              713046  
Medical Expenses      614357  
Relocation            574566  
Personal Loan         264499  
Environment           43849   
Wedding Loan          19453   
Consolidation         8654    
Student Loan          2817    
Name: Title, dtype: int64


In [15]:
# Importing libraries for relevant pre-processing (not used)
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

In [16]:
# One-hot encoding for categorical variables with pandas dummies
loan_title = pd.get_dummies((loan_data.Title))
loan_state = pd.get_dummies((loan_data.State))

In [17]:
# Checking one-hot encoding worked for categorical value for loan titles
loan_title.head()

Unnamed: 0,Business,Car,Consolidation,Credit Card,Debt Consolidation,Environment,Home,Major Purchase,Medical Expenses,Other,Personal Loan,Relocation,Student Loan,Wedding Loan
0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
2,0,0,0,0,0,0,0,0,0,1,0,0,0,0
3,0,0,0,0,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,0,0,0,0,0,1,0,0,0,0


In [18]:
# Checking one-hot encoding worked for categorical values for states
loan_state.head()

Unnamed: 0,AK,AL,AR,AZ,CA,CO,CT,DC,DE,FL,GA,HI,IA,ID,IL,IN,KS,KY,LA,MA,MD,ME,MI,MN,MO,MS,MT,NC,ND,NE,NH,NJ,NM,NV,NY,OH,OK,OR,PA,RI,SC,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [19]:
# The full data now contains all relevant variables including one-hot encoded categorical variables
full_data = pd.concat([loan_data, loan_title, loan_state], axis = 1)

In [20]:
full_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24441533 entries, 0 to 2081848
Data columns (total 71 columns):
 #   Column                Dtype  
---  ------                -----  
 0   Amount Requested      float64
 1   Status                int64  
 2   Title                 object 
 3   Debt-To-Income Ratio  float64
 4   State                 object 
 5   Employment Length     int64  
 6   Business              uint8  
 7   Car                   uint8  
 8   Consolidation         uint8  
 9   Credit Card           uint8  
 10  Debt Consolidation    uint8  
 11  Environment           uint8  
 12  Home                  uint8  
 13  Major Purchase        uint8  
 14  Medical Expenses      uint8  
 15  Other                 uint8  
 16  Personal Loan         uint8  
 17  Relocation            uint8  
 18  Student Loan          uint8  
 19  Wedding Loan          uint8  
 20  AK                    uint8  
 21  AL                    uint8  
 22  AR                    uint8  
 23  AZ    

In [21]:
'''
The following code was used to balance the number of rejections and approvals in the
dataset, however, it stopped working, so the balanced dataset was stored in a csv file
and it is the one used to run the machine learning model
'''

#sample_for_model = 80000

#reject_sample = full_data_rejections.sample(n = sample_for_model)
#accept_sample = full_data_approvals.sample(n = sample_for_model)

#data_for_model = pd.concat([reject_sample, accept_sample])
#len(data_for_model)
#data_for_model.to_csv('data_for_model.csv')

'\nThe following code was used to balance the number of rejections and approvals in the\ndataset, however, it stopped working, so the balanced dataset was stored in a csv file\nand it is the one used to run the machine learning model\n'

### Modeling & Evaluation

In this section, the aim is to make decisions about the model's hyperparameters. Using the machine learning library `sklearn` we can use `LogisticRegression` algorithm as well as a `LogisticRegressionCV` algorithm which allows for cross validation. The idea is that it will allow us to achieve better performace on unseen data, so we use train data and test data as appropriate and avoid leakage.

In [22]:
data_model = pd.read_csv('data_for_model.csv')

In [23]:
data_model.head()

Unnamed: 0.1,Unnamed: 0,Amount Requested,Status,Title,Debt-To-Income Ratio,State,Employment Length,Business,Car,Credit Card,Debt Consolidation,Environment,Home,Major Purchase,Medical Expenses,Other,Personal Loan,Relocation,Student Loan,Wedding Loan,AK,AL,AR,AZ,CA,CO,CT,DC,DE,FL,GA,HI,IA,ID,IL,IN,KS,KY,LA,MA,MD,ME,MI,MN,MO,MS,MT,NC,ND,NE,NH,NJ,NM,NV,NY,OH,OK,OR,PA,RI,SC,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY
0,393933,25000.0,0,Home,7.0,NY,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,793896,25000.0,0,Debt Consolidation,44.0,OH,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,813364,2000.0,0,Debt Consolidation,80.0,CA,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,988795,15000.0,0,Debt Consolidation,64.0,CA,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1507704,1200.0,0,Debt Consolidation,0.0,MI,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [24]:
data_model['Unnamed: 0'].head()

0    393933 
1    793896 
2    813364 
3    988795 
4    1507704
Name: Unnamed: 0, dtype: int64

In [25]:
from sklearn.model_selection import train_test_split

# Dataset has independent variables (X) and dependent variable (y) used to model and predict
X = data_model.drop(columns=['Unnamed: 0','Status','Title','State'])
y = data_model['Status']

# Splitting between training size and test for the model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, stratify = y)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(96000, 67) (64000, 67) (96000,) (64000,)


In [26]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import classification_report

# Difference between models is the penalty assigned
regression_lasso = LogisticRegressionCV(Cs=10, solver='liblinear', penalty='l1').fit(X_train, y_train)
regression_ridge = LogisticRegressionCV(Cs=10, solver='liblinear', penalty='l2').fit(X_train, y_train)



In [27]:
lasso_model_summary = [str(regression_lasso), regression_lasso.Cs_,
                       regression_lasso.C_, regression_lasso.scores_[1], 
                       np.mean(regression_lasso.scores_[1]), 
                       classification_report(y_train, regression_lasso.predict(X_train))]

for i in lasso_model_summary:
  print(f'{i}\n')

LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
                     fit_intercept=True, intercept_scaling=1.0, l1_ratios=None,
                     max_iter=100, multi_class='auto', n_jobs=None,
                     penalty='l1', random_state=None, refit=True, scoring=None,
                     solver='liblinear', tol=0.0001, verbose=0)

[1.00000000e-04 7.74263683e-04 5.99484250e-03 4.64158883e-02
 3.59381366e-01 2.78255940e+00 2.15443469e+01 1.66810054e+02
 1.29154967e+03 1.00000000e+04]

[21.5443469]

[[0.70916667 0.77838542 0.8103125  0.81489583 0.81614583 0.81723958
  0.81703125 0.8171875  0.81640625 0.815625  ]
 [0.70291667 0.77526042 0.80651042 0.81151042 0.81296875 0.81255208
  0.81338542 0.8125     0.81328125 0.81328125]
 [0.70286458 0.76739583 0.79770833 0.8021875  0.803125   0.80473958
  0.80494792 0.8028125  0.80479167 0.80473958]
 [0.70151042 0.76994792 0.79994792 0.80473958 0.80536458 0.80552083
  0.80536458 0.80546875 0.80541667 0.80526042]
 [0.692395

In [28]:
ridge_model_summary = [str(regression_ridge), regression_ridge.Cs_,
                       regression_ridge.C_, regression_ridge.scores_[1], 
                       np.mean(regression_ridge.scores_[1]), 
                       classification_report(y_train, regression_ridge.predict(X_train))]

for i in ridge_model_summary:
  print(f'{i}\n')                    

LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
                     fit_intercept=True, intercept_scaling=1.0, l1_ratios=None,
                     max_iter=100, multi_class='auto', n_jobs=None,
                     penalty='l2', random_state=None, refit=True, scoring=None,
                     solver='liblinear', tol=0.0001, verbose=0)

[1.00000000e-04 7.74263683e-04 5.99484250e-03 4.64158883e-02
 3.59381366e-01 2.78255940e+00 2.15443469e+01 1.66810054e+02
 1.29154967e+03 1.00000000e+04]

[1291.54966501]

[[0.77546875 0.78473958 0.81       0.81541667 0.81578125 0.81588542
  0.81588542 0.81588542 0.81614583 0.81604167]
 [0.77010417 0.788125   0.80041667 0.81098958 0.81140625 0.81291667
  0.81145833 0.81192708 0.81197917 0.81291667]
 [0.76458333 0.77947917 0.79692708 0.80119792 0.80140625 0.79911458
  0.8015625  0.79911458 0.80161458 0.7990625 ]
 [0.764375   0.7821875  0.79036458 0.80375    0.80473958 0.8040625
  0.80473958 0.80473958 0.80463542 0.79885417]
 [0.7622

In [29]:
y_pred1 = regression_lasso.predict(X_test)
print(f"Accuracy score: {regression_lasso.score(X_test,y_test)}\n")
print(f"Classification report: \n{classification_report(y_test, y_pred1)}")

Accuracy score: 0.807796875

Classification report: 
              precision    recall  f1-score   support

           0       0.78      0.86      0.82     32000
           1       0.85      0.75      0.80     32000

    accuracy                           0.81     64000
   macro avg       0.81      0.81      0.81     64000
weighted avg       0.81      0.81      0.81     64000



In [30]:
y_pred2 = regression_ridge.predict(X_test)
print(f"Accuracy score: {regression_ridge.score(X_test, y_test)}\n")
print(f"Classification report: \n{classification_report(y_test, y_pred2)}")

Accuracy score: 0.799078125

Classification report: 
              precision    recall  f1-score   support

           0       0.77      0.86      0.81     32000
           1       0.84      0.74      0.79     32000

    accuracy                           0.80     64000
   macro avg       0.80      0.80      0.80     64000
weighted avg       0.80      0.80      0.80     64000



### Predicting loan applicant approval or rejection probability

Using the models above, logistic regression with l1 penalty and logistic regression with l2 penalty, lasso and ridge respectively, the aim is to predict the loan amount that will be successfully funded for any given individual. The probability whether it will be funded relies on the independent variables: amount requested, debt-to-income ration, employment length, state, and loan title to varying degrees.

This model can then be used to advise applicants like the hypothetical ones here, depending on these predictors. 

In [31]:
loan_applicant1 = {'Amount Requested':10000,'Debt-To-Income Ratio':10,'Employment Length':8,
                  'Business':0,'Car':0,'Credit Card':0,'Debt Consolidation':1,
                  'Environment':0,'Home':0,'Major Purchase':0,'Medical Expenses':0,
                  'Other':0,'Personal Loan':0,'Relocation':0,'Student Loan':0,'Wedding Loan':0,
                  'AK':0,'AL':0, 'AR':0, 'AZ':0,'CA':1,'CO':0,'CT':0,'DC':0,'DE':0,
                  'FL':0,'GA':0,'HI':0,'IA':0,'ID':0,'IL':0,'IN':0,'KS':0,'KY':0,'LA':0,'MA':0,'MD':0,
                  'ME':0,'MI':0,'MN':0,'MO':0,'MS':0,'MT':0,'NC':0,'ND':0,'NE':0,'NH':0,
                  'NJ':0,'NM':0,'NV':0,'NY':0,'OH':0,'OK':0,'OR':0,'PA':0,'RI':0,'SC':0,
                  'SD':0,'TN':0,'TX':0,'UT':0,'VA':0,'VT':0,'WA':0,'WI':0,'WV':0,'WY':0
                  }

applicant1 = pd.DataFrame(loan_applicant1, index = [0])

applicant1_pred1 = regression_lasso.predict_proba(applicant1)
print(applicant1_pred1[0][1])

applicant1_pred2 = regression_ridge.predict_proba(applicant1)
print(applicant1_pred2[0][1])

0.9380791633019927
0.9302185335023914


In [32]:
loan_applicant2 = {'Amount Requested':80000,'Debt-To-Income Ratio':10,'Employment Length':8,
                  'Business':0,'Car':0,'Credit Card':0,'Debt Consolidation':1,
                  'Environment':0,'Home':0,'Major Purchase':0,'Medical Expenses':0,
                  'Other':0,'Personal Loan':0,'Relocation':0,'Student Loan':0,'Wedding Loan':0,
                  'AK':0,'AL':0, 'AR':0, 'AZ':0,'CA':1,'CO':0,'CT':0,'DC':0,'DE':0,
                  'FL':0,'GA':0,'HI':0,'IA':0,'ID':0,'IL':0,'IN':0,'KS':0,'KY':0,'LA':0,'MA':0,'MD':0,
                  'ME':0,'MI':0,'MN':0,'MO':0,'MS':0,'MT':0,'NC':0,'ND':0,'NE':0,'NH':0,
                  'NJ':0,'NM':0,'NV':0,'NY':0,'OH':0,'OK':0,'OR':0,'PA':0,'RI':0,'SC':0,
                  'SD':0,'TN':0,'TX':0,'UT':0,'VA':0,'VT':0,'WA':0,'WI':0,'WV':0,'WY':0
                  }

applicant2 = pd.DataFrame(loan_applicant2, index = [0])

applicant2_pred1 = regression_lasso.predict_proba(applicant2)
print(applicant2_pred1[0][1])

applicant2_pred2 = regression_ridge.predict_proba(applicant2)
print(applicant2_pred2[0][1])

0.9572034230005594
0.9523568902208699


For loan applicant 1 and loan applicant 2, the main difference is the amount requested for the loan. Even though there's a significant increase in the amount requested, the probability that it will be approved is not very evident with both being above 90%

In [33]:
loan_applicant3 = {'Amount Requested':20000,'Debt-To-Income Ratio':10,'Employment Length':2,
                  'Business':0,'Car':0,'Credit Card':0,'Debt Consolidation':1,
                  'Environment':0,'Home':0,'Major Purchase':0,'Medical Expenses':0,
                  'Other':0,'Personal Loan':0,'Relocation':0,'Student Loan':0,'Wedding Loan':0,
                  'AK':0,'AL':0, 'AR':0, 'AZ':0,'CA':1,'CO':0,'CT':0,'DC':0,'DE':0,
                  'FL':0,'GA':0,'HI':0,'IA':0,'ID':0,'IL':0,'IN':0,'KS':0,'KY':0,'LA':0,'MA':0,'MD':0,
                  'ME':0,'MI':0,'MN':0,'MO':0,'MS':0,'MT':0,'NC':0,'ND':0,'NE':0,'NH':0,
                  'NJ':0,'NM':0,'NV':0,'NY':0,'OH':0,'OK':0,'OR':0,'PA':0,'RI':0,'SC':0,
                  'SD':0,'TN':0,'TX':0,'UT':0,'VA':0,'VT':0,'WA':0,'WI':0,'WV':0,'WY':0
                  }

applicant3 = pd.DataFrame(loan_applicant3, index = [0])

applicant3_pred1 = regression_lasso.predict_proba(applicant3)
print(applicant3_pred1[0][1])

applicant3_pred2 = regression_ridge.predict_proba(applicant3)
print(applicant3_pred2[0][1])

0.5083943629856708
0.4821230639594911


In [34]:
loan_applicant4 = {'Amount Requested':20000,'Debt-To-Income Ratio':10,'Employment Length':8,
                  'Business':0,'Car':0,'Credit Card':0,'Debt Consolidation':1,
                  'Environment':0,'Home':0,'Major Purchase':0,'Medical Expenses':0,
                  'Other':0,'Personal Loan':0,'Relocation':0,'Student Loan':0,'Wedding Loan':0,
                  'AK':0,'AL':0, 'AR':0, 'AZ':0,'CA':1,'CO':0,'CT':0,'DC':0,'DE':0,
                  'FL':0,'GA':0,'HI':0,'IA':0,'ID':0,'IL':0,'IN':0,'KS':0,'KY':0,'LA':0,'MA':0,'MD':0,
                  'ME':0,'MI':0,'MN':0,'MO':0,'MS':0,'MT':0,'NC':0,'ND':0,'NE':0,'NH':0,
                  'NJ':0,'NM':0,'NV':0,'NY':0,'OH':0,'OK':0,'OR':0,'PA':0,'RI':0,'SC':0,
                  'SD':0,'TN':0,'TX':0,'UT':0,'VA':0,'VT':0,'WA':0,'WI':0,'WV':0,'WY':0
                  }

applicant4 = pd.DataFrame(loan_applicant4, index = [0])

applicant4_pred1 = regression_lasso.predict_proba(applicant4)
print(applicant4_pred1[0][1])

applicant4_pred2 = regression_ridge.predict_proba(applicant4)
print(applicant4_pred2[0][1])

0.9412341986434318
0.9338832780821806


For loan applicant 3 and loan applicant 4, the main difference is the lenght of employment. There is a difference of over 6 years in terms of length of employment which causes the probability of being funded to rise. Someone with only 2 years has about a 50% chance of getting funded while someone with 8 years has a greater chance of about 94%.

In [35]:
loan_applicant5 = {'Amount Requested':10000,'Debt-To-Income Ratio':10,'Employment Length':6,
                  'Business':0,'Car':0,'Credit Card':0,'Debt Consolidation':1,
                  'Environment':0,'Home':0,'Major Purchase':0,'Medical Expenses':0,
                  'Other':0,'Personal Loan':0,'Relocation':0,'Student Loan':0,'Wedding Loan':0,
                  'AK':0,'AL':0, 'AR':0, 'AZ':0,'CA':1,'CO':0,'CT':0,'DC':0,'DE':0,
                  'FL':0,'GA':0,'HI':0,'IA':0,'ID':0,'IL':0,'IN':0,'KS':0,'KY':0,'LA':0,'MA':0,'MD':0,
                  'ME':0,'MI':0,'MN':0,'MO':0,'MS':0,'MT':0,'NC':0,'ND':0,'NE':0,'NH':0,
                  'NJ':0,'NM':0,'NV':0,'NY':0,'OH':0,'OK':0,'OR':0,'PA':0,'RI':0,'SC':0,
                  'SD':0,'TN':0,'TX':0,'UT':0,'VA':0,'VT':0,'WA':0,'WI':0,'WV':0,'WY':0
                  }

applicant5 = pd.DataFrame(loan_applicant5, index = [0])

applicant5_pred1 = regression_lasso.predict_proba(applicant5)
print(applicant5_pred1[0][1])

applicant5_pred2 = regression_ridge.predict_proba(applicant5)
print(applicant5_pred2[0][1])

0.8587112637612161
0.8433755902849562


In [36]:
loan_applicant6 = {'Amount Requested':10000,'Debt-To-Income Ratio':40,'Employment Length':6,
                  'Business':0,'Car':0,'Credit Card':0,'Debt Consolidation':1,
                  'Environment':0,'Home':0,'Major Purchase':0,'Medical Expenses':0,
                  'Other':0,'Personal Loan':0,'Relocation':0,'Student Loan':0,'Wedding Loan':0,
                  'AK':0,'AL':0, 'AR':0, 'AZ':0,'CA':1,'CO':0,'CT':0,'DC':0,'DE':0,
                  'FL':0,'GA':0,'HI':0,'IA':0,'ID':0,'IL':0,'IN':0,'KS':0,'KY':0,'LA':0,'MA':0,'MD':0,
                  'ME':0,'MI':0,'MN':0,'MO':0,'MS':0,'MT':0,'NC':0,'ND':0,'NE':0,'NH':0,
                  'NJ':0,'NM':0,'NV':0,'NY':0,'OH':0,'OK':0,'OR':0,'PA':0,'RI':0,'SC':0,
                  'SD':0,'TN':0,'TX':0,'UT':0,'VA':0,'VT':0,'WA':0,'WI':0,'WV':0,'WY':0
                  }

applicant6 = pd.DataFrame(loan_applicant6, index = [0])

applicant6_pred1 = regression_lasso.predict_proba(applicant6)
print(applicant6_pred1[0][1])

applicant6_pred2 = regression_ridge.predict_proba(applicant6)
print(applicant6_pred2[0][1])

0.7577948331410492
0.7459627142274593


For applicant 5 and applicant 6, the main difference is between the debt-to-income ratio which describes how much of the income the applicant receives is to pay off debt. Someone with a lower ratio of 10 has about 85% chances of being funded while someone with a higher ratio has about 75% chances of being funded.

For the models for the lending club with different penalties (lasso and ridge); the accuracy, precision, and recall was not very different. For further improvement, the models can be changed to increase these metrics which would have be better evaluated as machine learning models for the purposes of the loan funding problem herein explored.

To varying degrees, the different independent variable predict what the probability is that an individual will be funded for their loan. The employment lenght and the debt to income ratio seem to have the most effect on loans being funded.