# Project description

Credit has played a key role in the economy for centuries and some form of credit has existed since the beginning of commerce. We'll be working with financial lending data from [Lending Club](https://www.lendingclub.com/). Lending Club is a marketplace for personal loans that matches borrowers who are seeking a loan with investors looking to lend money and make a return. You can read more about their marketplace [here](https://www.lendingclub.com/company/about-us?).

Each borrower completes a comprehensive application, providing their past financial history, the reason for the loan, and more. Lending Club evaluates each borrower's credit score using past historical data and their own data science process to assign an interest rate to the borrower. The interest rate is the percent in addition to the requested loan amount the borrower has to pay back. You can read more about the interest rate that Lending Club assigns [here](https://www.lendingclub.com/loans/personal-loans/rates-fees). Lending Club also tries to verify all the information the borrower provides but it can't verify all of the information (usually for regulation reasons).

A higher interest rate means that the borrower is a risk and more unlikely to pay back the loan. While a lower interest rate means that the borrower has a good credit history and is more likely to pay back the loan. The interest rates range from 5.32% all the way to 30.99% and each borrower is given a [grade](https://www.lendingclub.com/investing/investor-education/interest-rates-and-fees) according to the interest rate they were assigned. If the borrower accepts the interest rate, then the loan is listed on the Lending Club marketplace.

Investors are primarily interested in receiving a return on their investments. Approved loans are listed on the Lending Club website, where qualified investors can browse recently approved loans, the borrower's credit score, the purpose for the loan, and other information from the application. Once they're ready to back a loan, they select the amount of money they want to fund. Once a loan's requested amount is fully funded, the borrower receives the money they requested minus the [origination](https://help.lendingclub.com/hc/en-us/articles/214463677) fee that Lending Club charges.

The borrower will make monthly payments back to Lending Club either over 36 months or over 60 months. Lending Club redistributes these payments to the investors. This means that investors don't have to wait until the full amount is paid off before they see a return in money. If a loan is fully paid off on time, the investors make a return which corresponds to the interest rate the borrower had to pay in addition to the requested amount. Many loans aren't completely paid off on time and some borrowers default on the loan.

While Lending Club has to be extremely savvy and rigorous with their credit modelling, investors on Lending Club need to be equally as savvy about determining which loans are more likely to be paid off. At first, you may wonder why investors put money into anything but low interest loans. The incentive investors have to back higher interest loans is, well, the higher interest! If investors believe the borrower can pay back the loan, even if he or she has a weak financial history, then investors can make more money through the larger additional amount the borrower has to pay.

Most investors use a portfolio strategy to invest small amounts in many loans, with healthy mixes of low, medium, and interest loans. In this project, we'll focus on the mindset of a conservative investor who only wants to invest in the loans that have a good chance of being paid off on time. To do that, we'll need to first understand the features in the dataset and then experiment with building machine learning models that reliably predict if a loan will be paid off or not.


Lending Club releases data for all of the approved and declined loan applications periodically on their [Website](https://www.lendingclub.com/investing/peer-to-peer).

In this project, we'll focus on approved loans data from 2007 to 2011, since a good number of the loans have already finished. You'll find the dataset in 'data/loans_2007.csv'.

You'll also find a [data dictionary](data/LCDataDictionary.xlsx) (in XLS format) which contains information on the different column names.


# Problem Statement

We would like to build a machine learning model that can accurately predict if a borrower will pay off their loan on time or not.

# Instructions

1. Read and explore the dataset.
2. Perform data cleaning tasks that are useful to model our problem.
3. Define what features we want to use and which column represents the target column we want to predict. 
4. Perform necessary data preparation to start training machine learning models.
5. 
    a. Make predictions about whether or not a loan will be paid off on time.

    b. Our objective is to fund enough loans that are paid off on time to offset our losses from loans that aren't paid
    off. An error metric will help us determine if our algorithm will make us money or lose us money. Select an error metric that will help us figure out when our model is performing well, and when it's performing poorly.
7. Evaluate your model.

In [122]:
# 1.Rading and exploring the dataset
import pandas as pd
import numpy as np
df = pd.read_csv('data/loans_2007.csv')
print(df.info())
df.columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42538 entries, 0 to 42537
Data columns (total 52 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   id                          42538 non-null  object 
 1   member_id                   42535 non-null  float64
 2   loan_amnt                   42535 non-null  float64
 3   funded_amnt                 42535 non-null  float64
 4   funded_amnt_inv             42535 non-null  float64
 5   term                        42535 non-null  object 
 6   int_rate                    42535 non-null  object 
 7   installment                 42535 non-null  float64
 8   grade                       42535 non-null  object 
 9   sub_grade                   42535 non-null  object 
 10  emp_title                   39909 non-null  object 
 11  emp_length                  41423 non-null  object 
 12  home_ownership              42535 non-null  object 
 13  annual_inc                  425

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Index(['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
       'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'emp_title',
       'emp_length', 'home_ownership', 'annual_inc', 'verification_status',
       'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'title', 'zip_code',
       'addr_state', 'dti', 'delinq_2yrs', 'earliest_cr_line',
       'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'revol_util',
       'total_acc', 'initial_list_status', 'out_prncp', 'out_prncp_inv',
       'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int',
       'total_rec_late_fee', 'recoveries', 'collection_recovery_fee',
       'last_pymnt_d', 'last_pymnt_amnt', 'last_credit_pull_d',
       'collections_12_mths_ex_med', 'policy_code', 'application_type',
       'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt',
       'pub_rec_bankruptcies', 'tax_liens'],
      dtype='object')

In [123]:
# 2&3.Data Cleaning and features selection
df_new = df[['member_id', 'loan_amnt','term', 'grade','emp_length', 'home_ownership', 'annual_inc',
       'issue_d', 'loan_status', 'purpose',
       'addr_state', 'delinq_2yrs',
       'inq_last_6mths', 'open_acc',
       'total_pymnt', 'total_rec_int',
       'total_rec_late_fee','acc_now_delinq','tax_liens']]
df_new.dropna(subset=['issue_d'],inplace=True)
df_new.drop(columns=['issue_d'],inplace=True)
#df_new['member_id'].isna().values.any() False
#df_new['loan_amnt'].isna().values.any() False
#df_new['term'].isna().values.any() False
#df_new['grade'].isna().values.any() False
#df_new['emp_length'].isna().values.any() True
#df_new['home_ownership'].isna().values.any() False
#df_new['annual_inc'].isna().values.any() True
#df_new['loan_status'].isna().values.any() False
#df_new['purpose'].isna().values.any() False
#df_new['addr_state'].isna().values.any() False
#df_new['delinq_2yrs'].isna().values.any() True
#df_new['inq_last_6mths'].isna().values.any() True
#df_new['open_acc'].isna().values.any() True
#df_new['total_pymnt'].isna().values.any() False
#df_new['total_rec_int'].isna().values.any() False
#df_new['total_rec_late_fee'].isna().values.any() False
#df_new['acc_now_delinq'].isna().values.any() True
#df_new['tax_liens'].isna().values.any() True
print(len(df_new))
df_new.dropna(inplace=True)
print(len(df_new))
#df_new['loan_status'] = df_new['loan_status'].astype('string')
df_new = df_new[df_new['loan_status'] != 'Current']
df_new = df_new[df_new['loan_status'] != 'Default']
df_new['loan_status'] = df_new['loan_status'].str.replace('Does not meet the credit policy. Status:Fully Paid',"Fully Paid")
df_new['loan_status'] = df_new['loan_status'].str.replace('Does not meet the credit policy. Status:Charged Off',"Charged Off")
df_new['loan_status'] = df_new['loan_status'].str.replace('In Grace Period',"Charged Off")
df_new['loan_status'] = df_new['loan_status'].str.replace('Late \(31-120 days\)','Charged Off')
df_new['loan_status'] = df_new['loan_status'].str.replace('Late \(16-30 days\)',"Charged Off")
print(df_new['loan_status'].unique())
list_to_encode = ['term', 'grade','emp_length', 'home_ownership','loan_status', 'purpose',
       'addr_state']
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
df_new.set_index('member_id', inplace=True)
for col in list_to_encode:
    df_new[col] = df_new[col].astype('string')
    df_new[col] = labelencoder.fit_transform(df_new[col])
#df_new['term'] = df_new['term'].astype('string')
#df_new['term'] = labelencoder.fit_transform(df_new['term'])
df_new.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_new.dropna(subset=['issue_d'],inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


42535
41318


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_new.dropna(inplace=True)
  df_new['loan_status'] = df_new['loan_status'].str.replace('Does not meet the credit policy. Status:Fully Paid',"Fully Paid")
  df_new['loan_status'] = df_new['loan_status'].str.replace('Does not meet the credit policy. Status:Charged Off',"Charged Off")
  df_new['loan_status'] = df_new['loan_status'].str.replace('Late \(31-120 days\)','Charged Off')
  df_new['loan_status'] = df_new['loan_status'].str.replace('Late \(16-30 days\)',"Charged Off")


['Fully Paid' 'Charged Off']


Unnamed: 0_level_0,loan_amnt,term,grade,emp_length,home_ownership,annual_inc,loan_status,purpose,addr_state,delinq_2yrs,inq_last_6mths,open_acc,total_pymnt,total_rec_int,total_rec_late_fee,acc_now_delinq,tax_liens
member_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
1296599.0,5000.0,0,1,1,4,24000.0,1,1,3,0.0,1.0,3.0,5863.155187,863.16,0.0,0.0,0.0
1314167.0,2500.0,1,2,10,4,30000.0,0,0,10,0.0,5.0,3.0,1008.71,435.17,0.0,0.0,0.0
1313524.0,2400.0,0,2,1,4,12252.0,1,11,14,0.0,2.0,2.0,3005.666844,605.67,0.0,0.0,0.0
1277178.0,10000.0,0,2,1,4,49200.0,1,9,4,0.0,1.0,10.0,12231.89,2214.92,16.97,0.0,0.0
1311441.0,5000.0,0,0,3,4,36000.0,1,13,3,0.0,3.0,9.0,5632.21,632.21,0.0,0.0,0.0


In [124]:
# 4.Performing necessary data preparation to start training machine learning models.
X = df_new.drop(columns=['loan_status'])
y = df_new[['loan_status']]
y
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.2)

In [125]:
# LogisticRegression Classifier
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0)
clf.fit(X_train, y_train) 
yhat = clf.predict(X_test)
clf.score(X_test, y_test)

  return f(**kwargs)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.9569253620497586

In [131]:
# 6. Evaluating our model
y_comparison = y
y_comparison.rename(columns={'loan_status':'real_values'},inplace=True) 
y_real = y_comparison['real_values']
y_pred = clf.predict(X)
y_comparison['predicted_values'] = y_pred
print(y_comparison)
from sklearn.metrics import accuracy_score, jaccard_score, precision_score
print("Accuracy score is : {}.".format(accuracy_score(y_real, y_pred)))
print("Jaccard score is : {}.".format(jaccard_score(y_real, y_pred)))
print("Precison score is : {}.".format(precision_score(y_real, y_pred)))

           real_values  predicted_values
member_id                               
1296599.0            1                 1
1314167.0            0                 0
1313524.0            1                 1
1277178.0            1                 1
1311441.0            1                 1
...                ...               ...
114120.0             1                 1
113446.0             1                 1
113199.0             0                 0
112799.0             1                 1
78530.0              0                 0

[40393 rows x 2 columns]
Accuracy score is : 0.9583343648651994.
Jaccard score is : 0.9526129068588806.
Precison score is : 0.9626415523814943.


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  y_comparison['predicted_values'] = y_pred
