Lending Club is a marketplace for personal loans that matches borrowers who are seeking a loan with investors looking to lend money and make a return. Each borrower fills out a comprehensive application, providing his past financial history, the reason for the loan, etc. Each borrower's credit score is then evaluated using past historical data and an interest rate is assinged to the borrower. In this task, we'll focus on approved loans data from 2007 to 2011.

In [2]:
import pandas as pd
loans_2007 = pd.read_csv('loans_2007.csv')
print(loans_2007.head())

        id  member_id  loan_amnt  funded_amnt  funded_amnt_inv        term  \
0  1077501  1296599.0     5000.0       5000.0           4975.0   36 months   
1  1077430  1314167.0     2500.0       2500.0           2500.0   60 months   
2  1077175  1313524.0     2400.0       2400.0           2400.0   36 months   
3  1076863  1277178.0    10000.0      10000.0          10000.0   36 months   
4  1075358  1311748.0     3000.0       3000.0           3000.0   60 months   

  int_rate  installment grade sub_grade    ...    last_pymnt_amnt  \
0   10.65%       162.87     B        B2    ...             171.62   
1   15.27%        59.83     C        C4    ...             119.66   
2   15.96%        84.33     C        C5    ...             649.91   
3   13.49%       339.31     C        C1    ...             357.48   
4   12.69%        67.79     B        B5    ...              67.79   

  last_credit_pull_d collections_12_mths_ex_med  policy_code application_type  \
0           Jun-2016               

The Dataframe contains many columns and can be cumbersome to try to explore all at once. After careful study of each column, the following columns need to be dropped to avoid data leakage leading to overfitting:
-  id: randomly generated field by Lending Club for unique identification purposes only.
-  member_id: also a randomly generated field by Lending Club for unique identification purposes only.
-  funded_amnt: leaks data from the future (after the loan is already started to be funded).
-  funded_amnt_inv: also leaks data from the future (after the loan is already started to be funded).
-  grade: contains redundant information as the interest rate column (int_rate).
-  sub_grade: also contains redundant information as the interest rate column (int_rate).
-  emp_title: requires other data and a lot of processing to potentially be useful.
-  issue_d: leaks data from the future (after the loan is already completed funded).
-  zip_code: redundant with the addr_state column since only the first 3 digits of the 5 digit zip code are visible (which only can be used to identify the state the borrower lives in).
-  out_prncp: leaks data from the future, (after the loan already started to be paid off).
-  out_prncp_inv: also leaks data from the future, (after the loan already started to be paid off).
-  total_pymnt: also leaks data from the future, (after the loan already started to be paid off).
-  total_pymnt_inv: also leaks data from the future, (after the loan already started to be paid off).
-  total_rec_prncp: also leaks data from the future, (after the loan already started to be paid off).
-  total_rec_int: leaks data from the future, (after the loan already started to be paid off),
-  total_rec_late_fee: also leaks data from the future, (after the loan already started to be paid off),
-  recoveries: also leaks data from the future, (after the loan already started to be paid off),
-  collection_recovery_fee: also leaks data from the future, (after the loan already started to be paid off),
-  last_pymnt_d: also leaks data from the future, (after the loan already started to be paid off),
-  last_pymnt_amnt: also leaks data from the future, (after the loan already started to be paid off).

In [7]:
# dropping the columns from the dataset
loans_2007.drop(columns = ['id', 'member_id', 'funded_amnt', 'funded_amnt_inv', 'grade', 'sub_grade','emp_title', 'issue_d',  
                          'zip_code', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 
                          'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_d', 
                           'last_pymnt_amnt'], inplace = True)

In [8]:
print(loans_2007.head())

   loan_amnt        term int_rate  installment emp_length home_ownership  \
0     5000.0   36 months   10.65%       162.87  10+ years           RENT   
1     2500.0   60 months   15.27%        59.83   < 1 year           RENT   
2     2400.0   36 months   15.96%        84.33  10+ years           RENT   
3    10000.0   36 months   13.49%       339.31  10+ years           RENT   
4     3000.0   60 months   12.69%        67.79     1 year           RENT   

   annual_inc verification_status  loan_status pymnt_plan    ...      \
0     24000.0            Verified   Fully Paid          n    ...       
1     30000.0     Source Verified  Charged Off          n    ...       
2     12252.0        Not Verified   Fully Paid          n    ...       
3     49200.0     Source Verified   Fully Paid          n    ...       
4     80000.0     Source Verified      Current          n    ...       

  initial_list_status last_credit_pull_d collections_12_mths_ex_med  \
0                   f           Jun-201

The number of columns was reduced from 52 to 32 columns. For this task, our target column will be the **loan_status** column, since it's the only column that directly describes if a loan was paid off on time, had delayed payments, or was defaulted on the borrower.

In [9]:
print(loans_2007['loan_status'].value_counts())

Fully Paid                                             33136
Charged Off                                             5634
Does not meet the credit policy. Status:Fully Paid      1988
Current                                                  961
Does not meet the credit policy. Status:Charged Off      761
Late (31-120 days)                                        24
In Grace Period                                           20
Late (16-30 days)                                          8
Default                                                    3
Name: loan_status, dtype: int64


From the investor's perspective, we're interested in trying to predict which loans will be paid off on time and which ones won't be. Only the **Fully Paid** and **Charged Off** values describe the final outcome of the loan. The other values describe loans that are still on going and where the jury is still out on if the borrower will pay back the loan on time or not.

We will treat the problem as a binary classification one. Let's remove all the loans that don't contain either **Fully Paid** and **Charged Off** as the loan's status and then transform the **Fully Paid** values to 1 for the positive case and the **Charged Off** values to 0 for the negative case.

In [10]:
loans_2007 = loans_2007[(loans_2007['loan_status'] == 'Fully Paid') | (loans_2007['loan_status'] == 'Charged Off')]

mapping_dict = {
    "loan_status": {
        "Fully Paid": 1,
        "Charged Off": 0
    }
}
loans_2007 = loans_2007.replace(mapping_dict)

Now, let's look for any columns that contain only one unique value and remove them. These columns won't be useful for the model since they don't add any information to each loan application.

In [12]:
drop_columns = []
for c in loans_2007.columns:
    non_null = loans_2007[c].dropna()
    unique_non_null = non_null.unique()
    num_true_unique = len(unique_non_null)
    if num_true_unique == 1:
        drop_columns.append(c)
        
loans_2007.drop(columns=drop_columns, inplace = True)
# print the dropped columns names
print(drop_columns)

[]


Therefore, we were able to remove 9 more columns since they only contained 1 unique value.