<span style='font-size: 40px;'> Lending Club - ML Modeling with LightGBM </span>



<span style='font-size: 15px;'> Lending Club is an online peer-to-peer lending platform that allows small-dollar investors to lend borrower small unsecured personal loans. The interest rate charged on the loan is done on the basis of the grades and subgrades formulated by Lending Club. The method of calculating these grades is not transparent and the purpose of this project is to try to determine which variables are most strongly correlated with a loan being paid off by the end of its term, through a combination of statistical and visual analysis, as well as attempting to train a machine learning model to match the results or surpass those found in the datset.  </span>

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
from matplotlib.ticker import MaxNLocator
from sklearn.preprocessing import LabelEncoder
from tabulate import tabulate
import textwrap
from IPython.display import display
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix,  classification_report
import lightgbm as lgb
# from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.model_selection import cross_validate, KFold
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import roc_curve, auc

<span style="font-size:30px;"> Correlation Analysis </span>

There are still an unwieldy amount of columns remaining. What's more, many are difficult to decipher and predict its correlation with the target. We have to use statistical analysis to see which correspond to repayment rate and which are irrelevant or redundant and can be discarded. This is done by calculating Pearson and Spearman correlations. The former measures the linearity between variables and the latter measures the monotonic relationship, which is to say the directional relationship regardless of the slope of the relationship.


We begin by converting the non-numeric columns into numeric values.

In [18]:
corr_df = mod_df_2.copy()

In [19]:
# Turning categories that have strings in them into purely numeric values
corr_df['emp_length'] = corr_df['emp_length'].str.extract('(\d+)').astype(float)
corr_df['term'] = corr_df['term'].str.extract('(\d+)').astype(float)

# Loan status being recognized as data type object after we converted it. We need them as numeric values for our model.
corr_df['loan_status'] = corr_df['loan_status'].astype(int)

# Turning dates into numeric values
corr_df['earliest_cr_line'] = pd.to_datetime(corr_df['earliest_cr_line']).astype('int64')//10**9 # Convert the timestamp to Unix timestamp (seconds since January 1, 1970)


  corr_df['emp_length'] = corr_df['emp_length'].str.extract('(\d+)').astype(float)
  corr_df['term'] = corr_df['term'].str.extract('(\d+)').astype(float)
  corr_df['earliest_cr_line'] = pd.to_datetime(corr_df['earliest_cr_line']).astype('int64')//10**9 # Convert the timestamp to Unix timestamp (seconds since January 1, 1970)


In [20]:
categorical_columns = corr_df.select_dtypes(include=['object']).columns
print(categorical_columns)

Index(['grade', 'sub_grade', 'home_ownership', 'verification_status',
       'purpose', 'addr_state', 'application_type', 'disbursement_method'],
      dtype='object')


One-hot encoding

In [21]:
# Designating ordinal_cols to be converted to numeric. These are just the grade and sub_grade columns.  
ordinal_cols = list(categorical_columns[0:2]) 

corr_df_2 = corr_df.copy()

# One-hot encoding for ordinal values
label_encoder = LabelEncoder()

for col in ordinal_cols:
    corr_df_2[col] = label_encoder.fit_transform(corr_df_2[col])

In [22]:
# Designating non-ordinal columns.
binary_cols = list(categorical_columns[2:])

# One-hot encoding for binary values
corr_df_2 = pd.get_dummies(corr_df_2, columns=binary_cols)

Due to the one-hot encoding, there are now ~50 columns designating state of residence alone. Let's analyze those seperately by creating a subset dataframe.

In [23]:
# Filtering by state and creating a df out of it
states = corr_df_2.filter(like='addr_state')
states_df = pd.merge(states, corr_df_2['loan_status'], left_index=True, right_index=True)

# Applying correlation analysis to each column
state_corr = states_df.iloc[:, 0:-1].apply(lambda x: x.corr(states_df['loan_status'])) # Applies corr function to each column in the states df.
state_corr = state_corr.sort_values().reset_index()

The table below indicates that there isn't a significant impact of states on the payback rate. Even 0.017 for Colorada is negligable.

In [47]:
# Formatting tables showing the top 5 negatively and top 5 positively correlated states with the independent variable
state_corr.columns = ['State','Correlation']
state_corr['State'] = state_corr['State'].replace('addr_state_','', regex=True)
state_corr['Correlation'] = round(state_corr['Correlation'],3)

# Convert DataFrame to a list of dictionaries
neg_corr_data = state_corr.head(5).to_dict('records')
pos_corr_data = state_corr.tail(5).to_dict('records')

# Print top negatively numeric variables
neg_corr_table = tabulate(neg_corr_data, headers='keys', tablefmt='pretty')
print(f'Top negatively correlated states with final Loan Status: \n{neg_corr_table}\n')

# Print top positively numeric variables
pos_corr_table = tabulate(pos_corr_data, headers='keys', tablefmt='pretty')
print(f'Top positively correlated states with final Loan Status: \n{pos_corr_table}')

Top negatively correlated states with final Loan Status: 
+-------+-------------+
| State | Correlation |
+-------+-------------+
|  NY   |   -0.015    |
|  MS   |   -0.011    |
|  FL   |   -0.011    |
|  AL   |    -0.01    |
|  AR   |   -0.009    |
+-------+-------------+

Top positively correlated states with final Loan Status: 
+-------+-------------+
| State | Correlation |
+-------+-------------+
|  NH   |    0.009    |
|  SC   |    0.01     |
|  OR   |    0.015    |
|  WA   |    0.016    |
|  CO   |    0.017    |
+-------+-------------+


Let's continue with correlation analysis of the other categorical columns

In [24]:
cols_wo_states = [col for col in corr_df_2.columns if col not in states] # I want to limit my next analysis to the non-state columns.

In [25]:
# Retrieving just the categorical columns
cat_cols = cols_wo_states[-27:]

# Merging the categorical columns with the loan status column to create a new df
cat_df = corr_df_2[cat_cols+['loan_status']]

# Calculating the correlation in each column
cat_corr = cat_df.iloc[:, 0:-1].apply(lambda col: col.corr(cat_df['loan_status']))

# Sorting by correlation
cat_corr = cat_corr.sort_values()

# Renaming the columns
cat_corr1 = cat_corr.reset_index()
cat_corr1.columns = ['Category','Correlation']

The resulting correlation table above was a bit hard to work with because there are so many values. Below I create a function to group the table by the general category name prior to one-hot encoding.

In [50]:
cat_dict = {}
cat_corr1['Group'] = np.nan

def grouping(categories, dataframe, column):
    for general_cat in categories:
        for specific_cat in dataframe[column]:
            if general_cat in specific_cat:
                dataframe.loc[dataframe[column]==specific_cat,'Group'] = general_cat
     
    cat_grouping = dataframe.groupby('Group')

    for gen_cat, cat_df in cat_grouping:
        display(f"Group: {gen_cat}")
        display(cat_df)

Based on the below, most of the categorical columns seem insignificant. However for home ownership, renting property is negatively correlated with repayment (0.066) and having a mortgage is positively correlated (.068). These top-line results aren't surprising because you'd expect mortgage holders to be higher on the socio-economic ladder. However, those with a status of "own" which I take to mean owning homes outright have a slight negative correlation. 

The purpose of the borrowing doesn't seem to have a large effect either. However, debt consolidation is most negatively correlated and credit card most positively correlated, but very weakly and I'm reticent to speculate the difference between those two purposes that would explain the difference in correlation.

Lastly, and most surprising is verification status. If an individual's income has not been verified, they are likelier to have repaid their loans! I wonder if unverified incomes are on average higher. I will look into this in later visualizations.

In [37]:
grouping(binary_cols, cat_corr1, 'Category')

  dataframe.loc[dataframe[column]==specific_cat,'Group'] = general_cat


'Group: application_type'

Unnamed: 0,Category,Correlation,Group
5,application_type_Joint App,-0.016108,application_type
23,application_type_Individual,0.016108,application_type


'Group: disbursement_method'

Unnamed: 0,Category,Correlation,Group
14,disbursement_method_DirectPay,-0.000169,disbursement_method
16,disbursement_method_Cash,0.000169,disbursement_method


'Group: home_ownership'

Unnamed: 0,Category,Correlation,Group
0,home_ownership_RENT,-0.065987,home_ownership
8,home_ownership_OWN,-0.005632,home_ownership
13,home_ownership_OTHER,-0.000262,home_ownership
15,home_ownership_ANY,0.000145,home_ownership
17,home_ownership_NONE,0.000766,home_ownership
25,home_ownership_MORTGAGE,0.068069,home_ownership


'Group: purpose'

Unnamed: 0,Category,Correlation,Group
2,purpose_debt_consolidation,-0.034545,purpose
3,purpose_small_business,-0.026734,purpose
6,purpose_moving,-0.007217,purpose
7,purpose_other,-0.006846,purpose
9,purpose_medical,-0.005028,purpose
10,purpose_house,-0.003544,purpose
11,purpose_renewable_energy,-0.002465,purpose
12,purpose_educational,-0.000366,purpose
18,purpose_vacation,0.001606,purpose
19,purpose_major_purchase,0.00516,purpose


'Group: verification_status'

Unnamed: 0,Category,Correlation,Group
1,verification_status_Verified,-0.065173,verification_status
4,verification_status_Source Verified,-0.019482,verification_status
26,verification_status_Not Verified,0.086338,verification_status


<span style fontweight = '20px'> I move on to analysis of the numerical columns</span>

In [27]:
# Retrieving the numerical columns only
numeric_cols = corr_df_2.columns[0:57]
numeric_df = corr_df_2[numeric_cols]

Creating a function printing the top correlated features depending on method specified, pearson and spearman in my cases.


In [28]:
def corr_table(method, head, tail):
    # Applying the correlation function to each column and sorting by correlation
    num_corr = numeric_df.apply(lambda col: col.corr(numeric_df['loan_status'], method=method))
    num_corr = num_corr.sort_values()

    # Renaming columns and formatting the Correlation numbers
    num_corr1 = num_corr.reset_index()
    num_corr1.columns = ['Feature','Correlation']
    num_corr1['Correlation'] = round(num_corr1['Correlation'],3)
    num_corr1 = num_corr1.drop(num_corr1[num_corr1['Feature']=='loan_status'].index)

    # Taking the top 8 most negatively correlated features and retrieving the definition of the column name, and changing the size for readablity.
    num_neg_corr = num_corr1.head(head)
    num_neg_corr['Description'] = num_neg_corr['Feature'].apply(lambda x: columns_dict.get(x))
    num_neg_corr['Description'] = num_neg_corr['Description'].apply(lambda x: textwrap.fill(str(x), 84))

    # Taking the top 5 most positively correlated features and retrieving the definition of the column name, and changing the size for readablity.
    num_pos_corr = num_corr1.tail(tail)
    num_pos_corr['Description'] = num_pos_corr['Feature'].apply(lambda x: columns_dict.get(x))
    num_pos_corr['Description'] = num_pos_corr['Description'].apply(lambda x: textwrap.fill(str(x), 80))

    # Rearranging order of columns
    num_neg_corr = num_neg_corr.loc[:,['Feature','Description','Correlation']]
    num_pos_corr = num_pos_corr.loc[:,['Feature','Description','Correlation']]

    # Print top negative numeric features
    neg_corr_table = tabulate(num_neg_corr, headers='keys', tablefmt='pretty', showindex='never')
    print(f'Top negatively correlated numeric features:\n{neg_corr_table}\n')

    # Print top positive numeric variables
    pos_corr_table = tabulate(num_pos_corr, headers='keys', tablefmt='pretty', showindex='never')
    print(f'Top positively correlated numeric features:\n{pos_corr_table}')


Below I take the top 8 negatively correlated features and the top 5 positively correlated features and pull their definition from the columns_dict I previously created.

Based on the below results, we see that LC's grades and the interest rate on the loan is moderately inversely correlated with the repayment rate. To begin with, the direction of the correlation is logical. When I converted the grades to numerical values, A-graded loans became 1, B became 2, C became 3, etc... Repayment in the dependent variable is indicated by 1. So the higher the grade, the lower the chances of repayment. In addition, interest rate charged is based on the grades and is a close proxy for grade. The lower the chances of repayment, the higher the rate. Nonetheless, the correlation for any of these measures isn't that strong, at around 0.26-0.27. Formally, the grade, sub-grade, and interest rate columns, which are all correlated with each other anyway, cannot be used for our model because we are trying to determine the criteria that goes into LC's grading methodology, but it's interesting to analyze their existing performance. 

Other than the above mentioned columns, the features with a correlation of 1 or above are term (-0.176), FICO score (0.131), and number of open trade accounts in last 24 months (-0.1). This indicates that term is the most significant factor in LC determining its grades. It is intuitive that Term is negatively correlated with repayment. A borrower will have more opportunities to default the longer the term. Longer-term fixed-income instruments are as a rule riskier, all else remaining constant. FICO also being among the strongest correlated to repayment also is logical as FICO in itself is a score compiled by agencies whose specialties are credit risk assessment. Finally, the number of trade accounts opened may be negatively correlated with repayment because it indicates someone in lots of debt or risk-prone. Trade accounts are a variety of credit/debt accounts, such as mortgages, credit cards, and collection accounts. It's a little surprising a column like num_tl_90g_dpd_24m, which is the number of accounts 90 days past due over the past 2 years and pub_rec_bankruptcies doesn't crack the top 5. However, again, these correlations are weak. It looks like a model will need to combine several features for any chance of besting LC's grading system. It may also be the case that these relationships are not linear. For example, it may be the case that as the number of accounts opened increases, the repayment rate falls at an exponential rate. For that reason, I will run the same analysis using the spearman correlation methodology.

In [40]:
corr_table('pearson', 8, 5)


Top negatively correlated numeric features:
+----------------------+--------------------------------------------------------------------------------------+-------------+
|       Feature        |                                     Description                                      | Correlation |
+----------------------+--------------------------------------------------------------------------------------+-------------+
|      sub_grade       |                              LC assigned loan subgrade                               |   -0.267    |
|        grade         |                                LC assigned loan grade                                |   -0.261    |
|       int_rate       |                              Interest Rate on the loan                               |   -0.259    |
|         term         | The number of payments on the loan. Values are in months and can be either 36 or 60. |   -0.176    |
| acc_open_past_24mths |                      Number of trades opened in p

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  num_neg_corr['Description'] = num_neg_corr['Feature'].apply(lambda x: columns_dict.get(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  num_neg_corr['Description'] = num_neg_corr['Description'].apply(lambda x: textwrap.fill(str(x), 84))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  num_pos_corr[

Spearman measures the directional relationship between two variables, ignoring the slope of the relationship. Unlike Pearson, it is not considering whether the relationship is linear, only if it is moving in a positive or negative direction. It is also more resilient to outliers. Given the variety of the data, I believe it preferable to base the analysis on a more general correlation.

What we find below are similar results to the Pearson analysis, just slightly stronger. One thing to note is that DTI moved up as one of the top correlated variables, at 0.109, still weak but higher than most other variables. This is logical. DTI is a very common and important measure in lending and reflects the ratio of the debt of the user, excluding mortgage payments and the loan requested itself, to the income. I would expect DTI to be a good predictor.

In [44]:
corr_table('spearman', 8, 5)

Top negatively correlated numeric features:
+----------------------+--------------------------------------------------------------------------------------+-------------+
|       Feature        |                                     Description                                      | Correlation |
+----------------------+--------------------------------------------------------------------------------------+-------------+
|      sub_grade       |                              LC assigned loan subgrade                               |   -0.262    |
|        grade         |                                LC assigned loan grade                                |   -0.257    |
|       int_rate       |                              Interest Rate on the loan                               |   -0.254    |
|         term         | The number of payments on the loan. Values are in months and can be either 36 or 60. |   -0.176    |
|         dti          |   A ratio calculated using the borrower’s total m

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  num_neg_corr['Description'] = num_neg_corr['Feature'].apply(lambda x: columns_dict.get(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  num_neg_corr['Description'] = num_neg_corr['Description'].apply(lambda x: textwrap.fill(str(x), 84))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  num_pos_corr[

<span style="font-size:30px;"> Multicollinearity </span>

Before we drop weakly correlated columns, we should look at the multicollinearity, the correlation of the features to each other. As we still have a very high number of features, this will allow us to drop more features. Highly correlated can distort a model by keeping multiple features that exhibit redundant tendencies, overweighing these tendencies in prediction making. 

In [49]:

# We want the correlation of each column to each other except for the target, 'loan_status'. Therefore, I drop that in our new multi_col_df.
multi_col_df = corr_df_2.copy()
multi_col_df = multi_col_df.drop(columns='loan_status')

# Converting bools to numeric values.
bools = multi_col_df.select_dtypes(include='bool').columns
num_multi_col_df = multi_col_df.copy()
num_multi_col_df.loc[:,bools] = num_multi_col_df.loc[:,bools].astype('int')

# I drop NAs
num_multi_col_df = num_multi_col_df.dropna(axis=0)


In [50]:
# This runs the correlation of every feature to every other feature using Spearman
spearman_multi_col = num_multi_col_df.corr(method='spearman')
# We convert to absolute values. It's irrelevent for this task whether the correlations are positive or negative.
abs_cor = spearman_multi_col.abs()

# This pulls a subset of the abs_cor df. Multicollinear matrices will have duplicate information
abs_cor.loc[:,:] = np.tril(abs_cor , k=-1) 

# Multicollinearity dataframe
final_mc_df = abs_cor.stack().to_frame('correlation').reset_index().sort_values(by='correlation', ascending=False)

# Filtering the multicollinearity dataframe by the features with correlation above 0.70
mc_grouping = final_mc_df[final_mc_df['correlation']>.7].groupby('level_0')

Here I define a function to display the multicollinearity data accompanied by the Spearman coefficient vs the target calculated earlier.

In [57]:
def combined_correlation_analysis(method):
    num_corr1 = corr_to_Y(method)
    groupby_result = correlation_grouping(num_corr1)
    return groupby_result

def corr_to_Y(method):
    num_corr = numeric_df.apply(lambda col: col.corr(numeric_df['loan_status'], method=method))
    num_corr = num_corr.sort_values()

    # Renaming columns and formatting the Correlation numbers
    num_corr1 = num_corr.reset_index()
    num_corr1.columns = ['Feature', 'Correlation']
    num_corr1['Correlation'] = round(num_corr1['Correlation'], 3)
    num_corr1 = num_corr1.drop(num_corr1[num_corr1['Feature'] == 'loan_status'].index)

    return num_corr1

def correlation_grouping(num_corr1):
    df_results = []
    for level_0, df in mc_grouping:
        zero_corr_with_Y = num_corr1.loc[num_corr1['Feature'] == level_0, 'Correlation'].to_string(index=False)
        if zero_corr_with_Y == 'Series([], )':
            df['0_corr_with_Y'] = cat_corr1.loc[cat_corr1['Category'] == level_0, 'Correlation'].to_string(index=False)
        else:
            df['0_corr_with_Y'] = zero_corr_with_Y

        level_1_df = df['level_1'].reset_index()
        results_for_level_1 = []

        for level_1 in level_1_df['level_1']:
            one_corr_with_Y = num_corr1.loc[num_corr1['Feature'] == level_1, 'Correlation'].to_string(index=False)
            if one_corr_with_Y == 'Series([], )':
                result_for_level_1 = cat_corr1.loc[cat_corr1['Category'] == level_1, 'Correlation'].to_string(index=False)
            else:
                result_for_level_1 = one_corr_with_Y

            results_for_level_1.append(result_for_level_1)

        df['1_corr_with_Y'] = results_for_level_1
        df_results.append(df.copy())

    return pd.concat(df_results)



Based on analysis of the below table, I dropped one of the two variables highly correlated to each other (above 0.70), but also considered how strongly each is correlated to the target. It wasn't an exact science as to which one of the two variables to drop and some of the weaker relationships in the table were kept.

In [44]:
# Call the combined function with the desired correlation method
multicorr_table = combined_correlation_analysis(method='spearman')

# Display the final result
display(multicorr_table)

Unnamed: 0,level_0,level_1,correlation,0_corr_with_Y,1_corr_with_Y
17684,application_type_Joint App,application_type_Individual,1.0,-0.016108,0.016108
3367,bc_util,revol_util,0.843806,-0.067,-0.059
3374,bc_util,bc_open_to_buy,0.750516,-0.067,0.089
3638,delinq_amnt,acc_now_delinq,0.75957,-0.005,-0.004
17954,disbursement_method_DirectPay,disbursement_method_Cash,1.0,-0.000169,0.000169
134,funded_amnt,loan_amnt,1.0,-0.071,-0.07
673,grade,int_rate,0.957418,-0.257,-0.254
8231,home_ownership_RENT,home_ownership_MORTGAGE,0.811913,-0.065987,0.068069
537,installment,funded_amnt,0.967812,-0.059,-0.071
536,installment,loan_amnt,0.967812,-0.059,-0.07


In [36]:
# Converting bools into integers for ease of use in ML model
bools = corr_df_2.select_dtypes(include='bool').columns
model_df = corr_df_2.copy()
model_df.loc[:,bools] = model_df.loc[:,bools].astype('int')

# Drop nas
model_df = model_df.dropna(axis=0)

# These are the features strongly correlated to each other based on the above table.
corr_features_to_drop = ['application_type_Joint App','revol_util', 'bc_open_to_buy', 'acc_now_delinq','disbursement_method_DirectPay', 'loan_amnt','home_ownership_MORTGAGE','installment','earliest_cr_line',
             'mo_sin_rcnt_rev_tl_op','mths_since_recent_inq','num_actv_bc_tl','num_bc_sats','num_rev_accts','open_acc','percent_bc_gt_75','pub_rec_bankruptcies','total_rev_hi_lim',
             'total_il_high_credit_limit','num_sats','num_rev_tl_bal_gt_0']
             
# The first three items in this list are based on calculations made by LC that we want to replicate. We can't have them as dependent variables. ID is a random value so I drop that also.
excessive_features  = ['grade','sub_grade','int_rate'] 

model_df = model_df.drop(columns=corr_features_to_drop)
model_df = model_df.drop(columns=excessive_features)

After dropping the highly correlated features, we will revisit the independent variable correlations with target. We decide to keep all variables that have a Spearman coeff of 0.30 or higher except for 'verification_status_Verified', which is redundant as 'verification_status_Not Verified' already makes it into the top 5. This is conservative. It leaves us with 24 variables. We will narrow them down as we progress in our ML model.

In [58]:
def abs_corr_table(method, number_strongest):
    # Applying the correlation function to each column and sorting by correlation
    num_corr = model_df.apply(lambda col: col.corr(model_df['loan_status'], method=method))
    num_corr = abs(num_corr).sort_values(ascending=False)

    # Renaming columns and formatting the correlation numbers
    num_corr1 = num_corr.reset_index()
    num_corr1.columns = ['Feature','Correlation']
    num_corr1['Correlation'] = round(num_corr1['Correlation'],3)
    num_corr1 = num_corr1.drop(num_corr1[num_corr1['Feature']=='loan_status'].index)

    # Taking the strongest correlated features and pulling their dictionary definitions, if available
    num_strong_corr = num_corr1.head(number_strongest)
    num_strong_corr['Description'] = num_strong_corr['Feature'].apply(lambda x: columns_dict.get(x))
    num_strong_corr['Description'] = num_strong_corr['Description'].apply(lambda x: textwrap.fill(str(x), 84))

    # Rearranging order of columns
    num_strong_corr = num_strong_corr[['Feature','Description','Correlation']]

    # Print top correlated features
    strong_corr_table = tabulate(num_strong_corr, headers='keys', tablefmt='pretty', showindex='never')
    return num_strong_corr

In [48]:
# Displaying top 30 Spearman coefficients (abs numbers)
strong_corr = abs_corr_table('spearman', 30)
display(strong_corr)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  num_strong_corr['Description'] = num_strong_corr['Feature'].apply(lambda x: columns_dict.get(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  num_strong_corr['Description'] = num_strong_corr['Description'].apply(lambda x: textwrap.fill(str(x), 84))


Unnamed: 0,Feature,Description,Correlation
1,term,The number of payments on the loan. Values are...,0.182
2,fico_range_low,The lower boundary range the borrower’s FICO a...,0.126
3,dti,A ratio calculated using the borrower’s total ...,0.108
4,acc_open_past_24mths,Number of trades opened in past 24 months.,0.098
5,verification_status_Not Verified,,0.087
6,mort_acc,Number of mortgage accounts.,0.087
7,num_tl_op_past_12m,Number of accounts opened in past 12 months,0.082
8,tot_hi_cred_lim,Total high credit/credit limit,0.081
9,funded_amnt,The total amount committed to that loan at tha...,0.078
10,home_ownership_RENT,,0.074


In [None]:
# Removing 'verification_status_Verified' from the list because  'verification_status_Not Verified' already exists and is sufficient
top_corr_list = list(strong_corr['Feature'])
top_corr_list.remove('verification_status_Verified')
selected_features = top_corr_list[0:24]

Preparing the data for the ML models by scaling and splitting the data.

In [38]:
# Select data with the top 23 features
X= model_df[selected_features]

# Scale the data to be values from 0 to 1
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Setting the y to the value we want to predict, the final status of each loan.
y = model_df['loan_status']

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)