In [1]:

"""
2. Lending Club loan default predictor

Loan default occurs when a borrower fails to pay back a debt according
to the initial arrangement. In the case of most consumer loans, this means
that successive payments have been missed over the course of weeks or months. 
Fortunately, lenders and loan servicer usually allow a grace period before penalizing 
the borrower after missing one payment. The period between missing a loan payment 
and having the loan default is known as delinquency. The delinquency period gives 
the debtor time to avoid default by contacting their loan servicer or making up 
missed payments.

Defaulting on a loan will cause a substantial and lasting drop in the debtor's credit 
score, as well as extremely high interest rates on any future loan. For loans 
ecured with collateral, defaulting will likely result in the pledged asset being 
seized by the bank. The most popular types of consumer loans that are backed by collateral 
are mortgages, auto loans and secured personal loans. For unsecured debts like credit cards 
and student loans, the consequences of default vary in severity according to the type
of loan. In the most extreme cases, debt collection agencies can garnish wages to
pay back the outstanding debt. The loan is one of the most important products of the banking. 
All the banks are trying to figure out effective business strategies to persuade customers 
to apply their loans. However, there are some customers behave negatively after their 
application are approved.

linkcode
Data link:(~400MB file. Practice Patience)
https://drive.google.com/file/d/1e5gfIjJGJ7o7u27sW8c8gLRm8qmd0HJt/view?usp=sharing
Reference: Lending_club_loan_default_predictor | Kaggle

Follow these steps and create Credit Default Predictor ML model
● Problem Statement Analysis 
The objective is to use the loan data to build a model that can predict the loan status
(whether or not a loan will be fully paid or charged off) and interpret the results to find 
the factors, which affect the result. We want to develop a model that is accurate and interpretable.

● Reading Dataset and First look 
In this project, I will work with the publicly available Lending Club data 
on Kaggle https://drive.google.com/file/d/1e5gfIjJGJ7o7u27sW8c8gLRm8qmd0HJt/view?usp=sharing
Reference: Lending_club_loan_default_predictor | Kaggle
● Column Information

● looking for More information

● Exploratory Data Analysis (EDA)

● Numerical Features Analysis

● Categorical Data Analysis 
The dataset has numerical, /*categorical and missing values. 
Because there are so many features (150+), i will start the data 
engineering with dropping unnecessary features and then deal with 
the remaining features more in detail.

● Data PreProcessing

● Making The Base Line Model 
As features are measured in different units, they cover different ranges. 
The range of the features therefore significantly affects some ML models, 
such as SVM, KNN that take into account distance measures between observations 
and scaling allows them to learn. While some methods, such as Linear Regression and 
Ransom Forest do not actually require feature scaling, it's still best practice to 
take this step when we are comparing multiple algorithms.

● Support Vector Machine
● Baseline models
● Choosing the models
● XG Boost Classifier
● Conclusion
● Recommendation
"""

"\n2. Lending Club loan default predictor\n\nLoan default occurs when a borrower fails to pay back a debt according\nto the initial arrangement. In the case of most consumer loans, this means\nthat successive payments have been missed over the course of weeks or months. \nFortunately, lenders and loan servicer usually allow a grace period before penalizing \nthe borrower after missing one payment. The period between missing a loan payment \nand having the loan default is known as delinquency. The delinquency period gives \nthe debtor time to avoid default by contacting their loan servicer or making up \nmissed payments.\n\nDefaulting on a loan will cause a substantial and lasting drop in the debtor's credit \nscore, as well as extremely high interest rates on any future loan. For loans \necured with collateral, defaulting will likely result in the pledged asset being \nseized by the bank. The most popular types of consumer loans that are backed by collateral \nare mortgages, auto loans

In [18]:
# Imports
# Import necessary libraries
!pip install rawdata
import os
import re
import pandas as pd
from pandas import Series, DataFrame
import pandas_profiling
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
import gzip
import chardet
import io
from IPython.display import display_html
from collections import Counter
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC, LinearSVC
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier,ExtraTreesClassifier, VotingClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold, learning_curve, cross_validate
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, make_scorer
from sklearn import metrics
from copy import copy
from functools import partial
from collections.abc import Iterable, Sequence, Mapping
from numbers import Number
from datetime import datetime
from distutils.version import LooseVersion
from numpy.core import numerictypes

from xgboost import XGBClassifier
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score
import time
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
%matplotlib inline

# ignore Deprecation Warning
sns.set_style('whitegrid')
%matplotlib inline

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [19]:
from pandas.io.formats.info import DataFrameInfo

# Load the data into a dataframe
df = pd.read_csv("loan.csv",nrows=5)
print(df.sample(5))
print(df.info(5)) 

#Looking for more information 

print('Loan Status:')
df['loan_status'].value_counts()
print('Term:')
print(df['term'].value_counts(dropna=False))
print('Loan Amount:')
print(df['loan_amnt'].value_counts())
print('State')
print(df['addr_state'].describe())
print('Title:')
print(df['title'].value_counts().head(10))
df.sample(5)

        id  member_id  loan_amnt  funded_amnt  funded_amnt_inv        term  \
4  1075358    1311748       3000         3000             3000   60 months   
3  1076863    1277178      10000        10000            10000   36 months   
0  1077501    1296599       5000         5000             4975   36 months   
2  1077175    1313524       2400         2400             2400   36 months   
1  1077430    1314167       2500         2500             2500   60 months   

  int_rate  installment grade sub_grade  ... num_tl_90g_dpd_24m  \
4   12.69%        67.79     B        B5  ...                NaN   
3   13.49%       339.31     C        C1  ...                NaN   
0   10.65%       162.87     B        B2  ...                NaN   
2   15.96%        84.33     C        C5  ...                NaN   
1   15.27%        59.83     C        C4  ...                NaN   

  num_tl_op_past_12m pct_tl_nvr_dlq  percent_bc_gt_75 pub_rec_bankruptcies  \
4                NaN            NaN               

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit
0,1077501,1296599,5000,5000,4975,36 months,10.65%,162.87,B,B2,...,,,,,0,0,,,,
4,1075358,1311748,3000,3000,3000,60 months,12.69%,67.79,B,B5,...,,,,,0,0,,,,
1,1077430,1314167,2500,2500,2500,60 months,15.27%,59.83,C,C4,...,,,,,0,0,,,,
2,1077175,1313524,2400,2400,2400,36 months,15.96%,84.33,C,C5,...,,,,,0,0,,,,
3,1076863,1277178,10000,10000,10000,36 months,13.49%,339.31,C,C1,...,,,,,0,0,,,,


In [20]:
# ● Exploratory Data Analysis (EDA)
#Define a function to visulize the features with missing values, and % of total values, & datatype
def missing_values_table(df):
     # Total missing values
    mis_val = df.isnull().sum()
    # Percentage of missing values
    mis_val_percent = 100 * df.isnull().sum() / len(df)
    mis_val_type = df.dtypes
    # Make a table with the results
    mis_val_table = pd.concat([mis_val, mis_val_percent, mis_val_type], axis=1)
        
     # Rename the columns
    mis_val_table_ren_columns = mis_val_table.rename(columns = {0 : 'Missing Values', 1 : '% of Total Values', 2: 'type'})
        
    # Sort the table by percentage of missing descending
    mis_val_table_ren_columns = mis_val_table_ren_columns[ mis_val_table_ren_columns.iloc[:,1] != 0].sort_values('% of Total Values', ascending=False).round(1)
        
    # Print some summary information
    print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n" "There are " + str(mis_val_table_ren_columns.shape[0]) + " columns that have missing values.")
        
    # Return the dataframe with missing information
    return mis_val_table_ren_columns

In [21]:
print(sorted(df.columns))# sort columns 
#print(len(drop_list))
print(len(df.columns))
print(df['id'].sample)
print(df.describe)
df['term'].sample(5)

missing_values_table(df)

['acc_now_delinq', 'acc_open_past_24mths', 'addr_state', 'all_util', 'annual_inc', 'annual_inc_joint', 'application_type', 'avg_cur_bal', 'bc_open_to_buy', 'bc_util', 'chargeoff_within_12_mths', 'collection_recovery_fee', 'collections_12_mths_ex_med', 'delinq_2yrs', 'delinq_amnt', 'desc', 'dti', 'dti_joint', 'earliest_cr_line', 'emp_length', 'emp_title', 'funded_amnt', 'funded_amnt_inv', 'grade', 'home_ownership', 'id', 'il_util', 'initial_list_status', 'inq_fi', 'inq_last_12m', 'inq_last_6mths', 'installment', 'int_rate', 'issue_d', 'last_credit_pull_d', 'last_pymnt_amnt', 'last_pymnt_d', 'loan_amnt', 'loan_status', 'max_bal_bc', 'member_id', 'mo_sin_old_il_acct', 'mo_sin_old_rev_tl_op', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl', 'mort_acc', 'mths_since_last_delinq', 'mths_since_last_major_derog', 'mths_since_last_record', 'mths_since_rcnt_il', 'mths_since_recent_bc', 'mths_since_recent_bc_dlq', 'mths_since_recent_inq', 'mths_since_recent_revol_delinq', 'next_pymnt_d', 'num_accts_ever

Unnamed: 0,Missing Values,% of Total Values,type
bc_util,5,100.0,float64
num_il_tl,5,100.0,float64
mo_sin_rcnt_rev_tl_op,5,100.0,float64
mo_sin_rcnt_tl,5,100.0,float64
mort_acc,5,100.0,float64
mths_since_recent_bc,5,100.0,float64
mths_since_recent_bc_dlq,5,100.0,float64
mths_since_recent_inq,5,100.0,float64
mths_since_recent_revol_delinq,5,100.0,float64
num_accts_ever_120_pd,5,100.0,float64


In [22]:
# make general plots to examine each feature
def plot_var(col_name, full_name, continuous):
    """
    Visualize a variable with/without faceting on the loan status.
    - col_name is the variable name in the dataframe
    - full_name is the full variable name
    - continuous is True for continuous variables
    """
    fig, (ax1, ax2) = plt.subplots(1, 2, sharex=False, figsize=(15,3))
    # plot1: counts distribution of the variable
    
    if continuous:  
        sns.distplot(df.loc[df[col_name].notnull(), col_name], kde=False, ax=ax1)
    else:
        sns.countplot(df[col_name], order=sorted(df[col_name].unique()), color='#5975A4', saturation=1, ax=ax1)
    ax1.set_xlabel(full_name)
    ax1.set_ylabel('Count')
    ax1.set_title(full_name)

          
    # plot2: bar plot of the variable grouped by loan_status
    if continuous:
        sns.boxplot(x=col_name, y='loan_status', data=df, ax=ax2)
        ax2.set_ylabel('')
        ax2.set_title(full_name + ' by Loan Status')
    else:
        Charged_Off_rates = df.groupby(col_name)['loan_status'].value_counts(normalize=True)[:,'Charged Off']
        sns.barplot(x=Charged_Off_rates.index, y=Charged_Off_rates.values, color='#5975A4', saturation=1, ax=ax2)
        ax2.set_ylabel('Fraction of Loans Charged Off')
        ax2.set_title('Charged Off Rate by ' + full_name)
        ax2.set_xlabel(full_name)
    
    # plot3: kde plot of the variable gropued by loan_status
    if continuous:  
        facet = sns.FacetGrid(df, hue = 'loan_status', size=3, aspect=4)
        facet.map(sns.kdeplot, col_name, shade=True)
        #facet.set(xlim=(df[col_name].min(), df[col_name].max()))
        facet.add_legend()  
    else:
        fig = plt.figure(figsize=(12,3))
        sns.countplot(x=col_name, hue='loan_status', data=df, order=sorted(df[col_name].unique()) )
     
    plt.tight_layout()


    

In [8]:
#Feature Selection II: (Linear) correlation between (numerical) features
#5.1 Convert target variable to 0/1 indicator

def create_target(df):
    """
    Creates target variable by converting loan_status to 0/1 indicator
    """
    df['target'] = np.where(df['loan_status'] == 'Charged Off', 1, 0)
    df.drop('loan_status', axis=1, inplace=True)
    return df

In [23]:
#5.2 K-S test


def ks_test(df, col, target):
    """
    Performs the Kolmogorov-Smirnov test on a numerical feature
    """
    ks = stats.ks_2samp(df[df[target]==0][col], df[df[target]==1][col])
    return ks

In [24]:
#5.3 Pearson Correlation

def corr_features(df, target, threshold):
    """
    Returns features with a Pearson correlation over the threshold
    """
    corr = df.corr()[target].sort_values(ascending=False)
    corr_feats = corr[abs(corr) > threshold].index.tolist()
    return corr_feats#

In [25]:
#6.1 Dummy variables


def create_dummies(df, cat_cols):
    """
    Creates dummy variables for categorical features
    """
    for col in cat_cols:
        dummies = pd.get_dummies(df[col], prefix=col, drop_first=True)
        df = pd.concat([df, dummies], axis=1)
    df.drop(cat_cols, axis=1, inplace=True)
    return df