# Credit Risk Classification


* [**Introduction**](#introduction)
* [**Importing libraries and data**](#import)
* [**Data Preprocessing**](#preprocessing)
* [**Exploratory Data Analysis**](#exploratory_data_analysis)
* [**Feature Engineering and Selection**](#feature_engineering)
* [**Machine Learning**](#model)
* [**Conclusions**](#conclusions)

<a id='introduction'></a>

## Introduction

## The business question

### <span style="color:blue"> Credit Risk Classification helps in understanding what factors are most responsible for credit defaults.</span>

**The goal:** Classify features that contribute to a loan default or repayment.

<ins>**How does this help UON ML 2022 Class?**<ins>
* Gauge the technical skills required to prepare data and build classification ML Models. 
* Understand factors that influence probability of credit risk default. 

## Solution

* **Assumptions:**
- The credit risk can be described by the features in the dataset. 
 

* Using the data provided, Group F builds a classify for credit risk based on the features provided. 

* A linear regression model is built to model life expectancy based on the features provided.
* Root Mean Squared Error is used as an evaluation metric. 

<a id='import'></a>

## Importing libraries and data

### Libraries and settings

In [186]:
import warnings
warnings.simplefilter('ignore')

# %matplotlib widget
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn')
from matplotlib import rcParams
rcParams.update({'figure.autolayout': True})
import seaborn as sns
sns.set(font_scale = 2)


import plotly.offline as py # it's a library that we work with plotly
py.init_notebook_mode(connected=True) # this code, allow us to work with offline plotly version
import plotly.graph_objs as go # it's like "plt" of matplot
import plotly.tools as tls # It's useful to we get some tools of plotly

# Internal ipython tool for setting figure size
from IPython.core.pylabtools import figsize

# To do counter of some features
from collections import Counter 
import pandas as pd

import numpy as np
pd.set_option('display.max_columns', 10)
pd.set_option('display.width', 350)

data = pd.read_csv('../data/credit_risk.csv')

In [187]:
data.head()

Unnamed: 0,ID,person_age,person_income,person_home_ownership,person_emp_length,...,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
0,30786,41,40000,RENT,3.0,...,7.49,0,0.23,N,13
1,29460,44,28000,OWN,0.0,...,8.94,0,0.13,N,12
2,7059,22,56000,RENT,0.0,...,11.36,0,0.13,N,2
3,5377,24,45000,MORTGAGE,2.0,...,7.29,0,0.16,N,4
4,27170,28,55000,RENT,3.0,...,17.06,0,0.27,Y,5


In [188]:
data.info() # 8 numerical features, 4 categorical features

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24435 entries, 0 to 24434
Data columns (total 13 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   ID                          24435 non-null  int64  
 1   person_age                  24435 non-null  int64  
 2   person_income               24435 non-null  int64  
 3   person_home_ownership       24435 non-null  object 
 4   person_emp_length           23778 non-null  float64
 5   loan_intent                 24435 non-null  object 
 6   loan_grade                  24435 non-null  object 
 7   loan_amnt                   24435 non-null  int64  
 8   loan_int_rate               22113 non-null  float64
 9   loan_status                 24435 non-null  int64  
 10  loan_percent_income         24435 non-null  float64
 11  cb_person_default_on_file   24435 non-null  object 
 12  cb_person_cred_hist_length  24435 non-null  int64  
dtypes: float64(3), int64(6), object

In [189]:
data.nunique() # 24,435 uninque values

ID                            24435
person_age                       55
person_income                  3585
person_home_ownership             4
person_emp_length                33
loan_intent                       6
loan_grade                        7
loan_amnt                       704
loan_int_rate                   339
loan_status                       2
loan_percent_income              76
cb_person_default_on_file         2
cb_person_cred_hist_length       29
dtype: int64

<a id='preprocessing'></a>
# Data Preprocessing

In [190]:
# Typecasting of some columns
data['loan_percent_income']=data['loan_percent_income'].astype('float')
data['loan_amnt']=data['loan_amnt'].astype('float')
data['person_income']=data['person_income'].astype('float')
data['loan_int_rate']=data['loan_int_rate'].astype('float')
data['loan_intent']=data['loan_intent'].astype(str)
data['loan_grade']=data['loan_grade'].astype(str)

In [191]:
# Creating categories for Age
data['age_type'] = data['person_age'].apply(lambda row: "Under_18" if (row<18)
                                     else "Young_adult" if (row>=18) & (row<35)
                                    else "Adult" if (row>=35) & (row < 65)
                                      else "Retired"
                                     )

<a id='exploratory_data_analysis'></a>
## Exploratory Data Analysis

In [192]:
data.describe().T # Some columns have missing values

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ID,24435.0,16282.282873,9415.656115,1.0,8133.5,16296.0,24444.5,32580.0
person_age,24435.0,27.735543,6.333269,20.0,23.0,26.0,30.0,144.0
person_income,24435.0,65937.042112,65405.40636,4080.0,38400.0,55000.0,78735.0,6000000.0
person_emp_length,23778.0,4.778072,4.091264,0.0,2.0,4.0,7.0,123.0
loan_amnt,24435.0,9558.179865,6324.925845,500.0,5000.0,8000.0,12100.0,35000.0
loan_int_rate,22113.0,11.006377,3.241424,5.42,7.9,10.99,13.47,23.22
loan_status,24435.0,0.216984,0.4122,0.0,0.0,0.0,0.0,1.0
loan_percent_income,24435.0,0.169988,0.106785,0.0,0.09,0.15,0.23,0.83
cb_person_cred_hist_length,24435.0,5.807858,4.06661,2.0,3.0,4.0,8.0,30.0


In [193]:
#Checking for missing values

# Function to calculate missing values by column
def missing_values_table(df):
        # Total missing values
        mis_val = df.isnull().sum()
        
        # Percentage of missing values
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        
        # Make a table with the results
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        
        # Rename the columns
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        
        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        
        # Print some summary information
        print ("The dataset has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        
        # Return the dataframe with missing information
        return mis_val_table_ren_columns
    
missing_values_table(data)

The dataset has 14 columns.
There are 2 columns that have missing values.


Unnamed: 0,Missing Values,% of Total Values
loan_int_rate,2322,9.5
person_emp_length,657,2.7


<ins>**Observations**:<ins>

- There are a total of 24,435 records in the dataset. 
- Two columns have missing values: 'loan_int_rate' and 'person_emp_length'. We will remove records with missing values. 

In [194]:
data = data.dropna()
missing_values_table(data)

The dataset has 14 columns.
There are 0 columns that have missing values.


Unnamed: 0,Missing Values,% of Total Values


### Default vs Non-Default

In [195]:
# First plot
trace0 = go.Bar(
            x = data[data['cb_person_default_on_file']== 'N']['cb_person_default_on_file'].value_counts().index.values,
            y = data[data['cb_person_default_on_file']== 'N']['cb_person_default_on_file'].value_counts().values,
            name='Good'
    )

# Second Plot
trace1 = go.Bar(
            x = data[data['cb_person_default_on_file']== 'Y']['cb_person_default_on_file'].value_counts().index.values,
            y = data[data['cb_person_default_on_file']== 'Y']['cb_person_default_on_file'].value_counts().values,
            name='Bad'
    )

data_viz = [trace0, trace1]

layout = go.Layout(
    
)

layout = go.Layout(
    yaxis=dict(
        title='Count'
    ),
    xaxis=dict(
        title='Default vs Non-Default'
    ),
    title='Target variable distribution'
)

fig = go.Figure(data=data_viz, layout=layout)

py.iplot(fig, filename='grouped-bar')

#### Credit History Length VS (Non) Default Distribution

In [196]:
df_good = data.loc[data['cb_person_default_on_file'] == 'N']['cb_person_cred_hist_length'].values.tolist()
df_bad = data.loc[data['cb_person_default_on_file'] == 'Y']['cb_person_cred_hist_length'].values.tolist()
df_cred_hist_length = data['cb_person_cred_hist_length'].values.tolist()

#First plot
trace0 = go.Histogram(
    x=df_good,
    histnorm='probability',
    name="Good"
)
#Second plot
trace1 = go.Histogram(
    x=df_bad,
    histnorm='probability',
    name="Bad"
)
#Third plot
trace2 = go.Histogram(
    x=df_cred_hist_length,
    histnorm='probability',
    name="Overall Credit History Length"
)

#Creating the grid
fig = tls.make_subplots(rows=2, cols=2, specs=[[{}, {}], [{'colspan': 2}, None]],
                          subplot_titles=('Good','Bad', 'Credit History Length Distribution'))

#setting the figs
fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)
fig.append_trace(trace2, 2, 1)

fig['layout'].update(showlegend=True, title='Overall Credit History Length', bargap=0.05)
py.iplot(fig, filename='custom-sized-subplot-with-subplot-titles')

#### Employment Length VS (Non) Default Distribution

In [197]:
df_good = data.loc[data['cb_person_default_on_file'] == 'N']['person_emp_length'].values.tolist()
df_bad = data.loc[data['cb_person_default_on_file'] == 'Y']['person_emp_length'].values.tolist()
df_cred_hist_length = data['person_emp_length'].values.tolist()

#First plot
trace0 = go.Histogram(
    x=df_good,
    histnorm='probability',
    name="Good"
)
#Second plot
trace1 = go.Histogram(
    x=df_bad,
    histnorm='probability',
    name="Bad"
)
#Third plot
trace2 = go.Histogram(
    x=df_cred_hist_length,
    histnorm='probability',
    name="Overall Employment Length"
)

#Creating the grid
fig = tls.make_subplots(rows=2, cols=2, specs=[[{}, {}], [{'colspan': 2}, None]],
                          subplot_titles=('Good','Bad', 'Employment Length Distribution'))

#setting the figs
fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)
fig.append_trace(trace2, 2, 1)

fig['layout'].update(showlegend=True, title='Overall Employment Length', bargap=0.05)
py.iplot(fig, filename='custom-sized-subplot-with-subplot-titles')

#### Age vs (Non) Default Distribution

In [198]:
df_good = data.loc[data['cb_person_default_on_file'] == 'N']['age_type'].values.tolist()
df_bad = data.loc[data['cb_person_default_on_file'] == 'Y']['age_type'].values.tolist()
df_cred_hist_length = data['age_type'].values.tolist()

#First plot
trace0 = go.Histogram(
    x=df_good,
    histnorm='probability',
    name="Good"
)
#Second plot
trace1 = go.Histogram(
    x=df_bad,
    histnorm='probability',
    name="Bad"
)
#Third plot
trace2 = go.Histogram(
    x=df_cred_hist_length,
    histnorm='probability',
    name="Age Distribution"
)

#Creating the grid
fig = tls.make_subplots(rows=2, cols=2, specs=[[{}, {}], [{'colspan': 2}, None]],
                          subplot_titles=('Good','Bad', 'Age type Distribution'))

#setting the figs
fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)
fig.append_trace(trace2, 2, 1)

fig['layout'].update(showlegend=True, title='Age Distribution', bargap=0.05)
py.iplot(fig, filename='custom-sized-subplot-with-subplot-titles')

#### Ownership vs Risk Distribution

In [199]:
df_good = data.loc[data['cb_person_default_on_file'] == 'N']['person_home_ownership'].values.tolist()
df_bad = data.loc[data['cb_person_default_on_file'] == 'Y']['person_home_ownership'].values.tolist()
df_cred_hist_length = data['person_home_ownership'].values.tolist()

#First plot
trace0 = go.Histogram(
    x=df_good,
    histnorm='probability',
    name="Good"
)
#Second plot
trace1 = go.Histogram(
    x=df_bad,
    histnorm='probability',
    name="Bad"
)
#Third plot
trace2 = go.Histogram(
    x=df_cred_hist_length,
    histnorm='probability',
    name="Home Ownership Distribution"
)

#Creating the grid
fig = tls.make_subplots(rows=2, cols=2, specs=[[{}, {}], [{'colspan': 2}, None]],
                          subplot_titles=('Good','Bad', 'Home Ownership Distribution'))

#setting the figs
fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)
fig.append_trace(trace2, 2, 1)

fig['layout'].update(showlegend=True, title='Home Ownership Distribution', bargap=0.05)
py.iplot(fig, filename='custom-sized-subplot-with-subplot-titles')

#### Loan Intent vs Default Distribution

In [200]:
df_good = data.loc[data['cb_person_default_on_file'] == 'N']['loan_intent'].values.tolist()
df_bad = data.loc[data['cb_person_default_on_file'] == 'Y']['loan_intent'].values.tolist()
df_cred_hist_length = data['loan_intent'].values.tolist()

#First plot
trace0 = go.Histogram(
    x=df_good,
    histnorm='probability',
    name="Good"
)
#Second plot
trace1 = go.Histogram(
    x=df_bad,
    histnorm='probability',
    name="Bad"
)
#Third plot
trace2 = go.Histogram(
    x=df_cred_hist_length,
    histnorm='probability',
    name="Loan Intent Distribution"
)

#Creating the grid
fig = tls.make_subplots(rows=2, cols=2, specs=[[{}, {}], [{'colspan': 2}, None]],
                          subplot_titles=('Good','Bad', 'Loan Intent Distribution'))

#setting the figs
fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)
fig.append_trace(trace2, 2, 1)

fig['layout'].update(showlegend=True, title='Loan Intent Distribution', bargap=0.05)
py.iplot(fig, filename='custom-sized-subplot-with-subplot-titles')

#### Loan Grade vs Default Distribution

In [201]:
df_good = data.loc[data['cb_person_default_on_file'] == 'N']['loan_grade'].values.tolist()
df_bad = data.loc[data['cb_person_default_on_file'] == 'Y']['loan_grade'].values.tolist()
df_cred_hist_length = data['loan_grade'].values.tolist()

#First plot
trace0 = go.Histogram(
    x=df_good,
    histnorm='probability',
    name="Good"
)
#Second plot
trace1 = go.Histogram(
    x=df_bad,
    histnorm='probability',
    name="Bad"
)
#Third plot
trace2 = go.Histogram(
    x=df_cred_hist_length,
    histnorm='probability',
    name="Loan Grade Distribution"
)

#Creating the grid
fig = tls.make_subplots(rows=2, cols=2, specs=[[{}, {}], [{'colspan': 2}, None]],
                          subplot_titles=('Good','Bad', 'Loan Grade Distribution'))

#setting the figs
fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)
fig.append_trace(trace2, 2, 1)

fig['layout'].update(showlegend=True, title='Loan Grade Distribution', bargap=0.05)
py.iplot(fig, filename='custom-sized-subplot-with-subplot-titles')

#### Loan Status vs Default Distribution

In [202]:
df_good = data.loc[data['cb_person_default_on_file'] == 'N']['loan_status'].values.tolist()
df_bad = data.loc[data['cb_person_default_on_file'] == 'Y']['loan_status'].values.tolist()
df_cred_hist_length = data['loan_status'].values.tolist()

#First plot
trace0 = go.Histogram(
    x=df_good,
    histnorm='probability',
    name="Good"
)
#Second plot
trace1 = go.Histogram(
    x=df_bad,
    histnorm='probability',
    name="Bad"
)
#Third plot
trace2 = go.Histogram(
    x=df_cred_hist_length,
    histnorm='probability',
    name="Loan Status Distribution"
)

#Creating the grid
fig = tls.make_subplots(rows=2, cols=2, specs=[[{}, {}], [{'colspan': 2}, None]],
                          subplot_titles=('Good','Bad', 'Loan Status Distribution'))

#setting the figs
fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)
fig.append_trace(trace2, 2, 1)

fig['layout'].update(showlegend=True, title='Loan Status Distribution', bargap=0.05)
py.iplot(fig, filename='custom-sized-subplot-with-subplot-titles')

In [203]:
no_default = data[data['cb_person_default_on_file']== 'N'].value_counts().sum()
yes_default = data[data['cb_person_default_on_file']== 'Y'].value_counts().sum()

print('No default: ', no_default)
print('Yes default: ', yes_default)

No default:  17679
Yes default:  3828


In [204]:
data_copy = data.copy()

# Map default and no default to 1 and 0
data_copy['credit_risk'] = data_copy['cb_person_default_on_file'].map( {'Y': 1, 'N': 0} ).astype(int)
data_copy.head()

Unnamed: 0,ID,person_age,person_income,person_home_ownership,person_emp_length,...,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,age_type,credit_risk
0,30786,41,40000.0,RENT,3.0,...,0.23,N,13,Adult,0
1,29460,44,28000.0,OWN,0.0,...,0.13,N,12,Adult,0
2,7059,22,56000.0,RENT,0.0,...,0.13,N,2,Young_adult,0
3,5377,24,45000.0,MORTGAGE,2.0,...,0.16,N,4,Young_adult,0
4,27170,28,55000.0,RENT,3.0,...,0.27,Y,5,Young_adult,1


In [205]:
data_copy['credit_risk'] = data_copy['credit_risk'].astype('int64')
data_copy.corr()['credit_risk'].sort_values()

person_emp_length            -0.033787
person_income                -0.003599
ID                            0.003959
cb_person_cred_hist_length    0.010688
person_age                    0.011661
loan_percent_income           0.038117
loan_amnt                     0.043223
loan_status                   0.180116
loan_int_rate                 0.502205
credit_risk                   1.000000
Name: credit_risk, dtype: float64

In [209]:
# Select the numeric columns
numeric_subset = data_copy.select_dtypes(['float64', 'int64'])

# Create columns with square root of numeric columns
# We do this to account for non-linear relationships, if any. 
for col in numeric_subset.columns:
    
    # Skip the credit risk column
    if col == 'credit_risk' or col == 'ID':
        next
    # else:
    #     numeric_subset['sqrt_' + col] = np.sqrt(numeric_subset[col])

# Select the categorical columns we are interested in
categorical_subset = data_copy[['person_home_ownership', 'loan_intent', 'loan_grade']]

# One hot encode the categorical columns
categorical_subset = pd.get_dummies(categorical_subset, drop_first=True)

# Join the two dataframes (numerical and one hot encoded categorical columns) using concat
# Make sure to use axis = 1 to perform a column bind
features = pd.concat([numeric_subset, categorical_subset], axis = 1)
# print(features.head())

# Drop data without risk (sanity step)
# features = features.dropna(subset = ['credit_risk'])

# Find correlations with salary
correlations = features.corr()['credit_risk'].sort_values()

In [207]:
correlations.head(25)

loan_grade_B                  -0.318180
person_emp_length             -0.033787
loan_intent_EDUCATION         -0.011917
loan_intent_VENTURE           -0.003768
person_income                 -0.003599
loan_intent_PERSONAL          -0.000134
person_home_ownership_OWN      0.000367
ID                             0.003959
loan_intent_MEDICAL            0.005873
cb_person_cred_hist_length     0.010688
loan_intent_HOMEIMPROVEMENT    0.010868
person_age                     0.011661
person_home_ownership_OTHER    0.014176
loan_percent_income            0.038117
loan_amnt                      0.043223
loan_grade_G                   0.047205
person_home_ownership_RENT     0.053818
loan_grade_F                   0.061402
loan_grade_E                   0.138296
loan_status                    0.180116
loan_grade_D                   0.322123
loan_grade_C                   0.422663
loan_int_rate                  0.502205
credit_risk                    1.000000
Name: credit_risk, dtype: float64

In [166]:
correlations.tail(25)

loan_grade_B                  -0.318180
person_emp_length             -0.033787
loan_intent_EDUCATION         -0.011917
loan_intent_VENTURE           -0.003768
person_income                 -0.003599
loan_intent_PERSONAL          -0.000134
person_home_ownership_OWN      0.000367
ID                             0.003959
loan_intent_MEDICAL            0.005873
cb_person_cred_hist_length     0.010688
loan_intent_HOMEIMPROVEMENT    0.010868
person_age                     0.011661
person_home_ownership_OTHER    0.014176
loan_percent_income            0.038117
loan_amnt                      0.043223
loan_grade_G                   0.047205
person_home_ownership_RENT     0.053818
loan_grade_F                   0.061402
loan_grade_E                   0.138296
loan_status                    0.180116
loan_grade_D                   0.322123
loan_grade_C                   0.422663
loan_int_rate                  0.502205
credit_risk                    1.000000
Name: credit_risk, dtype: float64

<ins>**Observations**:<ins>

- There are plenty of variations within the datapoints. 
- The dataset is not balanced. We have more datapoints for the good class than the bad class. 
- Categorical variables seem to have a big influence on the default rates. They need to be used in the model. 
- Apart from loan_int_rate and loan_status, other numerical variables do not seem to have a strong correlation with credit risk. 
- From the categorical features, loan grade has an effect on credit risk. 

<a id='feature_engineering'></a>
## Feature Engineering and Selection

In [167]:
# Remove highly correlated features
def remove_collinear_features(x, threshold):
    '''
    Objective:
        Remove collinear features in a dataframe with a correlation coefficient
        greater than the threshold. Removing collinear features can help a model
        to generalize and improves the interpretability of the model.
        
    Inputs: 
        threshold: any features with correlations greater than this value are removed
    
    Output: 
        dataframe that contains only the non-highly-collinear features
    '''
    
    # Dont want to remove correlations between Credit risk
    y = x['credit_risk']
    x = x.drop(columns = ['credit_risk', 'ID'])
    
    # Calculate the correlation matrix
    corr_matrix = x.corr()
    iters = range(len(corr_matrix.columns) - 1)
    drop_cols = []

    # Iterate through the correlation matrix and compare correlations
    for i in iters:
        for j in range(i):
            item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]
            col = item.columns
            row = item.index
            val = abs(item.values)
            
            # If correlation exceeds the threshold
            if val >= threshold:
                # Print the correlated features and the correlation value
                print(col.values[0], "|", row.values[0], "|", round(val[0][0], 2))
                drop_cols.append(col.values[0])

    # Drop one of each pair of correlated columns
    drops = set(drop_cols)
    x = x.drop(columns = drops)
    
    
    # Add the score back in to the data
    x['credit_risk'] = y
               
    return x

In [168]:
# Remove the collinear features above a specified correlation coefficient
features = remove_collinear_features(features, 0.9);

<a id='model'></a>
## Machine Learning

In [169]:
# Splitting data into training and testing
from sklearn.model_selection import train_test_split, RepeatedStratifiedKFold, cross_val_score

# Import LabelEncoder
from sklearn import preprocessing

# Metrics
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, fbeta_score #To evaluate our model

from sklearn.model_selection import GridSearchCV

# Imbalanced dataset
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

# ML Models
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

In [170]:
features.head()

Unnamed: 0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,...,loan_grade_D,loan_grade_E,loan_grade_F,loan_grade_G,credit_risk
0,41,40000.0,3.0,9200.0,7.49,...,0,0,0,0,0
1,44,28000.0,0.0,3500.0,8.94,...,0,0,0,0,0
2,22,56000.0,0.0,7000.0,11.36,...,0,0,0,0,0
3,24,45000.0,2.0,7200.0,7.29,...,0,0,0,0,0
4,28,55000.0,3.0,15000.0,17.06,...,0,1,0,0,1


In [171]:
features.columns

Index(['person_age', 'person_income', 'person_emp_length', 'loan_amnt', 'loan_int_rate', 'loan_status', 'loan_percent_income', 'cb_person_cred_hist_length', 'person_home_ownership_OTHER', 'person_home_ownership_OWN', 'person_home_ownership_RENT', 'loan_intent_EDUCATION', 'loan_intent_HOMEIMPROVEMENT', 'loan_intent_MEDICAL', 'loan_intent_PERSONAL',
       'loan_intent_VENTURE', 'loan_grade_B', 'loan_grade_C', 'loan_grade_D', 'loan_grade_E', 'loan_grade_F', 'loan_grade_G', 'credit_risk'],
      dtype='object')

In [172]:
X = features.drop('credit_risk', axis=1)
y = features['credit_risk']

In [173]:
# For handling inbalanced data
weights = {0:features[features['credit_risk']== 1].value_counts().sum(), 1:features[features['credit_risk']== 0].value_counts().sum()}
print(weights)

print('Ratio of no default to default: ', weights[0]/weights[1])

{0: 3828, 1: 17679}
Ratio of no default to default:  0.21652808416765654


In [174]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=1)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(15054, 22)
(6453, 22)
(15054,)
(6453,)


In [230]:
test_set = pd.read_csv('../data/test_set.csv')

test_set = test_set.dropna()

def prep_external_test_data(data):
    
    # # Typecasting of some columns
    # test_set['loan_percent_income'] = test_set['loan_percent_income'].astype('float')
    # test_set['loan_amnt']= test_set['loan_amnt'].astype('float')
    # test_set['person_income']= test_set['person_income'].astype('float')
    # test_set['loan_int_rate'] = test_set['loan_int_rate'].astype('float')
    # test_set['loan_intent'] = test_set['loan_intent'].astype(str)
    # test_set['loan_grade'] = test_set['loan_grade'].astype(str)  
    # test_set['person_home_ownership'] = test_set['person_home_ownership'].astype(str)
    # test_set['person_emp_length'] = test_set['person_emp_length'].astype('float')
    # test_set['person_age'] = test_set['person_age'].astype('int')
    # test_set['loan_status'] = test_set['loan_status'].astype('int')
    # test_set['cb_person_cred_hist_length'] = test_set['cb_person_cred_hist_length'].astype('int')


    # # data = data.replace([np.inf, -np.inf], np.nan, inplace=True)
    # test_set = test_set.dropna(inplace=True)

    # Select the numeric columns
    numeric_subset = test_set.select_dtypes(['float', 'int'])

    for col in numeric_subset.columns:
        if col == 'ID':
            next

    categorical_subset = test_set[['person_home_ownership', 'loan_intent', 'loan_grade']]
    categorical_subset = pd.get_dummies(categorical_subset, drop_first=True)
    test_features = pd.concat([numeric_subset, categorical_subset], axis = 1)
    test_features = test_features.drop('ID', axis=1)
    
    return test_features


In [231]:
prep_external_test_data(test_set).columns

Index(['person_age', 'person_income', 'person_emp_length', 'loan_amnt', 'loan_int_rate', 'loan_status', 'loan_percent_income', 'cb_person_cred_hist_length', 'person_home_ownership_OTHER', 'person_home_ownership_OWN', 'person_home_ownership_RENT', 'loan_intent_EDUCATION', 'loan_intent_HOMEIMPROVEMENT', 'loan_intent_MEDICAL', 'loan_intent_PERSONAL',
       'loan_intent_VENTURE', 'loan_grade_B', 'loan_grade_C', 'loan_grade_D', 'loan_grade_E', 'loan_grade_F', 'loan_grade_G'],
      dtype='object')

In [232]:
X_train.columns

Index(['person_age', 'person_income', 'person_emp_length', 'loan_amnt', 'loan_int_rate', 'loan_status', 'loan_percent_income', 'cb_person_cred_hist_length', 'person_home_ownership_OTHER', 'person_home_ownership_OWN', 'person_home_ownership_RENT', 'loan_intent_EDUCATION', 'loan_intent_HOMEIMPROVEMENT', 'loan_intent_MEDICAL', 'loan_intent_PERSONAL',
       'loan_intent_VENTURE', 'loan_grade_B', 'loan_grade_C', 'loan_grade_D', 'loan_grade_E', 'loan_grade_F', 'loan_grade_G'],
      dtype='object')

In [236]:
# prepare models
models = []
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))

for name, model in models:
    print(model)
    
    # define imbalanced pipeline
    steps = [('over', RandomOverSampler(sampling_strategy='not majority')), ('under', RandomUnderSampler(sampling_strategy='majority')), ('model', model)]
    pipeline = Pipeline(steps=steps)
    
    # predict on internal test set
    model = pipeline.fit(X_train, y_train)
    model_pred = model.predict(X_test)
    print(classification_report(y_test, model_pred), '\n')
    
    model_pred = model.predict(prep_external_test_data(test_set))    
    pd.DataFrame({'ID': test_set['ID'], 'credit_risk': model_pred}).to_csv('../data/predictions/'+ name + '_credit_risk_submission.csv', index=False)
    
    
    

KNeighborsClassifier()
              precision    recall  f1-score   support

           0       0.86      0.65      0.74      5331
           1       0.23      0.51      0.32      1122

    accuracy                           0.62      6453
   macro avg       0.55      0.58      0.53      6453
weighted avg       0.75      0.62      0.66      6453
 

DecisionTreeClassifier()
              precision    recall  f1-score   support

           0       0.89      0.89      0.89      5331
           1       0.48      0.47      0.48      1122

    accuracy                           0.82      6453
   macro avg       0.69      0.68      0.69      6453
weighted avg       0.82      0.82      0.82      6453
 

GaussianNB()
              precision    recall  f1-score   support

           0       0.98      0.71      0.82      5331
           1       0.41      0.95      0.57      1122

    accuracy                           0.75      6453
   macro avg       0.70      0.83      0.70      6453
weighted 

### Fit the model on the test set

In [53]:
# prepare models
models = []
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))

for name, model in models:    
    # define imbalanced pipeline
    steps = [('over', RandomOverSampler(sampling_strategy='not majority')), ('under', RandomUnderSampler(sampling_strategy='majority')), ('model', model)]
    pipeline = Pipeline(steps=steps)

    model_pred = model.predict(prep_external_test_data(test_set))
    
    pd.DataFrame({'ID': test_set['ID'], 'credit_risk': model_pred}).to_csv('../data/predictions/{name}_credit_risk_submission.csv', index=False)

# credit_risk_submission = pd.DataFrame({'ID': test_set['ID'], 'credit_risk': model_pred})
# credit_risk_submission.to_csv('../data/predictions/credit_risk_submission.csv', index=False)


NotFittedError: This KNeighborsClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

In [35]:
steps = [('over', RandomOverSampler(sampling_strategy='not majority')), ('model', knn)]
pipeline = Pipeline(steps=steps)

kfold = RepeatedStratifiedKFold(n_splits=12, n_repeats=3, random_state=1)
cv_results = cross_val_score(pipeline, X_train, y_train, cv=kfold, scoring='f1_micro')
print(np.mean(cv_results))

0.6107791587504315


In [36]:
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)

# Predictions and Evaluations
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred_knn))
 
print(classification_report(y_test, y_pred_knn))

[[5064  267]
 [1032   90]]
              precision    recall  f1-score   support

           0       0.83      0.95      0.89      5331
           1       0.25      0.08      0.12      1122

    accuracy                           0.80      6453
   macro avg       0.54      0.52      0.50      6453
weighted avg       0.73      0.80      0.75      6453

