# Loan Delinquency Prediction

### Problem Statement: 

#### Loan default prediction is one of the most critical and crucial problem faced by financial institutions and organizations as it has a noteworthy effect on the profitability of these institutions. In recent years, there is a tremendous increase in the volume of non – performing loans which results in a jeopardizing effect on the growth of these institutions.

####    Therefore, to maintain a healthy portfolio, the banks put stringent monitoring and evaluation measures in place to ensure timely repayment of loans by borrowers. Despite these measures, a major proportion of loans become delinquent. Delinquency occurs when a borrower misses a payment against his/her loan.

####       Given the information like mortgage details, borrowers related details and payment details, our objective is to identify the delinquency status of loans for the next month given the delinquency status for the previous 12 months (in number of months)

### Solution: 
#### Notebook contains various machine learning models like K nearest neighbour, Logistic Regression ,Random Forest, Naive Bayes, XGBoost, AdaBoost, GradientBoost, Decision Tree which are trained on the training data which predicts the delinquency status of loans for the next month given the delinquency status for the previous 12 months.  Out of which  the f1-score of the Random Forest model out performs all other models so we are using the Random Forest classifier with Grid Searchoptimized hyper parameters using for the identification of the delinquency status. At the end we are writing the predicted values into a csv file.

## Importing the Libraries

In [137]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#import seaborn as sns
#import missingno as msno
from sklearn import model_selection
from sklearn import preprocessing
from sklearn.metrics import confusion_matrix,f1_score
import xgboost as xgb
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier,GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV

#### Setting Basic Configuration for Visualisation

%matplotlib inline

plt.rcParams['figure.figsize'] = [20.0, 7.0]
plt.rcParams.update({'font.size': 22,})

sns.set_palette('viridis')
sns.set_style('white')
sns.set_context('talk', font_scale=0.8)

### Loading the DataSet

In [138]:
data_train= pd.read_csv('train.csv')
data_test=pd.read_csv('test.csv')
data_test_new=pd.read_csv('test.csv')

### Getting Information About the DataSet

In [139]:
data_train.columns
data_test.columns

Index(['loan_id', 'source', 'financial_institution', 'interest_rate',
       'unpaid_principal_bal', 'loan_term', 'origination_date',
       'first_payment_date', 'loan_to_value', 'number_of_borrowers',
       'debt_to_income_ratio', 'borrower_credit_score', 'loan_purpose',
       'insurance_percent', 'co-borrower_credit_score', 'insurance_type'],
      dtype='object')

In [140]:
#train_data=data.drop('loan_id','financial_institution','origination_date','first_payment_date',)
print(data_train.m13.value_counts())


0    115422
1       636
Name: m13, dtype: int64


In [141]:
data_train=data_train.drop(['loan_id','financial_institution','origination_date',
       'first_payment_date','source','loan_purpose'], axis=1)  #Dropping ir-relevant columns

In [142]:
columns_data=data_train.columns #Checking the columns of training dataset

In [143]:
data_train.info()     #Info. about the training data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 116058 entries, 0 to 116057
Data columns (total 11 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   interest_rate             116058 non-null  float64
 1   unpaid_principal_bal      116058 non-null  int64  
 2   loan_term                 116058 non-null  int64  
 3   loan_to_value             116058 non-null  int64  
 4   number_of_borrowers       116058 non-null  int64  
 5   debt_to_income_ratio      116058 non-null  int64  
 6   borrower_credit_score     116058 non-null  int64  
 7   insurance_percent         116058 non-null  int64  
 8   co-borrower_credit_score  116058 non-null  int64  
 9   insurance_type            116058 non-null  int64  
 10  m13                       116058 non-null  int64  
dtypes: float64(1), int64(10)
memory usage: 9.7 MB


## Data Pre-Processing

#### Encoding the Categorical Data

In [144]:
data_train=data_train[['interest_rate', 'unpaid_principal_bal', 'loan_term', 'loan_to_value', 'number_of_borrowers', 'debt_to_income_ratio', 'borrower_credit_score', 'insurance_percent', 'co-borrower_credit_score', 'insurance_type','m13']]
columns_data=data_train.columns
print(data_train.info())
print(data_train.describe())
print(columns_data)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 116058 entries, 0 to 116057
Data columns (total 11 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   interest_rate             116058 non-null  float64
 1   unpaid_principal_bal      116058 non-null  int64  
 2   loan_term                 116058 non-null  int64  
 3   loan_to_value             116058 non-null  int64  
 4   number_of_borrowers       116058 non-null  int64  
 5   debt_to_income_ratio      116058 non-null  int64  
 6   borrower_credit_score     116058 non-null  int64  
 7   insurance_percent         116058 non-null  int64  
 8   co-borrower_credit_score  116058 non-null  int64  
 9   insurance_type            116058 non-null  int64  
 10  m13                       116058 non-null  int64  
dtypes: float64(1), int64(10)
memory usage: 9.7 MB
None
       interest_rate  unpaid_principal_bal      loan_term  loan_to_value  \
count  116058.000000         

#### Feature Scaling

In [145]:
np_data = data_train.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(np_data)
data_scaled = pd.DataFrame(x_scaled)
df=data_scaled

#df
#data_scaled #print the scaled_data

#### X-Y Split

In [146]:
#df

In [147]:
y = df.iloc[:,-1:]      #Dependent variabl
#df=df.drop([df.columns[-1]],  axis='columns')
X = df.iloc[:,0:-1]   #Independent variable
#y

In [148]:
model_columns = list(columns_data)
model_columns.remove('m13')
print(model_columns)
X.columns=model_columns


['interest_rate', 'unpaid_principal_bal', 'loan_term', 'loan_to_value', 'number_of_borrowers', 'debt_to_income_ratio', 'borrower_credit_score', 'insurance_percent', 'co-borrower_credit_score', 'insurance_type']


In [149]:
X.head(5)

Unnamed: 0,interest_rate,unpaid_principal_bal,loan_term,loan_to_value,number_of_borrowers,debt_to_income_ratio,borrower_credit_score,insurance_percent,co-borrower_credit_score,insurance_type
0,0.444444,0.170732,1.0,0.978022,0.0,0.333333,0.82619,0.75,0.0,0.0
1,0.583333,0.111859,1.0,0.725275,0.0,0.68254,0.829762,0.0,0.0,0.0
2,0.222222,0.29857,0.4,0.472527,0.0,0.507937,0.928571,0.0,0.0,0.0
3,0.555556,0.104289,1.0,0.43956,1.0,0.68254,0.753571,0.0,0.763158,0.0
4,0.555556,0.095038,1.0,0.813187,0.0,0.666667,0.810714,0.0,0.0,0.0


In [150]:
model_columns

['interest_rate',
 'unpaid_principal_bal',
 'loan_term',
 'loan_to_value',
 'number_of_borrowers',
 'debt_to_income_ratio',
 'borrower_credit_score',
 'insurance_percent',
 'co-borrower_credit_score',
 'insurance_type']

#### Test Train Split

In [151]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

In [152]:
X_train.describe()

Unnamed: 0,interest_rate,unpaid_principal_bal,loan_term,loan_to_value,number_of_borrowers,debt_to_income_ratio,borrower_credit_score,insurance_percent,co-borrower_credit_score,insurance_type
count,87043.0,87043.0,87043.0,87043.0,87043.0,87043.0,87043.0,87043.0,87043.0,87043.0
mean,0.359752,0.166038,0.774007,0.675199,0.59327,0.472236,0.916595,0.070061,0.549903,0.003205
std,0.1026,0.096538,0.299299,0.190163,0.491227,0.154477,0.049999,0.203064,0.456871,0.056525
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.277778,0.091674,0.4,0.56044,0.0,0.349206,0.894048,0.0,0.0,0.0
50%,0.361111,0.144659,1.0,0.725275,1.0,0.47619,0.929762,0.0,0.885167,0.0
75%,0.416667,0.224558,1.0,0.813187,1.0,0.603175,0.95119,0.0,0.946172,0.0
max,0.972222,1.0,1.0,1.0,1.0,0.952381,1.0,1.0,1.0,1.0


## Training the Model(s)

#### XgBoost

In [153]:
classifier_xgboost = xgb.XGBClassifier(silent=False, 
                      scale_pos_weight=1,
                      learning_rate=0.01,  
                      colsample_bytree = 0.4,
                      subsample = 0.8,
                      objective='binary:logistic', 
                      n_estimators=100, 
                      reg_alpha = 0.3,
                      max_depth=4, 
                      gamma=1)
classifier_xgboost.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.4, gamma=1,
              learning_rate=0.01, max_delta_step=0, max_depth=4,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0.3, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=False, subsample=0.8, verbosity=1)

In [154]:
from sklearn.externals import joblib
joblib.dump(classifier_xgboost,'classification_model.pkl')
print('Model Dumped!!!!!!')

Model Dumped!!!!!!


In [156]:
joblib.load('classification_model.pkl')

joblib.dump(model_columns, 'model_columns.pkl')
print("Models columns dumped!")

Models columns dumped!


In [157]:
classifier_xgboost.predict([0,0,0,0,0,0,0,0])

ValueError: feature_names mismatch: ['interest_rate', 'unpaid_principal_bal', 'loan_term', 'loan_to_value', 'number_of_borrowers', 'debt_to_income_ratio', 'borrower_credit_score', 'insurance_percent', 'co-borrower_credit_score', 'insurance_type'] ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7']
expected debt_to_income_ratio, co-borrower_credit_score, number_of_borrowers, borrower_credit_score, unpaid_principal_bal, insurance_type, loan_to_value, loan_term, insurance_percent, interest_rate in input data
training data did not have the following fields: f3, f4, f6, f0, f1, f7, f2, f5

In [130]:
type(X_test)

pandas.core.frame.DataFrame

In [132]:
X_test

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
44444,0.222222,0.228764,0.4,0.439560,1.0,0.523810,0.940476,0.0,0.915072,0.0
34145,0.386667,0.313709,1.0,0.813187,0.0,0.365079,0.963095,0.0,0.000000,0.0
72049,0.305556,0.216148,1.0,0.351648,1.0,0.666667,0.833333,0.0,0.913876,0.0
73877,0.166667,0.020185,0.4,0.362637,0.0,0.619048,0.933333,0.0,0.000000,0.0
80026,0.500000,0.037847,1.0,0.417582,0.0,0.523810,0.967857,0.0,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...
64065,0.166667,0.247267,0.4,0.439560,1.0,0.650794,0.961905,0.0,0.966507,0.0
95623,0.500000,0.066442,1.0,0.758242,0.0,0.682540,0.965476,0.0,0.000000,0.0
60085,0.555556,0.055509,1.0,0.758242,0.0,0.619048,0.877381,0.0,0.000000,0.0
104257,0.500000,0.096720,1.0,0.813187,1.0,0.619048,0.817857,0.0,0.802632,0.0
