<a href="https://colab.research.google.com/github/Saddam705/Credit-Card-fraud-prediction/blob/main/Credit_card_fraud_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Project Title : Predicting whether a customer will default on his/her credit card**

##**Problem Description**

**This project is aimed at predicting the case of customers default payments in Taiwan. From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients. We can use the K-S chart to evaluate which customers will default on their credit card payments**

##**Data Description**

###**Attribute Information:**

**This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 23 variables as explanatory variables:**


*   X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
*   X2: Gender (1 = male; 2 = female).

*   X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).

*   X4: Marital status (1 = married; 2 = single; 3 = others).

*   X5: Age (year).

*   X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.

*   X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005.


*   X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005.



#**Objective**

Objective of our project is to predict which customer might default in upcoming months. Before going any fudther let's have a quick look on defination of what actually meant by Credit Card Default.


*   We are all aware what is credit card. It is type of payment payment card in which charges are made against a line of credit instead of the account holder's cash deposits. When someone uses a credit card to make a purchase, that person's account accrues a balance that must be paid off each month.

*   Credit card default happens when you have become severely delinquent on your credit card payments.Missing credit card payments once or twice does not count as a default. A payment default occurs when you fail to pay the Minimum Amount Due on the credit card for a few consecutive months.



In [None]:
#importing important library
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
%matplotlib inline
warnings.filterwarnings('ignore')

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,roc_auc_score,recall_score,f1_score,precision_score,confusion_matrix,roc_curve,auc
from sklearn.svm import SVC

In [None]:
df = pd.read_excel('/content/drive/MyDrive/Almabetter course/capston project/ML 5 Credit-Card Default prediction/default of credit card clients.xls',header=1)
df

In [None]:
df.info()

What we know about dataset :

We have records of 30000 customers. Below are the description of all features we have.


*   ID: ID of each client

*   LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit)

*   SEX: Gender (1 = male, 2 = female)
*   EDUCATION: (1 = graduate school, 2 = university, 3 = high school, 0,4,5,6 = others)


*   MARRIAGE: Marital status (0 = others, 1 = married, 2 = single, 3 = others)
*   AGE: Age in years

*   Scale for PAY_0 to PAY_6 : (-2 = No consumption, -1 = paid in full, 0 = use of revolving credit (paid minimum only), 1 = payment delay for one month, 2 = payment delay for two months, ... 8 = payment delay for eight months, 9 = payment delay for nine months and above)

*   PAY_0: Repayment status in September, 2005 (scale same as above)

*   PAY_2: Repayment status in August, 2005 (scale same as above)
*   PAY_3: Repayment status in July, 2005 (scale same as above)

*   PAY_4: Repayment status in June, 2005 (scale same as above)

*   PAY_5: Repayment status in May, 2005 (scale same as above)

*   PAY_6: Repayment status in April, 2005 (scale same as above)
*   BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)

*   BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)

*   BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)

*   BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)

*   BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)

*   BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)

*   PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)

*   PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)

*   PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)

*   PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)

*   PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)
*   PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)

*   default.payment.next.month: Default payment (1=yes, 0=no)


In our dataset we got customer credit card transaction history for past 6 month , on basis of which we have to predict if cutomer will default or not.     



So let's begin.



First we will check if we have any null values


In [None]:
df.isnull().sum()

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.tail()

#**Exploratory Data Analysis**

##**Dependent Variable**

In [None]:
df.rename(columns={'default payment next month':'IsDefaulter'},inplace=True)

In [None]:
plt.figure(figsize=(10,8))
sns.countplot(x = 'IsDefaulter',data=df)

In [None]:
df['IsDefaulter'].value_counts()

As we can see from above graph that both classes are not in proportion and we have imbalanced dataset.

##**Independent Variable:**
###**Categorical Features**
We have few categorical features in our dataset. Let'Check how they are related with out target class.

**SEX**

*   1-Male
*   2-Female



In [None]:
df['SEX'].value_counts()

**Education**

1 = graduate school; 2 = university; 3 = high school; 4 = others

In [None]:
df['EDUCATION'].value_counts()

As we can see in dataset we have values like 5,6,0 as well for which we are not having description so we can add up them in 4, which is Others.

In [None]:
fil = (df['EDUCATION']==5)|(df['EDUCATION']==6)|(df['EDUCATION']==0)
df.loc[fil,'EDUCATION']=4
df['EDUCATION'].value_counts()

###**Marriage**

1 = married; 2 = single; 3 = others

In [None]:
df['MARRIAGE'].value_counts()

We have few values for 0, which are not determined . So I am adding them in Others category.

In [None]:
fil = df['MARRIAGE']==0
df.loc[fil,'MARRIAGE']=3
df['MARRIAGE'].value_counts()

###**Plotting our categorical features**

In [None]:
categorical_features = ['SEX','EDUCATION','MARRIAGE']

In [None]:
df_cat = df[categorical_features]
df_cat['Defaulter'] = df['IsDefaulter']

In [None]:
df_cat.replace({'SEX':{1:'MALE',2:'FEMALE'},'EDUCATION':{1:'graduate school',2:'university',3:'high school',4:'others'},'MARRIAGE':{1:'married',2:'single',3:'others'}},inplace=True)

In [None]:
for col in categorical_features:
  plt.figure(figsize=(10,5))
  fig,axes = plt.subplots(ncols=2,figsize=(12,8))
  df[col].value_counts().plot(kind='pie',ax=axes[0],autopct = '%0.1f%%',subplots=True)
  sns.countplot(x=col,hue='Defaulter',data=df_cat)

**Below are few observations for categorical features:**


*   There are more females credit card holder,so no. of defaulter have high proportion of females.

*   No. of defaulters have a higher proportion of educated people (graduate school and university)
*   No. of defaulters have a higher proportion of Singles.



###**Limit Balance**

In [None]:
df['LIMIT_BAL'].max()

In [None]:
df['LIMIT_BAL'].min()

In [None]:
df['LIMIT_BAL'].describe()

In [None]:
sns.barplot(x='IsDefaulter',y='LIMIT_BAL',data=df)

In [None]:
import plotly.express as px
plt.plot(df['LIMIT_BAL'])

In [None]:
plt.figure(figsize=(12,8))
sns.boxplot(x='IsDefaulter',y='LIMIT_BAL',data=df)

In [None]:
#renaming columns 

df.rename(columns={'PAY_0':'PAY_SEPT','PAY_2':'PAY_AUG','PAY_3':'PAY_JUL','PAY_4':'PAY_JUN','PAY_5':'PAY_MAY','PAY_6':'PAY_APR'},inplace=True)
df.rename(columns={'BILL_AMT1':'BILL_AMT_SEPT','BILL_AMT2':'BILL_AMT_AUG','BILL_AMT3':'BILL_AMT_JUL','BILL_AMT4':'BILL_AMT_JUN','BILL_AMT5':'BILL_AMT_MAY','BILL_AMT6':'BILL_AMT_APR'}, inplace = True)
df.rename(columns={'PAY_AMT1':'PAY_AMT_SEPT','PAY_AMT2':'PAY_AMT_AUG','PAY_AMT3':'PAY_AMT_JUL','PAY_AMT4':'PAY_AMT_JUN','PAY_AMT5':'PAY_AMT_MAY','PAY_AMT6':'PAY_AMT_APR'},inplace=True)
df.head()

**AGE**

Plotting graph of number of ages of all people with credit card irrespective of gender.

In [None]:
df['AGE'].value_counts()

In [None]:
df['AGE'] = df['AGE'].astype(int)

In [None]:
fig,axes = plt.subplots(ncols=2,figsize=(18,12))
age_df = df['AGE'].value_counts().reset_index()
df['AGE'].value_counts().plot(kind='pie',autopct='%0.1f%%',ax=axes[0],subplots=True)
sns.barplot(x='index',y='AGE',data=age_df,ax=axes[1],orient='v')

In [None]:
df.groupby('IsDefaulter')['AGE'].mean()

In [None]:
plt.figure(figsize=(15,9))
sns.boxplot(x='IsDefaulter',y='AGE',data=df)

##**Bill Amount**

In [None]:
bill_amnt_df = df[['BILL_AMT_SEPT',	'BILL_AMT_AUG',	'BILL_AMT_JUL',	'BILL_AMT_JUN',	'BILL_AMT_MAY',	'BILL_AMT_APR']]

In [None]:
sns.pairplot(data = bill_amnt_df)

###**History payment status**

In [None]:
pay_col = ['PAY_SEPT',	'PAY_AUG',	'PAY_JUL',	'PAY_JUN',	'PAY_MAY',	'PAY_APR']
for col in pay_col:
  plt.figure(figsize=(10,5))
  sns.countplot(x = col, hue = 'IsDefaulter', data = df)

###**Paid Amount**

In [None]:
pay_amnt_df = df[['PAY_AMT_SEPT',	'PAY_AMT_AUG',	'PAY_AMT_JUL',	'PAY_AMT_JUN',	'PAY_AMT_MAY',	'PAY_AMT_APR', 'IsDefaulter']]

In [None]:
# sns.pairplot(data = pay_amnt_df, hue='IsDefaulter')

##**As we have seen earlier that we have imbalanced dataset. So to remediate Imbalance we are using SMOTE(Synthetic Minority Oversampling Technique)**

In [None]:
from imblearn.over_sampling import SMOTE

smote = SMOTE()

# fit predictor and target variable
x_smote, y_smote = smote.fit_resample(df.iloc[:,0:-1], df['IsDefaulter'])

print('Original dataset shape', len(df))
print('Resampled dataset shape', len(y_smote))

In [None]:
x_smote

In [None]:
df.columns

In [None]:
columns = list(df.columns)

In [None]:
columns.pop()

In [None]:
balance_df = pd.DataFrame(x_smote, columns=columns)

In [None]:
balance_df['IsDefaulter'] = y_smote

In [None]:
balance_df.head()

In [None]:
sns.countplot(x= 'IsDefaulter', data = balance_df)

In [None]:
balance_df[balance_df['IsDefaulter']==1]

##**Feature Engineering**

In [None]:
df_fr = balance_df.copy()

In [None]:
df_fr['Payement_Value'] = df_fr['PAY_SEPT'] + df_fr['PAY_AUG'] + df_fr['PAY_JUL'] + df_fr['PAY_JUN'] + df_fr['PAY_MAY'] + df_fr['PAY_APR']

In [None]:
df_fr.groupby('IsDefaulter')['Payement_Value'].mean()

In [None]:
plt.figure(figsize=(10,10))
sns.boxplot(data = df_fr, x = 'IsDefaulter', y = 'Payement_Value' )

In [None]:
df_fr['Dues'] = (df_fr['BILL_AMT_APR']+df_fr['BILL_AMT_MAY']+df_fr['BILL_AMT_JUN']+df_fr['BILL_AMT_JUL']+df_fr['BILL_AMT_SEPT'])-(df_fr['PAY_AMT_APR']+df_fr['PAY_AMT_MAY']+df_fr['PAY_AMT_JUN']+df_fr['PAY_AMT_JUL']+df_fr['PAY_AMT_AUG']+df_fr['PAY_AMT_SEPT'])
df_fr.groupby('IsDefaulter')['Dues'].mean()

In [None]:
df_fr['EDUCATION'].unique()

In [None]:
df_fr['EDUCATION']=np.where(df_fr['EDUCATION'] == 6, 4, df_fr['EDUCATION'])
df_fr['EDUCATION']=np.where(df_fr['EDUCATION'] == 0, 4, df_fr['EDUCATION'])

In [None]:

df_fr['MARRIAGE'].unique()

In [None]:
df_fr['MARRIAGE']=np.where(df_fr['MARRIAGE'] == 0, 3, df_fr['MARRIAGE'])

In [None]:
df_fr.replace({'SEX': {1 : 'MALE', 2 : 'FEMALE'}, 'EDUCATION' : {1 : 'graduate school', 2 : 'university', 3 : 'high school', 4 : 'others'}, 'MARRIAGE' : {1 : 'married', 2 : 'single', 3 : 'others'}}, inplace = True)
df_fr.head()

##**One Hot Encoding**

In [None]:
df_fr = pd.get_dummies(df_fr,columns=['EDUCATION','MARRIAGE'])

In [None]:
df_fr.head()

In [None]:
df_fr.drop(['EDUCATION_others','MARRIAGE_others'],axis = 1, inplace = True)

In [None]:
df_fr = pd.get_dummies(df_fr, columns = ['PAY_SEPT',	'PAY_AUG',	'PAY_JUL',	'PAY_JUN',	'PAY_MAY',	'PAY_APR'], drop_first = True )
df_fr.head()

In [None]:
# LABEL ENCODING FOR SEX
encoders_nums = {
                 "SEX":{"FEMALE": 0, "MALE": 1}
}
df_fr = df_fr.replace(encoders_nums)

In [None]:
df_fr.head()

In [None]:
df_fr.drop('ID',axis = 1, inplace = True)

In [None]:
df_fr.to_csv('/content/drive/MyDrive/Almabetter course/capston project/ML 5 Credit-Card Default prediction/Final_data.csv')
df_fr

In [None]:
# df_fr.drop(['Unnamed: 0'],axis = 1, inplace = True)

##**Implementing Logistic Regression**
Logistic Regression is one of the simplest algorithms which estimates the relationship between one dependent binary variable and independent variables, computing the probability of occurrence of an event. The regulation parameter C controls the trade-off between increasing complexity (overfitting) and keeping the model simple (underfitting). For large values of C, the power of regulation is reduced and the model increases its complexity, thus overfitting the data.

In [None]:
df_log_reg = df_fr.copy()

In [None]:
df_log_reg.head()

In [None]:
X = df_log_reg.drop(['IsDefaulter','Payement_Value','Dues'],axis=1)
y = df_log_reg['IsDefaulter']

In [None]:
columns = X.columns

In [None]:
scaler = StandardScaler()
X = scaler.fit_transform(X)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, stratify = y)

In [None]:
param_grid = {'penalty':['l1','l2'],'C' : [0.001, 0.01, 0.1, 1, 10, 100, 1000]}

In [None]:
grid_lr_clf = GridSearchCV(LogisticRegression(),param_grid,scoring='accuracy',n_jobs=-1,verbose=3,cv=3)
grid_lr_clf.fit(X_train,y_train)

In [None]:
optimized_clf = grid_lr_clf.best_estimator_

In [None]:
grid_lr_clf.best_params_

In [None]:
grid_lr_clf.best_score_

In [None]:
#predicted probability
train_preds = optimized_clf.predict_proba(X_train)[:,1]
test_preds = optimized_clf.predict_proba(X_test)[:,1]

In [None]:
#get the predicted classes
train_class_preds = optimized_clf.predict(X_train)
test_class_preds = optimized_clf.predict(X_test)

In [None]:
#get the accuracy score
train_accuracy_lr = accuracy_score(train_class_preds,y_train)
test_accuracy_lr = accuracy_score(test_class_preds,y_test)

print("The accuracy on train data is ", train_accuracy_lr)
print("The accuracy on test data is ", test_accuracy_lr)

In [None]:
test_accuracy_lr = accuracy_score(test_class_preds,y_test)
test_precision_score_lr = precision_score(test_class_preds,y_test)
test_recall_score_lr = recall_score(test_class_preds,y_test)
test_f1_score_lr = f1_score(test_class_preds,y_test)
test_roc_score_lr = roc_auc_score(test_class_preds,y_test)

print("The accuracy on test data is ", test_accuracy_lr)
print("The precision on test data is ", test_precision_score_lr)
print("The recall on test data is ", test_recall_score_lr)
print("The f1 on test data is ", test_f1_score_lr)
print("The roc_score on test data is ", test_roc_score_lr)

In [None]:
# get the confusion matrix for both train and test
labels = ['Not Defaoulter','Defaulter']
cm = confusion_matrix(y_train,train_class_preds)
print(cm)

ax = plt.subplot()
sns.heatmap(cm, annot=True, ax = ax)

#labels, little and ticks

ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
feature_importance = pd.DataFrame({'feature':columns,'Importances':np.abs(optimized_clf.coef_).ravel()})

In [None]:
feature_importance = feature_importance.sort_values(by='Importances',ascending=False)[:10]

In [None]:
plt.bar(height=feature_importance['Importances'],x=feature_importance['feature'])
plt.xticks(rotation=70)
plt.title("Feature importances via coefficients")
plt.show()

In [None]:
y_preds_proba_lr = optimized_clf.predict_proba(X_test)[::,1]

In [None]:
y_pred_proba = y_preds_proba_lr
fpr, tpr, _ = roc_curve(y_test,  y_pred_proba)
auc = roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

We have implemented logistic regression and we getting f1-sore approx 73%. As we have imbalanced dataset, F1- score is better parameter. Let's go ahead with other models and see if they can yield better result.

##**Implementing SVC**

In [None]:
# defining parameter range 
param_grid = {'C': [0.1, 1, 10, 100],   
              'kernel': ['rbf']} 

In [None]:

X = df_fr.drop(['IsDefaulter','Payement_Value','Dues'],axis=1)
y = df_fr['IsDefaulter']

In [None]:
scaler = StandardScaler()
X = scaler.fit_transform(X)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, stratify = y)

In [None]:
grid_clf = GridSearchCV(SVC(probability=True), param_grid, scoring = 'accuracy', n_jobs = -1, verbose = 3, cv = 3)
grid_clf.fit(X_train, y_train)

In [None]:
optimal_SVC_clf = grid_clf.best_estimator_

In [None]:
grid_clf.best_params_

In [None]:
grid_clf.best_score_

In [None]:
# Get the predicted classes
train_class_preds_svc = optimal_SVC_clf.predict(X_train)
test_class_preds_svc = optimal_SVC_clf.predict(X_test)

In [None]:
# get the accuracy score
train_accuracy_svc = accuracy_score(train_class_preds_svc,y_train)
test_accuracy_svc = accuracy_score(test_class_preds_svc,y_test)

print("The accuracy on train data is ", train_accuracy_svc)
print("The accuracy on test data is ", test_accuracy_svc)

In [None]:
test_accuracy_SVC = accuracy_score(test_class_preds_svc,y_test)
test_precision_score_SVC = precision_score(test_class_preds_svc,y_test)
test_recall_score_SVC = recall_score(test_class_preds_svc,y_test)
test_f1_score_SVC = f1_score(test_class_preds_svc,y_test)
test_roc_score_SVC = roc_auc_score(test_class_preds_svc,y_test)

print("The accuracy on test data is ", test_accuracy_SVC)
print("The precision on test data is ", test_precision_score_SVC)
print("The recall on test data is ", test_recall_score_SVC)
print("The f1 on test data is ", test_f1_score_SVC)
print("The roc_score on test data is ", test_roc_score_SVC)


We can see from above results that we are getting around 80% train accuracy and 78% for test accuracy which is not bad. But f1- score is 76% approx, so there might be more ground for improvement.

In [None]:
  # Get the confusion matrix for both train and test

labels = ['Not Defaulter', 'Defaulter']
cm = confusion_matrix(y_train, train_class_preds_svc)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
y_pred_proba_SVC = optimal_SVC_clf.predict_proba(X_test)[::,1]

In [None]:
# ROC AUC CURVE
fpr, tpr, _ = roc_curve(y_test,  y_pred_proba_SVC)
auc = roc_auc_score(y_test, y_pred_proba_SVC)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

In [None]:
import torch

In [None]:
model_save_name = 'SVC_optimized_classifier.pt'
path = "/content/drive/MyDrive/Almabetter course/capston project/ML 5 Credit-Card Default prediction/{model_save_name}"
torch.save(optimal_SVC_clf,path)

In [None]:
model_save_name = 'SVC_optimized_classifier.pt'
path = "/content/drive/MyDrive/Almabetter course/capston project/ML 5 Credit-Card Default prediction/{model_save_name}"
optimal_SVC_clf = torch.load(path)

In [None]:
optimal_SVC_clf

In [None]:
# Get the predicted classes
train_class_preds = optimal_SVC_clf.predict(X_train)
test_class_preds = optimal_SVC_clf.predict(X_test)

#**Implementing Decision Tree**

Decision Tree is another very popular algorithm for classification problems because it is easy to interpret and understand. An internal node represents a feature, the branch represents a decision rule, and each leaf node represents the outcome. Some advantages of decision trees are that they require less data preprocessing, i.e., no need to normalize features. However, noisy data can be easily overfitted and results in biased results when the data set is imbalanced.

In [None]:
param_grid = {'max_depth': [20,30,50,100], 'min_samples_split':[0.1,0.2,0.4]}

In [None]:
from sklearn.tree import DecisionTreeClassifier
X = df_fr.drop(['IsDefaulter','Payement_Value','Dues'],axis=1)
y = df_fr['IsDefaulter']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, stratify = y)

In [None]:
grid_DTC_clf = GridSearchCV(DecisionTreeClassifier(), param_grid, scoring = 'accuracy', n_jobs = -1, verbose = 3, cv = 3)
grid_DTC_clf.fit(X_train, y_train)

In [None]:
grid_DTC_clf.best_score_

In [None]:
optimal_DTC_clf = grid_DTC_clf.best_estimator_

In [None]:
# Get the predicted classes
train_class_preds = optimal_DTC_clf.predict(X_train)
test_class_preds = optimal_DTC_clf.predict(X_test)

In [None]:
grid_DTC_clf.best_params_

In [None]:
# Get the accuracy scores
train_accuracy_DTC = accuracy_score(train_class_preds,y_train)
test_accuracy_DTC = accuracy_score(test_class_preds,y_test)

print("The accuracy on train data is ", train_accuracy_DTC)
print("The accuracy on test data is ", test_accuracy_DTC)

##**Implementing RandomForest**

In [None]:
from sklearn.ensemble import RandomForestClassifier
X = df_fr.drop(['IsDefaulter','Payement_Value','Dues'],axis=1)
y = df_fr['IsDefaulter']

In [None]:
rf_clf = RandomForestClassifier()
rf_clf.fit(X_train,y_train)

In [None]:
# Get the predicted classes
train_class_preds = rf_clf.predict(X_train)
test_class_preds = rf_clf.predict(X_test)

In [None]:
# Get the accuracy scores
train_accuracy_rf = accuracy_score(train_class_preds,y_train)
test_accuracy_rf = accuracy_score(test_class_preds,y_test)

print("The accuracy on train data is ", train_accuracy_rf)
print("The accuracy on test data is ", test_accuracy_rf)

In [None]:
test_accuracy_rf = accuracy_score(test_class_preds,y_test)
test_precision_score_rf = precision_score(test_class_preds,y_test)
test_recall_score_rf = recall_score(test_class_preds,y_test)
test_f1_score_rf = f1_score(test_class_preds,y_test)
test_roc_score_rf = roc_auc_score(test_class_preds,y_test)

print("The accuracy on test data is ", test_accuracy_rf)
print("The precision on test data is ", test_precision_score_rf)
print("The recall on test data is ", test_recall_score_rf)
print("The f1 on test data is ", test_f1_score_rf)
print("The roc_score on test data is ", test_roc_score_rf)

We can see from above results that we are getting around 99% train accuracy and 83% for test accuracy which depicts that model is overfitting. However our f1-score is around 82%, which is not bad.

In [None]:
param_grid = {'n_estimators': [100,150,200], 'max_depth': [10,20,30]}

In [None]:
grid_rf_clf = GridSearchCV(RandomForestClassifier(), param_grid, scoring = 'accuracy', n_jobs = -1, verbose = 3, cv = 3)
grid_rf_clf.fit(X_train, y_train)

In [None]:
grid_rf_clf.best_score_

In [None]:
grid_rf_clf.best_params_

In [None]:
optimal_rf_clf = grid_rf_clf.best_estimator_

In [None]:
# Get the predicted classes
train_class_preds = optimal_rf_clf.predict(X_train)
test_class_preds = optimal_rf_clf.predict(X_test)

In [None]:
# Get the accuracy scores
train_accuracy_rf = accuracy_score(train_class_preds,y_train)
test_accuracy_rf = accuracy_score(test_class_preds,y_test)

print("The accuracy on train data is ", train_accuracy_rf)
print("The accuracy on test data is ", test_accuracy_rf)

In [None]:
test_accuracy_rf = accuracy_score(test_class_preds,y_test)
test_precision_score_rf = precision_score(test_class_preds,y_test)
test_recall_score_rf = recall_score(test_class_preds,y_test)
test_f1_score_rf = f1_score(test_class_preds,y_test)
test_roc_score_rf = roc_auc_score(test_class_preds,y_test)

print("The accuracy on test data is ", test_accuracy_rf)
print("The precision on test data is ", test_precision_score_rf)
print("The recall on test data is ", test_recall_score_rf)
print("The f1 on test data is ", test_f1_score_rf)
print("The roc_score on test data is ", test_roc_score_rf)

In [None]:
len(optimal_rf_clf.feature_importances_)

In [None]:
# Feature Importance
feature_importances_rf = pd.DataFrame(optimal_rf_clf.feature_importances_,
                                   index = columns,
                                    columns=['importance_rf']).sort_values('importance_rf',
                                                                        ascending=False)[:10]
                                    
plt.subplots(figsize=(17,6))
plt.title("Feature importances")
plt.bar(feature_importances_rf.index, feature_importances_rf['importance_rf'],
        color="g",  align="center")
plt.xticks(feature_importances_rf.index, rotation = 85)
#plt.xlim([-1, X.shape[1]])
plt.show()

In [None]:
# Get the predicted classes
train_class_preds = optimal_rf_clf.predict(X_train)
test_class_preds = optimal_rf_clf.predict(X_test)

In [None]:
y_preds_proba_rf = optimal_rf_clf.predict_proba(X_test)[::,1]

In [None]:
from sklearn import metrics

In [None]:
y_pred_proba = y_preds_proba_rf
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

#**Implementing XGBoost**

In [None]:
#import lightgbm and xgboost 
import lightgbm as lgb 
import xgboost as xgb

#**Applying XGBoost**

In [None]:

#The data is stored in a DMatrix object 
#label is used to define our outcome variable
dtrain=xgb.DMatrix(X_train,label=y_train)
dtest=xgb.DMatrix(X_test)

In [None]:
#setting parameters for xgboost
parameters={'max_depth':7, 'eta':1, 'silent':1,'objective':'binary:logistic','eval_metric':'auc','learning_rate':.05}

In [None]:
#training our model 
num_round=50
from datetime import datetime 
start = datetime.now() 
xg=xgb.train(parameters,dtrain,num_round) 
stop = datetime.now()

In [None]:
#Execution time of the model 
execution_time_xgb = stop-start 
execution_time_xgb

In [None]:
#now predicting our model on train set 
train_class_preds_probs=xg.predict(dtrain) 
#now predicting our model on test set 
test_class_preds_probs =xg.predict(dtest) 

In [None]:
len(train_class_preds_probs)

In [None]:
train_class_preds = []
test_class_preds = []
for i in range(0,len(train_class_preds_probs)):
  if train_class_preds_probs[i] >= 0.5:
    train_class_preds.append(1)
  else:
    train_class_preds.append(0)

for i in range(0,len(test_class_preds_probs)):
  if test_class_preds_probs[i] >= 0.5:
    test_class_preds.append(1)
  else:
    test_class_preds.append(0)

In [None]:
test_class_preds_probs[:20]

In [None]:
test_class_preds[:20]

In [None]:
# Get the accuracy scores
train_accuracy_xgb = accuracy_score(train_class_preds,y_train)
test_accuracy_xgb = accuracy_score(test_class_preds,y_test)

print("The accuracy on train data is ", train_accuracy_xgb)
print("The accuracy on test data is ", test_accuracy_xgb)

In [None]:
test_accuracy_xgb = accuracy_score(test_class_preds,y_test)
test_precision_xgb = precision_score(test_class_preds,y_test)
test_recall_score_xgb = recall_score(test_class_preds,y_test)
test_f1_score_xgb = f1_score(test_class_preds,y_test)
test_roc_score_xgb = roc_auc_score(test_class_preds,y_test)

print("The accuracy on test data is ", test_accuracy_xgb)
print("The precision on test data is ", test_precision_xgb)
print("The recall on test data is ", test_recall_score_xgb)
print("The f1 on test data is ", test_f1_score_xgb)
print("The roc_score on train data is ", test_roc_score_xgb)

#**Hyperparameter Tuning**

In [None]:
from xgboost import  XGBClassifier

In [None]:
X = df_fr.drop(['IsDefaulter','Payement_Value','Dues'],axis=1)
y = df_fr['IsDefaulter']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, stratify = y)

In [None]:
param_test1 = {
 'max_depth':range(3,10,2),
 'min_child_weight':range(1,6,2)
}
gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=5,
 min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27), 
 param_grid = param_test1, scoring='accuracy',n_jobs=-1, cv=3, verbose = 2)
gsearch1.fit(X_train, y_train)

In [None]:
gsearch1.best_score_

In [None]:
optimal_xgb = gsearch1.best_estimator_

In [None]:
# Get the predicted classes
train_class_preds = optimal_xgb.predict(X_train)
test_class_preds = optimal_xgb.predict(X_test)

In [None]:
# Get the accuracy scores
train_accuracy_xgb_tuned = accuracy_score(train_class_preds,y_train)
test_accuracy_xgb_tuned = accuracy_score(test_class_preds,y_test)

print("The accuracy on train data is ", train_accuracy_xgb_tuned)
print("The accuracy on test data is ", test_accuracy_xgb_tuned)

In [None]:
test_accuracy_xgb_tuned = accuracy_score(test_class_preds,y_test)
test_precision_xgb_tuned = precision_score(test_class_preds,y_test)
test_recall_score_xgb_tuned = recall_score(test_class_preds,y_test)
test_f1_score_xgb_tuned = f1_score(test_class_preds,y_test)
test_roc_score_xgb_tuned = roc_auc_score(test_class_preds,y_test)

print("The accuracy on test data is ", test_accuracy_xgb_tuned)
print("The precision on test data is ", test_precision_xgb_tuned)
print("The recall on test data is ", test_recall_score_xgb_tuned)
print("The f1 on test data is ", test_f1_score_xgb_tuned)
print("The roc_score on train data is ", test_roc_score_xgb_tuned)

In [None]:
pd.DataFrame(optimal_xgb.feature_importances_,
                                   index = columns,
                                    columns=['importance_xgb']).sort_values('importance_xgb',
                                                                        ascending=False)[:10]

In [None]:
# Feature Importance
feature_importances_xgb = pd.DataFrame(optimal_xgb.feature_importances_,
                                   index = columns,
                                    columns=['importance_xgb']).sort_values('importance_xgb',
                                                                        ascending=False)[:10]
                                    
plt.subplots(figsize=(17,6))
plt.title("Feature importances")
plt.bar(feature_importances_xgb.index, feature_importances_xgb['importance_xgb'],
        color="b",  align="center")
plt.xticks(feature_importances_rf.index, rotation = 85)
#plt.xlim([-1, X.shape[1]])
plt.show()


In [None]:
y_preds_proba_xgb = optimal_xgb.predict_proba(X_test)[::,1]

In [None]:
y_pred_proba = y_preds_proba_xgb
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

#**Evaluating the models**

In [None]:
classifiers = ['Logistic Regression', 'SVC', 'Random Forest CLf', 'Xgboost Clf']
train_accuracy = [train_accuracy_lr, train_accuracy_svc, train_accuracy_rf, train_accuracy_xgb_tuned]
test_accuracy = [test_accuracy_lr, test_accuracy_SVC, test_accuracy_rf, test_accuracy_xgb_tuned]
precision_score = [test_precision_score_lr, test_precision_score_SVC, test_precision_score_rf, test_precision_xgb_tuned]
recall_score = [test_recall_score_lr, test_recall_score_SVC, test_recall_score_rf, test_recall_score_xgb_tuned]
f1_score = [test_f1_score_lr, test_f1_score_SVC, test_f1_score_rf, test_f1_score_xgb_tuned]

In [None]:
pd.DataFrame({'Classifier':classifiers, 'Train Accuracy': train_accuracy, 'Test Accuracy': test_accuracy, 'Precision Score': precision_score, 'Recall Score': recall_score, 'F1 Score': f1_score })

#Plotting ROC AUC for all the models

In [None]:
classifiers_proba = [(optimized_clf, y_preds_proba_lr), 
               (optimal_rf_clf, y_preds_proba_rf), 
               (optimal_xgb, y_preds_proba_xgb),
               (optimal_SVC_clf,y_pred_proba_SVC)]

# Define a result table as a DataFrame
result_table = pd.DataFrame(columns=['classifiers', 'fpr','tpr','auc'])

# Train the models and record the results
for pair in classifiers_proba:
    
    fpr, tpr, _ = roc_curve(y_test,  pair[1])
    auc = roc_auc_score(y_test, pair[1])
    
    result_table = result_table.append({'classifiers':pair[0].__class__.__name__,
                                        'fpr':fpr, 
                                        'tpr':tpr, 
                                        'auc':auc}, ignore_index=True)

# Set name of the classifiers as index labels
result_table.set_index('classifiers', inplace=True)

In [None]:
result_table