## **MS4610 - Introduction to Data Analytics - Final Course Project**

__Problem Statement:__ <br>
     The project deals in predicting the nature of Loan given (default or non-default) by an organization using the training dataset of given features like customer's age, income, expenses, occupation type and also some of the metrics calculated by the organization.

__Data Description:__ <br>

| Variable | Description | 
|------|------|
| ID | Unique Loan Identifier | 
| Loan Type | A/ B | 
| Occupation Type | customer's occupation(X/Y/Z) | 
| Income | Annual income of customer | 
| Expense | Annual expense of customer | 
| Age | 0 for customer's age below 50 / 1 for above 50 | 
| Score 1 | Customer Metric | 
| Score 2 | Customer Metric | 
| Score 3 | Customer Metric | 
| Score 4 | Customer Metric | 
| Score 5 | Customer Metric | 
| Label | 0 for non-default/ 1 for default type | 



## A. Data Importing:

In [None]:
############## Importing Libraries ##################
import pandas as pd 
import matplotlib.pyplot as py
import seaborn as sns

I imported the data by uploading the three csv files given i.e train_x.csv, train_y.csv and test_x.csv (in kaggle) and naming them together as loan-dataset. The total input file size is 9.7 MB.

In [None]:
########## reading training dataset ###############
train_data = pd.read_csv('../input/loan-dataset/train_x.csv')
train_data.info()

########## reading training labels ################
train_label_data=pd.read_csv('../input/loan-dataset/train_y.csv')
train_label_data.info()

train_data = train_data.merge(train_label_data,on = 'ID')

In [None]:
train_data.head(20)

## B. Data Visualization:

In [None]:
###### leaving out unlabelled rows ###########
new_train_data = train_data[train_data.Label.notnull()]  
new_train_data.head(7)

###### calculating percentage of data dropped #######
length = len(train_data.ID)
length_new = len(new_train_data.ID)
per = (1 - length_new/length) *100
print("Percent of data dropped is ",per)


In [None]:
###### checking for skewdness in labels in new_train_dataset ######
sns.countplot(x = "Label",data = new_train_data)

l = len(new_train_data.Label)
s = new_train_data.Label.sum()
print(s)
percent = (s/l)*100
print("Percentage of labels which are default is",percent)


In [None]:
interested_columns=['Loan type', 'Occupation type','Age' ]
for col in interested_columns:
    categorical_bin = pd.crosstab(new_train_data[col],new_train_data['Label'])
    categorical_bin.div(categorical_bin.sum(1).astype(float), axis=0).plot(kind="bar", stacked=True)
    py.xlabel(f'{col}')
    P = py.ylabel('Percentage')

In [None]:
########### Pearson correlation coefficient ###########
corr = new_train_data.corr(method = 'pearson')
f, ax = py.subplots(figsize=(11, 9))
cmap = sns.diverging_palette(10, 275, as_cmap=True)

sns.heatmap(corr, cmap=cmap, square=True,
            linewidths=0.5, cbar_kws={"shrink": 0.5}, ax=ax)

In [None]:
########### Box plot of numerical columns ###########
numerical_columns= ['Expense','Income', 'Score1','Score2','Score3','Score4', 'Score5']


fig,axes = py.subplots(3,3,figsize=(20,14))
for idx,cat_col in enumerate(numerical_columns):
     row,col = idx//3,idx%3
     sns.boxplot(y=cat_col,data=train_data,x='Label',ax=axes[row,col])

print(train_data[numerical_columns].describe())
py.subplots_adjust(hspace=0.5)



In [None]:
########### Plotting all the pairs of numerical columns ###########
interested_columns = ['Expense','Income', 'Score1','Score2','Score3','Score4','Score5','Label']
sns.pairplot(new_train_data[interested_columns][:5000],hue='Label')

__Observations:__<br>
1. There is an imbalance in the Label column
2. There is a high correlation between Score5 and Expense

__Solutions:__<br>
1. Using SMOTE to balance the dataset
2. As there are not much features, the correlated features are not removed

## C. Data Pre-processing:

In [None]:
########## defining numerical and categorical columns ###########
categorical_columns=['ID','Loan type', 'Occupation type','Age' ]
numerical_columns= ['ID','Expense','Income', 'Score1','Score2','Score3','Score4', 'Score5']

In [None]:
X = new_train_data.drop(columns='Label')
y = new_train_data['Label']

############# Mode filling for categorical columns ################
from sklearn.impute import SimpleImputer
imp= SimpleImputer(strategy = 'most_frequent')
X_categorical = imp.fit_transform(X[categorical_columns])
X_categorical = pd.DataFrame(X_categorical,columns=categorical_columns)

############# Mean filling for numerical columns ##################
imp= SimpleImputer(strategy = 'mean')
X_numerical = imp.fit_transform(X[numerical_columns])
X_numerical = pd.DataFrame(X_numerical,columns=numerical_columns)

In [None]:
X = X_numerical.merge(X_categorical,on="ID")
X = X.drop(columns = 'ID')

############# Encoding Categorical features #################
X = pd.get_dummies(X,drop_first=True)


In [None]:
X.info()

In [None]:
####### Using SMOTE for making the data set balanced #######
from imblearn.over_sampling import SMOTE
smk = SMOTE(random_state=0)
X_new,y_new = smk.fit_sample(X,y)
len(y_new)

####### checking whether dataset became balanced or not ######
l = len(y_new)
s = y_new.sum()
print(s)
percent = (s/l)*100
print("Percentage of labels which are default after balancing the data set is",percent)

In [None]:
X_new.info()

In [None]:
########### Plotting all the pairs of numerical columns after SMOTE ###########
interested_columns = ['Expense','Income', 'Score1','Score2','Score3','Score4','Score5','Label']
smote_df = pd.concat([X_new, y_new], axis=1)
smote_df = smote_df.sample(frac=1).reset_index(drop=True)
sns.pairplot(smote_df[interested_columns][:5000],hue='Label')

Fifty percent of data are labelled as default after applying SMOTE. The data-set became balanced.

## D. Model Training:

### Model 1 - Logistic Regression

 Logistic Regression model is sensitive to variation in the dataset, so standadizing data has to be done.

### Standardizing Numerical columns in Dataset

In [None]:
numerical_columns= ['Expense','Income', 'Score1','Score2','Score3','Score4', 'Score5']
categorical_columns =['Loan type_B','Occupation type_Y','Occupation type_Z','Age_1.0']

X_standard =pd.DataFrame([])

################## Standarizing values only for numerical columns################
from sklearn.preprocessing import StandardScaler

X_standard[numerical_columns] =pd.DataFrame(StandardScaler().fit_transform(X_new[numerical_columns]))

################## combining categorical columns ################
X_standard[categorical_columns]=X_new[categorical_columns]
X_standard.info()
X_standard.head()

In [None]:
########### Splitting data into Training and Test Data for the standardized data ###########
from sklearn.model_selection import train_test_split
X_train_encoded,X_test_encoded,y_train,y_test = train_test_split(X_standard,y_new,test_size=0.2,random_state=42)

In [None]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,f1_score
from sklearn.model_selection import cross_val_predict

train_accuracies = []
train_f1_scores = []
test_accuracies = []
test_f1_scores = []
thresholds = []

#Using different threshold values and finding the accuracy of the model
for thresh in np.arange(0.1,0.9,0.1): ## Sweeping from threshold of 0.1 to 0.9
    logreg_clf = LogisticRegression(solver='liblinear')
    logreg_clf.fit(X_train_encoded,y_train)
    
    y_pred_train_thresh = logreg_clf.predict_proba(X_train_encoded)[:,1]
    y_pred_train = (y_pred_train_thresh > thresh).astype(int)

    train_acc = accuracy_score(y_train,y_pred_train)
    train_f1 = f1_score(y_train,y_pred_train)
    
    y_pred_test_thresh = logreg_clf.predict_proba(X_test_encoded)[:,1]
    y_pred_test = (y_pred_test_thresh > thresh).astype(int) 
    
    test_acc = accuracy_score(y_test,y_pred_test)
    test_f1 = f1_score(y_test,y_pred_test)
    
    train_accuracies.append(train_acc)
    train_f1_scores.append(train_f1)
    test_accuracies.append(test_acc)
    test_f1_scores.append(test_f1)
    thresholds.append(thresh)

In [None]:
Threshold_logreg = {"Training Accuracy": train_accuracies, "Test Accuracy": test_accuracies, "Training F1": train_f1_scores, "Test F1":test_f1_scores, "Decision Threshold": thresholds }
Threshold_logreg_df = pd.DataFrame.from_dict(Threshold_logreg)

plot_df = Threshold_logreg_df.melt('Decision Threshold',var_name='Metrics',value_name="Values")
fig,ax = py.subplots(figsize=(15,5))
sns.pointplot(x="Decision Threshold", y="Values",hue="Metrics", data=plot_df,ax=ax)

##### Using the threshold value as 0.45

In [None]:
# using a threshold of 0.45 for Log regression:

logreg_clf = LogisticRegression(solver='liblinear')
logreg_clf.fit(X_train_encoded,y_train)
    
y_pred_train_thresh = logreg_clf.predict_proba(X_train_encoded)[:,1]
y_pred_train = (y_pred_train_thresh > 0.45).astype(int)

train_acc = accuracy_score(y_train,y_pred_train)
train_f1 = f1_score(y_train,y_pred_train)
    
y_pred_test_thresh = logreg_clf.predict_proba(X_test_encoded)[:,1]
y_pred_test = (y_pred_test_thresh >0.45).astype(int) 
    
test_acc = accuracy_score(y_test,y_pred_test)
test_f1 = f1_score(y_test,y_pred_test)


### Results for Logistic Regression:

In [None]:
################# Training Data Results ######################
print("Training acc. is :", train_acc)
print("Training f1 :",train_f1)
pd.crosstab(y_train, y_pred_train, rownames=['Actual'], colnames=['Predicted'], margins=True)


In [None]:
################## Test Data Results #################
print("Test acc. is :", test_acc)
print("Test f1 :",test_f1)
pd.crosstab(y_test, y_pred_test, rownames=['Actual'], colnames=['Predicted'], margins=True)

### ROC curve for Logistic Regression:


In [None]:
from sklearn.metrics import plot_roc_curve
ax=py.gca()
rfc=plot_roc_curve(logreg_clf,X_test_encoded,y_test,ax=ax,alpha=0.8)
py.show()

In [None]:
coeff_matrix = logreg_clf.coef_
print(coeff_matrix)

### Model 2 - Decision Tree Classifier

Decision tree classifier and Random Forest classifier is not sensitive to variation in the dataset. So, Test train split can be done to dataset without standardization (scaling) of data.

In [None]:
########### Splitting data into Training and Test Data ###########
from sklearn.model_selection import train_test_split
X_train_encoded,X_test_encoded,y_train,y_test = train_test_split(X_new,y_new,test_size=0.2,random_state=42)

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score,f1_score


tree_clf = DecisionTreeClassifier()
tree_clf.fit(X_train_encoded,y_train)
y_pred = tree_clf.predict(X_train_encoded)
print("Training Data Set Accuracy: ", accuracy_score(y_train,y_pred))
print("Training Data F1 Score ", f1_score(y_train,y_pred))

print("Validation Mean F1 Score: ",cross_val_score(tree_clf,X_train_encoded,y_train,cv=5,scoring='f1_macro').mean())
print("Validation Mean Accuracy: ",cross_val_score(tree_clf,X_train_encoded,y_train,cv=5,scoring='accuracy').mean())

In [None]:
y_pred = tree_clf.predict(X_test_encoded)
print("Test Data Set Accuracy: ", accuracy_score(y_test,y_pred))
print("Test Data F1 Score ", f1_score(y_test,y_pred))


 On observing the trained and test accuracy of the model, it can be seen that the trained model has been overfitted.

In [None]:
########## tuning the depth parameter ##########
training_accuracy = []
val_accuracy = []
training_f1 = []
val_f1 = []
tree_depths = []
test_accuracy = []
test_val_accuracy =[]
test_val_f1 = []
test_f1 =[]

for depth in range(1,20):
    tree_clf = DecisionTreeClassifier(max_depth=depth)
    tree_clf.fit(X_train_encoded,y_train)
    y_training_pred = tree_clf.predict(X_train_encoded)

    training_acc = accuracy_score(y_train,y_training_pred)
    train_f1 = f1_score(y_train,y_training_pred)
    val_mean_f1 = cross_val_score(tree_clf,X_train_encoded,y_train,cv=5,scoring='f1_macro').mean()
    val_mean_accuracy = cross_val_score(tree_clf,X_train_encoded,y_train,cv=5,scoring='accuracy').mean()
    
    y_test_pred_1 = tree_clf.predict(X_test_encoded)

    training_acc_1 = accuracy_score(y_test,y_test_pred_1)
    train_f1_1 = f1_score(y_test,y_test_pred_1)
    val_mean_f1_1 = cross_val_score(tree_clf,X_test_encoded,y_test,cv=5,scoring='f1_macro').mean()
    val_mean_accuracy_1 = cross_val_score(tree_clf,X_test_encoded,y_test,cv=5,scoring='accuracy').mean()
    
    training_accuracy.append(training_acc)
    val_accuracy.append(val_mean_accuracy)
    training_f1.append(train_f1)
    val_f1.append(val_mean_f1)
    tree_depths.append(depth)
    
     
    test_accuracy.append(training_acc_1)
    test_val_accuracy.append(val_mean_accuracy_1)
    test_f1.append(train_f1_1)
    test_val_f1.append(val_mean_f1_1)
    

Tuning_Max_depth = {"Training Accuracy": training_accuracy, "Validation Accuracy": val_accuracy, "Training F1": training_f1, "Validation F1":val_f1, "Max_Depth": tree_depths ,"Test_val_f1":test_val_f1 , "Test_val_acc":test_val_accuracy , "Test_acc":test_accuracy , "Test_f1":test_f1 }
Tuning_Max_depth_df = pd.DataFrame.from_dict(Tuning_Max_depth)

plot_df = Tuning_Max_depth_df.melt('Max_Depth',var_name='Metrics',value_name="Values")
fig,ax = py.subplots(figsize=(15,5))
sns.pointplot(x="Max_Depth", y="Values",hue="Metrics", data=plot_df,ax=ax)


In [None]:
Tuning_Max_depth = {"Training Accuracy": training_accuracy, "Validation Accuracy": val_accuracy, "Training F1": training_f1, "Validation F1":val_f1, "Max_Depth": tree_depths }
Tuning_Max_depth_df = pd.DataFrame.from_dict(Tuning_Max_depth)

plot_df = Tuning_Max_depth_df.melt('Max_Depth',var_name='Metrics',value_name="Values")
fig,ax = py.subplots(figsize=(15,5))
sns.pointplot(x="Max_Depth", y="Values",hue="Metrics", data=plot_df,ax=ax)

In [None]:
Tuning_Max_depth = {  "Max_Depth": tree_depths , "Test_acc":test_accuracy ,"Test_val_acc":test_val_accuracy,"Test_f1":test_f1 ,"Test_val_f1":test_val_f1  }
Tuning_Max_depth_df = pd.DataFrame.from_dict(Tuning_Max_depth)

plot_df = Tuning_Max_depth_df.melt('Max_Depth',var_name='Metrics',value_name="Values")
fig,ax = py.subplots(figsize=(15,5))
sns.pointplot(x="Max_Depth", y="Values",hue="Metrics", data=plot_df,ax=ax)

###### From the graph plotted, depth = 8 seems to be a reasonable value, considering computational time limts and to reduce overfitting risks. 

In [None]:
# depth = 8

tree_clf = DecisionTreeClassifier(max_depth =8)
tree_clf.fit(X_train_encoded,y_train)
y_pred = tree_clf.predict(X_train_encoded)
print("Training Data Set Accuracy: ", accuracy_score(y_train,y_pred))
print("Training Data F1 Score ", f1_score(y_train,y_pred))

print("Validation Mean F1 Score: ",cross_val_score(tree_clf,X_train_encoded,y_train,cv=5,scoring='f1_macro').mean())
print("Validation Mean Accuracy: ",cross_val_score(tree_clf,X_train_encoded,y_train,cv=5,scoring='accuracy').mean())

In [None]:
pd.crosstab(y_train, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)

In [None]:
#testing decision tree clasifier of depth 8
y_pred = tree_clf.predict(X_test_encoded)
print("Test Data Set Accuracy: ", accuracy_score(y_test,y_pred))
print("Test Data F1 Score ", f1_score(y_test,y_pred))

print("Validation Test Mean F1 Score: ",cross_val_score(tree_clf,X_test_encoded,y_test,cv=5,scoring='f1_macro').mean())
print("Validation Test Mean Accuracy: ",cross_val_score(tree_clf,X_test_encoded,y_test,cv=5,scoring='accuracy').mean())

pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)

#### ROC curve for Decision Tree classifier

In [None]:
from sklearn.metrics import plot_roc_curve
ax=py.gca()
rfc=plot_roc_curve(tree_clf,X_test_encoded,y_test,ax=ax,alpha=0.8)
py.show()

### Model 3 - Random Forest

No standardization (scaling) is done here, as stated before.

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(n_estimators=100,max_depth=14,min_samples_leaf = 10, random_state = 42)
rf_clf.fit(X_train_encoded,y_train)
y_pred = rf_clf.predict(X_train_encoded)
print("Train F1 Score ", f1_score(y_train,y_pred))
print("Train Accuracy ", accuracy_score(y_train,y_pred))



In [None]:

pd.crosstab(y_train, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)

In [None]:
########## evaluating random forest model for test dataset ##########
y_pred = rf_clf.predict(X_test_encoded)
print("Test Accuracy: ",accuracy_score(y_test,y_pred))
print("Test F1 Score: ",f1_score(y_test,y_pred))
print("Confusion Matrix on Test Data")
pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)

### ROC curve for Random Forest Classifier


In [None]:
from sklearn.metrics import plot_roc_curve
ax=py.gca()
rfc=plot_roc_curve(rf_clf,X_test_encoded,y_test,ax=ax,alpha=0.8)
py.show()

### Conclusion:
**Random forest** model gives the highest accuracy of all these models on both train and test dataset.

## E. Test Set Prediction:


In [None]:
######### importing test data set for evaluation ################
X_test_evaluation = pd.read_csv('../input/loan-dataset/test_x.csv')

In [None]:
X_test_evaluation.info()


There are no missing data in the test data set, so no imputation is required but we need to add dummy variables for categorical features.

In [None]:
########### getting dummy values for categorical columns #########
X_test_evaluation_new = X_test_evaluation.drop(columns="ID_Test")
X_test_evaluation_new= pd.get_dummies(X_test_evaluation_new,drop_first=True)
X_test_evaluation_new.info()

In [None]:
X_test_evaluation_new.head(10)

In [None]:
##changing the order of columns in prediction test dataset same as training dataset

X_test_evaluation_new = X_test_evaluation_new[['Expense','Income','Score1','Score2','Score3','Score4','Score5','Loan type_B','Occupation type_Y','Occupation type_Z','Age']]

X_test_evaluation_new.info()

In [None]:
X_test_evaluation_new.head()

In [None]:
######### predicting outputs using Random Forest Classifier ############
pred_y_new =rf_clf.predict(X_test_evaluation_new)




In [None]:
########## creating the ID of Loan Test data in a separate dataframe #######
ID_column =pd.DataFrame(X_test_evaluation["ID_Test"])

############ creating the final required pred_y file with ID_Test and Label_Test as columns#######
pred_y = ID_column.copy()
pred_y["Label_Test"]= pred_y_new



In [None]:
######### getting output file in csv format ###########
pred_y.to_csv('pred_y.csv')

## F. Additional Understanding:

### Principal Component Analysis for Standardized Dataset

In [None]:
#################### PCA analysis for capturing 99 percent of variance in dataset ##############
from sklearn.decomposition import PCA
pca = PCA(0.99)  ########## capturing 99 percent variance in dataset

pr_comp=pca.fit_transform(X_standard)
pr_df= pd.DataFrame([])
pr_df = pd.DataFrame(data = pr_comp,columns = ['Principal_Comp_1','Principal_Comp_2','Principal_Comp_3','Principal_Comp_4','Principal_Comp_5','Principal_Comp_6','Principal_comp_7','Principal_comp_8','Principal_comp_9'])  
pr_df.info()

In [None]:
############ finding percentage of information captured by each principal components

principal_components =['Principal_Comp_1','Principal_Comp_2','Principal_Comp_3','Principal_Comp_4','Principal_Comp_5','Principal_Comp_6','Principal_comp_7','Principal_comp_8','Principal_comp_9']
principal_information_percent = pd.DataFrame([])
principal_information_percent = pd.DataFrame(principal_components)
principal_information_percent['percent variation captured'] = pd.DataFrame(data = pca.explained_variance_ratio_)

principal_information_percent

Principal Component Analysis need 9 components for capturing 99 percent variation in dataset. Our original dataset has 11 features, we are able to reduce only 2 features.

In [None]:
###Finding the amount of variance explained by each principal component###
print(pca.explained_variance_)

In [None]:
###Directions of Principal Axes##
print(pca.components_)

In [None]:
#################### PCA analysis for capturing 95 percent of variance in dataset ##############
pca2 = PCA(0.95)  ########## capturing 95 percent variance in dataset

pr_comp2=pca2.fit_transform(X_standard)
pr_df2 = pd.DataFrame([])
pr_df2 = pd.DataFrame(data = pr_comp2,columns = ['New Principal_Comp_1','New Principal_Comp_2','New Princi_Comp_3','New Principal_Comp_4','New Principal_Comp_5','New Principal_Comp_6'])  
pr_df2.info()

We see that, For 95 percent variance retention, there are 6 principal components, while there are 11 features in original data set. So, it nearly compressed half of the data features.

In [None]:
############ finding percentage of information captured by each principal components

principal_components_2 =['New Principal_Comp_1','New Principal_Comp_2','New Principal_Comp_3','New Principal_Comp_4','New Principal_Comp_5','New Principal_Comp_6']
principal_information_percent_2 = pd.DataFrame([])
principal_information_percent_2 = pd.DataFrame(principal_components_2)
principal_information_percent_2['percent variation captured'] = pd.DataFrame(data = pca2.explained_variance_ratio_)

principal_information_percent_2

Comparing these with the 99% variance retention case, we see that the last three principal components in the 99% case are disregarded in 95% case because they have very less values of captured variance.

### Feature Selection

In [None]:
##Mutual information (MI) Criterion##
from sklearn.feature_selection import mutual_info_classif
mi = mutual_info_classif(X_standard, y_new,random_state = 42)
print(mi)
print("Mean value of MI = ", np.mean(mi))
print("Standard deviation of MI is =", np.std(mi))

Observing the mean and standard deviation, we select 0.045 as a threshold. We see that the last four features, i.e, Loan_B, Occupation_Y, Occupation_Z, Age_1.0 and Score_3 have less values of mutual information than the threshold. So, these features are less important using MI criterion.

In [None]:
from sklearn.ensemble import ExtraTreesClassifier
tree_model = ExtraTreesClassifier(random_state = 42)
tree_model.fit(X_standard, y_new)
importance_list = tree_model.feature_importances_
print(importance_list)
print("Mean value of importance = ", np.mean(importance_list))
print("Standard deviation of importance is =", np.std(importance_list))


Observing the mean and standard deviation, we select 0.08 as a threshold. We see that the last four features, i.e, Loan_B, Occupation_Y, Occupation_Z, and Age_1.0 have less values of importance than the threshold. So, these features are less important using Tree importance criterion.

So, from the intersection of both above criterion of feature importance, we can say that the last four features i.e., Loan_B, Occupation_Y, Occupation_Z, and Age_1.0 have less importance. In the original data set given in problem, these correspond to Loan type, Occupation and Age of customer. So, we can say that these are less important features than the others we have here.

-----------------------------------------##THANKYOU##-------------------------------------------