## Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

import warnings
warnings.filterwarnings('ignore')

## Loading the dataset:-
loading the data of train and test part where we run our model on train part and find the accuracy based on test part that is for testing purpose by which we can find out that how well our model has performed.

Here I created a list of train and test part so that I can make changes in both parts simultaneously, if it is necessary in both.

In [None]:
train_df = pd.read_csv('../input/loan-prediction-problem-dataset/train_u6lujuX_CVtuZ9i.csv')
test_df = pd.read_csv('../input/loan-prediction-problem-dataset/test_Y3wMUE5_7gLdaTN.csv')

df = [train_df, test_df]

In [None]:
train_df.info()

In [None]:
train_df.head()

## let's have a look on our features:-
    
    Categorical features:-
        numeric:-  
            Credit_History
            Dependents
        non-numeric:-
            Gender
            Married
            Education
            Self_Employed
            Property_Area
            Loan_Status(Target variable)
    Numeric features:-
        ApplicantIncome
        CoapplicantIncome
        LoanAmount
        Loan_Amount_Term

In [None]:
sns.heatmap(train_df.isnull(), cbar=False, yticklabels=False)

There is not that much null values in any feature so I have to fill them up rather then drop any feature well do you know this before if there is lot of null values in any feature comparable to its length then just drop that as it doesn't add any valuable information to our dataset. Also I have to drop Loan_ID as it is just the id and does not impact the target variable.

## Feature Engineering
let's go feature by feature, generally data scientist makes the steps to solve any problem like **Feature Engineering, Feature Selection, EDA, Model Training etc**. In generall first three processes are used to look into the data, understand that and then make predictions, to achieve good accuracy we do these steps.

But I personally like to go feature by feature, selecting a feature look over it and do changes if needed. This gives be more command on each feature and able to understand it more.

**General-->** took the data and do all the necessary steps(feature engineering, feature selection, data analysis)

**me-->** I have the data then I took the feature and do all necessary steps on that feature, this all goes on every feature.

In [None]:
# Loan_Status feature --- target variable
train_df['Loan_Status'] = train_df.Loan_Status.map({'Y': 1, 'N': 0}).astype(int)

In [None]:
# Gender feature
train_df.Gender.value_counts()

In [None]:
train_df.Gender.isnull().sum()

In [None]:
test_df.Gender.isnull().sum()

In [None]:
train_df[['Gender', 'Loan_Status']].groupby('Gender', as_index=False).mean()

In [None]:
grid = sns.FacetGrid(train_df, col='Loan_Status')
grid.map(plt.hist, 'Gender')

Here Males has high correlation and also mode of our gender feature is 'Male', So I decided to fill the nan values with male category.

In [None]:
for dataset in df:
    dataset.Gender.fillna('Male', inplace=True)

In [None]:
train_df.Gender.isnull().sum()

In [None]:
# Changing Gender feature into numeric so that our model works properly, kind of label encoding
for dataset in df:
    dataset['Gender'] = dataset['Gender'].map({'Male': 1, 'Female': 0}).astype(int)

In [None]:
# Married Feature
train_df.Married.value_counts()

In [None]:
train_df.Married.isnull().sum()

In [None]:
for dataset in df:
    dataset['Married'] = dataset.Married.fillna(dataset.Married.mode()[0])

In [None]:
train_df[['Married', 'Loan_Status']].groupby('Married', as_index=False).mean()

In [None]:
sns.set(style='whitegrid')
grid = sns.FacetGrid(train_df, col='Loan_Status')
grid.map(plt.hist, 'Married') 

Wow! I don't know this before if anyone is married he/she would be more likely to have a loan.

In [None]:
grid = sns.FacetGrid(train_df, row='Education', size=2.8, aspect=1.6)
grid.map(sns.barplot, 'Married', 'Loan_Status', 'Gender', ci=None, palette='deep')
grid.add_legend()

So if a female is not graduated and has married life she would be more likely to have loan infact if any female is not married she still would be more likely to have a loan then male.(oh! do you remember 0--female, 1--male)

In [None]:
for dataset in df:
    dataset['Married'] = dataset['Married'].map({'Yes': 1, 'No': 0}).astype(int)

In [None]:
# Dependents feature
train_df.Dependents.value_counts()

In [None]:
train_df.Dependents.isnull().sum()

In [None]:
grid = sns.FacetGrid(train_df, row='Gender')
grid.map(sns.barplot, 'Dependents', 'Loan_Status', palette='deep', ci=None)
grid.add_legend()

Here if a person is male with so and so Dependents his correlation ranges between (0.6-0.8) but if any female has more then 3 dependents then her correlation suddenly falls down. This feature may doesn't affect direclty to target variable like with increase in dependents correlation neither decreases nor increases strictly. It may affect other feature like Gender in this case.

I can also create another feature from this feature by the name of 'Dependent_3+' which indicates that any person has dependents more than 3 or less, but I'll do that later if accuracy is not good enough.

In [None]:
for dataset in df:
    dataset['Dependents'] = dataset['Dependents'].fillna(train_df.Dependents.mode()[0])
    dataset['Dependents'] = dataset['Dependents'].replace('3+', '3')
    dataset['Dependents'] = dataset.Dependents.astype(int)

In [None]:
train_df.head()

In [None]:
# Education, I do this every time to check any error or typos in any categorical feature.
train_df.Education.value_counts()

In [None]:
train_df[['Education', 'Loan_Status']].groupby('Education', as_index=False).mean()

In [None]:
train_df.Education.isnull().sum()

In [None]:
test_df.Education.isnull().sum()

In [None]:
for dataset in df:
    dataset['Education'] = dataset['Education'].map({'Graduate': 1, 'Not Graduate': 0}).astype(int)

In [None]:
# Self_Employed
train_df.Self_Employed.value_counts()

In [None]:
train_df.Self_Employed.isnull().sum()

In [None]:
train_df[['Self_Employed', 'Loan_Status']].groupby('Self_Employed', as_index=False).mean()

In [None]:
for dataset in df:
    dataset['Self_Employed'] = dataset['Self_Employed'].fillna(dataset['Self_Employed'].mode()[0])
    dataset['Self_Employed'] = dataset['Self_Employed'].map({'No': 0, 'Yes': 1}).astype(int)

In [None]:
# Credit_History
train_df.Credit_History.value_counts()

In [None]:
train_df.Credit_History.isnull().sum()

In [None]:
# gender, married, credit history, loan status
# gender, education, credit history, loan status
# gender, self employed, credit history, loan status

In [None]:
grid = sns.FacetGrid(train_df, row='Married', aspect=1.5)
grid.map(sns.barplot, 'Credit_History', 'Loan_Status', 'Gender', palette='deep', ci=None)
grid.add_legend()

if any person is married and has 1 credit history then he/she would be more likely to have a loan then unmarried one. But overall if any person has 1 credit history then his or her chances are higher to get a loan.

In [None]:
grid = sns.FacetGrid(train_df, row='Education', aspect=1.5)
grid.map(sns.barplot, 'Credit_History', 'Loan_Status', 'Gender', palette='deep', ci=None)
grid.add_legend()

In [None]:
grid = sns.FacetGrid(train_df, row='Self_Employed', aspect=1.5)
grid.map(sns.barplot, 'Credit_History', 'Loan_Status', 'Gender', palette='deep', ci=None)
grid.add_legend()

In [None]:
for dataset in df:
    dataset['Credit_History'] = dataset['Credit_History'].fillna(dataset['Credit_History'].mode()[0]).astype(int)

In [None]:
# Property_Area
train_df.Property_Area.value_counts()

In [None]:
train_df.Property_Area.isnull().sum()

In [None]:
train_df[['Property_Area', 'Loan_Status']].groupby('Property_Area', as_index=False).mean().sort_values(by='Loan_Status', ascending=False)

In [None]:
grid = sns.FacetGrid(train_df, row='Married', aspect=1.5)
grid.map(sns.barplot, 'Property_Area', 'Loan_Status', 'Gender', palette='deep', ci=None)
grid.add_legend()

if any person is from semiurban area then that person has higher chances to get a loan.

In [None]:
for dataset in df:
    dataset['Property_Area'] = dataset['Property_Area'].map({'Rural': 0, 'Urban': 1, 'Semiurban': 2}).astype(int)

In [None]:
train_df.head()

In [None]:
# ApplicantIncome, CoapplicantIncome, LoanAmount, Loan_Amount_Term.

In [None]:
train_df.describe()

In [None]:
sns.set(style='darkgrid')
sns.boxplot(train_df.ApplicantIncome)

In [None]:
sns.set(style='darkgrid')
sns.boxplot(train_df.CoapplicantIncome)

In [None]:
sns.set(style='darkgrid')
sns.boxplot(train_df.LoanAmount)

In [None]:
sns.set(style='darkgrid')
sns.boxplot(train_df.Loan_Amount_Term)

In [None]:
train_df['ApplicantIncome'] = train_df['ApplicantIncome'].astype(int)

In [None]:
train_df['ApplicantIncomeBand'] = pd.cut(train_df['ApplicantIncome'], 4)
train_df[['ApplicantIncomeBand', 'Loan_Status']].groupby('ApplicantIncomeBand', as_index=False).mean().sort_values(by='ApplicantIncomeBand', ascending=True)

In [None]:
train_df['CoapplicantIncome'] = train_df['CoapplicantIncome'].astype(int)

In [None]:
train_df['CoapplicantIncomeBand'] = pd.cut(train_df['CoapplicantIncome'], 3)
train_df[['CoapplicantIncomeBand', 'Loan_Status']].groupby('CoapplicantIncomeBand', as_index=False).mean().sort_values(by='CoapplicantIncomeBand', ascending=True)

In [None]:
for dataset in df:
    dataset['LoanAmount'] = dataset['LoanAmount'].fillna(dataset['LoanAmount'].mean())

In [None]:
train_df['LoanAmountBand'] = pd.cut(train_df['LoanAmount'], 4)
train_df[['LoanAmountBand', 'Loan_Status']].groupby('LoanAmountBand', as_index=False).mean().sort_values(by='LoanAmountBand', ascending=True)

In [None]:
for dataset in df:
    dataset['Loan_Amount_Term'] = dataset['Loan_Amount_Term'].fillna(dataset['Loan_Amount_Term'].mean())

In [None]:
train_df['Loan_Amount_TermBand'] = pd.cut(train_df['Loan_Amount_Term'], 3)
train_df[['Loan_Amount_TermBand', 'Loan_Status']].groupby('Loan_Amount_TermBand', as_index=False).mean().sort_values(by='Loan_Amount_TermBand', ascending=True)

In [None]:
train_df.head()

Based on matrices created above, mapped accordingly to each feature to create groups.

In [None]:
for dataset in df:
    dataset.loc[dataset['ApplicantIncome'] <= 20362.5, 'ApplicantIncome'] = 0
    dataset.loc[(dataset['ApplicantIncome'] > 20362.5) & (dataset['ApplicantIncome'] <= 40575.0), 'ApplicantIncome'] = 1
    dataset.loc[(dataset['ApplicantIncome'] > 40575.0) & (dataset['ApplicantIncome'] <= 60787.5), 'ApplicantIncome'] = 2
    dataset.loc[(dataset['ApplicantIncome'] > 60787.5), 'ApplicantIncome'] = 3

In [None]:
for dataset in df:
    dataset.loc[dataset['CoapplicantIncome'] <= 13889.0, 'CoapplicantIncome'] = 0
    dataset.loc[(dataset['CoapplicantIncome'] > 13889.0) & (dataset['CoapplicantIncome'] <= 27778.0), 'CoapplicantIncome'] = 1
    dataset.loc[(dataset['CoapplicantIncome'] > 27778.0), 'CoapplicantIncome'] = 2

In [None]:
for dataset in df:
    dataset.loc[dataset['LoanAmount'] <= 181.75, 'LoanAmount'] = 0
    dataset.loc[(dataset['LoanAmount'] > 181.75) & (dataset['LoanAmount'] <= 354.5), 'LoanAmount'] = 1
    dataset.loc[(dataset['LoanAmount'] > 354.5) & (dataset['LoanAmount'] <= 527.25), 'LoanAmount'] = 2
    dataset.loc[(dataset['LoanAmount'] > 527.25), 'LoanAmount'] = 3
    dataset['LoanAmount'] = dataset['LoanAmount'].astype(int)

In [None]:
for dataset in df:
    dataset.loc[dataset['Loan_Amount_Term'] <= 168.0, 'Loan_Amount_Term'] = 0
    dataset.loc[(dataset['Loan_Amount_Term'] > 168.0) & (dataset['Loan_Amount_Term'] <= 324.0), 'Loan_Amount_Term'] = 1
    dataset.loc[(dataset['Loan_Amount_Term'] > 324.0), 'Loan_Amount_Term'] = 2
    dataset['Loan_Amount_Term'] = dataset['Loan_Amount_Term'].astype(int)

In [None]:
train_df.head()

In [None]:
train_df.drop('ApplicantIncomeBand', inplace=True, axis=1)
train_df.drop('CoapplicantIncomeBand', inplace=True, axis=1)
train_df.drop('LoanAmountBand', inplace=True, axis=1)
train_df.drop('Loan_Amount_TermBand', inplace=True, axis=1)

In [None]:
for dataset in df:
    dataset.drop('Loan_ID', axis=1, inplace=True)

In [None]:
X = train_df.drop('Loan_Status', axis=1)
y = train_df['Loan_Status']

In [None]:
data_corr = pd.concat([X, y], axis=1)
corr = data_corr.corr()
plt.figure(figsize=(11,7))
sns.heatmap(corr, annot=True)

## Model Training
To check accuracy I am using k fold cross validation score, it makes number of train and test parts of the data according to the parameter 'cv' and then mean gave the mean of all outputs. To know more about cross validation score check out this https://machinelearningmastery.com/k-fold-cross-validation/

In [None]:
LogReg_classifier = LogisticRegression()
LogReg_classifier.fit(X,y)

In [None]:
LogReg_acc = cross_val_score(LogReg_classifier, X, y, cv=10, scoring='accuracy').mean()
LogReg_acc

In [None]:
SVM_classifier = SVC()
SVM_classifier.fit(X,y)

In [None]:
SVM_acc = cross_val_score(SVM_classifier, X, y, cv=10, scoring='accuracy').mean()
SVM_acc

In [None]:
Knn_classifier = KNeighborsClassifier()
Knn_classifier.fit(X,y)

In [None]:
Knn_acc = cross_val_score(Knn_classifier, X, y, cv=10, scoring='accuracy').mean()
Knn_acc

In [None]:
Tree_classifier = DecisionTreeClassifier()
Tree_classifier.fit(X,y)

In [None]:
Tree_acc = cross_val_score(Tree_classifier, X, y, cv=10, scoring='accuracy').mean()
Tree_acc

In [None]:
Ran_classifier = RandomForestClassifier(n_estimators=100)
Ran_classifier.fit(X, y)

In [None]:
Ran_acc = cross_val_score(Ran_classifier, X, y, cv=10, scoring='accuracy').mean()
Ran_acc

In [None]:
XGB_classifier = XGBClassifier()
XGB_classifier.fit(X,y)

In [None]:
XGB_acc = cross_val_score(XGB_classifier, X, y, cv=10, scoring='accuracy').mean()
XGB_acc

In [None]:
acc_dict = {'Logistic Regression': round(LogReg_acc, 2), 
           'Support Vectore Classifier': round(SVM_acc, 2), 
           'K-nearest Neighbor': round(Knn_acc, 2), 
           'Decision Tree': round(Tree_acc, 2), 
           'Random Forest': round(Ran_acc, 2),
            'XGB': round(XGB_acc, 2)
           }
print('Accuracy Scores:-')
acc_dict

best models are logistic regression and SVM for predicting the output. If you have any suggestions for me please let me know, and if you like my notebook please upvote that.😊😊