LOAN
In finance, a loan is the lending of money by one or more individuals, organizations, or other entities to other individuals, organizations etc. The recipient (i.e., the borrower) incurs a debt and is usually liable to pay interest on that debt until it is repaid as well as to repay the principal amount borrowed.

The document evidencing the debt (e.g., a promissory note) will normally specify, among other things, the principal amount of money borrowed, the interest rate the lender is charging, and the date of repayment. A loan entails the reallocation of the subject asset(s) for a period of time, between the lender and the borrower.

The interest provides an incentive for the lender to engage in the loan. In a legal loan, each of these obligations and restrictions is enforced by contract, which can also place the borrower under additional restrictions known as loan covenants. Although this article focuses on monetary loans, in practice, any material object might be lent.

Acting as a provider of loans is one of the main activities of financial institutions such as banks and credit card companies. For other institutions, issuing of debt contracts such as bonds is a typical source of funding.



DATA DESCRIPTION
Problem Statement: This dataset includes details of applicants who have applied for loan. The dataset includes details like credit history, loan amount, their income, dependents etc.

Independent Variables:

Loan_ID

Gender

Married

Dependents

Education

Self_Employed

ApplicantIncome

CoapplicantIncome

Loan_Amount

Loan_Amount_Term

Credit History

Property_Area

Dependent Variable (Target Variable):

Loan_Status
You have to build a model that can predict whether the loan of the applicant will be approved or not on the basis of the details provided in the dataset.

In [None]:
##The Dataset

In [None]:
import pandas as pd
import numpy as np

path ='https://raw.githubusercontent.com/dsrscientist/DSData/master/loan_prediction.csv'
df= pd.read_csv(path)
df


In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.columns

##Details About the Columns
1. Loan Id 

This Column Shows the Applicants Unique Loan Application Id

2. Gender

This Column Shows the Applicants Gender

3. Married

This Column Shows the Applicants Marital Status

4. Dependants

This Column Shows the Applicants Dependant on and how many

5. Education

This Column Shows the Applicants Education level

6. Self_Employed

This Column Shows the Applicants Occupation Sector

7. Application_Income

This Column Shows the Applicants Income

8. Coapplicant_Income

This Column Shows the Applicants Partners Income

9. Loan_Amount

This Column Shows the Applicants Desired Loan Amount

Loan Amount Term
This Column Shows the Applicants

Credit History
This Column Shows the Applicants Past loan Credit History

12. Property Area

This Column Shows the Applicants Property area

13. Loan Status

This Column Shows the Applicants Loan will be approved or not

In [None]:
#Data Exploration
cat_cols=df.select_dtypes([object])

for col in cat_cols.columns:
    print(col)
    print(df[col].value_counts())
    print('************************************')

In [None]:
df['CoapplicantIncome'].value_counts()

In [None]:
df['CoapplicantIncome'] = pd.cut(df['CoapplicantIncome'], bins = [0,1000,3000,42000], labels = ['Low','Average','High'])


In [None]:
df['CoapplicantIncome'].value_counts()

In [None]:
df['ApplicantIncome'].value_counts()

In [None]:
df['ApplicantIncome'] = pd.cut(df['ApplicantIncome'], bins = [0,2500,4000,6000,81000], labels = ['Low','Average','High', 'Very high'])
df['ApplicantIncome'].value_counts()

In [None]:
df['Total_Income']=df['ApplicantIncome']+df['CoapplicantIncome']


In [None]:
df['Total_Income'].value_counts()

In [None]:
f['Total_Income'] = pd.cut(df['Total_Income'], bins = [0,2500,4000,6000,81000], labels = ['Low','Average','High', 'Very high'])
df['Total_Income'].value_counts()

In [None]:
df['LoanAmount'].value_counts()

In [None]:
df['LoanAmount'] = pd.cut(df['LoanAmount'], bins = [0,100,200,700], labels = ['Low','Average','High'])
df['LoanAmount'].value_counts()

In [None]:
df['Loan_Amount_Term'].value_counts()


In [None]:
df['Loan_Amount_Term'] = pd.cut(df['Loan_Amount_Term'], bins = [0,100,200,700], labels = ['Low','Average','High'])
df['Loan_Amount_Term'].value_counts()

In [None]:
##Checking null values
df.isnull().sum()

In [None]:
#checking of the data types of the dataset
df.dtypes

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(df.isnull())
plt.title("Null Values")
plt.show()

In [None]:
df.describe(include=['O'])

For Integer Datatype Columns with missing Values we use Mean method

For Object Datatype Columns with missing Values we use Mode Method

For Float Datatype Columns with missing Values we use Median Method

In [None]:
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
df['Married'].fillna(df['Married'].mode()[0], inplace=True)
df['Dependents'].fillna(df['Dependents'].mode()[0], inplace=True)
df['CoapplicantIncome'].fillna(df['CoapplicantIncome'].mode()[0], inplace=True)
df['Self_Employed'].fillna(df['Self_Employed'].mode()[0], inplace=True)
df['LoanAmount'].fillna(df['LoanAmount'].mode()[0], inplace=True)
df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mode()[0], inplace=True)
df['Credit_History'].fillna(df['Credit_History'].mode()[0], inplace=True)
df.isnull().sum()

In [None]:
df.shape

In [None]:
##Transformation for balancing
df.info()

In [None]:
df.dtypes

In [None]:
df.describe(include=['O'])

In [None]:
from sklearn.preprocessing import LabelEncoder
le =LabelEncoder()

list1=['Gender','Married','Dependents','Education','Self_Employed','Property_Area','ApplicantIncome','Loan_Status','CoapplicantIncome','LoanAmount','Total_Income','Credit_History','Loan_Amount_Term']
for val in list1:
  df[val]=le.fit_transform(df[val].astype(str))

In [None]:
df.head()

In [None]:
##Exploratory Data Analysis
plt.figure(figsize =(12,6));
sns.countplot(x = 'Loan_Status', data = df);
plt.xlabel("Loan_Status",fontsize = 12);
plt.ylabel("Frequency",fontsize = 12);
plt.show()

In [None]:
plt.figure(figsize = (10,6))
plt.title("Comparision between Gender And Loan_Status")
sns.barplot(x = "Gender", y = "Loan_Status", data = df)
plt.show()

In [None]:
plt.figure(figsize = (10,6))
plt.title("Comparision between Married And Loan_Status")
sns.barplot(x = "Married", y = "Loan_Status", data = df)
plt.show()

We can see that the Married have optimized More for loan with great succes when applied which is not so in the Unmarried column

In [None]:
plt.figure(figsize = (10,6))
plt.title("Comparision between Self_Employed And Loan_Status")
sns.barplot(x = "Self_Employed", y = "Loan_Status", data = df)
plt.show()

WE can see that most applicants who had provided for loans are not self emmployed and sucess rate also lower for them in loan status



In [None]:
plt.figure(figsize = (10,6))
plt.title("Comparision between Credit_History And Loan_Status")
sns.barplot(x = "Credit_History", y = "Loan_Status", data = df)
plt.show()

In [None]:
plt.figure(1)
plt.subplot(221)
df['Gender'].value_counts(normalize=True).plot.bar(figsize=(20,10), title= 'Gender')

plt.subplot(222)
df['Married'].value_counts(normalize=True).plot.bar(title= 'Married')

plt.subplot(223)
df['Self_Employed'].value_counts(normalize=True).plot.bar(title= 'Self_Employed')

plt.subplot(224)
df['Credit_History'].value_counts(normalize=True).plot.bar(title= 'Credit_History')

plt.show()


In [None]:
lt.figure(figsize = (10,6))
plt.title("Comparision between Dependents And Loan_Status")
sns.barplot(x = "Dependents", y = "Loan_Status", data = df)
plt.show()


In [None]:
plt.figure(figsize = (10,6))
plt.title("Comparision between Property_Area And Loan_Status")
sns.barplot(x = "Property_Area", y = "Loan_Status", data = df)
plt.show()

In [None]:
plt.figure(1)
plt.subplot(131)
df['Dependents'].value_counts(normalize=True).plot.bar(figsize=(24,6), title= 'Dependents')

plt.subplot(132)
df['Education'].value_counts(normalize=True).plot.bar(title= 'Education')

plt.subplot(133)
df['Property_Area'].value_counts(normalize=True).plot.bar(title= 'Property_Area')

plt.show()

In [None]:
plt.figure(figsize = (10,6))
plt.title("Comparision between ApplicantIncome And Loan_Status")
sns.barplot(x = "ApplicantIncome", y = "Loan_Status", data = df)
plt.show()

In [None]:
plt.figure(figsize = (10,6))
plt.title("Comparision between CoapplicantIncome And Loan_Status")
sns.barplot(x = "CoapplicantIncome", y = "Loan_Status", data = df)
plt.show()


In [None]:
plt.figure(1)
plt.subplot(121)
sns.distplot(df['ApplicantIncome']);

plt.subplot(122)
df['ApplicantIncome'].plot.box(figsize=(16,5))

plt.show()

In [None]:
df['Education'].value_counts().plot.bar()

In [None]:
df.hist(figsize=(15,30),edgecolor='red',layout=(5,3),bins=15,legend=True)
plt.show()

In [None]:
sns.pairplot(df)

In [None]:
##Corealtion
df.corr()

In [None]:
# Coorelation with the Target Column Primary Fuel 

df.corr()['Loan_Status'].sort_values()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(15,7))
sns.heatmap(df.corr(), annot=True, linewidths=0.5,linecolor="black", fmt='.2f')


In [None]:
# Dropping the irrelevant columns..

df.drop(columns=["Loan_ID"], axis=1, inplace=True)

In [None]:
##Descriptive Statistics
df.describe()

In [None]:
plt.figure(figsize=(15,7))
sns.heatmap(round(df.describe()[1:].transpose(),2), annot=True, linewidths=0.5,linecolor="black", fmt='f')


In [None]:
df.info()

In [None]:
##Checking Data To Remove Skewness
my_column1 = df.pop('Loan_Status')
df.insert(12,'Loan_Status', my_column1) 


df.head()


In [None]:
df.iloc[:,:-1].skew()

In [None]:
#We can see that the data is a litte skewed..

#using yeo johnson method to remove skewness

from sklearn.preprocessing import power_transform
x_new=power_transform(df.iloc[:,:-1],method='yeo-johnson')

df.iloc[:,:-1]=pd.DataFrame(x_new,columns=df.iloc[:,:-1].columns)
df.iloc[:,:-1].skew()

In [None]:
Outliers Checking
import warnings
warnings.filterwarnings('ignore')
df.plot(kind='box',subplots=True, layout=(3,5), figsize=[20,8])


# IQR Proximity Rule
Z - Score Technique

In [None]:
from scipy.stats import zscore
import numpy as np
z=np.abs(zscore(df))
z.shape

In [None]:
threshold=3
print(np.where(z>3))

In [None]:
len(np.where(z>3)[0])

In [None]:
df.drop([68, 242, 262, 313, 495, 497, 546, 575, 585],axis=0)


In [None]:
df=df[(z<3).all(axis=1)]
df.shape

In [None]:
#Feature Engineering ( VIF)
from statsmodels.stats.outliers_influence import variance_inflation_factor
df.corr()

In [None]:
plt.figure(figsize=(25,22))
sns.heatmap(df.corr(),linewidths=.1,vmin=-1, vmax=1, fmt='.2g', annot = True, linecolor="black",annot_kws={'size':15},cmap="YlGnBu")
plt.yticks(rotation=0)

In [None]:
x=df.drop('Loan_Status',axis=1)
y=df['Loan_Status']

In [None]:
x

In [None]:
y

In [None]:
def vif_calc():
  vif=pd.DataFrame()
  vif["VIF Factor"]=[variance_inflation_factor(x.values,i) for i in range(x.shape[1])]
  vif["features"]=x.columns
  print(vif)
    

In [None]:
vif_calc()

In [None]:
#Scaling
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
x=pd.DataFrame(sc.fit_transform(x), columns=x.columns)
x

##MODELLING
Building CLASSIFICATION Model As Target Column's Has only Two Outputs

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

print(df['Loan_Status'].value_counts())  
plt.figure(figsize=(5,5))
sns.countplot(df['Loan_Status'])
plt.show()

In [None]:
#OverSampling
from imblearn.over_sampling import SMOTE
sm = SMOTE()
x, y = sm.fit_resample(x,y)
y.value_counts()

In [None]:
Getting the best random state
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.ensemble import GradientBoostingClassifier, BaggingClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, accuracy_score
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.model_selection import GridSearchCV

maxAccu=0
maxRS=0

for i in range(1,200):
    x_train,x_test, y_train, y_test=train_test_split(x,y,test_size=.30, random_state=i)
    rfc=RandomForestClassifier()
    rfc.fit(x_train,y_train)
    pred=rfc.predict(x_test)
    acc=accuracy_score(y_test,pred)
    if acc>maxAccu:
        maxAccu=acc
        maxRS=i
print("Best accuracy is ",maxAccu*100," on Random_state ",maxRS)


In [None]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.30,random_state=maxRS)


In [None]:
#Logistic Regression
# Checking Accuracy for Logistic Regression
log = LogisticRegression()
log.fit(x_train,y_train)

#Prediction
predlog = log.predict(x_test)

print(accuracy_score(y_test, predlog)*100)
print(confusion_matrix(y_test, predlog))
print(classification_report(y_test,predlog))

In [None]:
# Plotting Confusion_Matrix
cm = confusion_matrix(y_test,predlog)

x_axis_labels = ["0","1"]
y_axis_labels = ["0","1"]

f , ax = plt.subplots(figsize=(5,5))
sns.heatmap(cm, annot = True,linewidths=.2, linecolor="black", fmt = ".0f", ax=ax, cmap="BuGn",xticklabels=x_axis_labels,yticklabels=y_axis_labels)
plt.xlabel("PREDICTED LABEL")
plt.ylabel("TRUE LABEL")
plt.title('Confusion Matrix for LogisticRegression')
plt.show()

In [None]:
##Random Forest Classifier
# Checking accuracy for Random Forest Classifier
rf = RandomForestClassifier()
rf.fit(x_train,y_train)

# Prediction
predrf = rf.predict(x_test)

print(accuracy_score(y_test, predrf)*100)
print(confusion_matrix(y_test, predrf))
print(classification_report(y_test,predrf))

In [None]:
# Lets plot confusion matrix for RandomForestClassifier
cm = confusion_matrix(y_test,predrf)

x_axis_labels = ["0","1"]
y_axis_labels = ["0","1"]

f , ax = plt.subplots(figsize=(5,5))
sns.heatmap(cm, annot = True,linewidths=0.2, linecolor="black", fmt = ".0f", ax=ax, cmap="BuGn",xticklabels=x_axis_labels,yticklabels=y_axis_labels)
plt.xlabel("PREDICTED LABEL")
plt.ylabel("TRUE LABEL")
plt.title('Confusion Matrix for RandomForestClassifier')
plt.show()

In [None]:
##Decission Tree Classifier
# Checking Accuracy for Decision Tree Classifier
dtc = DecisionTreeClassifier()
dtc.fit(x_train,y_train)

#Prediction
preddtc = dtc.predict(x_test)

print(accuracy_score(y_test, preddtc)*100)
print(confusion_matrix(y_test, preddtc))
print(classification_report(y_test,preddtc))

In [None]:
# Lets plot confusion matrix for DTC
cm = confusion_matrix(y_test,preddtc)

x_axis_labels = ["0","1"]
y_axis_labels = ["0","1"]

f , ax = plt.subplots(figsize=(5,5))
sns.heatmap(cm, annot = True,linewidths=.2, linecolor="black", fmt = ".0f", ax=ax, cmap="BuGn",xticklabels=x_axis_labels,yticklabels=y_axis_labels)
plt.xlabel("PREDICTED LABEL")
plt.ylabel("TRUE LABEL")
plt.title('Confusion Matrix for Decision Tree Classifier')
plt.show()

In [None]:
#Cross Validation Score
#cv score for Logistic Regression
print(cross_val_score(log,x,y,cv=5).mean()*100)

# cv score for Random Forest Classifier
print(cross_val_score(rf,x,y,cv=5).mean()*100)

# cv score for Decision Tree Classifier
print(cross_val_score(dtc,x,y,cv=5).mean()*100)





In [None]:
#HyperParameter Tuning for the model with best score
#Random Forest Classifier

parameters = {'criterion':['gini'],
             'max_features':['auto'],
             'n_estimators':[0,200],
             'max_depth':[2,3,4,5,6,8]}

In [None]:
GCV=GridSearchCV(RandomForestClassifier(),parameters,cv=5)
GCV.fit(x_train,y_train)

In [None]:
GCV.best_params_

In [None]:
##Plotting ROC and compare AUC for the final model
from sklearn.metrics import plot_roc_curve
plot_roc_curve(Loan,x_test,y_test)
plt.title("ROC AUC Plot")
plt.show()

Conclusion:
The accuracy score for Loan_Status is 92 %

In [None]:
Saving the model
import joblib
joblib.dump(Loan,"Census_Income.pkl")