# Problem Description

Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have given a problem to identify the customers segments, those are eligible for loan amount so that they can specifically target these customers. Here they have provided a partial data set.


# Importing Libraries

In [None]:
import  numpy as np
import  pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
from sklearn.model_selection import train_test_split
from sklearn import feature_selection
from sklearn import model_selection
from sklearn.metrics import accuracy_score 
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier

import warnings
warnings.filterwarnings('ignore')

# Importing data 

In [None]:
train = pd.read_csv('../input/loan-prediction-problem-dataset/train_u6lujuX_CVtuZ9i.csv')
test = pd.read_csv('../input/loan-prediction-problem-dataset/test_Y3wMUE5_7gLdaTN.csv')

In [None]:
print (train.shape, test.shape)

# EDA & Data-Preprocessing

### Viewing the train dataset

In [None]:
train.head() 

In [None]:
train.info() 

### Viewing no of null data in each feature of the train dataset

In [None]:
train.isnull().sum()

### Viewing the test dataset

In [None]:
test.head()

In [None]:
test.info()

### Viewing no of null data in each feature of the test dataset

In [None]:
test.isnull().sum()

### Counting frequency of categories for each categorical features

In [None]:
data = [train,test]
for dataset in data:
    #Filter categorical variables
    categorical_columns = [x for x in dataset.dtypes.index if dataset.dtypes[x]=='object']
    # Exclude ID cols and source:
    categorical_columns = [x for x in categorical_columns if x not in ['Loan_ID' ]]
    #Print frequency of categories
    
for col in categorical_columns:
    print ('\nFrequency of Categories for variable %s'%col)
    print (train[col].value_counts())

#### Plotting No. of Males vs No. of Females

In [None]:
sns.countplot(train['Gender'])

#### Approval of loans between Males & Females

In [None]:
pd.crosstab(train.Gender, train.Loan_Status, margins = True)

> The male are in large number as compared to female applicants.

### Filling up NAN values of Gender and Converting categorial variables(Male,Female) to numerical variables(0,1)

In [None]:
train.Gender = train.Gender.fillna(train.Gender.mode())
test.Gender = test.Gender.fillna(test.Gender.mode())

sex = pd.get_dummies(train['Gender'] , drop_first = True )
train.drop(['Gender'], axis = 1 , inplace =True)
train = pd.concat([train , sex ] , axis = 1)

sex = pd.get_dummies(test['Gender'] , drop_first = True )
test.drop(['Gender'], axis = 1 , inplace =True)
test = pd.concat([test , sex ] , axis = 1)

### Plotting No. of Dependants (Size of family) in each household

In [None]:
plt.figure(figsize=(6,6))
labels = ['0' , '1', '2' , '3+']
explode = (0.05, 0, 0, 0)
size = [345 , 102 , 101 , 51]

plt.pie(size, explode=explode, labels=labels,
        autopct='%1.1f%%', shadow = True, startangle = 90)
plt.axis('equal')
plt.show()

In [None]:
train.Dependents.value_counts()

### Approval of loans between different sizes of families

In [None]:
pd.crosstab(train.Dependents , train.Loan_Status, margins = True)

> The applicants with highest number of dependants are least in number whereas applicants with no dependance are greatest among these.

### Filling up NAN values of Dependents and Converting categorial variables(1,2,3+) to numerical variables(1,2,3)

In [None]:
train.Dependents = train.Dependents.fillna("0")
test.Dependents = test.Dependents.fillna("0")

rpl = {'0':'0', '1':'1', '2':'2', '3+':'3'}

train.Dependents = train.Dependents.replace(rpl).astype(int)
test.Dependents = test.Dependents.replace(rpl).astype(int)

### Plotting No. of people with vs without Credit History

In [None]:
sns.countplot(train['Credit_History'])

### Approval of loans people those who have Credit History and those who don't

In [None]:
pd.crosstab(train.Credit_History , train.Loan_Status, margins = True)

### Filling up NAN values of Credit history by taking the mode

In [None]:
train.Credit_History = train.Credit_History.fillna(train.Credit_History.mode()[0])
test.Credit_History  = test.Credit_History.fillna(test.Credit_History.mode()[0])

### Plotting No. of people who are Self-employed vs who aren't

In [None]:
sns.countplot(train['Self_Employed'])

### Approval of loans between people who are Self-employed & who aren't

In [None]:
pd.crosstab(train.Self_Employed , train.Loan_Status,margins = True)

### Filling NAN values and Converting categorial variables(Yes,No) to numerical variables(1,0)

In [None]:
train.Self_Employed = train.Self_Employed.fillna(train.Self_Employed.mode())
test.Self_Employed = test.Self_Employed.fillna(test.Self_Employed.mode())

self_Employed = pd.get_dummies(train['Self_Employed'] ,prefix = 'employed' ,drop_first = True )
train.drop(['Self_Employed'], axis = 1 , inplace =True)
train = pd.concat([train , self_Employed ] , axis = 1)

self_Employed = pd.get_dummies(test['Self_Employed'] , prefix = 'employed' ,drop_first = True )
test.drop(['Self_Employed'], axis = 1 , inplace =True)
test = pd.concat([test , self_Employed ] , axis = 1)

### Plotting No. of Married people vs Unmarried people

In [None]:
sns.countplot(train.Married)

### Approval of loans between Married and Unmarried people

In [None]:
pd.crosstab(train.Married , train.Loan_Status,margins = True)

### Filling NAN values and Converting categorial variables(Yes,No) to numerical variables(1,0)

In [None]:
train.Married = train.Married.fillna(train.Married.mode())
test.Married = test.Married.fillna(test.Married.mode())

married = pd.get_dummies(train['Married'] , prefix = 'married',drop_first = True )
train.drop(['Married'], axis = 1 , inplace =True)
train = pd.concat([train , married ] , axis = 1)

married = pd.get_dummies(test['Married'] , prefix = 'married', drop_first = True )
test.drop(['Married'], axis = 1 , inplace =True)
test = pd.concat([test , married ] , axis = 1)

### Filling up NAN values of Loan Amount Term

In [None]:
train.drop(['Loan_Amount_Term'], axis = 1 , inplace =True)
test.drop(['Loan_Amount_Term'], axis = 1 , inplace =True)

train.LoanAmount = train.LoanAmount.fillna(train.LoanAmount.mean()).astype(int)
test.LoanAmount = test.LoanAmount.fillna(test.LoanAmount.mean()).astype(int)

In [None]:
sns.distplot(train['LoanAmount'])

> We observe no outliers in the continuous variable Loan Amount

### Plotting Graduates vs Non-Graduates

In [None]:
sns.countplot(train.Education)

### Converting categorial variables to numerical variables

In [None]:
train['Education'] = train['Education'].map( {'Graduate': 0, 'Not Graduate': 1} ).astype(int)
test['Education'] = test['Education'].map( {'Graduate': 0, 'Not Graduate': 1} ).astype(int)

### Property Area

In [None]:
sns.countplot(train.Property_Area)

### Converting categorial variables to numerical variables

In [None]:
train['Property_Area'] = train['Property_Area'].map( {'Urban': 0, 'Semiurban': 1 ,'Rural': 2  } ).astype(int)

test.Property_Area = test.Property_Area.fillna(test.Property_Area.mode())
test['Property_Area'] = test['Property_Area'].map( {'Urban': 0, 'Semiurban': 1 ,'Rural': 2  } ).astype(int)


### Plotting Co-Applicant income and Applicant income

In [None]:
sns.distplot(train['ApplicantIncome'])

In [None]:
sns.distplot(train['CoapplicantIncome'])

### Target Variable : Loan Status (Converting categorial variables to numerical variables)

In [None]:
train['Loan_Status'] = train['Loan_Status'].map( {'N': 0, 'Y': 1 } ).astype(int)

### Dropping the ID column

In [None]:
train.drop(['Loan_ID'], axis = 1 , inplace =True)

### Viewing the Datasets

In [None]:
train.head()

In [None]:
test.head()

# Visualizing the Correlations and Relations

### Plot between LoanAmount, Applicant Income, Employement and Gender

*What is the relation of Loan taken between men and women?<br> Did the employed ones were greater in number to take Loan ?<br> What is distribution of Loan Amount and Income?*

In [None]:
g = sns.lmplot(x='ApplicantIncome',y='LoanAmount',data= train , col='employed_Yes', hue='Male',
          palette= ["Red" , "Blue","Yellow"] ,aspect=1.2,size=6)
g.set(ylim=(0, 800))
##Relation Between the Male or female Applicant's income , Loan taken and Self employment.

### Above graph tells:
    - The male applicants take more amount of loan than female.
    - The males are higher in number of "NOT self employed" category.
    - The amount is still larger in the income range in (0 to 20000).
    - Also we observe that majority of applicants are NOT self employed.
    - Highest Loan amount taken is by the female applicant of about 700 which is NOT self employed.
    - The majority of income taken is about 0-200 with income in the range 0-20000. 
    - The line plotted shows that with increase in income the amount of loan increases with almost same slope for the case of women in both the cases but a slightely lesser slope in the case of men in Self- Employed category as compared to non-self employed.


### Boxplots for  relation between Property Area, Amount of Loan and Education qualification 

- Property_Area: 
    - `Urban      :0`
    - `Semiurban  :1`
    - `Rural      :2`

In [None]:
plt.figure(figsize=(10,5))
sns.boxplot(x="Property_Area", y="LoanAmount", hue="Education",data=train, palette="coolwarm")

### The above boxplot signifies that,
    - In the Urban area the non graduates take slightly more loan than graduates. 
    - In the Rural and semiurban area the graduates take more amount of Loan than non graduates 
    - The higher values of Loan are mostly from Urban area 
    - The semiurban area and rural area both have one unusual Loan amount close to zero.


### Relation between Credit History and Loan status.

In [None]:
train.Credit_History.value_counts()

In [None]:
lc = pd.crosstab(train['Credit_History'], train['Loan_Status'])
lc.plot(kind='bar', stacked=True, color=['red','blue'], grid=False)

- The credit history vs Loan Status indicates:
    - The good credit history applicants have more chances of getting Loan.
    - With better credit History the Loan amount given was greater too.
    - But many were not given loan in the range 0-100
    - The applicant with poor credit history were handled in the range 0-100 only.

In [None]:
plt.figure(figsize=(9,6))
sns.heatmap(train.drop('Loan_Status',axis=1).corr(), vmax=0.6, square=True, annot=True)

# Modelling

The problem is of **Classification** as observed and concluded from the data and visualisations.

In [None]:
X = train.drop('Loan_Status' , axis = 1 )
y = train['Loan_Status']

X_train ,X_test , y_train , y_test = train_test_split(X , y , test_size = 0.3 , random_state =102)

### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()
logmodel.fit(X_train , y_train)
pred_l = logmodel.predict(X_test)
acc_l = accuracy_score(y_test , pred_l)*100
acc_l

### Random Forest

In [None]:

random_forest = RandomForestClassifier(n_estimators= 100)
random_forest.fit(X_train, y_train)
pred_rf = random_forest.predict(X_test)
acc_rf = accuracy_score(y_test , pred_rf)*100
acc_rf

### K-Nearest Neighbors

In [None]:

knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, y_train)
pred_knn = knn.predict(X_test)
acc_knn = accuracy_score(y_test , pred_knn)*100
acc_knn

### Naive Bayes

In [None]:
gaussian = GaussianNB()
gaussian.fit(X_train, y_train)
pred_gb = gaussian.predict(X_test)
acc_gb = accuracy_score(y_test , pred_gb)*100
acc_gb

### SVM

In [None]:
svc = SVC()
svc.fit(X_train, y_train)
pred_svm = svc.predict(X_test)
acc_svm = accuracy_score(y_test , pred_svm)*100
acc_svm

### Gradient Boosting Classifier

In [None]:
gbk = GradientBoostingClassifier()
gbk.fit(X_train, y_train)
pred_gbc = gbk.predict(X_test)
acc_gbc = accuracy_score(y_test , pred_gbc)*100
acc_gbc

In [None]:
## Arranging the Accuracy results
models = pd.DataFrame({
    'Model': ['Logistic Regression', 'Random Forrest','K- Nearest Neighbour' ,
             'Naive Bayes' , 'SVM','Gradient Boosting Classifier'],
    'Score': [acc_l , acc_rf , acc_knn , acc_gb ,acc_svm ,acc_gbc ]})
models.sort_values(by='Score', ascending=False)

## The highest classification accuracy is shown by Logistic Regression => 83.78%

### Let us Check the feature importance

In [None]:
importances = pd.DataFrame({'Features':X_train.columns,'Importance':np.round(random_forest.feature_importances_,3)})
importances = importances.sort_values('Importance',ascending=False).set_index('Features')
importances.head(11) 

In [None]:
importances.plot.bar()

> Credit History has the maximum importance and Employment has the least!

# Conclusion

The Loan status has better relation with features such as Credit History, Applicant's Income, Loan Amount needed by them, Family status(Depenedents) and Property Area which are generally considered by the loan providing organisations. These factors are hence used to take correct decisions to provide loan status or not. This data analysis hence gives a realisation of features and the relation between them from the older decision examples hence giving a learning to predict the class of the unseen data. 

# Submission
## Finally we predict over unseen dataset using the Logistic Regression and Random Forest model (**Ensemble Learning**)

In [None]:
df_test = test.drop(['Loan_ID'], axis = 1)

In [None]:
df_test.head()

In [None]:
p_log = logmodel.predict(df_test)

In [None]:
p_rf = random_forest.predict(df_test)

In [None]:
predict_combine = np.zeros((df_test.shape[0]))

for i in range(0, test.shape[0]):
    temp = p_log[i] + p_rf[i]
    if temp>=2:
        predict_combine[i] = 1
predict_combine = predict_combine.astype('int')

In [None]:
submission = pd.DataFrame({
        "Loan_ID": test["Loan_ID"],
        "Loan_Status": predict_combine
    })

submission.to_csv("results.csv", encoding='utf-8', index=False)
