A classification problem, to predict whether a loan should be approved or not.

Workflow:
    
    1. Importing libraries
    2. Loading data
    3. Summarising data
    4. Filling missing values if any, for both categorical and numerical
       * Categorical: Mode
       * Numerical: Median/Mean/Bfill
    5. Exploratory data analysis
       * Data visualisation
       * Normalisation of data if any outliers found
    6. Conversion of object data type to numbers
    7. Correlation between different variables. (Drop any independent variables if not coorelated to the dependent variable.)
    8. Evaluating different models based on different metrics (Cross validated accuracy, precision, f1 score,
                                                              recall,AUC curve, confusion matrix)
       * Random Forest Classifier
       * Extra Tree Classifier
       * SVC
       * Logistic Regression
       * Kneighbours Classifier
       * Decision Tree Classifier
    9. Tuning the hyper parameters of the best models.
    10.Feature importance

##### Importing initial libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv('../input/loan-prediction-problem-dataset/train_u6lujuX_CVtuZ9i.csv')

In [None]:
df.head().T

##### Statistical Overview

In [None]:
df.describe()

##### Data summary

In [None]:
df.info()

In [None]:
df.dtypes

In [None]:
#  Let's check if any duplication of rows

df.duplicated().any()

In [None]:
#  Dropping unnecessary columns for now Loan_ID

df = df.drop("Loan_ID", axis=1)
df.head().T

In [None]:
# Splitting data into categorical and numerical.

cat_data = []
num_data = []

for index, type in enumerate(df.dtypes):
    if type == "object":
        cat_data.append(df.iloc[:, index])
    else:
        num_data.append(df.iloc[:,index])

In [None]:
cat_data = pd.DataFrame(cat_data).transpose()
cat_data.head()

In [None]:
num_data = pd.DataFrame(num_data).transpose()
num_data.head()

##### Getting deep into categorical data.

In [None]:
cat_data.isna().sum()

In [None]:
# Filling missing values with mode.

cat_data = cat_data.apply(lambda x: x.fillna(x.value_counts().index[0]))
cat_data.isna().sum()

In [None]:
#  Filling missing values in the numerical data.
#  Firstly, let's check the stats.
num_data.describe()

In [None]:
num_data.isna().sum()

In [None]:
#  For Loan amount since there is major difference between median and mean. So, let's fill the missing values with median.

num_data.LoanAmount = num_data.LoanAmount.fillna(num_data["LoanAmount"].median())
num_data.LoanAmount.isna().sum()

In [None]:
# Filling remaining missing values with previously occuring value in resspective columns.

num_data.Loan_Amount_Term = num_data.Loan_Amount_Term.fillna(method="bfill")
num_data.Credit_History = num_data.Credit_History.fillna(method="bfill")

In [None]:
# Rechecking, if any missing values remaining.
num_data.isna().sum()

##### Exploratory Data Analysis

* Distribution plots for numerical data

In [None]:
sns.distplot(num_data.ApplicantIncome);

In [None]:
sns.distplot(num_data.CoapplicantIncome);

In [None]:
sns.distplot(num_data.LoanAmount);

In [None]:
#  Since above plots are not gaussian. Let's normalise them and get rid of outliers.

num_data.ApplicantIncome = np.log(num_data.ApplicantIncome)
num_data.CoapplicantIncome = np.log(num_data.CoapplicantIncome + 1)  # Since some values are zero to avoid log 0 =infinity.
num_data.LoanAmount = np.log(num_data.LoanAmount)
num_data.Loan_Amount_Term = np.log(num_data.Loan_Amount_Term)

In [None]:
#   Again checking the plots.

sns.distplot(num_data.ApplicantIncome);

In [None]:
sns.distplot(num_data.CoapplicantIncome);

In [None]:
sns.distplot(num_data.LoanAmount);

* Categorical data visualisation

In [None]:
#  Married v/s Loan status
sns.countplot(x="Married", hue="Loan_Status", data=cat_data);

#  There is higher chance of loan approval if you are married.

In [None]:
#  Gender v/s Loan status
sns.countplot(x="Gender", hue="Loan_Status", data=cat_data);

#  Most of the males have got there loans approved.

In [None]:
#  Dependents v/s Loan status
sns.countplot(x="Dependents", hue="Loan_Status", data=cat_data);

#  Having zero dependency increases the probability of loan approval.

In [None]:
#  Education v/s Loan status
sns.countplot(x="Education", hue="Loan_Status", data=cat_data);

#  Graduation preffered over non graduates.


In [None]:
#  Concatenating the updated categorical and numerical data

data = pd.concat([num_data, cat_data], axis=1)
data.head().T

##### Converting categorical data to numbers.

In [None]:
from sklearn.preprocessing import LabelEncoder

LE = LabelEncoder()

cols = ['Gender', 'Married', 'Education', 
        'Self_Employed', 'Property_Area', 
        'Loan_Status', 'Dependents']
for col in cols:
    data[col] = LE.fit_transform(data[col])

data.head().T

###### Visualising Correlation

In [None]:
corr = data.corr()
plt.figure(figsize=(15, 12))
sns.heatmap(corr, annot=True);

##### Interpretation:
    * The dependent variable Loan_status has less correlation with Loan amount term and Self employed.
    Shall drop them during modelling.
    

###### Modelling


In [None]:
#  Splitting into x and y

x = data.drop(["Loan_Amount_Term", "Self_Employed", "Loan_Status"], axis=1)
y = data["Loan_Status"]

In [None]:
#  Splliting into train and test
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)

In [None]:
#  Put models in a dictionary

from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC


# Model Evaluations
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import plot_roc_curve
from sklearn.metrics import plot_confusion_matrix



models = {'RandomForestClassifier': RandomForestClassifier(),
          'ExtraTreesClassifier': ExtraTreesClassifier(),
          'DecisionTreeClassifier': DecisionTreeClassifier(),
          'LogisticRegression': LogisticRegression(),
          'KNeighborsClassifier': KNeighborsClassifier(),
          'SVC': SVC()
         }

def Fit_Score(model, x_train, y_train, x_test, y_test, x, y):
    
    
    np.random.seed(45)
    model_scores = {}
    for name, model in models.items():
        model.fit(x_train, y_train)
        model_scores[name] = {"Accuracy": model.score(x_test, y_test),
                              "cv_acc": np.mean(cross_val_score(model, x, y, cv=5, scoring="accuracy"))}
    return model_scores

In [None]:
Scores = Fit_Score(model=models, x_train=x_train, y_train=y_train, x_test=x_test, y_test=y_test, x=x, y=y)
pd.DataFrame(Scores.values(), Scores.keys())

##### Improving the models.
* Further tunning the hyper parameters.

* 1. Random Forest Classifier

a. Random Search CV

In [None]:
grid = {"n_estimators": np.arange(10, 1000, 50),
       "max_depth": [3, 10],
       "min_samples_split": np.arange(2, 20, 2),
       "min_samples_leaf": np.arange(1, 20, 2)}
#  Tunning

np.random.seed(45)
rs_clf = RandomizedSearchCV(RandomForestClassifier(n_jobs=1), 
                            param_distributions=grid, 
                            cv=5, n_iter=15, verbose=True, refit=True)

rs_clf.fit(x_train, y_train)


In [None]:
rs_clf.best_params_

In [None]:
rs_clf.score(x_test, y_test)

b. Grid Search CV

In [None]:
grid2 = {"n_estimators": [960],
       "max_depth": [3, 10],
       "min_samples_split": [12, 14],
       "min_samples_leaf": [13, 12]}
#  Tunning
np.random.seed(45)

gs_clf = GridSearchCV(RandomForestClassifier(n_jobs=1), 
                            param_grid=grid2, 
                            cv=5, verbose=True, refit=True)

gs_clf.fit(x_train, y_train)


In [None]:
gs_clf.best_params_

In [None]:
gs_clf.score(x_test, y_test)

* 2. Logistic Regression

In [None]:
#  Let's try with some random params if the score increases

grid3 = {"C": np.logspace(-4, 4, 20),
        "solver": ["liblinear"]}

        
#  Tunning
np.random.seed(45)
rs_lr = RandomizedSearchCV(LogisticRegression(n_jobs=1), 
                            param_distributions=grid3, 
                            cv=5, n_iter=15, verbose=True, refit=True)

rs_lr.fit(x_train, y_train)

In [None]:
rs_lr.best_params_

In [None]:
rs_lr.score(x_test, y_test)

### From above tunning it seems the best fit would be with accuracy of 81 %.

#### Further different metrics study. 
Let's take losistic regression as the perfect model.

In [None]:
def Analytics(model, x_train, y_train, x_test, y_test, x, y):
    

    models = {'RandomForestClassifier': RandomForestClassifier(max_depth=3, min_samples_leaf=13, 
                                                               min_samples_split=12, n_estimators=960),
              'ExtraTreesClassifier': ExtraTreesClassifier(),
              'DecisionTreeClassifier': DecisionTreeClassifier(),
              'LogisticRegression': LogisticRegression(solver='liblinear', C=545.5594781168514),
              'KNeighborsClassifier': KNeighborsClassifier(),
              'SVC': SVC()
              }
    model_scores = {}

    np.random.seed(45)
    for name, model in models.items():
        model.fit(x_train, y_train)
        y_preds = model.predict(x_test)
        model_scores[name] ={"cv_acc": np.mean(cross_val_score(model, x, y, cv=5, scoring="accuracy")),
                             "cv_prec": np.mean(cross_val_score(model, x, y, cv=5, scoring="precision")),
                             "cv_recall": np.mean(cross_val_score(model, x, y, cv=5, scoring="recall")),
                             "cv_f1": np.mean(cross_val_score(model, x, y, cv=5, scoring="f1"))
                             }
        
             
    return model_scores
    

In [None]:
final_scores = Analytics(models, x_train, y_train, x_test, y_test, x, y)
scores = pd.DataFrame(final_scores.values(), final_scores.keys())
scores

#### Since our sample is imbalanced, Precision and recall also plays an important role in addition to accuracy.
#### Best models:
* Logistic Regression/Random Forest Classifier/SVC (good accuracy)
* Extra Trees Classifier (better precesion)

let's further explore

Logistic Regression


In [None]:
LR = LogisticRegression(solver='liblinear', C=545.5594781168514)

LR.fit(x_train, y_train)
ROC_Curve = plot_roc_curve(LR, x_test, y_test);
Conf_Matrix = plot_confusion_matrix(LR, x_test, y_test);
ROC_Curve, Conf_Matrix

Extra Trees Classifier

In [None]:
ET = ExtraTreesClassifier()
np.random.seed(45)
ET.fit(x_train, y_train)
ROC_Curve = plot_roc_curve(ET, x_test, y_test);
Conf_Matrix = plot_confusion_matrix(ET, x_test, y_test);

#### Feature Importance

In [None]:
LR.coef_

In [None]:
feature_dict = dict(zip(x.columns, list(LR.coef_[0])))
features = pd.DataFrame(feature_dict, index=[0])
features.plot.bar(figsize=(10, 6));

### Conclusion

##### Models with better scores are:

* RandomForestClassifier/LogisticRegression: 

* cv_acc: 0.803
* cv_prec: 0.792
* cv_recall: 0.97
* cv_f1: 0.870

We can also further standardise the data and even tune other models to get the better scores. 

##### Please do suggest some more better models. 