# Task 10 : Benchmark Top ML Algorithms

This task tests your ability to use different ML algorithms when solving a specific problem.


### Dataset
Predict Loan Eligibility for Dream Housing Finance company

Dream Housing Finance company deals in all kinds of home loans. They have presence across all urban, semi urban and rural areas. Customer first applies for home loan and after that company validates the customer eligibility for loan.

Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have provided a dataset to identify the customers segments that are eligible for loan amount so that they can specifically target these customers.

Train: https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_train.csv

Test: https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_test.csv

## Task Requirements
### You can have the following Classification models built using different ML algorithms
- Decision Tree
- KNN
- Logistic Regression
- SVM
- Random Forest
- Any other algorithm of your choice

### Use GridSearchCV for finding the best model with the best hyperparameters

- ### Build models
- ### Create Parameter Grid
- ### Run GridSearchCV
- ### Choose the best model with the best hyperparameter
- ### Give the best accuracy
- ### Also, benchmark the best accuracy that you could get for every classification algorithm asked above

#### Your final output will be something like this:
- Best algorithm accuracy
- Best hyperparameter accuracy for every algorithm

**Table 1 (Algorithm wise best model with best hyperparameter)**

Algorithm   |     Accuracy   |   Hyperparameters
- DT
- KNN
- LR
- SVM
- RF
- anyother

**Table 2 (Best overall)**

Algorithm    |   Accuracy    |   Hyperparameters



### Submission
- Submit Notebook containing all saved ran code with outputs
- Document with the above two tables

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder,StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import  train_test_split, GridSearchCV
from imblearn.over_sampling import SMOTE
from sklearn.decomposition import PCA
# from sklearn.utils._testing import ignore_warnings
# from sklearn.exceptions import FitFailedWarning
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('loan_train.csv')
df = df.dropna().drop(columns='Loan_ID').reset_index(drop = True)
df.Loan_Status = df.Loan_Status.map({'Y':1,'N':0})

X = df.iloc[:,:-1]
y = df.iloc[:,-1]

oneHotEncoder1 = OneHotEncoder(drop='first')
oneHotEncoder2 = OneHotEncoder()

encode_columns1 = X.select_dtypes(include='object').columns.drop(['Dependents','Property_Area','Education']).tolist()
encode_columns2 = X[['Dependents','Property_Area','Education']]

encode_columns1 = pd.DataFrame(oneHotEncoder1.fit_transform(df[encode_columns1]).toarray(),columns=oneHotEncoder1.get_feature_names_out())
encode_columns2 = pd.DataFrame(oneHotEncoder2.fit_transform(encode_columns2).toarray(),columns=oneHotEncoder2.get_feature_names_out()).drop(columns = 'Education_Not Graduate')

# scale_column = X.select_dtypes(include='number')
scale_column = pd.DataFrame(StandardScaler().fit_transform(X.select_dtypes(include='number')))
scale_column.columns = X.select_dtypes(include='number').columns
X= pd.concat([encode_columns1,encode_columns2,scale_column],axis=1)
# X = PCA(n_components = 8).fit(X).transform(X)
X_train, X_test,y_train,y_test = train_test_split(X,y,test_size=0.35,stratify=y, shuffle=True)

# X_train, y_train = SMOTE(sampling_strategy={0: 120,1:y_train.value_counts()[1]}).fit_resample(X_train,y_train)
# y_train.value_counts()

In [2]:
estimators = [
    ('DT',DecisionTreeClassifier()),
    ('KNN',KNeighborsClassifier()),
    ('LR',LogisticRegression()),
    ('SVM',SVC()),
    ('RF',RandomForestClassifier())
]

params = {
    'DT':{
        'max_depth': [None, 3, 5, 10, 20],
        'min_samples_split': [2, 5, 10],
        'splitter': ['best','random'],
        'criterion': ['gini','entropy','log_loss'],
        'max_features':['auto','sqrt','log2']
    },
    'KNN':{
        'n_neighbors': [3, 5, 7],
        'weights': ['uniform', 'distance'],
        'p':[1, 1.25, 1.5, 1.75, 2],
        'algorithm':['auto','ball_tree','kd_tree','brute']
    },
    'LR':{
        'C': [0.1, 1, 10],
        'penalty': ['l1','l2','elasticnet'],
        'solver':[ 'saga', 'newton-cholesky', 'newton-cg', 'liblinear', 'lbfgs'],
        'max_iter':[50, 100, 150, 200],
        'l1_ratio':[0,0.25,0.5,0.75,1]
    },
    'SVM':{
        'C': [0.1, 1, 10],
        'kernel': ['linear', 'rbf','sigmoid'],
        'gamma':['scale','auto']
    },
    'RF':{
        'n_estimators': [50,100, 200],
        'max_depth': [None, 3, 5,10, 20]
    }
}

best_models = {}
for name, estimator in estimators:
    gridSearchCV = GridSearchCV(estimator, params[name],scoring='precision',cv = 8, n_jobs = -1)
    gridSearchCV.fit(X_train,y_train)
    gridSearchCV.predict(X_test)
    best_models[name] = [gridSearchCV.best_estimator_,gridSearchCV.best_score_,gridSearchCV.best_estimator_.score(X_test,y_test),gridSearchCV.best_params_]
performance = pd.DataFrame(best_models).T
performance.columns = ['Estimator','Train_Accuracy','Test_Accuracy','Best_params']



### Table 1
- Index: Estimator name
- Train_Accuracy: $\frac{TP+TN}{TP+TN+FP+FN}$ wih train dataset
- Test_Accuracy: $\frac{TP+TN}{TP+TN+FP+FN}$ wih test dataset
- Best_params: best params with GridSearchCV


In [3]:
performance

Unnamed: 0,Estimator,Train_Accuracy,Test_Accuracy,Best_params
DT,"DecisionTreeClassifier(criterion='log_loss', m...",0.815114,0.708333,"{'criterion': 'log_loss', 'max_depth': 20, 'ma..."
KNN,"KNeighborsClassifier(n_neighbors=3, p=1.25, we...",0.798328,0.767857,"{'algorithm': 'auto', 'n_neighbors': 3, 'p': 1..."
LR,"LogisticRegression(C=0.1, l1_ratio=0, max_iter...",0.78992,0.821429,"{'C': 0.1, 'l1_ratio': 0, 'max_iter': 50, 'pen..."
SVM,"SVC(C=10, kernel='sigmoid')",0.80562,0.702381,"{'C': 10, 'gamma': 'scale', 'kernel': 'sigmoid'}"
RF,"(DecisionTreeClassifier(max_depth=20, max_feat...",0.796072,0.803571,"{'max_depth': 20, 'n_estimators': 100}"


### Table 2
- Display the best estimator based on the test_accuracy
- If it is based on the precision and recall, RF should be the best 

In [4]:
performance[performance.Test_Accuracy == performance.Test_Accuracy.max()]

Unnamed: 0,Estimator,Train_Accuracy,Test_Accuracy,Best_params
LR,"LogisticRegression(C=0.1, l1_ratio=0, max_iter...",0.78992,0.821429,"{'C': 0.1, 'l1_ratio': 0, 'max_iter': 50, 'pen..."


### Remarks
- Outliers exist: Mostly Loan_status = 0
    - Not sure remove condition
- Inbalance dataset $\rightarrow$ Insufficient learning on Loan_Status: 0 $\rightarrow$ Precision & Recall of 0 low
    - SMOTE without remove outliers = noise
- Features vs Features: overlapped, Difficult to observe 2 seperate clusters  

In [5]:
from sklearn.metrics import classification_report
print(pd.DataFrame(y).value_counts())
for i in performance.Estimator:
    print(i,'\n',classification_report(y_test, i.predict(X_test)),'\n')

Loan_Status
1              332
0              148
Name: count, dtype: int64
DecisionTreeClassifier(criterion='log_loss', max_depth=20, max_features='sqrt',
                       min_samples_split=10, splitter='random') 
               precision    recall  f1-score   support

           0       0.53      0.60      0.56        52
           1       0.81      0.76      0.78       116

    accuracy                           0.71       168
   macro avg       0.67      0.68      0.67       168
weighted avg       0.72      0.71      0.71       168
 

KNeighborsClassifier(n_neighbors=3, p=1.25, weights='distance') 
               precision    recall  f1-score   support

           0       0.64      0.56      0.60        52
           1       0.81      0.86      0.84       116

    accuracy                           0.77       168
   macro avg       0.73      0.71      0.72       168
weighted avg       0.76      0.77      0.76       168
 

LogisticRegression(C=0.1, l1_ratio=0, max_iter=50, sol

In [6]:
# import seaborn as sns
# sns.pairplot(pd.concat([X_train,y_train],axis=1), hue =  'Loan_Status');
