# Task 10 : Benchmark Top ML Algorithms

This task tests your ability to use different ML algorithms when solving a specific problem.


### Dataset
Predict Loan Eligibility for Dream Housing Finance company

Dream Housing Finance company deals in all kinds of home loans. They have presence across all urban, semi urban and rural areas. Customer first applies for home loan and after that company validates the customer eligibility for loan.

Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have provided a dataset to identify the customers segments that are eligible for loan amount so that they can specifically target these customers.

Train: https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_train.csv

Test: https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_test.csv

## Task Requirements
### You can have the following Classification models built using different ML algorithms
- Decision Tree
- KNN
- Logistic Regression
- SVM
- Random Forest
- Any other algorithm of your choice

### Use GridSearchCV for finding the best model with the best hyperparameters

- ### Build models
- ### Create Parameter Grid
- ### Run GridSearchCV
- ### Choose the best model with the best hyperparameter
- ### Give the best accuracy
- ### Also, benchmark the best accuracy that you could get for every classification algorithm asked above

#### Your final output will be something like this:
- Best algorithm accuracy
- Best hyperparameter accuracy for every algorithm

**Table 1 (Algorithm wise best model with best hyperparameter)**

Algorithm   |     Accuracy   |   Hyperparameters
- DT
- KNN
- LR
- SVM
- RF
- anyother

**Table 2 (Best overall)**

Algorithm    |   Accuracy    |   Hyperparameters



### Submission
- Submit Notebook containing all saved ran code with outputs
- Document with the above two tables

In [1]:
# Importing the necessary libraries

import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score, roc_curve
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import VotingClassifier

In [2]:
# Importing the training dataset

data = pd.read_csv("https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_train.csv")
data.head(5)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


# Data Pre-processing

In [3]:
# Showing data type

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


In [4]:
# Dropping Loan_ID

data = data.drop(columns=["Loan_ID"])

In [5]:
# Getting Categorical columns

cat_cols = []
for column in data.columns:
    if data[column].dtype=="object":
        cat_cols.append(column)

In [6]:
# Checking the possible values for each feature

for column in data.columns:
    print(f"{column}: {data[column].unique()}")

Gender: ['Male' 'Female' nan]
Married: ['No' 'Yes' nan]
Dependents: ['0' '1' '2' '3+' nan]
Education: ['Graduate' 'Not Graduate']
Self_Employed: ['No' 'Yes' nan]
ApplicantIncome: [ 5849  4583  3000  2583  6000  5417  2333  3036  4006 12841  3200  2500
  3073  1853  1299  4950  3596  3510  4887  2600  7660  5955  3365  3717
  9560  2799  4226  1442  3750  4166  3167  4692  3500 12500  2275  1828
  3667  3748  3600  1800  2400  3941  4695  3410  5649  5821  2645  4000
  1928  3086  4230  4616 11500  2708  2132  3366  8080  3357  3029  2609
  4945  5726 10750  7100  4300  3208  1875  4755  5266  1000  3333  3846
  2395  1378  3988  2366  8566  5695  2958  6250  3273  4133  3620  6782
  2484  1977  4188  1759  4288  4843 13650  4652  3816  3052 11417  7333
  3800  2071  5316  2929  3572  7451  5050 14583  2214  5568 10408  5667
  2137  2957  3692 23803  3865 10513  6080 20166  2014  2718  3459  4895
  3316 14999  4200  5042  6950  2698 11757  2330 14866  1538 10000  4860
  6277  2577  9166

In [7]:
# Checking null values

data.isna().sum()

Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

## Label Encoding

In [8]:
# Helper Function for Label Encoding

def label_encode_columns(df, columns, le):
    for col in columns:
        # fit the encoder on non-null values
        le.fit(df[col][df[col].notnull()])
        
        # transform the column and preserve the NaN values
        df[col] = df[col].map(lambda x: x if pd.isna(x) else le.transform([x])[0])
        
    return df

In [9]:
# Label Encoding

le = LabelEncoder()
df = label_encode_columns(data, cat_cols, le)
y = df["Loan_Status"]
df = df.drop(columns=["Loan_Status"])
df

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,1.0,0.0,0.0,0,0.0,5849,0.0,,360.0,1.0,2
1,1.0,1.0,1.0,0,0.0,4583,1508.0,128.0,360.0,1.0,0
2,1.0,1.0,0.0,0,1.0,3000,0.0,66.0,360.0,1.0,2
3,1.0,1.0,0.0,1,0.0,2583,2358.0,120.0,360.0,1.0,2
4,1.0,0.0,0.0,0,0.0,6000,0.0,141.0,360.0,1.0,2
...,...,...,...,...,...,...,...,...,...,...,...
609,0.0,0.0,0.0,0,0.0,2900,0.0,71.0,360.0,1.0,0
610,1.0,1.0,3.0,0,0.0,4106,0.0,40.0,180.0,1.0,0
611,1.0,1.0,1.0,0,0.0,8072,240.0,253.0,360.0,1.0,2
612,1.0,1.0,2.0,0,0.0,7583,0.0,187.0,360.0,1.0,2


## Iterative Imputer

In [10]:
# Applying Iterative Imputer

imputer = IterativeImputer(max_iter=10, random_state=0)
imputer.fit(df)
df_transform = imputer.transform(df)


In [11]:
df = MinMaxScaler().fit_transform(df_transform)
X = pd.DataFrame(data=df)
feature_name = list(X.columns)
num_feats = len(feature_name)

## Feature Selection

### Pearson Correlation

In [12]:
def cor_selector(X, y,num_feats):
    cor_list = []
    feature_name = X.columns.tolist()
    
    for i in X.columns.tolist():
        cor = np.corrcoef(X[i], y)[0, 1]
        cor_list.append(cor)
    
    cor_list = [0 if np.isnan(i) else i for i in cor_list]
    
    cor_feature = X.iloc[:,np.argsort(np.abs(cor_list))[-num_feats:]].columns.tolist()
    
    cor_support = [True if i in cor_feature else False for i in feature_name]
    return cor_support, cor_feature

In [13]:
cor_support, cor_feature = cor_selector(X, y, num_feats)

### Chi Square

In [14]:
def chi_squared_selector(X, y, num_feats):
    bestfeatures = SelectKBest(score_func=chi2, k=num_feats)
    bestfeatures.fit(X,y)
    chi_support=bestfeatures.get_support()
    chi_feature=X.loc[:,chi_support].columns.tolist()
    return chi_support, chi_feature

In [15]:
chi_support, chi_feature = chi_squared_selector(X, y,num_feats)

### Recursive Feature Elimination

In [16]:
def rfe_selector(X, y, num_feats):
    rfe_selector= RFE(estimator=LogisticRegression(),n_features_to_select=num_feats,step=10,verbose=5)
    rfe_selector.fit(X, y)
    rfe_support=rfe_selector.get_support()
    rfe_feature=X.loc[:, rfe_support].columns.tolist()
    return rfe_support, rfe_feature

In [17]:
rfe_support, rfe_feature = rfe_selector(X, y,num_feats)

### Lasso Regression

In [18]:
def embedded_log_reg_selector(X, y, num_feats):
    lr_selector=SelectFromModel(LogisticRegression(random_state=0),max_features=num_feats)
    lr_selector.fit(X, y)
    lr_support=lr_selector.get_support()
    lr_feature=X.loc[:, lr_support].columns.tolist()
    return lr_support, lr_feature

In [19]:
embedded_lr_support, embedded_lr_feature = embedded_log_reg_selector(X, y, num_feats)

### Random Forest

In [20]:
def embedded_rf_selector(X, y, num_feats):
    rf_selector=SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=0),max_features=num_feats)
    
    rf_selector.fit(X,y)
    rf_support=rf_selector.get_support()
    rf_feature=X.loc[:,rf_support].columns.tolist()
    
    return rf_support, rf_feature

In [21]:
embeded_rf_support, embeded_rf_feature = embedded_rf_selector(X, y, num_feats)

### Light GBM

In [22]:
def embedded_lgbm_selector(X, y, num_feats):
    lgbc=LGBMClassifier(n_estimators=500,learning_rate=.05,num_leaves=32,colsample_bytree=.2,
                       reg_alpha=3,reg_lambda=1,min_split_gain=.01,min_child_weight=40, random_state=0)
    lgb_selector=SelectFromModel(lgbc,max_features=num_feats)
    lgb_selector.fit(X,y)
    lgb_support=lgb_selector.get_support()
    lgb_feature=X.loc[:,lgb_support].columns.tolist()
    return lgb_support, lgb_feature

In [23]:
embeded_lgbm_support, embeded_lgbm_feature = embedded_lgbm_selector(X, y, num_feats)

### Feature Selection Summary

In [24]:
feature_selection_df = pd.DataFrame({'Feature':feature_name, 'Pearson':cor_support, 'Chi-2':chi_support, 'RFE':rfe_support, 'Logistic':embedded_lr_support,
                                    'Random Forest':embeded_rf_support, 'LightGBM':embeded_lgbm_support})
# count the selected times for each feature
feature_selection_df['Total'] = np.sum(feature_selection_df.iloc[:,1:], axis=1)
# display the top 100
feature_selection_df = feature_selection_df.sort_values(['Total','Feature'] , ascending=False)
feature_selection_df.index = range(1, len(feature_selection_df)+1)
feature_selection_df.head(num_feats)

Unnamed: 0,Feature,Pearson,Chi-2,RFE,Logistic,Random Forest,LightGBM,Total
1,9,True,True,True,True,True,True,6
2,7,True,True,True,True,True,True,6
3,6,True,True,True,True,True,True,6
4,5,True,True,True,False,True,True,5
5,10,True,True,True,False,False,True,4
6,8,True,True,True,False,False,True,4
7,4,True,True,True,False,False,True,4
8,3,True,True,True,False,False,True,4
9,2,True,True,True,False,False,True,4
10,1,True,True,True,False,False,True,4


# Classification

In [25]:
# Splitting the data into train and test

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0, shuffle=True)

## Logistic Regression

In [40]:
# Defining and Training the logistic Regression Model
lr_model = LogisticRegression(random_state=0)
lr_model.fit(X_train, y_train)

# Testing the Logistic Regression Model
lr_predictions = lr_model.predict(X_test)
lr_score = lr_model.score(X_test, y_test)
print("Accuracy: ", lr_score)

# Confusion Matrix
lr_cm = confusion_matrix(y_test,lr_predictions)
print("Confusion Matrix:\n", lr_cm)

Accuracy:  0.8292682926829268
Confusion Matrix:
 [[14 19]
 [ 2 88]]


### Logistic Regression with GridSearchCV

In [41]:
# Hyperparameters to tune with GridSearch
param_grid = {'C': [1,5, 20,100,500],
              'tol': [0.001, 0.0001, 0.00001], 
              'class_weight': ['balanced', None],
              'fit_intercept': [True, False]}

# Training the Model
lr_model = LogisticRegression(random_state=0)
lr_gs = GridSearchCV(lr_model,
                           param_grid,
                           scoring='accuracy',
                           verbose=1,
                           n_jobs=-1)

# Applying the Model
lr_gs.fit(X_train, y_train)
lr_gs_predictions = lr_gs.predict(X_test)
lr_gs_score = lr_gs.score(X_test, y_test)

# Optimum values values for the hyperparameters and the improved accuracy score
print("Best parameters:", lr_gs.best_params_)
print("Accuracy:", lr_gs_score)

Fitting 5 folds for each of 60 candidates, totalling 300 fits
Best parameters: {'C': 1, 'class_weight': None, 'fit_intercept': True, 'tol': 0.001}
Accuracy: 0.8292682926829268


## Decision Tree

In [42]:
#Defining and Training the Decision Tree Model
tree = DecisionTreeClassifier(random_state=0)
tree.fit(X_train, y_train)

#Node count and Maximum depth of the tree
print(f'Decision tree has {tree.tree_.node_count} nodes with maximum depth {tree.tree_.max_depth}.')

#Make probability predictions
train_probs = tree.predict_proba(X_train)[:,1]
probs = tree.predict_proba(X_test)[:,1]

train_predictions = tree.predict(X_train)
predictions = tree.predict(X_test)

#Plot ROC AUC Score to assess Decision Tree Performance
print(f'Train ROC AUC Score: {roc_auc_score(y_train, train_probs)}')
print(f'Test ROC AUC  Score: {roc_auc_score(y_test, probs)}')
print(f'Baseline ROC AUC: {roc_auc_score(y_test, [1 for _ in range(len(y_test))])}')


# Calculate the accuracy score
y_pred = tree.predict(X_test)
dt_accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {dt_accuracy}")

#Decision Tree Confusional Matrix
dt_cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", dt_cm)

Decision tree has 197 nodes with maximum depth 17.
Train ROC AUC Score: 1.0
Test ROC AUC  Score: 0.6601010101010101
Baseline ROC AUC: 0.5
Accuracy: 0.6991869918699187
Confusion Matrix:
 [[19 14]
 [23 67]]


### Decision Tree with GridSearchCV

In [43]:
# Hyperparameters to tune with GridSearch
param_dict = {'criterion':['gini', 'entropy'],
              'max_depth':range(1,10),
              'min_samples_split':range(2,10),
              'min_samples_leaf':range(1,5)}

 # Training the Model           
dt = DecisionTreeClassifier(random_state=0)
dt_gs = GridSearchCV(dt,
                       param_grid=param_dict,
                       scoring='accuracy',
                       verbose=1,
                       n_jobs=-1)

# Applying the Model
dt_gs.fit(X_train, y_train)
dt_gs_predictions = dt_gs.predict(X_test)
dt_gs_accuracy = accuracy_score(y_test, dt_gs_predictions)

#Print out the best parameters by using GridSearchCV and the best score of the model
print("Best parameters:", dt_gs.best_params_)
print("Accuracy:", dt_gs_accuracy)

Fitting 5 folds for each of 576 candidates, totalling 2880 fits
Best parameters: {'criterion': 'entropy', 'max_depth': 1, 'min_samples_leaf': 1, 'min_samples_split': 2}
Accuracy: 0.8292682926829268


## Random Forest

In [44]:
#Defining and Training the Random Forest Classifiesr
rf_model = RandomForestClassifier(n_estimators=100,
                                  random_state=0,
                                  max_features='sqrt',
                                  n_jobs=-1,
                                  verbose=1)
rf_model.fit(X_train, y_train)

#Node count and Maximum depth of the tree
n_nodes = []
max_depths = []

for ind_tree in rf_model.estimators_:
    n_nodes.append(ind_tree.tree_.node_count)
    max_depths.append(ind_tree.tree_.max_depth)
    
print(f'Average number of nodes {int(np.mean(n_nodes))}')
print(f'Average maximum depth {int(np.mean(max_depths))}')

#Make Random Forest predictions
train_rf_predictions = rf_model.predict(X_train)
train_rf_probs = rf_model.predict_proba(X_train)[:, 1]

rf_predictions = rf_model.predict(X_test)
rf_probs = rf_model.predict_proba(X_test)[:, 1]

#Plot ROC AUC Score
print(f'Train ROC AUC Score: {roc_auc_score(y_train, train_rf_probs)}')
print(f'Test ROC AUC  Score: {roc_auc_score(y_test, rf_probs)}')
print(f'Baseline ROC AUC: {roc_auc_score(y_test, [1 for _ in range(len(y_test))])}')

# Calculate the accuracy score
y_pred = rf_model.predict(X_test)
rf_accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {rf_accuracy}")

#Random Forest Confusion Matrix
rf_cm = confusion_matrix(y_test, rf_predictions)
print("Confusion Matrix:\n", rf_cm)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=16)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=16)]: Done  18 tasks      | elapsed:    0.0s
[Parallel(n_jobs=16)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=16)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=16)]: Done  18 tasks      | elapsed:    0.0s
[Parallel(n_jobs=16)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=16)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=16)]: Done  18 tasks      | elapsed:    0.0s
[Parallel(n_jobs=16)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=16)]: Using backend ThreadingBackend with 16 concurrent workers.


Average number of nodes 203
Average maximum depth 17
Train ROC AUC Score: 0.9999999999999999
Test ROC AUC  Score: 0.7585858585858586
Baseline ROC AUC: 0.5
Accuracy: 0.7804878048780488
Confusion Matrix:
 [[15 18]
 [ 9 81]]


[Parallel(n_jobs=16)]: Done  18 tasks      | elapsed:    0.0s
[Parallel(n_jobs=16)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=16)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=16)]: Done  18 tasks      | elapsed:    0.0s
[Parallel(n_jobs=16)]: Done 100 out of 100 | elapsed:    0.0s finished


### Random Forest with GridSearchCV

In [45]:
# Hyperparameters to tune with GridSearch
param_grid = {'n_estimators': [10, 50, 100],
              'max_depth': [None, 10, 20],
              'min_samples_split': [2, 5, 10],
              'min_samples_leaf': [1, 2, 4]}

#Train and Fit the Model
rf = RandomForestClassifier(random_state=0)
rf_gs = GridSearchCV(rf,
                       param_grid,
                       scoring="accuracy",
                       verbose=1,
                       n_jobs=-1)
rf_gs.fit(X_train, y_train)
rf_gs_pred = rf_gs.predict(X_test)
rf_gs_accuracy = accuracy_score(y_test, rf_gs_pred)

# Optimum values values for the hyperparameters and the improved score
print(f"Best parameters: {rf_gs.best_params_}")
print(f"Accuracy: {rf_gs_accuracy}")

Fitting 5 folds for each of 81 candidates, totalling 405 fits
Best parameters: {'max_depth': None, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 100}
Accuracy: 0.8048780487804879


## KNN

In [90]:
# Defining and Training the Stochastic Gradient Descent Model
knn_model = KNeighborsClassifier()
knn_model.fit(X_train, y_train)

# Testing the Stochastic Gradient Descent Model
knn_predictions = knn_model.predict(X_test)
knn_accuracy = accuracy_score(y_test, knn_predictions)
print("Accuracy: ", knn_accuracy)

# Confusion Matrix
knn_cm = confusion_matrix(y_test, knn_predictions)
print("Confusion Matrix:\n", knn_cm)

Accuracy:  0.7886178861788617
Confusion Matrix:
 [[14 19]
 [ 7 83]]


### KNN with GridSearchCV

In [91]:
# Hyperparameters to tune with GridSearch
param_grid = {'weights': ['uniform', 'distance'],
              'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
              'leaf_size': [30, 35]}

# Training the Model
knn_model = KNeighborsClassifier()
knn_gs = GridSearchCV(knn_model,
                           param_grid,
                           scoring='accuracy',
                           verbose=1,
                           n_jobs=-1)

# Applying the Model
knn_gs.fit(X_train, y_train)
knn_gs_predictions = knn_gs.predict(X_test)
knn_gs_accuracy = accuracy_score(y_test, knn_gs_predictions)

# Optimum values values for the hyperparameters and the improved accuracy score
print("Best parameters:", knn_gs.best_params_)
print("Accuracy:", knn_gs_accuracy)

Fitting 5 folds for each of 16 candidates, totalling 80 fits
Best parameters: {'algorithm': 'auto', 'leaf_size': 30, 'weights': 'uniform'}
Accuracy: 0.7886178861788617


## SVM

In [51]:
# Defining and Training the Support Vector Machine Model
svm_model = SVC(random_state=0)
svm_model.fit(X_train, y_train)

# Testing the Stochastic Gradient Descent Model
svm_predictions = svm_model.predict(X_test)
svm_score = svm_model.score(X_test, y_test)
print("Accuracy: ", svm_score)

svm_cm = confusion_matrix(y_test,svm_predictions)
print("Confusion Matrix:\n", svm_cm)

Accuracy:  0.8292682926829268
Confusion Matrix:
 [[14 19]
 [ 2 88]]


### SVM with GridSearchCV

In [52]:
# Hyperparameters to tune with GridSearch
param_grid = {'C': [1,5, 20,100,500],
              'gamma': [.001, .01, .1,1, 'scale', 'auto'],
              'kernel': ['linear', 'rbf'],
              'shrinking': [True, False]}

# Training the Model
svm_model = SVC(random_state=0)
svm_gs = GridSearchCV(svm_model,
                           param_grid,
                           scoring='accuracy',
                           verbose=1,
                           n_jobs=-1)

# Applying the Model
svm_gs.fit(X_train, y_train)
svm_gs_predictions=svm_gs.predict(X_test)
svm_gs_accuracy = accuracy_score(y_test, svm_gs_predictions)

# Optimum values values for the hyperparameters and the improved accuracy score
print("Best parameters:", svm_gs.best_params_)
print("Accuracy:", svm_gs_accuracy)

Fitting 5 folds for each of 120 candidates, totalling 600 fits
Best parameters: {'C': 1, 'gamma': 0.001, 'kernel': 'linear', 'shrinking': True}
Accuracy: 0.8292682926829268


# Ensemble

In [53]:
# Initialize dictionary of all the models

models = {'lr': lr_gs,
          'knn': knn_gs,
          'dt': dt_gs,
          'svm': svm_gs,
          'rf': rf_gs
         }

## Blending

In [54]:
# Fit the individual models and make base predictions

base_model_train_predictions = []
for algo, model in models.items():
    # Fit on the training dataset
    model.fit(X_train, y_train)
    # Predict on the hold-out dataset
    yhat = model.predict(X_test)
    # Store predictions for meta-model's use
    yhat = yhat.reshape(len(yhat), 1)
    base_model_train_predictions.append(yhat)

Fitting 5 folds for each of 60 candidates, totalling 300 fits
Fitting 5 folds for each of 16 candidates, totalling 80 fits
Fitting 5 folds for each of 576 candidates, totalling 2880 fits
Fitting 5 folds for each of 120 candidates, totalling 600 fits
Fitting 5 folds for each of 81 candidates, totalling 405 fits


### Training

In [55]:
# Reshaping the base model output

base_model_train_predictions = np.hstack(base_model_train_predictions)

In [56]:
# Defining the META model

blender = LogisticRegression(random_state=0)

In [57]:
blender.fit(base_model_train_predictions, y_test)
blender_predictions = blender.predict(base_model_train_predictions)

In [61]:
# Accuracy of the blender model

accuracy_score(y_test, blender_predictions)

0.8292682926829268

# Testing

In [62]:
# Loading the test dataset

test = pd.read_csv("https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_test.csv")
test

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,LP001015,Male,Yes,0,Graduate,No,5720,0,110.0,360.0,1.0,Urban
1,LP001022,Male,Yes,1,Graduate,No,3076,1500,126.0,360.0,1.0,Urban
2,LP001031,Male,Yes,2,Graduate,No,5000,1800,208.0,360.0,1.0,Urban
3,LP001035,Male,Yes,2,Graduate,No,2340,2546,100.0,360.0,,Urban
4,LP001051,Male,No,0,Not Graduate,No,3276,0,78.0,360.0,1.0,Urban
...,...,...,...,...,...,...,...,...,...,...,...,...
362,LP002971,Male,Yes,3+,Not Graduate,Yes,4009,1777,113.0,360.0,1.0,Urban
363,LP002975,Male,Yes,0,Graduate,No,4158,709,115.0,360.0,1.0,Urban
364,LP002980,Male,No,0,Graduate,No,3250,1993,126.0,360.0,,Semiurban
365,LP002986,Male,Yes,0,Graduate,No,5000,2393,158.0,360.0,1.0,Rural


## Pre-processing

In [63]:
# Dropping Loan_ID

test.drop(columns=["Loan_ID"], inplace=True)

In [64]:
# Checking for missing values

test.isna().sum()

Gender               11
Married               0
Dependents           10
Education             0
Self_Employed        23
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount            5
Loan_Amount_Term      6
Credit_History       29
Property_Area         0
dtype: int64

In [65]:
# Label Encoding

le = LabelEncoder()
test = label_encode_columns(test, cat_cols[:-1], le)
test

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,1.0,1,0.0,0,0.0,5720,0,110.0,360.0,1.0,2
1,1.0,1,1.0,0,0.0,3076,1500,126.0,360.0,1.0,2
2,1.0,1,2.0,0,0.0,5000,1800,208.0,360.0,1.0,2
3,1.0,1,2.0,0,0.0,2340,2546,100.0,360.0,,2
4,1.0,0,0.0,1,0.0,3276,0,78.0,360.0,1.0,2
...,...,...,...,...,...,...,...,...,...,...,...
362,1.0,1,3.0,1,1.0,4009,1777,113.0,360.0,1.0,2
363,1.0,1,0.0,0,0.0,4158,709,115.0,360.0,1.0,2
364,1.0,0,0.0,0,0.0,3250,1993,126.0,360.0,,1
365,1.0,1,0.0,0,0.0,5000,2393,158.0,360.0,1.0,0


In [66]:
# Applying Iterative Imputer

imputer = IterativeImputer(max_iter=10, random_state=0)
imputer.fit(test)
test_transform = imputer.transform(test)
test_transform

array([[  1.        ,   1.        ,   0.        , ..., 360.        ,
          1.        ,   2.        ],
       [  1.        ,   1.        ,   1.        , ..., 360.        ,
          1.        ,   2.        ],
       [  1.        ,   1.        ,   2.        , ..., 360.        ,
          1.        ,   2.        ],
       ...,
       [  1.        ,   0.        ,   0.        , ..., 360.        ,
          0.80549133,   1.        ],
       [  1.        ,   1.        ,   0.        , ..., 360.        ,
          1.        ,   0.        ],
       [  1.        ,   0.        ,   0.        , ..., 180.        ,
          1.        ,   0.        ]])

In [67]:
test = MinMaxScaler().fit_transform(test_transform)
test = pd.DataFrame(data=test)
test

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,1.0,1.0,0.000000,0.0,0.0,0.078865,0.000000,0.157088,0.746835,1.000000,1.0
1,1.0,1.0,0.333333,0.0,0.0,0.042411,0.062500,0.187739,0.746835,1.000000,1.0
2,1.0,1.0,0.666667,0.0,0.0,0.068938,0.075000,0.344828,0.746835,1.000000,1.0
3,1.0,1.0,0.666667,0.0,0.0,0.032263,0.106083,0.137931,0.746835,0.798226,1.0
4,1.0,0.0,0.000000,1.0,0.0,0.045168,0.000000,0.095785,0.746835,1.000000,1.0
...,...,...,...,...,...,...,...,...,...,...,...
362,1.0,1.0,1.000000,1.0,1.0,0.055274,0.074042,0.162835,0.746835,1.000000,1.0
363,1.0,1.0,0.000000,0.0,0.0,0.057329,0.029542,0.166667,0.746835,1.000000,1.0
364,1.0,0.0,0.000000,0.0,0.0,0.044810,0.083042,0.187739,0.746835,0.805491,0.5
365,1.0,1.0,0.000000,0.0,0.0,0.068938,0.099708,0.249042,0.746835,1.000000,0.0


## Inference

In [85]:
# Using the blender model to make predictions

base_model_train_predictions = []
for algo, model in models.items():
    # Predict on the hold-out dataset
    yhat = model.predict(test)
    # Store predictions for meta-model's use
    yhat = yhat.reshape(len(yhat), 1)
    base_model_train_predictions.append(yhat)

In [86]:
# Reshaping the base predictions

base_model_train_predictions = np.hstack(base_model_train_predictions)

In [88]:
# Using base predictions to obtain final predictions

blender_predictions = blender.predict(base_model_train_predictions)
blender_predictions

array([1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0,
       1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,