# Model Prototyping
<p style="font-size:20px">
After the creation of a <b>Feature_Engineered.csv</b> that contains clean data with predictability power and also being tested on a RandomForestClassifier, I am going to use multiple models and compare their performance.
</p>

## Importing Libraries

In [1]:
# Data Exploration
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Basic Models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

#Ensemble Models
from sklearn.ensemble import RandomForestClassifier, VotingClassifier

# Boosting
from xgboost import XGBClassifier
from lightgbm import  LGBMClassifier

#Evaluation
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.metrics import f1_score, classification_report, confusion_matrix, roc_auc_score, precision_recall_curve

#PreProcessing
from  sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline 
import shap

In [5]:
#List of Models
models = [
  ("Logistic Regression", LogisticRegression()),
  ("Decision Tree", DecisionTreeClassifier()),
  ("SVM", SVC(probability=True)),
  ("K Nearest Neighbour", KNeighborsClassifier()),
  ("Random Forest Classifier", RandomForestClassifier()),
  ("XGBoost",XGBClassifier()),
  ("LightGBM", LGBMClassifier())
]

In [25]:
#Evaluation function
def evaluate_model(models, X_train, y_train, X_test, y_test):
  results=[]
  scaler = StandardScaler()
  smote = SMOTE()
  for name, model in models:
    print(f"\n Name of Model: {name}")
    if name in ["Logistic Regression","SVM", "K Nearest Neighbour"]:
      X_train_scaled = scaler.fit_transform(X_train)
      X_test_scaled = scaler.transform(X_test)
      X_res, y_res = smote.fit_resample(X_train_scaled, y_train)
      model.fit(X_res,y_res)
      pred = model.predict(X_test_scaled)
    else:
      X_res, y_res = smote.fit_resample(X_train, y_train)
      model.fit(X_res, y_res)
      pred = model.predict(X_test)
    
    f1 = f1_score(y_test,pred)
    results.append((name,f1))
    

  return pd.DataFrame(results,columns=["Model","F1 Score"]).sort_values(by="F1 Score",ascending=False)

# Reading the file 

In [26]:
jee = pd.read_csv("../Data/Feature_Engineered.csv", delimiter=",")
jee.head()

Unnamed: 0,peer_focused_mh,PSxIA,daily_study_hours,location_type,Income vs Admission,family_income,parental_support,dropout
0,0,0.0,5.4,2,0,0,1.62,1
1,1,1.65,5.5,2,1,1,1.65,0
2,3,0.0,7.0,1,0,0,2.7,1
3,3,2.49,2.1,1,3,0,0.83,0
4,4,9.16,6.3,1,4,1,2.29,0


# Creating X_train, X_test, y_train, y_test

In [27]:
X = jee.drop("dropout", axis=1)
Y = jee['dropout']

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42, stratify = Y)

In [28]:
evaluate_model(models, X_train, y_train, X_test, y_test)


 Name of Model: Logistic Regression

 Name of Model: Decision Tree

 Name of Model: SVM

 Name of Model: K Nearest Neighbour

 Name of Model: Random Forest Classifier

 Name of Model: XGBoost

 Name of Model: LightGBM
[LightGBM] [Info] Number of positive: 3171, number of negative: 3171
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000168 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 778
[LightGBM] [Info] Number of data points in the train set: 6342, number of used features: 7
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000


Unnamed: 0,Model,F1 Score
4,Random Forest Classifier,0.990244
5,XGBoost,0.987835
6,LightGBM,0.985437
1,Decision Tree,0.978313
2,SVM,0.953052
3,K Nearest Neighbour,0.948113
0,Logistic Regression,0.821577
