# All Campuses & Years - Select From Model

Using our scaler data to quickly check the model that fits the best for the complete dataset

**Index**

1. [Environment](#Environment)
2. [Transform Data](#TransformData)
3. [Select From Model](#SelectFromModel)
4. [Conclusion](#Conclusion)

## Environment

#### Import libraries

In [72]:
import pandas as pd
import numpy as np

from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn import svm
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

import seaborn as sns

from imblearn.under_sampling import TomekLinks

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_transformer
from imblearn.pipeline import make_pipeline

#### Import data

In [73]:
apps = pd.read_csv("apps_allYears_clean_selCols_addCols.csv")

In [74]:
apps.head()

Unnamed: 0,Bootcamp Course,Bootcamp Format,Bootcamp Year,Campus - Timezone,Drop,Drop Reason,how did you hear about us?,Person Account: Gender,Stage Duration,Paid Deposit,Scholarship,Discount,Time Conversion - days,Time between Created Date and Start Date - days,Discount(%),Creater Month,Creater Quarter
0,WD,FT,2018,MEX,0,Not specified,Social Media,Male,1125.0,0,0,0,110,133,0.0,9,3
1,UX,FT,2018,AMS,0,Not specified,google,Male,875.0,0,0,0,39,97,0.0,7,3
2,UX,FT,2018,BCN,0,Not specified,google,Male,952.0,0,0,0,143,227,0.0,1,1
3,WD,FT,2020,MIA,0,Not specified,Social Media,Female,147.0,0,0,0,0,34,0.0,7,3
4,WD,PT,2018,MIA,0,Not specified,other,Male,1125.0,0,0,1,388,421,0.09,9,3


In [75]:
apps.dtypes

Bootcamp Course                                     object
Bootcamp Format                                     object
Bootcamp Year                                        int64
Campus - Timezone                                   object
Drop                                                 int64
Drop Reason                                         object
how did you hear about us?                          object
Person Account: Gender                              object
Stage Duration                                     float64
Paid Deposit                                         int64
Scholarship                                          int64
Discount                                             int64
Time Conversion - days                               int64
Time between Created Date and Start Date - days      int64
Discount(%)                                        float64
Creater Month                                        int64
Creater Quarter                                      int

## TransformData

In [76]:
features_to_encode = apps.columns[apps.dtypes==object].tolist()
col_transformer_e = make_column_transformer((OneHotEncoder(handle_unknown='ignore'), 
                                             features_to_encode),remainder="passthrough")

In [77]:
from sklearn.base import TransformerMixin

class DenseTransformer(TransformerMixin):

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X, y=None, **fit_params):
        return X.todense()

## SelectFromModel

In [78]:
def run_multiple_models(X,y):
    
    dfs = []
    models = [
        ('LR_model', LogisticRegression(max_iter=5000)),
        ('SVC_model', svm.SVC(max_iter=5000)),
        ('KNN_model', KNeighborsClassifier(n_neighbors=5)),
        ('LDA_model', LinearDiscriminantAnalysis()),
        ('GNB_model', GaussianNB()),
        ('RFC_model', RandomForestClassifier()),
        ('MLPC_model', MLPClassifier(max_iter=5000))
        ]
    
 
    results = []
    
    names = []
    scoring = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc']
    target_names = ['Not Paid Deposit', 'Paid Deposit']

    
    for name, model in models:
        
        pipe = Pipeline([
            ('encoder',col_transformer_e),
            ('to_dense', DenseTransformer()),
            ('classifier', model)
        ])
        
        cv = cross_val_score(pipe, X, y, cv=StratifiedKFold(), scoring = "accuracy")
        clf = pipe.fit(X, y)
        y_pred = clf.predict(X)

        
        print(name)
        print(classification_report(y, y_pred, target_names=target_names))
        
        results.append(cv)
        names.append(name)
        
        final_df = pd.DataFrame(cv)
        final_df['model'] = name
        final_df.rename(columns={0:'Cross Validaton Score'},inplace=True)
        dfs.append(final_df)
        
        final = pd.concat(dfs, ignore_index=True)
        
        
    return final

In [79]:
run_multiple_models(X,y)

LR_model
                  precision    recall  f1-score   support

Not Paid Deposit       0.86      0.96      0.91     23556
    Paid Deposit       0.62      0.27      0.38      5167

        accuracy                           0.84     28723
       macro avg       0.74      0.62      0.64     28723
    weighted avg       0.82      0.84      0.81     28723





SVC_model
                  precision    recall  f1-score   support

Not Paid Deposit       0.79      0.06      0.11     23556
    Paid Deposit       0.18      0.93      0.30      5167

        accuracy                           0.21     28723
       macro avg       0.48      0.49      0.20     28723
    weighted avg       0.68      0.21      0.14     28723

KNN_model
                  precision    recall  f1-score   support

Not Paid Deposit       0.90      0.97      0.93     23556
    Paid Deposit       0.76      0.48      0.59      5167

        accuracy                           0.88     28723
       macro avg       0.83      0.73      0.76     28723
    weighted avg       0.87      0.88      0.87     28723

LDA_model
                  precision    recall  f1-score   support

Not Paid Deposit       0.86      0.96      0.91     23556
    Paid Deposit       0.62      0.32      0.42      5167

        accuracy                           0.84     28723
       macro avg       0.74      0

Unnamed: 0,Cross Validaton Score,model
0,0.831506,LR_model
1,0.815492,LR_model
2,0.834813,LR_model
3,0.84227,LR_model
4,0.851323,LR_model
5,0.823151,SVC_model
6,0.821062,SVC_model
7,0.821062,SVC_model
8,0.820508,SVC_model
9,0.820857,SVC_model


## Conclusion

RFC_model followed by KNN_model and MLPC_model are the models that works better with our data.

**In this project we will use the recall as evaluation parameter because in our case the cost of a false negative (not being able to predict the people who will not enrol to Itonhack) is high.**

**We'll choose to work with RFC and make it better to predict the Enrolments of Ironhack. RFC will also let us see the way (=variables) that help us come to a result.**

The **Random Forest** is a classification algorithm consisting of many decisions trees. It uses bagging and feature randomness when building each individual tree to try to create an uncorrelated forest of trees whose prediction by committee is more accurate than that of any individual tree. See picture below to graphic de classifier trees:

<img src=https://images.deepai.org/glossary-terms/756367e26f1049d3989001c440109fa2/random.png>