# 01 Baseline Models

## Imports

In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression
import pickle

## Read in the training and test data

In [2]:
with open('../../02_Data/02_Processed_Data/X_train.pkl', 'rb') as f:
    X_train = pickle.load(f)
    
with open('../../02_Data/02_Processed_Data/y_train.pkl', 'rb') as f:
    y_train = pickle.load(f)    

with open('../../02_Data/02_Processed_Data/X_test.pkl', 'rb') as f:
    X_test = pickle.load(f)
    
with open('../../02_Data/02_Processed_Data/y_test.pkl', 'rb') as f:
    y_test = pickle.load(f)    

## Naive Logistic Regression

In [6]:
lr = LogisticRegression(random_state=42)
lr.fit(X_train, y_train)
print('Train:', lr.score(X_train,y_train))
print('Test:', lr.score(X_test,y_test))

Train: 0.5129437869822485
Test: 0.4823529411764706


As expected, the non-tuned logistic regression performs quite poorly with the test doing even worse than baseline. 

## Random Forest Classifier

In [5]:
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
print('Train:', rf.score(X_train,y_train))
print('Test:',rf.score(X_test,y_test))

Train: 0.9818786982248521
Test: 0.5764705882352941


The untuned random forest classifier does a lot better, at least beating the baseline (50/50).  However, the model is significantly overfit with the train being 0.98 and test only coming in at 0.57.  This is not surprising because not setting a max depth allows the trees to grow as long as it needs to almost perfectly classify the training data, causing significant overfitting.

## Adaboost Classifier

In [7]:
ada = AdaBoostClassifier(random_state=42)
ada.fit(X_train, y_train)
print('Train:', ada.score(X_train,y_train))
print('Test:',ada.score(X_test,y_test))

Train: 0.6852810650887574
Test: 0.5529411764705883


## Gradient Boosting Classifier

In [8]:
gb = GradientBoostingClassifier(random_state=42)
gb.fit(X_train, y_train)
print('Train:', gb.score(X_train,y_train))
print('Test:',gb.score(X_test,y_test))

Train: 0.8923816568047337
Test: 0.5647058823529412
