## Models to Categorize Storm Types
Code written by Drew Dyson and edited by Julia Taussig

The purpose of this notebook is to create models to categorize storms (we were curious about whether we could use storm data to predict categories of storms if NOAA had not categorized the storm yet).

Importing libraries:

In [1]:
import numpy as np
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split,cross_val_predict,cross_val_score,GridSearchCV
from sklearn.linear_model import LassoCV, LogisticRegressionCV, RidgeCV, BayesianRidge, bayes
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.decomposition import TruncatedSVD

## Reading in data and removing unnecessary columns. 

In [2]:
hurdat     = pd.read_csv('./data/hurdat_population.csv')
fatalities = pd.read_csv('./data/fatalities_final.csv')

### Hurdat

In [74]:
hurdat.head()

Unnamed: 0.1,Unnamed: 0,ID,Name,Date,Time,Event,Status,Latitude,Longitude,Maximum Wind,...,High Wind NW,Year,Month,Day,lat,lng,county_name,state_name,population,Population
0,0,AL011950,ABLE,19500812.0,0.0,,TS,17.1,55.5,35.0,...,,1950,8,12,17.0,-55.1,St. John,Virgin Islands,,0
1,1,AL011950,ABLE,19500812.0,600.0,,TS,17.7,56.3,40.0,...,,1950,8,12,17.1,-56.0,St. John,Virgin Islands,,0
2,2,AL011950,ABLE,19500812.0,1200.0,,TS,18.2,57.4,45.0,...,,1950,8,12,18.0,-57.1,St. John,Virgin Islands,,0
3,3,AL011950,ABLE,19500812.0,1800.0,,TS,19.0,58.6,50.0,...,,1950,8,12,19.0,-58.1,St. John,Virgin Islands,,0
4,4,AL011950,ABLE,19500813.0,0.0,,TS,20.0,60.0,50.0,...,,1950,8,13,20.0,-60.0,St. John,Virgin Islands,,0


In [75]:
hurdat.drop(['Unnamed: 0', 'population'], axis = 1, inplace = True)

In [76]:
hurdat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23729 entries, 0 to 23728
Data columns (total 30 columns):
ID                  23729 non-null object
Name                23729 non-null object
Date                23729 non-null float64
Time                23729 non-null float64
Event               574 non-null object
Status              23729 non-null object
Latitude            23729 non-null float64
Longitude           23729 non-null float64
Maximum Wind        23729 non-null float64
Minimum Pressure    18281 non-null float64
Low Wind NE         7045 non-null float64
Low Wind SE         7045 non-null float64
Low Wind SW         7045 non-null float64
Low Wind NW         7045 non-null float64
Moderate Wind NE    7045 non-null float64
Moderate Wind SE    7045 non-null float64
Moderate Wind SW    7045 non-null float64
Moderate Wind NW    7045 non-null float64
High Wind NE        7045 non-null float64
High Wind SE        7045 non-null float64
High Wind SW        7045 non-null float64
High 

In [77]:
hurdat_w_wind = hurdat.dropna(subset = ['Low Wind NE'])

In [78]:
hurdat_w_wind.head()


Unnamed: 0,ID,Name,Date,Time,Event,Status,Latitude,Longitude,Maximum Wind,Minimum Pressure,...,High Wind SW,High Wind NW,Year,Month,Day,lat,lng,county_name,state_name,Population
16604,AL012004,ALEX,20040731.0,1800.0,,TD,30.3,78.3,25.0,1010.0,...,0.0,0.0,2004,7,31,30.1,-78.0,Brevard,Florida,544439
16605,AL012004,ALEX,20040801.0,0.0,,TD,31.0,78.8,25.0,1009.0,...,0.0,0.0,2004,8,1,31.0,-78.1,Charleston,South Carolina,357492
16606,AL012004,ALEX,20040801.0,600.0,,TD,31.5,79.0,25.0,1009.0,...,0.0,0.0,2004,8,1,31.1,-79.0,Charleston,South Carolina,357492
16607,AL012004,ALEX,20040801.0,1200.0,,TD,31.6,79.1,30.0,1009.0,...,0.0,0.0,2004,8,1,31.1,-79.0,Charleston,South Carolina,357492
16608,AL012004,ALEX,20040801.0,1800.0,,TS,31.6,79.2,35.0,1009.0,...,0.0,0.0,2004,8,1,31.1,-79.0,Charleston,South Carolina,357492


In [79]:
hurdat_w_wind.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7045 entries, 16604 to 23728
Data columns (total 30 columns):
ID                  7045 non-null object
Name                7045 non-null object
Date                7045 non-null float64
Time                7045 non-null float64
Event               175 non-null object
Status              7045 non-null object
Latitude            7045 non-null float64
Longitude           7045 non-null float64
Maximum Wind        7045 non-null float64
Minimum Pressure    7045 non-null float64
Low Wind NE         7045 non-null float64
Low Wind SE         7045 non-null float64
Low Wind SW         7045 non-null float64
Low Wind NW         7045 non-null float64
Moderate Wind NE    7045 non-null float64
Moderate Wind SE    7045 non-null float64
Moderate Wind SW    7045 non-null float64
Moderate Wind NW    7045 non-null float64
High Wind NE        7045 non-null float64
High Wind SE        7045 non-null float64
High Wind SW        7045 non-null float64
High Wind N

In [80]:
hurdat_minpres = hurdat.dropna(subset = ['Minimum Pressure'])

In [81]:
hurdat_minpres.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18281 entries, 7 to 23728
Data columns (total 30 columns):
ID                  18281 non-null object
Name                18281 non-null object
Date                18281 non-null float64
Time                18281 non-null float64
Event               515 non-null object
Status              18281 non-null object
Latitude            18281 non-null float64
Longitude           18281 non-null float64
Maximum Wind        18281 non-null float64
Minimum Pressure    18281 non-null float64
Low Wind NE         7045 non-null float64
Low Wind SE         7045 non-null float64
Low Wind SW         7045 non-null float64
Low Wind NW         7045 non-null float64
Moderate Wind NE    7045 non-null float64
Moderate Wind SE    7045 non-null float64
Moderate Wind SW    7045 non-null float64
Moderate Wind NW    7045 non-null float64
High Wind NE        7045 non-null float64
High Wind SE        7045 non-null float64
High Wind SW        7045 non-null float64
High 

In [82]:
hurdat_landfall = hurdat[hurdat['Event'] == 'L']

In [83]:
hurdat_landfall.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 511 entries, 39 to 23706
Data columns (total 30 columns):
ID                  511 non-null object
Name                511 non-null object
Date                511 non-null float64
Time                511 non-null float64
Event               511 non-null object
Status              511 non-null object
Latitude            511 non-null float64
Longitude           511 non-null float64
Maximum Wind        511 non-null float64
Minimum Pressure    452 non-null float64
Low Wind NE         150 non-null float64
Low Wind SE         150 non-null float64
Low Wind SW         150 non-null float64
Low Wind NW         150 non-null float64
Moderate Wind NE    150 non-null float64
Moderate Wind SE    150 non-null float64
Moderate Wind SW    150 non-null float64
Moderate Wind NW    150 non-null float64
High Wind NE        150 non-null float64
High Wind SE        150 non-null float64
High Wind SW        150 non-null float64
High Wind NW        150 non-null fl

### Fatalities

In [84]:
fatalities.head()

Unnamed: 0.1,Unnamed: 0,Hurricane ID,Hurricane Name,Hurricane Year,Hurricane Month,_name,state_name,Number of Fatalities,lat,lng,population
0,0,AL071969,BLANCHE,1969,8,,,0,,,0
1,1,AL081969,DEBBIE,1969,8,,,0,,,0
2,2,AL091969,CAMILLE,1969,8,Nelson,Virginia,153,37.9,-78.9,15018
3,3,AL091969,CAMILLE,1969,8,Harrison,Mississippi,24,39.4,-80.3,190928
4,4,AL091969,CAMILLE,1969,8,,Alabama,60,,,0


In [85]:
fatalities.drop('Unnamed: 0', axis = 1, inplace = True)

In [86]:
fatalities.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 604 entries, 0 to 603
Data columns (total 10 columns):
Hurricane ID            604 non-null object
Hurricane Name          604 non-null object
Hurricane Year          604 non-null int64
Hurricane Month         604 non-null int64
_name                   395 non-null object
state_name              410 non-null object
Number of Fatalities    566 non-null object
lat                     391 non-null float64
lng                     391 non-null float64
population              604 non-null int64
dtypes: float64(2), int64(3), object(5)
memory usage: 47.3+ KB


In [87]:
fatalities[fatalities['Number of Fatalities'].isnull()]

Unnamed: 0,Hurricane ID,Hurricane Name,Hurricane Year,Hurricane Month,_name,state_name,Number of Fatalities,lat,lng,population
271,AL091998,IVAN,1998,9,,,,,,0
272,AL101998,JEANNE,1998,9,,,,,,0
273,AL111998,KARL,1998,9,,,,,,0
274,AL121998,LISA,1998,10,,,,,,0
365,AL072004,GASTON,2004,8,,,,,,0
427,AL132007,LORENZO,2007,9,,,,,,0
428,AL162007,NOEL,2007,11,,,,,,0
442,AL152008,OMAR,2008,10,,,,,,0
443,AL172008,PALOMA,2008,11,,,,,,0
446,AL072009,FRED,2009,9,,,,,,0


In [88]:
fatalities.dropna(inplace = True)

In [89]:
fatalities.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 391 entries, 2 to 602
Data columns (total 10 columns):
Hurricane ID            391 non-null object
Hurricane Name          391 non-null object
Hurricane Year          391 non-null int64
Hurricane Month         391 non-null int64
_name                   391 non-null object
state_name              391 non-null object
Number of Fatalities    391 non-null object
lat                     391 non-null float64
lng                     391 non-null float64
population              391 non-null int64
dtypes: float64(2), int64(3), object(5)
memory usage: 33.6+ KB


In [90]:
fatalities['Number of Fatalities'] = fatalities['Number of Fatalities'].astype(int)

## Adding Total Fatalities


In [91]:
fatalities.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 391 entries, 2 to 602
Data columns (total 10 columns):
Hurricane ID            391 non-null object
Hurricane Name          391 non-null object
Hurricane Year          391 non-null int64
Hurricane Month         391 non-null int64
_name                   391 non-null object
state_name              391 non-null object
Number of Fatalities    391 non-null int64
lat                     391 non-null float64
lng                     391 non-null float64
population              391 non-null int64
dtypes: float64(2), int64(4), object(4)
memory usage: 33.6+ KB


In [92]:
fatalities.reset_index(drop = True, inplace = True)

In [93]:
fatalities['Number of Fatalities'].fillna(0, inplace = True)

In [94]:
fatalities[fatalities['Hurricane ID'] == 'AL101989']['Number of Fatalities'].astype(int).sum()

4

In [95]:
# h_ids = []

# for h_id in fatalities['Hurricane ID']:
#     h_ids.append(h_id)
# #     if h_id not in h_ids:
# #         total = fatalities[fatalities['Hurricane ID'] == h_id]['Number of Fatalities'].astype(int).sum()
# #         h_ids.append(h_id)
        
    

In [96]:
#Trying to add fatalities by hurricane ID

# id_fatal = {(h_id: total = fatalities[fatalities['Hurricane ID'] == h_id]['Number of Fatalities'].astype(int).sum()) for h_id in fatalities['Hurricane ID']}



In [97]:
#Finding sum of fatalities of hurricane based on hurricane ID and placing in list that will go in
#fatalities_total column:

# Code from Julia 

list_fatalities_sums = []
for hurr_i in list(fatalities['Hurricane ID']):
    hurr_i_sum_fat = fatalities[fatalities['Hurricane ID']==hurr_i]['Number of Fatalities'].unique().sum()
    list_fatalities_sums.append(hurr_i_sum_fat)

In [98]:
fatalities['Total Fatalities'] = list_fatalities_sums
fatalities.head()

Unnamed: 0,Hurricane ID,Hurricane Name,Hurricane Year,Hurricane Month,_name,state_name,Number of Fatalities,lat,lng,population,Total Fatalities
0,AL091969,CAMILLE,1969,8,Nelson,Virginia,153,37.9,-78.9,15018,177
1,AL091969,CAMILLE,1969,8,Harrison,Mississippi,24,39.4,-80.3,190928,177
2,AL041970,CELIA,1970,8,Escambia,Florida,8,30.4,-87.3,299366,15
3,AL041970,CELIA,1970,8,Nueces,Texas,8,27.7,-97.4,343225,15
4,AL041970,CELIA,1970,8,Jim Wells,Texas,7,27.5,-98.1,41206,15


In [99]:
list_fatalities_hurdat_sums = []
for hurr_i in list(hurdat['Hurricane ID']):
    hurr_i_sum_fat = fatalities[fatalities['Hurricane ID']==hurr_i]['Number of Fatalities'].unique().sum()
    list_fatalities_hurdat_sums.append(hurr_i_sum_fat)

KeyError: 'Hurricane ID'

In [None]:
h_ids[0]['AL091969']

In [None]:
def fatal_hurdat(df):
    if df[0] in id_fatal.keys():
        return id_fatal[df[0]]

    else: 
        return 0

total_fatal  = hurdat['Hurricane ID'].apply(fatal_hurdat, axis = 1)
hurdat['Total Fatal'] = total_fatal

## Exploring the Data

In [None]:
plt.figure(figsize = (10,10))
sns.heatmap(fatalities.corr(), 
            annot= True, cmap= 'viridis');

# interesting the lat and the Hurricane year are slightly postively correlated. 
# other than that no significant correlations

In [None]:
plt.figure(figsize = (10,10))
sns.heatmap(fatalities.corr()[['Number of Fatalities']].sort_values('Number of Fatalities'), 
            annot= True, cmap= 'viridis');


In [None]:
plt.figure(figsize = (12,12))
sns.heatmap(hurdat.corr(), 
            annot= True, cmap= 'viridis');

In [None]:
plt.figure(figsize = (12,12))
sns.heatmap(hurdat_landfall.corr(), 
            annot= True, cmap= 'viridis');

In [None]:
plt.figure(figsize = (12,12))
sns.heatmap(hurdat_minpres.corr(), 
            annot= True, cmap= 'viridis');

In [None]:
plt.figure(figsize = (12,12))
sns.heatmap(hurdat_w_wind.corr(), 
            annot= True, cmap= 'viridis');

## Model for Hurricane Status Classification

In [100]:
hurdat_w_wind['Status'].value_counts()

TS    2479
HU    1565
LO     984
TD     917
EX     846
SS      92
DB      81
WV      58
SD      23
Name: Status, dtype: int64

In [104]:
pd.set_option('display.max_rows', 100000)
hurdat_w_wind.head(1)

Unnamed: 0,ID,Name,Date,Time,Event,Status,Latitude,Longitude,Maximum Wind,Minimum Pressure,...,High Wind SW,High Wind NW,Year,Month,Day,lat,lng,county_name,state_name,Population
16604,AL012004,ALEX,20040731.0,1800.0,,TD,30.3,78.3,25.0,1010.0,...,0.0,0.0,2004,7,31,30.1,-78.0,Brevard,Florida,544439


In [105]:
X = hurdat_w_wind.drop(['ID', 'Name', 'Date', 'Event', 'Status', 'Latitude', 'Longitude', 'county_name', 'state_name', 'Population', 'Year', 'Day'], axis = 1)
y = hurdat_w_wind['Status']

In [106]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify = y)

In [114]:
results = pd.DataFrame(columns= ['model','train', 'test'])
results

Unnamed: 0,model,train,test


## KNN Model Hurricane Status

In [108]:
knn = KNeighborsClassifier()

knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform')

In [109]:
knn.score(X_train, y_train)

0.8578459208782888

In [110]:
knn.score(X_test, y_test)

0.7758229284903518

In [111]:
knn_train_results = knn.score(X_train, y_train)
knn_train_results

0.8578459208782888

In [112]:
knn_test_results = knn.score(X_test, y_test)
knn_test_results

0.7758229284903518

In [115]:
results = results.append({'model':'KNN', 'train' :knn_train_results, 'test': knn_test_results}, ignore_index= True)
results

Unnamed: 0,model,train,test
0,KNN,0.857846,0.775823


## Logistic Regression Hurricane Status

In [116]:
logreg = LogisticRegressionCV(cv= 5)

In [117]:
logreg.fit(X_train, y_train)















LogisticRegressionCV(Cs=10, class_weight=None, cv=5, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='warn', n_jobs=None, penalty='l2',
           random_state=None, refit=True, scoring=None, solver='lbfgs',
           tol=0.0001, verbose=0)

In [118]:
logreg_train_results = logreg.score(X_train, y_train)
logreg_train_results

0.8374029907249668

In [119]:
logreg_test_results = logreg.score(X_test, y_test)
logreg_test_results

0.8251986379114642

In [120]:
results = results.append({'model': 'LogReg', 'train': logreg_train_results, 'test': logreg_test_results}, ignore_index= True)
results

Unnamed: 0,model,train,test
0,KNN,0.857846,0.775823
1,LogReg,0.837403,0.825199


## Naive Bayes Model Hurricane Status

In [123]:
nb = MultinomialNB()

nb.fit(abs(X_train), y_train)

nb_train_results = nb.score(abs(X_train), y_train)
nb_train_results

0.48741245504448233

In [124]:
nb_test_results = nb.score(X_test, y_test)
nb_test_results

0.4551645856980704

In [125]:
results = results.append({'model': 'NaivBays', 'train': nb_train_results, 'test': nb_test_results}, ignore_index= True)
results

Unnamed: 0,model,train,test
0,KNN,0.857846,0.775823
1,LogReg,0.837403,0.825199
2,NaivBays,0.487412,0.455165


## Decision Tree Hurricane Status

In [126]:
dtc = DecisionTreeClassifier()


dtc.fit(X_train, y_train)


dtc_train_results = dtc.score(X_train, y_train)
dtc_test_results = dtc.score(X_test, y_test)

print(dtc_train_results)
print(dtc_test_results)
# overfit, with high variance

0.9998107136096914
0.8825198637911464


In [127]:
results = results.append({'model': 'DecTreeClas', 'train': dtc_train_results, 'test': dtc_test_results}, ignore_index= True)
results

Unnamed: 0,model,train,test
0,KNN,0.857846,0.775823
1,LogReg,0.837403,0.825199
2,NaivBays,0.487412,0.455165
3,DecTreeClas,0.999811,0.88252


## Bagging Classifier Hurricane Status

In [128]:
bag = BaggingClassifier()

bag.fit(X_train, y_train)

bag_train_results = bag.score(X_train, y_train)
bag_test_results  =bag.score(X_test, y_test)

print(bag_train_results)
print(bag_test_results)

0.9941321219004353
0.8944381384790011


In [129]:
results = results.append({'model': 'BagClas', 'train': bag_train_results, 'test': bag_test_results}, ignore_index= True)
results

Unnamed: 0,model,train,test
0,KNN,0.857846,0.775823
1,LogReg,0.837403,0.825199
2,NaivBays,0.487412,0.455165
3,DecTreeClas,0.999811,0.88252
4,BagClas,0.994132,0.894438


## Random Forest Hurricane Status

In [130]:
rf = RandomForestClassifier()

rf.fit(X_train, y_train)

rf_train_result = rf.score(X_train, y_train)
rf_test_result  = rf.score(X_test, y_test)

print(rf_train_result)
print(rf_test_result)

0.9950785538519781
0.8842224744608399




In [131]:
results = results.append({'model': 'RandFor', 'train': rf_train_result, 'test': rf_test_result}, ignore_index= True)
results

Unnamed: 0,model,train,test
0,KNN,0.857846,0.775823
1,LogReg,0.837403,0.825199
2,NaivBays,0.487412,0.455165
3,DecTreeClas,0.999811,0.88252
4,BagClas,0.994132,0.894438
5,RandFor,0.995079,0.884222


## Grid Search Hurricane Status

In [132]:
grid_results = pd.DataFrame(columns= ['model', 'train_grid', 'test_grid'])
grid_results

Unnamed: 0,model,train_grid,test_grid


## Grid Search for KNN

In [133]:
gs_knn = GridSearchCV(KNeighborsClassifier(), {'n_neighbors': [5, 15, 25], 'weights': ['uniform', 'distance']}
                      , cv=5, return_train_score=True)

gs_knn.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'n_neighbors': [5, 15, 25], 'weights': ['uniform', 'distance']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [134]:
gs_knn.best_params_

{'n_neighbors': 5, 'weights': 'distance'}

In [135]:
grid_train_knn = gs_knn.score(X_train, y_train)
grid_test_knn  = gs_knn.score(X_test, y_test)

print(grid_train_knn)
print(grid_test_knn)

0.9998107136096914
0.7905788876276958


In [136]:
grid_results = grid_results.append({'model': 'KNN', 'train_grid': grid_train_knn, 'test_grid': grid_test_knn}, ignore_index= True)
grid_results

Unnamed: 0,model,train_grid,test_grid
0,KNN,0.999811,0.790579


## Grid Search for Logreg

In [137]:
gs_logreg = GridSearchCV(LogisticRegressionCV(), {'cv': [3, 5, 10], 'Cs': [10, 1]}
                      , cv=5, return_train_score=True);

gs_logreg.fit(X_train, y_train)











































































































































































































































































GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=LogisticRegressionCV(Cs=10, class_weight=None, cv='warn', dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='warn', n_jobs=None, penalty='l2',
           random_state=None, refit=True, scoring=None, solver='lbfgs',
           tol=0.0001, verbose=0),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'cv': [3, 5, 10], 'Cs': [10, 1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [138]:
gs_logreg.best_params_

{'Cs': 10, 'cv': 10}

In [139]:
logreg_grid_train   = gs_logreg.score(X_train, y_train)
logreg_grid_test    = gs_logreg.score(X_test, y_test)

print(logreg_grid_train)
print(logreg_grid_test)

0.8366458451637327
0.8263337116912599


In [140]:
grid_results = grid_results.append({'model': 'LogReg', 'train_grid': logreg_grid_train, 'test_grid': logreg_test_results}, ignore_index= True)
grid_results

Unnamed: 0,model,train_grid,test_grid
0,KNN,0.999811,0.790579
1,LogReg,0.836646,0.825199


## Grid Search for Naive Bayes

In [142]:
gs_navbay = GridSearchCV(MultinomialNB(), {'alpha': [1.0, 0.5, 0.1], 'fit_prior': [True, False]}
                      , cv=5, return_train_score=True);

gs_navbay.fit(abs(X_train), y_train)

navbay_grid_train   = gs_navbay.score(abs(X_train), y_train)
navbay_grid_test    = gs_navbay.score(abs(X_test), y_test)

print(navbay_grid_train)
print(navbay_grid_test)

0.48741245504448233
0.4880817253121453


In [143]:
grid_results = grid_results.append({'model': 'NaivBays', 'train_grid': navbay_grid_train, 'test_grid': navbay_grid_test}, ignore_index= True)
grid_results

Unnamed: 0,model,train_grid,test_grid
0,KNN,0.999811,0.790579
1,LogReg,0.836646,0.825199
2,NaivBays,0.487412,0.488082


## Grid Search for Decision Tree Classification

In [144]:
gs_dtc = GridSearchCV(DecisionTreeClassifier(), {'criterion': ['gini', 'entropy'], 
                                                    'max_depth': [None, 5, 10], 'min_samples_split': [2, 5],
                                                   'max_features': [1.0, 0.8, 0.5] },
                         cv=5, return_train_score=True);

gs_dtc.fit(X_train, y_train)

dtc_grid_train   = gs_dtc.score(X_train, y_train)
dtc_grid_test    = gs_dtc.score(X_test, y_test)

print(dtc_grid_train)
print(dtc_grid_test)

0.9998107136096914
0.880249716231555


In [145]:
grid_results = grid_results.append({'model': 'DecTreeClas', 'train_grid': dtc_grid_train, 'test_grid': dtc_grid_test}, ignore_index= True)
grid_results

Unnamed: 0,model,train_grid,test_grid
0,KNN,0.999811,0.790579
1,LogReg,0.836646,0.825199
2,NaivBays,0.487412,0.488082
3,DecTreeClas,0.999811,0.88025


## Grid Search Bag Classifier

In [146]:
gs_bag = GridSearchCV(BaggingClassifier(), {'n_estimators':  [5, 10, 15], 
                                                    'max_samples': [1.0, 0.8, 0.5], 'bootstrap_features': [True, False],
                                                   'max_features': [1.0, 0.8, 0.5] },
                         cv=5, return_train_score=True);

gs_bag.fit(X_train, y_train)

bag_grid_train   = gs_bag.score(X_train, y_train)
bag_grid_test    = gs_bag.score(X_test, y_test)

print(bag_grid_train)
print(bag_grid_test)

0.9967821313647549
0.9001135073779796


In [147]:
grid_results = grid_results.append({'model': 'BagClas', 'train_grid': bag_grid_train, 'test_grid': bag_grid_test}, ignore_index= True)
grid_results

Unnamed: 0,model,train_grid,test_grid
0,KNN,0.999811,0.790579
1,LogReg,0.836646,0.825199
2,NaivBays,0.487412,0.488082
3,DecTreeClas,0.999811,0.88025
4,BagClas,0.996782,0.900114


## Random Forest Grid Search

In [149]:
gs_rf = GridSearchCV(RandomForestClassifier(), {'n_estimators':  [5, 10, 15], 'criterion': ['gini', 'entropy'], 
                                                'max_depth': [None, 5, 10], 'min_samples_split': [2, 5],
                                                'max_features': [1.0, 0.8, 0.5] },
                         cv=5, return_train_score=True);

gs_rf.fit(X_train, y_train)

rf_grid_train   = gs_rf.score(X_train, y_train)
rf_grid_test    = gs_rf.score(X_test, y_test)

print(rf_grid_train)
print(rf_grid_test)

0.9965928449744463
0.9035187287173666


In [150]:
grid_results = grid_results.append({'model': 'RandFor', 'train_grid': rf_grid_train, 'test_grid': rf_grid_test}, ignore_index= True)
grid_results.sort_values('test_grid')

Unnamed: 0,model,train_grid,test_grid
0,KNN,0.999811,0.790579
1,LogReg,0.836646,0.825199
2,NaivBays,0.487412,0.488082
3,DecTreeClas,0.999811,0.88025
4,BagClas,0.996782,0.900114
5,RandFor,0.996593,0.903519


### Compare the models to find the top two. The top two models will be applied to a SVD data frame

In [152]:
final_results = pd.concat([results, grid_results] )
final_results.sort_values('test_grid', ascending = False)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0,model,test,test_grid,train,train_grid
5,RandFor,,0.903519,,0.996593
4,BagClas,,0.900114,,0.996782
3,DecTreeClas,,0.88025,,0.999811
1,LogReg,,0.825199,,0.836646
0,KNN,,0.790579,,0.999811
2,NaivBays,,0.488082,,0.487412
0,KNN,0.775823,,0.857846,
1,LogReg,0.825199,,0.837403,
2,NaivBays,0.455165,,0.487412,
3,DecTreeClas,0.88252,,0.999811,


In [159]:
# SVD
# this section is informed by Sam Stacks lecture and the sklearn documentation

svd = TruncatedSVD()





X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)
X_train_svd = svd.fit_transform(X_train)
X_test_svd  = svd.transform(X_test)


randfor_sdv = RandomForestClassifier(criterion = 'entropy',max_depth = None, max_features = 1.0,
                                     min_samples_split= 2, n_estimators= 15)

randfor_sdv.fit(X_train_svd, y_train)

print(randfor_sdv.score(X_train_svd, y_train))
print(randfor_sdv.score(X_test_svd, y_test))

0.9808820745788378
0.5323496027241771


In [161]:
svd2 = TruncatedSVD( random_state= 42)


X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, stratify=y, random_state=42)
X_train2_svd = svd2.fit_transform(X_train)
X_test2_svd  = svd2.transform(X_test)


logreg_svd = LogisticRegressionCV()
logreg_svd.fit(X_train2_svd, y_train2)

logreg_svd_train = logreg_svd.score(X_train2_svd, y_train2)
logreg_svd_test = logreg_svd.score(X_test2_svd, y_test2)

print(logreg_svd_train)
print(logreg_svd_test)



0.40034071550255534
0.3944381384790011


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


ValueError: n_components must be < n_features; got 100 >= 18