Grid search attempt using count vectorized text. 

In [2]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

Reading in Data

In [3]:
tr = pd.read_csv("data/train.csv")

Trying to do predictions only based on text. Maybe later bring look at keywords, location, and id

In [4]:
df.drop(columns=['keyword', 'location','id'], inplace = True)


In [5]:
import nltk
nltk.download('punkt')

from nltk.corpus import stopwords
import string
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Function for tokenizing and lowercasing text

In [6]:
def transform(text):
    text = text.lower()
    text = nltk.word_tokenize(text)
    y = []
    for i in text:
        if i.isalnum():
            y.append(i)
    text = y[:]
    y.clear()
    for i in text:
        if i not in stopwords.words('english') and i not in string.punctuation:
            y.append(i)
    text = y[:]
    y.clear()
    for i in text:
        y.append(ps.stem(i))
    return " ".join(y)

Main data set I am working on

In [7]:
tr['transform_text'] = tr['text'].apply(transform)

In [8]:
tr

Unnamed: 0,text,target,transform_text
0,Our Deeds are the Reason of this #earthquake M...,1,deed reason earthquak may allah forgiv us
1,Forest fire near La Ronge Sask. Canada,1,forest fire near la rong sask canada
2,All residents asked to 'shelter in place' are ...,1,resid ask place notifi offic evacu shelter pla...
3,"13,000 people receive #wildfires evacuation or...",1,peopl receiv wildfir evacu order california
4,Just got sent this photo from Ruby #Alaska as ...,1,got sent photo rubi alaska smoke wildfir pour ...
...,...,...,...
7608,Two giant cranes holding a bridge collapse int...,1,two giant crane hold bridg collaps nearbi home...
7609,@aria_ahrary @TheTawniest The out of control w...,1,thetawniest control wild fire california even ...
7610,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1,utc 5km volcano hawaii http
7611,Police investigating after an e-bike collided ...,1,polic investig collid car littl portug rider s...


Dropping the original text for the altered text

In [9]:
tr.drop(columns=['text'], inplace = True)

In [10]:
tr

Unnamed: 0,target,transform_text
0,1,deed reason earthquak may allah forgiv us
1,1,forest fire near la rong sask canada
2,1,resid ask place notifi offic evacu shelter pla...
3,1,peopl receiv wildfir evacu order california
4,1,got sent photo rubi alaska smoke wildfir pour ...
...,...,...
7608,1,two giant crane hold bridg collaps nearbi home...
7609,1,thetawniest control wild fire california even ...
7610,1,utc 5km volcano hawaii http
7611,1,polic investig collid car littl portug rider s...


Setting up for modeling

In [11]:
X = tr["transform_text"]

In [12]:
y = tr["target"]

It makes more sense to work on averaged cross vals than the train test split

In [13]:
from sklearn.model_selection import cross_val_score
from sklearn import metrics

Vectorizing the text

In [14]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
cv = CountVectorizer()
tf = TfidfVectorizer()

In [17]:
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

In [18]:
import pandas as pd
import nltk
import regex as re
import string
from nltk.stem import PorterStemmer, WordNetLemmatizer
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, cross_val_score, cross_validate
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC

Pipeline being set up.

In [60]:
mnb_tf = Pipeline([('Vectorizer',  TfidfVectorizer(lowercase=False)),
               ('mnb', MultinomialNB())])

lr_tf = Pipeline([('Vectorizer', TfidfVectorizer(lowercase=False)),
               ('LogisticReg', LogisticRegression(max_iter=200, random_state=1))])

dtc_tf = Pipeline([('Vectorizer', TfidfVectorizer(lowercase=False)),
               ('DecisionTree', DecisionTreeClassifier(random_state=1))])

rf_tf = Pipeline([('Vectorizer', TfidfVectorizer(lowercase=False)),
               ('RandomFor', RandomForestClassifier(random_state=1))]) 

gbc_tf = Pipeline([('Vectorizer', TfidfVectorizer(lowercase=False)),
               ('gradiendboosting', GradientBoostingClassifier(random_state=1))])

svc_tf = Pipeline([('Vectorizer', TfidfVectorizer(lowercase=False)),
                ('SupportVec', SVC(random_state=1))])


In [61]:
models2 = [('MultiNomBa', mnb_tf),
          ('LogisticReg', lr_tf),
          ('DecTreeClass', dtc_tf),           
          ('RandomFor', rf_tf),
          ('GradBoost', gbc_tf),
          ('SupportVec', svc_tf)]


In [62]:
num_mba=0
num_lreg=1
num_dtc=2
num_rfc=3
num_gbc=4
num_svc=5


In [24]:
tuned_params = {}

In [50]:

def gridsearch_count(params, name, models, num):
    for model, grid in params.items():
        print(model, 'Grid Search:')
        print(model)
        pipe = Pipeline(steps=[('Vectorizer', CountVectorizer(lowercase=False)),
                                ('classifier', models[num][1][1])]) 
        print(pipe["Vectorizer"])
        gridsearch = GridSearchCV(estimator=pipe, param_grid=grid[0], scoring='accuracy', cv=5)
        gridsearch.fit(X, y)
        print("Scoring method: Accuracy")
        print(f'Avg of cross validation scores: {gridsearch.cv_results_["mean_test_score"]}')
        print(f'Best cross validation score: {gridsearch.best_score_ :.2%}')
        print(f'Optimal parameters: {gridsearch.best_params_}')
        tuned_params[name] = gridsearch.best_params_

Logistic Regression

In [52]:
params_lr_cv1 = {'LogisticReg': [{
    "classifier__penalty":["l1", "l2", "elasticnet"],
    'classifier__max_iter':[100, 200],
    'classifier__C':[0.001, 0.1, 1],
    'classifier__solver':['lbfgs', 'saga'],
    'classifier__fit_intercept':[True, False]

}]}

run_gridsearch_count(params_lr_cv1, name="LogisticReg", models=models2, num=num_lreg)

LogisticReg Grid Search:
LogisticReg
CountVectorizer(lowercase=False)
Scoring method: Accuracy
Avg of cross validation scores: [       nan 0.5703402  0.57809341 0.57730515        nan        nan
        nan 0.5703402  0.57809341 0.57783069        nan        nan
        nan 0.5703402  0.63090156 0.63090156        nan        nan
        nan 0.5703402  0.63090156 0.63090156        nan        nan
        nan 0.6458765  0.70826935 0.70708635        nan        nan
        nan 0.64548245 0.70826935 0.70866279        nan        nan
        nan 0.65572807 0.67963482 0.67963482        nan        nan
        nan 0.65572807 0.67963482 0.67963482        nan        nan
        nan 0.6588775  0.68751582 0.68685922        nan        nan
        nan 0.65388588 0.68751582 0.68607105        nan        nan
        nan 0.63457955 0.64876622 0.65178813        nan        nan
        nan 0.63392174 0.64876622 0.64902912        nan        nan]
Best cross validation score: 70.87%
Optimal parameters: {'classifier

Decision tree

In [53]:
params_dtc1 = {'DecisionTree': [{
    'classifier__criterion':['gini', 'entropy'],
    'classifier__max_depth':[1, 3, 5, 10, 15, 25],
    'classifier__min_samples_split':[2, 5, 6, 8],
    'classifier__ccp_alpha':[0.0, 0.01, 0.1]
}]}

gridsearch_count(params_dtc1, name='DecisionTree', models=models2, num=num_dtc)


DecisionTree Grid Search:
DecisionTree
CountVectorizer(lowercase=False)
Scoring method: Accuracy
Avg of cross validation scores: [0.618816   0.618816   0.618816   0.618816   0.62354437 0.62354437
 0.62354437 0.62354437 0.61185355 0.61185355 0.61185355 0.61185355
 0.60252579 0.60344512 0.6033138  0.60305107 0.59963684 0.59819224
 0.59884884 0.5993742  0.61369185 0.6131664  0.61447977 0.61277236
 0.618816   0.618816   0.618816   0.618816   0.62354437 0.62354437
 0.62354437 0.62354437 0.61027771 0.61027771 0.61027771 0.61027771
 0.59490623 0.59411822 0.59411822 0.59411822 0.59451365 0.5956963
 0.59556498 0.59490821 0.61080006 0.60869894 0.60961827 0.61001205
 0.618816   0.618816   0.618816   0.618816   0.618816   0.618816
 0.618816   0.618816   0.618816   0.618816   0.618816   0.618816
 0.618816   0.618816   0.618816   0.618816   0.618816   0.618816
 0.618816   0.618816   0.618816   0.618816   0.618816   0.618816
 0.618816   0.618816   0.618816   0.618816   0.61947277 0.61947277
 0.619472

Random Forest 

In [65]:
params_rf1 = {'RandomForest': [{
    "classifier__n_estimators": [50,100, 150, 200, 250],
    'classifier__criterion':['gini', 'entropy'],
    "classifier__max_depth": [2, 3, 4, 5, 8, 12, 20],
    "classifier__min_samples_leaf": [2, 4, 6],
    'classifier__max_depth':[5, 10, 15, 20],
#    "classifier__min_weight_fraction_leaf": [0.1, 0.3]
}]}

gridsearch_count(params_rf1, name='RandomForest', models=models2, num=num_rfc)


RandomForest Grid Search:
RandomForest
CountVectorizer(lowercase=False)
Scoring method: Accuracy
Avg of cross validation scores: [0.58163793 0.57651412 0.57704018 0.57796002 0.57375581 0.58150669
 0.57664544 0.57704018 0.57769721 0.57349299 0.580718   0.57664544
 0.57664605 0.57769712 0.57533268 0.61001059 0.60764528 0.60567376
 0.60567384 0.60725115 0.61250601 0.60738203 0.6058049  0.60606789
 0.60764528 0.61040325 0.60764476 0.60843233 0.6067244  0.60895856
 0.63299733 0.63115963 0.63431001 0.63181467 0.63299733 0.63509913
 0.63221096 0.63352235 0.63142114 0.63168361 0.63536238 0.62892598
 0.6314208  0.63260354 0.63247205 0.65637992 0.64836847 0.648367
 0.64902533 0.64863137 0.655987   0.64849944 0.65033775 0.65125768
 0.65033801 0.65808898 0.64928874 0.65243981 0.6509946  0.64915527
 0.5811129  0.57533207 0.5773036  0.57730351 0.57520136 0.58124422
 0.57625131 0.57704078 0.57651525 0.57520136 0.57993008 0.5762514
 0.57782862 0.5775658  0.57625261 0.61185027 0.60698816 0.60580577
 0.

Multinomial Naive Bayes

In [67]:
params_nb_cv1 = {'MultinomialNB': [{
    'classifier__alpha':[.001, .01, .05, .1, .2, .4, .6, .8, 1],

}]}

gridsearch_count(params_nb_cv1, name="MultinomialNB", models=models2, num=num_mba)

MultinomialNB Grid Search:
MultinomialNB
CountVectorizer(lowercase=False)
Scoring method: Accuracy
Avg of cross validation scores: [0.67358894 0.67477151 0.68015812 0.68554464 0.68856534 0.69382072
 0.69487197 0.69789284 0.69999517]
Best cross validation score: 70.00%
Optimal parameters: {'classifier__alpha': 1}


Gradient Boost Classifier

In [68]:
params_gbc1 = {'GradBoostClassifier': [{
    'classifier__learning_rate':[.001, .01],
    'classifier__n_estimators':[100, 200],
    'classifier__max_depth':[5, 10]
}]}

gridsearch_count(params_gbc1, name='GradBoostClassifier1', models=models2, num=num_gbc)

GradBoostClassifier Grid Search:
GradBoostClassifier
CountVectorizer(lowercase=False)
Scoring method: Accuracy
Avg of cross validation scores: [0.5703402  0.59542719 0.5703402  0.59333091 0.62603902 0.63943795
 0.62380641 0.62525274]
Best cross validation score: 63.94%
Optimal parameters: {'classifier__learning_rate': 0.01, 'classifier__max_depth': 5, 'classifier__n_estimators': 200}


Support Vec

In [69]:
params_svc1 = {'SVC': [{
    'classifier__C':[1],
    'classifier__kernel':['linear'],
    'classifier__gamma':['scale'],
}]}

gridsearch_count(params_svc1, name='SVC1', models=models2, num=num_svc)

SVC Grid Search:
SVC
CountVectorizer(lowercase=False)
Scoring method: Accuracy
Avg of cross validation scores: [0.65244231]
Best cross validation score: 65.24%
Optimal parameters: {'classifier__C': 1, 'classifier__gamma': 'scale', 'classifier__kernel': 'linear'}


In [1]:
params_svc1 = {'SVC': [{
    'classifier__C':[1],
    'classifier__kernel':['linear'],
    'classifier__gamma':['scale'],
}]}

gridsearch_count(params_svc1, name='SVC1', models=models2, num=num_svc)

NameError: name 'gridsearch_count' is not defined

In [70]:
tuned_params

{'LogisticReg': {'classifier__C': 0.1,
  'classifier__fit_intercept': True,
  'classifier__max_iter': 200,
  'classifier__penalty': 'l2',
  'classifier__solver': 'saga'},
 'DecisionTree': {'classifier__ccp_alpha': 0.0,
  'classifier__criterion': 'gini',
  'classifier__max_depth': 3,
  'classifier__min_samples_split': 2},
 'RandomForest': {'classifier__criterion': 'gini',
  'classifier__max_depth': 20,
  'classifier__min_samples_leaf': 6,
  'classifier__n_estimators': 50},
 'MultinomialNB': {'classifier__alpha': 1},
 'GradBoostClassifier1': {'classifier__learning_rate': 0.01,
  'classifier__max_depth': 5,
  'classifier__n_estimators': 200},
 'SVC1': {'classifier__C': 1,
  'classifier__gamma': 'scale',
  'classifier__kernel': 'linear'}}

These grid searches were significantly lower than the defaults run in the exploratory models notebook so there is something
wrong with these grid searches. Random forest is because the max depth was probably significantly higher but the others don't make sense. It is possible this is because there was no shuffle split. It is possible the data would be better if shuffled. This can be done in grid search by train test split.