# Building Models

## Abstract

In this notebook I have created a workflow to build and test models, while being able to tune their parameters.  At the start, the csv is read in, and then some helper functions help create a collection of Pipelines that include a vectorizer and a classifier, then parameter grids are setup.  One of the important things here is the use of recall as a scoring metric.  Originally, I tried with accuracy and realized that this had a baseline of $\approx 95\%$ which was not very useful for comparing scores.  The justification for recall is making sure relative content is brought to attention, which it measures well as the best way to increase it is to reduce the amount of False Negatives the model outputs.

### Imports

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn import svm
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import f1_score, recall_score
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import stopwords
import copy

### Reading Files

Originally was just using blizzard to make the whole process faster, as fitting tons of models to a dataset magnitudes larger would severely increase time, and realistically the parameters for most pieces should be very similar.  The next notebook covers training the model from _all_ the csvs

In [2]:
blizzard_df = pd.read_csv("./datasets/blizzard.csv",index_col=1)
blizzard_df.head()

Unnamed: 0,y_label,headline,pub_date,snippet,web_url
0,0,Blizzard Clobbers Plains and Midwest After Bla...,2019-04-11T11:22:21+0000,A powerful blizzard slammed the U.S. Plains an...,https://www.nytimes.com/reuters/2019/04/11/us/...
1,0,"Spring Snow Falls in New England, U.S. Midwest...",2019-04-09T01:04:40+0000,New England states enjoying the first signs of...,https://www.nytimes.com/reuters/2019/04/08/us/...
2,1,"Blizzard Barrels Through U.S. Great Plains, Th...",2019-04-10T06:16:34+0000,"A ""bomb cyclone"" blizzard swept out of the Roc...",https://www.nytimes.com/reuters/2019/04/10/us/...
3,1,Blizzard Forces Closure of Some U.S. Grain Pro...,2019-04-11T16:41:16+0000,"A second ""bomb cyclone"" blizzard hitting the U...",https://www.nytimes.com/reuters/2019/04/11/us/...
4,0,"No, Winter Isn’t Over. Hitting the Plains: A F...",2019-03-14T19:16:18+0000,"Rain, melting snow and frozen ground lifted ri...",https://www.nytimes.com/2019/03/14/us/bomb-cyc...


### Train Test Split

In [3]:
X = blizzard_df["headline"]
y = blizzard_df["y_label"]

In [4]:
X_train,X_test,y_train,y_test = train_test_split(X,
                                                 y,
                                                 stratify=y,
                                                 random_state=42)

In [5]:
class CustomTextGet(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass #do nothing to initialize
    def fit(self,X,y=None):
        return self
    def transform(self,X,y=None):
        return X["headline"]
    def fit_transform(self,X,y=None):
        return self.transform(X,y)

#This class definition was given as an example in SKLearn's docs here: https://scikit-learn.org/stable/modules/feature_extraction.html
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]

### Pipelines

In this section we are going to construct the Pipelines and their `param_grids`.  Note the use of feature union to branch each pipeline.

In [6]:
def pipe_constructor(classifiers):
    pipes = []
    for name,Classifier in classifiers:

        pipes.append(Pipeline([
            ("cvec",CountVectorizer(tokenizer=LemmaTokenizer())),
            (name,Classifier())
        ]))

        pipes.append(Pipeline([
            ("tvec",TfidfVectorizer(tokenizer=LemmaTokenizer())),
            (name,Classifier())
        ]))
    return pipes

This cell controls what models I will be testing.  From this list the functon defined above will automatically generate the combinations of the vectorizers and the classifiers, and then spit it out as a list of pipes.  This list would be sufficient to grid search over, but I did unpack them into individual names just to show the combinations a little more explicitly.

In [7]:
classifiers = [("logclf",LogisticRegression),
               ("nb",MultinomialNB),
               ("gboost",GradientBoostingClassifier),
               ("ada",AdaBoostClassifier),
               ("bag",BaggingClassifier)
              ]

In [8]:
pipes = pipe_constructor(classifiers)

#### Param Grids

In [9]:
stopWords = set(stopwords.words('english'))

cvec_grid = {"cvec__stop_words" : [stopWords],
             "cvec__ngram_range": [(1,1),(1,2),(1,3)],
             "cvec__min_df" : [1,5,10,15],
             "cvec__max_df" : [.9,.95,1.0],
             "cvec__max_features" : [1000,2000,None]}
tvec_grid = {"tvec__stop_words" : [stopWords],
             "tvec__ngram_range": [(1,1),(1,2),(1,3)],
             "tvec__min_df" : [1,5,10,15],
             "tvec__max_df" : [.9,.95,1.0],
             "tvec__max_features" : [1000,2000,None]}
# knn_grid = {"knn__n_neighbors":[5]}
log_grid = {"logclf__C" : np.linspace(0.1,1.0,5),
            "logclf__solver": ["liblinear","lbfgs"],
            "logclf__class_weight" : ["balanced"]}
nb_grid = {"nb__alpha": np.linspace(0.15,0.35,20)}
# dt_grid = {"dt__max_depth" : [None,200,300],
#           "dt__max_features" : ["auto",None]}
# gboost_grid = {"gboost__max_depth" : [2,3,4]}
# ada_grid = {"ada__learning_rate" : np.linspace(0.1,1,5)}

This for loop essentially zips all these `param_grids` together.  Dictionaries are mutable so `copy.copy()` is needed.

In [10]:
param_grids = []
cl_grids = [log_grid,nb_grid]
for cl_grid in cl_grids:
    new_cvec = copy.copy(cvec_grid)
    new_cvec.update(cl_grid)
    new_tvec = copy.copy(tvec_grid)
    new_tvec.update(cl_grid)
    param_grids.append(new_cvec)
    param_grids.append(new_tvec)

## Gridsearch

Now by using `zip()` to correctly pair the pipe to its corresponding `param_grid`, we can gridsearch a great amount of models at once, then we will take a look at their scores.

In [11]:
grids = []
for pipe,param_grid in zip(pipes,param_grids):
    grid = GridSearchCV(pipe,
                        param_grid=param_grid,
                        #RECALL is important here, because of the imbalanced classes.  f1_score presents another viable path for optimization.
                        scoring="recall",
                        cv=3,
                        n_jobs=-1,verbose=2)
    grid.fit(X_train,y_train)
    grids.append(grid.best_estimator_)

Fitting 3 folds for each of 1080 candidates, totalling 3240 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:   13.4s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:   27.0s
[Parallel(n_jobs=-1)]: Done 349 tasks      | elapsed:   50.5s
[Parallel(n_jobs=-1)]: Done 632 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 997 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done 1442 tasks      | elapsed:  2.9min
[Parallel(n_jobs=-1)]: Done 1969 tasks      | elapsed:  3.9min
[Parallel(n_jobs=-1)]: Done 2576 tasks      | elapsed:  4.9min
[Parallel(n_jobs=-1)]: Done 3240 out of 3240 | elapsed:  6.0min finished
  'stop_words.' % sorted(inconsistent))


Fitting 3 folds for each of 1080 candidates, totalling 3240 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    3.3s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:   16.9s
[Parallel(n_jobs=-1)]: Done 349 tasks      | elapsed:   37.6s
[Parallel(n_jobs=-1)]: Done 632 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 997 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 1442 tasks      | elapsed:  2.4min
[Parallel(n_jobs=-1)]: Done 1969 tasks      | elapsed:  3.2min
[Parallel(n_jobs=-1)]: Done 2576 tasks      | elapsed:  4.3min
[Parallel(n_jobs=-1)]: Done 3240 out of 3240 | elapsed:  5.4min finished
  'stop_words.' % sorted(inconsistent))


Fitting 3 folds for each of 2160 candidates, totalling 6480 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    2.7s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:   15.4s
[Parallel(n_jobs=-1)]: Done 349 tasks      | elapsed:   35.7s
[Parallel(n_jobs=-1)]: Done 632 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 997 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 1442 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done 1969 tasks      | elapsed:  3.3min
[Parallel(n_jobs=-1)]: Done 2576 tasks      | elapsed:  4.3min
[Parallel(n_jobs=-1)]: Done 3265 tasks      | elapsed:  5.5min
[Parallel(n_jobs=-1)]: Done 4034 tasks      | elapsed:  6.8min
[Parallel(n_jobs=-1)]: Done 4885 tasks      | elapsed:  8.2min
[Parallel(n_jobs=-1)]: Done 5816 tasks      | elapsed: 10.0min
[Parallel(n_jobs=-1)]: Done 6480 out of 6480 | elapsed: 11.4min finished
  'stop_words.' % sorted(inconsistent))


Fitting 3 folds for each of 2160 candidates, totalling 6480 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    3.9s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:   18.2s
[Parallel(n_jobs=-1)]: Done 349 tasks      | elapsed:   43.2s
[Parallel(n_jobs=-1)]: Done 632 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 997 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done 1442 tasks      | elapsed:  2.9min
[Parallel(n_jobs=-1)]: Done 1969 tasks      | elapsed:  4.0min
[Parallel(n_jobs=-1)]: Done 2576 tasks      | elapsed:  5.3min
[Parallel(n_jobs=-1)]: Done 3265 tasks      | elapsed:  6.7min
[Parallel(n_jobs=-1)]: Done 4034 tasks      | elapsed:  8.4min
[Parallel(n_jobs=-1)]: Done 4885 tasks      | elapsed:  9.8min
[Parallel(n_jobs=-1)]: Done 5816 tasks      | elapsed: 11.3min
[Parallel(n_jobs=-1)]: Done 6480 out of 6480 | elapsed: 12.4min finished
  'stop_words.' % sorted(inconsistent))


### Model Evaluation

#### Baseline Model

In [12]:
blizzard_df.y_label.value_counts(normalize=True)

0    0.954
1    0.046
Name: y_label, dtype: float64

grids

In [13]:
len(grids)

4

#### Model Scores

Here is a formatted output of the model scores on the training data and the testing data.  The testing data score is the most important, because it represents the approximation of performance on novel, unseen data.  The training data is included because it can help expose bias and variance of the models at a glance.

##### Helper Function Definitions

In [14]:
def formatted_scores(model, details, X_train,X_test,y_train,y_test):
    #get scores
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)
    train_score = recall_score(y_train,y_pred_train)
    test_score = recall_score(y_test,y_pred_test)
    #output
    print(f"{details['name']} with {details['processor']}'s Recall Scores:")
    print("=====================================================")
    print(f"Performance on training data: {train_score*100}%")
    print(f"Performance on training data: {test_score*100}%")
    print("=====================================================")
    print("")

In [15]:
#hardcoded lookup dictionary for formatting purposes
lookup = {
    "knn" : "k-Nearest Neighbors",
    "logclf" : "Logistic Regression",
    "nb" : "Multinomial Naïve Bayes",
    "svc" : "Support Vector Classifier",
    "dt" : "Decision Tree Classifier",
    "gboost" : "Gradient Boosting Classifier",
    "bag" : "Bagging Classifier",
    "ada" : "AdaBoost Classifier",
    "cvec" : "Count Vectorizer",
    "tvec" : "TfidVectorizer"
}

In [34]:
detail_dicts = []
for model in grids:
    details = {}
    details["name"] = lookup[model.steps[1][0]]
    details["processor"] = lookup[model.steps[0][0]]
    detail_dicts.append(details)

In [36]:
detail_dicts

[{'name': 'Logistic Regression', 'processor': 'Count Vectorizer'},
 {'name': 'Logistic Regression', 'processor': 'TfidVectorizer'},
 {'name': 'Multinomial Naïve Bayes', 'processor': 'Count Vectorizer'},
 {'name': 'Multinomial Naïve Bayes', 'processor': 'TfidVectorizer'}]

##### Scores

In [17]:
for model, details in zip(grids,detail_dicts):
    formatted_scores(model,details,X_train,X_test,y_train,y_test)

Logistic Regression with Count Vectorizer's Recall Scores:
Performance on training data: 88.57142857142857%
Performance on training data: 81.81818181818183%

Logistic Regression with TfidVectorizer's Recall Scores:
Performance on training data: 82.85714285714286%
Performance on training data: 72.72727272727273%

Multinomial Naïve Bayes with Count Vectorizer's Recall Scores:
Performance on training data: 94.28571428571428%
Performance on training data: 63.63636363636363%

Multinomial Naïve Bayes with TfidVectorizer's Recall Scores:
Performance on training data: 25.71428571428571%
Performance on training data: 18.181818181818183%



### Permanent Output

This section just serializes the details into a file on the hard drive to make it easier to reconstruct later.

In [None]:
models = []
for g in grids:
    # Unpack the tuples
    name = g.steps[1][0]
    model = g.steps[1][1]
    
    #get the parameters for the most optimized version from the gridsearch
    model_dict = model.get_params()
    model_dict["name"] = name
    
    # form a sub dictionary
    vect_dict = g.steps[0][1].get_params()
    
    #remove pieces that cant be serialized and arent necessary for reconstruction
    del vect_dict["tokenizer"]
    del vect_dict["dtype"]
    
    #turn into a serializible data structrue
    vect_dict["stop_words"] = list(vect_dict["stop_words"])
    model_dict["vect_details"] = vect_dict
    
    #attach this as a sub-dictionary
    models.append(model_dict)

#### JSON Output

In [45]:
import json

In [46]:
with open("./datasets/opt_model_params.json","w") as fp:
    json.dump(models,fp)

#### CSV output

Redundant with a `random_state` set in theory, but if that compatibility breaks between versions this should help retain reproducibility.

In [110]:
X_train.to_csv('./datasets/blizzard_X_train.csv',index=False)
X_test.to_csv('./datasets/blizzard_X_test.csv',index=False)
y_train.to_csv('./datasets/blizzard_y_train.csv',index=False)
y_test.to_csv('./datasets/blizzard_y_test.csv',index=False)

  """Entry point for launching an IPython kernel.
  
  This is separate from the ipykernel package so we can avoid doing imports until
  after removing the cwd from sys.path.
