# Training the Final Model

## Abstract

This notebook is fully implementing the research into parameters from the previous notebook, first by loading in the dictionary with the optimized parameters, reconstructing from them, and then fitting to the full dataset.  At the end, the model will be pickled so that it can be used in the flask app.

## Imports

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import recall_score
from sklearn.pipeline import Pipeline
import pickle
import json

## Read Files

In [2]:
combo_df = pd.read_csv("./datasets/disaster_combo.csv")

In [3]:
combo_df.shape

(5002, 5)

In [4]:
combo_df.head()

Unnamed: 0,y_label,headline,pub_date,snippet,web_url
0,0.0,Hurricanes' Svechnikov in Concussion Protocol,2019-04-16T16:55:33+0000,Carolina Hurricanes rookie forward Andrei Svec...,https://www.nytimes.com/reuters/2019/04/16/spo...
1,0.0,‘It’s Making Us Less Prepared’: Shutdown Slows...,2019-01-18T10:00:09+0000,The partial government shutdown has kept storm...,https://www.nytimes.com/2019/01/18/us/governme...
2,0.0,Housing Vouchers Ending for Hurricane Michael ...,2019-04-11T18:05:57+0000,Hundreds of residents in the county hardest hi...,https://www.nytimes.com/aponline/2019/04/11/us...
3,0.0,An Action Plan to Reduce Hurricane Havoc,2018-10-12T20:08:55+0000,A reader calls for upgrading our built environ...,https://www.nytimes.com/2018/10/12/opinion/let...
4,0.0,Capitals Survive Surge From Hurricanes to Win ...,2019-04-12T02:25:00+0000,Nicklas Backstrom and Alex Ovechkin came out f...,https://www.nytimes.com/aponline/2019/04/11/sp...


### Data Cleaning

In [5]:
combo_df.isna().sum()

y_label      1
headline     2
pub_date     2
snippet     19
web_url      2
dtype: int64

In [6]:
combo_df.dropna(subset=["headline","y_label"], inplace=True)

## Train Test Split

In [7]:
X = combo_df["headline"]
y = combo_df["y_label"]

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                   stratify = y,
                                                   random_state = 42)

### Loading in parameters

In [9]:
with open("./datasets/opt_model_params.json") as file:
    model_dict = json.load(file)

In [10]:
log_params = model_dict[0]

In [11]:
#Extrat vectorizer params
vect_params = log_params.pop("vect_details")

In [12]:
del log_params["name"]

## Build Model

In [13]:
#notice the use of the dictionaries to reconstruct the parameters from before
model = Pipeline([
    ("cvec", CountVectorizer(**vect_params)),
    ("logclf", LogisticRegression(**log_params))
])

Fit this reconstructed model to the full dataset.

In [14]:
model.fit(X_train,y_train)

Pipeline(memory=None,
     steps=[('cvec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.9, max_features=1000, min_df=5,
        ngram_range=[1, 1], preprocessor=None,
        stop_words=['these', '...ty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False))])

### Performance

In [15]:
preds = model.predict(X_test)

In [16]:
recall_score(y_test,preds)

0.7710843373493976

Barely any performance lost here, so just going to stick with the parameters as is instead, but further optimizations are always possible.

## Output

Pickle for usage in flask app.  Finding an alternative to this could be good as pickles represent a security hole.  At the very least, maybe a good idea to look into hash validation.

In [18]:
with open("../modelpkl/model.pkl","wb") as file:
    pickle.dump(model,file)