# Base Models

This processed data is not uploaded to the Github repo, as some of the files are too large. Run notebook 2 in order to produce the same files.

## Baseline Models

Even though I ran this type of model analysis with the sample data, I want to run it again now that I've created bigrams. I'm also no longer using SVM, as it took way too long to run even on the smaller dataset. Once again, logistic regression is the best performing model. Also as expected, bigrams improved the accuracy of the models.

In [1]:
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier

In [9]:
def get_model_metrics(X_train, y_train, X_test, y_test, model, model_name, data_name):
    
    model.fit(X_train, y_train)
    y_train_hat = model.predict(X_train)
    y_test_hat = model.predict(X_test)
    
    acc_train = accuracy_score(y_train, y_train_hat)
    pre_train = precision_score(y_train, y_train_hat)
    rec_train = recall_score(y_train, y_train_hat)
    f1_train = f1_score(y_train, y_train_hat, average='macro')
    
    acc_test = accuracy_score(y_test, y_test_hat)
    pre_test = precision_score(y_test, y_test_hat)
    rec_test = recall_score(y_test, y_test_hat)
    f1_test = f1_score(y_test, y_test_hat, average='macro')
    
    metrics = {'Model': model_name,
               'Processing': data_name,
               'Test Accuracy': acc_test,
               'Test Precision': pre_test,
               'Test Recall': rec_test,
               'Test F1': f1_test,
               'Train Accuracy': acc_train,
               'Train Precision': pre_train,
               'Train Recall': rec_train,
               'Train F1': f1_train}
    
    return metrics

In [3]:
"""
datasets = [('TF-IDF', 'tf'),
        ('TF-IDF with Bigrams', 'bigram'),
        ('Document Embeddings', 'embed')]
"""
datasets = [('TF-IDF with Bigrams', 'bigram')]
models = [('Logistic Regression', LogisticRegression(solver='saga')),
          ('Multinomial Naive Bayes', MultinomialNB()),
          ('Random Forest', RandomForestClassifier())]
metrics = []

In [4]:
y_train = pd.read_pickle('../data/processed/y_train.pkl.gz')['voted_up'].to_numpy()
y_test = pd.read_pickle('../data/processed/y_test.pkl.gz')['voted_up'].to_numpy()

for data_name, file in datasets:
    X_train = pd.read_pickle(f'../data/processed/X_{file}_train.pkl.gz').to_numpy()
    X_test = pd.read_pickle(f'../data/processed/X_{file}_test.pkl.gz').to_numpy()
    for model_name, model in models:
        print(model_name, data_name)
        metrics.append(get_model_metrics(X_train, y_train, X_test, y_test, model, model_name, data_name))

metrics.append(get_model_metrics(X_train, y_train, X_test, y_test, DummyClassifier()))

Logistic Regression TF-IDF with Bigrams
Multinomial Naive Bayes TF-IDF with Bigrams
Random Forest TF-IDF with Bigrams


TypeError: get_model_metrics() missing 2 required positional arguments: 'model_name' and 'data_name'

In [5]:
metrics_df = pd.DataFrame(metrics)
metrics_df.sort_values(by='Test Accuracy', ascending=False)

Unnamed: 0,Model,Processing,Test Accuracy,Test Precision,Test Recall,Train Accuracy,Train Precision,Train Recall
0,Logistic Regression,TF-IDF with Bigrams,0.841855,0.864672,0.928968,0.843095,0.865643,0.929798
2,Random Forest,TF-IDF with Bigrams,0.822884,0.825431,0.960816,0.987273,0.985017,0.997777
1,Multinomial Naive Bayes,TF-IDF with Bigrams,0.761805,0.756041,0.995139,0.762755,0.756878,0.995376


# Gridsearch

Here I performed a gridsearch on the random forest and logistic regression models using just the bigram data, as it performed the best. Naive Bayes models do not have any hyperparameters to tune, and so there is no grid search to perform on it.

In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

In [2]:
y_train = pd.read_pickle('../data/processed/y_train.pkl.gz')['voted_up'].to_numpy()
X_train = pd.read_pickle('../data/processed/X_bigram_train.pkl.gz').to_numpy()

In [3]:
param_grid_lr = {'C': [0.1, 1, 10],
                 'class_weight': ['balanced', None],
                 'solver': ['saga']}
gs_lr = GridSearchCV(estimator=LogisticRegression(), param_grid=param_grid_lr, scoring='f1_macro', cv=3, verbose=5)
gs_lr.fit(X_train, y_train)
gs_lr.best_params_

Fitting 5 folds for each of 6 candidates, totalling 30 fits
[CV 1/5] END ......C=0.1, class_weight=balanced, solver=saga; total time= 4.5min
[CV 2/5] END ......C=0.1, class_weight=balanced, solver=saga; total time= 4.0min
[CV 3/5] END ......C=0.1, class_weight=balanced, solver=saga; total time= 4.2min
[CV 4/5] END ......C=0.1, class_weight=balanced, solver=saga; total time= 3.9min
[CV 5/5] END ......C=0.1, class_weight=balanced, solver=saga; total time= 4.7min
[CV 1/5] END ..........C=0.1, class_weight=None, solver=saga; total time= 3.8min
[CV 2/5] END ..........C=0.1, class_weight=None, solver=saga; total time= 3.6min
[CV 3/5] END ..........C=0.1, class_weight=None, solver=saga; total time= 3.9min
[CV 4/5] END ..........C=0.1, class_weight=None, solver=saga; total time= 3.8min
[CV 5/5] END ..........C=0.1, class_weight=None, solver=saga; total time= 3.7min
[CV 1/5] END ........C=1, class_weight=balanced, solver=saga; total time= 4.0min
[CV 2/5] END ........C=1, class_weight=balanced, 

{'C': 10, 'class_weight': None, 'solver': 'saga'}

In [None]:
param_grid_rf = {'n_estimators': [100, 500],
                 'max_features': ['auto', 150],
                 'class_weight': ['balanced', None]}
gs_rf = GridSearchCV(estimator=RandomForestClassifier(), param_grid=param_grid_rf, scoring='f1_macro', cv=3, verbose=5)
gs_rf.fit(X_train, y_train)
gs_rf.best_params_

Fitting 3 folds for each of 8 candidates, totalling 24 fits
[CV 1/3] END class_weight=balanced, max_features=auto, n_estimators=100; total time=33.3min


## Final Models

After comparing the best tuned models, logistic regression still has the best accuracy. I had expected random forest to improve performance more with tuning, but it seems not to be the case here.

In [5]:
y_train = pd.read_pickle('../data/processed/y_train.pkl.gz')['voted_up'].to_numpy()
y_test = pd.read_pickle('../data/processed/y_test.pkl.gz')['voted_up'].to_numpy()
X_train = pd.read_pickle('../data/processed/X_bigram_train.pkl.gz').to_numpy()
X_test = pd.read_pickle('../data/processed/X_bigram_test.pkl.gz').to_numpy()

In [10]:
lr_final = LogisticRegression(C=10, solver='saga')
metrics = get_model_metrics(X_train, y_train, X_test, y_test, lr_final, 'Logistic Regression', 'TF-IDF with Bigrams')
print(metrics)

{'Model': 'Logistic Regression', 'Processing': 'TF-IDF with Bigrams', 'Test Accuracy': 0.8419409987277354, 'Test Precision': 0.865303327471933, 'Test Recall': 0.9281795511221945, 'Test F1': 0.7850249727016845, 'Train Accuracy': 0.8431205218571709, 'Train Precision': 0.8662267959126413, 'Train Recall': 0.9289829863564711, 'Train F1': 0.7862350301545274}


In [11]:
metrics

{'Model': 'Logistic Regression',
 'Processing': 'TF-IDF with Bigrams',
 'Test Accuracy': 0.8419409987277354,
 'Test Precision': 0.865303327471933,
 'Test Recall': 0.9281795511221945,
 'Test F1': 0.7850249727016845,
 'Train Accuracy': 0.8431205218571709,
 'Train Precision': 0.8662267959126413,
 'Train Recall': 0.9289829863564711,
 'Train F1': 0.7862350301545274}

In [4]:
lr_final = LogisticRegression(C=10, solver='saga')
nb_final = MultinomialNB()
rf_final = RandomForestClassifier(max_features=150)

final_metrics = []
print('starting Logistic Regression model')
final_metrics.append(get_model_metrics(X_train, y_train, X_test, y_test, lr_final, 'Logistic Regression', 'TF-IDF with Bigrams'))
print('starting Naive Bayes model')
final_metrics.append(get_model_metrics(X_train, y_train, X_test, y_test, nb_final, 'Multinomial Naive Bayes', 'TF-IDF with Bigrams'))
print('starting Random Forest model')
final_metrics.append(get_model_metrics(X_train, y_train, X_test, y_test, rf_final, 'Random Forest', 'TF-IDF with Bigrams'))
print('completed models')

final_metrics_df = pd.DataFrame(final_metrics)
final_metrics_df.sort_values(by='Test Accuracy', ascending=False)

starting Logistic Regression model
starting Naive Bayes model
starting Random Forest model
completed models


Unnamed: 0,Model,Processing,Test Accuracy,Test Precision,Test Recall,Train Accuracy,Train Precision,Train Recall
0,Logistic Regression,TF-IDF with Bigrams,0.91409,0.93401,0.961287,0.947295,0.956907,0.978674
1,Multinomial Naive Bayes,TF-IDF with Bigrams,0.871614,0.869202,0.989558,0.877557,0.874733,0.989815
2,Random Forest,TF-IDF with Bigrams,0.866963,0.864709,0.989727,0.998957,0.99894,0.999767


## Save Model

Even though I will also be creating a neural network model, I still want to save this best logistic regression model. I can try to use it as a backup in case the neural network model is too big to upload to heroku.

In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
import pickle

In [3]:
y_train = pd.read_pickle('../data/processed/y_train.pkl.gz')['voted_up'].to_numpy()
y_test = pd.read_pickle('../data/processed/y_test.pkl.gz')['voted_up'].to_numpy()
X_train = pd.read_pickle('../data/processed/X_bigram_train.pkl.gz').to_numpy()
X_test = pd.read_pickle('../data/processed/X_bigram_test.pkl.gz').to_numpy()

In [4]:
model = LogisticRegression(C=10, solver='saga')
model.fit(X_train, y_train)

LogisticRegression(C=10, solver='saga')

In [6]:
pickle.dump(model, open('../final_model/sklearn-logreg/model.pk', 'wb'))