# Baseline

This code will run a baseline model that the transformer based language models can be compared to. Four steps must be performed: normalization, vectorization, training and prediction. This code uses the baseline.py file which is very similar to the code used to create the baseline for NoReC.

Inspired by [this post](https://medium.com/analytics-vidhya/sentiment-analysis-on-amazon-reviews-using-tf-idf-approach-c5ab4c36e7a1). 

In [None]:
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from baseline import *
import pandas as pd

In [None]:
train = pd.read_csv('./emails_train_balanced.csv')
test = pd.read_csv('./emails_test.csv')

# For some reason the train and test set had different formats
test.loc[test['label'] == 'LABEL_0', 'label'] = 0
test.loc[test['label'] == 'LABEL_1', 'label'] = 1

In [None]:
# To ensure that the baseline model compared to the BERT models are of a decent quality, a grid search is performed
# to find a good set of parameters. 
find_best_params(train, test)

In [None]:
# Use the best parameters found above to make predictions. 
test_best_models(train, test, 
                 [SVC(C=1, gamma=1), 
                  LogisticRegression(C=1, penalty='l2', solver='lbfgs'), 
                  KNeighborsClassifier(leaf_size=5, n_neighbors=20, weights='distance')
                 ])