# H2 - PROGRAMMING EXERCISES 

## 1. Binary Classification on Text Data. 

In this problem, you will implement several machine learning techniques from the class to perform classification on text data. Throughout the problem, we will be working on the NLP with Disaster Tweets Kaggle competition, where the task is to predict whether or not a tweet is about a real disaster

### (a) Download the data. 
Download the training and test data from Kaggle, and answer the following questions: (1) how many training and test data points are there? and (2) what percentage of the training tweets are of real disasters, and what percentage is not? Note that the meaning of each column is explained in the data description on Kaggle.

<font color='blue'> Data Analysis

The train dataset encompasses 7613 datapoints, while the test set only 3623, which is a test-train split of roughly [68-32].
The train dataset contains 3271 real disaster tweets (datapoints where target = 1), which accounts for the 43% of the dataset and 4342 (57%) are labelled as fake disasters instead. 

In [295]:
# imports .
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# load train and test datasets.
raw_train = pd.read_csv('./data/train.csv')
raw_test = pd.read_csv('./data/test.csv')

print("Train dataset: ", len(raw_train))
print("Test dataset: ", len(raw_test))
print("Real Disasters in Train dataset: ", len(raw_train[raw_train["target"] == 1.0]))
print("Non-Real Disasters in Train dataset: ", len(raw_train[raw_train["target"] == 0.0]))

Train dataset:  7613
Test dataset:  3263
Real Disasters in Train dataset:  3271
Non-Real Disasters in Train dataset:  4342


### (b) Split the training data. 
Since we do not know the correct values of labels in the test data, we will split the training data from Kaggle into a training set and a development set (a development set is a held out subset of the labeled data that we set aside in order to fine-tune models, before evaluating the best model(s) on the test data). Randomly choose 70% of the data points in the training data as the training set, and the remaining 30% of the data as the
development set. Throughout the rest of this problem we will keep these two sets fixed. The
idea is that we will train different models on the training set, and compare their performance
on the development set, in order to decide what to submit to Kaggle.

In [296]:
from sklearn.model_selection import train_test_split
df_train, df_val = train_test_split(raw_train, test_size = 0.3)

### (c) Preprocess the data.
Since the data consists of tweets, they may contain significant amounts
of noise and unprocessed content. You may or may not want to do one or all of the following.
Explain the reasons for each of your decision (why or why not).
- Convert all the words to lowercase.
- Lemmatize all the words (i.e., convert every word to its root so that all of “running,” “run,”# and “runs” are converted to “run” and and all of “good,” “well,” “better,” and “best” are converted to “good”; this is easily done using nltk.stem).
- Strip punctuation.
- Strip the stop words, e.g., “the”, “and”, “or”.
- Strip @ and urls. (It’s Twitter.)
- Something else? Tell us about it

<font color = "blue"> Pre-Processing Steps

In this session we heavily pre-processed the data in the "text" column, to remove all the gibberish, urls, hashtags, etc and leave nice lemmatized english words to maximize the efficiency and speed of the bag of words model.

The steps we took to pre-process the data are the following:
- Make the whole text lowercase: This is needed to standardize the text and match words in future steps in a case-insensitive manner.
- Remove using Regex any '@', '#' or 'http(s)://' characters followed by any characters, hence removing any tag, hashtag or url included in the text.
- Using Regex to only keep alphabetic lemmas, removing numbers, dates or non alphabetic characters.
- Using WordNetLemmatizer.lemmatize to standardize words. We decided to use the lemmatizer twice to leverage both modes 'v' and 'n', which we noticed worked better to standardize verbs and plurals respectively (e.g. 'v' works for flooding -> flood, while 'n' works for years -> year)
- Finally, we filtered out any words that are stop words (i.e. the, so, etc) since they do not convey any useful information. Then we filter out words not present in the "corpus.words" dictionary of words provided by nltk. We found that this method eliminates very few meaningful words which are not recognized, but helps a lot in filtering out a lot of jibberish included in the tweets. Finally we manually check not to include the word "target" as it would create a new column "target", which is the column name used for the feature to predict.

In [297]:
import re
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.corpus import words
from nltk.tokenize import word_tokenize

Lem = WordNetLemmatizer()
stop_words = stopwords.words("english")
english_words = set(words.words())

def preprocessing_lambda(text):
    # Make everything lowercase.
    out_text = str(text).lower()
    # Remove mentions, hashtags and urls.
    out_text = re.sub(r"(?:\@|\#|https?\://)\S+", "", out_text)
    # Keep alphabetic sequences only
    out_text = re.sub(r'[^a-zA-Z]', ' ', out_text)
    
    # tokenize sentence
    token_out = word_tokenize(out_text)
    
    
    # Lemmatize words using both settings from nltk as:
    # 'v' works for bombing -> bomb
    # 'n' works for years-> year
    out_text = " ".join([Lem.lemmatize(Lem.lemmatize(w, 'v'), 'n') for w in token_out])

    # Keep only words longer than two characters
    # ignore stopwords and words that are not in the english vocabulary
    token_out = [w for w in token_out if (not w in stop_words) and (w in english_words) \
                 and (len(w) > 2) and w != "target"]
    
    return " ".join([w for w in token_out])

In [298]:
df_tot = pd.concat([df_train, df_val, raw_test])
df_tot["text"] = df_tot["text"].apply(preprocessing_lambda)


In [299]:
df_train_val = df_tot.iloc[:len(df_train) + len(df_val), :].sort_index().reset_index()
df_train = df_train_val.iloc[:len(df_train), :].reset_index()
df_val = df_train_val.iloc[len(df_train):, :].reset_index()
df_test = df_tot.iloc[len(df_train) + len(df_val):, :].sort_index().reset_index()


### (d) Bag of words model
The next task is to extract features in order to represent each tweet using
the binary “bag of words” model, as discussed in lectures. The idea is to build a vocabulary
of the words appearing in the dataset, and then to represent each tweet by a feature vector
x whose length is the same as the size of the vocabulary, where xi = 1 if the i’th vocabulary
word appears in that tweet, and xi = 0 otherwise. In order to build the vocabulary, you should
choose some threshold M, and only include words that appear in at least k different tweets;
this is important both to avoid run-time and memory issues, and to avoid noisy/unreliable
features that can hurt learning. Decide on an appropriate threshold M, and discuss how you
made this decision. Then, build the bag of words feature vectors for both the training and
development sets, and report the total number of features in these vectors.
In order to construct these features, we suggest using the CountVectorizer class in sklearn. A
couple of notes on using this function: (1) you should set the option “binary=True” in order to ensure that the feature vectors are binary; and (2) you can use the option “min_df=M” in order to only include in the vocabulary words that appear in at least M different tweets. Finally,
make sure you fit CountVectorizer only once on your training set and use the same instance
to process both your training and development sets (don’t refit it on your development set a
second time).

__Important__: at this point you should only be constructing feature vectors for each data point
using the text in the “text” column. You should ignore the “keyword” and “location” columns
for now.

<font color="blue"> Bag Of Words Tuning

In the bag of words model we opted for M = 0.001 (keep words that appear in at least 0.1% of the tweets.
We tuned this hyperparameter by looking at both the model results for training and valuation, and the computational time. 

The score on the valuation set kept increasing as we increased the number of grams considered, which is a clear indication that the model was increasing performance and not overfitting. Hence we stopped at M = 0.001 since we noticed that reducing it further started impacting the computational time of training more, without providing significant improvements.

In [300]:
from sklearn.feature_extraction.text import CountVectorizer

M = 0.001

count_vect = CountVectorizer(min_df = M, binary = True) 
count_matrix = count_vect.fit_transform(df_train['text'])
count_array = count_matrix.toarray()

grams_identified = count_vect.get_feature_names_out()
train_vect_features = pd.DataFrame(data=count_array, columns = grams_identified)
df_train_vect = pd.concat([df_train, train_vect_features], axis = 1)
print("Number of grams found: ", len(grams_identified))


Number of grams found:  1116


In [301]:
count_matrix = count_vect.transform(df_val['text'])
count_array = count_matrix.toarray()

val_vect_features = pd.DataFrame(data = count_array, columns = grams_identified)

df_val_vect = pd.concat([df_val, val_vect_features], axis = 1)

In [302]:
count_matrix = count_vect.transform(df_test['text'])
count_array = count_matrix.toarray()

test_vect_features = pd.DataFrame(data = count_array, columns = grams_identified)

df_test_vect = pd.concat([df_test, test_vect_features], axis = 1)

### (e) Logistic regression
In this question, we will be training logistic regression models using
bag of words feature vectors obtained in part (d). We will use the F1-score as the evaluation
metric.

We use F 1-score because it gives a more comprehensive view of classifier performance than
accuracy. For more information on this metric see F1-score.
We ask you to train the following classifiers. We suggest using the LogisticRegression imple-
mentation in sklearn .

#### i. Train a logistic regression model without regularization terms
You will notice that the
default sklearn logistic regression utilizes L2 regularization. You can turn off L2 regularization by changing the penalty parameter. Report the F1 score in your training and
in your development set. Comment on whether you observe any issues with overfitting
or underfitting.

<font color='blue'> Overfitting/Underfitting Concerns


At a first glance it seems that the model is not generalizing well on the Validation set while obtaining good results on the train dataset, which would hint to some degree of Overfitting.

However, if case the model was actually overfitting, increasing the max_iter parameter to achieve better convergence should result in an even poorer performance on the validation set. We tried to increase the max_iter count up to 1500 and the performance on validation gets better, hence the model is not actually overfitting.

In [303]:
X_train = df_train_vect.iloc[:, len(df_train.columns):]
Y_train = df_train_vect["target"]
X_val = df_val_vect.iloc[:, len(df_val.columns):]
Y_val = df_val_vect["target"]
X_test = df_test_vect.iloc[:, len(df_test.columns):]

In [310]:
from sklearn.linear_model import LogisticRegression

logistic_regr_none = LogisticRegression(penalty = "none", max_iter = 1000)
logistic_regr_none.fit(X_train, Y_train)

In [311]:
from sklearn.metrics import f1_score

train_preds = logistic_regr_none.predict(X_train)
train_F1_none = f1_score(Y_train, train_preds)

In [312]:
val_preds = logistic_regr_none.predict(X_val)
val_F1_none = f1_score(Y_val, val_preds)
print("Train F1 None: ", train_F1_none, "\tVal F1 None: ", val_F1_none)

Train F1 None:  0.8252516010978957 	Val F1 None:  0.6097186700767263


#### ii. Train a logistic regression model with L1 regularization
Sklearn provides some good
examples for implementation. Report the performance on both the training and the
development sets.

In [313]:
coeffs = [10, 1.0, 0.1]
for c in coeffs:
    logistic_regr_l1 = LogisticRegression(C = c, penalty = "l1", solver = "liblinear")
    logistic_regr_l1.fit(X_train, Y_train)
    
    train_preds = logistic_regr_l1.predict(X_train)
    val_preds = logistic_regr_l1.predict(X_val)   
    
    train_F1_l1 = f1_score(Y_train, train_preds)
    val_F1_l1 = f1_score(Y_val, val_preds)

    print("c: ", c, "\tTrain F1 L1: ", train_F1_l1, "\tVal F1 L1: ", val_F1_l1)

c:  10 	Train F1 L1:  0.8233117483811286 	Val F1 L1:  0.6070861977789529
c:  1.0 	Train F1 L1:  0.7830052808449351 	Val F1 L1:  0.5758477096966093
c:  0.1 	Train F1 L1:  0.5261282660332541 	Val F1 L1:  0.3


#### iii. Similarly, train a logistic regression model with L2 regularization 
Report the performance on the training and the development sets.

In [314]:
logistic_regr_l2 = LogisticRegression()
logistic_regr_l2.fit(X_train, Y_train)

In [315]:
train_preds = logistic_regr_l2.predict(X_train)
train_F1_l2 = f1_score(Y_train, train_preds)

val_preds = logistic_regr_l2.predict(X_val)
val_F1_l2 = f1_score(Y_val, val_preds)

print("Train F1 L2: ", train_F1_l2, "\tVal F1 L2: ", val_F1_l2)

Train F1 L2:  0.7926944971537002 	Val F1 L2:  0.6017391304347827


#### iv. Which one of the three classifiers performed the best on your training and development set? 
Did you observe any overfitting and did regularization help reduce it? Support your answers with the classifier performance you got.


<font color='blue'> Best Classifier

For L1 regularization, I tried three models with coefficient 10, 1 and 0.1 respectively.
C = 10 seems to significantly outperform the other two models, hence it will be the L1 model considered in further analysis.

It seems that the non regularized model very slightly outperforms the L1 and L2 regularized models. 
The three models perform very similarly, which hints to the fact that there is no overfitting, hence L1 and L2 regression do not make any improvements.

#### v. Inspect the weight vector of the classifier with L1 regularization (in other words, look at the θ you got after training)
You can access the weight vector of the trained model using the coef_ attribute of a LogisticRegression instance. What are the most important words for deciding whether a tweet is about a real disaster or not?

<font color='blue'> L1 regularization top words

It looks like the 10 most important words found by L1 all describe objects or events strongly connected to a disastrous scenario: debris, atomic, spill, train, outbreak, fire, derailment, storm, accident and mass.

This shows both that the model is correct in identifying insightful words in the "disasters dictionary" as well as that the pre-processing did a great job at getting rid of all the non-important words that the model could have mistakenly considered as important features.

In [316]:
l1_coeffs = np.array(logistic_regr_l1.coef_[0])

n = 10
idx = np.argpartition(l1_coeffs, -n)[-n:]
indices = idx[np.argsort((-l1_coeffs)[idx])]

for c, w in zip(l1_coeffs[indices], X_train.columns[indices]):
    print("Word: ", w, " " * (15 - len(w)), "L1 Coefficient: ", round(c, 4))

Word:  debris           L1 Coefficient:  1.6307
Word:  atomic           L1 Coefficient:  1.4876
Word:  spill            L1 Coefficient:  1.4407
Word:  train            L1 Coefficient:  1.1221
Word:  outbreak         L1 Coefficient:  1.0948
Word:  fire             L1 Coefficient:  1.0553
Word:  derailment       L1 Coefficient:  1.0479
Word:  storm            L1 Coefficient:  0.8163
Word:  accident         L1 Coefficient:  0.7453
Word:  mass             L1 Coefficient:  0.729


### (f) Bernoulli Naive Bayes
Implement a Bernoulli Naive Bayes classifier to predict the probability of whether each tweet is about a real disaster. Train this classifier on the training set, and report its F1-score on the development set.

__Important__: For this question you should implement the classifier yourself similar to what
was shown in class, without using any existing machine learning libraries such as sklearn.
You may only use basic libraries such as numpy.



In [317]:
class myBernoulliNB(object):
    def __init__(self, alpha = 1.0):
        self.alpha = alpha

    def fit(self, X, Y):
        n, d  = X.shape

        K = 2 # Binary classes = 2

        self.psis = np.zeros([K, d])
        self.phis = np.zeros([K])

        for k in range(K):
            X_l = X[Y == k]
            
            self.psis[k] = (np.sum(X_l, axis = 0) + self.alpha) / (np.sum(X_l) + 2 * self.alpha)
            self.phis[k] = X_l.shape[0] / float(n)
        
        self.psis = np.reshape(self.psis, (K, 1, d))
        self.psis = self.psis.clip(1e-14, 1-1e-14)

        return self


    def predict(self, X):
        n, d = X.shape
        X = np.reshape(X, (1, n, d))
 
        logpy = np.log(self.phis).reshape([K, 1])
        logpxy = X * np.log(self.psis) + (1 - X) * np.log(1 - self.psis)
        logpyx = logpxy.sum(axis = 2) + logpy
        return logpyx.argmax(axis = 0).flatten()

As you work on this problem, you may find that some words in the vocabulary occur in the
development set but are not in the training set. As a result, the standard Naive Bayes model
learns to assign them an occurrence probability of zero, which becomes problematic when
we observe this "zero probability" event on our development set.
The solution to this problem is a form of regularization called Laplace smoothing or additive
smoothing. The idea is to use "pseudo-counts", i.e. to increment the number of times we
have seen each word or document by some number of "virtual" occurrences α. Thus, the
Naive Bayes model will behave as if every word or document has been seen at least α times.

In [318]:
BNB_nosmoothing = myBernoulliNB(alpha = 1)

BNB_nosmoothing.fit(x_train_numpy, y_train_numpy)
bnb_nosmoothing_train_preds = BNB_nosmoothing.predict(x_train_numpy)
bnb_nosmoothing_val_preds = BNB_nosmoothing.predict(X_val.to_numpy())


F1_BNB_train = f1_score(Y_train, bnb_nosmoothing_train_preds)
F1_BNB_val = f1_score(Y_val, bnb_nosmoothing_val_preds)
print("Train F1 BernoulliNB: ", F1_BNB_train, "\tVal F1 BernoulliNB: ", F1_BNB_val)

Train F1 BernoulliNB:  0.7655061102144339 	Val F1 BernoulliNB:  0.6568421052631579


### (g) Model comparison
You just implemented a generative classifier and a discriminative classifier. Reflect on the following:
- Which model performed the best in predicting whether a tweet is of a real disaster or not? Include your performance metric in your response. Comment on the pros and cons of using generative vs discriminative models.
- Think about the assumptions that Naive Bayes makes. How are the assumptions different from logistic regressions? Discuss whether it’s valid and efficient to use Bernoulli Naive Bayes classifier for natural language texts.

<font color='blue'> Comparing Models

The BernoulliNB model seems to outperform other modules on generalizing to the validation set (F1 = 0.657), while L1, L2 and None regularization obtain very similar F1 Scores (0.605).

Generative models like Bernoulli Naive Bayes are less computational expensive (we could see in this exercise that Logistic Regression starts to be slow for large "max_iter" values) and generally better when there is a limited amount of data. Instead, Discriminative algorithms like Logistic Regression are more robust against outliers.

A key difference between generative and discriminative models is that generative models assume independence among predictors, while discriminative models don't.
This key difference seems to be the reason why BernoulliNB performs better on the development set (i.e. generalizes better to unseen datapoints) but actually achieves a significantly lower score for training.

In comparing the models, we must also notice that the scores we also must notice that the heavy pre-processing approach we opted for has great impact on the general performance of the models, which would be significantly higher with less pre-processing steps. However, for the sake of approaching the NLP task in the most logical way we felt that the reduction in data size produced by the extra data-processing steps is more valuable in a real prediction scenario than brute forcing bag-of-words on more raw tweets.


### (h) N-gram model
The N-gram model is similar to the bag of words model, but instead of using
individual words we use N-grams, which are contiguous sequences of words. For example,
using N = 2, we would says that the text “Alice fell down the rabbit hole” consists of the sequence of 2-grams: ["Alice fell", "fell down", "down the", "the rabbit", "rabbit hole"], and the
following sequence of 1-grams: ["Alice", "fell", "down", "the", "rabbit", "hole"]. All eleven
of these symbols may be included in the vocabulary, and the feature vector x is defined according to xi = 1 if the i’th vocabulary symbol occurs in the tweet, and xi = 0 otherwise.
Using N = 2, construct feature representations of the tweets in the training and development tweets. Again, you should choose a threshold M, and only include symbols in the vocabulary
that occur in at least M different tweets in the training set. Discuss how you chose the threshold M, and report the total number of 1-grams and 2-grams in your vocabulary. In addition,
take 10 2-grams from your vocabulary, and print them out.

Then, implement a logistic regression and a Bernoulli classifier to train on 2-grams. You may
reuse the code in (e) and (f). You may also choose to use or not use a regularization term,
depending on what you got from (e). Report your results on training and development set.
Do these results differ significantly from those using the bag of words model? Discuss what
this implies about the task.

Again, we suggest using CountVectorizer to construct these features. In order to include both
1-gram and 2-gram features, you can set ngram_range=(1,2). Note also that in this case,
since there are probably many different 2-grams in the dataset, it is especially important carefully set min_df in order to avoid run-time and memory issues.

<font color='blue'> 2-grams Only Considerations


Only using 2grams results in a very underfitted model.
Firstly we had to drop the value of M significantly, to account for the fact that 2grams are generally less likely to appear in multiple tweets.
After tuning M to generate a similar number of grams that we got for 1-grams, we trained the model and observed the results. The F1-score on the Validation set is incredibly low: (0.15). That was expected as it is more difficult for 2grams found in the train set to appear many times in the validation set too, so probably most features do not convey much information. 

Furthermore, we heavily preprocessed the data. Training a 2-gram model on less processed data (for example not filtering out stopwords) would likely perform way better, since one stop word and a disaster term could combine together to create insightful 2grams.

In [207]:
M = 0.0004
twogram_vect = CountVectorizer(min_df = M, binary = True, ngram_range = (2, 2)) 
count_matrix = twogram_vect.fit_transform(df_train['text'])
count_array = count_matrix.toarray()

grams_identified = twogram_vect.get_feature_names_out()
train_twogram_vect_features = pd.DataFrame(data=count_array, columns = grams_identified)

df_train_twogram_vect = pd.concat([df_train, train_twogram_vect_features], axis = 1)

In [208]:
count_matrix = twogram_vect.transform(df_val['text'])
count_array = count_matrix.toarray()

val_twogram_vect_features = pd.DataFrame(data = count_array, columns = twogram_vect.get_feature_names_out())

df_val_twogram_vect = pd.concat([df_val, val_twogram_vect_features], axis = 1)

In [209]:
twograms = list(filter(lambda x : " " in x, grams_identified))
print("two grams count over total grams count: ", len(twograms), "/", len(grams_identified))

two grams count over total grams count:  946 / 946


In [210]:
print(twograms[:10])

['abandoned aircraft', 'access top', 'accident man', 'accident property', 'active exploit', 'added video', 'affected fatal', 'ago today', 'air accident', 'air ambulance']


In [211]:
X_train = df_train_twogram_vect.iloc[:, len(df_train.columns):]
Y_train = df_train_twogram_vect["target"]
X_val = df_val_twogram_vect.iloc[:, len(df_val.columns):]
Y_val = df_val_twogram_vect["target"]

In [214]:
logistic_regr_l2 = LogisticRegression(solver = "liblinear")
logistic_regr_l2.fit(X_train, Y_train)

In [215]:
train_preds = logistic_regr_l2.predict(X_train)
train_F1_l2 = f1_score(Y_train, train_preds)

val_preds = logistic_regr_l2.predict(X_val)
val_F1_l2 = f1_score(Y_val, val_preds)

print("Train F1 L2: ", train_F1_l2, "\tVal F1 L2: ", val_F1_l2)

Train F1 L2:  0.5606617647058824 	Val F1 L2:  0.12730627306273065


<font color='blue'> Combined 1-grams and 2-grams Considerations

We now used both 1-grams and 2-grams (i.e. set the range to (1,2)).
This allows the model to train on insightful 1grams as well as common 2-grams. 
This does improve the performance of the 1-gram L2 regularized LinearRegression model very slightly.

In [216]:
M = 0.001
twogram_vect = CountVectorizer(min_df = M, binary = True, ngram_range = (1, 2)) 
count_matrix = twogram_vect.fit_transform(df_train['text'])
count_array = count_matrix.toarray()

grams_identified = twogram_vect.get_feature_names_out()
train_twogram_vect_features = pd.DataFrame(data=count_array, columns = grams_identified)

df_train_twogram_vect = pd.concat([df_train, train_twogram_vect_features], axis = 1)

count_matrix = twogram_vect.transform(df_val['text'])
count_array = count_matrix.toarray()

val_twogram_vect_features = pd.DataFrame(data = count_array, columns = twogram_vect.get_feature_names_out())

df_val_twogram_vect = pd.concat([df_val, val_twogram_vect_features], axis = 1)

twograms = list(filter(lambda x : " " in x, grams_identified))
print("two grams count over total grams count: ", len(twograms), "/", len(grams_identified))

two grams count over total grams count:  275 / 1391


In [217]:
print(twograms[:10])

['abandoned aircraft', 'added video', 'affected fatal', 'air accident', 'air ambulance', 'aircraft debris', 'airplane accident', 'airplane debris', 'ambulance helicopter', 'amid crisis']


In [218]:
X_train = df_train_twogram_vect.iloc[:, len(df_train.columns):]
Y_train = df_train_twogram_vect["target"]
X_val = df_val_twogram_vect.iloc[:, len(df_val.columns):]
Y_val = df_val_twogram_vect["target"]

logistic_regr_l2 = LogisticRegression(solver = "liblinear")
logistic_regr_l2.fit(X_train, Y_train)

In [219]:
train_preds = logistic_regr_l2.predict(X_train)
train_F1_l2 = f1_score(Y_train, train_preds)

val_preds = logistic_regr_l2.predict(X_val)
val_F1_l2 = f1_score(Y_val, val_preds)

print("Train F1 L2: ", train_F1_l2, "\tVal F1 L2: ", val_F1_l2)

Train F1 L2:  0.801909307875895 	Val F1 L2:  0.6062717770034843


### (i) Determine performance with the test set 
Re-build your feature vectors and re-train your preferred classifier (either bag of word or n-gram using either logistic regression or Bernoulli naive bayes) using the entire Kaggle training data (i.e. using all of the data in both the training and development sets). Then, test it on the Kaggle test data. Submit your results to Kaggle,
and report the resulting F1-score on the test data, as reported by Kaggle. Was this lower or
higher than you expected? Discuss why it might be lower or higher than your expectation

<font color='blue'> Results on test set


#### Kaggle Score: 0.72785

The results obtained are better than what we expected given the performance of the model on the validation set. 

We think that might be because the heavy pre-processing that we opted for might have resulted in some important information being lost. By running the bag of words on unprocessed data, we noticed words like "hiroshima" and "california" being quite common and informative. However these words were lost in the pre-processing due to the "english vocabulary" filter. That might have caused the validation score to drop.

In [262]:
BNB_nosmoothing = myBernoulliNB(alpha = 1)
BNB_nosmoothing.fit(x_train_numpy, y_train_numpy)

test_preds = BNB_nosmoothing.predict(X_test.to_numpy())
# test_preds = logistic_regr_l2.predict(X_test)

In [263]:
pred_df = pd.DataFrame(columns = ['id', 'target'])
pred_df["id"] = df_tot.iloc[len(df_train) + len(df_val):, 0].sort_index()
pred_df["target"] = test_preds.astype("int")

pred_df.to_csv('./results/test_predictions.csv', sep=',', index = False)