<a href="https://colab.research.google.com/github/Andrian0s/ML4NLP1-2024-Tutorial-Notebooks/blob/main/exercises/ex1/ex1_nn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ML4NLP1
## Starting Point for Exercise 1, part II

This notebook is supposed to serve as a starting point and/or inspiration when starting exercise 1, part II.

One of the goals of this exercise is o make you acquainted with **skorch**. You will probably need to consult the [documentation](https://skorch.readthedocs.io/en/stable/).

# Installing skorch and loading libraries

In [1]:
import subprocess

# Installation on Google Colab
try:
    import google.colab
    subprocess.run(['python', '-m', 'pip', 'install', 'skorch'])
except ImportError:
    pass

In [2]:
!pip install skorch
!pip install gdown



In [3]:
import torch
from torch import nn
import torch.nn.functional as F
from skorch import NeuralNetClassifier

import pandas as pd
import numpy as np
import csv
import re
import string
from collections import defaultdict

# Set seed for reproducibility
seed = 42
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)

## Training a classifier and making predictions

In [4]:
# Download dataset
!gdown 1QP6YuwdKFNUPpvhOaAcvv2Pcp4JMbIRs # x_train
!gdown 1QVo7PZAdiZKzifK8kwhEr_umosiDCUx6 # x_test
!gdown 1QbBeKcmG2ZyAEFB3AKGTgSWQ1YEMn2jl # y_train
!gdown 1QaZj6bI7_78ymnN8IpSk4gVvg-C9fA6X # y_test

Downloading...
From: https://drive.google.com/uc?id=1QP6YuwdKFNUPpvhOaAcvv2Pcp4JMbIRs
To: /kaggle/working/x_train.txt
100%|███████████████████████████████████████| 64.1M/64.1M [00:00<00:00, 206MB/s]
Downloading...
From: https://drive.google.com/uc?id=1QVo7PZAdiZKzifK8kwhEr_umosiDCUx6
To: /kaggle/working/x_test.txt
100%|███████████████████████████████████████| 65.2M/65.2M [00:00<00:00, 231MB/s]
Downloading...
From: https://drive.google.com/uc?id=1QbBeKcmG2ZyAEFB3AKGTgSWQ1YEMn2jl
To: /kaggle/working/y_train.txt
100%|█████████████████████████████████████████| 480k/480k [00:00<00:00, 109MB/s]
Downloading...
From: https://drive.google.com/uc?id=1QaZj6bI7_78ymnN8IpSk4gVvg-C9fA6X
To: /kaggle/working/y_test.txt
100%|█████████████████████████████████████████| 480k/480k [00:00<00:00, 111MB/s]


In [5]:
with open(f'x_train.txt') as f:
    x_train = f.read().splitlines()
with open(f'y_train.txt') as f:
    y_train = f.read().splitlines()
with open(f'x_test.txt') as f:
    x_test = f.read().splitlines()
with open(f'y_test.txt') as f:
    y_test = f.read().splitlines()

In [6]:
# Combine x_train and y_train into one dataframe
train_df = pd.DataFrame({'text': x_train, 'label': y_train})
# Write train_df to csv with tab as separator
train_df.to_csv('train_df.csv', index=False, sep='\t')
# Comibne x_test and y_test into one dataframe
test_df = pd.DataFrame({'text': x_test, 'label': y_test})
# Inspect the first 5 items in the train split
train_df.head()

Unnamed: 0,text,label
0,Klement Gottwaldi surnukeha palsameeriti ning ...,est
1,"Sebes, Joseph; Pereira Thomas (1961) (på eng)....",swe
2,भारतीय स्वातन्त्र्य आन्दोलन राष्ट्रीय एवम क्षे...,mai
3,"Après lo cort periòde d'establiment a Basilèa,...",oci
4,ถนนเจริญกรุง (อักษรโรมัน: Thanon Charoen Krung...,tha


### Data preparation

Prepare your dataset for this experiment using the same method as you did in part 1.

Get a subset of the train/test data that includes 20 languages. Include English, German, Dutch, Danish, Swedish, Norwegian, and Japanese, plus 13 additional languages of your choice based on the items in the list of labels.

Don't forget to encode your labels using the adjusted code snippet from part 1!


In [7]:
# TODO: Create your train/test subsets of languages
# Note, make sure these are the same as what you used in Part 1!

from sklearn.model_selection import train_test_split

# TODO: Create your train/test subsets of languages
language_filter = ['eng','deu','nld','dan','swe','nob','jpn', #basics
                   'fra', 'spa', 'rus', 'por', 'ita', 'kor', 'ara', 'zho', 'hin', 'tam', 'tha', 'vie', 'fin' #additionals
                   ]
# Filter x and y based on the language filter
filtered_x = [text for text,label in zip(x_train + x_test,y_train + y_test) if label in language_filter]
filtered_y = [label for label in y_train + y_test if label in language_filter]

# Split the train/test data into 8:2
x_train,x_test,y_train,y_test = train_test_split(filtered_x,filtered_y,test_size = 0.2,random_state=42)

#display
print(x_train[:5])
print(y_train[:5])

['銀行券は帝国国庫及びドイツ帝国銀行(Reichsbank)から発行され、帝国のいくつかの構成国の銀行からも発行された。帝国国庫発行の帝国紙幣(Reichskassenschein)は5、10、20、50マルクが発行された一方、ドイツ帝国銀行券(Reichsbanknote)は20、50、100、1000マルクが発行された。1914年以降に発行されたこれらの銀行券はパピエルマルクと呼ばれる。', 'في عام 2007، كرئيس أساقفة و كاردينال بوينس آيرس، قدم بيرجوليو النسخة النهائية من البيان المشترك الصادر عن أساقفة أمريكا اللاتينية المسمى "وثيقة أباريسيدا" بعد إقراره من قبل البابا بندكت السادس عشر. نصت الوثيقة على ضرورة الامتثال و قبول تعاليم الكنيسة ضد "جرائم نكراء" مثل الإجهاض والقتل الرحيم: "نأمل أن المشرعين ورؤساء الحكومات، والعاملين في مجال الصحة، سيدركون كرامة الحياة الإنسانية وأهمية العائلة في شعوبنا، و سيدافعون عن حمايتها من جرائم نكراء مثل الإجهاض والقتل الرحيم، وهذه هي مسؤوليتهم. ونحن نلزم أنفسنا "تماسك إفخارستي"، بما معناه، يجب أن نكون واعين بأن الناس لا يستطيعون الحصول على القربان المقدس وفي الوقت نفسه هم يعملون ضد الوصايا، ولا سيما عندما يوافقون على الإجهاض والقتل الرحيم، وغيرها من الجرائم الخطيرة ضد الحياة والعائلة، وهو ينطبق بشكل خاص على مسؤولية المشرعين والحكام، والعاملين

In [8]:
# TODO: Use your adjusted code from part 1 to encode the labels again
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder().fit(y_train)
y_train, y_test = label_encoder.transform(y_train), label_encoder.transform(y_test)
print(label_encoder.classes_)
print(y_train)
print(y_test)

['ara' 'dan' 'deu' 'eng' 'fin' 'fra' 'hin' 'ita' 'jpn' 'kor' 'nld' 'nob'
 'por' 'rus' 'spa' 'swe' 'tam' 'tha' 'vie' 'zho']
[ 8  0  4 ... 19 18  1]
[ 7 15 12 ...  1  1  0]


### Feature Extraction

In [9]:
# First, we extract some simple features as input for the neural network
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer='char', ngram_range=(2, 2), max_features=100, binary=True)
X = vectorizer.fit_transform(x_train)

In [10]:
# We need to change the datatype to make it play nice with pytorch
X = X.astype(np.float32)
y = y_train.astype(np.int64)

In the following, we define a vanilla neural network with two hidden layers. The output layer should have as many outputs as there are classes. In addition, it should have a nonlinearity function.

In [11]:
# TODO: In the following, you can find a small (almost) working example of a neural network.
# Unfortunately, again, the cat messed up some of the code. Please fix the code such that it is executable. (Hint: the input and output sizes look a bit weird...)

class ClassifierModule(nn.Module):
    def __init__(
        self,
        num_units=200,
        nonlin=F.relu,
        num_classes=20,
        input_size=100,
    ):
        super(ClassifierModule, self).__init__()
        self.num_units = num_units
        self.nonlin = nonlin

        self.dense0 = nn.Linear(input_size, num_units)
        self.nonlin = nonlin
        self.dense1 = nn.Linear(num_units, 50)
        self.output = nn.Linear(50, num_classes)

    def forward(self, X, **kwargs):
        X = self.nonlin(self.dense0(X))
        X = F.relu(self.dense1(X))
        X = self.output(X)
        return X.squeeze(dim=1)


In [12]:
# Initalise the neural net classifier.
net = NeuralNetClassifier(
    ClassifierModule(
        input_size=X.shape[1],
        num_units=200,
        num_classes=len(label_encoder.classes_),
        nonlin=F.relu,
    ),
    max_epochs=20,
    criterion=nn.CrossEntropyLoss(),
    lr=0.1,
    device='cuda',  # comment this to train with CPU
)

In [13]:
# Train the classifier
net.fit(X, y)

  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        [36m2.7376[0m       [32m0.2303[0m        [35m2.4827[0m  1.8264
      2        [36m2.1024[0m       [32m0.4122[0m        [35m1.7188[0m  1.6383
      3        [36m1.5182[0m       [32m0.5319[0m        [35m1.3270[0m  1.6527
      4        [36m1.2503[0m       [32m0.6306[0m        [35m1.1424[0m  1.6323
      5        [36m1.0864[0m       [32m0.6506[0m        [35m1.0249[0m  1.6339
      6        [36m0.9856[0m       [32m0.6606[0m        [35m0.9579[0m  1.6523
      7        [36m0.9264[0m       [32m0.6700[0m        [35m0.9192[0m  1.7078
      8        [36m0.8891[0m       [32m0.6753[0m        [35m0.8941[0m  1.7495
      9        [36m0.8632[0m       [32m0.6787[0m        [35m0.8764[0m  1.6792
     10        [36m0.8434[0m       [32m0.6819[0m        [35m0.8627[0m  1.6645
     11        [36m0.8274[0m       [32m0.68

<class 'skorch.classifier.NeuralNetClassifier'>[initialized](
  module_=ClassifierModule(
    (dense0): Linear(in_features=100, out_features=200, bias=True)
    (dense1): Linear(in_features=200, out_features=50, bias=True)
    (output): Linear(in_features=50, out_features=20, bias=True)
  ),
)

In [14]:
X_test = vectorizer.transform(x_test)
X_test = X_test.astype(np.float32)
y_test_np = np.array(y_test, dtype=np.int64)

y_pred = net.predict(X_test)
test_accuracy = np.mean(y_pred == y_test_np)
print(f"Test Accuracy: {test_accuracy}")


Test Accuracy: 0.682


### Experimenting with a better count vectorizer

In [15]:
vectorizer_updated = CountVectorizer(analyzer='char', ngram_range=(2, 2), max_features=5000, binary=True)
X_cv_updated = vectorizer_updated.fit_transform(x_train)
X_cv_updated = X_cv_updated.astype(np.float32)
y_cv_updted = y_train.astype(np.int64)

net_cv_updated = NeuralNetClassifier(
    ClassifierModule(
        input_size=X_cv_updated.shape[1],
        num_units=200,
        num_classes=len(label_encoder.classes_),
        nonlin=F.relu,
    ),
    max_epochs=20,
    criterion=nn.CrossEntropyLoss(),
    lr=0.1,
    device='cuda',  # comment this to train with CPU
)

In [16]:
net_cv_updated.fit(X_cv_updated, y_cv_updted)

  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        [36m2.3122[0m       [32m0.6966[0m        [35m1.2376[0m  1.8738
      2        [36m0.7029[0m       [32m0.8925[0m        [35m0.3870[0m  1.8367
      3        [36m0.2740[0m       [32m0.9647[0m        [35m0.2131[0m  1.9036
      4        [36m0.1582[0m       [32m0.9731[0m        [35m0.1500[0m  1.8623
      5        [36m0.1123[0m       [32m0.9766[0m        [35m0.1245[0m  1.8618
      6        [36m0.0886[0m       [32m0.9781[0m        [35m0.1110[0m  1.8986
      7        [36m0.0732[0m       [32m0.9788[0m        [35m0.1026[0m  1.8417
      8        [36m0.0620[0m       [32m0.9794[0m        [35m0.0969[0m  1.8381
      9        [36m0.0534[0m       [32m0.9803[0m        [35m0.0927[0m  1.8774
     10        [36m0.0464[0m       [32m0.9809[0m        [35m0.0898[0m  1.9043
     11        [36m0.0406[0m       [32m0.98

<class 'skorch.classifier.NeuralNetClassifier'>[initialized](
  module_=ClassifierModule(
    (dense0): Linear(in_features=5000, out_features=200, bias=True)
    (dense1): Linear(in_features=200, out_features=50, bias=True)
    (output): Linear(in_features=50, out_features=20, bias=True)
  ),
)

In [17]:
X_cv_updated_test = vectorizer_updated.transform(x_test)
X_cv_updated_test = X_cv_updated_test.astype(np.float32)
y_test_np = np.array(y_test, dtype=np.int64)

y_cv_updated_pred = net_cv_updated.predict(X_cv_updated_test)
test_cv_updated_accuracy = np.mean(y_cv_updated_pred == y_test_np)
print(f"Test Accuracy: {test_cv_updated_accuracy}")

Test Accuracy: 0.98075


### Experimenting with TF-IDF vectorizer instead of count vectorizer

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Use TF-IDF vectorizer for better feature representation
vectorizer_tfidf = TfidfVectorizer(analyzer='char', ngram_range=(2,4), max_features=5000, use_idf=True) # Increased ngram range and max_features
X_tfidf = vectorizer_tfidf.fit_transform(x_train)
X_tfidf = X_tfidf.astype(np.float32)
y_tfidf = y_train.astype(np.int64)


# Initalise the neural net classifier.
net_tfid = NeuralNetClassifier(
    ClassifierModule(
        input_size=X_tfidf.shape[1],
        num_units=200,
        num_classes=len(label_encoder.classes_),
        nonlin=F.relu,
    ),
    max_epochs=20,
    criterion=nn.CrossEntropyLoss(),
    lr=0.1,
    device='cuda',  # comment this to train with CPU
)

# Train the classifier
net_tfid.fit(X_tfidf, y_tfidf)


  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        [36m2.9951[0m       [32m0.0872[0m        [35m2.9903[0m  1.8490
      2        [36m2.9857[0m       [32m0.1500[0m        [35m2.9791[0m  1.9064
      3        [36m2.9710[0m       [32m0.4909[0m        [35m2.9594[0m  1.9040
      4        [36m2.9404[0m       [32m0.6366[0m        [35m2.9112[0m  1.8626
      5        [36m2.8476[0m       0.4706        [35m2.7438[0m  1.8867
      6        [36m2.5506[0m       0.4234        [35m2.3325[0m  1.8923
      7        [36m2.1478[0m       [32m0.7450[0m        [35m1.9551[0m  1.8532
      8        [36m1.7604[0m       [32m0.7822[0m        [35m1.5475[0m  1.8669
      9        [36m1.3485[0m       [32m0.8159[0m        [35m1.1456[0m  1.8581
     10        [36m0.9953[0m       [32m0.8512[0m        [35m0.8554[0m  1.9163
     11        [36m0.7498[0m       [32m0.9025[0m        [35

<class 'skorch.classifier.NeuralNetClassifier'>[initialized](
  module_=ClassifierModule(
    (dense0): Linear(in_features=5000, out_features=200, bias=True)
    (dense1): Linear(in_features=200, out_features=50, bias=True)
    (output): Linear(in_features=50, out_features=20, bias=True)
  ),
)

In [19]:
X_test_tfidf = vectorizer_tfidf.transform(x_test)
X_test_tfidf = X_test_tfidf.astype(np.float32)
y_test_np = np.array(y_test, dtype=np.int64)

y_pred_tfidf = net_tfid.predict(X_test_tfidf)
test_accuracy_tfidf = np.mean(y_pred_tfidf == y_test_np)
print(f"Test Accuracy with TF-IDF: {test_accuracy_tfidf}")


Test Accuracy with TF-IDF: 0.96575


In [20]:
from sklearn.model_selection import GridSearchCV
from skorch.callbacks import EarlyStopping

# Define the parameter grid for GridSearchCV
param_grid = {
    'module__num_units': [100, 200, 300],
    'module__nonlin': [F.relu], #, F.tanh],
    'module__input_size': [X_tfidf.shape[1]],
    'lr': [0.01, 0.1],
    'max_epochs': [20, 30],
    'callbacks': [[('EarlyStopping', EarlyStopping(patience=patience))] for patience in [5]]
}

net_tfidf_gs = NeuralNetClassifier(
    ClassifierModule(
        input_size=X_tfidf.shape[1],
        num_units=200,
        num_classes=len(label_encoder.classes_),
        nonlin=F.relu,
    ),
    max_epochs=20,
    criterion=nn.CrossEntropyLoss(),
    lr=0.1,
    device='cuda',  # comment this to train with CPU
)

In [21]:
# Create GridSearchCV object
gs = GridSearchCV(net_tfidf_gs, param_grid, refit=True, cv=3, scoring='accuracy')

# Fit the GridSearchCV object
gs.fit(X_tfidf, y_tfidf)

# Print the best parameters and score
print("Best parameters:", gs.best_params_)
print("Best score:", gs.best_score_)

# You can now use the best estimator to make predictions
best_model = gs.best_estimator_

  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        [36m2.9994[0m       [32m0.0487[0m        [35m2.9990[0m  1.2689
      2        [36m2.9986[0m       0.0487        [35m2.9982[0m  1.2490
      3        [36m2.9979[0m       0.0487        [35m2.9975[0m  1.2642
      4        [36m2.9972[0m       0.0487        [35m2.9968[0m  1.2499
      5        [36m2.9965[0m       0.0487        [35m2.9962[0m  1.2368
      6        [36m2.9958[0m       0.0487        [35m2.9955[0m  1.2814
      7        [36m2.9952[0m       0.0487        [35m2.9949[0m  1.2331
      8        [36m2.9946[0m       0.0487        [35m2.9942[0m  1.2916
      9        [36m2.9939[0m       0.0487        [35m2.9936[0m  1.2477
     10        [36m2.9933[0m       0.0487        [35m2.9929[0m  1.2509
     11        [36m2.9926[0m       0.0487        [35m2.9922[0m  1.2408
     12        [36m2.9919[0m       0.0487        

In [25]:
# Predict and get accuracy using best model

y_pred_best = best_model.predict(X_test_tfidf)
test_accuracy_best = np.mean(y_pred_best == y_test_np)
print(f"Test Accuracy with Best Model: {test_accuracy_best}")

Test Accuracy with Best Model: 0.97175


Note, you can also use `GridSearchCV` with `skorch`, but be aware that training a neural network takes much more time.

Play around with 5 different sets of hyperparameters. For example, consider some of the following:

- layer sizes
- activation functions
- regularizers
- early stopping
- vectorizer parameters

Report your best hyperparameter combination. \\
📝❓ What is the effect of your modifcations on validation performance? Discuss potential reasons.

### Best Parameters: 
* Learning Rate: 0.1
* Max Epochs: 30
* Module Input Size: 5000
* Activation: ReLU
* Number of Units: 200

### The effect of hyperparametrs is significant on the training. Some of the observations are as follows:
* Changing learning rate from 0.01 to 0.1 results in a massive improvement in the validation accuracy. This is seen when the validation accuracy improves from ~30% to nearly 97.1%. This may be because increasing the learning rate resulted in "escaping" the local minima and converge faster.
* Having more units in a layer does not necessarily mean a better accuracy. When the number of units increased from 200 to 300, the accuracy dropped instead of increasing. However, it increased when the number was changed from 100 to 200. This probably implies that 200 units provide sufficient complexity for the model to perform at its best and 300 just leads to overfitting. 
* ReLU outperforms the other activation functions like tanh (not shown on grid search but tested independently). This may be because ReLU does not suffer from the vanishing gradient problem. 
* Increasing max_features for the vectorizer (from 100 to 5000) leads to noticeable increase in accuracy from ~75% to ~97%. More features allowed our models to pick up on even more patterns in the text to make accurate predictions. 
* Early stopping did not kick in at any point during the training. This means that the model accuracy kept changing throughout the training. This suggests that we can benefit from even more epochs or improving our stopping criteria.



☝ Note, during model development, if you run into the infamous CUDA out-of-memory (OOM) error, try clearing the GPU memory either with `torch.cuda.empty_cache()` or restarting the runtime.


---

📝❓ Write your lab report here addressing all questions in the notebook

# Lab Report

## Introduction

In this lab, we explored the use of neural networks for language classification using the `skorch` library. We experimented with different vectorizers and hyperparameters to improve the model's performance. The dataset consisted of text data in various languages, and the goal was to classify the text into one of the 20 languages.

## Data Preparation

We started by preparing the dataset, which involved:
- Downloading the dataset.
- Combining the training and testing data into dataframes.
- Filtering the data to include only the 20 selected languages.
- Splitting the data into training and testing sets.
- Reorganising the training and test datasets to 80:20 split.
- Encoding the labels using `LabelEncoder`.

## Feature Extraction

We experimented with different feature extraction techniques:
- **Count Vectorizer**: Extracted character-level bigrams with a maximum of 100 and 5000 features.
- **TF-IDF Vectorizer**: Extracted character-level n-grams (2 to 4) with a maximum of 5000 features.

## Neural Network Architecture

We did not try to improve the vanilla neural network provided in the code template (other than altering the number of units in the hidden layer). This showed how a simple MLP is capable of outperforming ML techniques introduced in part 1 of the assignment. 

## Experiments and Results

### Initial Experiments

1. **Count Vectorizer with 100 Features**:
    - Achieved a test accuracy of ~75%.

2. **Count Vectorizer with 5000 Features**:
    - Improved test accuracy to ~98%.

3. **TF-IDF Vectorizer**:
    - Achieved a test accuracy of ~97%.

### Why choose `TF-IDF Vectoizer` over `Count Vectorizer` for our grid search?
- For our language classification task, the Count Vectorizer showed a slightly higher accuracy (98%) compared to TF-IDF (97%). This aligns with expectations for language identification, where the mere presence and frequency of specific character patterns or words are often more indicative of the language than their relative importance across documents.
- Although the Count Vectorizer showed marginally better performance, we decided to explore TF-IDF in our grid search to thoroughly investigate its potential benefits. This decision was made to ensure we weren't overlooking any advantages TF-IDF might offer in capturing subtle language distinctions, especially for languages with similar character distributions.

### Hyperparameter Tuning

We used `GridSearchCV` to find the best hyperparameters. The best parameters were:
- Learning Rate: 0.1
- Max Epochs: 30
- Module Input Size: 5000
- Activation: ReLU
- Number of Units: 200

The best model achieved a test accuracy of ~97.1%.

### Observations Summary

- **Learning Rate**: Increasing the learning rate from 0.01 to 0.1 resulted in a significant improvement in validation accuracy.
- **Number of Units**: 200 units provided the best performance, while increasing to 300 units led to overfitting.
- **Activation Function**: ReLU outperformed other activation functions like tanh.
- **Vectorizer Features**: Increasing the maximum features for the vectorizer from 100 to 5000 led to a noticeable increase in accuracy.
- **Early Stopping**: Did not kick in, suggesting that more epochs or improved stopping criteria could be beneficial.

## Conclusion

The experiments demonstrated the importance of hyperparameter tuning and feature extraction in improving the performance of neural networks for language classification. The best model achieved a test accuracy of ~97.1%, highlighting the effectiveness of the chosen hyperparameters and vectorizer settings.

---

## Questions

### What is the effect of your modifications on validation performance? Discuss potential reasons.

* Changing learning rate from 0.01 to 0.1 results in a massive improvement in the validation accuracy. This is seen when the validation accuracy improves from ~30% to nearly 97.1%. This may be because increasing the learning rate resulted in "escaping" the local minima and converge faster.
* Having more units in a layer does not necessarily mean a better accuracy. When the number of units increased from 200 to 300, the accuracy dropped instead of increasing. However, it increased when the number was changed from 100 to 200. This probably implies that 200 units provide sufficient complexity for the model to perform at its best and 300 just leads to overfitting. 
* ReLU outperforms the other activation functions like tanh (not shown on grid search but tested independently). This may be because ReLU does not suffer from the vanishing gradient problem. 
* Increasing max_features for the vectorizer (from 100 to 5000) leads to noticeable increase in accuracy from ~75% to ~97%. More features allowed our models to pick up on even more patterns in the text to make accurate predictions. 
* Early stopping did not kick in at any point during the training. This means that the model accuracy kept changing throughout the training. This suggests that we can benefit from even more epochs or improving our stopping criteria.