# NLP - Lab 03

### Let's beat the results we obtained with naive Bayes using the FastText library

---

Authors :

eliott.bouhana\
victor.simonin\
alexandre.lemonnier\
sarah.gutierez

----

## The dataset

In [1]:
import numpy as np
from datasets import load_dataset
dataset = load_dataset("imdb")

Reusing dataset imdb (/home/leme/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)
100%|██████████| 3/3 [00:00<00:00, 41.79it/s]


### How many splits does the dataset has?

In [2]:
from datasets import get_dataset_split_names

get_dataset_split_names("imdb")

['train', 'test', 'unsupervised']

The dataset is composed of 3 splits: `train`, `test` and `unsupervised`

----------------

## FastText setup

In [3]:
import fasttext
import pandas as pd

train = dataset["train"].to_pandas()
test = dataset["test"].to_pandas()

print("Number of values in train dataset: ", len(train))
print("Number of negative (0) and positive (1) sentences in train supervised split:\n{0}".format(train['label'].value_counts()))
print("--------------------------------")
print("Number of values in test dataset: ", len(test))
print("Number of negative (0) and positive (1) sentences in test supervised split:\n{0}".format(test['label'].value_counts()))

Number of values in train dataset:  25000
Number of negative (0) and positive (1) sentences in train supervised split:
0    12500
1    12500
Name: label, dtype: int64
--------------------------------
Number of values in test dataset:  25000
Number of negative (0) and positive (1) sentences in test supervised split:
0    12500
1    12500
Name: label, dtype: int64


In [4]:
train.head()

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


### Preprocessing

Before training, we remove the punctuation in our text samples and lowercase them.

In [5]:
from string import punctuation
from typing import List

def preprocessor(x_list: List[str]) -> List[str]:
    """
    Preprocessing function to lowercase and remove punctuation of a list of string.
    
    Args:
        x_list: List of strings
    
    Returns:
        List of preprocessed strings.
    """
    return [x.lower().translate(str.maketrans("", "", punctuation)) for x in x_list]

train['text'] = preprocessor(train['text'])
test['text'] = preprocessor(test['text'])
train.head()

Unnamed: 0,text,label
0,i rented i am curiousyellow from my video stor...,0
1,i am curious yellow is a risible and pretentio...,0
2,if only to avoid making this type of film in t...,0
3,this film was probably inspired by godards mas...,0
4,oh brotherafter hearing about this ridiculous ...,0


### Shuffling

To avoid having a strong model bias toward negative, we shuffle our data before training.

In [6]:
def shuffle_dataset(dataset: pd.DataFrame) -> pd.DataFrame:
    """
    Shuffle pandas dataset 
    
    Args:
        dataset: pandas Dataframe
    
    Returns:
       The shuffled dataset 
    """
    return dataset.sample(frac=1).reset_index(drop=True)

train = shuffle_dataset(train)
test = shuffle_dataset(test)

train.head()

Unnamed: 0,text,label
0,good horror movies from france are quite rare ...,1
1,whoever says pokemon is stupid can die this mo...,1
2,believe it or dont i have my very own dvd copy...,0
3,im sorry but this is just awful i have told pe...,0
4,i watched this movie also and altho it is very...,0


### Turn the dataset into a dataset compatible with FastText

To be able to use `fasttext`, we have to transform each row of our dataset with the following format :

`__label__<label> <corresponding_text>`

Each line has to be saved in a file which will be used to train with FastText.

In [7]:
from random import shuffle

def dataset_to_fasttext_file(df: pd.DataFrame, filename: str) -> None:
    """
    Transform dataframe to a fasttext file with correct format.
    
    Args:
        df: pandas Dataframe
        filename: str 
    
    Returns:
        None
    """
    if df['label'].values.__contains__(0):
        df['label'] = df['label'].map({0: "negative", 1: "positive"})
    df = df.astype('str')
    with open(filename, 'w') as file:
        for index, row in df.iterrows():
            fasttext = '__label__' + row['label'] + ' ' + row['text'] + '\n'
            file.write(fasttext)
    
dataset_to_fasttext_file(train, "train.fasttext")
dataset_to_fasttext_file(test, "test.fasttext")

## FastText classifier

Let's train a simple classifier with our formated text file:

In [8]:
model = fasttext.train_supervised(input="train.fasttext")

Read 5M words
Number of words:  121891
Number of labels: 2
Progress: 100.0% words/sec/thread: 2383320 lr:  0.000000 avg.loss:  0.367719 ETA:   0h 0m 0s


In [9]:
test_prediction = model.test("test.fasttext")
print('Number of samples:', test_prediction[0])
print('Accurracy:', test_prediction[1])

Number of samples: 25000
Accurracy: 0.87608


### Hyperparameters search

To find the best parameters for our classifier, we have to split our training data and create a validation dataset.

In [10]:
validation_size = int(len(train) * 0.3)
validation = train.iloc[:validation_size:, :] 
train = train.iloc[validation_size:, :]

validation.head()

Unnamed: 0,text,label
0,good horror movies from france are quite rare ...,positive
1,whoever says pokemon is stupid can die this mo...,positive
2,believe it or dont i have my very own dvd copy...,negative
3,im sorry but this is just awful i have told pe...,negative
4,i watched this movie also and altho it is very...,negative


Let's format our validation dataset to a fasttext file:

In [11]:
dataset_to_fasttext_file(validation, "validation.fasttext")

Now, let's train and search for the best hyperparameters with the validation dataset:

In [12]:
model_hyper = fasttext.train_supervised(input="train.fasttext", autotuneValidationFile="validation.fasttext")

Progress:   2.7% Trials:    2 Best score:  0.904267 ETA:   0h 4m51s
Aborting autotune...
Progress:   2.8% Trials:    2 Best score:  0.904267 ETA:   0h 4m51s
Training again with best arguments
Read 5M words
Number of words:  121891
Number of labels: 2
Progress: 100.0% words/sec/thread: 3249684 lr:  0.000000 avg.loss:  0.381792 ETA:   0h 0m 0s


In [None]:
test_hyper_prediction = model_hyper.test("test.fasttext")
print('Number of samples:', test_hyper_prediction[0])
print('Accurracy:', test_hyper_prediction[1])

Number of samples: 25000
Accurracy: 0.89824


### What is/are the difference(s) between the two models ?

Let's compare the two accuracy of our models with the test dataset:

In [None]:
print("Simple model on test dataset accuracy:", test_prediction[1])
print("Hyper model on test dataset accuracy:", test_hyper_prediction[1])
print("Accuracy difference : ", abs(test_prediction[1] - test_hyper_prediction[1]))

Simple model on test dataset accuracy: 0.87592
Hyper model on test dataset accuracy: 0.89824
Accuracy difference :  0.022320000000000007


We can observe that our second model is more accurate on its prediction than our first dataset.
The hyperparameters search worked as expected and we got better results.

Now, let's observe the differences between the hyper parameters of our two models:

In [None]:
print("Simple model learning rate:", model.lr)
print("Hyper model learning rate:", model_hyper.lr)

Simple model learning rate: 0.1
Hyper model learning rate: 0.08499425639667486


In [None]:
print("Simple model epoch:", model.epoch)
print("Hyper model epoch:", model_hyper.epoch)

Simple model epoch: 5
Hyper model epoch: 100


In [None]:
print("Simple model loss function:", model.loss)
print("Hyper model loss function:", model_hyper.loss)

Simple model loss function: loss_name.softmax
Hyper model loss function: loss_name.softmax


In [None]:
print("Simple model size of word vectors:", model.dim)
print("Hyper model size of word vectors:", model_hyper.dim)

Simple model size of word vectors: 100
Hyper model size of word vectors: 92


In [None]:
print("Simple model number of bucket:", model.bucket)
print("Hyper model number of bucket:", model_hyper.bucket)

Simple model loss function: 0
Hyper model loss function: 4110692


First, we can see that the *learning rate* is more accurate for our second model than the first one. This parameter is really important in the resulting accuracy of our prediction.

An other mismatch between the two models is that the *number of epoch* used for both of them are really different, **5** against **100**, meaning that our `model_hyper` has received more training than the simple one.

However, the two models use the same *loss function*, another important hyper-parameter in their training.

We can finally notice that the *size of word vectors* and the *number of bucket* between the two models are not the same. We can conclude that the values of the `model_hyper` model are better for this two parameters.

### Identify prediction failure

Using the tuned model, let's take 2 wrongly classified examples from the test set, and try explaining why the model failed:

In [None]:
# Read test.fasttext to get some test line with fasttext format
fasttext_lines = []
with open("test.fasttext", "r") as file:
    for i in range(40):
        fasttext_lines.append(file.readline().strip())

# Predit with our model the label for the lines read
test_predictions = model_hyper.predict(fasttext_lines)

# Get two examples from our lines and predictions:
# one positive instead of negative, and one negative instead of positive
nb_example = 0
p_n = 0
n_p = 0
for i in range(40):
    label = "__label__positive"
    if (fasttext_lines[i].find("__label__negative") != -1):
        label = "__label__negative"
    if test_predictions[0][i][0] == "__label__positive" and label == "__label__negative" and not p_n:
        print("Expected:\n" + fasttext_lines[i].split(' ')[0])
        print("Got:\n" + test_predictions[0][i][0])
        print("Text:\n" + fasttext_lines[i].split(label)[1].lstrip())
        print()
        p_n = 1
    if test_predictions[0][i][0] == "__label__negative" and label == "__label__positive" and not n_p:
        print("Expected:\n" + fasttext_lines[i].split(' ')[0])
        print("Got:\n" + test_predictions[0][i][0])
        print("Text:\n" + fasttext_lines[i].split(label)[1].lstrip())
        print()
        n_p = 1
    if p_n and n_p:
        break

Expected:
__label__negative
Got:
__label__positive
Text:
the brave one is about a new york radio show host named erica bain jodie foster her life is a dream living in the city she grew up in and loves she has her great fiancé david naveen andrews whom she is planning to marry but one night while erica and david are out walking their dog they are attacked and mugged by a group of degenerates leaving david dead erica recovers but is heartbroken and traumatized later on and can barely cope with real life anymore she buys a gun off a guy on the streets for protection but one day shes shopping in a store and a man comes in and shoots the clerk dead it is then that erica shoots and kills this man and she becomes a vigilante killing anyone who tries to threaten or harm her or any others at the same time detective mercer terrence howard is tracking down this elusive unknown killer and in the process becomes friends with erica erica begins to regain her sanity as she kills these violent people 

Here we took two examples of mistakes that our model does:
- one were the predicted label was positive instead of negative
- one were the predicted label was negative instead of positive

The main reason of those mistakes is probably because theses inputs contains words that the model probably classify as the other label.

Indeed, for the first input, we can find those keywords that can be classified as 'positive':
- loves
- good
- her great fiancé
- excellent
- brillant
- good acting
- ...

Moreover, we can also notice that this text is pretty long, then, maybe more a text is long, higher is the probability that it will be considered as positive.

For the second input, we can find those keywords that can be classified as 'negative':
- I don't like
- very poor execution
- heartbreaking
- drama
- ...
  