# NLP - Lab 03

### Let's beat the results we obtained with naive Bayes using the FastText library

---

Authors :

eliott.bouhana\
victor.simonin\
alexandre.lemonnier\
sarah.gutierez

----

## The dataset

In [13]:
import numpy as np
from datasets import load_dataset
dataset = load_dataset("imdb")

Reusing dataset imdb (/home/leme/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)
100%|██████████| 3/3 [00:00<00:00, 21.33it/s]


### How many splits does the dataset has?

In [14]:
from datasets import get_dataset_split_names

get_dataset_split_names("imdb")

['train', 'test', 'unsupervised']

The dataset is composed of 3 splits: `train`, `test` and `unsupervised`

----------------

## FastText setup

In [15]:
import fasttext
import pandas as pd

train = dataset["train"].to_pandas()
test = dataset["test"].to_pandas()

print("Number of values in train dataset: ", len(train))
print("Number of negative (0) and positive (1) sentences in train supervised split:\n{0}".format(train['label'].value_counts()))
print("--------------------------------")
print("Number of values in test dataset: ", len(test))
print("Number of negative (0) and positive (1) sentences in test supervised split:\n{0}".format(test['label'].value_counts()))

Number of values in train dataset:  25000
Number of negative (0) and positive (1) sentences in train supervised split:
0    12500
1    12500
Name: label, dtype: int64
--------------------------------
Number of values in test dataset:  25000
Number of negative (0) and positive (1) sentences in test supervised split:
0    12500
1    12500
Name: label, dtype: int64


In [16]:
train.head()

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


### Preprocessing

Before training, we remove the punctuation in our text samples and lowercase them.

In [17]:
from string import punctuation
from typing import List

def preprocessor(x_list: List[str]) -> List[str]:
    """
    Preprocessing function to lowercase and remove punctuation of a list of string.
    
    Args:
        x_list: List of strings
    
    Returns:
        List of preprocessed strings.
    """
    return [x.lower().translate(str.maketrans("", "", punctuation)) for x in x_list]

train['text'] = preprocessor(train['text'])
test['text'] = preprocessor(test['text'])
train.head()

Unnamed: 0,text,label
0,i rented i am curiousyellow from my video stor...,0
1,i am curious yellow is a risible and pretentio...,0
2,if only to avoid making this type of film in t...,0
3,this film was probably inspired by godards mas...,0
4,oh brotherafter hearing about this ridiculous ...,0


### Shuffling

To avoid having a strong model bias toward negative, we shuffle our data before training.

In [18]:
def shuffle_dataset(dataset: pd.DataFrame) -> pd.DataFrame:
    """
    Shuffle pandas dataset 
    
    Args:
        dataset: pandas Dataframe
    
    Returns:
       The shuffled dataset 
    """
    return dataset.sample(frac=1).reset_index(drop=True)

train = shuffle_dataset(train)
test = shuffle_dataset(test)

train.head()

Unnamed: 0,text,label
0,in the third entry of the phantasm series mike...,0
1,i only watched the first 30 minutes of this an...,0
2,now this is what a family movie should be ther...,1
3,this show has a great storyline its very belie...,1
4,a year or so ago i was watching the tv news wh...,1


### Turn the dataset into a dataset compatible with FastText

To be able to use `fasttext`, we have to transform each row of our dataset with the following format :

`__label__<label> <corresponding_text>`

Each line has to be saved in a file which will be used to train with FastText.

In [19]:
from random import shuffle

def dataset_to_fasttext_file(df: pd.DataFrame, filename: str) -> None:
    """
    Transform dataframe to a fasttext file with correct format.
    
    Args:
        df: pandas Dataframe
        filename: str 
    
    Returns:
        None
    """
    if df['label'].values.__contains__(0):
        df['label'] = df['label'].map({0: "negative", 1: "positive"})
    df = df.astype('str')
    with open(filename, 'w') as file:
        for index, row in df.iterrows():
            fasttext = '__label__' + row['label'] + ' ' + row['text'] + '\n'
            file.write(fasttext)
    
dataset_to_fasttext_file(train, "train.fasttext")
dataset_to_fasttext_file(test, "test.fasttext")

## FastText classifier

Let's train a simple classifier with our formated text file:

In [20]:
model = fasttext.train_supervised(input="train.fasttext")

Read 5M words
Number of words:  121891
Number of labels: 2
Progress: 100.0% words/sec/thread: 1060949 lr:  0.000000 avg.loss:  0.406357 ETA:   0h 0m 0s


In [21]:
test_prediction = model.test("test.fasttext")
print('Number of samples:', test_prediction[0])
print('Accurracy:', test_prediction[1])

Number of samples: 25000
Accurracy: 0.87624


### Hyperparameters search

To find the best parameters for our classifier, we have to split our training data and create a validation dataset.

In [22]:
validation_size = int(len(train) * 0.3)
validation = train.iloc[:validation_size:, :] 
train_hyper = train.iloc[validation_size:, :]

validation.head()

Unnamed: 0,text,label
0,in the third entry of the phantasm series mike...,negative
1,i only watched the first 30 minutes of this an...,negative
2,now this is what a family movie should be ther...,positive
3,this show has a great storyline its very belie...,positive
4,a year or so ago i was watching the tv news wh...,positive


Let's format our validation dataset to a fasttext file:

In [23]:
dataset_to_fasttext_file(validation, "validation.fasttext")
dataset_to_fasttext_file(train_hyper, "train_hyper.fasttext")

Now, let's train and search for the best hyperparameters with the validation dataset:

In [24]:
model_hyper = fasttext.train_supervised(input="train_hyper.fasttext", autotuneValidationFile="validation.fasttext")

Progress: 100.0% Trials:    7 Best score:  0.898267 ETA:   0h 0m 0s
Training again with best arguments
Read 4M words
Number of words:  100524
Number of labels: 2
Progress: 100.0% words/sec/thread: 1077692 lr:  0.000000 avg.loss:  0.054391 ETA:   0h 0m 0s 922134 lr:  0.071786 avg.loss:  0.475098 ETA:   0h 2m18s 13.3% words/sec/thread:  970074 lr:  0.067106 avg.loss:  0.313050 ETA:   0h 2m 2sh 0m57s% words/sec/thread: 1069316 lr:  0.033094 avg.loss:  0.090018 ETA:   0h 0m54s avg.loss:  0.061926 ETA:   0h 0m17s


In [25]:
test_hyper_prediction = model_hyper.test("test.fasttext")
print('Number of samples:', test_hyper_prediction[0])
print('Accurracy:', test_hyper_prediction[1])

Number of samples: 25000
Accurracy: 0.89272


### What is/are the difference(s) between the two models ?

Let's compare the two accuracy of our models with the test dataset:

In [26]:
print("Simple model on test dataset accuracy:", test_prediction[1])
print("Hyper model on test dataset accuracy:", test_hyper_prediction[1])
print("Accuracy difference : ", abs(test_prediction[1] - test_hyper_prediction[1]))

Simple model on test dataset accuracy: 0.87624
Hyper model on test dataset accuracy: 0.89272
Accuracy difference :  0.01647999999999994


We can observe that our second model is more accurate on its prediction than our first dataset.
The hyperparameters search worked as expected and we got better results.

Now, let's observe the differences between the hyper parameters of our two models:

In [27]:
print("Simple model learning rate:", model.lr)
print("Hyper model learning rate:", model_hyper.lr)

Simple model learning rate: 0.1
Hyper model learning rate: 0.07736590952706915


In [28]:
print("Simple model epoch:", model.epoch)
print("Hyper model epoch:", model_hyper.epoch)

Simple model epoch: 5
Hyper model epoch: 100


In [29]:
print("Simple model loss function:", model.loss)
print("Hyper model loss function:", model_hyper.loss)

Simple model loss function: loss_name.softmax
Hyper model loss function: loss_name.softmax


In [30]:
print("Simple model size of word vectors:", model.dim)
print("Hyper model size of word vectors:", model_hyper.dim)

Simple model size of word vectors: 100
Hyper model size of word vectors: 82


In [31]:
print("Simple model number of bucket:", model.bucket)
print("Hyper model number of bucket:", model_hyper.bucket)

Simple model number of bucket: 0
Hyper model number of bucket: 3908064


First, we can see that the *learning rate* is more accurate for our second model than the first one. This parameter is really important in the resulting accuracy of our prediction.

An other mismatch between the two models is that the *number of epoch* used for both of them are really different, **5** against **100**, meaning that our `model_hyper` has received more training than the simple one.

However, the two models use the same *loss function*, another important hyper-parameter in their training.

We can finally notice that the *size of word vectors* and the *number of bucket* between the two models are not the same. We can conclude that the values of the `model_hyper` model are better for this two parameters.

### Identify prediction failure

Using the tuned model, let's take 2 wrongly classified examples from the test set, and try explaining why the model failed:

In [32]:
# Read test.fasttext to get some test line with fasttext format
fasttext_lines = []
with open("test.fasttext", "r") as file:
    for i in range(40):
        fasttext_lines.append(file.readline().strip())

# Predit with our model the label for the lines read
test_predictions = model_hyper.predict(fasttext_lines)

# Get two examples from our lines and predictions:
# one positive instead of negative, and one negative instead of positive
nb_example = 0
p_n = 0
n_p = 0
for i in range(40):
    label = "__label__positive"
    if (fasttext_lines[i].find("__label__negative") != -1):
        label = "__label__negative"
    if test_predictions[0][i][0] == "__label__positive" and label == "__label__negative" and not p_n:
        print("Expected:\n" + fasttext_lines[i].split(' ')[0])
        print("Got:\n" + test_predictions[0][i][0])
        print("Text:\n" + fasttext_lines[i].split(label)[1].lstrip())
        print()
        p_n = 1
    if test_predictions[0][i][0] == "__label__negative" and label == "__label__positive" and not n_p:
        print("Expected:\n" + fasttext_lines[i].split(' ')[0])
        print("Got:\n" + test_predictions[0][i][0])
        print("Text:\n" + fasttext_lines[i].split(label)[1].lstrip())
        print()
        n_p = 1
    if p_n and n_p:
        break

Expected:
__label__positive
Got:
__label__negative
Text:
see no evil is the first film from wwe films yes wwe word wrestling entertainment pro wrestling of course being that its a wwe film a wrestler has to star in it the wrestler being glenn jacobs aka kane which is not really important as if you didnt know kane or what wwe stood for you would never know it had anything to do with the wild word of wrestling as the movie has nothing to do with wrestling see no evil is gross out horror film it has some moments were the some people may jump but for the most part its just saying hey look how gross we can get not that there is anything wrong with that jacob goodnight played by kane is sort of a jason type character his mother tortured him as a kid with strict understatement christian beliefs and has warped his mind now hes a big scary chopping killing machine 90 of the movie takes place in an abandon hotel where jacob stalks six teenagers surprised and a handful of adults i could explain w

Here we took two examples of mistakes that our model does:
- one were the predicted label was positive instead of negative
- one were the predicted label was negative instead of positive

The main reason of those mistakes is probably because theses inputs contains words that the model probably classify as the other label.

Indeed, for the first input, we can find those keywords that can be classified as 'positive':
- loves
- good
- her great fiancé
- excellent
- brillant
- good acting
- ...

Moreover, we can also notice that this text is pretty long, then, maybe more a text is long, higher is the probability that it will be considered as positive.

For the second input, we can find those keywords that can be classified as 'negative':
- I don't like
- very poor execution
- heartbreaking
- drama
- ...
  