# FastText

Use the [FastText library](https://fasttext.cc/docs/en/support.html) to train and test a classifier.

Go through the following steps.
1. (2 points) Turn the dataset into a dataset compatible with Fastext (see the _Tips on using FastText_ section a bit lower).
   * For pretreatment, only apply lower casing and punctuation removal.
2. (2 points) Train a FastText classifier with default parameters on the training data, and evaluate it on the test data using accuracy.
3. (2 points) Use the [hyperparameters search functionality](https://fasttext.cc/docs/en/autotune.html) of FastText and repeat step 2.
   * To do so, you'll need to [split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) your training set into a training and a validation set.
   * Let the model search for 5 minutes (it's the default search time).
   * Don't forget to shuffle (and stratify) your splits. The dataset has its entry ordered by label (0s first, then 1s). Feeding the classifier one class and then the second can mess with its performances.
4. (1 points) Look at the differences between the default model and the attributes found with hyperparameters search. How do the two models differ?
   * Only refer to the attributes you think are interesting.
   * See the _Tips on using FastText_ (just below) for help.
5. (1 point) Using the tuned model, take at least 2 wrongly classified examples from the test set, and try explaining why the model failed.
6. (Bonus point) Why is it likely that the attributes `minn` and `maxn` are at 0 after an hyperparameter search on our data?
   * Hint: on what language are we working?

### Tips on using FastText

FastText is not exactly documented in details, so you might run into a few problems. The following tips can be useful.

#### Training file format

Training a FastText classifier takes a text file as input. Every line corresponds to a sample and must have the following format
```
__label__<your_label> <corresponding text>
```
For example, in our case a line should look like this.
```
__label__positive you know robin williams god bless him is constantly...
```
Also, the data are presented `positive` first and then `negative`. To avoid having a strong model bias toward `negative`, **shuffle your data before training**.

#### Attributes

You can check a model's attributes as they are listed on the [cheatsheet](https://fasttext.cc/docs/en/options.html). Also, if you have a well configure IDE or use Jupyter Lab, tab is your friend.

#### Random seed

To my knowledge, there is no way to set the random seed for FastText. It uses C++ code in the back, so using `random.seed()` won't help. For every other model you will use in these projects, please set the random seed to make your results reproductible.

#### Data split

Do not use the test set for hyperparameters search. Extract a validation set from the training data for that purpose. The test set is only made for comparing final models (see [data leakage](https://en.wikipedia.org/wiki/Leakage_%28machine_learning%29)).

In [1]:
import fasttext
import pandas as pd
import numpy as np
from datasets import load_dataset
from sklearn.model_selection import train_test_split

import os
import sys

parent_dir = os.path.dirname(os.getcwd())
sys.path.append(str(parent_dir))

from scripts.sentiment_analysis import fast_text_utils



## Data loading

Load dataset as a `Pandas` dataframe 

In [2]:
dataset = load_dataset('imdb')
df: pd.DataFrame = dataset['train'].to_pandas()

# Display the first 5 rows of the training dataset
df.head()

Found cached dataset imdb (/Users/francois.soulier/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)


  0%|          | 0/3 [00:00<?, ?it/s]

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


## Dataset conversion to the FastText format

In [3]:
# Apply pretreatment to the text
df['text'] = df['text'].apply(fast_text_utils.preprocess_text)

# Convert the data to FastText format
fast_text_utils.to_fast_text_format(df=df, label_column_name='label', texts_column_name='text')

# Keep only the text column
# train_df = train_df[['text']]

# Display some rows of the newly formatted data
df.head()

Unnamed: 0,text,label
0,__label__negative i rented i am curiousyellow ...,0
1,__label__negative i am curious yellow is a ris...,0
2,__label__negative if only to avoid making this...,0
3,__label__negative this film was probably inspi...,0
4,__label__negative oh brotherafter hearing abou...,0


Split the data into train, validation and test sets with `scikit-learn`. Also, we set a high random state to shuffle the datasets well enough in order to keep consistant results.

In [4]:
train_df, test_df = train_test_split(df[['text']], test_size=0.2, random_state=42)
train_df, val_df = train_test_split(train_df, test_size=0.2, random_state=42)

Define output paths

In [5]:
TRAINING_PATH = './output/train.txt'
VALIDATION_PATH = './output/val.txt'
TEST_PATH = './output/test.txt'

Save the different datasets to their respective paths

In [6]:
# Create the output directory if it doesn't exist
os.makedirs('./output', exist_ok=True)

# Write the data to the output files
fast_text_utils.save_to_file(df=train_df, file_name=TRAINING_PATH)
fast_text_utils.save_to_file(df=val_df, file_name=VALIDATION_PATH)
fast_text_utils.save_to_file(df=test_df, file_name=TEST_PATH)

## FastText classifier

Define a `FastText` classifier with the default parameters

In [7]:
ft_classifier_default = fasttext.train_supervised(TRAINING_PATH)

Read 3M words
Number of words:  95575
Number of labels: 2
Progress: 100.0% words/sec/thread: 2833902 lr:  0.000000 avg.loss:  0.475094 ETA:   0h 0m 0s


Evaluate the classifier on the testing dataset.

In [8]:
results = ft_classifier_default.test(TEST_PATH)
fast_text_utils.display_results(results)

Number of examples: 5000
Precision: 85.34%
Recall: 0.8534


## Hyperparameter search

Define a new `FastText` classifier, feeding it with the validation dataset this time. We also specify the training duration (5 minutes).

In [9]:
ft_classifier_tuned = fasttext.train_supervised(input=TRAINING_PATH, autotuneValidationFile=VALIDATION_PATH, autotuneDuration=5 * 60)

Progress: 100.0% Trials:   30 Best score:  0.894250 ETA:   0h 0m 0s
Training again with best arguments
Read 3M words
Number of words:  95575
Number of labels: 2
Progress: 100.0% words/sec/thread:  760747 lr:  0.000000 avg.loss:  0.073967 ETA:   0h 0m 0s


Evaluate the newly autotuned model.

In [10]:
results = ft_classifier_tuned.test(TEST_PATH)
fast_text_utils.display_results(results)

Number of examples: 5000
Precision: 89.56%
Recall: 0.8956


We observe a precision and recall of ~4% higher on the `tuned` model than the `default` model.

## Models comparison

In [11]:
# Define a selection of relevant hyperparameters to observe
attributes_list: list[str] = ['dim', 'ws', 'epoch', 'lr', 'wordNgrams', 'loss', 'lrUpdateRate', 'bucket']

# Display the default model attributes
print('-- Default model attributes -- ')
fast_text_utils.display_model_attributes(model=ft_classifier_default, parameters=attributes_list)

print()

# Display the tuned model attributes
print('-- Tuned model attributes --')
fast_text_utils.display_model_attributes(model=ft_classifier_tuned, parameters=attributes_list)

-- Default model attributes -- 
dim: 100
ws: 5
epoch: 5
lr: 0.1
wordNgrams: 1
loss: loss_name.softmax
lrUpdateRate: 100
bucket: 0

-- Tuned model attributes --
dim: 10
ws: 5
epoch: 77
lr: 0.5467089873546624
wordNgrams: 5
loss: loss_name.softmax
lrUpdateRate: 100
bucket: 1223063


Firstly, we can observe that the context window, the loss, the learning rate update rate, are constant between the two models.

Then, we can compare the hyperparameters differences of the two models exhaustively:
* `dim` (size of word vectors) - The tuning reduced the dimensionality of the vectors almost by a 10 factor (100 vs 10). We can assume that the model input is less complex and thus less prone to overfitting.

* `epoch` (number of epochs) - Here again, the number epochs (iterations) much higher in the default model (5 vs 77). Logically, the more the model is trained, the more the accuracy should increase. However, the model is more prone to overfitting, but the tuning process helped to reduce this effect by selecting the best hyperparameter value.

* `lr` (learning rate) - The tuning process almost multiplied the learning rate by 5 (0.1 vs ~0.55). This goes in the same direction as the previous point, as the learning rate should (generally) vary in regard to the number of epochs. This proves that the default model was undertrained.

* `wordNgrams` (max length of word ngram) - The word ngrams increased from 1 to 5. We can assume that this choice was made to prevent the model from overfitting, as the model is now able to take into account the context of the word. Therefore, the model specializes less on the training data and is more generalizable.

* `bucket` (number of buckets) - Comparing the number of buckets in the two models (0 vs 1223063), we can make the hypothesis that using a certain amount of buckets is more efficient than using none. As the model takes into account the context of a word, the number of buckets becomes This is probably due to the fact that the model is now able to take into account the context of the word, and thus the number of buckets is more relevant.

## Misclassified examples

In [12]:
examples: list[tuple] = []

# Select two examples that were misclassified
for i in range(df.shape[0]):
    example = df['text'].iloc[i]
    ground_truth = '__label__' + ('positive' if df['label'].iloc[i] else 'negative')
    prediction = ft_classifier_tuned.predict(example)

    if prediction[0][0] != ground_truth:
        examples.append((example, prediction[0][0], prediction[1][0], ground_truth))
    
    if len(examples) == 2:
        break

# Display the examples
for example in examples:
    print()
    print('Example: ', example[0])
    print('Prediction: ', example[1])
    print(f'Confidence: {example[2] * 100:.2f}%')
    print('Ground truth: ', example[3])


Example:  __label__negative i did not like the idea of the female turtle at all since 1987 we knew the tmnt to be four brothers with their teacher splinter and their enemies and each one of the four brothers are named after the great artists name like leonardo  michelangleo raphel and donatello so venus here doesnt have any meaning or playing any important part and i believe that the old tmnt series was much more better than that new one which contains venus as a female turtle will not add any action to the story we like the story of the tmnt we knew in 1987 to have new enemies in every part is a good point to have some action but to have a female turtle is a very weak point to have some action we wish to see more new of tmnt series but just as the same characters we knew in 1987 without that female turtle
Prediction:  __label__positive
Confidence: 65.09%
Ground truth:  __label__negative

Example:  __label__negative i am not so much like love sick as i image finally the film express s

* First example

The text is very descriptive and does not brightly highlight real sentiments. Here we could assume the classification has been on the movie synopsis (which represents the most part of the text), and not really on the review part.

* Second example

This example has been classified as `positive`, whereas it is actually `negative`. The text represents a detailed description, which contains a lot of positive words in the describing process. However, the lack context consideration is at stake here, because our (human) sentiment classification would be here base on adjacent words.

## Bonus question

### Why is it likely that the attributes minn and maxn are at 0 after a hyperparameter search on our data?

In [13]:
parameters: list[str] = ['minn', 'maxn']

# Display the default model attributes
print('-- Default model attributes -- ')
fast_text_utils.display_model_attributes(model=ft_classifier_default, parameters=parameters)
# Display the tuned model attributes
print('-- Tuned model attributes --')
fast_text_utils.display_model_attributes(model=ft_classifier_tuned, parameters=parameters)

-- Default model attributes -- 
minn: 0
maxn: 0
-- Tuned model attributes --
minn: 2
maxn: 5


We are here processing English texts. Thus, as we want to take the context of a word into consideration, we would rather use the `wordNgrams` hyperparameter. The `minn` and `maxn` hyperparameters are used to caption characters contexts, which is not very relevant in the case of English.

However, we have here the contrary situation, as the `minn` and `maxn` hyperparameters are not set to 0 after hyperparameter tuning. This is probably due to the fact that the model is now able to take into account the context of the word, and thus the number of buckets is more relevant.