# FastText

Use the [FastText library](https://fasttext.cc/docs/en/support.html) to train and test a classifier.

Go through the following steps.
1. (2 points) Turn the dataset into a dataset compatible with Fastext (see the _Tips on using FastText_ section a bit lower).
   * For pretreatment, only apply lower casing and punctuation removal.
2. (2 points) Train a FastText classifier with default parameters on the training data, and evaluate it on the test data using accuracy.
3. (2 points) Use the [hyperparameters search functionality](https://fasttext.cc/docs/en/autotune.html) of FastText and repeat step 2.
   * To do so, you'll need to [split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) your training set into a training and a validation set.
   * Let the model search for 5 minutes (it's the default search time).
   * Don't forget to shuffle (and stratify) your splits. The dataset has its entry ordered by label (0s first, then 1s). Feeding the classifier one class and then the second can mess with its performances.
4. (1 points) Look at the differences between the default model and the attributes found with hyperparameters search. How do the two models differ?
   * Only refer to the attributes you think are interesting.
   * See the _Tips on using FastText_ (just below) for help.
5. (1 point) Using the tuned model, take at least 2 wrongly classified examples from the test set, and try explaining why the model failed.
6. (Bonus point) Why is it likely that the attributes `minn` and `maxn` are at 0 after an hyperparameter search on our data?
   * Hint: on what language are we working?

### Tips on using FastText

FastText is not exactly documented in details, so you might run into a few problems. The following tips can be useful.

#### Training file format

Training a FastText classifier takes a text file as input. Every line corresponds to a sample and must have the following format
```
__label__<your_label> <corresponding text>
```
For example, in our case a line should look like this.
```
__label__positive you know robin williams god bless him is constantly...
```
Also, the data are presented `positive` first and then `negative`. To avoid having a strong model bias toward `negative`, **shuffle your data before training**.

#### Attributes

You can check a model's attributes as they are listed on the [cheatsheet](https://fasttext.cc/docs/en/options.html). Also, if you have a well configure IDE or use Jupyter Lab, tab is your friend.

#### Random seed

To my knowledge, there is no way to set the random seed for FastText. It uses C++ code in the back, so using `random.seed()` won't help. For every other model you will use in these projects, please set the random seed to make your results reproductible.

#### Data split

Do not use the test set for hyperparameters search. Extract a validation set from the training data for that purpose. The test set is only made for comparing final models (see [data leakage](https://en.wikipedia.org/wiki/Leakage_%28machine_learning%29)).

In [1]:
import fasttext
import pandas as pd
import numpy as np
from datasets import load_dataset
from sklearn.model_selection import train_test_split

import os
import sys

parent_dir = os.path.dirname(os.getcwd())
sys.path.append(str(parent_dir))

from scripts.sentiment_analysis import fast_text_utils

  from .autonotebook import tqdm as notebook_tqdm


## Data loading

Load dataset as a `Pandas` dataframe 

In [2]:
dataset = load_dataset('imdb')
train_df = dataset['train'].to_pandas()

# Display the first 5 rows of the training dataset
train_df.head()

Found cached dataset imdb (/Users/francois.soulier/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)
100%|██████████| 3/3 [00:00<00:00, 346.22it/s]


Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


## Dataset conversion to the FastText format

In [3]:
# Apply pretreatment to the text
train_df['text'] = train_df['text'].apply(fast_text_utils.preprocess_text)

# Convert the data to FastText format
fast_text_utils.to_fast_text_format(df=train_df, label_column_name='label', texts_column_name='text')

# Keep only the text column
train_df = train_df[['text']]

# Display some rows of the newly formatted data
train_df.head()

Unnamed: 0,text
0,__label__negative i rented i am curiousyellow ...
1,__label__negative i am curious yellow is a ris...
2,__label__negative if only to avoid making this...
3,__label__negative this film was probably inspi...
4,__label__negative oh brotherafter hearing abou...


Split the data into train, validation and test sets with `scikit-learn`. Also, we set a high random state to shuffle the datasets well enough in order to keep consistant results.

In [4]:
train_df, test_df = train_test_split(train_df, test_size=0.2, random_state=42)
train_df, val_df = train_test_split(train_df, test_size=0.2, random_state=42)

Define output paths

In [5]:
TRAINING_PATH = './output/train.txt'
VALIDATION_PATH = './output/val.txt'
TEST_PATH = './output/test.txt'

Save the different datasets to their respective paths

In [6]:
# Create the output directory if it doesn't exist
os.makedirs('./output', exist_ok=True)

# Write the data to the output files
fast_text_utils.save_to_file(df=train_df, file_name=TRAINING_PATH)
fast_text_utils.save_to_file(df=val_df, file_name=VALIDATION_PATH)
fast_text_utils.save_to_file(df=test_df, file_name=TEST_PATH)

## FastText classifier

Define a `FastText` classifier with the default parameters

In [7]:
ft_classifier = fasttext.train_supervised(TRAINING_PATH)

Read 3M words
Number of words:  95575
Number of labels: 2
Progress: 100.0% words/sec/thread: 1841751 lr:  0.000000 avg.loss:  0.466556 ETA:   0h 0m 0s


Evaluate the classifier on the testing dataset.

In [8]:
results = ft_classifier.test(TEST_PATH)
fast_text_utils.display_results(results)

Number of examples: 5000
Precision: 0.8564
Recall: 0.8564


## Hyperparameter search

Define a new `FastText` classifier, feeding it with the validation dataset this time. We also specify the training duration (5 minutes).

In [9]:
ft_classifier = fasttext.train_supervised(input=TRAINING_PATH, autotuneValidationFile=VALIDATION_PATH, autotuneDuration=5 * 60)

Progress: 100.0% Trials:   18 Best score:  0.894250 ETA:   0h 0m 0s
Training again with best arguments
Read 3M words
Number of words:  95575
Number of labels: 2
Progress: 100.0% words/sec/thread:  611089 lr:  0.000000 avg.loss:  0.084821 ETA:   0h 0m 0s


Evaluate the newly autotuned model.

In [10]:
results = ft_classifier.test(TEST_PATH)
fast_text_utils.display_results(results)

Number of examples: 5000
Precision: 0.8942
Recall: 0.8942
