## FastText

We set the random seed to make our result reproductible.

In [None]:
import random

random.seed(10)

First we import everything we need for this sheet.

In [None]:
from datasets import load_dataset, concatenate_datasets
import pandas as pd
from typing import List, Tuple

We download the dataset from HuggingFace. This time we use the fonction to directly split the train and test set.

In [None]:
dataset_train = load_dataset('imdb', split='train')
dataset_test = load_dataset('imdb', split='test')

Downloading:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text (download: 80.23 MiB, generated: 127.02 MiB, post-processed: Unknown size, total: 207.25 MiB) to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a...


Downloading:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a. Subsequent calls will reuse this data.


Reusing dataset imdb (/root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a)


Now that we have our data, we want to convert it to DataFrames to facilitate manipulations. In the same time, we convert ints 0 to `negative` and 1 to `positive`.

In [None]:
from typing import List, Tuple

def create_dataframe(data: List[Tuple[str, str]], columns: List[str]) -> pd.DataFrame:
    """ Convert our data into a DataFrame and convert the string identifier to int """

    rtn = pd.DataFrame(data, columns=columns)
    rtn = rtn.replace(0, "negative")
    rtn = rtn.replace(1, "positive")
    return rtn

train_df = create_dataframe(list(zip(dataset_train['label'], dataset_train['text'])), ['Label', 'Text'])
test_df = create_dataframe(list(zip(dataset_test['label'], dataset_test['text'])), ['Label', 'Text'])

Using fastText, pretreatement is a key for this model. 

In [None]:
# import packages for steeming
import nltk
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize

In [None]:
# We need to download a package for word tokenization
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

Let's start by removing html tags

In [None]:
# Remove html tags
from bs4 import BeautifulSoup
train_df['Text'] = train_df['Text'].apply(lambda x: BeautifulSoup(x, 'lxml').get_text().strip())
test_df['Text'] = test_df['Text'].apply(lambda x: BeautifulSoup(x, 'lxml').get_text().strip())

Let's continue with the word tokenization.

In [None]:
# Tokenization
train_df['Text'] = train_df['Text'].apply(lambda x: " ".join(word_tokenize(x)))
test_df['Text'] = test_df['Text'].apply(lambda x: " ".join(word_tokenize(x)))

Now let's apply the stemming to everything that is composed of characters. Words are simply cut and stemmed. We do not have any punctuation.

In [None]:
# Steeming
import re

re_word = re.compile(r"^\w+$")
stemmer = SnowballStemmer("english")

def stemming(text):
    return [stemmer.stem(word) for word in word_tokenize(text.lower()) if re_word.match(word)]
        
train_df['Text'] = train_df['Text'].apply(lambda x: " ".join(stemming(x)))
test_df['Text'] = test_df['Text'].apply(lambda x: " ".join(stemming(x)))

In [None]:
!python -m spacy download en_core_web_sm

Collecting en_core_web_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 5.6 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


Let's lemmatize all token we can find.

In [None]:
# Lemmatization
import spacy

nlp = spacy.load("en_core_web_sm", disable = ['ner', 'tagger', 'parser', 'textcat', "lemmatizer"])

def lemmatization(text):
    return [token.lemma_ for token in nlp(text.lower()) if re_word.match(token.text)]

train_df['Text'] = train_df['Text'].apply(lambda x: " ".join(lemmatization(x)))
test_df['Text'] = test_df['Text'].apply(lambda x: " ".join(lemmatization(x)))

Format the data for fasttext

In [None]:
def transform_row(row: pd.Series) -> str:
    """ Format each row of dataframe in correct way to train fastText model """
    return "__label__" + row['Label'] + " " + row['Text']

The data are presented positive first and then negative. To avoid having a strong model bias toward negative, we need to shuffle our data.

In [None]:
train_df = train_df.sample(frac=1)
test_df = test_df.sample(frac=1)

In [None]:
def write_file(dataframe: pd.DataFrame, test : bool = False):
    """ Write in the appropriate file the content of the dataframe. Each row corresponding to a line in the file in a formatted way"""
    if test:
        f = open("test_file.txt","w+")
    else:
        f = open("train_file.txt", "w+")
    for index, row in dataframe.iterrows():
        f.write(transform_row(row) + "\n")
    f.close()

In [None]:
write_file(train_df)
write_file(test_df, test = True)

In [None]:
import fasttext
model = fasttext.train_supervised(input="train_file.txt", wordNgrams=2, lr=0.5, epoch=25)

In [None]:
model.test("test_file.txt")

(25000, 0.8914, 0.8914)

We can observe that the accuracy using fastText is the best comparing to naive bayes and logistic regression model. 