## fastText Tutorial 

## Replication of original work

## Text classification
Text classification is a core problem to many applications, like spam detection, sentiment analysis or smart replies. In this tutorial, we describe how to build a text classifier with the fastText tool.

## What is text classification?
The goal of text classification is to assign documents (such as emails, posts, text messages, product reviews, etc...) to one or multiple categories. Such categories can be review scores, spam v.s. non-spam, or the language in which the document was typed. Nowadays, the dominant approach to build such classifiers is machine learning, that is learning classification rules from examples. In order to build such classifiers, we need labeled data, which consists of documents and their corresponding categories (or tags, or labels).

As an example, we build a classifier which automatically classifies stackexchange questions about cooking into one of several possible tags, such as pot, bowl or baking.

!wget https://github.com/facebookresearch/fastText/archive/v0.9.2.zip

In [1]:
import pandas as pd
import numpy as np
import wget
import os

In [2]:
#! pip install fasttext
import fasttext
#help(fasttext.FastText)

train_supervised() will mostly be used for retruning a model object and calling test and predict on that object. This is the same as learning the text classifier. 

## Getting and Preparing the data

As mentioned in the introduction, we need labeled data to train our supervised classifier. In this tutorial, we are interested in building a classifier to automatically recognize the topic of a stackexchange question about cooking. Let's download examples of questions from the cooking section of Stackexchange, and their associated tags:

In [3]:
!wget --no-check-certificate https://dl.fbaipublicfiles.com/fasttext/data/cooking.stackexchange.tar.gz && tar xvzf cooking.stackexchange.tar.gz

cooking.stackexchange.id
cooking.stackexchange.txt
readme.txt


--2021-11-04 23:22:36--  https://dl.fbaipublicfiles.com/fasttext/data/cooking.stackexchange.tar.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.74.142, 172.67.9.4, 104.22.75.142
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.74.142|:443... connected.
  Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 200 OK
Length: 457609 (447K) [application/x-tar]
Saving to: 'cooking.stackexchange.tar.gz.3'

     0K .......... .......... .......... .......... .......... 11%  264K 2s
    50K .......... .......... .......... .......... .......... 22%  354K 1s
   100K .......... .......... .......... .......... .......... 33%  244K 1s
   150K .......... .......... .......... .......... .......... 44% 2.84M 1s
   200K .......... .......... .......... .......... .......... 55% 6.81M 0s
   250K .......... .......... .......... .......... .......... 67%  559K 0s
   300K .......... .......... .......... .......... .......... 

In [4]:
!head cooking.stackexchange.txt

__label__sauce __label__cheese How much does potato starch affect a cheese sauce recipe?
__label__food-safety __label__acidity Dangerous pathogens capable of growing in acidic environments
__label__cast-iron __label__stove How do I cover up the white spots on my cast iron stove?
__label__restaurant Michelin Three Star Restaurant; but if the chef is not there
__label__knife-skills __label__dicing Without knife skills, how can I quickly and accurately dice vegetables?
__label__storage-method __label__equipment __label__bread What's the purpose of a bread box?
__label__baking __label__food-safety __label__substitutions __label__peanuts how to seperate peanut oil from roasted peanuts at home?
__label__chocolate American equivalent for British chocolate terms
__label__baking __label__oven __label__convection Fan bake vs bake
__label__sauce __label__storage-lifetime __label__acidity __label__mayonnaise Regulation and balancing of readymade packed mayonnaise and other sauces


Each line of the text file contains a list of labels, followed by the corresponding document. All the labels start by the __label__ prefix, which is how fastText recognize what is a label or what is a word. The model is then trained to predict the labels given the word in the document.

Before training our first classifier, we need to split the data into train and validation. We will use the validation set to evaluate how good the learned classifier is on new data.

In [5]:
!wc cooking.stackexchange.txt

  15404  169582 1401900 cooking.stackexchange.txt


Our full dataset contains 15404 examples. Let's split it into a training set of 12404 examples and a validation set of 3000 examples:

In [6]:
!head -n 12404 cooking.stackexchange.txt > cooking.train
!tail -n 3000 cooking.stackexchange.txt > cooking.valid

## Our First Classifier

In [7]:
model = fasttext.train_supervised(input="cooking.train")

The input argument indicates the file containing the training examples. We can now use the model variable to access information on the trained model.

We can also call save_model to save it as a file and load it later with load_model function.

In [8]:
model.save_model("model_cooking.bin")

In [9]:
model.predict("Which baking dish is best to bake a banana bread ?")

(('__label__baking',), array([0.06506271]))

The predicted tag is baking which fits well to this question. Let us now try a second example:

In [10]:
model.predict("Why not put knives in the dishwasher?")

(('__label__baking',), array([0.07080489]))

The label predicted by the model is baking, which is not relevant. Somehow, the model seems to fail on simple examples.

To get a better sense of its quality, let's test it on the validation data by running:

In [11]:
model.test("cooking.valid")

(3000, 0.127, 0.05492287732449185)

The output are the number of samples (here 3000), the precision at one (0.126) and the recall at one (0.0547).

We can also compute the precision at five and recall at five with:

The precision is 0.117667 and the recall is at 0.05

In [12]:
model.test("cooking.valid", k=5)

(3000, 0.06613333333333334, 0.14300129739080294)

### Precision and Recall

Precision is the number of correct ones among the predicted ones. Recall is the number of labels that were predicted among the real labels. We shall use an example; 
<br>
<em>Why not put knives in the dishwasher</em>
<br>
On the stack exchange this is labelled with three tags: <code>equipment, cleaning and knives</code>, and these can be predicted by these labels:-

In [13]:
model.predict("Why not put knives in the dishwasher?", k=5)

(('__label__baking',
  '__label__food-safety',
  '__label__bread',
  '__label__equipment',
  '__label__substitutions'),
 array([0.07080489, 0.06346922, 0.03655997, 0.03551925, 0.0352584 ]))

are food-safety, baking, equipment, substitutions and bread.
Thus, one out of five labels predicted by the model is correct, giving a precision of 0.20. Out of the three real labels, only one is predicted by the model, giving a recall of 0.33.

## Making the model better

The model obtained by running fastText with the default arguments is pretty bad at classifying new questions. Let's try to improve the performance, by changing the default parameters.

### Preprocessing data

Looking at the data, we observe that some words contain uppercase letter or punctuation. One of the first step to improve the performance of our model is to apply some simple pre-processing. A crude normalization can be obtained using command line tools such as sed and tr:

In [14]:
!cat cooking.stackexchange.txt | sed -e "s/\([.\!?,'/()]\)/ \1 /g" | tr "[:upper:]" "[:lower:]" > cooking.preprocessed.txt
!head -n 12404 cooking.preprocessed.txt > cooking.train
!tail -n 3000 cooking.preprocessed.txt > cooking.valid

Let's train a new model on the pre-processed data:

In [15]:
model = fasttext.train_supervised(input="cooking.train")
model.test("cooking.valid")

(3000, 0.14966666666666667, 0.06472538561337754)

We observe that thanks to the pre-processing, the vocabulary is smaller (from 14k words to 9k). The precision is also starting to go up by 4%!

## more epochs and larger learning rate

By default, fastText sees each training example only five times during training, which is pretty small, given that our training set only have 12k training examples. The number of times each examples is seen (also known as the number of epochs), can be increased using the -epoch option:

In [16]:
model = fasttext.train_supervised(input="cooking.train", epoch=25)

Let's test our model

In [17]:
model.test("cooking.valid")

(3000, 0.5176666666666667, 0.22387199077410985)

This is much better! Another way to change the learning speed of our model is to increase (or decrease) the learning rate of the algorithm. This corresponds to how much the model changes after processing each example. A learning rate of 0 would mean that the model does not change at all, and thus, does not learn anything. Good values of the learning rate are in the range 0.1 - 1.0.

In [18]:
model = fasttext.train_supervised(input="cooking.train", lr=1.0)
model.test("cooking.valid")

(3000, 0.5826666666666667, 0.25198212483782617)

Even better! Let's try both together:

In [19]:
model = fasttext.train_supervised(input="cooking.train", lr=1.0, epoch=25)

In [20]:
model.test("cooking.valid")

(3000, 0.587, 0.2538561337754072)

Let us now add a few more features to improve even further our performance!

### using word n-grams

Finally, we can improve the performance of a model by using word bigrams, instead of just unigrams. This is especially important for classification problems where word order is important, such as sentiment analysis.

In [21]:
model = fasttext.train_supervised(input="cooking.train", lr=1.0, epoch=25, wordNgrams=2)

In [22]:
model.test("cooking.valid")

(3000, 0.6076666666666667, 0.26279371486233244)

With a few steps, we were able to go from a precision at one of 12.6% to 60%. Important steps included:
<br>
<ul>
    <li>preprocessing the data</li>
    <li>changing the number of epochs (using the option -epoch, standard range [5 - 50])</li>
    <li>changing the learning rate (using the option -lr, standard range [0.1 - 1.0]) </li>
    <li>using word n-grams (using the option -wordNgrams, standard range [1 - 5])</li>

## Hierarchical softmax

Since we are training our model on a few thousands of examples, the training only takes a few seconds. But training models on larger datasets, with more labels can start to be too slow. A potential solution to make the training faster is to use the hierarchical softmax, instead of the regular softmax. This can be done with the option -loss hs:

In [23]:
%%time
model = fasttext.train_supervised(input="cooking.train", lr=1.0,
                    epoch=25, wordNgrams=2, bucket=200000, dim=50, loss='hs')

Wall time: 804 ms


Training should now take less than a second.

The hierarchical softmax is a loss function that apprpximates the softmax with a much faster computation. It does this by using a binary tree that has leaves corresponding to the labels. Each intermediate node has a binary decsion activation ~ sigmoid that is trained and predicts if we should go to the left or right. The probability of output is given by the product of the probabilities of the intermediate nodes along the path from the roor to the output unit leaf. In fastText, we use a Huffman tree so that the lookup time is faster for ore frequent outputs and thus the avergae lookuptime for the output is optimal. 

### Multi-label classification

When we want to assign a document to multiple labels, we can still use the softmax loss and play with the parameters for prediction, namely the number of labels to predict and the threshold for the predicted probability. However playing with these arguments can be tricky and unintuitive since the probabilities must sum to 1.

A convenient way to handle multiple labels is to use independent binary classifiers for each label. This can be done with -loss one-vs-all or -loss ova.

In [24]:
model1 = fasttext.train_supervised(input="cooking.train", lr=0.5, epoch=25, wordNgrams=2, bucket=200000, dim=50, loss='ova')

It is a good idea to decrease the learning rate compared to other loss functions.

Now let's have a look on our predictions, we want as many prediction as possible (argument -1) and we want only labels with probability higher or equal to 0.5 :

In [25]:
model1.predict("Which baking dish is best to bake a banana bread ?", k=-1, threshold=0.5)

(('__label__baking',
  '__label__bread',
  '__label__equipment',
  '__label__bananas'),
 array([1.00001001, 0.98718882, 0.93629503, 0.91731268]))

We can also evaluate our results with the test function:

In [26]:
model1.test("cooking.valid", k=-1)

(3000, 0.003146031746031746, 1.0)

## Conclusion
In this tutorial, we gave a brief overview of how to use fastText to train powerful text classifiers. We had a light overview of some of the most important options to tune.