# Getting and preparing the data

As mentioned in the introduction, we need labeled data to train our supervised classifier. In this tutorial, we are interested in building a classifier to automatically recognize the topic of a stackexchange question about cooking. Let's download examples of questions from the cooking section of Stackexchange, and their associated tags:

> wget https://dl.fbaipublicfiles.com/fasttext/data/cooking.stackexchange.tar.gz && tar xvzf cooking.stackexchange.tar.gz  
> head cooking.stackexchange.txt  
> head -n 12404 cooking.stackexchange.txt > cooking.train  
> tail -n 3000 cooking.stackexchange.txt > cooking.valid

# Our first classifier

We are now ready to train our first classifier:

In [1]:
import fasttext
model = fasttext.train_supervised(input="../corpus/cooking.train")

We can also call save_model to save it as a file and load it later with load_model function.

In [2]:
model.save_model("model_cooking.bin")

In [19]:
model_new = fasttext.load_model("model_cooking.bin")
model.predict("Which baking dish is best to bake a banana bread ?")



(('__label__baking',), array([1.00001001]))

Now, we can test our classifier, by :

In [20]:
model.predict("Which baking dish is best to bake a banana bread ?")

(('__label__baking',), array([1.00001001]))

To get a better sense of its quality, let's test it on the validation data by running:  
The output are the number of samples (here 3000), the precision at one (0.124) and the recall at one (0.0541).

In [7]:
model.test("../corpus/cooking.valid")

(3000, 0.17633333333333334, 0.07625774830618423)

We can also compute the precision at five and recall at five with:

In [8]:
model.test("../corpus/cooking.valid", k=5)

(3000, 0.07193333333333334, 0.15554274181923022)

The top five labels predicted by the model can be obtained with:

In [9]:
model.predict("Why not put knives in the dishwasher?", k=5)

(('__label__food-safety',
  '__label__baking',
  '__label__equipment',
  '__label__substitutions',
  '__label__chicken'),
 array([0.100132  , 0.05670684, 0.04090057, 0.03371016, 0.02893838]))

# Making the model better

The model obtained by running fastText with the default arguments is pretty bad at classifying new questions. Let's try to improve the performance, by changing the default parameters.

### preprocessing the data  
Looking at the data, we observe that some words contain uppercase letter or punctuation. One of the first step to improve the performance of our model is to apply some simple pre-processing. A crude normalization can be obtained using command line tools such as sed and tr:

> cat cooking.stackexchange.txt | sed -e "s/\([.\!?,'/()]\)/ \1 /g" | tr "[:upper:]" "[:lower:]" > cooking.preprocessed.txt  
> head -n 12404 cooking.preprocessed.txt > cooking.train  
> tail -n 3000 cooking.preprocessed.txt > cooking.valid

In [10]:
import fasttext

model = fasttext.train_supervised(input="../corpus/cooking.train")
model.test("../corpus/cooking.valid")

(3000, 0.172, 0.07438373936860314)

### more epochs and larger learning rate  
By default, fastText sees each training example only five times during training, which is pretty small, given that our training set only have 12k training examples. The number of times each examples is seen (also known as the number of epochs), can be increased using the -epoch option:  
This is much better! Another way to change the learning speed of our model is to increase (or decrease) the learning rate of the algorithm. This corresponds to how much the model changes after processing each example. A learning rate of 0 would mean that the model does not change at all, and thus, does not learn anything. Good values of the learning rate are in the range 0.1 - 1.0.

In [12]:
model = fasttext.train_supervised(input="../corpus/cooking.train", lr=1.0, epoch=35)
model.test("../corpus/cooking.valid")

(3000, 0.584, 0.2525587429724665)

### word n-grams  

Finally, we can improve the performance of a model by using word bigrams, instead of just unigrams. This is especially important for classification problems where word order is important, such as sentiment analysis.

In [14]:
model = fasttext.train_supervised(input="../corpus/cooking.train", lr=1.0, epoch=25, wordNgrams=2)
model.test("../corpus/cooking.valid")

(3000, 0.593, 0.2564509153812887)

With a few steps, we were able to go from a precision at one of 12.4% to 59.9%. Important steps included:  

> preprocessing the data ;  
> changing the number of epochs (using the option -epoch, standard range [5 - 50]) ;  
> changing the learning rate (using the option -lr, standard range [0.1 - 1.0]) ;  
> using word n-grams (using the option -wordNgrams, standard range [1 - 5]).  

# Advanced readers: What is a Bigram?

A 'unigram' refers to a single undividing unit, or token, usually used as an input to a model. For example a unigram can be a word or a letter depending on the model. In fastText, we work at the word level and thus unigrams are words.

Similarly we denote by 'bigram' the concatenation of 2 consecutive tokens or words. Similarly we often talk about n-gram to refer to the concatenation any n consecutive tokens.

For example, in the sentence, 'Last donut of the night', the unigrams are 'last', 'donut', 'of', 'the' and 'night'. The bigrams are: 'Last donut', 'donut of', 'of the' and 'the night'.

Bigrams are particularly interesting because, for most sentences, you can reconstruct the order of the words just by looking at a bag of n-grams.

Let us illustrate this by a simple exercise, given the following bigrams, try to reconstruct the original sentence: 'all out', 'I am', 'of bubblegum', 'out of' and 'am all'. It is common to refer to a word as a unigram.

# Scaling things up

Since we are training our model on a few thousands of examples, the training only takes a few seconds. But training models on larger datasets, with more labels can start to be too slow. A potential solution to make the training faster is to use the hierarchical softmax, instead of the regular softmax. This can be done with the option -loss hs:

In [13]:
model = fasttext.train_supervised(input="../corpus/cooking.train", lr=1.0, epoch=25, wordNgrams=2, bucket=200000, dim=50, loss='hs')
model.test("../corpus/cooking.valid")

(3000, 0.5893333333333334, 0.2548652155110278)

# Advanced readers: hierarchical softmax
The hierarchical softmax is a loss function that approximates the softmax with a much faster computation.

The idea is to build a binary tree whose leaves correspond to the labels. Each intermediate node has a binary decision activation (e.g. sigmoid) that is trained, and predicts if we should go to the left or to the right. The probability of the output unit is then given by the product of the probabilities of intermediate nodes along the path from the root to the output unit leave.

For a detailed explanation, you can have a look on this video. https://www.youtube.com/watch?v=B95LTf2rVWM

In fastText, we use a Huffman tree, so that the lookup time is faster for more frequent outputs and thus the average lookup time for the output is optimal.

# Multi-label classification
When we want to assign a document to multiple labels, we can still use the softmax loss and play with the parameters for prediction, namely the number of labels to predict and the threshold for the predicted probability. However playing with these arguments can be tricky and unintuitive since the probabilities must sum to 1.

A convenient way to handle multiple labels is to use independent binary classifiers for each label. This can be done with -loss one-vs-all or -loss ova.

In [14]:
model = fasttext.train_supervised(input="../corpus/cooking.train", lr=0.5, epoch=25, wordNgrams=2, bucket=200000, dim=50, loss='ova')

It is a good idea to decrease the learning rate compared to other loss functions.

Now let's have a look on our predictions, we want as many prediction as possible (argument -1) and we want only labels with probability higher or equal to 0.5 :

In [15]:
model.predict("Which baking dish is best to bake a banana bread ?", k=-1, threshold=0.5)

(('__label__baking',
  '__label__bread',
  '__label__bananas',
  '__label__equipment'),
 array([1.00001001, 0.99427974, 0.92193186, 0.89030427]))

We can also evaluate our results with the test function:

In [16]:
model.test("../corpus/cooking.valid", k=-1)

(3000, 0.003146031746031746, 1.0)