# Text Classification with fastText

This quick tutorial introduces the task of text classification using the [fastText](https://fasttext.cc/) library and tries to show what the full pipeline looks like from the beginning (obtaining the dataset and preparing the train/valid split) to the end (predicting labels for unseen input data).

## The Cooking StackExchange tags dataset

We'll use a dataset of a few thousand questions asked on [Cooking StackExchange](https://cooking.stackexchange.com/) which have various tags assigned to them and which already exists in the fastText format -- basically a text file where each line contains one text document that is to be classified. Note that the lines start with `__label__` tags which denote the "ground truth" label for that particular text document.


`__label__<X> __label__<Y> ... <Text>`


For example:

`__label__chocolate American equivalent for British chocolate terms`

--------------------------

In the next few cells we'll download the dataset and take a closer look at what the data looks like (using the [`head`](https://linux.101hacks.com/unix/head/) command) and some further statistics about the dataset (using the [`wc`](https://www.tecmint.com/wc-command-examples/) -- command).


In [0]:
!wget https://dl.fbaipublicfiles.com/fasttext/data/cooking.stackexchange.tar.gz && tar xvzf cooking.stackexchange.tar.gz

In [0]:
!head cooking.stackexchange.txt

In [0]:
!wc cooking.stackexchange.txt 

We've got roughly 15k samples in our dataset. Let's split it into a training set of roughly 12k samples and testing set of 3k samples.

In [0]:
!head -n 12404 cooking.stackexchange.txt > cooking.train
!tail -n 3000 cooking.stackexchange.txt > cooking.valid

## Installation of fastText

Installing fastText is realtively easy on any Unix-like system -- running the following cell should be enough to build the `fasttext` binary, which is all we need in this tutorial.

In [0]:
!wget https://github.com/facebookresearch/fastText/archive/v0.2.0.zip
!unzip v0.2.0.zip
%cd fastText-0.2.0
!make
!cp fasttext ../
%cd ..

## Training and testing a fastText model

The actual model fastText implements is rather simple as we can see in the image below -- the final log-likelihood the model tries to optimize in training is 

$$ - \frac{1}{N} \sum_{n=1}^{N} y_n \log(f(BAx_n)) $$

where 
- $x_n$ is the one-hot encoded representation of a word
- $A$ is the word embedding matrix
- $B$ is the linear projection from word embeddings to output classes
- $f$ is the `softmax` non-linearity function

You can find more details on the model in the introductory paper: [Bag of Tricks for Efficient Text Classification](https://arxiv.org/abs/1607.01759).


![The actual model architecture of fastText classification](https://cdn-images-1.medium.com/max/800/1*AgrrRZ9DpUVb3srTWs0gzA.png)


In the following cell we run the `supervised` command which trains a fastText model using the data in `./cooking.train` and saves the model to `./cooking_model1`.

In [0]:
!./fasttext supervised -input ./cooking.train -output ./cooking_model1

Now let's see how the model does on the validation set.

In [0]:
!./fasttext test cooking_model1.bin ./cooking.valid

Looking at the results, they do not look very stellar. The `P@1`metrics represents the [precision](https://en.wikipedia.org/wiki/Precision_and_recall#Precision) at the first topmost predicted class, while `R@1` represents the [recall](https://en.wikipedia.org/wiki/Precision_and_recall#Recall) at the first topmost predicted class and their respective values leave a lot to be desired.

Let's see what options does fastText allow us to set and see if we can get it to perform better

In [0]:
!./fasttext supervised

There are a couple of interesting options we'll dive a bit deeper into:

#### Character ngrams (`minx` and `maxn`)

One of the interesting things fastText is capable of doing is incorporating character level information when preparing word vectors. You can find all the glory details in the [Enriching Word Vectors with Subword Information](https://arxiv.org/abs/1607.04606) paper, but the basic idea is as follows:

Given the word `banana` and $n=3$, fastText would generate the following ngrams:

- `<ba`
- `ban`
- `ana`
- `nan`
- `ana`
- `na>`

where the `<` and `>` represent the beginning and end of the word, respectively. That is quite useful because if we also had the word *ban* as part of the vocabulary, it would be represented as `<ban>` which makes it distinguishable from `ban` we extracted from banana.

Note that we are still talking about bag of words model and thus only the presence of a respective ngram matters. Still, thanks to this nice setup we are pretty much by default able to model prefixes and suffixes. That is of huge practical value, since even if we now encountered say the word `bananoid`  which was not present in training data, thanks to the aforementioned character ngrams we are able to assign it at least some representation, rather than calling it an unknown word and replacing its occurences with `UNK`, which is what the standard approach would be.

In fastText the length of ngrams can be set via the `-minn` and `-maxn` flags, which control the minimum and maximum length of ngrams fastText considers. By default these are set to 0, which basically turns this feature off.

Let's see if our `bananoid` example would actually work by saving the word vectors fastText produces during training and trying to find out which words are the closest neighbors of `bananoid` in the learned vector space

In [0]:
!./fasttext supervised -minn 3 -maxn 5 -input ./cooking.train -output ./cooking_model1 -saveOutput 1

In [0]:
!echo "bananoid" | ./fasttext nn ./cooking_model1.bin

#### Word ngrams

Similarly to character ngrams, fastText can also generate ngrams from words in the document. This can be set using the `-wordNgrams` flag which is set to 1 by default: only unigrams (single words) are considered. When we set it to say 2, the sentece `smash all potatoes` would be represented as

- `<smash>`
- `<all>`
- `<potatoes>`
- `<smash all>`
- `<all potatoes`

-----------------

Using these and some of the other available options, let us train a new version of the model and see how it performs.


In [0]:
!./fasttext supervised -minCount 2 -wordNgrams 3 -minn 3 -maxn 8 -lr 0.7 -dim 100 -epoch 25 -input ./cooking.train -output cooking_model2

In [0]:
!./fasttext test cooking_model2.bin ./cooking.valid 1

Looks a bit better, right?

Note that the command above outputs precision/recall for just the top 1 example. In many cases, however, we maybe more interested in knowing whether the "true" labels could be found in the say top 5 predictions, especially since many of them have more than one tag assigned.

We can easily compute precision/recall in this way by executing

In [0]:
!./fasttext test cooking_model2.bin ./cooking.valid 5

Looking at just the summary statistics is not really that much fun -- that usually comes from trying the model out on some real-world data. We can easily do that with fastText by running something like the command in the following cell:

In [0]:
!echo "Does it make sense to cook smashed potatoes?" | ./fasttext predict-prob ./cooking_model2.bin -

Alternatively we can also ask for more than just the most probable label:

In [0]:
!echo "Does it make sense to cook smashed potatoes?" | ./fasttext predict-prob ./cooking_model2.bin - 5

Or ask for as many predictions as possible (`-1`) but only taking into account those that have probability higher than `0.02`:

In [0]:
!echo "Does it make sense to cook smashed potatoes?" | ./fasttext predict-prob ./cooking_model2.bin - -1 0.02

## Your tasks

1. See if you can improve the model further -- try to optimize both Precision and Recall at 3 predictions
2. Try to see if some pre-processing (lowercasing, removing stop words, punctuation, ...) would be helpful here (the `bananoid` example does really suggest so). Note that fastText splits tokens on whitespace it finds in the input data, so it is not uncommon to find out that it learned word vectors for words like `banana?` among others. If you are looking for an industry-grade tokenizer, I strongly recommend [BlingFire](https://github.com/Microsoft/BlingFire). 
3. See if you can use some of the same ideas on a different [Amazon Sentiment Analysis dataset](https://storage.googleapis.com/amazonreviews/train.ft.txt.bz2) and get the testing precision/recall over 0.9!

**Bonus**: the choice of checking both Precision and Recall at 3 or 5 is rather arbitrary. Analyze the training data and find out what number would really make sense, based on the number of labels the considered documents usually have.

For a more in-depth walkthrough of fastText's internals please reffer to [FastText: Under the Hood](https://towardsdatascience.com/fasttext-under-the-hood-11efc57b2b3)