# Predicting Cooking StackExchange tags with fastText

We start with a dataset of a few thousand questions asked on [Cooking StackExchange](https://cooking.stackexchange.com/) in the fastText format.

`__label__<X> __label__<Y> ... <Text>`


For example:

`__label__chocolate American equivalent for British chocolate terms`


In [0]:
!wget https://dl.fbaipublicfiles.com/fasttext/data/cooking.stackexchange.tar.gz && tar xvzf cooking.stackexchange.tar.gz

In [0]:
!head cooking.stackexchange.txt

In [0]:
!wc cooking.stackexchange.txt 

We've got roughly 15k samples in our dataset. Let's split it into a training set of roughly 12k samples and testing set of 3k samples.

In [0]:
!head -n 12404 cooking.stackexchange.txt > cooking.train
!tail -n 3000 cooking.stackexchange.txt > cooking.valid

## Installation of fastText

In [0]:
!wget https://github.com/facebookresearch/fastText/archive/v0.1.0.zip
!unzip v0.1.0.zip
%cd fastText-0.1.0
!make
!cp fasttext ../
%cd ..

## Training and testing a fastText model

![The actual model architecture of fastText classification](https://cdn-images-1.medium.com/max/800/1*AgrrRZ9DpUVb3srTWs0gzA.png)

In [0]:
!./fasttext supervised -input ./cooking.train -output cooking_model1

In [0]:
!./fasttext test cooking_model1.bin ./cooking.valid

Looking at the results, they do not look very stellar.

Let's see what options does fastText allow us to set and see if we can get it to perform better

In [0]:
!./fasttext supervised

In [0]:
!./fasttext supervised -minCount 2 -wordNgrams 3 -minn 3 -maxn 8 -lr 0.7 -dim 100 -epoch 25 -input ./cooking.train -output cooking_model2

In [0]:
!./fasttext test cooking_model2.bin ./cooking.valid 1

Looks a bit better, right?

Note that the command above outputs precision/recall for just the top 1 example. In many cases, however, we maybe more interested in knowing whether the "true" labels could be found in the say top 5 predictions, especially since many of them have more than one tag assigned.

We can easily compute precision/recall in this way by executing

In [0]:
!./fasttext test cooking_model2.bin ./cooking.valid 5

Your tasks:

1. See if you can improve the model further -- try to optimize both Precision and Recall at 3 predictions
2. Try to see if some pre-processing (lowercasing, removing stop words, punctuation, ...) would be helpful here
3. See if you can use some of the same ideas on a different [Amazon Sentiment Analysis dataset](https://storage.googleapis.com/amazonreviews/train.ft.txt.bz2) and get the testing precision/recall over 0.9!