# <center>Text Classification with fastText</center>

This project introduces the task of text classification using the fastText library. First obtain the dataset then prepare for train/valid split and at the end predicting labels for unseen input data.

In [1]:
from google.colab import drive

drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [2]:
import os
os.chdir('/content/gdrive/My Drive/AppDataSci')

## <center>The Cooking StackExchange dataset</center>

We'll use a dataset of a few thousand questions asked on [Cooking StackExchange](https://cooking.stackexchange.com/) which have various tags assigned to them and which already exists in the fastText format -- basically a text file where each line contains one text document that is to be classified. Note that the lines start with `__label__` tags which denote the "ground truth" label for that particular text document.

In the next few cells we'll download the dataset and take a closer look at what the data looks like (using the [`head`](https://linux.101hacks.com/unix/head/) command) and some further statistics about the dataset (using the [`wc`](https://www.tecmint.com/wc-command-examples/) -- command).

In [3]:
!wget https://dl.fbaipublicfiles.com/fasttext/data/cooking.stackexchange.tar.gz && tar xvzf cooking.stackexchange.tar.gz

--2021-11-01 07:19:28--  https://dl.fbaipublicfiles.com/fasttext/data/cooking.stackexchange.tar.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 172.67.9.4, 104.22.74.142, 104.22.75.142, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|172.67.9.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 457609 (447K) [application/x-tar]
Saving to: ‘cooking.stackexchange.tar.gz’


2021-11-01 07:19:28 (2.06 MB/s) - ‘cooking.stackexchange.tar.gz’ saved [457609/457609]

cooking.stackexchange.id
cooking.stackexchange.txt
readme.txt


In [4]:
!head cooking.stackexchange.txt

__label__sauce __label__cheese How much does potato starch affect a cheese sauce recipe?
__label__food-safety __label__acidity Dangerous pathogens capable of growing in acidic environments
__label__cast-iron __label__stove How do I cover up the white spots on my cast iron stove?
__label__restaurant Michelin Three Star Restaurant; but if the chef is not there
__label__knife-skills __label__dicing Without knife skills, how can I quickly and accurately dice vegetables?
__label__storage-method __label__equipment __label__bread What's the purpose of a bread box?
__label__baking __label__food-safety __label__substitutions __label__peanuts how to seperate peanut oil from roasted peanuts at home?
__label__chocolate American equivalent for British chocolate terms
__label__baking __label__oven __label__convection Fan bake vs bake
__label__sauce __label__storage-lifetime __label__acidity __label__mayonnaise Regulation and balancing of readymade packed mayonnaise and other sauces


In [5]:
!wc cooking.stackexchange.txt 


  15404  169582 1401900 cooking.stackexchange.txt


We've got roughly 15k samples in our dataset. Let's split it into a training set of roughly 12k samples and testing set of 3k samples.

In [6]:
!head -n 12404 cooking.stackexchange.txt > cooking.train
!tail -n 3000 cooking.stackexchange.txt > cooking.valid

### <center>Installation of fastText</center>

Installing fastText is realtively easy on any Unix-like system -- running the following cell should be enough to build the `fasttext` binary, which is all we need in this tutorial. 

Note that fastText also has [Python bindings](https://pypi.org/project/fasttext/) which allow you to use it directly from Python code.

In [7]:
!git clone https://github.com/facebookresearch/fastText.git
%cd fastText
!make
!cp fasttext ../
%cd ..

Cloning into 'fastText'...
remote: Enumerating objects: 3854, done.[K
remote: Total 3854 (delta 0), reused 0 (delta 0), pack-reused 3854[K
Receiving objects: 100% (3854/3854), 8.22 MiB | 8.90 MiB/s, done.
Resolving deltas: 100% (2417/2417), done.
Checking out files: 100% (526/526), done.
/content/gdrive/My Drive/AppDataSci/fastText
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/args.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/autotune.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/matrix.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/dictionary.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/loss.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/productquantizer.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/densematrix.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG 

In the following cell we run the `supervised` command which trains a fastText model using the data in `./cooking.train` and saves the model to `./cooking_model1`.

The -input command line option indicates the file containing the training examples, while the -output option indicates where to save the model. At the end of training, a file model_cooking.bin, containing the trained classifier, is created in the current directory.

In [8]:
!./fastText/fasttext supervised -input ./cooking.train -output ./cooking_model1

Read 0M words
Number of words:  14543
Number of labels: 735
Progress: 100.0% words/sec/thread:    8053 lr:  0.000000 avg.loss: 10.184074 ETA:   0h 0m 0s


In [9]:
!echo "Which baking dish is best to bake a banana bread ?" | ./fastText/fasttext predict cooking_model1.bin -

__label__baking


Now let's see how the model does on the validation set.

In [10]:
!./fastText/fasttext test cooking_model1.bin ./cooking.valid

N	3000
P@1	0.14
R@1	0.0607


Looking at the data, we observe that some words contain uppercase letter or punctuation. One of the first step to improve the performance of our model is to apply some simple pre-processing.

In [11]:
!cat cooking.stackexchange.txt | sed -e "s/\([.\!?,'/()]\)/ \1 /g" | tr "[:upper:]" "[:lower:]" > cooking.preprocessed.txt
!head -n 12404 cooking.preprocessed.txt > cooking.train
!tail -n 3000 cooking.preprocessed.txt > cooking.valid

Let's train a new model on the pre-processed data

In [12]:
!./fastText/fasttext supervised -input cooking.train -output cooking_model2

Read 0M words
Number of words:  8952
Number of labels: 735
Progress: 100.0% words/sec/thread:    9001 lr:  0.000000 avg.loss: 10.103974 ETA:   0h 0m 0s


In [13]:
!./fastText/fasttext test cooking_model2.bin cooking.valid

N	3000
P@1	0.171
R@1	0.074


more epochs and larger learning rate

By default, fastText sees each training example only five times during training, which is pretty small, given that our training set only have 12k training examples. The number of times each examples is seen (also known as the number of epochs), can be increased using the -epoch option:

In [14]:
!./fastText/fasttext supervised -input cooking.train -output cooking_model3 -epoch 25

Read 0M words
Number of words:  8952
Number of labels: 735
Progress: 100.0% words/sec/thread:    9095 lr:  0.000000 avg.loss:  7.240567 ETA:   0h 0m 0s


Let's test the new model:

In [15]:
!./fastText/fasttext test cooking_model3.bin cooking.valid

N	3000
P@1	0.513
R@1	0.222


Now the result is better

word n-grams

In [None]:
!./fastText/fasttext supervised -input cooking.train -output cooking_model4 -lr 1.0 -epoch 25 -wordNgrams 2

Read 0M words
Number of words:  8952
Number of labels: 735
Progress: 100.0% words/sec/thread:    9509 lr:  0.000000 avg.loss:  3.210398 ETA:   0h 0m 0s


In [None]:
!./fastText/fasttext test cooking_model4.bin cooking.valid

N	3000
P@1	0.611
R@1	0.264
