## fastText Tutorial notebook

This notebook is a representation of the facebook machine learning library called fastText which is quite awesome in word labelling and representation. The machine learning library can be applied to whatsoever purpose. In this notebook we show how the library can be used in supervised machine learning and even be tuned to be fast and more accurate. 

The tools used are a command line terminal; PS: Linux is recommended for the terminal, I have used the zsh terminal along side this notebook. Some of the zsh terminal commands might not really work in the notebook and thus I shifted to the zsh or any terminal you would prefer.

!wget https://github.com/facebookresearch/fastText/archive/v0.9.2.zip

In [6]:
!pip install fasttext

Defaulting to user installation because normal site-packages is not writeable
Collecting fasttext
  Downloading fasttext-0.9.2.tar.gz (68 kB)
[K     |████████████████████████████████| 68 kB 308 kB/s eta 0:00:011
[?25hCollecting pybind11>=2.2
  Using cached pybind11-2.8.0-py2.py3-none-any.whl (207 kB)
Using legacy 'setup.py install' for fasttext, since package 'wheel' is not installed.
Installing collected packages: pybind11, fasttext
    Running setup.py install for fasttext ... [?25ldone
[?25hSuccessfully installed fasttext-0.9.2 pybind11-2.8.0


In [14]:
import pandas as pd
import numpy as np

In [8]:
import fasttext
help(fasttext.FastText)

Help on module fasttext.FastText in fasttext:

NAME
    fasttext.FastText

DESCRIPTION
    # Copyright (c) 2017-present, Facebook, Inc.
    # All rights reserved.
    #
    # This source code is licensed under the MIT license found in the
    # LICENSE file in the root directory of this source tree.

FUNCTIONS
    cbow(*kargs, **kwargs)
    
    eprint(*args, **kwargs)
    
    load_model(path)
        Load a model given a filepath and return a model object.
    
    read_args(arg_list, arg_dict, arg_names, default_values)
    
    skipgram(*kargs, **kwargs)
    
    supervised(*kargs, **kwargs)
    
    tokenize(text)
        Given a string of text, tokenize it and return a list of tokens
    
    train_supervised(*kargs, **kwargs)
        Train a supervised model and return a model object.
        
        input must be a filepath. The input text does not need to be tokenized
        as per the tokenize function, but it must be preprocessed and encoded
        as UTF-8. You might wan

train_supervised() will mostly be used for retruning a model object and calling test and predict on that object. This is the same as learning the text classifier. 

## Getting the data and preparing it

In [10]:
!wget https://dl.fbaipublicfiles.com/fasttext/data/cooking.stackexchange.tar.gz && tar xvzf cooking.stackexchange.tar.gz

--2021-10-06 15:53:24--  https://dl.fbaipublicfiles.com/fasttext/data/cooking.stackexchange.tar.gz
SSL_INIT
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.74.142, 172.67.9.4, 104.22.75.142, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.74.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 457609 (447K) [application/x-tar]
Saving to: ‘cooking.stackexchange.tar.gz’


2021-10-06 15:53:27 (260 KB/s) - ‘cooking.stackexchange.tar.gz’ saved [457609/457609]

cooking.stackexchange.id
cooking.stackexchange.txt
readme.txt


In [15]:
model = fasttext.train_supervised(input="cooking.train")

Read 0M words
Number of words:  14543
Number of labels: 735
Progress: 100.0% words/sec/thread:   75204 lr:  0.000000 avg.loss: 10.067896 ETA:   0h 0m 0s 0s


In [16]:
model.save_model("model_cooking.bin")

In [17]:
model.predict("Which baking dish is best to bake a taco?")

(('__label__baking',), array([0.08457895]))

In [18]:
model.predict("What do we call the process of submerging veggies or fruits quickly in boiling water?")

(('__label__baking',), array([0.02988265]))

In [20]:
model.test("cooking.valid")

(3000, 0.11766666666666667, 0.05088655038200952)

The precision is 0.117667 and the recall is at 0.05

In [21]:
model.test("cooking.valid", k=5)

(3000, 0.0672, 0.14530776992936428)

### Precision and Recall

Precision is the number of correct ones among the predicted ones. Recall is the number of labels that were predicted among the real labels. We shall use an example; 
<br>
<em>Why not put knives in the dishwasher</em>
<br>
On the stack exchange this is labelled with three tags: <code>equipment, cleaning and knives</code>, and these can be predicted by these labels:-

In [22]:
model.predict("Why not put knives in the dishwasher?", k=5)

(('__label__baking',
  '__label__food-safety',
  '__label__bread',
  '__label__substitutions',
  '__label__equipment'),
 array([0.08093037, 0.06498709, 0.03857679, 0.03446391, 0.03077549]))

The labels we have got are <code> baking, food-safety, bread, substitutions and equipment</code>. Thus there are 2 out of 5 labels predicted correctly. This means that the precision is 0.40. Out of the three labels, only 2 have been predicted correctly, meaning the recall is 0.6667.

## So How Do we make the model better

### Preprocessing data

We could first remove the uppercase and punctuation marks. A crude normalization ccan be obtained using the command line tools such as <code>sed and tr</code>

In [23]:
model_preprocssed = fasttext.train_supervised(input="cooking.train")

Read 0M words
Number of words:  8952
Number of labels: 735
Progress: 100.0% words/sec/thread:   81834 lr:  0.000000 avg.loss:  9.896623 ETA:   0h 0m 0s


In [24]:
model.test("cooking.valid")

(3000, 0.11933333333333333, 0.05160732305030993)

The precision has gone up, now let us try increasing the epochs and lr

In [30]:
model_hyperparameter_tuned_epochs = fasttext.train_supervised(input="cooking.train", epoch=50)

Read 0M words
Number of words:  8952
Number of labels: 735
Progress: 100.0% words/sec/thread:   79863 lr:  0.000000 avg.loss:  5.073856 ETA:   0h 0m 0s


In [27]:
model_hyperparameter_tuned_epochs.test("cooking.valid")

(3000, 0.5656666666666667, 0.2446302436211619)

Quite strong!!

In [28]:
model_hyperparameter_tuned_lr = fasttext.train_supervised(input="cooking.train", lr=1.0)

Read 0M words
Number of words:  8952
Number of labels: 735
Progress: 100.0% words/sec/thread:   81884 lr:  0.000000 avg.loss:  6.625004 ETA:   0h 0m 0s100.0% words/sec/thread:   81886 lr: -0.000060 avg.loss:  6.625004 ETA:   0h 0m 0s


In [29]:
model_hyperparameter_tuned_lr.test("cooking.valid")

(3000, 0.562, 0.24304454375090095)

In [32]:
model_hyperparameter_tuned = fasttext.train_supervised(input="cooking.train", lr=1.0, epoch=50)

Read 0M words
Number of words:  8952
Number of labels: 735
Progress: 100.0% words/sec/thread:   72156 lr:  0.000000 avg.loss:  4.688759 ETA:   0h 0m 0s words/sec/thread:   72156 lr: -0.000011 avg.loss:  4.688759 ETA:   0h 0m 0s


In [33]:
model_hyperparameter_tuned.test("cooking.valid")

(3000, 0.5893333333333334, 0.2548652155110278)

### using word n-grams

Using bigrams instead of unigrams greatly improves a model. This is useful in sentiment analysis.

In [35]:
model_bigrams = fasttext.train_supervised(input="cooking.train", lr=1.0, epoch=25, wordNgrams=2)

Read 0M words
Number of words:  8952
Number of labels: 735
Progress: 100.0% words/sec/thread:   78878 lr:  0.000000 avg.loss:  3.073013 ETA:   0h 0m 0s


In [36]:
model_bigrams.test("cooking.valid")

(3000, 0.599, 0.2590456969871702)

Right now the precision has gone to 59.9%, due to the following:-
<br>
<ul>
    <li> preprocessing data </li>
    <li> changing the epochs between 5-50 </li>
    <li> changing the learning rate 0.1 - 10 </li>
    <li> using word n-grams; a range or 1 to 5 </li>

## Hierarchical softmax

A potential solution for faster training is to use hierarchical softmax instead of the regular softmax. This can be utilised with the option <code>-loss hs </code>:

In [37]:
model_hsoftmax = fasttext.train_supervised(input="cooking.train", lr=1.0, epoch=25, wordNgrams=2, bucket=200000, dim=50, loss='hs') 

Read 0M words
Number of words:  8952
Number of labels: 735
Progress: 100.0% words/sec/thread: 1475070 lr:  0.000000 avg.loss:  2.269904 ETA:   0h 0m 0s% words/sec/thread: 1475158 lr: -0.000003 avg.loss:  2.269904 ETA:   0h 0m 0s


This has been sooo fast!!!!

In [38]:
model_hsoftmax.test("cooking.valid")

(3000, 0.585, 0.25299120657344676)

The hierarchical softmax is a loss function that apprpximates the softmax with a much faster computation. It does this by using a binary tree that has leaves corresponding to the labels. Each intermediate node has a binary decsion activation ~ sigmoid that is trained and predicts if we should go to the left or right. The probability of output is given by the product of the probabilities of the intermediate nodes along the path from the roor to the output unit leaf. In fastText, we use a Huffman tree so that the lookup time is faster for ore frequent outputs and thus the avergae lookuptime for the output is optimal. 

### Multi-label classification

When we would like to assign a document to multiple labels, we can still softmax loss and play with hyperparameter tuning for prediction. Playing with these argumnets can be unintuitive becuase teh proablities need to sum up to 1.A better way is to handle multiple labels and use independent binary classifiers for each label bu using <code> -loss one-vs-all or -loss ova </code>.

In [39]:
model_multi= fasttext.train_supervised(input="cooking.train", lr=0.5, epoch=25, wordNgrams=2, bucket=200000, dim=50, loss='ova')

Read 0M words
Number of words:  8952
Number of labels: 735
Progress: 100.0% words/sec/thread:  128578 lr:  0.000000 avg.loss:  4.948205 ETA:   0h 0m 0s 36.9% words/sec/thread:  121484 lr:  0.315417 avg.loss: 12.150004 ETA:   0h 0m 7s


The predictions

In [41]:
model_multi.predict("Which baking dish is the best to bake a banana bread ?", k=-1, threshold = 0.5)

(('__label__baking',
  '__label__equipment',
  '__label__bread',
  '__label__bananas'),
 array([1.00001001, 0.9994216 , 0.99288857, 0.97772384]))

In [42]:
model_multi.test("cooking.valid", k=-1)

(3000, 0.003146031746031746, 1.0)

Wow, a recall of 1.0!!!