## fastText Tutorial notebook

This notebook is a representation of the facebook machine learning library called fastText which is quite awesome in word labelling and representation. The machine learning library can be applied to whatsoever purpose. In this notebook we show how the library can be used in supervised machine learning and even be tuned to be fast and more accurate. 

The tools used are a command line terminal; PS: Linux is recommended for the terminal, I have used the zsh terminal along side this notebook. Some of the zsh terminal commands might not really work in the notebook and thus I shifted to the zsh or any terminal you would prefer.

!wget https://github.com/facebookresearch/fastText/archive/v0.9.2.zip

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_files

In [2]:
import wget
wget.download ('https://dl.fbaipublicfiles.com/fasttext/data/cooking.stackexchange.tar.gz')

100% [............................................................................] 457609 / 457609

'cooking.stackexchange.tar (19).gz'

In [3]:
import fasttext
help(fasttext.FastText)

Help on module fasttext.FastText in fasttext:

NAME
    fasttext.FastText

DESCRIPTION
    # Copyright (c) 2017-present, Facebook, Inc.
    # All rights reserved.
    #
    # This source code is licensed under the MIT license found in the
    # LICENSE file in the root directory of this source tree.

FUNCTIONS
    cbow(*kargs, **kwargs)
    
    eprint(*args, **kwargs)
    
    load_model(path)
        Load a model given a filepath and return a model object.
    
    read_args(arg_list, arg_dict, arg_names, default_values)
    
    skipgram(*kargs, **kwargs)
    
    supervised(*kargs, **kwargs)
    
    tokenize(text)
        Given a string of text, tokenize it and return a list of tokens
    
    train_supervised(*kargs, **kwargs)
        Train a supervised model and return a model object.
        
        input must be a filepath. The input text does not need to be tokenized
        as per the tokenize function, but it must be preprocessed and encoded
        as UTF-8. You might wan

train_supervised() will mostly be used for retruning a model object and calling test and predict on that object. This is the same as learning the text classifier. 

## Getting the data and preparing it

In [4]:
model = fasttext.train_supervised(input="cooking.txt")

In [5]:
model.save_model("model_cooking.bin")

In [6]:
model.predict("Which baking dish is best to bake a taco?")

(('__label__baking',), array([0.05322305]))

In [7]:
model.predict("What do we call the process of submerging veggies or fruits quickly in boiling water?")

(('__label__baking',), array([0.02938409]))

In [8]:
model.test("cooking1.txt")

(7802, 0.09241220199948731, 0.040171606864274574)

The precision is 0.117667 and the recall is at 0.05

In [9]:
model.test("cooking1.txt", k=5)

(7802, 0.06549602665983081, 0.14235569422776911)

### Precision and Recall

Precision is the number of correct ones among the predicted ones. Recall is the number of labels that were predicted among the real labels. We shall use an example; 
<br>
<em>Why not put knives in the dishwasher</em>
<br>
On the stack exchange this is labelled with three tags: <code>equipment, cleaning and knives</code>, and these can be predicted by these labels:-

In [10]:
model.predict("Why not put knives in the dishwasher?", k=5)

(('__label__baking',
  '__label__food-safety',
  '__label__equipment',
  '__label__substitutions',
  '__label__bread'),
 array([0.07447263, 0.06482264, 0.0424542 , 0.0371525 , 0.03523888]))

The labels we have got are <code> baking, food-safety, bread, substitutions and equipment</code>. Thus there are 2 out of 5 labels predicted correctly. This means that the precision is 0.40. Out of the three labels, only 2 have been predicted correctly, meaning the recall is 0.6667.

## So How Do we make the model better

### Preprocessing data

We could first remove the uppercase and punctuation marks. A crude normalization ccan be obtained using the command line tools such as <code>sed and tr</code>

In [11]:
model_preprocssed = fasttext.train_supervised(input="cooking.txt")

In [12]:
model.test("cooking1.txt")

(7802, 0.09241220199948731, 0.040171606864274574)

The precision has gone up, now let us try increasing the epochs and lr

In [13]:
model_hyperparameter_tuned_epochs = fasttext.train_supervised(input="cooking.txt", epoch=50)

In [14]:
model_hyperparameter_tuned_epochs.test("cooking1.txt")

(7802, 0.43424762881312484, 0.18876755070202808)

Quite strong!!

In [15]:
model_hyperparameter_tuned_lr = fasttext.train_supervised(input="cooking.txt", lr=1.0)

In [16]:
model_hyperparameter_tuned_lr.test("cooking1.txt")

(7802, 0.43027428864393746, 0.1870403387564074)

In [17]:
model_hyperparameter_tuned = fasttext.train_supervised(input="cooking.txt", lr=1.0, epoch=50)

In [18]:
model_hyperparameter_tuned.test("cooking1.txt")

(7802, 0.5314022045629326, 0.2310006685981725)

### using word n-grams

Using bigrams instead of unigrams greatly improves a model. This is useful in sentiment analysis.

In [19]:
model_bigrams = fasttext.train_supervised(input="cooking.txt", lr=1.0, epoch=25, wordNgrams=2)

In [20]:
model_bigrams.test("cooking1.txt")

(7802, 0.48320943347859524, 0.21005125919322487)

Right now the precision has gone to 59.9%, due to the following:-
<br>
<ul>
    <li> preprocessing data </li>
    <li> changing the epochs between 5-50 </li>
    <li> changing the learning rate 0.1 - 10 </li>
    <li> using word n-grams; a range or 1 to 5 </li>

## Hierarchical softmax

A potential solution for faster training is to use hierarchical softmax instead of the regular softmax. This can be utilised with the option <code>-loss hs </code>:

In [21]:
model_hsoftmax = fasttext.train_supervised(input="cooking.txt", lr=1.0, epoch=25, wordNgrams=2, bucket=200000, dim=50, loss='hs') 

This has been sooo fast!!!!

In [22]:
model_hsoftmax.test("cooking1.txt")

(7802, 0.4791079210458857, 0.20826833073322934)

The hierarchical softmax is a loss function that apprpximates the softmax with a much faster computation. It does this by using a binary tree that has leaves corresponding to the labels. Each intermediate node has a binary decsion activation ~ sigmoid that is trained and predicts if we should go to the left or right. The probability of output is given by the product of the probabilities of the intermediate nodes along the path from the roor to the output unit leaf. In fastText, we use a Huffman tree so that the lookup time is faster for ore frequent outputs and thus the avergae lookuptime for the output is optimal. 

### Multi-label classification

When we would like to assign a document to multiple labels, we can still softmax loss and play with hyperparameter tuning for prediction. Playing with these argumnets can be unintuitive becuase teh proablities need to sum up to 1.A better way is to handle multiple labels and use independent binary classifiers for each label bu using <code> -loss one-vs-all or -loss ova </code>.

In [23]:
model_multi= fasttext.train_supervised(input="cooking.txt", lr=0.5, epoch=25, wordNgrams=2, bucket=200000, dim=50, loss='ova')

The predictions

In [24]:
model_multi.predict("Which baking dish is the best to bake a banana bread ?", k=-1, threshold = 0.5)

(('__label__baking', '__label__bread', '__label__bananas'),
 array([1.00001001, 1.00001001, 0.92841882]))

In [25]:
model_multi.test("cooking1.txt", k=-1)

(7802, 0.003240050402388698, 1.0)

Wow, a recall of 1.0!!!