In [None]:
%%html
<marquee style='width: 100%; color: red;'><b><li style="font-size:75px;">Fast Text</li></b></marquee>

## Reading the data

In [None]:
# Let's import libraries
import pandas as pd
import numpy as np
import os
from termcolor import colored
import warnings

In [None]:
# Mount the drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Go to the file created in Colab and specify your path
dir = '/content/drive/MyDrive/IMDB_Data'

In [None]:
# Reading in proper format from our mentioned file
df_train = pd.read_pickle(dir + '/train_features.pkl')
df_test  = pd.read_pickle(dir + '/test_features.pkl')

## Loading Model

- Fast text model of facebook is trained on large weight vectors with essential embeddings

- The essential embeddings used are trained using CBOW methods in dimension 300 with ngram of 5 and window size varying from 5 - 10

- The documentation of the following can be found [here](https://fasttext.cc/docs/en/crawl-vectors.html)

In [None]:
# Getting the model
!wget https://github.com/facebookresearch/fastText/archive/v0.1.0.zip

--2020-12-02 20:47:01--  https://github.com/facebookresearch/fastText/archive/v0.1.0.zip
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/facebookresearch/fastText/zip/v0.1.0 [following]
--2020-12-02 20:47:01--  https://codeload.github.com/facebookresearch/fastText/zip/v0.1.0
Resolving codeload.github.com (codeload.github.com)... 140.82.112.10
Connecting to codeload.github.com (codeload.github.com)|140.82.112.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘v0.1.0.zip’

v0.1.0.zip              [ <=>                ]  92.06K  --.-KB/s    in 0.01s   

2020-12-02 20:47:01 (6.20 MB/s) - ‘v0.1.0.zip’ saved [94267]



In [None]:
# Unzip the files
!unzip v0.1.0.zip

Archive:  v0.1.0.zip
431c9e2a9b5149369cc60fb9f5beba58dcf8ca17
   creating: fastText-0.1.0/
  inflating: fastText-0.1.0/.gitignore  
  inflating: fastText-0.1.0/CONTRIBUTING.md  
  inflating: fastText-0.1.0/LICENSE  
  inflating: fastText-0.1.0/Makefile  
  inflating: fastText-0.1.0/PATENTS  
  inflating: fastText-0.1.0/README.md  
  inflating: fastText-0.1.0/classification-example.sh  
  inflating: fastText-0.1.0/classification-results.sh  
  inflating: fastText-0.1.0/eval.py  
  inflating: fastText-0.1.0/get-wikimedia.sh  
  inflating: fastText-0.1.0/pretrained-vectors.md  
  inflating: fastText-0.1.0/quantization-example.sh  
  inflating: fastText-0.1.0/quantization-results.sh  
   creating: fastText-0.1.0/src/
  inflating: fastText-0.1.0/src/args.cc  
  inflating: fastText-0.1.0/src/args.h  
  inflating: fastText-0.1.0/src/dictionary.cc  
  inflating: fastText-0.1.0/src/dictionary.h  
  inflating: fastText-0.1.0/src/fasttext.cc  
  inflating: fastText-0.1.0/src/fasttext.h  
  inflat

In [None]:
#  Change the directory for fast text
os.chdir('fastText-0.1.0')
!make

c++ -pthread -std=c++0x -O3 -funroll-loops -c src/args.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/dictionary.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/productquantizer.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/matrix.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/qmatrix.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/vector.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/model.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/utils.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/fasttext.cc
c++ -pthread -std=c++0x -O3 -funroll-loops args.o dictionary.o productquantizer.o matrix.o qmatrix.o vector.o model.o utils.o fasttext.o src/main.cc -o fasttext


In [None]:
!./fasttext

usage: fasttext <command> <args>

The commands supported by fasttext are:

  supervised              train a supervised classifier
  quantize                quantize a model to reduce the memory usage
  test                    evaluate a supervised classifier
  predict                 predict most likely labels
  predict-prob            predict most likely labels with probabilities
  skipgram                train a skipgram model
  cbow                    train a cbow model
  print-word-vectors      print word vectors given a trained model
  print-sentence-vectors  print sentence vectors given a trained model
  nn                      query for nearest neighbors
  analogies               query for analogies



## Data Preparation

In [None]:
# Selecting the columns for analysis
df_train =  df_train[['Label','Text']]
df_test  =  df_test[['Label','Text']]

<div class="alert alert-block alert-danger">
<b>Note:</b> The data format to run the model is specified in the document which is being followed here
</div>

In [None]:
# Assigning dict values in proper format for sentiment classification
df_train['Label'] = df_train['Label'].map({0: 'negative',1:'positive'})
df_test['Label']   = df_test['Label'].map({0: 'negative',1:'positive'})

In [None]:
# Making into proper format for labelling positive and negative
df_train['Label'] = ['__label__'+ str(s) for s in df_train['Label']]
df_test['Label']  = ['__label__'+ str(s) for s in df_test['Label']]

<div class="alert alert-block alert-danger">
<b>Note:</b> We need to write the data in 'path' format for specifying it to the model
</div>

In [None]:
# Writing the data into proper format
df_train.to_csv(r'/content/drive/MyDrive/IMDB_Data/fast_text_train.txt', index=False, sep=' ', header=False)
df_test.to_csv(r'/content/drive/MyDrive/IMDB_Data/fast_text_test.txt',   index=False, sep=' ', header=False)

## Model training

For training the model we use the model file saved  in our local directory named **"model_fast"**

In [None]:
%%time
!./fasttext supervised -input '/content/drive/MyDrive/IMDB_Data/fast_text_train.txt' -output model_fast -epoch 50 -lr 0.01

Read 2M words
Number of words:  77346
Number of labels: 2
Progress: 100.0%  words/sec/thread: 3520076  lr: 0.000000  loss: 0.304932  eta: 0h0m 
CPU times: user 299 ms, sys: 112 ms, total: 410 ms
Wall time: 26.9 s


In [None]:
# Let's see the files in this directory
# You can locate "model_fast.bin" in the directory
!ls

args.o			   Makefile		    quantization-results.sh
classification-example.sh  matrix.o		    README.md
classification-results.sh  model_fast.bin	    src
CONTRIBUTING.md		   model_fast.vec	    tutorials
dictionary.o		   model.o		    utils.o
eval.py			   PATENTS		    vector.o
fasttext		   pretrained-vectors.md    wikifil.pl
fasttext.o		   productquantizer.o	    word-vector-example.sh
get-wikimedia.sh	   qmatrix.o
LICENSE			   quantization-example.sh


In [None]:
#Let's test our model
!./fasttext test model_fast.bin '/content/drive/MyDrive/IMDB_Data/fast_text_test.txt'

N	25000
P@1	0.87
R@1	0.87
Number of examples: 25000


## Model Summary

- The accuracy is "F-1 score:" of each postive and negative sentiment defined in each class.

- We observe we get a decent level accuracy by training this model with very less latency and constraints computationally.

- Let's now dive into the state of the model of Google BERT in this [notebook](https://colab.research.google.com/drive/1XcnZsRLV1x-bV7L4w2-RTzHbl_3OdmK4?usp=sharing)