# Word Embedding with FastText

FastText is a library for performing NLP tasks such as word embedding and text classification. It is developed by Facebook. It employs both CBOW and skip-gram models in creating vector representation of words. Unlike Word2Vec implemnetation that suffer from the out of vocabulary (OOV) word problem FastText overcomes this problem by treating words as the average of individual characters in the text hence being able to predict the unkown words in the training corpus. However, FastText takes more time train as compared to the Word2Vec model. 

Import libraries

In [1]:
import gensim
import pandas as pd
from nltk.tokenize import word_tokenize

import warnings

In [2]:
warnings.filterwarnings('ignore')

Given the following tokenized text

In [3]:
text = [['The','quick','brown','fox','jumped','over','the','lazy','dog']]

In [4]:
text

[['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']]

Create FastText Model

In [5]:
model=gensim.models.FastText(text, size=100, window=2, min_count=1, workers=4,sg=1,min_n=1)

In [6]:
print(model)

FastText(vocab=9, size=100, alpha=0.025)


Show vocabulary

In [7]:
model.wv.vocab

{'The': <gensim.models.keyedvectors.Vocab at 0x2839d57d6d8>,
 'quick': <gensim.models.keyedvectors.Vocab at 0x2839d57d518>,
 'brown': <gensim.models.keyedvectors.Vocab at 0x2839d57d550>,
 'fox': <gensim.models.keyedvectors.Vocab at 0x2839d57d588>,
 'jumped': <gensim.models.keyedvectors.Vocab at 0x2839d57d5c0>,
 'over': <gensim.models.keyedvectors.Vocab at 0x2839d57d5f8>,
 'the': <gensim.models.keyedvectors.Vocab at 0x2839d57d630>,
 'lazy': <gensim.models.keyedvectors.Vocab at 0x2839d57d6a0>,
 'dog': <gensim.models.keyedvectors.Vocab at 0x2839d57db38>}

Embedding size

In [8]:
model.wv.vector_size

100

Number of words in the model

In [9]:
len(model.wv.vocab)

9

Get the vector for the word

In [10]:
model.wv['dog']

array([-1.4176186e-04, -8.9067698e-04,  3.8891474e-03,  2.5695378e-03,
       -1.7134650e-03,  1.7198741e-03,  2.2002682e-04, -5.0427200e-04,
        4.3807672e-03, -9.5034909e-04, -2.9343623e-04, -1.8652430e-03,
        2.1255324e-03, -1.8820608e-03, -5.6179083e-04,  5.3679738e-05,
       -6.4589398e-04, -2.8342693e-03,  2.0778158e-03,  7.9447945e-04,
       -1.8359096e-03, -1.0908170e-03, -5.8449712e-04,  1.3403230e-03,
        4.0305706e-04,  2.4212194e-03, -3.4421741e-03, -3.1474323e-04,
       -1.0102881e-04, -1.5179611e-04, -9.5073803e-04, -1.6175084e-04,
       -1.4757197e-03, -7.8404188e-04, -1.7160275e-04, -4.7489727e-04,
       -3.7479012e-03,  5.1566778e-04, -2.1151400e-05,  1.2291519e-03,
       -1.1457974e-03,  2.4619172e-04, -2.1266822e-04, -1.2059588e-03,
        2.2159133e-03,  2.8339566e-03, -7.1660604e-04, -1.5639720e-03,
       -1.2781285e-04,  1.3118693e-04,  3.3181368e-03,  9.9504809e-04,
       -2.4623496e-03, -8.7316334e-04, -8.0615532e-04,  2.3007947e-03,
      

Get similar words

In [11]:
model.wv.most_similar('dog',topn=5)

[('fox', 0.07695597410202026),
 ('quick', 0.06328225880861282),
 ('over', 0.04278339445590973),
 ('jumped', 0.0007894337177276611),
 ('the', -0.07917296886444092)]

Get words similar to 'dog' or 'jumped'

In [12]:
model.wv.most_similar(['dog','jumped'],topn=5)

[('fox', 0.08860601484775543),
 ('quick', 0.06956937909126282),
 ('the', 0.052219927310943604),
 ('lazy', -0.02684996835887432),
 ('The', -0.04585377126932144)]

Get disimilar words

In [13]:
model.wv.doesnt_match(['dog','cat'])

'cat'

Get similarity index between words

In [14]:
model.wv.similarity('dog','fox')

0.076955974

In [15]:
model.wv.similarity('dog','cat')

-0.03192065

Tacking out of vocabulary (OOV) word problem.<hr> The word Elephant does't occure in our training corpus but the model can approximate its vector.

In [16]:
model.wv['Elephant']

array([ 2.1511593e-03,  3.8176755e-04,  1.4032318e-03, -3.5268543e-04,
        1.1208318e-03,  8.0412319e-05, -3.5473134e-04,  8.4583741e-04,
        8.7807031e-04,  4.6867627e-04, -3.9382291e-04,  9.5001975e-04,
       -9.8662346e-04, -1.1218158e-03, -1.4597934e-03, -1.2757437e-03,
        6.7832938e-04,  2.8072970e-04, -1.0720185e-03, -3.0953035e-04,
       -1.1760935e-03, -1.1959377e-03,  4.5601607e-04, -2.4174016e-04,
        4.3237748e-04, -1.2577324e-03,  1.5623691e-03, -5.0138333e-04,
        4.5538074e-04, -6.8542932e-04,  9.5980830e-04,  1.2120079e-03,
        1.0422956e-03, -6.5210270e-04, -5.0693809e-04, -6.1966921e-04,
        4.8274716e-04,  1.3960593e-03, -7.3872402e-04, -1.5125867e-03,
       -2.1353830e-04,  1.3970184e-03,  1.7122459e-03, -5.9309979e-05,
       -3.0222876e-04,  5.5010732e-05,  2.0579759e-04,  6.8118877e-04,
       -5.6362519e-04, -7.3218491e-04,  6.6722365e-05,  6.3822250e-04,
        4.7766938e-04, -6.1266997e-04, -1.0063048e-06,  1.0841808e-05,
      

In [17]:
model.wv.most_similar('Elephant',topn=5)

[('fox', 0.2621423006057739),
 ('quick', 0.15453022718429565),
 ('over', 0.07030598819255829),
 ('jumped', 0.06264208257198334),
 ('brown', 0.06171374022960663)]

In [18]:
model.wv.most_similar('jum',topn=5)

[('jumped', 0.5389863848686218),
 ('dog', 0.1440437287092209),
 ('fox', 0.042089514434337616),
 ('quick', 0.034181419759988785),
 ('lazy', 0.01689363829791546)]

##### Creating embedding from dataframe data

Load data

In [19]:
df=pd.read_csv('datasets/Sentiment Analysis on Movie Reviews/train.tsv',sep='\t')

In [20]:
df.head()

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,A series of escapades demonstrating the adage ...,1
1,2,1,A series of escapades demonstrating the adage ...,2
2,3,1,A series,2
3,4,1,A,2
4,5,1,series,2


Tokenize the Phrase into words

In [21]:
df['token']=df['Phrase'].apply(lambda x: word_tokenize(x))

Create FastText model

In [22]:
fasttext_model=gensim.models.FastText(text, size=100, window=10, min_count=1, workers=4,min_n=3)

In [23]:
fasttext_model

<gensim.models.fasttext.FastText at 0x283a3b5b518>

Words in the model

In [24]:
len(fasttext_model.wv.vocab)

9

We can now perform other analyses on the embedding (similarity, differences e.t.c) as above ...