<a href="https://colab.research.google.com/github/SriNithin965/nlp_notebooks/blob/main/fasttext.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

FastText is an extension to Word2Vec proposed by Facebook in 2016. Instead of feeding individual words into the Neural Network, FastText breaks words into several n-grams (sub-words). For instance, the tri-grams for the word apple is app, ppl, and ple (ignoring the starting and ending of boundaries of words). The word embedding vector for apple will be the sum of all these n-grams. After training the Neural Network, we will have word embeddings for all the n-grams given the training dataset. Rare words can now be properly represented since it is highly likely that some of their n-grams also appears in other words. I will show you how to use FastText with Gensim in the following section.

In [None]:
from pprint import pprint as print
from gensim.models.fasttext import FastText
from gensim.test.utils import datapath

# Set file names for train and test data
corpus_file = datapath('lee_background.cor')

model = FastText(vector_size=100)

# build the vocabulary
model.build_vocab(corpus_file=corpus_file)

# train the model
model.train(
    corpus_file=corpus_file, epochs=model.epochs,
    total_examples=model.corpus_count, total_words=model.corpus_total_words,
)

print(model)

<gensim.models.fasttext.FastText object at 0x000001F00162C5B0>


In [None]:
wv = model.wv
print(wv)

<gensim.models.fasttext.FastTextKeyedVectors object at 0x000001F00162C520>


In [None]:
wv['india']

array([-0.13601464,  0.1206445 , -0.17324091, -0.06203524,  0.04282829,
        0.23890977,  0.19053896,  0.3213919 ,  0.16475925, -0.15950854,
        0.02349657, -0.10488129, -0.1430523 ,  0.33074853, -0.2623605 ,
       -0.3597735 ,  0.1152797 , -0.16147825, -0.27947775, -0.3532005 ,
       -0.30688426, -0.03713719, -0.29282302, -0.08848781, -0.12943256,
       -0.21343252, -0.45209476, -0.07147336, -0.21573687,  0.1766408 ,
       -0.21581466,  0.19174771,  0.5406469 , -0.17505644,  0.1229412 ,
        0.25960675,  0.25127518, -0.07169745, -0.24575746, -0.2214999 ,
        0.30793127, -0.27963513,  0.02619917, -0.2733715 , -0.34414735,
       -0.18905082, -0.05138936,  0.07613254,  0.24654427,  0.00366264,
        0.22768787, -0.2826577 ,  0.1922712 , -0.2627508 , -0.12381885,
       -0.11427651, -0.10306711, -0.08921213,  0.03112538, -0.23383443,
       -0.21684554, -0.2888015 , -0.11493521,  0.22562303, -0.08188816,
        0.4543199 ,  0.03838707,  0.04640825,  0.27662313,  0.16

In [None]:
sim_score = model.wv.similarity('pakistan', 'india')
sim_score

0.9999638