Installation
```
$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ pip install .
```

__References__

* [RaRe tutorial notebook used here](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/FastText_Tutorial.ipynb)
* [RaRe technologies - makers of gensim](https://github.com/RaRe-Technologies/gensim/tree/develop/docs/notebooks)
* [Intro to gensim](https://radimrehurek.com/gensim/intro.html)
* [Quora succinct explanation](https://www.quora.com/What-is-the-main-difference-between-word2vec-and-fastText)
* [Facebook fastText repo](https://github.com/facebookresearch/fastText/#requirements)

In [1]:
import pandas as pd
pd.set_option('display.max_rows', 5)

In [2]:
pd.options.display.max_rows

5

In [4]:
! FILE='/opt/conda/lib/python3.7/site-packages/gensim/test/test_data/lee_background.cor'; \
sed -n 1p ${FILE} | head -c 500

Hundreds of people have been forced to vacate their homes in the Southern Highlands of New South Wales as strong winds today pushed a huge bushfire towards the town of Hill Top. A new blaze near Goulburn, south-west of Sydney, has forced the closure of the Hume Highway. At about 4:00pm AEDT, a marked deterioration in the weather as a storm cell moved east across the Blue Mountains forced authorities to make a decision to evacuate people from homes in outlying streets at Hill Top in the New South

In [5]:
from gensim.models.fasttext import FastText as FT_gensim
from gensim.test.utils import datapath

# Set file names for train and test data
corpus_file = datapath('lee_background.cor')

model_gensim = FT_gensim(size=100)

# build the vocabulary
model_gensim.build_vocab(corpus_file=corpus_file)

# train the model
model_gensim.train(
    corpus_file=corpus_file, epochs=model_gensim.epochs,
    total_examples=model_gensim.corpus_count, total_words=model_gensim.corpus_total_words
)

print(model_gensim)

FastText(vocab=1762, size=100, alpha=0.025)


In [None]:
# saving a model trained via Gensim's fastText implementation
model_gensim.save('saved_model_gensim')
loaded_model = FT_gensim.load('saved_model_gensim')
print(loaded_model)

In [6]:
print('night' in model_gensim.wv.vocab)
print('nights' in model_gensim.wv.vocab)

True
False


In [7]:
print(model_gensim['night'])

[ 3.02392226e-02 -6.21297956e-01  5.88215828e-01 -2.50432882e-02
 -9.53542069e-02  3.79108250e-01  4.72763889e-02  2.19921142e-01
 -2.07267731e-01  3.76425415e-01 -3.64657551e-01  3.64133894e-01
 -2.80809999e-01 -2.39424944e-01 -3.28515828e-01 -1.12393454e-01
 -1.60692468e-01 -6.61114812e-01  4.17221010e-01  1.87709436e-01
 -4.95419390e-02 -2.66229749e-01  1.06365077e-01 -6.04175746e-01
 -2.38002241e-01 -6.59016669e-02 -4.53367025e-01  1.94207594e-01
  1.27956688e-01 -1.76235005e-01 -2.91133523e-02 -5.58853030e-01
 -2.86314636e-01  4.03854728e-01  1.01799774e+00 -7.43042454e-02
 -4.69729155e-01 -3.42207521e-01 -3.85300934e-01  8.14097747e-02
 -7.37527788e-01 -6.83626980e-02 -4.45275664e-01  2.06199825e-01
  2.04998702e-01 -1.00169554e-01  3.41434509e-01 -1.94936663e-01
 -5.22109449e-01  5.25017083e-01 -2.29641050e-01  7.12306947e-02
 -3.45147341e-01 -3.47071946e-01 -2.47467477e-02  6.84074461e-02
 -4.51964110e-01 -1.42093543e-02  2.82373708e-02 -1.47072628e-01
  1.96731046e-01 -5.05950

  """Entry point for launching an IPython kernel.


In [8]:
print(model_gensim['nights'])

[ 2.73815766e-02 -5.76783538e-01  5.47451675e-01 -2.03676801e-02
 -8.96775723e-02  3.50036055e-01  4.45766784e-02  2.04973102e-01
 -1.93165079e-01  3.52447063e-01 -3.36603612e-01  3.35392237e-01
 -2.61315823e-01 -2.22529158e-01 -3.05473834e-01 -1.03741474e-01
 -1.50353074e-01 -6.15058064e-01  3.88727307e-01  1.75114125e-01
 -4.63238880e-02 -2.46018752e-01  9.82075706e-02 -5.60712218e-01
 -2.20386311e-01 -6.00378811e-02 -4.22511190e-01  1.81272671e-01
  1.18900359e-01 -1.61747918e-01 -2.87383758e-02 -5.18806577e-01
 -2.67347306e-01  3.76360059e-01  9.45483804e-01 -6.91659898e-02
 -4.37582463e-01 -3.17368418e-01 -3.56905162e-01  7.71079808e-02
 -6.83531284e-01 -6.28698766e-02 -4.13264990e-01  1.91327035e-01
  1.90386489e-01 -9.24587920e-02  3.16662818e-01 -1.78655520e-01
 -4.86538321e-01  4.86373633e-01 -2.12037891e-01  6.56346604e-02
 -3.20957690e-01 -3.19698840e-01 -2.36785486e-02  6.35699928e-02
 -4.19455618e-01 -1.29103912e-02  2.45727468e-02 -1.35962263e-01
  1.83441773e-01 -4.69187

  """Entry point for launching an IPython kernel.


In [10]:
print( model_gensim.corpus_total_words )
model_gensim.cum_table

59890

In [14]:
model_gensim.estimate_memory()

{'vocab': 881000,
 'syn0_vocab': 704800,
 'syn1neg': 704800,
 'syn0_ngrams': 6774400,
 'buckets_word': 346496,
 'total': 9411496}

In [None]:
# Tests if word present in vocab
print("word" in model_wrapper.wv.vocab)
# Tests if vector present for word
print("word" in model_wrapper)

In [17]:
model_gensim.most_similar(['night'])

  """Entry point for launching an IPython kernel.


[('fight', 0.9999890923500061),
 ('night.', 0.9999887347221375),
 ('light', 0.9999877214431763),
 ('fighter', 0.9999872446060181),
 ('might', 0.9999871850013733),
 ('overnight', 0.9999868869781494),
 ('fighters', 0.9999866485595703),
 ('fighting', 0.9999857544898987),
 ('eight', 0.9999848008155823),
 ('night,', 0.999984622001648)]

In [22]:
model_gensim.similar_by_word('night, owl')

  """Entry point for launching an IPython kernel.


[('night,', 0.9991105198860168),
 ('firm', 0.9990448355674744),
 ('built', 0.9990408420562744),
 ('summit', 0.9990309476852417),
 ('bomb', 0.9990290403366089),
 ('guilty', 0.9990273118019104),
 ('against', 0.9990243911743164),
 ('Gillespie', 0.9990242719650269),
 ('hours,', 0.9990205764770508),
 ('scored', 0.9990161657333374)]

In [27]:
model_gensim.similar_by_vector([0.9991105198860168,0.9990448355674744])

  """Entry point for launching an IPython kernel.


[('1999', 0.9979867935180664),
 ('area,', 0.9978424906730652),
 ('walk', 0.9978353977203369),
 ("he's", 0.9978049397468567),
 ('create', 0.9978022575378418),
 ('Hobart', 0.9977971315383911),
 ('Home', 0.9977955222129822),
 ('growth', 0.9977902173995972),
 ('Emergency', 0.9977893829345703),
 ('Musharraf', 0.9977880120277405)]

In [None]:
model_gensim.similarity("night", "nights")

In [None]:
model_gensim.n_similarity(['sushi', 'shop'], ['japanese', 'restaurant'])

In [None]:
model_gensim.doesnt_match("breakfast cereal dinner lunch".split())

In [None]:
model_gensim.most_similar(positive=['baghdad', 'england'], negative=['london'])

In [28]:
from nltk.corpus import stopwords

ModuleNotFoundError: No module named 'nltk'

In [None]:
# Word Movers distance
sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
sentence_president = 'The president greets the press in Chicago'.lower().split()

# Remove their stopwords.
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
sentence_obama = [w for w in sentence_obama if w not in stopwords]
sentence_president = [w for w in sentence_president if w not in stopwords]

# Compute WMD.
distance = model_gensim.wmdistance(sentence_obama, sentence_president)
distance