# FastText Example
This notebook contains an example of loading pre-existing FastText word embedding models and using them.

In this example, we will load the FastText word embeddings that are freely available at https://fasttext.cc/docs/en/english-vectors.html, more precisely, the English aligned word embeddings from https://fasttext.cc/docs/en/aligned-vectors.html.

We will go through a couple examples of how to load the word embeddings.

## Prerequisites

- Installed `gensim` and `numpy` library (are included in the project's `environment.yml`)
- Downloaded FastText model


## Using Gensim

In [18]:
import numpy as np
from gensim.models import KeyedVectors
from gensim.parsing.preprocessing import preprocess_string, strip_punctuation

## Load the Model

In [19]:
# change the location of the fasttext model
embeddings = KeyedVectors.load_word2vec_format('./models/wiki.en.align.vec')

In [21]:
embeddings.vocab

{',': <gensim.models.keyedvectors.Vocab at 0x7f501c2b7820>,
 '.': <gensim.models.keyedvectors.Vocab at 0x7f501c2b7880>,
 'the': <gensim.models.keyedvectors.Vocab at 0x7f501c2b78b0>,
 '</s>': <gensim.models.keyedvectors.Vocab at 0x7f501c2b7940>,
 'of': <gensim.models.keyedvectors.Vocab at 0x7f501c2b79d0>,
 '-': <gensim.models.keyedvectors.Vocab at 0x7f501c2b7a00>,
 'in': <gensim.models.keyedvectors.Vocab at 0x7f501c2b7a60>,
 'and': <gensim.models.keyedvectors.Vocab at 0x7f501c2b7ac0>,
 "'": <gensim.models.keyedvectors.Vocab at 0x7f501c2b7b20>,
 ')': <gensim.models.keyedvectors.Vocab at 0x7f501c2b7b50>,
 '(': <gensim.models.keyedvectors.Vocab at 0x7f501c2b7bb0>,
 'to': <gensim.models.keyedvectors.Vocab at 0x7f501c2b7be0>,
 'a': <gensim.models.keyedvectors.Vocab at 0x7f501c2b7c40>,
 'is': <gensim.models.keyedvectors.Vocab at 0x7f501c2b7190>,
 'was': <gensim.models.keyedvectors.Vocab at 0x7f501c2b7cd0>,
 'on': <gensim.models.keyedvectors.Vocab at 0x7f50b036ac70>,
 's': <gensim.models.keyed

Get the embedding of a single word

In [22]:
token = 'virus'
word_embedding = embeddings[token] if token in embeddings.vocab.keys() else None
word_embedding

array([-0.0183, -0.0171, -0.0872,  0.0269, -0.0022, -0.0114,  0.0272,
       -0.0671, -0.0316, -0.0858, -0.0063,  0.1165, -0.0545, -0.0345,
        0.0262, -0.0776, -0.0007, -0.1314,  0.0129,  0.1624,  0.0755,
        0.0753,  0.0181, -0.0047, -0.0995,  0.0483, -0.0634, -0.0538,
       -0.0179,  0.0231, -0.014 ,  0.0583, -0.1262,  0.0442, -0.0179,
        0.0154,  0.0921, -0.0875,  0.0032,  0.0933,  0.0475, -0.0978,
        0.065 ,  0.0324, -0.0798,  0.0542,  0.0256, -0.0194,  0.0971,
       -0.0214,  0.0259, -0.0146, -0.0758, -0.0902, -0.0465, -0.0258,
       -0.0317, -0.0273, -0.0081,  0.0518, -0.0026,  0.0456,  0.0097,
        0.0015, -0.0999,  0.0505, -0.0891,  0.0319, -0.0566,  0.0738,
        0.0365, -0.0183,  0.0815, -0.0351, -0.0533,  0.0073, -0.009 ,
        0.1135, -0.0366, -0.0265, -0.0163, -0.1047,  0.0187,  0.0123,
        0.0503, -0.0222,  0.0469,  0.0596, -0.0135, -0.0218,  0.0091,
        0.0266,  0.0751,  0.0568,  0.0509,  0.0275,  0.0433,  0.0409,
       -0.0054,  0.0

### Define the Stopwords

In [23]:
stopwords = ["the", "a"]

### Define The Tokenizer

In [24]:
def tokenize(text):
    """Tokenizes the provided text
    Args:
        text (str): The text to be tokenized
    Returns:
        list(tuple(str, int)): A list of (token, count) pairs from the text without the stopwords.
    """

    # make everything lowercase and strip punctuation
    CUSTOM_FILTERS = [lambda x: x.lower(), strip_punctuation]
    tokens = preprocess_string(text, CUSTOM_FILTERS)

    # filter out all stopwords
    filtered_tokens = [w for w in tokens if not w in stopwords]

    # return the filtered tokens
    return filtered_tokens

### Define the Text Embedding Function

In [25]:
def text_embedding(text):
    """Create the text embedding
    Args:
        text (str): The text to be embedded
    Returns:
        list(float): The array of values representing the text embedding
    """

    # prepare the embedding placeholder
    embedding = np.zeros(embeddings.vector_size, dtype=np.float32)

    if text is None:
        # return the default embedding in a vanilla python object
        return embedding

    # get the text terms with frequencies
    tokens = tokenize(text)
    # iterate through the terms and count the number of terms
    count = 0
    for token in tokens:
        # sum all token embeddings of the vector
        if token in embeddings.vocab.keys():
            embedding += embeddings[token]
            count += 1

    if count == 0:
        # return the empty embedding list
        return embedding.tolist()

    # average the embedding
    embedding = embedding / count


    # return the embedding in vanilla python object
    return embedding

## Create a Text Embeddings

In [26]:
text = "Today is a lovely day"
text_embedding(text)

array([ 0.002925  ,  0.01125   ,  0.017025  ,  0.02935   , -0.016275  ,
        0.0339    ,  0.019525  , -0.01185   ,  0.028375  ,  0.045775  ,
        0.04865   ,  0.0384    ,  0.015025  , -0.0064    ,  0.004725  ,
       -0.055825  ,  0.014275  , -0.054475  ,  0.050575  ,  0.047525  ,
       -0.035425  ,  0.051475  , -0.04705   , -0.0454    , -0.00995   ,
       -0.009575  , -0.012475  ,  0.020025  , -0.024325  , -0.000225  ,
       -0.0182    ,  0.056025  , -0.10462499,  0.043075  , -0.0209    ,
       -0.038625  , -0.00705   , -0.0218    , -0.008775  , -0.0271    ,
        0.025675  , -0.008325  , -0.01185   ,  0.015775  ,  0.01235   ,
        0.0228    ,  0.054875  ,  0.02035   , -0.0293    , -0.032     ,
        0.00255   , -0.020125  ,  0.0467    , -0.02485   ,  0.022275  ,
        0.0453    ,  0.0073    ,  0.074075  ,  0.006125  ,  0.081475  ,
        0.04985   ,  0.013325  , -0.0054    , -0.046025  , -0.03455   ,
       -0.05655   , -0.040225  ,  0.012525  ,  0.0232    ,  0.01