# FastText Example
This notebook contains an example of loading pre-existing FastText word embedding models and using them.

In this example, we will load the FastText word embeddings that are freely available at https://fasttext.cc/docs/en/english-vectors.html, more precisely, the English aligned word embeddings from https://fasttext.cc/docs/en/aligned-vectors.html.

We will go through a couple examples of how to load the word embeddings.

## Prerequisites

- Installed `gensim` and `numpy` library (are included in the project's `environment.yml`)
- Downloaded FastText model


## Using Gensim

In [18]:
import numpy as np
from gensim.models import KeyedVectors
from gensim.parsing.preprocessing import preprocess_string, strip_punctuation

## Load the Model

In [None]:
# change the location of the fasttext model
embeddings = KeyedVectors.load_word2vec_format('./models/wiki.en.align.vec')

In [None]:
embeddings.vocab

Get the embedding of a single word

In [None]:
token = 'virus'
word_embedding = embeddings[token] if token in embeddings.vocab.keys() else None
word_embedding

### Define the Stopwords

In [21]:
stopwords = ["the", "a"]

### Define The Tokenizer

In [22]:
def tokenize(text):
    """Tokenizes the provided text
    Args:
        text (str): The text to be tokenized
    Returns:
        list(tuple(str, int)): A list of (token, count) pairs from the text without the stopwords.
    """

    # make everything lowercase and strip punctuation
    CUSTOM_FILTERS = [lambda x: x.lower(), strip_punctuation]
    tokens = preprocess_string(text, CUSTOM_FILTERS)

    # filter out all stopwords
    filtered_tokens = [w for w in tokens if not w in stopwords]

    # return the filtered tokens
    return filtered_tokens

### Define the Text Embedding Function

In [23]:
def text_embedding(text):
    """Create the text embedding
    Args:
        text (str): The text to be embedded
    Returns:
        list(float): The array of values representing the text embedding
    """

    # prepare the embedding placeholder
    embedding = np.zeros(embeddings.vector_size, dtype=np.float32)

    if text is None:
        # return the default embedding in a vanilla python object
        return embedding

    # get the text terms with frequencies
    tokens = tokenize(text)
    # iterate through the terms and count the number of terms
    count = 0
    for token in tokens:
        # sum all token embeddings of the vector
        if token in embeddings.vocab.keys():
            embedding += embeddings[token]
            count += 1

    if count == 0:
        # return the empty embedding list
        return embedding.tolist()

    # average the embedding
    embedding = embedding / count


    # return the embedding in vanilla python object
    return embedding

## Create a Text Embeddings

In [None]:
text = "Today is a lovely day"
text_embedding(text)