# FastText Word Embeddings  

checked 27.02.2024 G.Paaß




## Description of FastText

[FastText](https://fasttext.cc/) FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. It works on standard, generic hardware. Models can later be reduced in size to even fit on mobile devices.It is implemented in C++. The library was developed by [facebook research](https://research.fb.com/fasttext/).



In [None]:
!pip install fasttext

In [None]:
import os, sys;
import pandas as pd
import scipy
import fasttext

## Data from Wikipedia
We use an excerpt `text8` from the first 109 bytes of the English Wikipedia dump on Mar. 3, 2006. It may be downloaded from [here](http://mattmahoney.net/dc/textdata.html).

In [None]:
!wget http://mattmahoney.net/dc/text8.zip
!unzip text8.zip
!mv text8 text8.txt

In [None]:
!ls

In [None]:
text8File = 'text8.txt'
print(text8File)
file = open(text8File, "r")
words=file.read().split()
print('Read wikipedia data. Data size (number of words)', len(words))
st = ""
for w in words[0:1000]:
    st += w + " "
print("First words:\n" + st)

In [None]:
len(words)

## Training Word Vectors with Word2Vec

Enter `FT_wrapper.train?` to get the arguments of the FastText-call:<br/>For word vector learning fasttext has the following **options**:

FT_wrapper.train(
    ft_path,
    corpus_file,
    output_file=None,
    model='cbow',
    size=100,
    alpha=0.025,
    window=5,
    min_count=5,
    word_ngrams=1,
    loss='ns',
    sample=0.001,
    negative=5,
    iter=5,
    min_n=3,
    max_n=6,
    sorted_vocab=1,
    threads=12,

|option     |           meaning |
|----------------|----------------------------------|
| input             | training file path (required) |
|   model     | unsupervised fasttext model {cbow, skipgram} [skipgram]|
|    lr                | learning rate [0.05]|
|    dim               | size of word vectors [100]|
|    ws                | size of the context window [5]|
|    epoch             | number of epochs [5]|
|    minCount          | minimal number of word occurences [5]|
|    minn              | min length of char ngram [3]|
|    maxn              | max length of char ngram [6]|
|    neg               | number of negatives sampled [5]|
|    wordNgrams        | max length of word ngram [1]|
|    loss              | loss function {ns, hs, softmax, ova} [ns]|
|    bucket            | number of buckets [2000000]|
|    thread            | number of threads [number of cpus]|
|    lrUpdateRate      | change the rate of updates for the learning rate [100]|
|    t                 | sampling threshold [0.0001]|
|    verbose           | verbose [2]|

### Training without n-grams
Learning word vectors on this data can now be achieved with a single command. <br/>See progress on console invoking jupyter notebook.

See the progress on the console invoking jupyter notebook.

(skipgram Wall time: 16min 40s):

In [None]:
%%time
# train the model without n-grams
model = fasttext.train_unsupervised(text8File,         # input text
                                    model='skipgram',  # 'cbow', 'skipgram'.
                                    dim=100,           # embedding length
                                    maxn=0,            # n-grams maximal length
                                    ws=5,              # window size
                                    thread=10,         # number of threads
                                    epoch=5)           # number of epochs

print(model)

In [None]:
model.save_model('model0')   # save model

In [None]:
model0=fasttext.load_model('model0')

### Effect of Training Parameters

So far, we run fastText with the default parameters, but depending on the data, these parameters may not be optimal. Let us give an introduction to some of the key parameters for word vectors.

The most important parameters of the model are its dimension and the range of size for the subwords.
* The dimension (`size`) controls the **size of the vectors**, the larger they are the more information they can capture but requires more data to be learned. But, if they are too large, they are harder and slower to train. By default, we use 100 dimensions, but any value in the 100-300 range is as popular.
* The **subwords** are all the substrings contained in a word between the minimum size (`minn`) and the maximal size (`maxn`). By default, we take all the subword between 3 and 6 characters, but other range could be more appropriate to different languages:

Depending on the quantity of data you have, you may want to change the parameters of the training.
* The **epoch** parameter controls how many time will loop over your data. By default, we loop over the dataset 5 times. If you dataset is extremely massive, you may want to loop over it less often.
* Another important parameter is the **learning rate** (alpha). The higher the learning rate is, the faster the model converge to a solution but at the risk of overfitting to the dataset. The default value is 0.05 which is a good compromise. If you want to play with it we suggest to stay in the range of [0.01, 1].
* Finally , fastText is multi-threaded and uses 12 threads by default. If you have less CPU cores (say 4), you can easily set the number of threads using the **threads** flag

### Printing Word Vectors

In [None]:
print('night\n',model['night'])
print('nights\n',model['nights'])
print('cosine similarity "night" "nights" = ',scipy.spatial.distance.cosine(model['night'],model['nights']))

### Nearest neighbor queries

A simple way to check the quality of a word vector is to look at its nearest neighbors. This give an intuition of the type of semantic information the vectors are able to capture.

This can be achieve with the `nn` functionality. For example, we can query the 20 nearest neighbors of a word by running the following command in a command shell:

In [None]:
pd.DataFrame(model.get_nearest_neighbors('nights',k=15))

In [None]:
pd.DataFrame(model.get_nearest_neighbors('nights',k=15))

In [None]:
pd.DataFrame(model.get_nearest_neighbors('proton',k=15))

In [None]:
pd.DataFrame(model.get_nearest_neighbors('bank',k=15))

In order to find nearest neighbors, we need to compute a similarity score between words. Our words are represented by continuous word vectors and we can thus apply simple similarities to them. In particular we use the **cosine** of the angles between two vectors. This similarity is computed for all words in the vocabulary, and the 10 most similar words are shown. Of course, if the word appears in the vocabulary, it will appear on top, with a similarity of 1.

### Word Analogies

In a similar spirit, one can play around with word analogies. For example, we can see if our model can guess what is to France, what Berlin is to Germany.

This can be done with the analogies functionality. It takes a word triplet (like king man woman) and computes
$$ diff= emb(king)-emb(man) \\
   res = diff + emb(woman) $$
Subsequently the embeddings of words are selected which are closest to $res$.



This can be achieve with the `get_analogies`  functionality.


In [None]:
model.get_analogies('king', 'man', 'woman', k=10)

In [None]:
model.get_analogies('doctor', 'man', 'woman', k=10)

In [None]:
model.get_analogies('berlin', 'germany', 'france', k=10)