# FastText

A diferencia de Word2Vec, que trabaja a nivel de palabra, FastText trata de capturar la información morfológica de las palabras.

>*"[...] we propose a new approach **based on the skipgram model, where each word is represented as a bag of character n-grams**. A vector representation is associated to each character n-gram; words being represented as the sum of these representations. [...]"* <br>(Mikolov et al., Enriching Word Vectors with Subword Information, https://arxiv.org/pdf/1607.04606.pdf)

De esta manera, una palabra quedará representada por sus n-grams.

El tamaño de los n-grams deberá ser definido como hiperparámetro
- min_n: valor mínimo de _n_ a considerar
- max_n: valor máximo de _n_ a considerar

Ejemplo:
>*"Me gusta el procesado del lenguaje natural"*
>* Ejemplo de *skip-gram* pre-procesado con una ventana de contexto de 2 palabras
>
>$w_{target} =$ "procesado" &emsp;$w_{context} =$ ["gusta", "el", "del", "lenguaje"]
>
>     ("procesado", "gusta")
>
> Descomoposición de n-grams con min_n=3 and max_n=4:
>
>"procesado" = ["$<$pr", "pro", ..., "ado", "do$>$", "$<$pro", "roce", ..., "sado", "ado$>$"]
>
>* De este modo, la similitud será: <br><br>
>&emsp;$\boxed{s(w_{target}, w_{context}) = \sum_{g \in G_{w_{target}}}z_{g}^T v_{w_{context}}}$, where $G_{w_{target}}\subset\{g_{1}, ..., g_{G}\}$

## Palabras más similares

In [1]:
def print_sim_words(word, model1, model2):
    query = "Most similar to {}".format(word)
    print(query)
    print("-"*len(query))
    for (sim1, sim2) in zip(model1.wv.most_similar(word), model2.wv.most_similar(word)):
        print("{}:{}{:.3f}{}{}:{}{:.3f}".format(sim1[0],
                                               " "*(20-len(sim1[0])),
                                               sim1[1],
                                               " "*10,
                                               sim2[0],
                                               " "*(20-len(sim2[0])),
                                               sim2[1]))
    print("\n")

## Importamos las librerías

In [2]:

from gensim.models import FastText
from gensim.models.word2vec import LineSentence
from gensim.models.phrases import Phrases, Phraser

## Lectura de datos

In [3]:
!pip install unzip
!unzip df_clean_simpsons.csv.zip

Collecting unzip
  Downloading unzip-1.0.0.tar.gz (704 bytes)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: unzip
  Building wheel for unzip (setup.py) ... [?25l[?25hdone
  Created wheel for unzip: filename=unzip-1.0.0-py3-none-any.whl size=1283 sha256=0278517a38ea53bc1f8c64ccf2e1755546705db1d5f2c7648b18c12d3ec30712
  Stored in directory: /root/.cache/pip/wheels/80/dc/7a/f8af45bc239e7933509183f038ea8d46f3610aab82b35369f4
Successfully built unzip
Installing collected packages: unzip
Successfully installed unzip-1.0.0
unzip:  cannot find or open df_clean_simpsons.csv.zip, df_clean_simpsons.csv.zip.zip or df_clean_simpsons.csv.zip.ZIP.


In [5]:
import pandas as pd
df_clean = pd.read_csv('./df_clean_simpsons.csv')

In [6]:

sent = [row.split() for row in df_clean['clean']]

## Hyperparameters

In [7]:
sg_params = {
    'sg': 1,
    'vector_size': 300,
    'min_count': 5,
    'window': 5,
    'hs': 0,
    'negative': 20,
    'workers': 4,
    'min_n': 3,
    'max_n': 6
}



## Inicializamos el objeto FastText

In [8]:
help(FastText)

Help on class FastText in module gensim.models.fasttext:

class FastText(gensim.models.word2vec.Word2Vec)
 |  FastText(sentences=None, corpus_file=None, sg=0, hs=0, vector_size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, word_ngrams=1, sample=0.001, seed=1, workers=3, min_alpha=0.0001, negative=5, ns_exponent=0.75, cbow_mean=1, hashfxn=<built-in function hash>, epochs=5, null_word=0, min_n=3, max_n=6, sorted_vocab=1, bucket=2000000, trim_rule=None, batch_words=10000, callbacks=(), max_final_vocab=None, shrink_windows=True)
 |  
 |  Method resolution order:
 |      FastText
 |      gensim.models.word2vec.Word2Vec
 |      gensim.utils.SaveLoad
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, sentences=None, corpus_file=None, sg=0, hs=0, vector_size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, word_ngrams=1, sample=0.001, seed=1, workers=3, min_alpha=0.0001, negative=5, ns_exponent=0.75, cbow_mean=1, hashfxn=<built-in functi

In [9]:
# Skip Gram
ft_sg = FastText(**sg_params)



## Construímos el vocabulario

In [10]:
# Skip Gram
ft_sg.build_vocab(sent)



In [11]:
print('Vocabulario compuesto por {} palabras'.format(len(ft_sg.wv.key_to_index)))


Vocabulario compuesto por 6130 palabras


## Entrenamos los pesos de los embeddings

In [12]:
# Skip Gram


ft_sg.train(sent, total_examples=len(sent), epochs=20)


(5024696, 6116860)

## Guardamos los modelos

In [13]:
ft_sg.save('./w2v_model_fast.pkl')


## Algunos resultados

In [14]:
ft_sg.wv.most_similar(positive=["homer"])

[('knockahomer', 0.6621387600898743),
 ('homey', 0.6509182453155518),
 ('astronomer', 0.5661692023277283),
 ('customer', 0.5487726330757141),
 ('homemade', 0.5278177261352539),
 ('mer', 0.5139923691749573),
 ('carrier', 0.5022217631340027),
 ('home', 0.5019287467002869),
 ('somewhat', 0.5005836486816406),
 ('margarita', 0.49495843052864075)]

In [15]:
ft_sg.wv.most_similar(positive=["marge"])

[('sarge', 0.7090577483177185),
 ('margarita', 0.7005890011787415),
 ('margie', 0.6558294892311096),
 ('marjorie', 0.5643143653869629),
 ('large', 0.5106120705604553),
 ('married', 0.5083991289138794),
 ('urge', 0.5059846043586731),
 ('marriage', 0.4967114329338074),
 ('argue', 0.4876605272293091),
 ('march', 0.4851870536804199)]

In [16]:
ft_sg.wv.most_similar(positive=["bart"])

[('barty', 0.6337318420410156),
 ('bartron', 0.5803127288818359),
 ('barf', 0.5655282139778137),
 ('bartholomew', 0.5616724491119385),
 ('dart', 0.5523513555526733),
 ('baryshnikov', 0.5475201606750488),
 ('fart', 0.5405943393707275),
 ('barbara', 0.5387760400772095),
 ('gypsy', 0.522591769695282),
 ('art', 0.5124298334121704)]

In [17]:
ft_sg.wv.similarity('maggie', 'baby')

0.40071344

In [18]:
ft_sg.wv.similarity('bart', 'nelson')

0.34023723

In [19]:
ft_sg.wv.doesnt_match(['jimbo', 'milhouse', 'kearney'])

'milhouse'

In [20]:
ft_sg.wv.doesnt_match(['homer', 'patty', 'selma'])

'homer'

## Out-of-Vocabulary (OOV) Words

la cantidad de n-grams creados durante el entrenamiento del FastText hace improbable (que no imposible) que alguna palabra no pueda ser construída como una bolsa de n-grams

In [21]:
'asereje' in ft_sg.wv.key_to_index

False

In [22]:
ft_sg.wv.most_similar('asereje')

[('eraser', 0.6586302518844604),
 ('taser', 0.657060444355011),
 ('lu', 0.6406767964363098),
 ('eliza', 0.6382820010185242),
 ('shredded', 0.6241763234138489),
 ('laser', 0.6193310022354126),
 ('cease', 0.6184722185134888),
 ('whereabouts', 0.6139325499534607),
 ('buddhist', 0.6138971447944641),
 ('huzzah', 0.6137781143188477)]

In [23]:
ft_sg.wv['asereje'].shape

(300,)