# Word Embedding tutorial

## Load data
**Warning :**   
The data set used in the present tutorial to train embeddings contains only 50 lines (emails). This is not sufficient to obtain meaningful results.  

Feel free to replace the data set with your own data (at least 10000 documents is recommended) and you should observe significant improvement in the results. The quality of an embedding can be assessed, for exemple, by using the most_similar method on a word and making sure that the words obtained are coherent.

In [1]:
import pandas as pd

df_emails_clean = pd.read_csv('./data/emails_preprocessed.csv', encoding='utf-8', sep=';')
# Artificially increase df size by duplication
df_emails_clean = pd.concat([df_emails_clean] * 100, ignore_index=True) 
df_emails_clean = df_emails_clean[['clean_body']]
df_emails_clean = df_emails_clean.astype(str)

In [2]:
df_emails_clean.clean_body[1]

'je vous informe que la nouvelle immatriculation est enfin faite. je vous prie de trouver donc la carte grise ainsi que la nouvelle immatriculation. je vous demanderai de faire les changements necessaires concernant lassurance.'

## The Embedding class

Word embeddings are abstract representations of words in a lower dimensional vector space. One of the advantages of word embeddings is thus to save computational cost. 

There are several methods to train word embeddings, Melusine provides high level functions to train word embeddings using different methods and benchmark the resuls. The types of word embeddings available in the **Embedding** class are:
- `lsa_docterm` : Apply a Singular Value Decomposition (SVD) on the DocTerm matrix
- `lsa_tfidf` : Apply a Singular Value Decomposition (SVD) on the TfIdf matrix
- `word2vec_sg` : Train a Word2Vec model using the Skip-Gram method (Warning : time consuming!)
- `word2vec_cbow` : Train a Word2Vec model using the Continuous Bag-Of-Words method.

The Melusine **Embedding** class can be used to benchmark word embeddings training methods in a straightforward manner.

In [3]:
from melusine.nlp_tools.embedding import Embedding



In [5]:
min_count = 2
n_dimension = 50

embedding_lsa_docterm = Embedding(input_column='clean_body',
                                  vector_size=n_dimension,
                                  min_count=min_count,
                                  method = 'lsa_docterm')

embedding_lsa_tfidf = Embedding(input_column='clean_body',
                                  vector_size=n_dimension,
                                  min_count=min_count,
                                  method = 'lsa_tfidf')

embedding_word2vec_sg = Embedding(input_column='clean_body',
                                  vector_size=n_dimension,
                                  min_count=min_count,
                                  method = 'word2vec_sg')


embedding_word2vec_cbow = Embedding(input_column='clean_body',
                                  vector_size=n_dimension,
                                  min_count=min_count,
                                  method = 'word2vec_cbow')

embeddings_list = [embedding_lsa_docterm, embedding_lsa_tfidf, embedding_word2vec_sg, embedding_word2vec_cbow]

In [6]:
for embedding in embeddings_list:
    embedding.train(df_emails_clean)

20/05 04:27 - melusine.nlp_tools.embedding - INFO - Start training for embedding
20/05 04:27 - melusine.nlp_tools.embedding - INFO - Start training for embedding
20/05 04:27 - melusine.nlp_tools.embedding - INFO - Start training for embedding
20/05 04:27 - melusine.nlp_tools.embedding - INFO - Start training for embedding
20/05 04:27 - melusine.nlp_tools.embedding - INFO - Done.
20/05 04:27 - melusine.nlp_tools.embedding - INFO - Done.
20/05 04:27 - melusine.nlp_tools.embedding - INFO - Done.
20/05 04:27 - melusine.nlp_tools.embedding - INFO - Done.
20/05 04:27 - melusine.nlp_tools.embedding - INFO - Start training for embedding
20/05 04:27 - melusine.nlp_tools.embedding - INFO - Start training for embedding
20/05 04:27 - melusine.nlp_tools.embedding - INFO - Start training for embedding
20/05 04:27 - melusine.nlp_tools.embedding - INFO - Start training for embedding
20/05 04:27 - melusine.nlp_tools.embedding - INFO - Done.
20/05 04:27 - melusine.nlp_tools.embedding - INFO - Done.
20/0

## Changing the train parameters

There are parameters that can be tuned to optimize the training of word embeddings. The most widely used parameters can be specified directly when the **Embedding** class object is instanciated :  
* `size` : number of dimension of the embedding
* `min_count` : minimum number of occurence of a word to be included in the embedding vocabulary

Other training parameters can be specified in the **Embedding** attribute train_params which is a dictionnary of parameters. Keep in mind that some training parameters are specific to a given embedding training method.

In [8]:
embedding_word2vec_cbow = Embedding(input_column='clean_body',
                                  vector_size=100,
                                  min_count=3,
                                  method = 'word2vec_cbow')

In [9]:
print("Train parameters for a Word2Vec CBOW embedding:")
embedding_word2vec_cbow.train_params

Train parameters for a Word2Vec CBOW embedding:


{'vector_size': 100,
 'alpha': 0.025,
 'min_count': 3,
 'max_vocab_size': None,
 'sample': 0.001,
 'seed': 42,
 'workers': 40,
 'min_alpha': 0.0001,
 'negative': 5,
 'hs': 0,
 'ns_exponent': 0.75,
 'cbow_mean': 1,
 'epochs': 15,
 'null_word': 0,
 'trim_rule': None,
 'sorted_vocab': 1,
 'batch_words': 10000,
 'compute_loss': False,
 'callbacks': (),
 'max_final_vocab': None,
 'sg': 0,
 'window': 5}

In [10]:
# Change a training parameter before training the embedding
embedding_word2vec_cbow.train_params["window"] = 3
embedding_word2vec_cbow.train(df_emails_clean)

20/05 04:28 - melusine.nlp_tools.embedding - INFO - Start training for embedding
20/05 04:28 - melusine.nlp_tools.embedding - INFO - Start training for embedding
20/05 04:28 - melusine.nlp_tools.embedding - INFO - Start training for embedding
20/05 04:28 - melusine.nlp_tools.embedding - INFO - Start training for embedding
20/05 04:28 - melusine.nlp_tools.embedding - INFO - Start training for embedding
20/05 04:28 - melusine.nlp_tools.embedding - INFO - Done.
20/05 04:28 - melusine.nlp_tools.embedding - INFO - Done.
20/05 04:28 - melusine.nlp_tools.embedding - INFO - Done.
20/05 04:28 - melusine.nlp_tools.embedding - INFO - Done.
20/05 04:28 - melusine.nlp_tools.embedding - INFO - Done.


## The Gensim Word2VecKeyedVectors object

Regardless of the selected method (lsa_docterm, word2vec, etc), the trained embedding is converted to a Gensim **Word2VecKeyedVectors** object and stored in the embedding attribute of the **Embedding** class (`Embedding.embedding`). This very convenient as it enables the use of all the **Word2VecKeyedVectors** functions. Examples of such functions are:
* `similarity` : Compute the cosine similarity between two words 
* `most_similar` : Compute the words most similar to the input word 
* See more methods in the Gensim documentation

Warning : The Word2VecKeyedVectors object is originally developped for Word2Vec word embeddings, therefore, some functions, such as the "accuracy" function, are specific to Word2Vec embeddings and should not be used if the embedding was trained using a DocTerm_LSA method.

In [11]:
print("Cosine similarity between 'telephone' and 'numero'")
embedding.embedding.similarity("telephone", "numero")

Cosine similarity between 'telephone' and 'numero'


AttributeError: 'Word2Vec' object has no attribute 'similarity'

In [12]:
print("Cosine similarity between 'telephone' and 'manifestation'")
embedding.embedding.similarity("telephone", "manifestation")

Cosine similarity between 'telephone' and 'manifestation'


AttributeError: 'Word2Vec' object has no attribute 'similarity'

The word "telephone" is closer to the word "numero" than the word "manifestation".  

In [None]:
embedding.embedding.most_similar("numero")

In the present exemple "telephone" appears as the word most similar to "numero" which is an intuitive outcome. However, due to the very limited amount of training data, some words are not very relevant (Ex: "joins")