## Word2VecTrainer tutorial
Word embeddings are abstract representations of words in a lower dimensional vector space. One of the advantages of word embeddings is thus to save computational cost. The Melusine Embedding class uses a **Word2Vec** model. The trained Embedding object will be used in the Models subpackage to train a Neural Network to classify emails.

## Load data
**Warning :**   
The data set used in the present tutorial to train embeddings contains only 50 lines (emails). This is not sufficient to obtain meaningful results.  

Feel free to replace the data set with your own data (at least 10000 documents is recommended) and you should observe significant improvement in the results. The quality of an embedding can be assessed, for exemple, by using the most_similar method on a word and making sure that the words obtained are coherent.

In [3]:
import pandas as pd
from melusine import load_email_data

df_emails_preprocessed = load_email_data(type="preprocessed")
# Artificially increase df size by duplication
df_emails_preprocessed = pd.concat([df_emails_preprocessed] * 100, ignore_index=True) 

In [4]:
df_emails_preprocessed.clean_body[1]

'je vous informe que la nouvelle immatriculation est enfin faite. je vous prie de trouver donc la carte grise ainsi que la nouvelle immatriculation. je vous demanderai de faire les changements necessaires concernant lassurance.'

## The Word2VecTrainer class

The arguments of an Embedding object are :
- **input_column :** the name of the column containing the input text data.
- **tokens_column :** tthe name of the column containing the tokens data  
  (created by the Word2Vec Trainer class if necessary)
- **tokenizer :** Tokenizer object to split the text into tokens
- **kwargs :** parameters for the Word2Vec model training (c.f. Gensim Word2Vec documentation)

In [5]:
from melusine.nlp_tools.embedding import Word2VecTrainer



## Train the word embeddings model

In [8]:
# Instantiate the trainer
embedding_trainer = Word2VecTrainer(
    input_column='clean_body',
    workers=4,
    min_count=3
)

In [9]:
# Train the word embeddings model
embedding_trainer.train(df_emails_preprocessed)

In [10]:
# Test the trained embedding
embedding = embedding_trainer.embedding
embedding.most_similar("vehicule")

[('etait', 0.7717979550361633),
 ('resilier', 0.7692007422447205),
 ('stationnement', 0.7630469799041748),
 ('vente', 0.7250186204910278),
 ('presente', 0.7154752612113953),
 ('cession', 0.7047790288925171),
 ('certificat', 0.6921747326850891),
 ('droite', 0.6782360076904297),
 ('pare-choc', 0.6758974194526672),
 ('recevoir', 0.6740345358848572)]

## Saving embeddings

In [12]:
embedding.save('./data/my_embedding')

## Loading embeddings

In [13]:
from gensim.models import KeyedVectors

In [14]:
embedding = KeyedVectors.load('./data/my_embedding')