In [1]:
import joblib
import pandas as pd
from tempfile import TemporaryDirectory, TemporaryFile

# NLP Tools tutorial

The **nlp_tools** subpackage offers classic NLP tools implemented as classes that will be used to preprocess an already cleaned text :
- a **Tokenizer class** : to split a sentence-like string into a list of sub-strings (tokens).
- a **Phraser class** : to transform common multi-word expressions into single elements (*new york* becomes *new_york*)
- an **Embedding class** : to represent of words in a lower dimensional vector space.

## Load data

In [2]:
from melusine.data.data_loader import load_email_data
df_emails = load_email_data(type="preprocessed")

## The Tokenizer class
The Tokenizer class splits a sentence-like string into a list of sub-strings (tokens).  

The arguments of a Tokenizer object are just the input_columns and output_columns.

In [3]:
from melusine.nlp_tools.tokenizer import Tokenizer
tokenizer = Tokenizer (input_column='clean_body', output_column="tokens")

Use the **fit_transform** method on a dataframe to create a new ***tokens* column**

In [4]:
df_emails = tokenizer.fit_transform(df_emails)
print("Base text")
print(df_emails.clean_body[1])
print("\nTokenized text")
print(df_emails.tokens[1])

Base text
je vous informe que la nouvelle immatriculation est enfin faite. je vous prie de trouver donc la carte grise ainsi que la nouvelle immatriculation. je vous demanderai de faire les changements necessaires concernant lassurance.

Tokenized text
['informe', 'nouvelle', 'immatriculation', 'enfin', 'faite', 'prie', 'trouver', 'donc', 'carte', 'grise', 'ainsi', 'nouvelle', 'immatriculation', 'demanderai', 'faire', 'les', 'changements', 'necessaires', 'concernant', 'lassurance']


#### Load / Save a tokenizer

In [None]:
with TemporaryDirectory() as tmpdir:
    path = f"{tmpdir}/tokenizer.pkl"
    _ = joblib.dump(tokenizer, path, compress=True)
    tokenizer_reload = joblib.load(path)

In [None]:
df_emails = tokenizer_reload.fit_transform(df_emails)
print(df_emails.tokens[1])

## The Phraser class

The Phraser class transforms common multi-word expressions into single elements: for example *new york* becomes *new_york*.

The arguments of a Phraser object are:
- **input_column :** the name of the column of the dataframe that will be used as input for the training of the Phraser.
- **common_terms :** list of stopwords to be ignored. The default list is defined in the *conf.json* file.
- **threshold :** threshold to select collocations.
- **min_count :** minimum count of word to be selected as collocation.

In [None]:
from melusine.nlp_tools.phraser import Phraser

phraser = Phraser(
    input_column='tokens',
    output_column='phrased_tokens',
    threshold=5,
    min_count=2
)

#### Training a phraser

The input dataframe must contain a column with a clean text : **a sentence-like string with only lowcase letters and no accents**.

In [None]:
_ = phraser.fit(df_emails)
df_emails = phraser.transform(df_emails)
print(df_emails['phrased_tokens'].iloc[3])

Expected result : You should see phrased tokens such as  
"bulletin" + "salaire" = "bulletin_salaire"

#### Load/Save a phraser

In [None]:
with TemporaryDirectory() as tmpdir:
    path = f"{tmpdir}/phraser.pkl"
    _ = joblib.dump(phraser, path, compress=True)
    phraser_reload = joblib.load(path)
    
print(phraser_reload.transform(df_emails)['phrased_tokens'].iloc[3])

## The Embedding class

Word embeddings are abstract representations of words in a lower dimensional vector space. One of the advantages of word embeddings is thus to save computational cost. The Melusine Embedding class uses a **Word2Vec** model. The trained Embedding object will be used in the Models subpackage to train a Neural Network to classify emails.

The arguments of an Embedding object are :
- **input_column :** the name of the column used as an input for the training.
- **workers :** the number of cores used for computation. Default value, 40.
- **seed :** seed for the embedding model,
- **iter :** number of iterations for the training,
- **size :** dimension of the embeddings
- **window :** 
- **min_count :** minimum number of occurences for a word to be taken into account.

In [None]:
from melusine.nlp_tools.embedding import Embedding

embedding = Embedding(
    tokens_column='tokens',
    size=300,
    workers=4,
    min_count=3
)

#### Training embeddings

In [None]:
embedding.train(df_emails)
embedding.embedding.most_similar("vehicule")

Warning : The embedding is trained on a very small dataset so the results here are irrelevant