# NLP Tools tutorial

The **nlp_tools** subpackage offers classic NLP tools implemented as classes that will be used to preprocess an already cleaned text :
- a **Phraser class** : to transform common multi-word expressions into single elements (*new york* becomes *new_york*)
- a **Tokenizer class** : to split a sentence-like string into a list of sub-strings (tokens).
- an **Embedding class** : to represent of words in a lower dimensional vector space.

## The Phraser class

The Phraser class transforms common multi-word expressions into single elements: for example *new york* becomes *new_york*.

The arguments of a Phraser object are:
- **input_column :** the name of the column of the dataframe that will be used as input for the training of the Phraser.
- **common_terms :** list of stopwords to be ignored. The default list is defined in the *conf.json* file.
- **threshold :** threshold to select collocations.
- **min_count :** minimum count of word to be selected as collocation.

In [2]:
from melusine.nlp_tools.phraser import Phraser

phraser = Phraser(input_column='clean_body',
                  threshold=10,
                  min_count=10)

#### Training a phraser

The input dataframe must contain a column with a clean text : **a sentence-like string with only lowcase letters and no accents**.

In [3]:
import pandas as pd

df_emails_clean = pd.read_csv('./data/emails_preprocessed.csv', encoding='utf-8', sep=';')
df_emails_clean = df_emails_clean[['clean_body']]
df_emails_clean = df_emails_clean.astype(str)

In [4]:
df_emails_clean.clean_body[1]

'je vous informe que la nouvelle immatriculation est enfin faite. je vous prie de trouver donc la carte grise ainsi que la nouvelle immatriculation. je vous demanderai de faire les changements necessaires concernant l assurance.'

In [5]:
phraser.train(df_emails_clean)

#### Saving a phraser

In [6]:
phraser.save('./data/phraser.pickle')

#### Loading a phraser

In [7]:
phraser = Phraser().load('./data/phraser.pickle')

#### Applying a phraser

The main method of a Phraser object is its *train* method. To apply a specifi phraser it has to be passed as an argument to one of the following functions :
- **phraser_on_body :** to apply the phraser on the *clean_body* column of a dataframe
- **phraser_on_header :** to apply the phraser on the *clean_header* column of a dataframe

The **phraser_on_body** and **phraser_on_header** functions are applied on rows of dataframes.

In [8]:
from melusine.nlp_tools.phraser import phraser_on_body

row = df_emails_clean.loc[1,:]

phraser_on_body(row, phraser)

'je vous informe que la nouvelle immatriculation est enfin faite. je vous prie de trouver donc la carte grise ainsi que la nouvelle immatriculation. je vous demanderai de faire les changements necessaires concernant l assurance.'

Because the **phraser_on_body** and **phraser_on_header** functions are applied on rows of dataframes, they have to be passed as arguments of a **TransferScheduler object** in order to be applied on a whole dataframe.

In [9]:
from melusine.utils.transformer_scheduler import TransformerScheduler

PhraserTransformer = TransformerScheduler(
    functions_scheduler=[
        (phraser_on_body, (phraser,), ['clean_body'])
    ]
)

In [10]:
df_emails_clean = PhraserTransformer.fit_transform(df_emails_clean)

## The Tokenizer class

The Tokenizer class splits a sentence-like string into a list of sub-strings (tokens). 

The arguments of a Tokenizer object are :
- **input_column :** the input text column to consider for the tokenizer.
- **stopwords :** the list of words to remove from list of tokens. Default value, list defined in conf.json file.
- **stop_removal :** True if stopwords to be removed, else False. Default value, False.
- **n_jobs :** the number of cores used for computation. Default value, 20.

In [11]:
from melusine.nlp_tools.tokenizer import Tokenizer

tokenizer = Tokenizer (input_column='clean_body',
                       stop_removal=True,
                       n_jobs=20)

#### Applying a Tokenizer

Use the **fit_transform** method on a dataframe to create a new ***tokens* column**

In [12]:
df_emails_clean = tokenizer.fit_transform(df_emails_clean)

In [13]:
df_emails_clean.clean_body[1]

'je vous informe que la nouvelle immatriculation est enfin faite. je vous prie de trouver donc la carte grise ainsi que la nouvelle immatriculation. je vous demanderai de faire les changements necessaires concernant l assurance.'

In [14]:
df_emails_clean.tokens[1]

['informe',
 'nouvelle',
 'immatriculation',
 'enfin',
 'faite',
 'prie',
 'trouver',
 'donc',
 'carte',
 'grise',
 'ainsi',
 'nouvelle',
 'immatriculation',
 'demanderai',
 'faire',
 'les',
 'changements',
 'necessaires',
 'concernant',
 'assurance']

#### Saving a Tokenizer

In [15]:
import joblib
_ = joblib.dump(tokenizer,"./data/tokenizer.pickle",compress=True)

#### Loading a Tokenizer 

In [16]:
tokenizer = joblib.load("./data/tokenizer.pickle")

## The Embedding class

Word embeddings are abstract representations of words in a lower dimensional vector space. One of the advantages of word embeddings is thus to save computational cost. The Melusine Embedding class uses a **Word2Vec** model. The trained Embedding object will be used in the Models subpackage to train a Neural Network to classify emails.

The arguments of an Embedding object are :
- **input_column :** the name of the column used as an input for the training.
- **workers :** the number of cores used for computation. Default value, 40.
- **seed :** seed for the embedding model,
- **iter :** number of iterations for the training,
- **size :** dimension of the embeddings
- **window :** 
- **min_count :** minimum number of occurences for a word to be taken into account.

In [17]:
from melusine.nlp_tools.embedding import Embedding

embedding = Embedding(input_column='clean_body',
                      size=300,
                      workers=4,
                      min_count=3)

#### Training embeddings

In [18]:
embedding.train(df_emails_clean)

#### Saving embeddings

In [19]:
embedding.save('./data/embedding.pickle')

#### Loading embeddings

In [20]:
embedding = Embedding().load('./data/embedding.pickle')

### Different types of embeddings

The types of embedding available in the **Embedding** class are:
- `lsa_docterm` : Apply a Singular Value Decomposition (SVD) on the DocTerm matrix
- `lsa_tfidf` : Apply a Singular Value Decomposition (SVD) on the TfIdf matrix
- `word2vec_sg` : Train a Word2Vec model using the Skip-Gram method (Warning : time consuming!)
- `word2vec_ns` : Train a Word2Vec model using the Negative-Sampling method
- `word2vec_cbow` : Train a Word2Vec model using the Continuous Bag-Of-Words method.

In [21]:
embedding = Embedding(input_column='clean_body',
                      size=300,
                      min_count=3,
                      method = 'lsa_tfidf'
                     )
embedding.train(df_emails_clean)

### Specify a tokens column instead of a text column

There are two ways to provide the text input to the **Embedding** class:
- `input_column` : Provide a raw text column.  The embedding class will tokenize it and create a tokens generator.  The tokens generator will be used to generate tokens as input for training the model
- `tokens_column` : Provide a column containing list of tokens.   The embedding class will use a list of list of tokens to train the embedding model 

In [22]:
embedding = Embedding(tokens_column='tokens',
                      size=300,
                      min_count=3,
                     )
embedding.train(df_emails_clean)