# Melusine Tokenizer Tutorial

The Melusine tokenizer is a key component of the Melusine package.  
It takes care of the tokenization pipeline in a broad sens inclusing:  
* Text normalization
* Lowercasing
* General text flagging
* Tokenization (in the strict sens, split text into tokens)
* Remove stopwords
* (Efficient) Name flagging

## Load emails data

In [1]:
from melusine import load_email_data
import pandas as pd

In [2]:
df_emails_preprocessed = load_email_data(type="preprocessed")

## The Tokenizer class

The Tokenizer class splits a sentence-like string into a list of sub-strings (tokens). 

The arguments of a Tokenizer object are :
* tokenizer_regex: (str) Regex used to split the text into tokens
* normalization: (str or None) Type of normalization to apply to the text  
  (Possible values are None, ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’)
* lowercase: (bool) If True, lowercase the text
* stopwords: (list) List of words to be removed
* remove_stopwords: (bool) If True, stopwords removal is enabled
* flag_dict: Dict: (dict) Flagging dict with regex as key and replace_text as value
* collocations_dict: (dict) Dict with expressions to be grouped into one unit of sens
* names: (list) List of names to be flagged
name_flag: (str) Replace value for names

In [3]:
from melusine.nlp_tools.tokenizer import WordLevelTokenizer

tokenizer = WordLevelTokenizer()

### Using a Tokenizer

Use the tokenize method to split text into tokens

In [4]:
text = "Bonjour Mireille, peux-tu m'appeller au numéro suivant : 0612345678?"
tokenizer.tokenize(text)

['bonjour',
 'flag_name_',
 'peux-tu',
 'appeller',
 'numero',
 'suivant',
 'flag_phone_']

In [5]:
df_emails_preprocessed["tokens"] = df_emails_preprocessed["clean_body"].apply(tokenizer.tokenize)
df_emails_preprocessed["tokens"].head()

0    [client, chez, pouvez, etablir, devis, fils, s...
1    [informe, nouvelle, immatriculation, enfin, fa...
2    [suite, a, conversation, telephonique, flag_da...
3    [fais, suite, a, mail, envoye, bulletin, salai...
4    [voici, ci, joint, bulletin, salaire, comme, d...
Name: tokens, dtype: object

### Saving a Tokenizer
The tokenizer is saved to a human readable json config file.  
The file can be inspected and the tokenizer configurations can be modified easily.

In [6]:
tokenizer.save("./data/my_tokenizer")

### Loading a Tokenizer 
Load a tokenizer from a json file

In [7]:
tokenizer_reloaded = WordLevelTokenizer.load("./data/my_tokenizer")

### Custom tokenization regex
The default tokenizer does not split the text on the "-" character.  
Let's customize the tokenizer for this.

In [8]:
print(f"Default tokenization regex : {tokenizer_reloaded.tokenizer_regex}")
tokenizer_reloaded.tokenize("voulez-vous")

Default tokenization regex : \w+(?:[\?\-"_]\w+)*


['voulez-vous']

In [9]:
custom_tokenizer_regex = r"\w+(?:[\?\"_]\w+)*"
custom_tokenizer = WordLevelTokenizer(tokenizer_regex=custom_tokenizer_regex, stopwords=None)
custom_tokenizer.tokenize("voulez-vous")

['voulez', 'vous']

### Custom flags
Flagging consists in replacing expressions by a standard value.  
Typical examples of flagging are:  
* Replace phone numbers by a flag
* Replace email addresses by a flag
* etc

With Melusine, you can easily define custom flags

In [10]:
custom_flag_dict = {
    "chats?": "flag_animal",
    "chiens?": "flag_animal",
}

In [11]:
custom_tokenizer = WordLevelTokenizer(flag_dict=custom_flag_dict)
custom_tokenizer.tokenize("Mon chat n'apprécie pas nos chiens")

['flag_animal', 'apprecie', 'pas', 'flag_animal']

### Custom names
Searching through a list with thousands of names may take long.  
For performance optimization, Melusine uses the library Flashtext to flag names.  
To use a custom name list, change the names parameter.

In [12]:
custom_tokenizer = WordLevelTokenizer(names=["daenerys", "tyrion"])
custom_tokenizer.tokenize("Daenerys rencontre Tyrion dans un chateau")

['flag_name_', 'rencontre', 'flag_name_', 'chateau']

### Collocations
NLP models performance may be improved by grouping together words forming a single unit of sens.  
Ex:  
* new york -> new_york
* rendez vous -> rendez_vous

To use custom collocations, change the collocations_dict parameter.

In [13]:
custom_collocations_dict = {
    "rdv": "rendez_vous",
    "rendez[ -]+vous": "rendez_vous",
}
custom_tokenizer = WordLevelTokenizer(collocations_dict=custom_collocations_dict, stopwords=None)
custom_tokenizer.tokenize("rdv rendez vous rendez       vous rendez-vous")

['rendez_vous', 'rendez_vous', 'rendez_vous', 'rendez_vous']