# Full pipeline (detailed)

This notebook explains the full pipeline in a detailed manner, including the preprocessing steps, the summerization steps and the classification ones.

## Loading the dataset under the Pandas Dataframe format

Because Melusine operates Pandas Dataframes by applying functions to certain columns to produce new columns, the initial columns have to follow a strict naming.

The basic requirement to use Melusine is to have an input e-mail DataFrame with the following columns :
- body : Body of an email (single message or conversation historic)
- header : Header of an email
- date : Reception date of an email
- from : Email address of the sender
- to (optional): Email address of the recipient
- label (optional): Label of the email for a classification task (examples: Business, Spam, Finance or Family)

Each row correspond to a unique email.

In [None]:
from melusine.data.data_loader import load_email_data

df_emails = load_email_data()

In [None]:
df_emails.columns

In [None]:
print('Body :')
print(df_emails.body[5])
print('\n')
print('Header :')
print(df_emails.header[5])
print('Date :')
print(df_emails.date[5])
print('From :')
print(df_emails.loc[5,"from"])
print('To :')
print(df_emails.to[5])
print('Label :')
print(df_emails.label[5])

## Pipeline to manage transfers and replies

A single email can contain several replies or transfers in its body.

In this pipeline the functions applied are :
- **check_mail_begin_by_transfer :** returns True if an email is a direct transfer, else False.
- **update_info_for_transfer_mail :** update the columns body, header, date, from and to if the email is a direct transfer.
- **add_boolean_answer :** returns True if an email is an answer, else False.
- **add_boolean_transfer :** returns True if an email is transferred, else False.

This pipeline will create the following new columns :
- **is_begin_by_transfer (boolean) :** indicates if the email is a direct transfer, meaning the person whe tranfered a previous email has not written anything on his own. If it is the case, the body, header, date, from and to columns will be updated with the information of the transfered email.
- **is_answer (boolean) :** indicates if the body contains replies from previous emails.
- **is_transfer (boolean) :** indicates if the body contains transfered emails (not necesseraly a direct transfer).

#### An example of a direct tranfer

In [None]:
print(df_emails.loc[0,'header'])
print(df_emails.loc[0,'date'])
print(df_emails.loc[0,'from'])
print(df_emails.loc[0,'to'])
print(df_emails.loc[0,'body'])

#### The pipeline 

In [None]:
from melusine.utils.transformer_scheduler import TransformerScheduler

from melusine.prepare_email.manage_transfer_reply import check_mail_begin_by_transfer
from melusine.prepare_email.manage_transfer_reply import update_info_for_transfer_mail
from melusine.prepare_email.manage_transfer_reply import add_boolean_transfer
from melusine.prepare_email.manage_transfer_reply import add_boolean_answer

In [None]:
ManageTransferReplyTransformer = TransformerScheduler(
    functions_scheduler=[
        (check_mail_begin_by_transfer, None, ['is_begin_by_transfer']),
        (update_info_for_transfer_mail, None, None),
        (add_boolean_answer, None, ['is_answer']),
        (add_boolean_transfer, None, ['is_transfer'])
    ]
)

In [None]:
df_emails = ManageTransferReplyTransformer.fit_transform(df_emails)

In [None]:
df_emails.columns

#### An emails previously transfered directly after it has been updated

In [None]:
print(df_emails.loc[0,'is_begin_by_transfer'])
print(df_emails.loc[0,'header'])
print(df_emails.loc[0,'date'])
print(df_emails.loc[0,'from'])
print(df_emails.loc[0,'to'])
print(df_emails.loc[0,'body'])

#### Headers of emails containing replies

In [None]:
test = df_emails[df_emails['is_answer']==True]
test.header

#### Headers of emails containing transfers

In [None]:
test = df_emails[df_emails['is_transfer']==True]
test.header

## Email segmenting pipeline

Each email will be segmented according to :
- the different messages
- the metadata, the header and the  text of each messages
- the type of metadata (date, from, to)
- the different partos of each text (hello, greetings, footer..)

In this pipeline the functions applied are :
- **build_historic :** segments the different messages of the body and returns a list of dictionaries, one per message. Each dictionary has a key 'meta' to access the metadata and a key 'text' to access the text of the body.
- **structure_email :** splits parts of each messages in historic, tags them (tags: Hello, Body, Greetings, etc) and segments each part of the metadata (date, from, to). The result is returned as a list of dictionaries, one per message. Each dictionary has a key 'meta' to access the metadata (itself a dictionary with keys 'date', 'from' and 'to') and a key 'text' to access the text of the body (itself a dictionary with keys 'header' and 'structured_text').

This pipeline creates the following new columns :
- **structured_historic :** the list of dictionaries returned by **build_historic** function.
- **structured_body :** the list of dictionaries returned by **structure_email** function.

In [None]:
from melusine.prepare_email.build_historic import build_historic
from melusine.prepare_email.mail_segmenting import structure_email

In [None]:
SegmentingTransformer = TransformerScheduler(
    functions_scheduler=[
        (build_historic, None, ['structured_historic']),
        (structure_email, None, ['structured_body'])
    ]
)

In [None]:
df_emails = SegmentingTransformer.fit_transform(df_emails)

In [None]:
df_emails.columns

In [None]:
print(df_emails.body[2])

In [None]:
df_emails.structured_historic[2]

In [None]:
df_emails.structured_body[2]

## Extraction and cleaning of the body of the last message

Once each email segmented, the body of the last message will be extracted and cleaned.

In this pipeline the functions applied are :
- **extract_last_body :** returns the body of the last message of the email.
- **clean_body :** returns the body of the last message of the email after cleaning.

This pipeline returns the following columns : 
- **last_body :** the body of the last message of the email returned by **extract_last_body** function.
- **clean_body :** the cleaned body of the last message of the email returned by **clean_body** function.

In [None]:
from melusine.prepare_email.body_header_extraction import extract_last_body
from melusine.prepare_email.cleaning import clean_body

In [None]:
LastBodyHeaderCleaningTransformer = TransformerScheduler(
    functions_scheduler=[
        (extract_last_body, None, ['last_body']),
        (clean_body, None, ['clean_body'])
    ]
)

In [None]:
df_emails = LastBodyHeaderCleaningTransformer.fit_transform(df_emails)

In [None]:
df_emails.columns

In [None]:
print(df_emails.body[2])

In [None]:
print(df_emails.last_body[2])

In [None]:
print(df_emails.clean_body[2])

## Applying a phraser

A phraser can be passed on the body. However it first need to be trained 

In [None]:
from melusine.nlp_tools.phraser import Phraser
from melusine.nlp_tools.phraser import phraser_on_body

#### Training a phraser

In [None]:
phraser = Phraser()

In [None]:
phraser.train(df_emails)

#### Applying a phraser

The **phraser_on_body** function applies a phraser on the clean_body of an email.

In [None]:
PhraserTransformer = TransformerScheduler(
    functions_scheduler=[
        (phraser_on_body, (phraser,), ['clean_body'])
    ]
)

In [None]:
df_emails = PhraserTransformer.fit_transform(df_emails)

## Applying a tokenizer

In [None]:
from melusine.nlp_tools.tokenizer import Tokenizer

In [None]:
tokenizer = Tokenizer(input_column="clean_body")

In [None]:
df_emails = tokenizer.fit_transform(df_emails)

In [None]:
df_emails.columns

In [None]:
print(df_emails.clean_body[2])

In [None]:
print(df_emails.tokens[2])

### Metadata preprocessing

The metadata have to be extracted before being dummified.

This pipeline extractes the following metadata :
- **extension :** from the "from" column.
- **dayofweek :** from the date.
- **hour :** from the date.
- **min :** from the date.

In [None]:
from sklearn.pipeline import Pipeline
from melusine.prepare_email.metadata_engineering import MetaExtension
from melusine.prepare_email.metadata_engineering import MetaDate
from melusine.prepare_email.metadata_engineering import Dummifier

In [None]:
# Pipeline to extract dummified metadata
MetadataPipeline = Pipeline([
    ('MetaExtension', MetaExtension()),
    ('MetaDate', MetaDate()),
    ('Dummifier', Dummifier())
])

In [None]:
df_meta = MetadataPipeline.fit_transform(df_emails)

In [None]:
df_meta.columns

In [None]:
df_meta.head()

## Keywords extraction

Once a tokens column exists, keywords can be extracted.

In [None]:
from melusine.summarizer.keywords_generator import KeywordsGenerator

In [None]:
keywords_generator = KeywordsGenerator(n_max_keywords=4)

In [None]:
df_emails = keywords_generator.fit_transform(df_emails)

In [None]:
df_emails.clean_body[23]

In [None]:
df_emails.tokens[23]

In [None]:
df_emails.keywords[23]

## Classification with neural networks

Melusine offers a NeuralModel class to train, save, load and use for prediction any kind of neural networks based on Keras. 
Predefined architectures of RNN and CNN models using the cleaned body and the metadata of the emails are also offered.

#### Embeddings training

Embeddings have to be pretrained on the data set to be given as arguments of the neural networks.

In [None]:
from melusine.nlp_tools.embedding import Embedding

In [None]:
pretrained_embedding = Embedding(input_column='clean_body',
                                 workers=1,
                                 min_count=5)

In [None]:
pretrained_embedding.train(df_emails) 

#### Préparation de X et de y

In [None]:
import pandas as pd 
from sklearn.preprocessing import LabelEncoder

In [None]:
X = pd.concat([df_emails['clean_body'],df_meta],axis=1)
y = df_emails['label']
le = LabelEncoder()
y = le.fit_transform(y)

In [None]:
X.columns

In [None]:
X.head()

In [None]:
y

#### Entraînement et prédictions avec un CNN

In [None]:
from melusine.models.neural_architectures import cnn_model
from melusine.models.train import NeuralModel

In [None]:
nn_model = NeuralModel(architecture_function=cnn_model,
                       pretrained_embedding=pretrained_embedding,
                       text_input_column="clean_body",
                       meta_input_list=['extension', 'dayofweek','hour', 'min'],
                       n_epochs=10)

In [None]:
nn_model.fit(X,y)

In [None]:
y_res = nn_model.predict(X)
y_res = le.inverse_transform(y_res)
y_res

#### Using a dict instead of a Dataframe as input (performance optimization)

In an industrialized context, a trained model might be fed input data one by one.  
In this case, creating a pandas DataFrame with a single row is overkill and massive performed gain can be obtained by using a dict instead of a DataFrame.  


Melusine is developped to ensure dict compatibility as described in the code below.

In [None]:
import copy

# ============== Test dict compatibility ==============
dict_emails = df_emails.to_dict(orient='records')[0]
dict_meta = MetadataPipeline.transform(dict_emails)
dict_keywords = keywords_generator.transform(dict_emails)

dict_input = copy.deepcopy(dict_meta)
dict_input['clean_body'] = dict_emails['clean_body']

dict_result = nn_model.predict(dict_input)