# Full pipeline (detailed)

This notebook explains the full pipeline in a detailed manner, including the preprocessing steps, the summerization steps and the classification ones.

## Loading the dataset under the Pandas Dataframe format

Because Melusine operates Pandas Dataframes by applying functions to certain columns to produce new columns, the initial columns have to follow a strict naming.

The basic requirement to use Melusine is to have an input e-mail DataFrame with the following columns :
- body : Body of an email (single message or conversation historic)
- header : Header of an email
- date : Reception date of an email
- from : Email address of the sender
- to (optional): Email address of the recipient
- label (optional): Label of the email for a classification task (examples: Business, Spam, Finance or Family)

Each row correspond to a unique email.

In [1]:
from melusine.data.data_loader import load_email_data

df_emails = load_email_data()

In [2]:
df_emails.columns

Index(['body', 'header', 'date', 'from', 'to', 'label'], dtype='object')

In [3]:
print('Body :')
print(df_emails.body[5])
print('\n')
print('Header :')
print(df_emails.header[5])
print('Date :')
print(df_emails.date[5])
print('From :')
print(df_emails.loc[5,"from"])
print('To :')
print(df_emails.to[5])
print('Label :')
print(df_emails.label[5])

Body :
 Madame, Monsieur, 
 
 Je vous avais contactés car j'avais pour 
 projet d'agrandir ma maison. J'avais reçu un devis pour lequel je n'avais 
 pas donné suite, les travaux n'étant pas encore réalisés. 
  
 Le projet a maintenant été porté à son terme et je voudrais donc revoir 
 votre offre si possible. 
  
 Je désire garder le même type de contrat. 
 Je suis à votre disposition pour tout renseignement complémentaires. 
  
 Sincères salutations 
 Monsieur Dupont 
  


Header :
Modification et extension de ma maison
Date :
jeudi 31 mai 2018 10 h 28 CEST
From :
Monsieur Dupont <monsieurdupont@extensiona.com>
To :
demandes@societeimaginaire.fr
Label :
habitation


## Pipeline to manage transfers and replies

A single email can contain several replies or transfers in its body.

In this pipeline the functions applied are :
- **check_mail_begin_by_transfer :** returns True if an email is a direct transfer, else False.
- **update_info_for_transfer_mail :** update the columns body, header, date, from and to if the email is a direct transfer.
- **add_boolean_answer :** returns True if an email is an answer, else False.
- **add_boolean_transfer :** returns True if an email is transferred, else False.

This pipeline will create the following new columns :
- **is_begin_by_transfer (boolean) :** indicates if the email is a direct transfer, meaning the person whe tranfered a previous email has not written anything on his own. If it is the case, the body, header, date, from and to columns will be updated with the information of the transfered email.
- **is_answer (boolean) :** indicates if the body contains replies from previous emails.
- **is_transfer (boolean) :** indicates if the body contains transfered emails (not necesseraly a direct transfer).

#### An example of a direct tranfer

In [4]:
print(df_emails.loc[0,'header'])
print(df_emails.loc[0,'date'])
print(df_emails.loc[0,'from'])
print(df_emails.loc[0,'to'])
print(df_emails.loc[0,'body'])

Tr : Devis habitation
jeudi 24 mai 2018 11 h 49 CEST
conseiller1@societeimaginaire.fr
demandes@societeimaginaire.fr
 
  
  
  
 ----- Transféré par Conseiller le 24/05/2018 11:49 ----- 
  
 De :	Dupont <monsieurdupont@extensiona.com> 
 A :	conseiller@Societeimaginaire.fr 
 Cc :	Societe@www.Societe.fr 
 Date :	24/05/2018 11:36 
 Objet :	Devis habitation 
  
  
  
 Bonjour 
 Je suis client chez vous 
 Pouvez vous m établir un devis pour mon fils qui souhaite 
 louer l’appartement suivant : 
 25 rue du rueimaginaire 77000 
 Merci 
 Envoyé de mon iPhone


#### The pipeline 

In [5]:
from melusine.utils.transformer_scheduler import TransformerScheduler

from melusine.prepare_email.manage_transfer_reply import check_mail_begin_by_transfer
from melusine.prepare_email.manage_transfer_reply import update_info_for_transfer_mail
from melusine.prepare_email.manage_transfer_reply import add_boolean_transfer
from melusine.prepare_email.manage_transfer_reply import add_boolean_answer

In [6]:
ManageTransferReplyTransformer = TransformerScheduler(
    functions_scheduler=[
        (check_mail_begin_by_transfer, None, ['is_begin_by_transfer']),
        (update_info_for_transfer_mail, None, None),
        (add_boolean_answer, None, ['is_answer']),
        (add_boolean_transfer, None, ['is_transfer'])
    ]
)

In [7]:
df_emails = ManageTransferReplyTransformer.fit_transform(df_emails)

In [8]:
df_emails.columns

Index(['body', 'header', 'date', 'from', 'to', 'label', 'is_begin_by_transfer',
       'is_answer', 'is_transfer'],
      dtype='object')

#### An emails previously transfered directly after it has been updated

In [9]:
print(df_emails.loc[0,'is_begin_by_transfer'])
print(df_emails.loc[0,'header'])
print(df_emails.loc[0,'date'])
print(df_emails.loc[0,'from'])
print(df_emails.loc[0,'to'])
print(df_emails.loc[0,'body'])

True
Devis habitation
24/05/2018 11:36
Dupont <monsieurdupont@extensiona.com>
conseiller@Societeimaginaire.fr
 
  
  
  
 Bonjour 
 Je suis client chez vous 
 Pouvez vous m établir un devis pour mon fils qui souhaite 
 louer l’appartement suivant : 
 25 rue du rueimaginaire 77000 
 Merci 
 Envoyé de mon iPhone


#### Headers of emails containing replies

In [10]:
test = df_emails[df_emails['is_answer']==True]
test.header

2     Re: Envoi d'un document de la Société Imaginaire
3           Re: Votre adhésion à la Société Imaginaire
8                                         Re: Virement
10                                         Re: Demande
14        Re:  Correspondance de La Societe Imaginaire
19                  Re: Suppression assurance logement
27        Re: Confirmation de votre assurance véhicule
Name: header, dtype: object

#### Headers of emails containing transfers

In [11]:
test = df_emails[df_emails['is_transfer']==True]
test.header

12                      Tr : Re: Vos documents demandés
21                          Fwd: Changement de vehicule
25                              Tr : Re: Interrogations
31    Tr : résiliation couverture véhicule suite ces...
39        Tr : Message de votre conseillère personnelle
Name: header, dtype: object

## Email segmenting pipeline

Each email will be segmented according to :
- the different messages
- the metadata, the header and the  text of each messages
- the type of metadata (date, from, to)
- the different partos of each text (hello, greetings, footer..)

In this pipeline the functions applied are :
- **build_historic :** segments the different messages of the body and returns a list of dictionaries, one per message. Each dictionary has a key 'meta' to access the metadata and a key 'text' to access the text of the body.
- **structure_email :** splits parts of each messages in historic, tags them (tags: Hello, Body, Greetings, etc) and segments each part of the metadata (date, from, to). The result is returned as a list of dictionaries, one per message. Each dictionary has a key 'meta' to access the metadata (itself a dictionary with keys 'date', 'from' and 'to') and a key 'text' to access the text of the body (itself a dictionary with keys 'header' and 'structured_text').

This pipeline creates the following new columns :
- **structured_historic :** the list of dictionaries returned by **build_historic** function.
- **structured_body :** the list of dictionaries returned by **structure_email** function.

In [12]:
from melusine.prepare_email.build_historic import build_historic
from melusine.prepare_email.mail_segmenting import structure_email

In [13]:
SegmentingTransformer = TransformerScheduler(
    functions_scheduler=[
        (build_historic, None, ['structured_historic']),
        (structure_email, None, ['structured_body'])
    ]
)

In [14]:
df_emails = SegmentingTransformer.fit_transform(df_emails)

In [15]:
df_emails.columns

Index(['body', 'header', 'date', 'from', 'to', 'label', 'is_begin_by_transfer',
       'is_answer', 'is_transfer', 'structured_historic', 'structured_body'],
      dtype='object')

In [16]:
print(df_emails.body[2])

 
  
  
 Bonjours, 
  
 Suite a notre conversation téléphonique de Mardi , pourriez vous me dire la 
 somme que je vous dois afin d'être en régularisation . 
  
 Merci bonne journée 
  
 Le mar. 22 mai 2018 à 10:20,  <conseiller@Societeimaginaire.fr> a écrit : 
 Bonjour. 
  
 Merci de bien vouloir prendre connaissance du document ci-joint : 
 1 - Relevé d'identité postal (contrats) 
  
 Cordialement. 
  
 La Mututelle Imaginaire 
  
 La visualisation des fichiers PDF nécessite Adobe Reader. 
  


In [17]:
df_emails.structured_historic[2]

[{'text': " \n  \n  \n Bonjours, \n  \n Suite a notre conversation téléphonique de Mardi , pourriez vous me dire la \n somme que je vous dois afin d'être en régularisation . \n  \n Merci bonne journée",
  'meta': ''},
 {'text': " \n Bonjour. \n  \n Merci de bien vouloir prendre connaissance du document ci-joint : \n 1 - Relevé d'identité postal (contrats) \n  \n Cordialement. \n  \n La Mututelle Imaginaire \n  \n La visualisation des fichiers PDF nécessite Adobe Reader. \n  ",
  'meta': ' \n  \n Le mar. 22 mai 2018 à 10:20,  <conseiller@Societeimaginaire.fr> a écrit\xa0:'}]

In [18]:
df_emails.structured_body[2]

[{'meta': {'date': None, 'from': None, 'to': None},
  'structured_text': {'header': None,
   'text': [{'part': ' Bonjours, ', 'tags': 'HELLO'},
    {'part': "    Suite a notre conversation téléphonique de Mardi , pourriez vous me dire la   somme que je vous dois afin d'être en régularisation . \n  \n ",
     'tags': 'BODY'},
    {'part': 'Merci bonne journée', 'tags': 'GREETINGS'}]}},
 {'meta': {'date': ' mar. 22 mai 2018 à 10:20',
   'from': '  <conseiller@Societeimaginaire.fr> ',
   'to': None},
  'structured_text': {'header': None,
   'text': [{'part': ' Bonjour. \n  \n ', 'tags': 'HELLO'},
    {'part': "Merci de bien vouloir prendre connaissance du document ci-joint :   1 - Relevé d'identité postal (contrats)    ",
     'tags': 'BODY'},
    {'part': ' Cordialement. ', 'tags': 'GREETINGS'},
    {'part': '        La Mututelle Imaginaire    ', 'tags': 'BODY'},
    {'part': ' La visualisation des fichiers PDF nécessite Adobe Reader. \n',
     'tags': 'FOOTER'}]}}]

## Extraction and cleaning of the body of the last message

Once each email segmented, the body of the last message will be extracted and cleaned.

In this pipeline the functions applied are :
- **extract_last_body :** returns the body of the last message of the email.
- **clean_body :** returns the body of the last message of the email after cleaning.

This pipeline returns the following columns : 
- **last_body :** the body of the last message of the email returned by **extract_last_body** function.
- **clean_body :** the cleaned body of the last message of the email returned by **clean_body** function.

In [19]:
from melusine.prepare_email.body_header_extraction import extract_last_body
from melusine.prepare_email.cleaning import clean_body

In [20]:
LastBodyHeaderCleaningTransformer = TransformerScheduler(
    functions_scheduler=[
        (extract_last_body, None, ['last_body']),
        (clean_body, None, ['clean_body'])
    ]
)

In [21]:
df_emails = LastBodyHeaderCleaningTransformer.fit_transform(df_emails)

In [22]:
df_emails.columns

Index(['body', 'header', 'date', 'from', 'to', 'label', 'is_begin_by_transfer',
       'is_answer', 'is_transfer', 'structured_historic', 'structured_body',
       'last_body', 'clean_body'],
      dtype='object')

In [23]:
print(df_emails.body[2])

 
  
  
 Bonjours, 
  
 Suite a notre conversation téléphonique de Mardi , pourriez vous me dire la 
 somme que je vous dois afin d'être en régularisation . 
  
 Merci bonne journée 
  
 Le mar. 22 mai 2018 à 10:20,  <conseiller@Societeimaginaire.fr> a écrit : 
 Bonjour. 
  
 Merci de bien vouloir prendre connaissance du document ci-joint : 
 1 - Relevé d'identité postal (contrats) 
  
 Cordialement. 
  
 La Mututelle Imaginaire 
  
 La visualisation des fichiers PDF nécessite Adobe Reader. 
  


In [24]:
print(df_emails.last_body[2])

    Suite a notre conversation téléphonique de Mardi , pourriez vous me dire la   somme que je vous dois afin d'être en régularisation . 
  
  


In [25]:
print(df_emails.clean_body[2])

suite a notre conversation telephonique de  flag_date_  , pourriez vous me dire la somme que je vous dois afin d'etre en regularisation .


## Applying a phraser

A phraser can be passed on the body. However it first need to be trained 

In [26]:
from melusine.nlp_tools.phraser import Phraser
from melusine.nlp_tools.phraser import phraser_on_body

#### Training a phraser

In [27]:
phraser = Phraser()

In [28]:
phraser.train(df_emails)

#### Applying a phraser

The **phraser_on_body** function applies a phraser on the clean_body of an email.

In [29]:
PhraserTransformer = TransformerScheduler(
    functions_scheduler=[
        (phraser_on_body, (phraser,), ['clean_body'])
    ]
)

In [30]:
df_emails = PhraserTransformer.fit_transform(df_emails)

## Applying a tokenizer

In [31]:
from melusine.nlp_tools.tokenizer import Tokenizer

In [32]:
tokenizer = Tokenizer(input_column="clean_body")

In [33]:
df_emails = tokenizer.fit_transform(df_emails)

In [34]:
df_emails.columns

Index(['body', 'header', 'date', 'from', 'to', 'label', 'is_begin_by_transfer',
       'is_answer', 'is_transfer', 'structured_historic', 'structured_body',
       'last_body', 'clean_body', 'tokens'],
      dtype='object')

In [35]:
print(df_emails.clean_body[2])

suite a notre conversation telephonique de  flag_date_  , pourriez vous me dire la somme que je vous dois afin d'etre en regularisation .


In [36]:
print(df_emails.tokens[2])

['suite', 'a', 'conversation', 'telephonique', 'flag_date_', 'pourriez', 'dire', 'somme', 'dois', 'afin', 'etre', 'regularisation']


### Metadata preprocessing

The metadata have to be extracted before being dummified.

This pipeline extractes the following metadata :
- **extension :** from the "from" column.
- **dayofweek :** from the date.
- **hour :** from the date.
- **min :** from the date.

In [37]:
from sklearn.pipeline import Pipeline
from melusine.prepare_email.metadata_engineering import MetaExtension
from melusine.prepare_email.metadata_engineering import MetaDate
from melusine.prepare_email.metadata_engineering import Dummifier

In [38]:
# Pipeline to extract dummified metadata
MetadataPipeline = Pipeline([
    ('MetaExtension', MetaExtension()),
    ('MetaDate', MetaDate()),
    ('Dummifier', Dummifier())
])

In [39]:
df_meta = MetadataPipeline.fit_transform(df_emails)

In [40]:
df_meta.columns

Index(['extension__0', 'extension__1', 'extension__2', 'extension__3',
       'extension__4', 'extension__5', 'extension__6', 'extension__7',
       'extension__8', 'extension__9', 'dayofweek__0', 'dayofweek__1',
       'dayofweek__2', 'dayofweek__3', 'dayofweek__4', 'dayofweek__5',
       'dayofweek__6', 'hour__6', 'hour__8', 'hour__9', 'hour__10', 'hour__11',
       'hour__12', 'hour__14', 'hour__15', 'hour__16', 'hour__17', 'hour__18',
       'hour__19', 'hour__20', 'hour__22', 'min__2', 'min__3', 'min__4',
       'min__6', 'min__7', 'min__9', 'min__10', 'min__11', 'min__12',
       'min__15', 'min__16', 'min__19', 'min__22', 'min__28', 'min__30',
       'min__32', 'min__33', 'min__36', 'min__37', 'min__38', 'min__39',
       'min__40', 'min__44', 'min__45', 'min__49', 'min__52', 'min__54',
       'min__56', 'min__58'],
      dtype='object')

In [41]:
df_meta.head()

Unnamed: 0,extension__0,extension__1,extension__2,extension__3,extension__4,extension__5,extension__6,extension__7,extension__8,extension__9,...,min__38,min__39,min__40,min__44,min__45,min__49,min__52,min__54,min__56,min__58
0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Keywords extraction

Once a tokens column exists, keywords can be extracted.

In [42]:
from melusine.summarizer.keywords_generator import KeywordsGenerator

In [43]:
keywords_generator = KeywordsGenerator(n_max_keywords=4)

In [44]:
df_emails = keywords_generator.fit_transform(df_emails)

In [45]:
df_emails.clean_body[23]

'veuillez recevoir le certificat de cession de mon vehicule afin que vous puissiez effectuer la resiliation de mon contrat. je reviendrai vers vous afin dassurer mon nouveau vehicule bientot.'

In [46]:
df_emails.tokens[23]

['veuillez',
 'recevoir',
 'certificat',
 'cession',
 'vehicule',
 'afin',
 'puissiez',
 'effectuer',
 'resiliation',
 'contrat',
 'reviendrai',
 'vers',
 'afin',
 'dassurer',
 'nouveau',
 'vehicule',
 'bientot']

In [47]:
df_emails.keywords[23]

['veuillez', 'vehicule', 'afin', 'nouveau']

## Classification with neural networks

Melusine offers a NeuralModel class to train, save, load and use for prediction any kind of neural networks based on Keras. 
Predefined architectures of RNN and CNN models using the cleaned body and the metadata of the emails are also offered.

#### Embeddings training

Embeddings have to be pretrained on the data set to be given as arguments of the neural networks.

In [48]:
from melusine.nlp_tools.embedding import Embedding

In [49]:
pretrained_embedding = Embedding(input_column='clean_body',
                                 workers=1,
                                 min_count=5)

In [50]:
pretrained_embedding.train(df_emails) 

27/05 04:07 - melusine.nlp_tools.embedding - INFO - Start training for embedding
27/05 04:07 - melusine.nlp_tools.embedding - INFO - Done.


#### Préparation de X et de y

In [51]:
import pandas as pd 
from sklearn.preprocessing import LabelEncoder

In [52]:
X = pd.concat([df_emails['clean_body'],df_meta],axis=1)
y = df_emails['label']
le = LabelEncoder()
y = le.fit_transform(y)

In [53]:
X.columns

Index(['clean_body', 'extension__0', 'extension__1', 'extension__2',
       'extension__3', 'extension__4', 'extension__5', 'extension__6',
       'extension__7', 'extension__8', 'extension__9', 'dayofweek__0',
       'dayofweek__1', 'dayofweek__2', 'dayofweek__3', 'dayofweek__4',
       'dayofweek__5', 'dayofweek__6', 'hour__6', 'hour__8', 'hour__9',
       'hour__10', 'hour__11', 'hour__12', 'hour__14', 'hour__15', 'hour__16',
       'hour__17', 'hour__18', 'hour__19', 'hour__20', 'hour__22', 'min__2',
       'min__3', 'min__4', 'min__6', 'min__7', 'min__9', 'min__10', 'min__11',
       'min__12', 'min__15', 'min__16', 'min__19', 'min__22', 'min__28',
       'min__30', 'min__32', 'min__33', 'min__36', 'min__37', 'min__38',
       'min__39', 'min__40', 'min__44', 'min__45', 'min__49', 'min__52',
       'min__54', 'min__56', 'min__58'],
      dtype='object')

In [54]:
X.head()

Unnamed: 0,clean_body,extension__0,extension__1,extension__2,extension__3,extension__4,extension__5,extension__6,extension__7,extension__8,...,min__38,min__39,min__40,min__44,min__45,min__49,min__52,min__54,min__56,min__58
0,je suis client chez vous pouvez vous m etablir...,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,je vous informe que la nouvelle immatriculatio...,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,suite a notre conversation telephonique de fl...,0,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,je fais suite a votre mail. j'ai envoye mon bu...,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,voici ci joint mon bulletin de salaire comme d...,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [55]:
y

array([ 4, 10,  3,  0,  0,  4,  7, 10,  1, 10,  2,  5, 10, 10,  4,  7,  7,
       10,  0,  9,  4, 10,  4,  7, 10, 10,  6,  7,  3,  8, 10, 10, 10,  4,
        7,  3,  5,  4,  4, 10])

#### Entraînement et prédictions avec un CNN

In [56]:
from melusine.models.neural_architectures import cnn_model
from melusine.models.train import NeuralModel

In [57]:
nn_model = NeuralModel(architecture_function=cnn_model,
                       pretrained_embedding=pretrained_embedding,
                       text_input_column="clean_body",
                       meta_input_list=['extension', 'dayofweek','hour', 'min'],
                       n_epochs=10)

In [58]:
nn_model.fit(X,y)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [59]:
y_res = nn_model.predict(X)
y_res = le.inverse_transform(y_res)
y_res

array(['vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule'],
      dtype=object)

#### Using a dict instead of a Dataframe as input (performance optimization)

In an industrialized context, a trained model might be fed input data one by one.  
In this case, creating a pandas DataFrame with a single row is overkill and massive performed gain can be obtained by using a dict instead of a DataFrame.  


Melusine is developped to ensure dict compatibility as described in the code below.

In [60]:
import copy

# ============== Test dict compatibility ==============
dict_emails = df_emails.to_dict(orient='records')[0]
dict_meta = MetadataPipeline.transform(dict_emails)
dict_keywords = keywords_generator.transform(dict_emails)

dict_input = copy.deepcopy(dict_meta)
dict_input['clean_body'] = dict_emails['clean_body']

dict_result = nn_model.predict(dict_input)