# Full pipeline (quick)

This notebook explains the full pipeline in a detailed manner, including the preprocessing steps, the summerization steps and the classification ones.

## Loading the dataset under the Pandas Dataframe format

Because Melusine operates Pandas Dataframes by applying functions to certain columns to produce new columns, the initial columns have to follow a strict naming.

The basic requirement to use Melusine is to have an input e-mail DataFrame with the following columns :
- body : Body of an email (single message or conversation historic)
- header : Header of an email
- date : Reception date of an email
- from : Email address of the sender
- to (optional): Email address of the recipient
- attachment (optional) : List of filenames attached to the email
- label (optional): Label of the email for a classification task (examples: Business, Spam, Finance or Family)

Each row correspond to a unique email.

In [1]:
from melusine.data.data_loader import load_email_data
import ast

df_emails = load_email_data()
df_emails['attachment'] = df_emails['attachment'].apply(ast.literal_eval)

In [2]:
df_emails.columns

Index(['body', 'header', 'date', 'from', 'to', 'attachment', 'sexe', 'age',
       'label'],
      dtype='object')

In [3]:
print('Body :')
print(df_emails.body[1])
print('\n')
print('Header :')
print(df_emails.header[1])
print('Date :')
print(df_emails.date[1])
print('From :')
print(df_emails.loc[1,"from"])
print('To :')
print(df_emails.to[1])
print('Attachment :')
print(df_emails.attachment[1])
print('Label :')
print(df_emails.label[1])

Body :
 
  
  
  
 ----- Transféré par Conseiller le 25/05/2018 08:20 ----- 
  
 De :	Dupont <monsieurdupont@extensiona.com> 
 A :	conseiller@Societeimaginaire.fr 
 Date :	24/05/2018 19:37 
 Objet :	Immatriculation voiture 
  
  
  
 Bonsoir madame, 
  
 Je vous informe que la nouvelle immatriculation est enfin 
 faite. Je vous remercie bien pour votre patience. 
 Je vous prie de trouver donc la carte grise ainsi que la 
 nouvelle immatriculation. Je vous demanderai de faire les changements 
 nécessaires concernant l’assurance. 
 Je vous remercie encore pour tout. 
 Cordialement, 
 Monsieur Dupont (See attached file: pj.pdf)


Header :
Tr : Immatriculation voiture
Date :
vendredi 25 mai 2018 06 h 21 CEST
From :
conseiller1@societeimaginaire.fr
To :
demandes@societeimaginaire.fr
Attachment :
['pj.pdf']
Label :
vehicule


## Text preprocessing pipeline

This pipeline will :
- Update the columns of the dataframe if an email is transfered.
- Segment the different messages of an email and tag its parts (hello, body, greetings, footer..).
- Extract the body of the last message of the email.
- Clean the body of the last message of the email.
- Apply the phraser on the cleaned body.
- Tokenize the cleaned body (once the phraser has been applied).

The pipeline will return new columns at each steps, the most importants being :
- **clean_body :** the body (with hello, greetings, signature, footers..) of the last message of an email, after cleaning and application of the phraser. This column will be used to train the embeddings and the neural networks
- **tokens :** clean_body after tokenization. This column will be used for the keywords extraction.

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from melusine.utils.multiprocessing import apply_by_multiprocessing
from melusine.utils.transformer_scheduler import TransformerScheduler

from melusine.prepare_email.manage_transfer_reply import check_mail_begin_by_transfer
from melusine.prepare_email.manage_transfer_reply import update_info_for_transfer_mail
from melusine.prepare_email.manage_transfer_reply import add_boolean_transfer
from melusine.prepare_email.manage_transfer_reply import add_boolean_answer

from melusine.prepare_email.build_historic import build_historic
from melusine.prepare_email.mail_segmenting import structure_email
from melusine.prepare_email.body_header_extraction import extract_last_body
from melusine.prepare_email.cleaning import clean_body
from melusine.prepare_email.cleaning import clean_header

from melusine.nlp_tools.phraser import Phraser
from melusine.nlp_tools.phraser import phraser_on_body
from melusine.nlp_tools.phraser import phraser_on_header

from melusine.nlp_tools.tokenizer import Tokenizer

In [5]:
# Transformer object to manage transfers and replies
ManageTransferReply = TransformerScheduler(
    functions_scheduler=[
        (check_mail_begin_by_transfer, None, ['is_begin_by_transfer']),
        (update_info_for_transfer_mail, None, None),
        (add_boolean_answer, None, ['is_answer']),
        (add_boolean_transfer, None, ['is_transfer'])
    ]
)

# Transformer object to segment the different messages in the email, parse their metadata and
# tag the different part of the messages
Segmenting = TransformerScheduler(
    functions_scheduler=[
        (build_historic, None, ['structured_historic']),
        (structure_email, None, ['structured_body'])
    ]
)

# Transformer object to extract the body of the last message of the email and clean it as 
# well as the header
LastBodyHeaderCleaning = TransformerScheduler(
    functions_scheduler=[
        (extract_last_body, None, ['last_body']),
        (clean_body, None, ['clean_body']),
        (clean_header, None, ['clean_header'])
    ]
)

# Transformer object to apply the phraser on the texts
# phraser = Phraser().load('./data/phraser.pickle')
# PhraserTransformer = TransformerScheduler(
#     functions_scheduler=[
#         (phraser_on_body, (phraser,), ['clean_body']),
#         (phraser_on_header, (phraser,), ['clean_header'])
#     ]
# )

# Tokenizer object
tokenizer = Tokenizer(input_column="clean_body")

# Full preprocessing pipeline
PreprocessingPipeline = Pipeline([
    ('ManageTransferReply', ManageTransferReply),
    ('Segmenting', Segmenting),
    ('LastBodyHeaderCleaning', LastBodyHeaderCleaning),
    # ('PhraserTransformer', PhraserTransformer),
    ('tokenizer', tokenizer)
])

In [6]:
df_emails = PreprocessingPipeline.fit_transform(df_emails)

In [7]:
df_emails.columns

Index(['body', 'header', 'date', 'from', 'to', 'attachment', 'sexe', 'age',
       'label', 'is_begin_by_transfer', 'is_answer', 'is_transfer',
       'structured_historic', 'structured_body', 'last_body', 'clean_body',
       'clean_header', 'tokens'],
      dtype='object')

## Metadata preprocessing pipeline

The metadata have to be extracted before being dummified.

This pipeline extractes the following metadata :
- **extension :** from the "from" column.
- **dayofweek :** from the date.
- **hour :** from the date.
- **min :** from the date.
- **attachment_type :** from the attachment column.

In [8]:
from sklearn.pipeline import Pipeline
from melusine.prepare_email.metadata_engineering import MetaExtension
from melusine.prepare_email.metadata_engineering import MetaDate
from melusine.prepare_email.metadata_engineering import MetaAttachmentType
from melusine.prepare_email.metadata_engineering import Dummifier

In [9]:
# Pipeline to extract dummified metadata
MetadataPipeline = Pipeline([
    ('MetaExtension', MetaExtension()),
    ('MetaDate', MetaDate()),
    ('MetaAttachmentType',MetaAttachmentType()),
    ('Dummifier', Dummifier())
])

In [10]:
df_meta = MetadataPipeline.fit_transform(df_emails)

In [11]:
df_meta.columns

Index(['extension__0', 'extension__1', 'extension__2', 'extension__3',
       'extension__4', 'extension__5', 'extension__6', 'extension__7',
       'extension__8', 'extension__9', 'dayofweek__0', 'dayofweek__1',
       'dayofweek__2', 'dayofweek__3', 'dayofweek__4', 'dayofweek__5',
       'hour__6', 'hour__8', 'hour__9', 'hour__10', 'hour__11', 'hour__12',
       'hour__14', 'hour__15', 'hour__16', 'hour__17', 'hour__18', 'hour__19',
       'hour__20', 'hour__22', 'min__2', 'min__3', 'min__4', 'min__6',
       'min__7', 'min__9', 'min__10', 'min__11', 'min__12', 'min__15',
       'min__16', 'min__19', 'min__22', 'min__28', 'min__30', 'min__32',
       'min__33', 'min__36', 'min__37', 'min__38', 'min__39', 'min__40',
       'min__44', 'min__45', 'min__49', 'min__52', 'min__54', 'min__56',
       'min__58', 'attachment_type__0', 'attachment_type__1',
       'attachment_type__2', 'attachment_type__3', 'attachment_type__4',
       'attachment_type__5', 'attachment_type__6'],
      dtype='

In [12]:
df_meta.head()

Unnamed: 0,extension__0,extension__1,extension__2,extension__3,extension__4,extension__5,extension__6,extension__7,extension__8,extension__9,...,min__54,min__56,min__58,attachment_type__0,attachment_type__1,attachment_type__2,attachment_type__3,attachment_type__4,attachment_type__5,attachment_type__6
0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


## Keywords extraction

Once a tokens column exists, keywords can be extracted by using the KeywordsGenerator class :

In [13]:
from melusine.summarizer.keywords_generator import KeywordsGenerator

In [14]:
keywords_generator = KeywordsGenerator(n_max_keywords=4)

In [15]:
df_emails = keywords_generator.fit_transform(df_emails)

                                                       

In [16]:
print(df_emails.body[23])

 
  
  
  
 Bonjour , 
  
 Veuillez recevoir le certificat de cession de mon véhicule afin que vous 
 puissiez effectuer la résiliation de mon contrat. 
 Je reviendrai vers vous afin d’assurer mon nouveau véhicule bientôt. 
  
 Bien à vous , 
  
 Mr DUPONT 
  
  
  
 (Embedded image moved to file: pic.jpg) 
  
  
 Envoyé de mon iPad


In [17]:
df_emails.clean_body[23]

'veuillez recevoir le certificat de cession de mon vehicule afin que vous puissiez effectuer la resiliation de mon contrat. je reviendrai vers vous afin dassurer mon nouveau vehicule bientot.'

In [18]:
df_emails.tokens[23]

['veuillez',
 'recevoir',
 'certificat',
 'cession',
 'vehicule',
 'afin',
 'puissiez',
 'effectuer',
 'resiliation',
 'contrat',
 'reviendrai',
 'vers',
 'afin',
 'dassurer',
 'nouveau',
 'vehicule',
 'bientot']

In [19]:
df_emails.keywords[23]

['veuillez', 'vehicule', 'afin', 'nouveau']

## Classification with neural networks

Melusine offers a NeuralModel class to train, save, load and use for prediction any kind of neural networks based on Keras. 
Predefined architectures of RNN and CNN models using the cleaned body and the metadata of the emails are also offered.

#### Embeddings training

Embeddings have to be pretrained on the data set to be given as arguments of the neural networks.

In [20]:
from melusine.nlp_tools.embedding import Embedding

In [21]:
pretrained_embedding = Embedding(input_column='clean_body',
                                 workers=1,
                                 min_count=5)

                                                       

In [22]:
pretrained_embedding.train(df_emails) 

#### X and y preparation

In [23]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

In [24]:
X = pd.concat([df_emails['clean_body'],df_meta],axis=1)
y = df_emails['label']
le = LabelEncoder()
y = le.fit_transform(y)

In [25]:
X.columns

Index(['clean_body', 'extension__0', 'extension__1', 'extension__2',
       'extension__3', 'extension__4', 'extension__5', 'extension__6',
       'extension__7', 'extension__8', 'extension__9', 'dayofweek__0',
       'dayofweek__1', 'dayofweek__2', 'dayofweek__3', 'dayofweek__4',
       'dayofweek__5', 'hour__6', 'hour__8', 'hour__9', 'hour__10', 'hour__11',
       'hour__12', 'hour__14', 'hour__15', 'hour__16', 'hour__17', 'hour__18',
       'hour__19', 'hour__20', 'hour__22', 'min__2', 'min__3', 'min__4',
       'min__6', 'min__7', 'min__9', 'min__10', 'min__11', 'min__12',
       'min__15', 'min__16', 'min__19', 'min__22', 'min__28', 'min__30',
       'min__32', 'min__33', 'min__36', 'min__37', 'min__38', 'min__39',
       'min__40', 'min__44', 'min__45', 'min__49', 'min__52', 'min__54',
       'min__56', 'min__58', 'attachment_type__0', 'attachment_type__1',
       'attachment_type__2', 'attachment_type__3', 'attachment_type__4',
       'attachment_type__5', 'attachment_type__6'],

In [26]:
X.head()

Unnamed: 0,clean_body,extension__0,extension__1,extension__2,extension__3,extension__4,extension__5,extension__6,extension__7,extension__8,...,min__54,min__56,min__58,attachment_type__0,attachment_type__1,attachment_type__2,attachment_type__3,attachment_type__4,attachment_type__5,attachment_type__6
0,je suis client chez vous pouvez vous m etablir...,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,je vous informe que la nouvelle immatriculatio...,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,suite a notre conversation telephonique de mar...,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,je fais suite a votre mail. j'ai envoye mon bu...,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,voici ci joint mon bulletin de salaire comme d...,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [27]:
y

array([ 4, 10,  3,  0,  0,  4,  7, 10,  1, 10,  2,  5, 10, 10,  4,  7,  7,
       10,  0,  9,  4, 10,  4,  7, 10, 10,  6,  7,  3,  8, 10, 10, 10,  4,
        7,  3,  5,  4,  4, 10])

#### Training and predictions with a  CNN

In [28]:
from melusine.models.neural_architectures import cnn_model
from melusine.models.train import NeuralModel

In [29]:
nn_model = NeuralModel(architecture_function=cnn_model,
                       pretrained_embedding=pretrained_embedding,
                       text_input_column="clean_body",
                       meta_input_list=['extension', 'dayofweek','hour', 'min', 'attachment_type'],
                       n_epochs=10)

In [30]:
nn_model.fit(X,y)

2021-09-17 15:54:50.030628: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-09-17 15:54:50.198496: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [31]:
y_res = nn_model.predict(X)
y_res = le.inverse_transform(y_res)
y_res

array(['vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule'],
      dtype=object)