# Prepare_email tutorial

The Melusine **prepare_email subpackage** provides preprocessing subpackages all providing several preprocessing functions to be applied in a particular order.

**The functions are all designed to be applied on rows of dataframes. They should be wrapped in a TransformerScheduler object before being integrated into an execution Pipeline.**

In [48]:
from melusine.data.data_loader import load_email_data
df_emails = load_email_data()

In [49]:
df_emails.columns

Index(['body', 'header', 'date', 'from', 'to', 'attachment', 'sexe', 'age',
       'label'],
      dtype='object')

## Manage_transfer_reply subpackage

The manage_transfer_reply subpackage provides several functions to preprocess the transfers and replies contained in the body of an email. All the functions are all designed to be applied on rows of dataframes.

### check_mail_begin_by_transfer function

**check_mail_begin_by_transfer** returns True if the *body* starts with given regex 'begin_transfer', False if not.

In [50]:
from melusine.prepare_email.manage_transfer_reply import check_mail_begin_by_transfer

In [51]:
row_with_direct_transfer = df_emails.loc[0,:]
print(row_with_direct_transfer.body)
print('\n')
print(check_mail_begin_by_transfer(row_with_direct_transfer))

 
  
  
  
 ----- Transféré par Conseiller le 24/05/2018 11:49 ----- 
  
 De :	Dupont <monsieurdupont@extensiona.com> 
 A :	conseiller@Societeimaginaire.fr 
 Cc :	Societe@www.Societe.fr 
 Date :	24/05/2018 11:36 
 Objet :	Devis habitation 
  
  
  
 Bonjour 
 Je suis client chez vous 
 Pouvez vous m établir un devis pour mon fils qui souhaite 
 louer l’appartement suivant : 
 25 rue du rueimaginaire 77000 
 Merci 
 Envoyé de mon iPhone


True


In [52]:
row_without_direct_transfer = df_emails.loc[5,:]
print(row_without_direct_transfer.body)
print('\n')
print(check_mail_begin_by_transfer(row_without_direct_transfer))

 Madame, Monsieur, 
 
 Je vous avais contactés car j'avais pour 
 projet d'agrandir ma maison. J'avais reçu un devis pour lequel je n'avais 
 pas donné suite, les travaux n'étant pas encore réalisés. 
  
 Le projet a maintenant été porté à son terme et je voudrais donc revoir 
 votre offre si possible. 
  
 Je désire garder le même type de contrat. 
 Je suis à votre disposition pour tout renseignement complémentaires. 
  
 Sincères salutations 
 Monsieur Dupont 
  


False


### update_info_for_transfer_mail function

** update_info_for_transfer_mail** extracts and updates informations from emails if the value of the **is_begin_transfer** column returned by the **check_mail_begin_by_transfer** fuction is True.

The informations are extracted from the **body** column to update the following columns : 
- **header**
- **from**
- **to**
- **date**

The **body** column will then be cleaned of the updated informations.

In [53]:
from melusine.prepare_email.manage_transfer_reply import update_info_for_transfer_mail

In [54]:
row_with_direct_transfer = df_emails.loc[0,:].copy()
print(row_with_direct_transfer.body)
print('\n')
print(row_with_direct_transfer.header)
print(row_with_direct_transfer.date)
print(row_with_direct_transfer['from'])
print(row_with_direct_transfer.to)

 
  
  
  
 ----- Transféré par Conseiller le 24/05/2018 11:49 ----- 
  
 De :	Dupont <monsieurdupont@extensiona.com> 
 A :	conseiller@Societeimaginaire.fr 
 Cc :	Societe@www.Societe.fr 
 Date :	24/05/2018 11:36 
 Objet :	Devis habitation 
  
  
  
 Bonjour 
 Je suis client chez vous 
 Pouvez vous m établir un devis pour mon fils qui souhaite 
 louer l’appartement suivant : 
 25 rue du rueimaginaire 77000 
 Merci 
 Envoyé de mon iPhone


Tr : Devis habitation
jeudi 24 mai 2018 11 h 49 CEST
conseiller1@societeimaginaire.fr
demandes@societeimaginaire.fr


In [55]:
row_with_direct_transfer['is_begin_by_transfer'] = check_mail_begin_by_transfer(row_with_direct_transfer)
row_with_direct_transfer['is_begin_by_transfer']

True

In [56]:
row_with_direct_transfer = update_info_for_transfer_mail(row_with_direct_transfer)

In [57]:
print(row_with_direct_transfer.body)
print('\n')
print(row_with_direct_transfer.header)
print(row_with_direct_transfer.date)
print(row_with_direct_transfer['from'])
print(row_with_direct_transfer.to)

 
  
  
  
 Bonjour 
 Je suis client chez vous 
 Pouvez vous m établir un devis pour mon fils qui souhaite 
 louer l’appartement suivant : 
 25 rue du rueimaginaire 77000 
 Merci 
 Envoyé de mon iPhone


Devis habitation
24/05/2018 11:36
Dupont <monsieurdupont@extensiona.com>
conseiller@Societeimaginaire.fr


### add_boolean_answer function

**add_boolean_answer function** returns True if the **header** column indicates that the email is a reply, False if not.

In [58]:
from melusine.prepare_email.manage_transfer_reply import add_boolean_answer

In [59]:
row_with_answer = df_emails.loc[2,:]
row_with_answer.header

"Re: Envoi d'un document de la Société Imaginaire"

In [60]:
add_boolean_answer(row_with_answer)

True

In [61]:
df_emails['is_answer'] = df_emails.apply(add_boolean_answer, axis=1)
df_emails[['is_answer','body']]

Unnamed: 0,is_answer,body
0,False,\n \n \n \n ----- Transféré par Conseiller...
1,False,\n \n \n \n ----- Transféré par Conseiller...
2,True,"\n \n \n Bonjours, \n \n Suite a notre con..."
3,True,"\n \n \n \n \n Bonjour, \n \n \n Je fai..."
4,False,"\n \n \n Bonjour, \n Voici ci joint mon bul..."
5,False,"Madame, Monsieur, \n \n Je vous avais contact..."
6,False,\n \n \n \n \n \n \n ----- Transféré pa...
7,False,"\n \n \n \n \n Bonjour, \n \n \n \n Je..."
8,True,"\n \n \n Bonjour, \n \n Voici la copie du ..."
9,False,\n \n \n \n \n \n \n \n BONJOUR \n \n...


### add_boolean_transfer function

**add_boolean_transfer function** returns True if the **header** column indicates that the email is a transfer, False if not.

In [62]:
from melusine.prepare_email.manage_transfer_reply import add_boolean_transfer

In [63]:
row_with_transfer = df_emails.loc[6,:]
row_with_transfer.header

"Tr : Assurance d'un nouveau logement"

In [64]:
add_boolean_transfer(row_with_transfer)

True

### manage_transfer_reply transformer

The functions of the manage_transfer_reply subpackage can be wrapped in a TransformerScheduler object to be applied directly on a dataframe :

In [65]:
from melusine.utils.transformer_scheduler import TransformerScheduler

ManageTransferReplyTransformer = TransformerScheduler(
    functions_scheduler=[
        (check_mail_begin_by_transfer, None, ['is_begin_by_transfer']),
        (update_info_for_transfer_mail, None, None),
        (add_boolean_answer, None, ['is_answer']),
        (add_boolean_transfer, None, ['is_transfer'])
    ]
)

In [66]:
df_emails = load_email_data()

In [67]:
df_emails.columns

Index(['body', 'header', 'date', 'from', 'to', 'attachment', 'sexe', 'age',
       'label'],
      dtype='object')

In [68]:
df_emails = ManageTransferReplyTransformer.fit_transform(df_emails)

In [69]:
df_emails.columns

Index(['body', 'header', 'date', 'from', 'to', 'attachment', 'sexe', 'age',
       'label', 'is_begin_by_transfer', 'is_answer', 'is_transfer'],
      dtype='object')

## Build_historic and mail_segmenting subpackage

### build_historic function

The **build_historic subpackage** provides a **build_historic function** to segment the messages components of the **body column** of an email.

It returns a list of dictionaries, one dictionary per message in inverse chronological order (the first dictionary corresponds to the last message while the last dictionary corresponds to the first message). Each dictionary has two keys:

    {'text': raw text without metadata,
     'meta': metadata
     }.. 

**build_historic** is designed to be applied on rows of dataframes.

In [70]:
row = df_emails.loc[2,:].copy()
print(row.body)

 
  
  
 Bonjours, 
  
 Suite a notre conversation téléphonique de Mardi , pourriez vous me dire la 
 somme que je vous dois afin d'être en régularisation . 
  
 Merci bonne journée 
  
 Le mar. 22 mai 2018 à 10:20,  <conseiller@Societeimaginaire.fr> a écrit : 
 Bonjour. 
  
 Merci de bien vouloir prendre connaissance du document ci-joint : 
 1 - Relevé d'identité postal (contrats) 
  
 Cordialement. 
  
 La Mututelle Imaginaire 
  
 La visualisation des fichiers PDF nécessite Adobe Reader. 
  


In [71]:
from melusine.prepare_email.build_historic import build_historic

row['structured_historic'] = build_historic(row)

There is no metadata for the last message.

In [72]:
print('Text of last message :')
print(row['structured_historic'][0]['text'])
print('\n')
print('Metadata of last message :')
print(row['structured_historic'][0]['meta'])
print('\n')
print('Text of first message :')
print(row['structured_historic'][1]['text'])
print('\n')
print('Metadata of first message :')
print(row['structured_historic'][1]['meta'])

Text of last message :
 
  
  
 Bonjours, 
  
 Suite a notre conversation téléphonique de Mardi , pourriez vous me dire la 
 somme que je vous dois afin d'être en régularisation . 
  
 Merci bonne journée


Metadata of last message :



Text of first message :
 
 Bonjour. 
  
 Merci de bien vouloir prendre connaissance du document ci-joint : 
 1 - Relevé d'identité postal (contrats) 
  
 Cordialement. 
  
 La Mututelle Imaginaire 
  
 La visualisation des fichiers PDF nécessite Adobe Reader. 
  


Metadata of first message :
 
  
 Le mar. 22 mai 2018 à 10:20,  <conseiller@Societeimaginaire.fr> a écrit :


### structure_email function

The **mail_segmenting subpackage** provides a **structure_email function** to further segment the messages components of the **structured_historic column** which should contain the result of the **build_historic function** previously applied:
- meta : the date, from and to components of the metadata will be segmented.
- text : the header will be segmented from the text. The different parts of the text will be segmented and tagged (hello, body, greetings, signature, footer..)

It returns a list of dictionaries, one dictionary per message in inverse chronological order (the first dictionary corresponds to the last message while the last dictionary corresponds to the first message). Each dictionary has two keys:

    {'structured_text': {'header': header of the message,
                          'text': [{'part': first part of the message, 
                                   'tags': tag of the first part of the message
                                   },
                                   ...,
                                   {'part': last part of the message, 
                                   'tags': tag of the last part of the message
                                   }
                                 ]
             }
     'meta': {'date': date of the message,
              'from': email address of the author of the message,
              'to': email address of the recipient of the message
             }
     } 

**structure_email** is designed to be applied on rows of dataframes.

In [73]:
from melusine.prepare_email.mail_segmenting import structure_email

row['structured_body'] = structure_email(row)

In [74]:
print('Date of last message :')
print(row['structured_body'][0]['meta']['date'])
print('From of last message :')
print(row['structured_body'][0]['meta']['from'])
print('To of last message :')
print(row['structured_body'][0]['meta']['to'])
print('\n')
print('Header of last message :')
print(row['structured_body'][0]['structured_text']['header'])
print('\n')
print('Segmented text of last message :')
for parts in row['structured_body'][0]['structured_text']['text']:
    print(parts['tags']+" :")
    print(parts['part'])
print('\n')
print('----------------------------------------------------------------------')
print('\n')
print('Date of first message :')
print(row['structured_body'][1]['meta']['date'])
print('From of first message :')
print(row['structured_body'][1]['meta']['from'])
print('To of first message :')
print(row['structured_body'][1]['meta']['to'])
print('\n')
print('Header of first message :')
print(row['structured_body'][1]['structured_text']['header'])
print('\n')
print('Segmented text of first message :')
for parts in row['structured_body'][1]['structured_text']['text']:
    print(parts['tags']+" :")
    print(parts['part'])

Date of last message :
None
From of last message :
None
To of last message :
None


Header of last message :
None


Segmented text of last message :
HELLO :
Bonjours,
BODY :
 Suite a notre conversation téléphonique de Mardi , pourriez vous me dire la somme que je vous dois afin d'être en régularisation .
GREETINGS :
Merci bonne journée


----------------------------------------------------------------------


Date of first message :
 mar. 22 mai 2018 à 10:20
From of first message :
  <conseiller@Societeimaginaire.fr> 
To of first message :
None


Header of first message :
None


Segmented text of first message :
HELLO :
Bonjour.
BODY :
Merci de bien vouloir prendre connaissance du document ci-joint : 1 - Relevé d'identité postal (contrats) 
GREETINGS :
Cordialement.
BODY :
La Mututelle Imaginaire 
FOOTER :
La visualisation des fichiers PDF nécessite Adobe Reader.


### segmenting transformer

The **build_historic** and **structure_email** functions can be wrapped in a TransformerScheduler object to be applied directly on a dataframe :

In [75]:
SegmentingTransformer = TransformerScheduler(
    functions_scheduler=[
        (build_historic, None, ['structured_historic']),
        (structure_email, None, ['structured_body'])
    ]
)

In [76]:
df_emails.columns

Index(['body', 'header', 'date', 'from', 'to', 'attachment', 'sexe', 'age',
       'label', 'is_begin_by_transfer', 'is_answer', 'is_transfer'],
      dtype='object')

In [77]:
df_emails = SegmentingTransformer.fit_transform(df_emails)

In [78]:
df_emails.columns

Index(['body', 'header', 'date', 'from', 'to', 'attachment', 'sexe', 'age',
       'label', 'is_begin_by_transfer', 'is_answer', 'is_transfer',
       'structured_historic', 'structured_body'],
      dtype='object')

## Body_header_extraction and cleaning subpackages

### extract_last_body function

The **body_header_extraction subpackage** provides a **extract_last_body function** to extract from the **structured_body column** of a row the parts of the last message that have been tagged as *body*.

In [79]:
print(row.body)

 
  
  
 Bonjours, 
  
 Suite a notre conversation téléphonique de Mardi , pourriez vous me dire la 
 somme que je vous dois afin d'être en régularisation . 
  
 Merci bonne journée 
  
 Le mar. 22 mai 2018 à 10:20,  <conseiller@Societeimaginaire.fr> a écrit : 
 Bonjour. 
  
 Merci de bien vouloir prendre connaissance du document ci-joint : 
 1 - Relevé d'identité postal (contrats) 
  
 Cordialement. 
  
 La Mututelle Imaginaire 
  
 La visualisation des fichiers PDF nécessite Adobe Reader. 
  


In [80]:
for parts in row.structured_body[0]['structured_text']['text']:
    if parts['tags']=='BODY':
        print(parts['part'])

 Suite a notre conversation téléphonique de Mardi , pourriez vous me dire la somme que je vous dois afin d'être en régularisation .


In [81]:
from melusine.prepare_email.body_header_extraction import extract_last_body

row['last_body'] = extract_last_body(row)

In [82]:
print(row['last_body'])

 Suite a notre conversation téléphonique de Mardi , pourriez vous me dire la somme que je vous dois afin d'être en régularisation . 


### cleaning subpackage

The **cleaning subpackage** provides two functions to be applied on rows of dataframes :
- **clean_body :** to clean the *last_body* column.
- **clean_header :** to clean the *header* column.

In [83]:
from melusine.prepare_email.cleaning import clean_body

clean_body(row)

"suite a notre conversation telephonique de  flag_date_  , pourriez vous me dire la somme que je vous dois afin d'etre en regularisation ."

In [84]:
from melusine.prepare_email.cleaning import clean_header

clean_header(row)

"envoi d'un document de la societe imaginaire"

### LastBodyHeaderCleaning transformer

The **extract_last_body**, **clean_body**, **clean_header** functions can be wrapped in a TransformerScheduler object to be applied directly on a dataframe :

In [85]:
LastBodyHeaderCleaning = TransformerScheduler(
    functions_scheduler=[
        (extract_last_body, None, ['last_body']),
        (clean_body, None, ['clean_body']),
        (clean_header, None, ['clean_header'])
    ]
)

In [86]:
df_emails.columns

Index(['body', 'header', 'date', 'from', 'to', 'attachment', 'sexe', 'age',
       'label', 'is_begin_by_transfer', 'is_answer', 'is_transfer',
       'structured_historic', 'structured_body'],
      dtype='object')

In [87]:
df_emails = LastBodyHeaderCleaning.fit_transform(df_emails)

In [88]:
df_emails.columns

Index(['body', 'header', 'date', 'from', 'to', 'attachment', 'sexe', 'age',
       'label', 'is_begin_by_transfer', 'is_answer', 'is_transfer',
       'structured_historic', 'structured_body', 'last_body', 'clean_body',
       'clean_header'],
      dtype='object')

## Full prepare_email pipeline

In [89]:
df_emails = load_email_data()

In [90]:
df_emails.columns

Index(['body', 'header', 'date', 'from', 'to', 'attachment', 'sexe', 'age',
       'label'],
      dtype='object')

In [91]:
from sklearn.pipeline import Pipeline

# Transformer object to manage transfers and replies
ManageTransferReply = TransformerScheduler(
    functions_scheduler=[
        (check_mail_begin_by_transfer, None, ['is_begin_by_transfer']),
        (update_info_for_transfer_mail, None, None),
        (add_boolean_answer, None, ['is_answer']),
        (add_boolean_transfer, None, ['is_transfer'])
    ]
)

# Transformer object to segment the different messages in the email, parse their metadata and
# tag the different part of the messages
Segmenting = TransformerScheduler(
    functions_scheduler=[
        (build_historic, None, ['structured_historic']),
        (structure_email, None, ['structured_body'])
    ]
)

# Transformer object to extract the body of the last message of the email and clean it as 
# well as the header
LastBodyHeaderCleaning = TransformerScheduler(
    functions_scheduler=[
        (extract_last_body, None, ['last_body']),
        (clean_body, None, ['clean_body']),
        (clean_header, None, ['clean_header'])
    ]
)

# Full prepare_email pipeline
PrepareEmailPipeline = Pipeline([
    ('ManageTransferReply', ManageTransferReply),
    ('Segmenting', Segmenting),
    ('LastBodyHeaderCleaning', LastBodyHeaderCleaning)
])

In [92]:
df_emails = PrepareEmailPipeline.fit_transform(df_emails)

In [93]:
df_emails.head()

Unnamed: 0,body,header,date,from,to,attachment,sexe,age,label,is_begin_by_transfer,is_answer,is_transfer,structured_historic,structured_body,last_body,clean_body,clean_header
0,\n \n \n \n Bonjour \n Je suis client chez...,Devis habitation,24/05/2018 11:36,Dupont <monsieurdupont@extensiona.com>,conseiller@Societeimaginaire.fr,[],F,35,habitation,True,False,False,[{'text': ' Bonjour Je suis clien...,"[{'meta': {'date': None, 'from': None, 'to': N...",Je suis client chez vous Pouvez vous m établir...,je suis client chez vous pouvez vous m etablir...,devis habitation
1,"\n \n \n \n Bonsoir madame, \n \n Je vous...",Immatriculation voiture,24/05/2018 19:37,Dupont <monsieurdupont@extensiona.com>,conseiller@Societeimaginaire.fr,"[""pj.pdf""]",M,32,vehicule,True,False,False,"[{'text': ' Bonsoir madame, Je...","[{'meta': {'date': None, 'from': None, 'to': N...",Je vous informe que la nouvelle immatriculati...,je vous informe que la nouvelle immatriculatio...,immatriculation voiture
2,"\n \n \n Bonjours, \n \n Suite a notre con...",Re: Envoi d'un document de la Société Imaginaire,vendredi 25 mai 2018 06 h 45 CEST,Monsieur Dupont <monsieurdupont@extensiona.com>,demandes@societeimaginaire.fr,[],M,66,compte,False,True,False,"[{'text': ' Bonjours, Suite a not...","[{'meta': {'date': None, 'from': None, 'to': N...",Suite a notre conversation téléphonique de Ma...,suite a notre conversation telephonique de fl...,envoi d'un document de la societe imaginaire
3,"\n \n \n \n \n Bonjour, \n \n \n Je fai...",Re: Votre adhésion à la Société Imaginaire,vendredi 25 mai 2018 10 h 15 CEST,Monsieur Dupont <monsieurdupont@extensiond.com>,demandes@societeimaginaire.fr,"[""fichedepaie.png""]",M,50,adhesion,False,True,False,"[{'text': ' Bonjour, Je ...","[{'meta': {'date': None, 'from': None, 'to': N...",Je fais suite à votre mail. J'ai envoyé mon...,je fais suite a votre mail. j'ai envoye mon bu...,votre adhesion a la societe imaginaire
4,"\n \n \n Bonjour, \n Voici ci joint mon bul...",Bulletin de salaire,vendredi 25 mai 2018 17 h 30 CEST,Monsieur Dupont <monsieurdupont@extensiona.com>,demandes@societeimaginaire.fr,"[""pj.pdf""]",M,15,adhesion,False,False,False,"[{'text': ' Bonjour, Voici ci joint ...","[{'meta': {'date': None, 'from': None, 'to': N...",Voici ci joint mon bulletin de salaire comme d...,voici ci joint mon bulletin de salaire comme d...,bulletin de salaire
