# Metadata preprocessing tutorial

Melusine **prepare_data.metadata_engineering subpackage** provides classes to preprocess the metadata :
- **MetaExtension :** a transformer which creates an 'extension' feature extracted from regex in metadata. It extracts the extensions of mail adresses.
- **MetaDate :** a transformer which creates new features from dates such as: hour, minute, dayofweek.
- **MetaAttachmentType :** a transformer which creates an 'attachment type' feature extracted from regex in metadata. It extracts the extensions of attached files.
- **Dummifier :** a transformer to dummifies categorial features.

All the classes have **fit_transform** methods.

### Input dataframe

- To use a **MetaExtension** transformer : the dataframe requires a **from** column
- To use a **MetaDate** transformer : the dataframe requires a **date** column
- To use a **MetaAttachmentType** transformer : the dataframe requires a **attachment** column with the list of attached files

In [1]:
from melusine.data.data_loader import load_email_data
import ast

df_emails = load_email_data(type="preprocessed")

In [2]:
df_emails['from'].head(2)

0    Dupont <monsieurdupont@extensiona.com>
1    Dupont <monsieurdupont@extensiona.com>
Name: from, dtype: object

In [3]:
df_emails['date'].head(2)

0    24/05/2018 11:36
1    24/05/2018 19:37
Name: date, dtype: object

In [4]:
df_emails['attachment'].head(2)

0            []
1    ["pj.pdf"]
Name: attachment, dtype: object

### MetaExtension transformer

A **MetaExtension transformer** creates an *extension* feature extracted from regex in metadata. It extracts the extensions of mail adresses.

In [5]:
from melusine.prepare_email.metadata_engineering import MetaExtension

meta_extension = MetaExtension()

In [6]:
df_emails = meta_extension.fit_transform(df_emails)

In [7]:
df_emails["extension"].head(5)

0    1
1    1
2    1
3    4
4    1
Name: extension, dtype: int64

### MetaDate transformer

A **MetaDate transformer** creates new features from dates : hour, minute and dayofweek

In [8]:
from melusine.prepare_email.metadata_engineering import MetaDate

meta_date = MetaDate()

In [9]:
df_emails = meta_date.fit_transform(df_emails)

In [10]:
df_emails.date[0]

Timestamp('2018-05-24 11:36:00')

In [11]:
df_emails.hour[0]

11

In [12]:
df_emails.dayofweek[0]

3

### MetaAttachmentType transformer

A **MetaAttachmentType transformer** creates an *attachment_type* feature extracted from an attachment names list. It extracts the extensions of attachments files.

In [13]:
from melusine.prepare_email.metadata_engineering import MetaAttachmentType

meta_pj = MetaAttachmentType()

In [14]:
df_emails = meta_pj.fit_transform(df_emails)

In [15]:
df_emails.attachment_type.head(2)

0    [1]
1    [0]
Name: attachment_type, dtype: object

### Dummifier transformer

A **Dummifier transformer** dummifies categorial features.

Its arguments are :
- **columns_to_dummify** : a list of the metadata columns to dummify.

In [16]:
from melusine.prepare_email.metadata_engineering import Dummifier
dummifier = Dummifier(columns_to_dummify=['extension','attachment_type', 'dayofweek', 'hour', 'min'])

In [17]:
df_meta = dummifier.fit_transform(df_emails)

  ).sum(level=0)
  ).sum(level=0)


In [18]:
df_meta.columns

Index(['extension__0', 'extension__1', 'extension__2', 'extension__3',
       'extension__4', 'extension__5', 'extension__6', 'extension__7',
       'extension__8', 'extension__9', 'dayofweek__0', 'dayofweek__1',
       'dayofweek__2', 'dayofweek__3', 'dayofweek__4', 'dayofweek__5',
       'hour__6', 'hour__8', 'hour__9', 'hour__10', 'hour__11', 'hour__12',
       'hour__14', 'hour__15', 'hour__16', 'hour__17', 'hour__18', 'hour__19',
       'hour__20', 'hour__22', 'min__2', 'min__3', 'min__4', 'min__6',
       'min__7', 'min__9', 'min__10', 'min__11', 'min__12', 'min__15',
       'min__16', 'min__19', 'min__22', 'min__28', 'min__30', 'min__32',
       'min__33', 'min__36', 'min__37', 'min__38', 'min__39', 'min__40',
       'min__44', 'min__45', 'min__49', 'min__52', 'min__54', 'min__56',
       'min__58', 'attachment_type__0', 'attachment_type__1'],
      dtype='object')

In [19]:
df_meta.head()

Unnamed: 0,extension__0,extension__1,extension__2,extension__3,extension__4,extension__5,extension__6,extension__7,extension__8,extension__9,...,min__40,min__44,min__45,min__49,min__52,min__54,min__56,min__58,attachment_type__0,attachment_type__1
0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,0,1,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1
3,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


## Combine meta features with emails dataFrame

In [20]:
import pandas as pd
df_full = pd.concat([df_emails,df_meta],axis=1)

### Custom metadata transformer

A custom transformer can be implemented to extract metadata from a column :

```python
from sklearn.base import BaseEstimator, TransformerMixin

class MetaDataCustom(BaseEstimator, TransformerMixin):
    """Transformer which creates custom matadata

    Compatible with scikit-learn API.
    """

    def __init__(self):
        """
        arguments
        """

    def fit(self, X, y=None):
        """ Fit method"""
        return self

    def transform(self, X):
        """Transform method"""
        X['custom_metadata'] = X['column'].apply(self.get_metadata)
        return X
```

The name of the output column can then be given as argument to a Dummifier transformer :

```python
dummifier = Dummifier(columns_to_dummify=['custom_metadata'])
```