# Metadata preprocessing tutorial

Melusine **prepare_data.metadata_engineering subpackage** provides classes to preprocess the metadata :
- **MetaExtension :** a transformer which creates an 'extension' feature extracted from regex in metadata. It extracts the extensions of mail adresses.
- **MetaDate :** a transformer which creates new features from dates such as: hour, minute, dayofweek.
- **MetaAttachmentType :** a transformer which creates an 'attachment type' feature extracted from regex in metadata. It extracts the extensions of attached files.
- **Dummifier :** a transformer to dummifies categorial features.

All the classes have **fit_transform** methods.

### Input dataframe

- To use a **MetaExtension** transformer : the dataframe requires a **from** column
- To use a **MetaDate** transformer : the dataframe requires a **date** column
- To use a **MetaAttachmentType** transformer : the dataframe requires a **attachment** column with the list of attached files

In [None]:
from melusine.data.data_loader import load_email_data
import ast

df_emails = load_email_data()
df_emails = df_emails[['from','date', 'attachment']]

In [None]:
df_emails['from']

In [None]:
df_emails['date']

In [None]:
df_emails['attachment'] = df_emails['attachment'].apply(ast.literal_eval)
df_emails['attachment']

### MetaExtension transformer

A **MetaExtension transformer** creates an *extension* feature extracted from regex in metadata. It extracts the extensions of mail adresses.

In [None]:
from melusine.prepare_email.metadata_engineering import MetaExtension

meta_extension = MetaExtension()

In [None]:
df_emails = meta_extension.fit_transform(df_emails)

In [None]:
df_emails.extension

### MetaDate transformer

A **MetaDate transformer** creates new features from dates : hour, minute and dayofweek

In [None]:
from melusine.prepare_email.metadata_engineering import MetaDate

meta_date = MetaDate()

In [None]:
df_emails = meta_date.fit_transform(df_emails)

In [None]:
df_emails.date[0]

In [None]:
df_emails.hour[0]

In [None]:
df_emails.loc[0,'min']

In [None]:
df_emails.dayofweek[0]

### MetaAttachmentType transformer

A **MetaAttachmentType transformer** creates an *attachment_type* feature extracted from an attachment names list. It extracts the extensions of attachments files.

In [None]:
from melusine.prepare_email.metadata_engineering import MetaAttachmentType

meta_pj = MetaAttachmentType()

In [None]:
df_emails = meta_pj.fit_transform(df_emails)

In [None]:
df_emails.attachment_type

### Dummifier transformer

A **Dummifier transformer** dummifies categorial features.

Its arguments are :
- **columns_to_dummify** : a list of the metadata columns to dummify.

In [None]:
from melusine.prepare_email.metadata_engineering import Dummifier
dummifier = Dummifier(columns_to_dummify=['extension','attachment_type', 'dayofweek', 'hour', 'min'])

In [None]:
df_meta = dummifier.fit_transform(df_emails)

In [None]:
df_meta.columns

In [None]:
df_meta.head()

In [None]:
df_meta.to_csv('./data/metadata.csv', index=False, encoding='utf-8', sep=';')

### Custom metadata transformer

A custom transformer can be implemented to extract metadata from a column :

```python
from sklearn.base import BaseEstimator, TransformerMixin

class MetaDataCustom(BaseEstimator, TransformerMixin):
    """Transformer which creates custom matadata

    Compatible with scikit-learn API.
    """

    def __init__(self):
        """
        arguments
        """

    def fit(self, X, y=None):
        """ Fit method"""
        return self

    def transform(self, X):
        """Transform method"""
        X['custom_metadata'] = X['column'].apply(self.get_metadata)
        return X
```

The name of the output column can then be given as argument to a Dummifier transformer :

```python
dummifier = Dummifier(columns_to_dummify=['custom_metadata'])
```