# Metadata preprocessing tutorial

Melusine **prepare_data.metadata_engineering subpackage** provides classes to preprocess the metadata :
- **MetaExtension :** a transformer which creates an 'extension' feature extracted from regex in metadata. It extracts the extensions of mail adresses.
- **MetaDate :** a transformer which creates new features from dates such as: hour, minute, dayofweek.
- **Dummifier :** a transformer to dummifies categorial features.

All the classes have **fit_transform** methods.

### Input dataframe

- To use a **MetaExtension** transformer : the dataframe requires a **from** column
- To use a **MetaDate** transformer : the dataframe requires a **date** column

In [1]:
from melusine.data.data_loader import load_email_data

df_emails = load_email_data()
df_emails = df_emails[['from','date']]

In [2]:
df_emails['from']

0                    conseiller1@societeimaginaire.fr
1                    conseiller1@societeimaginaire.fr
2     Monsieur Dupont <monsieurdupont@extensiona.com>
3     Monsieur Dupont <monsieurdupont@extensiond.com>
4     Monsieur Dupont <monsieurdupont@extensiona.com>
5     Monsieur Dupont <monsieurdupont@extensiona.com>
6       Conseiller <conseiller1@societeimaginaire.fr>
7     Monsieur Dupont <monsieurdupont@extensiona.com>
8     Monsieur Dupont <monsieurdupont@extensione.com>
9     Monsieur Dupont <monsieurdupont@extensionb.com>
10                    conseiller@societeimaginaire.fr
11                      monsieurdupont@extensionf.net
12                    conseiller@societeimaginaire.fr
13    Monsieur Dupont <monsieurdupont@extensiona.com>
14                    conseiller@societeimaginaire.fr
15    Monsieur Dupont <monsieurdupont@extensionc.com>
16    Monsieur Dupont <monsieurdupont@extensionb.com>
17    Monsieur Dupont <monsieurdupont@extensiona.com>
18    Monsieur Dupont <monsi

In [3]:
df_emails['date']

0        jeudi 24 mai 2018 11 h 49 CEST
1     vendredi 25 mai 2018 06 h 21 CEST
2     vendredi 25 mai 2018 06 h 45 CEST
3     vendredi 25 mai 2018 10 h 15 CEST
4     vendredi 25 mai 2018 17 h 30 CEST
5        jeudi 31 mai 2018 10 h 28 CEST
6        jeudi 31 mai 2018 12 h 24 CEST
7        jeudi 31 mai 2018 14 h 02 CEST
8        jeudi 31 mai 2018 17 h 10 CEST
9        jeudi 31 mai 2018 08 h 54 CEST
10       jeudi 31 mai 2018 12 h 00 CEST
11       jeudi 31 mai 2018 12 h 44 CEST
12       lundi 4 juin 2018 09 h 56 CEST
13       lundi 4 juin 2018 14 h 09 CEST
14       lundi 4 juin 2018 09 h 20 CEST
15       lundi 4 juin 2018 10 h 22 CEST
16       lundi 4 juin 2018 15 h 39 CEST
17       lundi 4 juin 2018 15 h 49 CEST
18       lundi 4 juin 2018 18 h 04 CEST
19       lundi 4 juin 2018 20 h 45 CEST
20       lundi 4 juin 2018 22 h 28 CEST
21       lundi 4 juin 2018 10 h 29 CEST
22       lundi 4 juin 2018 10 h 38 CEST
23       lundi 4 juin 2018 11 h 19 CEST
24       lundi 4 juin 2018 10 h 58 CEST


### MetaExtension transformer

A **MetaExtension transformer** creates an *extension* feature extracted from regex in metadata. It extracts the extensions of mail adresses.

In [4]:
from melusine.prepare_email.metadata_engineering import MetaExtension

meta_extension = MetaExtension()

In [5]:
df_emails = meta_extension.fit_transform(df_emails)

In [6]:
df_emails.extension

0     8
1     8
2     0
3     3
4     0
5     0
6     8
7     0
8     4
9     1
10    8
11    5
12    8
13    0
14    8
15    2
16    1
17    0
18    0
19    6
20    6
21    8
22    8
23    6
24    0
25    8
26    8
27    8
28    0
29    0
30    8
31    8
32    8
33    1
34    6
35    7
36    0
37    0
38    1
39    8
Name: extension, dtype: int64

### MetaExtension transformer

A **MetaDate transformer** creates new features from dates : **hour**, **minute** and **dayofweek**.

In [7]:
from melusine.prepare_email.metadata_engineering import MetaDate

meta_date = MetaDate()

In [8]:
df_emails = meta_date.fit_transform(df_emails)

In [9]:
df_emails.date[0]

Timestamp('2018-05-24 11:49:00')

In [10]:
df_emails.hour[0]

11

In [11]:
df_emails.loc[0,'min']

49

In [12]:
df_emails.dayofweek[0]

3

### Dummifier transformer

A **Dummifier transformer** dummifies categorial features.

Its arguments are :
- **columns_to_dummify** : a list of the metadata columns to dummify.

In [13]:
from melusine.prepare_email.metadata_engineering import Dummifier

dummifier = Dummifier(columns_to_dummify=['extension', 'dayofweek', 'hour', 'min'])

In [14]:
df_meta = dummifier.fit_transform(df_emails)

In [15]:
df_meta.columns

Index(['extension__0', 'extension__1', 'extension__2', 'extension__3',
       'extension__4', 'extension__5', 'extension__6', 'extension__7',
       'extension__8', 'dayofweek__0', 'dayofweek__1', 'dayofweek__3',
       'dayofweek__4', 'hour__6', 'hour__8', 'hour__9', 'hour__10', 'hour__11',
       'hour__12', 'hour__14', 'hour__15', 'hour__16', 'hour__17', 'hour__18',
       'hour__20', 'hour__22', 'min__0', 'min__2', 'min__4', 'min__6',
       'min__9', 'min__10', 'min__11', 'min__12', 'min__15', 'min__16',
       'min__19', 'min__20', 'min__21', 'min__22', 'min__24', 'min__28',
       'min__29', 'min__30', 'min__32', 'min__33', 'min__37', 'min__38',
       'min__39', 'min__40', 'min__44', 'min__45', 'min__49', 'min__54',
       'min__56', 'min__58'],
      dtype='object')

In [16]:
df_meta.head()

Unnamed: 0,extension__0,extension__1,extension__2,extension__3,extension__4,extension__5,extension__6,extension__7,extension__8,dayofweek__0,...,min__37,min__38,min__39,min__40,min__44,min__45,min__49,min__54,min__56,min__58
0,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,1,0,0,0
1,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Custom metadata transformer

A custom transformer can be implemented to extract metadata from a column :

```python
from sklearn.base import BaseEstimator, TransformerMixin

class MetaDataCustom(BaseEstimator, TransformerMixin):
    """Transformer which creates custom matadata

    Compatible with scikit-learn API.
    """

    def __init__(self):
        """
        arguments
        """

    def fit(self, X, y=None):
        """ Fit method"""
        return self

    def transform(self, X):
        """Transform method"""
        X['custom_metadata'] = X['column'].apply(self.get_metadata)
        return X
```

The name of the output column can then be given as argument to a Dummifier transformer :

```python
dummifier = Dummifier(columns_to_dummify=['custom_metadata'])
```