# Models subpackage tutorial

The NeuralModel class is a generic class used to manage neural networks implemented with Keras. It offers methods to save, load, train and use for classification the neural networks.

Melusine provides two built-in Keras model : cnn_model and rnn_model based on the models used in-house at Maif. However the user is free to implement neural networks tailored for its needs.

## The dataset

The NeuralModel class can take as input either :
- a text input : a cleaned text, usually the cleaned body or the concatenation of the cleaned body and the cleaned header.
- a text input and a metadata input : the metadata input has to be dummified.

#### Text input 

In [1]:
import ast
import pandas as pd

df_emails_preprocessed = pd.read_csv('./data/emails_preprocessed.csv', encoding='utf-8', sep=';')
df_emails_preprocessed['clean_header'] = df_emails_preprocessed['clean_header'].astype(str)
df_emails_preprocessed['clean_body'] = df_emails_preprocessed['clean_body'].astype(str)
df_emails_preprocessed['attachment'] = df_emails_preprocessed['attachment'].apply(ast.literal_eval)

In [2]:
df_emails_preprocessed.columns

Index(['body', 'header', 'date', 'from', 'to', 'attachment', 'sexe', 'age',
       'label', 'is_begin_by_transfer', 'is_answer', 'is_transfer',
       'structured_historic', 'structured_body', 'last_body', 'clean_body',
       'clean_header', 'tokens'],
      dtype='object')

The new clean_text column is the concatenation of the clean_header column and the clean_body column :

In [3]:
df_emails_preprocessed['clean_text'] = df_emails_preprocessed['clean_header'] + " " + df_emails_preprocessed['clean_body']



In [4]:
df_emails_preprocessed.clean_text[0]

'devis habitation je suis client chez vous pouvez vous m etablir un devis pour mon fils qui souhaite louer lappartement suivant : 25 rue du rueimaginaire  flag_cp_ '

#### Metadata input

By default the metadata used are :
- the extension : gmail, outlook, wanadoo..
- the day of the week at which the email has been sent
- the hour at which the email has been sent
- the minute at which the email has been sent
- the attachment types : pdf, png ..

In [5]:
df_meta = pd.read_csv('./data/metadata.csv', encoding='utf-8', sep=';')

In [6]:
df_meta.columns

Index(['extension__0', 'extension__1', 'extension__2', 'extension__3',
       'extension__4', 'extension__5', 'extension__6', 'extension__7',
       'extension__8', 'dayofweek__0', 'dayofweek__1', 'dayofweek__3',
       'dayofweek__4', 'hour__6', 'hour__8', 'hour__9', 'hour__10', 'hour__11',
       'hour__12', 'hour__14', 'hour__15', 'hour__16', 'hour__17', 'hour__18',
       'hour__20', 'hour__22', 'min__0', 'min__2', 'min__4', 'min__6',
       'min__9', 'min__10', 'min__11', 'min__12', 'min__15', 'min__16',
       'min__19', 'min__20', 'min__21', 'min__22', 'min__24', 'min__28',
       'min__29', 'min__30', 'min__32', 'min__33', 'min__37', 'min__38',
       'min__39', 'min__40', 'min__44', 'min__45', 'min__49', 'min__54',
       'min__56', 'min__58', 'attachment_type__0', 'attachment_type__1',
       'attachment_type__2', 'attachment_type__3', 'attachment_type__4',
       'attachment_type__5', 'attachment_type__6'],
      dtype='object')

In [7]:
df_meta.head()

Unnamed: 0,extension__0,extension__1,extension__2,extension__3,extension__4,extension__5,extension__6,extension__7,extension__8,dayofweek__0,...,min__54,min__56,min__58,attachment_type__0,attachment_type__1,attachment_type__2,attachment_type__3,attachment_type__4,attachment_type__5,attachment_type__6
0,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


#### Defining X and y

X is a Pandas dataframe with a clean_text column that will be used for the text input and columns containing the dummified metadata.

In [8]:
X = pd.concat([df_emails_preprocessed['clean_text'],df_meta],axis=1)

y is a numpy array containing the encoded labels :

In [9]:
from sklearn.preprocessing import LabelEncoder
y = df_emails_preprocessed['label']
le = LabelEncoder()
y = le.fit_transform(y)

In [10]:
y

array([ 4, 10,  3,  0,  0,  4,  7, 10,  1, 10,  2,  5, 10, 10,  4,  7,  7,
       10,  0,  9,  4, 10,  4,  7, 10, 10,  6,  7,  3,  8, 10, 10, 10,  4,
        7,  3,  5,  4,  4, 10])

## The NeuralModel class

In [11]:
from melusine.models.train import NeuralModel

The NeuralModel class is a generic class used to manage neural networks implemented with Keras. It offers methods to save, load, train and use for classification the neural networks.

Its arguments are :
- **architecture_function :** a function returning a Model instance from Keras.
- **pretrained_embedding :** the pretrained embedding matrix as an numpy array.
- **text_input_column :** the name of the column that will provide the text input, by default clean_text.
- **meta_input_list :** the list of the names of the columns containing the metadata. If empty list or None the model is used without metadata. Default value, ['extension', 'dayofweek', 'hour', 'min'].
- **vocab_size :** the size of vocabulary for neurol network model. Default value, 25000.
- **seq_size :** the maximum size of input for neural model. Default value, 100.
- **loss :** the loss function for training. Default value, 'categorical_crossentropy'.
- **batch_size :** the size of batches for the training of the neural network model. Default value, 4096.
- **n_epochs :** the number of epochs for the training of the neural network model. Default value, 15.

#### architecture_function

In [12]:
from melusine.models.neural_architectures import cnn_model, rnn_model

**architecture_function** is a function returning a Model instance from Keras.
Melusine provides two built-in neural networks : **cnn_model** and **rnn_model** based on the models used in-house at Maif.

#### pretrained_embedding

The embedding have to be trained on the user's dataset.

In [13]:
from melusine.nlp_tools.embedding import Embedding

ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

In [17]:
pretrained_embedding = Embedding().load('./data/embedding.pickle') 

### NeuralModel used with text and metadata input

This neural network model will use the **clean_text** column for the text input and the dummified **extension**, **dayofweek**, **hour** and **min** as metadata input :

In [18]:
nn_model = NeuralModel(architecture_function=cnn_model,
                       pretrained_embedding=pretrained_embedding,
                       text_input_column="clean_text",
                       meta_input_list=['extension','attachment_type', 'dayofweek', 'hour', 'min'],
                       n_epochs=10)

#### Training the neural network

During the training, logs are saved in "train" situated in the data directory. Use tensorboard to follow training using 
- "tensorboard --logdir data" from your terminal  
- directly from a notebook with "%load_ext tensorboard" and "%tensorboard --logdir data" magics command (see https://www.tensorflow.org/tensorboard/tensorboard_in_notebooks)

In [19]:
nn_model.fit(X,y,tensorboard_log_dir="./data")

Epoch 1/10
Epoch 2/10
Instructions for updating:
use `tf.profiler.experimental.stop` instead.
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


![title](../docs/_static/tensorboard.png)

#### Saving the neural network

The **save_nn_model** method saves :
- the Keras model as a json file 
- the weights as a h5 file

In [20]:
nn_model.save_nn_model("./data/nn_model")

Once the **save_nn_model** used the NeuralModel object can be saved as a pickle file :

In [21]:
import joblib
_ = joblib.dump(nn_model,"./data/nn_model.pickle",compress=True)

#### Loading the neural network

The NeuralModel saved as a pickle file has to be loaded first : 

In [22]:
nn_model = joblib.load("./data/nn_model.pickle")

Then the Keras model and its weights can be loaded :

In [23]:
nn_model.load_nn_model("./data/nn_model")

#### Making predictions 

In [24]:
y_res = nn_model.predict(X)
y_res = le.inverse_transform(y_res)

In [25]:
y_res

array(['vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'habitation', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule'],
      dtype=object)

### NeuralModel used with only text input

In [26]:
X = df_emails_preprocessed[['clean_text']]

In [27]:
nn_model = NeuralModel(architecture_function=cnn_model,
                       pretrained_embedding=pretrained_embedding,
                       text_input_column="clean_text",
                       meta_input_list=None,
                       n_epochs=10)

In [28]:
nn_model.fit(X,y)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X["tokens"] = apply_func(X, self.tokenize)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X["tokens"] = apply_func(X, lambda x: x["tokens"][0])


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [29]:
y_res = nn_model.predict(X)
y_res = le.inverse_transform(y_res)
y_res

array(['vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule'],
      dtype=object)