# Models subpackage tutorial

The NeuralModel class is a generic class used to manage neural networks implemented with Keras. It offers methods to save, load, train and use for classification the neural networks.

Melusine provides two built-in Keras model : cnn_model and rnn_model based on the models used in-house at Maif. However the user is free to implement neural networks tailored for its needs.

## The dataset

The NeuralModel class can take as input either :
- a text input : a cleaned text, usually the cleaned body or the concatenation of the cleaned body and the cleaned header.
- a text input and a metadata input : the metadata input has to be dummified.

#### Text input 

In [1]:
import ast
import pandas as pd

from melusine.data.data_loader import load_email_data
df_emails_full = load_email_data(type="full")

The new clean_text column is the concatenation of the clean_header column and the clean_body column :

In [2]:
df_emails_full['clean_text'] = df_emails_full['clean_header'] + " " + df_emails_full['clean_body']

In [3]:
df_emails_full.clean_text[0]

'devis habitation je suis client chez vous pouvez vous m etablir un devis pour mon fils qui souhaite louer lappartement suivant : 25 rue du rueimaginaire 77000'

#### Metadata input

By default the metadata used are :
- the extension : gmail, outlook, wanadoo..
- the day of the week at which the email has been sent
- the hour at which the email has been sent
- the minute at which the email has been sent
- the attachment types : pdf, png ..

#### Defining X and y

In [4]:
X = df_emails_full.drop("label", axis=1)

y is a numpy array containing the encoded labels :

In [5]:
from sklearn.preprocessing import LabelEncoder
y = df_emails_full['label']
le = LabelEncoder()
y = le.fit_transform(y)

In [6]:
y

array([ 4, 10,  3,  0,  0,  4,  7, 10,  1, 10,  2,  5, 10, 10,  4,  7,  7,
       10,  0,  9,  4, 10,  4,  7, 10, 10,  6,  7,  3,  8, 10, 10, 10,  4,
        7,  3,  5,  4,  4, 10])

## The NeuralModel class

In [7]:
from melusine.models.train import NeuralModel

The NeuralModel class is a generic class used to manage neural networks implemented with Keras. It offers methods to save, load, train and use for classification the neural networks.

Its arguments are :
- **architecture_function :** a function returning a Model instance from Keras.
- **pretrained_embedding :** the pretrained embedding matrix as an numpy array.
- **text_input_column :** the name of the column that will provide the text input, by default clean_text.
- **meta_input_list :** the list of the names of the columns containing the metadata. If empty list or None the model is used without metadata. Default value, ['extension', 'dayofweek', 'hour', 'min'].
- **vocab_size :** the size of vocabulary for neurol network model. Default value, 25000.
- **seq_size :** the maximum size of input for neural model. Default value, 100.
- **loss :** the loss function for training. Default value, 'categorical_crossentropy'.
- **batch_size :** the size of batches for the training of the neural network model. Default value, 4096.
- **n_epochs :** the number of epochs for the training of the neural network model. Default value, 15.

#### architecture_function

In [8]:
from melusine.models.neural_architectures import cnn_model, rnn_model

**architecture_function** is a function returning a Model instance from Keras.
Melusine provides two built-in neural networks : **cnn_model** and **rnn_model** based on the models used in-house at Maif.

#### pretrained_embedding

The embedding have to be trained on the user's dataset.

In [9]:
from melusine.nlp_tools.embedding import Word2VecTrainer

In [10]:
# Instantiate the trainer
embedding_trainer = Word2VecTrainer(
    input_column='clean_body',
    workers=4,
    min_count=3
)

# Train the word embeddings model
embedding_trainer.train(df_emails_full)

pretrained_embedding = embedding_trainer.embedding

### NeuralModel used with text and metadata input

This neural network model will use the **clean_text** column for the text input and the dummified **extension**, **dayofweek**, **hour** and **min** as metadata input :

In [11]:
nn_model = NeuralModel(architecture_function=cnn_model,
                       pretrained_embedding=pretrained_embedding,
                       text_input_column="clean_text",
                       meta_input_list=['extension','attachment_type', 'dayofweek', 'hour', 'min'],
                       n_epochs=10)

#### Training the neural network

During the training, logs are saved in "train" situated in the data directory. Use tensorboard to follow training using 
- "tensorboard --logdir data" from your terminal  
- directly from a notebook with "%load_ext tensorboard" and "%tensorboard --logdir data" magics command (see https://www.tensorflow.org/tensorboard/tensorboard_in_notebooks)

In [12]:
nn_model.fit(X,y,tensorboard_log_dir="./data")

2021-09-17 14:50:00.997749: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-09-17 14:50:01.276949: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2021-09-17 14:50:01.276970: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
2021-09-17 14:50:01.278224: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down.
2021-09-17 14:50:01.376352: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)


Epoch 1/10
Epoch 2/10
Epoch 3/10


2021-09-17 14:50:03.137627: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2021-09-17 14:50:03.137643: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
2021-09-17 14:50:03.191653: I tensorflow/core/profiler/lib/profiler_session.cc:66] Profiler session collecting data.
2021-09-17 14:50:03.211686: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down.
2021-09-17 14:50:03.238816: I tensorflow/core/profiler/rpc/client/save_profile.cc:136] Creating directory: ./data/train/plugins/profile/2021_09_17_14_50_03

2021-09-17 14:50:03.243165: I tensorflow/core/profiler/rpc/client/save_profile.cc:142] Dumped gzipped tool data for trace.json.gz to ./data/train/plugins/profile/2021_09_17_14_50_03/MacBookPro-hperrier-C02W308DHV27.trace.json.gz
2021-09-17 14:50:03.268601: I tensorflow/core/profiler/rpc/client/save_profile.cc:136] Creating directory: ./data/train/plugins/profile/2021_09_17_14_50_03

202

Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


![title](../docs/_static/tensorboard.png)

#### Saving the neural network

The **save_nn_model** method saves :
- the Keras model as a json file 
- the weights as a h5 file

In [13]:
nn_model.save_nn_model("./data/nn_model")

Once the **save_nn_model** used the NeuralModel object can be saved as a pickle file :

In [14]:
import joblib
_ = joblib.dump(nn_model,"./data/nn_model.pickle",compress=True)

#### Loading the neural network

The NeuralModel saved as a pickle file has to be loaded first : 

In [15]:
nn_model = joblib.load("./data/nn_model.pickle")

Then the Keras model and its weights can be loaded :

In [16]:
nn_model.load_nn_model("./data/nn_model")

#### Making predictions 

In [17]:
y_res = nn_model.predict(X)
y_res = le.inverse_transform(y_res)

In [18]:
y_res

array(['vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule'],
      dtype=object)

### NeuralModel used with only text input

In [19]:
X = df_emails_full[['clean_text']]

In [20]:
nn_model = NeuralModel(architecture_function=cnn_model,
                       pretrained_embedding=pretrained_embedding,
                       text_input_column="clean_text",
                       meta_input_list=None,
                       n_epochs=10)

In [21]:
nn_model.fit(X,y)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [22]:
y_res = nn_model.predict(X)
y_res = le.inverse_transform(y_res)
y_res

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[self.TOKENS_COL] = X[self.text_input_column].apply(


array(['vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule',
       'vehicule', 'vehicule', 'vehicule', 'vehicule', 'vehicule'],
      dtype=object)