# Comparison of the various model architectures tutorial

In the NeuralModel class, we propose several standard neural architectures for text classification. In particular, we include the well known `bert` architecture that usually allows for a significant improvement in NLP tasks.

* `RNN` are widely used for text processing.
* `CNN` or Convolutional Neural Networks are usually used for image classification tasks but give excellent results, comparable to the `RNN`, in our case.
* `Attentive` Neural Networks emerged recently for text processing and shows extremely promissing results. We therefore decided to include such models as part of Melusine and ease their used for email processing. We propose both an original attentive-based classifier as well as a wrap-up for a standard `Bert classifier`.

All our architectures follow the same general pattern: the email and header texts are embedded using one of the encoder listed above. The text vector is concatenated with a vector built using the email meta data (hour, email domain, email attachement ...)

In this tutorial we compare the different models' characteristics such as inference & training time, precision and architecture.

## Dataset preparation

The dataset preparation is developped in **tutorial 07: models**.

In [1]:
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";

# The GPU id to use, usually either "0" or "1", or "" to use CPU;
os.environ["CUDA_VISIBLE_DEVICES"]="";

In [2]:
# load inputs
import numpy as np
import pandas as pd
from copy import deepcopy
from sklearn.preprocessing import LabelEncoder
from melusine import load_email_data

# Text input 
df_emails_preprocessed = load_email_data(type="full")
df_emails_preprocessed['clean_text'] = df_emails_preprocessed['clean_header'] + " " + df_emails_preprocessed['clean_body']

# Dataset
X = df_emails_preprocessed.copy()
y = df_emails_preprocessed['label']
le = LabelEncoder()
y = le.fit_transform(y)

In [3]:
from melusine.nlp_tools.embedding import Word2VecTrainer

# Instantiate the trainer
embedding_trainer = Word2VecTrainer(
    input_column='clean_body',
    workers=1,
    min_count=2
)

# Train the word embeddings model
embedding_trainer.train(df_emails_preprocessed)

pretrained_embedding = embedding_trainer.embedding



## Models

In [4]:
import tensorflow.keras.backend as K
import time
from melusine.models.train import NeuralModel
from melusine.models.neural_architectures import cnn_model, rnn_model, transformers_model, bert_model

from sklearn.metrics import accuracy_score, classification_report

In [5]:
#!pip install psutil
import psutil

In [6]:
def get_available_memory():
    return psutil.virtual_memory()._asdict()['available']

### RNN

`RNN` are traditionally used with textual data as they are specifically designed to handle sequentially structured data. Inputs are sequentially computed given a cell operation, generally a `LSTM` or `GRU` cell. At each step, the current input as well as the output from the previous step are used to compute the next hidden state. The proposed architecture includes a 2-layers bidirectional `GRU` network. The network last hidden state is used as the final sentence embedding.

<img src="./images/rnn-model.png" style="width:500px">

In [7]:
memory_start = get_available_memory()
RNN_model = NeuralModel(architecture_function=rnn_model,
                        pretrained_embedding=pretrained_embedding,
                        text_input_column="clean_text",
                        meta_input_list=['extension', 'dayofweek', 'hour', 'min','attachment_type'],
                        n_epochs=1)
training_start = time.time()
RNN_model.fit(df_emails_preprocessed, y)
training_end = time.time()
RNN_memory = memory_start - get_available_memory()
RNN_memory = round(RNN_memory / 1e9 * 1024 , 1)
print('RNN is using {} Mb memory (RAM).'.format(str(RNN_memory)))

2021-09-17 11:00:42.646284: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-09-17 11:00:43.796998: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)


RNN is using -54.6 Mb memory (RAM).


In [8]:
RNN_N_trainable_parameters = int(np.sum([K.count_params(p) for p in RNN_model.model.trainable_weights]))
RNN_N_non_trainable_parameters = int(np.sum([K.count_params(p) for p in RNN_model.model.non_trainable_weights]))
RNN_training_time = round(training_end - training_start, 2)
RNN_N_trainable_parameters, RNN_N_non_trainable_parameters, RNN_training_time

(304531, 0, 9.64)

In [9]:
y_res = []
inference_start = time.time()
for i in range(X.shape[0]):
    X_copy = deepcopy(X.loc[i:i, :])
    y_res.append(RNN_model.predict(X_copy))
y_res = le.inverse_transform(np.ravel(y_res))
inference_end = time.time()
RNN_inference_time = round((inference_end - inference_start)/len(y_res) * 1000, 2)
RNN_accuracy = accuracy_score(y_res, le.inverse_transform(y))
RNN_cls_report = classification_report(le.inverse_transform(y), y_res, output_dict=True, zero_division=0)

### CNN

`CNN` uses multiple filters to discriminate patterns in data. Such filters are assembled across the hidden layers to build more complex patterns and structures. The last layer should therefore capture a global and generic representation of the data. In our architecture, we use a two hidden layers `CNN` with respectively 200 filters for each hidden layer. The last hidden states are aggregated using a max pooling operation.

<img src="./images/cnn-model.png" style="width:500px">

In [10]:
memory_start = get_available_memory()
CNN_model = NeuralModel(architecture_function=cnn_model,
                        pretrained_embedding=pretrained_embedding,
                        text_input_column="clean_text",
                        meta_input_list=['extension', 'dayofweek', 'hour', 'min', 'attachment_type'],
                        n_epochs=1)
training_start = time.time()
CNN_model.fit(X, y)
training_end = time.time()
CNN_memory = memory_start - get_available_memory()
CNN_memory = round(CNN_memory / 1e9 * 1024 , 1)
print('CNN is using {} Mb memory (RAM).'.format(str(CNN_memory)))

CNN is using 55.6 Mb memory (RAM).


In [11]:
CNN_N_trainable_parameters = int(np.sum([K.count_params(p) for p in CNN_model.model.trainable_weights]))
CNN_N_non_trainable_parameters = int(np.sum([K.count_params(p) for p in CNN_model.model.non_trainable_weights]))
CNN_training_time =  round(training_end - training_start, 2)
CNN_N_trainable_parameters, CNN_N_non_trainable_parameters, CNN_training_time

(352241, 400, 1.74)

In [12]:
y_res = []
inference_start = time.time()
for i in range(X.shape[0]):
    X_copy = deepcopy(X.loc[i:i, :])
    y_res.append(CNN_model.predict(X_copy))
y_res = le.inverse_transform(np.ravel(y_res))
inference_end = time.time()
CNN_inference_time = round((inference_end - inference_start)/len(y_res) * 1000, 2)
CNN_accuracy = accuracy_score(y_res, le.inverse_transform(y))
CNN_cls_report = classification_report(le.inverse_transform(y), y_res, output_dict=True, zero_division=0)

### Transformer

#### Multi-heads Attention Classifier

Attentive-based neural networks are fairly new in the NLP community but results are extremely promising. They rely on the self-attention operation which computes hidden states as a weighted sum from the inputs. As the multiple filters in the `CNN` architecture, the multi-branch attention aggregate multiple attention operation to capture various properties from the input. Such operation is easily perform on GPU infrastructure. We propose an architecture inspired from previously introduced RNN and CNN architecture with a two layers multi-branch attention module follow by a max pooling operation.

<img src="./images/transformer-model.png" style="width:500px">

In [13]:
memory_start = get_available_memory()
Transformer_model = NeuralModel(architecture_function=transformers_model,
                                pretrained_embedding=pretrained_embedding,
                                text_input_column="clean_text",
                                meta_input_list=['extension', 'dayofweek', 'hour', 'min', 'attachment_type'],
                                n_epochs=1)
training_start = time.time()
Transformer_model.fit(X, y)
training_end = time.time()
Transformer_memory = memory_start - get_available_memory()
Transformer_memory = round(Transformer_memory / 1e9 * 1024 , 1)
print('Transformer is using {} Mb memory (RAM).'.format(str(Transformer_memory)))

Transformer is using 240.7 Mb memory (RAM).


In [14]:
Transformer_N_trainable_parameters = int(np.sum([K.count_params(p) for p in Transformer_model.model.trainable_weights]))
Transformer_N_non_trainable_parameters = int(np.sum([K.count_params(p) for p in Transformer_model.model.non_trainable_weights]))
Transformer_training_time =  round(training_end - training_start, 2)
Transformer_N_trainable_parameters, Transformer_N_non_trainable_parameters, Transformer_training_time

(237041, 11300, 4.16)

In [15]:
y_res = []
inference_start = time.time()
for i in range(X.shape[0]):
    X_copy = deepcopy(X.loc[i:i, :])
    y_res.append(Transformer_model.predict(X_copy))
y_res = le.inverse_transform(np.ravel(y_res))
inference_end = time.time()
Transformer_inference_time = round((inference_end - inference_start)/len(y_res) * 1000, 2)
Transformer_accuracy = accuracy_score(y_res, le.inverse_transform(y))
Transformer_cls_report = classification_report(le.inverse_transform(y), y_res, output_dict=True, zero_division=0)

#### Bert Model

We also propose a wrap-up for the popular pre-trained `bert` architecture. `bert` architecture encodes every sentence tokens with a contextualized embeddings: each words embeddings depends from all words in the sentence. However, we only use the first sentence embedding, usually called the **classification token** in our classifier model.

<img src="./images/bert-model.png" style="width:500px">


Bert tokenizers and models can be downloaded here: https://huggingface.co/transformers/pretrained_models.html

Only Camembert and Flaubert are available now in Melusine.

### Camembert

In [16]:
memory_start = get_available_memory()
CamemBert_model = NeuralModel(architecture_function=bert_model,
                            pretrained_embedding=None,
                                text_input_column="clean_text",
                                meta_input_list=['extension', 'dayofweek', 'hour', 'min', 'attachment_type'],
                                n_epochs=1,
                        bert_tokenizer='jplu/tf-camembert-base',
                        bert_model='jplu/tf-camembert-base')
training_start = time.time()
CamemBert_model.fit(X, y)
training_end = time.time()
CamemBert_memory = memory_start - get_available_memory()
CamemBert_memory = round(CamemBert_memory / 1e9 * 1024 , 1)
print('CamemBert is using {} Mb memory (RAM).'.format(str(CamemBert_memory)))

Downloading:   0%|          | 0.00/508 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/545M [00:00<?, ?B/s]

Some layers from the model checkpoint at jplu/tf-camembert-base were not used when initializing TFCamembertModel: ['lm_head']
- This IS expected if you are initializing TFCamembertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFCamembertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFCamembertModel were initialized from the model checkpoint at jplu/tf-camembert-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFCamembertModel for predictions without further training.


CamemBert is using -2077.9 Mb memory (RAM).


In [17]:
import transformers
transformers.__version__

'3.4.0'

In [18]:
CamemBert_N_trainable_parameters = int(np.sum([K.count_params(p) for p in CamemBert_model.model.trainable_weights]))
CamemBert_N_non_trainable_parameters = int(np.sum([K.count_params(p) for p in CamemBert_model.model.non_trainable_weights]))
CamemBert_training_time =  round(training_end - training_start, 2)
CamemBert_N_trainable_parameters, CamemBert_N_non_trainable_parameters, CamemBert_training_time

(110845443, 0, 240.56)

In [19]:
y_res = []
inference_start = time.time()
for i in range(X.shape[0]):
    X_copy = deepcopy(X.loc[i:i, :])
    y_res.append(CamemBert_model.predict(X_copy))
y_res = le.inverse_transform(np.ravel(y_res))
inference_end = time.time()
CamemBert_inference_time = round((inference_end - inference_start)/len(y_res) * 1000, 2)
CamemBert_accuracy = accuracy_score(y_res, le.inverse_transform(y))
CamemBert_cls_report = classification_report(le.inverse_transform(y), y_res, output_dict=True, zero_division=0)

### Flaubert

In [20]:
memory_start = get_available_memory()
FlauBert_model = NeuralModel(architecture_function=bert_model,
                            pretrained_embedding=None,
                                text_input_column="clean_text",
                                meta_input_list=['extension', 'dayofweek', 'hour', 'min', 'attachment_type'],
                                n_epochs=1,
                        bert_tokenizer='jplu/tf-flaubert-base-cased',
                        bert_model='jplu/tf-flaubert-base-cased')
training_start = time.time()
FlauBert_model.fit(X, y)
training_end = time.time()
FlauBert_memory = memory_start - get_available_memory()
FlauBert_memory = round(FlauBert_memory / 1e9 * 1024 , 1)
print('FlauBert is using {} Mb memory (RAM).'.format(str(FlauBert_memory)))

Downloading:   0%|          | 0.00/1.56M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/896k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.50k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/765M [00:00<?, ?B/s]

2021-09-17 11:08:58.811708: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
Some layers from the model checkpoint at jplu/tf-flaubert-base-cased were not used when initializing TFFlaubertModel: ['pred_layer_._proj']
- This IS expected if you are initializing TFFlaubertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFFlaubertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFFlaubertModel were initialized from the model checkpoint at jplu/tf-flaubert-base-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFFlaub

FlauBert is using -352.3 Mb memory (RAM).


In [21]:
FlauBert_N_trainable_parameters = int(np.sum([K.count_params(p) for p in FlauBert_model.model.trainable_weights]))
FlauBert_N_non_trainable_parameters = int(np.sum([K.count_params(p) for p in FlauBert_model.model.non_trainable_weights]))
FlauBert_training_time =  round(training_end - training_start, 2)
FlauBert_N_trainable_parameters, FlauBert_N_non_trainable_parameters, FlauBert_training_time

(138456579, 0, 281.71)

In [22]:
y_res = []
inference_start = time.time()
for i in range(X.shape[0]):
    X_copy = deepcopy(X.loc[i:i, :])
    y_res.append(FlauBert_model.predict(X_copy))
y_res = le.inverse_transform(np.ravel(y_res))
inference_end = time.time()
FlauBert_inference_time = round((inference_end - inference_start)/len(y_res) * 1000, 2)
FlauBert_accuracy = accuracy_score(y_res, le.inverse_transform(y))
FlauBert_cls_report = classification_report(le.inverse_transform(y), y_res, output_dict=True, zero_division=0)

# Comparison of the model characteristics

In [23]:
parameters_list = ["Number of Parameters", "Memory Usage (Mb)", "Training time (s./epoch)",
                   "Inference time (ms./sample)", "Accuracy (%)"]
models_names = ['RNN', 'CNN', 'Transformers', 'CamemBert', 'FlauBert']

N_trainable_parameters = [RNN_N_trainable_parameters, CNN_N_trainable_parameters,
                          Transformer_N_trainable_parameters, CamemBert_N_trainable_parameters, FlauBert_N_trainable_parameters]

memory_usage = [RNN_memory, CNN_memory,
                Transformer_memory, CamemBert_memory, FlauBert_memory]

training_time = [RNN_training_time, CNN_training_time,
                 Transformer_training_time, CamemBert_training_time, FlauBert_training_time]

inference_time = [RNN_inference_time, CNN_inference_time,
                  Transformer_inference_time, CamemBert_inference_time, FlauBert_inference_time]

accuracy = [RNN_accuracy, CNN_accuracy,
            Transformer_accuracy, CamemBert_accuracy, FlauBert_accuracy]

data = [N_trainable_parameters, memory_usage, training_time, inference_time, accuracy]

In [24]:
def format_table(data, columns_header, row_header):
    row_format = "{:>15}|" * (len(row_header))
    row_format = "{:>30}|" + row_format
    space = ['']*len(row_header)
    space_format = "{:->15}+" * (len(row_header))
    space_format = "{:->30}+" + space_format
    print(row_format.format("", *row_header))
    for col, row in zip(columns_header, data):
        print(space_format.format("", *space))
        print(row_format.format(col, *row))

In [25]:
format_table(data, parameters_list, models_names)

                              |            RNN|            CNN|   Transformers|      CamemBert|       FlauBert|
------------------------------+---------------+---------------+---------------+---------------+---------------+
          Number of Parameters|         304531|         352241|         237041|      110845443|      138456579|
------------------------------+---------------+---------------+---------------+---------------+---------------+
             Memory Usage (Mb)|          -54.6|           55.6|          240.7|        -2077.9|         -352.3|
------------------------------+---------------+---------------+---------------+---------------+---------------+
      Training time (s./epoch)|           9.64|           1.74|           4.16|         240.56|         281.71|
------------------------------+---------------+---------------+---------------+---------------+---------------+
   Inference time (ms./sample)|          89.43|          52.21|          66.34|         359.39|         

⚠️⚠️ **The metrics above are computed on a very small sample and may therefore be misleading. Please use your own dataset to compare extensively the models and their performances** ⚠️⚠️

# Classification Report

Please install plotly to plot the following graphs

``!pip install --upgrade plotly``

In [28]:
import plotly.graph_objs as go
import plotly as plotly
import plotly.express as px

In [29]:
def cls_report_2_df(cls_report, model_name):
    cls_report_df = pd.DataFrame.from_dict(cls_report)
    cls_report_df.drop(['accuracy', 'macro avg', 'weighted avg'], axis=1, inplace=True)
    cls_report_df = cls_report_df.T
    cls_report_df['class'] = cls_report_df.index
    cls_report_df['model'] = model_name
    return cls_report_df

In [30]:
models = ['RNN', 'CNN', 'Transformers', 'CamemBert', 'FlauBert']
cls_reports = [RNN_cls_report, CNN_cls_report, Transformer_cls_report, FlauBert_cls_report, CamemBert_cls_report]
ALL_cls_report_df = [cls_report_2_df(c, m) for (c, m) in zip(cls_reports, models)] 
ALL_cls_report_df = pd.concat(ALL_cls_report_df)

In [32]:
fig = px.scatter(ALL_cls_report_df, 
                 x="precision", 
                 y="recall", 
                 size='support', 
                 color='model',
                 hover_data=['class'],
                 title="Precision and Recall given classes and models")
fig.show()

⚠️⚠️ **The metrics above are computed on a very small sample and may therefore be misleading. Please use your own dataset to compare extensively the models and their performances** ⚠️⚠️