The goal of this Notebook is to develop a systemic approach for testing various classifiers.<br>
We will learn how to build a training and testing pipeline with various configurations. Then we will
evaluate these models with simple classification metrics.<br>
We will use:
* the very useful model Pipeline, LabelEncoder, and dataset split features of sklearn
* some ML classifiers and a first deep learning model
* Document embeddings is not really the focus here, so will re-use the TFIDF embeddings for the moment.

In [2]:
from config import RAW_DATA_PATH
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

import os

Load the dataframe and tokenized documents

In [3]:
df = pd.read_csv('data/preprocessing/document_tokens_labelled.csv', sep=',')

In [4]:
df

Unnamed: 0,document_tokens,label
0,commission|accès|document|administratif|examin...,défavorable
1,commission|accès|document|administratif|examin...,défavorable
2,commission|accès|document|administratif|examin...,défavorable
3,commission|accès|document|administratif|examin...,
4,commission|accès|document|administratif|examin...,
...,...,...
48741,monsieur|x|saisir|commission|accès|document|ad...,favorable
48742,monsieur|x|saisir|commission|accès|document|ad...,sans objet
48743,maître|x|x|saisir|commission|accès|document|ad...,sans objet
48744,Monsieur|x|x|saisir|commission|accès|document|...,sans objet


Remove unlabelled documents.

In [5]:
df.dropna(inplace=True)

In [6]:
len(df)

40454

In [7]:
df.label.value_counts()

favorable      26940
sans objet      9849
défavorable     3665
Name: label, dtype: int64

# Pipeline

Create a pipeline that will be applied to Train and Test separately

## Split into Train and Test
Do it upfront in order to avoid data leakage. Also we need to stratify the test sample by label class.

In [36]:
df_train, df_test, y_train, y_test = train_test_split(
    df.document_tokens, df.label, test_size=0.30,
)

 ## Encode the labels
 Important to encode the labels on training only (again, prevent leakage).


In [37]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

label_encoder.fit(y_train)

y_train = label_encoder.transform(y_train)
y_test = label_encoder.transform(y_test)

In [38]:
label_encoder.classes_

array(['défavorable', 'favorable', 'sans objet'], dtype=object)

In [39]:
unique, counts = np.unique(y_train, return_counts=True)

In [68]:
dict(zip(unique, counts))

{0: 2529, 1: 18851, 2: 6937}

In [41]:
MOST_POPULAR_CLASS = 1

In [71]:
CLASS_INDEX = dict(zip(list(range(len(label_encoder.classes_))),label_encoder.classes_))
CLASS_INDEX

{0: 'défavorable', 1: 'favorable', 2: 'sans objet'}

## Document vectorizer

Encode the document into vectors.

### TFIDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(
    analyzer='word',
    max_df=10000,  # ignore tokens that appear more than X times in the document collection
    max_features=10000,   # capping on vocabulary size
    sublinear_tf=True,
)

### GloVe

In [217]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

In [214]:
class GloVeVectorizer():
    
    def fit(self, X, y=None):
        return
    
    def transform(self, X, y=None):
        return

## Define the models

### Choose a baseline naive model

What is the simplest model you could think of? Think of a model that always predicts the most popular class
from the train data.

In [43]:
class NaiveClassifier():
    "Always predict the most popular class"
    
    # we know from label_encoder that most popular class Favorable takes index 1.
    most_popular_class = MOST_POPULAR_CLASS
    
    def fit(self, X, y=None, **fit_params):
        return
    
    def transform(self):
        return
    
    def predict(self, X):
        return np.array([self.most_popular_class] * X.shape[0])
    

In [44]:
naive_classifier = NaiveClassifier()

### Choose a baseline ML model

Now choose a ML classifier, the simplest one.

In [45]:
from sklearn.naive_bayes import MultinomialNB

naive_bayes = MultinomialNB()

### Non parametric ML

In [46]:
from sklearn.ensemble import RandomForestClassifier

random_forest = RandomForestClassifier(n_estimators=150)

### Deep learning classifier

In [205]:
from tensorflow.keras import metrics
from tensorflow.keras import losses
import tensorflow as tf

class DeepLearningClassifier():
    """
    My first deep learning classifier.
    Takes as input a word-gram matrix weighted by TFIDF.
    The model learns a mapping of the input space to an dense space of size 100.
    Then it outputs three probabilities, one for each class (hence a layer softmax of size 3).
    """
    
    EPOCHS=5
    INPUT_DIMENSION = 10000
    model=None
    
    def build_model(self):
        self.model = Sequential()
        self.model.add(layers.Dense(100, input_dim=self.INPUT_DIMENSION, activation='relu'))
        self.model.add(layers.Dense(3, activation='softmax'))
        self.model.compile(
            loss=losses.CategoricalCrossentropy(),
            optimizer='adam',
            metrics=metrics.CategoricalCrossentropy()
        )
        
    
    def fit(self, X, y=None, **fit_params):
        
        self.build_model()
        
        X_tensor = tf.constant(X.toarray())
        y_onehot = tf.keras.utils.to_categorical(y,3)
        
        history = self.model.fit(
            x=X_tensor,
            y=y_onehot,
            epochs=self.EPOCHS,
        )
        return
    
    
    def predict(self, X):
        X_tensor = tf.constant(X.toarray())
        return self.model.predict(X_tensor).argmax(axis=1)
    

In [206]:
deep_learning_classifier = DeepLearningClassifier()

## Assemble pipeline components

In [47]:
pipeline_naive_classifier = Pipeline(steps=[
    ('tfidf', tfidf),
    ('naive_classifier', naive_classifier),
])

In [48]:
pipeline_naive_classifier

In [49]:
pipeline_baseline_ml = Pipeline(steps=[
    ('tfidf', tfidf),
    ('naive_bayes', naive_bayes),
])

In [50]:
pipeline_baseline_ml

In [51]:
pipeline_random_forest = Pipeline(steps=[
    ('tfidf', tfidf),
    ('random_forest', random_forest),
])

In [207]:
pipeline_deep_learning_baseline = Pipeline(steps=[
    ('tfidf', tfidf),
    ('deep_learning_classifier', deep_learning_classifier),
])

In [208]:
pipeline_deep_learning_baseline

In [51]:
pipeline_deep_learning_glove = Pipeline(steps=[
    ('glove_vectorizer', glove_vectorizer),
    ('deep_learning_glove', ),
])

## Train the whole pipeline
Remark: no need to fit the naive model.

In [52]:
pipeline_baseline_ml.fit(df_train, y_train)

In [53]:
pipeline_random_forest.fit(df_train, y_train)

In [209]:
deep_learning_classifier.fit(X=X_train, y=y_train)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


## Serve the model

In [210]:
prediction_models = dict(naive=None, baseline_ml=None, random_forest=None, deep_learning=None)

In [211]:
prediction_models['naive'] = pipeline_naive_classifier.predict(df_test)

prediction_models['baseline_ml'] = pipeline_baseline_ml.predict(df_test)

prediction_models['random_forest'] = pipeline_random_forest.predict(df_test)

prediction_models['deep_learning'] = pipeline_deep_learning_baseline.predict(df_test)




# Results

In [212]:
from sklearn.metrics import classification_report

In [213]:
for k,v in prediction_models.items():
    print('------------------------------------------------------------')
    print(k)
    print(classification_report(y_test, v))

------------------------------------------------------------
naive
              precision    recall  f1-score   support

           0       0.00      0.00      0.00      1136
           1       0.67      1.00      0.80      8089
           2       0.00      0.00      0.00      2912

    accuracy                           0.67     12137
   macro avg       0.22      0.33      0.27     12137
weighted avg       0.44      0.67      0.53     12137

------------------------------------------------------------
baseline_ml
              precision    recall  f1-score   support

           0       0.75      0.22      0.34      1136
           1       0.83      0.94      0.88      8089
           2       0.85      0.75      0.80      2912

    accuracy                           0.83     12137
   macro avg       0.81      0.64      0.67     12137
weighted avg       0.83      0.83      0.81     12137

------------------------------------------------------------
random_forest
              precision

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [88]:
def make_df_results(y_true, y_pred, documents):
    "Package results into a dataframe for better visualisation"
    result = pd.DataFrame({'true_class':y_true, 'predicted_class':y_pred, 'document':documents})
    result['correct'] = (y_true - y_pred) == 0
    result['true_class'] = result['true_class'].apply(lambda x: CLASS_INDEX[x])
    result['predicted_class'] = result['predicted_class'].apply(lambda x: CLASS_INDEX[x])
    return result


df_result = make_df_results(y_true=y_test, y_pred=prediction_models['random_forest'], documents=df_test.values)

In [89]:
df_result.head()

Unnamed: 0,true_class,predicted_class,document,correct
0,défavorable,favorable,Monsieur|x|saisir|commission|accès|document|ad...,False
1,favorable,favorable,monsieur|x|x|saisir|commission|accès|document|...,True
2,sans objet,sans objet,maître|x|conseil|Monsieur|x|saisir|commission|...,True
3,favorable,favorable,monsieur|x|saisir|commission|accès|document|ad...,True
4,défavorable,défavorable,commission|accès|document|administratif|examin...,True


In [96]:
df_result[(df_result.correct==False) & (df_result.true_class == 'défavorable')]

Unnamed: 0,true_class,predicted_class,document,correct
0,défavorable,favorable,Monsieur|x|saisir|commission|accès|document|ad...,False
10,défavorable,favorable,Monsieur|x|monsieur|x|Monsieur|x|Monsieur|x|sa...,False
11,défavorable,favorable,Monsieur|x|saisir|commission|accès|document|ad...,False
20,défavorable,favorable,maître|x|x|saisir|commission|accès|document|ad...,False
114,défavorable,favorable,maître|t.|c.|conseil|Monsieur|b.|saisir|commis...,False
...,...,...,...,...
12004,défavorable,favorable,Monsieur|xx|xxx|saisir|commission|accès|docume...,False
12010,défavorable,favorable,maître|x.|conseil|Monsieur|monsieur|x.|saisir|...,False
12064,défavorable,favorable,commission|examiner|séance|décembre|demande|co...,False
12082,défavorable,favorable,maître|x|v.|conseil|boulangerie|cardinet|saisi...,False


Let's analyze which words have led to the wrong category.

# Notes

### 20/12/22
Baseline ML and Random Forest have pretty much the same accuracy !

We are probably reaching a complexity ceiling. We won't improve much with more fancy models (e.g. deep learning).

Before that, we should fine tune the feature space. Maybe use a better word representation ?

To go further:
* fine tune the TfIdf embeddings
* use to pre-trained word vectors e.g. GloVe


### 31/12/22
We have largely improved the results by adding a parameter of sublinear TF scaling to TFIDF vectorizer.<br>
Now the RF is much better than the Naive Bayes.

Let's try a deep learner model with dense layers. Then will we use a deep learner model with pre-trained
embeddings (GloVe) as input.


### 01/01/23
I have build a first DL model with simple dense layer. Then I have added it to the sklearn Pipeline structure.
The models are quite comparable to the Random Forest. Globally the recall is better but precision is lower.

Next steps:
* New Notebook dedicated to more advanced deep learning
    * Better embedding space: vectorize words with pre-trained embeddings such as GloVe. And then learn a document
    embeddings space with Doc2Vec framework.
    * cross validation, printing loss/accuracy over the epochs,...
    * trying different model architectures.