<a href="https://colab.research.google.com/github/Adrian-Muino/DMML2022_Geneva/blob/main/Code/4.DMML_2022_Geneva_Bert%26Tensor_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#A. Introduction


## Group project for Data Mining & Machine Learning course, at HEC UNIL 2022 (Geneva Group)

This notebook is the last step we took on our journey for the competition in kaggle
[Detecting the difficulty level of French texts](https://www.kaggle.com/competitions/detecting-french-texts-difficulty-level-2022)

In this last notebook we implemented a Multi-class Text Classification using [BERT](https://en.wikipedia.org/wiki/BERT_(language_model)) and [TensorFlow](https://www.tensorflow.org/?hl=fr).

BERT stands for _"Bidirectional Encoder Representations from Transformers"_ and is a machine learning model based on transformers. We will use TensorFlow Hub to import the Bert Model called [Keras](https://keras.io/api/models/model/) which implementation was described by [Nicolo Cosimo](https://towardsdatascience.com/multi-label-text-classification-using-bert-and-tensorflow-d2e88d8f488d).

Other Bert Models could be used like [CamemBert model](https://camembert-model.fr/) or [FlauBert](https://huggingface.co/docs/transformers/v4.25.1/en/model_doc/flaubert#transformers.FlaubertForSequenceClassification) but because time constraints we couldn't implement the three of them.

**However, by implementing this Keras Bert Model our submission accuracy improved from 0.515 to 0.55583, our best result for the competition.**






##Table of content

>[A. Introduction](#scrollTo=8aOmV2F2AH8C)

>>[Group project for Data Mining & Machine Learning course, at HEC UNIL 2022 (Geneva Group)](#scrollTo=UCujSHKcAMlg)

>>[Table of content](#scrollTo=HY0vZvLCzCEr)

>[B. Prerequisites](#scrollTo=EkKHqlA4AOTp)

>>[Installations](#scrollTo=zqIpXYX7ATz5)

>>[Imports](#scrollTo=MUgW7YQHAa2f)

>[C. Environment set up & exploratory data analysis](#scrollTo=qi-7AUVpB_Z8)

>[D. Bert & Tensor Model](#scrollTo=IAsmRfQlDRxf)

>[E. Submission](#scrollTo=19wukEl_wocC)



#B. Prerequisites

##Installations

In [None]:
#Installation
!pip install sentence-transformers
!python -m spacy download fr_core_news_sm
!python -m spacy link fr_core_news_sm fr
!python -m spacy download fr_core_news_md
!pip install tensorflow_hub
!pip install tensorflow_text

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Exception ignored in: <function _get_module_lock.<locals>.cb at 0x7fe990c8b5e0>
Traceback (most recent call last):
  File "<frozen importlib._bootstrap>", line 176, in cb
KeyboardInterrupt: 
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 185, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/usr/lib/python3.8/runpy.py", line 144, in _get_module_details
    return _get_module_details(pkg_main_name, error)
  File "/usr/lib/python3.8/runpy.py", line 111, in _get_module_details
    __import__(pkg_name)
  File "/usr/local/lib/python3.8/dist-packages/spacy/__init__.py", line 6, in <module>
  File "/usr/local/lib/python3.8/dist-packages/spacy/errors.py", line 2, in <module>
    from .compat import Literal
  File "/usr/local/lib/python3.8/dist-packages/spacy/compat.py", line 3, in <module>
    from thinc.util import

##Imports

In [None]:
# Imports the functions we use all along our projects that are in python file in our GitHub
import requests
url = 'https://raw.githubusercontent.com/Adrian-Muino/DMML2022_Geneva/main/Code/dmml_2022_geneva_functions.py'

r = requests.get(url)

with open('dmml_2022_geneva_functions.py', 'w') as f:
    f.write(r.text)

In [None]:
# All the other imports
import string
import re

from dmml_2022_geneva_functions import *
import pandas as pd

import spacy
from spacy import displacy

import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, SnowballStemmer, WordNetLemmatizer
nltk.download('punkt')

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text

from keras import backend as K

from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV, RidgeClassifier, Perceptron
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.utils.multiclass import unique_labels
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import recall_score, precision_score, f1_score, accuracy_score, confusion_matrix, ConfusionMatrixDisplay

#C. Environment set up & exploratory data analysis

In [None]:
# load the data from our github repository
training_data = 'https://raw.githubusercontent.com/Adrian-Muino/DMML2022_Geneva/main/Data/training_data.csv'
unlabelled_data = 'https://raw.githubusercontent.com/Adrian-Muino/DMML2022_Geneva/main/Data/unlabelled_test_data.csv'

df = df_train = pd.read_csv(training_data)
df_unlabeled = df_test = pd.read_csv(unlabelled_data)

#D. Bert & Tensor Model

In [None]:
# Number of sentences for each category and % relative to the total.
plt.style.use('ggplot')

num_classes = len(df["difficulty"].value_counts())

colors = plt.cm.Dark2(np.linspace(0, 1, num_classes))
iter_color = iter(colors)

df['difficulty'].value_counts().plot.barh(title="Setences by Difficulty Level", 
                                                 ylabel="Level",
                                                 color=colors,
                                                 figsize=(9,9))

for i, v in enumerate(df['difficulty'].value_counts()):
  c = next(iter_color)
  plt.text(v, i,
           " "+str(v)+", "+str(round(v*100/df.shape[0],2))+"%", 
           color=c, 
           va='center', 
           fontweight='bold')

In [None]:
# map topic descriptions to labels
df['Level'] = df['difficulty'].map({'A1': 0,
                                    'A2': 1,
                                    'B1': 2,
                                    'B2': 3,
                                    'C1': 4,
                                    'C2': 5,})

# drop unused column
df = df.drop(["difficulty"], axis=1)

df.head()

In [None]:
y = tf.keras.utils.to_categorical(df["Level"].values, num_classes=num_classes)

x_train, x_test, y_train, y_test = train_test_split(df['sentence'], y, test_size=0.001)

As the model is based on the BERT transformer architecture, it will generate a pooled_output (output embedding of the entire sequence) of shape [batch size, 768], as displayed in the following example

In [None]:

preprocessor = hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder-cmlm/multilingual-preprocess/2")
encoder = hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder-cmlm/multilingual-base/1")

#function imported
get_embeddings([
    "Les coûts kilométriques réels peuvent diverger sensiblement des valeurs moyennes en fonction du moyen de transport utilisé, du taux d'occupation ou du taux de remplissage, de l'infrastructure utilisée, de la topographie des lignes, du flux de trafic, etc."]
)

We now define a model as the preprocessor and encoder layers followed by a dropout and a dense layer with a softmax activation function and an output space dimensionality equal to the number of classes we want to predict:

In [None]:
i = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
x = preprocessor(i)
x = encoder(x)
x = tf.keras.layers.Dropout(0.2, name="dropout")(x['pooled_output'])
x = tf.keras.layers.Dense(num_classes, activation='softmax', name="output")(x)

model = tf.keras.Model(i, x)

Once we have defined the model’s structure, we can compile and fit it. We choose to train the model for 20 epochs, but we also use the EarlyStopping callback in order to monitor the validation loss during training: if the metric does not improve for at least 3 epochs (patience = 3), the training is interrupted and the weights from the epoch where the validation loss showed the best value (i.e. lowest) are restored (restore_best_weights = True):

In [None]:
n_epochs = 20

METRICS = [
      tf.keras.metrics.CategoricalAccuracy(name="accuracy"),
      balanced_recall,
      balanced_precision,
      balanced_f1_score
]

earlystop_callback = tf.keras.callbacks.EarlyStopping(monitor = "val_loss", 
                                                      patience = 3,
                                                      restore_best_weights = True)

model.compile(optimizer = "adam",
              loss = "categorical_crossentropy",
              metrics = METRICS)



model_fit = model.fit(x_train, 
                      y_train, 
                      epochs = n_epochs,
                      validation_data = (x_test, y_test),
                      callbacks = [earlystop_callback])

#E. Submission

In [None]:
#function imported
predict_class(df_unlabeled["sentence"],model)

In [None]:
predictions_to_submit = pd.DataFrame(predict_class(df_unlabeled["sentence"],model))
predictions_to_submit.columns = ['difficulty']

In [None]:
predictions_to_submit

In [None]:
predictions_to_submit['difficulty'] = predictions_to_submit['difficulty'].map({0: "A1",
                                    1: 'A2',
                                    2: 'B1',
                                    3: 'B2',
                                    4: 'C1',
                                    5: 'C2'})
predictions_to_submit = predictions_to_submit.rename_axis("id")
predictions_to_submit

In [None]:
predictions_to_submit.to_csv("Geneva_predictions_BertTensor3.csv")