# [Detecting the difficulty level of French texts](https://www.kaggle.com/c/detecting-the-difficulty-level-of-french-texts/overview/evaluation)
## Model improvement
---
In this notebook, we will try differents methods to improve the accuracy of the models.

In [6]:
import pandas as pd
import spacy
from spacy import displacy
import string
import numpy as np
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, LabelEncoder
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, RidgeClassifier, Perceptron
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, accuracy_score, precision_score, recall_score, f1_score
from sklearn.utils.multiclass import unique_labels
from sklearn.preprocessing import MinMaxScaler, StandardScaler
np.random.state = 0


def evaluate(y_true, pred):
    """
    Calculate the models performance metrics. 
    Since it is a multi-class classification, we take the weighted average 
    for the metrics that are calculated for each class.

    """

    report = {
      'accuracy':accuracy_score(y_true, pred),
      'recall':recall_score(y_true, pred, average='weighted'),
      'precision':precision_score(y_true, pred, average='weighted'),
      'f1_score':f1_score(y_true, pred, average='weighted')
    }

    return report


def plot_confusion_matrix(y_true, pred, model):  
    """
    A function to plot the models confusion matrix.
    """

    cf_matrix = confusion_matrix(y_test, pred)

    disp = ConfusionMatrixDisplay(confusion_matrix=cf_matrix,
                              display_labels=model.classes_)

    disp.plot()


sp = spacy.load('fr_core_news_md')

# Import stopwords from spacy french language
stop_words = spacy.lang.fr.stop_words.STOP_WORDS
# Import punctations characters
punctuations = string.punctuation

In [7]:
df = pd.read_csv("https://raw.githubusercontent.com/LaCrazyTomato/Group-Project-DM-ML-2021/main/data/training_data.csv")

df.head()

Unnamed: 0,id,sentence,difficulty
0,0,Les coûts kilométriques réels peuvent diverger...,C1
1,1,"Le bleu, c'est ma couleur préférée mais je n'a...",A1
2,2,Le test de niveau en français est sur le site ...,A1
3,3,Est-ce que ton mari est aussi de Boston?,A1
4,4,"Dans les écoles de commerce, dans les couloirs...",B1


In [8]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, SnowballStemmer, WordNetLemmatizer
nltk.download('punkt')

# Define cleaning function
def nltk_tokenizer(doc):
    
    # Lowercase
    doc = doc.lower()
    
    # Tokenize and remove white spaces (strip)
    doc = word_tokenize(doc)
    doc = [word.lower().strip() for word in doc]
    
    stemmer = PorterStemmer()
    doc = [stemmer.stem(word) for word in doc]
    
    lemma = WordNetLemmatizer()
    doc = [lemma.lemmatize(word) for word in doc]
    
    return doc


print(nltk_tokenizer(df.loc[2, 'sentence']))

['le', 'test', 'de', 'niveau', 'en', 'françai', 'est', 'sur', 'le', 'site', 'internet', 'de', "l'école", '.']


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Alex\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 1. PCA
We will first try PCA to reduce our models' dimensionnality.

In [9]:
# Vectorizer with optimal parameters found previously
vectorizer = TfidfVectorizer(tokenizer=nltk_tokenizer, 
                               ngram_range=(1, 6),
                               analyzer='char',
                            min_df=2,
                            max_df=0.7,
                            norm='l2')


In [10]:
X = df['sentence']
y = df['difficulty']

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size=0.2, 
                                                    random_state=0, 
                                                    stratify=y)

# We nee to transform the features, before apply PCA on it
X_train_vec = vectorizer.fit_transform(X_train).toarray()
X_test_vec = vectorizer.transform(X_test).toarray()


# Then, we scale the data
scaler = MinMaxScaler()
X_train_vec = scaler.fit_transform(X_train_vec)
X_test_vec = scaler.transform(X_test_vec)


print(X_test_vec.shape)

(960, 111784)


In [11]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA


# Define PCA (we want 95% of explained variance)
pca = PCA(n_components=0.95)

# Example on X_train_vec
X_train_vec_pca = pca.fit_transform(X_train_vec)
X_test_vec_pca = pca.transform(X_test_vec)

print('Shape after PCA: ', X_train_vec_pca.shape)
print('Number of components: ', pca.n_components_)
print('Explained variance ratio: ', sum(pca.explained_variance_ratio_))

Shape after PCA:  (3840, 2978)
Number of components:  2978
Explained variance ratio:  0.9500066963357994


# 1 Logistic Regression
### 1.1.1 Remainder : accuracy from previous step -> 50.31 %

In [12]:
lr_model = LogisticRegression(max_iter=10_000,
                          penalty='l2',
                          solver='lbfgs')


# We don't need a pipeline anymore since we already applied vectorizer
lr_model.fit(X_train_vec_pca, y_train)

lr_model.score(X_test_vec_pca, y_test)

0.5041666666666667

It did not improve a lot.. Let's try only with the scaled values (no PCA).

In [13]:
lr_model.fit(X_train_vec, y_train)

lr_model.score(X_test_vec, y_test)

0.5125

Looks way better without PCA..

## 2. Random Forest
### 2.1 Remainder : accuracy from previous approach -> 46.15 %
With PCA :

In [14]:
randomForest_model = RandomForestClassifier(max_depth=40,
                                            n_estimators=80,
                                           criterion='gini',
                                           max_features='sqrt')


# We don't need a pipeline anymore since we already applied vectorizer
randomForest_model.fit(X_train_vec_pca, y_train)

randomForest_model.score(X_test_vec_pca, y_test)

0.215625

With scaler only:

In [15]:
randomForest_model.fit(X_train_vec, y_train)

randomForest_model.score(X_test_vec, y_test)

0.425

Scaling or PCA doesn't seem to improve accuracy. We will keep only parameters found in previous step.

## 3. Ridge classifier
### 3.1 Remainder : accuracy from previous approach -> 50.42 %
With PCA :

In [16]:
ridge_model = RidgeClassifier(random_state=0, 
                        max_iter=10_000, 
                        alpha=1.2,
                        solver='auto')


# We don't need a pipeline anymore since we already applied vectorizer
ridge_model.fit(X_train_vec_pca, y_train)

ridge_model.score(X_test_vec_pca, y_test)

0.48333333333333334

With scaler only:

In [17]:
ridge_model.fit(X_train_vec, y_train)

ridge_model.score(X_test_vec, y_test)

0.45729166666666665

Without PCA or scaler:

In [18]:
ridge_pipe = Pipeline([('vectorizer', vectorizer),                 
                 ('classifier', ridge_model)])


ridge_pipe.fit(X_train, y_train)

ridge_pipe.score(X_test, y_test)

0.5041666666666667

## 4. Perceptron classifier
### 4.1 Remainder : accuracy from previous approach -> 46.77 %
With PCA:

In [19]:
perceptron_model = Perceptron()


perceptron_model.fit(X_train_vec_pca, y_train)

perceptron_model.score(X_test_vec_pca, y_test)

0.475

With scaling only:

In [20]:
perceptron_model.fit(X_train_vec, y_train)

perceptron_model.score(X_test_vec, y_test)

0.459375

With PCA, we managed to improve accuracy 

# 2. Ensemble method
Another step to improve our prediction accuracy is to create an ensemble of models. StackingClassifier module from sklearn allows us to do that, very easily.

In [21]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, stratify=y)


# We didn't find how to implement a scaler with a vectorizer in a pipeline.. Therefore, we will only use 

## Logistic Regression pipe
lr_model = LogisticRegression(max_iter=10_000,
                          penalty='l2',
                          solver='lbfgs')

lr_pipe = Pipeline([
                    ('vectorizer', vectorizer),
                    ("classifier", lr_model)
                   ])


## Random Forest pipe
randomForest_model = RandomForestClassifier(max_depth=40,
                                            n_estimators=80,
                                           criterion='gini',
                                           max_features='sqrt')

randomForest_pipe = Pipeline([
                    ('vectorizer', vectorizer),
                    ("classifier", randomForest_model)
                   ])


## Ridge classifier pipe
ridge_model = RidgeClassifier(random_state=0, 
                        max_iter=10_000, 
                        alpha=1.2,
                        solver='auto')

ridge_pipe = Pipeline([
                    ('vectorizer', vectorizer),
                    ("classifier", ridge_model)
                   ])

# Perceptron classsifier pipe
perceptron_model = Perceptron()

perceptron_pipe = Pipeline([
                    ('vectorizer', vectorizer),
                    ("classifier", perceptron_model)
                   ])


In [22]:
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegressionCV, RidgeCV
from sklearn import set_config
set_config(display="diagram")

estimators = [
    ("Random Forest", randomForest_pipe),
    ("Logistic Regression", lr_pipe),
    ('Ridge Classifier', ridge_pipe)
]

stacking_classifier = StackingClassifier(estimators=estimators, final_estimator=LogisticRegressionCV(max_iter=10_000))

display(stacking_classifier)

print("Accuracy:")
stacking_classifier.fit(X_train, y_train).score(X_test, y_test)

Accuracy:


0.5541666666666667

On the kaggle unlabeled set, this model achieved a 52.08 % score.

In [23]:
estimators = [
    ("Perceptron Classifier", perceptron_pipe),
    ("Logistic Regression", lr_pipe),
    ('Ridge Classifier', ridge_pipe)
]

stacking_classifier = StackingClassifier(estimators=estimators, final_estimator=LogisticRegressionCV(max_iter=10_000))

display(stacking_classifier)

print("Accuracy:")
stacking_classifier.fit(X_train, y_train).score(X_test, y_test)

Accuracy:


0.55625

Even better with Perceptron instead of Random foret. On the kaggle unlabeled set, this model achieved a score of approximately 53 %.

## Can we go any further?

After thinking about how to further improve the model, we realized that the problem was probably with the vectorizer. The 4800 sentences of the training data are not enough to capture the whole French vocabulary. If the model sees a new word in a sentence it has to classify, it may react in a wrong way.

For these reasons, we started looking for a pre-trained word embedding model. Mr. Vlachos had mentioned in class that there is a model called BERT that we could try. We did our research based on this advice and found a model called CAMEMBERT, which is based on the same vectorization technology as BERT but trained on the French language. 

We will see in the next notebook how we implemented this model and managed to reach 57% accuracy on unlabeled data.


![test](https://miro.medium.com/max/1200/1*E9NixJnfi8aVGU3ofiqQ8Q.png)