<a href="https://colab.research.google.com/github/JW20221/DMML2022_Ouchy/blob/main/French_Level_Detector.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# French Level Detector
### Table of Contents
#### 1. Project Introduction
* 1.1 Introduction

#### 2. Text Preparation
* 2.1 Install spaCy
* 2.2 Tokenization
* 2.3 Dependency Parsing
* 2.4 Remove Stopwords
* 2.5 Lemmatization
* 2.6 Entity Detection
* 2.7 Exercise
* 2.8 Solution

#### 3. Text Representation
* 3.1 Bag of Words (BOW)
* 3.2 TF-IDF Representation
* 3.3 Exercise
* 3.4 Solution

#### 4. Text Classification: Alexa Reviews
* 4.1 Load and prepare data
* 4.2 Classification of the reviews using logistic regression
* 4.3 How can we improve the accuracy?

## 1. Project Introduction

## 2. Text Preparation

### 2.1 Install useful liabraries
[spaCy](https://spacy.io/) is an open-source natural language processing library for Python. It is designed particularly for production use, and it can help us to build applications that process massive volumes of text efficiently.

We install the library and its French-language model.

In [None]:
# Install and update spaCy
!pip install -U spacy

# Download the french language model
!python -m spacy download fr


In [8]:
# Import required packages
import spacy
from spacy import displacy
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd

# Import additional packages
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
import string
from spacy.lang.fr.stop_words import STOP_WORDS
from spacy.lang.fr import French
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

### 2.2 Load Training Data

In [10]:
# Load Training data
url = "https://raw.githubusercontent.com/JW20221/DMML2022_Ouchy/main/Data/training_data.csv"
df = pd.read_csv(url)
df.sample(10)

Unnamed: 0,id,sentence,difficulty
2076,2076,"La dynamique de la vie de ces couples, ponctué...",C2
3721,3721,"Or une telle vie, guidée par des désirs multip...",C2
1279,1279,La règle du frein à l'endettement devrait trai...,C1
2819,2819,Monsieur a été le premier à animer une émissio...,B1
3030,3030,Giscard se laisse filmer durant ses vacances à...,C1
1056,1056,Une des sources majeures des gaz à effet de se...,C1
1900,1900,Mais trop de gens sont encore désarmés face au...,C2
2319,2319,"Cependant, un recours en urgence a suspendu la...",C1
3938,3938,"Néanmoins, l'industrie photovoltaïque peut enc...",C2
3413,3413,"Ceux-là, ils sont pas en visioconférence à plu...",B1


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4800 entries, 0 to 4799
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          4800 non-null   int64 
 1   sentence    4800 non-null   object
 2   difficulty  4800 non-null   object
dtypes: int64(1), object(2)
memory usage: 112.6+ KB


In [13]:
# Base rate: check is data is balanced.
df.difficulty.value_counts()

A1    813
C2    807
C1    798
B1    795
A2    795
B2    792
Name: difficulty, dtype: int64

### 2.3 Tokenization

**Tokenization** is the process of breaking a text into pieces called tokens. A token simply refers to an individual part of a sentence having some semantic value. SpaCy‘s tokenizer takes input in form of unicode text and outputs a sequence of token objects. In addition, SpaCy automatically breaks your document into tokens when a document is created using the language model.

In [16]:
# Create a list of punctuation marks
#punctuations = string.punctuation

#punctuations

In [17]:
# Create a list of stopwords
#stop_words = spacy.lang.fr.stop_words.STOP_WORDS

#list(stop_words)[:20]

In [20]:
# Load French language model
sp = spacy.load('fr_core_news_sm')

# Create tokenizer function
def spacy_tokenizer(sentence):
    # Create token object, which is used to create documents with linguistic annotations.
    mytokens = sp(sentence)

    # Lemmatize each token and convert each token into lowercase
    mytokens = [ word.lemma_.lower().strip() for word in mytokens ]
    ## alternative way
    # mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]

    # Remove stop words and punctuation
    #mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]

    # Return preprocessed list of tokens
    return mytokens

## 3. Text Representation
We now show how to transform a text into an usable input for text classification. We use the first sentence of the article from the last section and two other sentences.

### 3.1 Bag of Words (BOW)
We use the [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) class of scikit learn.

### 3.2 TF-IDF Representation


Recall that:

- **term frequency tf** = count(word, document) / len(document) 
- **term frequency idf** = log( len(collection) / count(document_containing_term, collection) )
- **tf-idf** = tf * idf 

  (The goal of using tf-idf instead of the raw frequencies of occurrence of a token in a given document is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus.) [TfidfTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer)

It is important to mention that the IDF value for a word remains the same throughout all the documents as it depends upon the total number of documents. On the other hand, TF values of a word differ from document to document.

In [24]:
tfidf_vector = TfidfVectorizer(tokenizer=spacy_tokenizer) # we use the above defined tokenizer

## 4. Text Clasification

### 4.1 Logistic regression

In [None]:
# Select features
X = df['sentence'] # the features we want to analyze
ylabels = df['difficulty'] # the labels, or answers, we want to test against

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.2, random_state=0, stratify=ylabels)

X_train

In [27]:
y_train

183     C2
90      A2
1128    C2
2336    B1
4398    A1
        ..
3983    A2
1870    A1
394     B2
3244    B2
411     C2
Name: difficulty, Length: 3840, dtype: object

In [32]:
# Define classifier
classifier_1 = LogisticRegression()

# Create pipeline
## The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters.
pipe_1 = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', classifier_1)])

# Fit model on training set
pipe_1.fit(X_train, y_train)

Pipeline(steps=[('vectorizer',
                 TfidfVectorizer(tokenizer=<function spacy_tokenizer at 0x7fb7f25fb940>)),
                ('classifier', LogisticRegression())])

In [29]:
# Evaluate the model
def evaluate(true, pred):
    precision = precision_score(true, pred)
    recall = recall_score(true, pred)
    f1 = f1_score(true, pred)
    print(f"CONFUSION MATRIX:\n{confusion_matrix(true, pred)}")
    print(f"ACCURACY SCORE:\n{accuracy_score(true, pred):.4f}")
    print(f"CLASSIFICATION REPORT:\n\tPrecision: {precision:.4f}\n\tRecall: {recall:.4f}\n\tF1_Score: {f1:.4f}")

In [33]:
# Predictions
y_pred = pipe_1.predict(X_test)

# Evaluation - test set
#evaluate(y_test, y_pred)

In [36]:
# Evaluation - test set
print(f"CONFUSION MATRIX:\n{confusion_matrix(y_test, y_pred)}")
print(f"ACCURACY SCORE:\n{accuracy_score(y_test, y_pred):.4f}")

CONFUSION MATRIX:
[[116  22  13   5   2   5]
 [ 37  64  33  12   4   9]
 [ 18  46  54  20   8  13]
 [  6   2  21  72  26  31]
 [  5   4   9  37  53  52]
 [  7   5  15   9  29  96]]
ACCURACY SCORE:
0.4740


### 4.2 KNN

### 4.3 Decision Tree

### 4.4 Random Forest

In [31]:
# Use random forest
from sklearn.ensemble import RandomForestClassifier

# Define vectorizer
tfidf_vector = TfidfVectorizer(tokenizer=spacy_tokenizer) # we use the above defined tokenizer

# Define classifier
classifier_4 = RandomForestClassifier(n_estimators=50)

# Create pipeline
pipe_4 = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', classifier_4)])

# Generate Model on training set
pipe_4.fit(X_train, y_train)

# Predictions
y_pred = pipe_4.predict(X_test)

# Evaluation - test set
print(f"CONFUSION MATRIX:\n{confusion_matrix(y_test, y_pred)}")
print(f"ACCURACY SCORE:\n{accuracy_score(y_test, y_pred):.4f}")

CONFUSION MATRIX:
[[126  18  13   5   1   0]
 [ 65  53  27   5   5   4]
 [ 39  40  45  21   9   5]
 [ 16  14  17  56  35  20]
 [ 14   8  15  39  50  34]
 [ 11  15  18  29  28  60]]
ACCURACY SCORE:
0.4062


## Unlabelled Text Data prediction

In [57]:
# Load Unlabelled Test Data
url_test = "https://raw.githubusercontent.com/JW20221/DMML2022_Ouchy/main/Data/unlabelled_test_data.csv"
df_unlabelled_test = pd.read_csv(url_test)
df_unlabelled_test.head(10)

Unnamed: 0,id,sentence
0,0,Nous dûmes nous excuser des propos que nous eû...
1,1,Vous ne pouvez pas savoir le plaisir que j'ai ...
2,2,"Et, paradoxalement, boire froid n'est pas la b..."
3,3,"Ce n'est pas étonnant, car c'est une saison my..."
4,4,"Le corps de Golo lui-même, d'une essence aussi..."
5,5,"Elle jeta un cri, un petit cri, voulut se dres..."
6,6,"Madame, Monsieur, Votre fils Léo arrive tous l..."
7,7,Comment tu as trouvé le repas de ce midi
8,8,Mais la racine du mal est bel est bien cette f...
9,9,"Je ne peux pas vous laisser dire cela, Madame."


In [58]:
# Select features
X_unlabelled_test = df_unlabelled_test['sentence'] # the features we want to analyze

In [59]:
# Predictions for unlabelled test data
y_pred_unlabelled_test = pipe_1.predict(X_unlabelled_test)
y_pred_unlabelled_test

array(['C2', 'B1', 'B1', ..., 'C2', 'A1', 'B2'], dtype=object)

In [60]:
# Add column "difficulty"
df_unlabelled_test['difficulty'] = y_pred_unlabelled_test

df_unlabelled_test

Unnamed: 0,id,sentence,difficulty
0,0,Nous dûmes nous excuser des propos que nous eû...,C2
1,1,Vous ne pouvez pas savoir le plaisir que j'ai ...,B1
2,2,"Et, paradoxalement, boire froid n'est pas la b...",B1
3,3,"Ce n'est pas étonnant, car c'est une saison my...",B1
4,4,"Le corps de Golo lui-même, d'une essence aussi...",C2
...,...,...,...
1195,1195,C'est un phénomène qui trouve une accélération...,B1
1196,1196,Je vais parler au serveur et voir si on peut d...,A2
1197,1197,Il n'était pas comme tant de gens qui par pare...,C2
1198,1198,Ils deviennent dangereux pour notre économie.,A1


In [62]:
#Drop column "sentence"
df_labelled_test = df_unlabelled_test.drop(columns = ['sentence'])
df_labelled_test


Unnamed: 0,id,difficulty
0,0,C2
1,1,B1
2,2,B1
3,3,B1
4,4,C2
...,...,...
1195,1195,B1
1196,1196,A2
1197,1197,C2
1198,1198,A1


In [65]:
# Write DataFrame to an CSV file
df_labelled_test.to_csv(r'C:\Users\Jingmin\Desktop.csv'index=False)

SyntaxError: ignored