# Data Mining and Machine Learning - Project

## Detecting Difficulty Level of French Texts Using Logistic Regression Classification.

Import the main packages

In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style("whitegrid")

Read the training data

In [29]:
df = pd.read_csv('training_data.csv')
df.head()

Unnamed: 0,id,sentence,difficulty
0,0,Les coûts kilométriques réels peuvent diverger...,C1
1,1,"Le bleu, c'est ma couleur préférée mais je n'a...",A1
2,2,Le test de niveau en français est sur le site ...,A1
3,3,Est-ce que ton mari est aussi de Boston?,A1
4,4,"Dans les écoles de commerce, dans les couloirs...",B1


Read the unlabelled test data, on which to make predictions

In [30]:
df_pred = pd.read_csv('unlabelled_test_data.csv')
df_pred.head()

Unnamed: 0,id,sentence
0,0,Nous dûmes nous excuser des propos que nous eû...
1,1,Vous ne pouvez pas savoir le plaisir que j'ai ...
2,2,"Et, paradoxalement, boire froid n'est pas la b..."
3,3,"Ce n'est pas étonnant, car c'est une saison my..."
4,4,"Le corps de Golo lui-même, d'une essence aussi..."


The submission format is the following:

In [31]:
df_example_submission = pd.read_csv('sample_submission.csv')
df_example_submission.head()

Unnamed: 0,id,difficulty
0,0,A1
1,1,A1
2,2,A1
3,3,A1
4,4,A1


### Check the value of the baseline

Check the baseline to have a better understanding of the precision of the classification

In [32]:
np.random.seed = 0

In [33]:
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier

In [34]:
# Define x and y, and the respective training data and test data
x = df['sentence']
y = df['difficulty']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

In [35]:
# Baseline - using dummy classifier
dummy = DummyClassifier(strategy='most_frequent')

dummy.fit(None, y_train)
baseline = dummy.score(None, y_test)

print('The value of the baseline in our data is',baseline.round(4))

The value of the baseline in our data is 0.1677


In [36]:
# Baseline - identifying the most frequent difficulty
df.difficulty.value_counts()

A1    813
C2    807
C1    798
B1    795
A2    795
B2    792
Name: difficulty, dtype: int64

In [37]:
baseline2 = round(df.difficulty.value_counts()['A1']/len(df),4)

print('The value of the baseline in our data is',baseline2.round(5))

The value of the baseline in our data is 0.1694


#### Logistic Regression (without data cleaning)

Train a simple logistic regression model using a Tfidf vectoriser.

In order to classify text, we need to use a vectorizer and also import certain packages. It is also important to note that while installing the packages, you must import the correct language you will be classifying.

In [None]:
# Install and update spaCy
!pip install -U spacy
!python -m spacy download fr

# Import necessary packages
import spacy
from spacy import displacy
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.pipeline import Pipeline
import string
from spacy.lang.fr.stop_words import STOP_WORDS as fr_stop
from spacy.lang.fr import French
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

Define the vectorizer and the classification method that will be used, once those are defined, create a pipeline to classify the training set.

In [47]:
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2))
lr = LogisticRegression(solver='lbfgs', max_iter=10000, random_state=0)

pipe = Pipeline([('vectorizer', tfidf),
                 ('classifier', lr)])

pipe.fit(x_train, y_train)

y_pred = pipe.predict(x_test)

In order to analyse our results, we calculated the accuracy, precision, recall and F1 score on the test set, besides plotting the confusion matrix.

In [40]:
def evaluate(true, pred):
    global precision,recall,f1
    precision = precision_score(true, pred, average='weighted')
    recall = recall_score(true, pred, average='weighted')
    f1 = f1_score(true, pred, average='weighted')
    print(f"CONFUSION MATRIX:\n{confusion_matrix(true, pred)}")
    print(f"ACCURACY SCORE:\n{accuracy_score(true, pred):.4f}")
    print(f"CLASSIFICATION REPORT:\n\tPrecision: {precision:.4f}\n\tRecall: {recall:.4f}\n\tF1_Score: {f1:.4f}")

In [41]:
evaluate(y_test,y_pred)

CONFUSION MATRIX:
[[89 41 22  6  2  1]
 [50 59 32 10  9  4]
 [14 37 63 18 11 17]
 [10  2 14 59 34 25]
 [ 4  4 10 38 67 50]
 [ 7  8  5 25 32 81]]
ACCURACY SCORE:
0.4354
CLASSIFICATION REPORT:
	Precision: 0.4340
	Recall: 0.4354
	F1_Score: 0.4337


We can now have a look at sentences that are not well classified. In order to do that we need to compare the results obtained by the classification with the values already given for the test set.

In [42]:
y_pred_df = pd.DataFrame(data=y_pred)
y_pred_df.columns = ['difficulty']
y_test_df = pd.DataFrame(data=y_test).reset_index().drop(['index'],axis=1)
x_test_df = pd.DataFrame(data=x_test).reset_index().drop(['index'],axis=1)

In [43]:
df_check = (y_pred_df != y_test_df)

In [48]:
for i in range(len(df_check)):
  if df_check['difficulty'][i]==True:
    print('Example of a wrongly classified text:',x_test_df['sentence'][i])
  else:
    break

Example of a wrongly classified text: C'est en décembre 1967, après bien des invectives au Parlement, que sa loi relative à la régulation des naissances, dite loi Neuwirth est votée : elle autorise la vente exclusive des contraceptifs en pharmacie sur ordonnance médicale, avec autorisation parentale pour les mineures
Example of a wrongly classified text: Giscard va pourtant réussir à transformer ce revers en tremplin
Example of a wrongly classified text: Un choix difficile mais important : le public français écoute souvent les professionnels de Cannes pour choisir le film qu'il va aller voir au cinéma.
Example of a wrongly classified text: Le débat porte plutôt sur l'utilité d'une telle mesure.


Now we can generate predictions on the `unlabelled_test_data.csv`, insuring that they match the format of the `unlabelled_test_data.csv` in order to submit them.

In [50]:
x_pred = df_pred['sentence']

y_prediction_lr = pipe.predict(x_pred)

In [51]:
df_pred_lr = df_pred
df_pred_lr['difficulty'] = y_prediction_lr
df_submission_lr = df_pred_lr.drop(['sentence'], axis=1)
df_submission_lr.to_csv('subimissionlr.csv',index=False)
df_submission_lr.head()

Unnamed: 0,id,difficulty
0,0,C2
1,1,B1
2,2,A1
3,3,B1
4,4,C2
