# Data Mining and Machine Learning - Project

## Detecting Difficulty Level of French Texts Using K-Nearest Neightbors.

Import the main packages

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style("whitegrid")

Read the training data.

In [2]:
df = pd.read_csv('training_data.csv')
df.head()

Unnamed: 0,id,sentence,difficulty
0,0,Les coûts kilométriques réels peuvent diverger...,C1
1,1,"Le bleu, c'est ma couleur préférée mais je n'a...",A1
2,2,Le test de niveau en français est sur le site ...,A1
3,3,Est-ce que ton mari est aussi de Boston?,A1
4,4,"Dans les écoles de commerce, dans les couloirs...",B1


Read the unlabelled test data, on which to make predictions.

In [3]:
df_pred = pd.read_csv('unlabelled_test_data.csv')
df_pred.head()

Unnamed: 0,id,sentence
0,0,Nous dûmes nous excuser des propos que nous eû...
1,1,Vous ne pouvez pas savoir le plaisir que j'ai ...
2,2,"Et, paradoxalement, boire froid n'est pas la b..."
3,3,"Ce n'est pas étonnant, car c'est une saison my..."
4,4,"Le corps de Golo lui-même, d'une essence aussi..."


The submission format is the following:

In [4]:
df_example_submission = pd.read_csv('sample_submission.csv')
df_example_submission.head()

Unnamed: 0,id,difficulty
0,0,A1
1,1,A1
2,2,A1
3,3,A1
4,4,A1


### Check the value of the baseline

Check the baseline to have a better understanding of the precision of the classification.

In [5]:
np.random.seed = 0

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier

In [7]:
# Define x and y, and the respective training data and test data
x = df['sentence']
y = df['difficulty']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

In [9]:
# Baseline - using dummy classifier
dummy = DummyClassifier(strategy='most_frequent')

dummy.fit(None, y_train)
baseline = dummy.score(None, y_test)

print('The value of the baseline in our data is',baseline.round(4))

The value of the baseline in our data is 0.1677


In [10]:
# Baseline - identifying the most frequent difficulty
df.difficulty.value_counts()

A1    813
C2    807
C1    798
B1    795
A2    795
B2    792
Name: difficulty, dtype: int64

In [11]:
baseline2 = round(df.difficulty.value_counts()['A1']/len(df),4)

print('The value of the baseline in our data is',baseline2.round(5))

The value of the baseline in our data is 0.1694


#### KNN (without data cleaning)

Train a KNN classification model using a Tfidf vectoriser.

In order to classify text, we need to use a vectorizer and also import certain packages. It is also important to note that while installing the packages, you must import the correct language you will be classifying.

In [None]:
# Install and update spaCy
!pip install -U spacy
!python -m spacy download fr

# Import necessary packages
import spacy
from spacy import displacy
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.pipeline import Pipeline
import string
from spacy.lang.fr.stop_words import STOP_WORDS as fr_stop
from spacy.lang.fr import French
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

Define the vectorizer and the classification method that will be used, once those are defined, create a pipeline to classify the training set.

In [14]:
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2))
knn = KNeighborsClassifier()

pipe = Pipeline([('vectorizer',tfidf),
                 ('classifier', knn)])
pipe.fit(x_train, y_train)

y_pred = pipe.predict(x_test)

In order to analyse our results, we calculated the accuracy, precision, recall and F1 score on the test set, besides plotting the confusion matrix.

In [15]:
def evaluate(true, pred):
    global precision,recall,f1
    precision = precision_score(true, pred, average='weighted')
    recall = recall_score(true, pred, average='weighted')
    f1 = f1_score(true, pred, average='weighted')
    print(f"CONFUSION MATRIX:\n{confusion_matrix(true, pred)}")
    print(f"ACCURACY SCORE:\n{accuracy_score(true, pred):.4f}")
    print(f"CLASSIFICATION REPORT:\n\tPrecision: {precision:.4f}\n\tRecall: {recall:.4f}\n\tF1_Score: {f1:.4f}")

In [16]:
evaluate(y_test,y_pred)

CONFUSION MATRIX:
[[ 34   4   3   1 119   0]
 [ 12   2   2   0 148   0]
 [  6   2   0   0 152   0]
 [  1   1   0   1 141   0]
 [  1   0   0   0 171   1]
 [  0   1   0   0 157   0]]
ACCURACY SCORE:
0.2167
CLASSIFICATION REPORT:
	Precision: 0.2495
	Recall: 0.2167
	F1_Score: 0.1171


In order to improve the classification, we tuned the hyper parameters for `n_neighbors`,   `p` and `weights` using `GridSearchCV()`.

In [19]:
from sklearn.model_selection import GridSearchCV

k_range = list(range(1,31,2))
parameters = { 'classifier__n_neighbors' : k_range,
               'classifier__p' : (1,2),
               'classifier__weights' : ['uniform','distance']
              }

gs = GridSearchCV(pipe, parameters, scoring='accuracy', return_train_score = False, verbose=1)
grid_search = gs.fit(x_train,y_train)
best_params = grid_search.best_params_

k = best_params['classifier__n_neighbors']
p = best_params['classifier__p']
w = best_params['classifier__weights']

print('By tuning the hyper parameters, we find that the best parameters for our KNN classification are:','\n',
      'n_neighbors:',k,'\n',
      'p:',p,'\n',
      'weights:',w)

Fitting 5 folds for each of 60 candidates, totalling 300 fits
By tuning the hyper parameters, we find that the best parameters for our KNN classification are: 
 n_neighbors: 29 
 p: 2 
 weights: distance


With the new parameters, we run the classification once more, to try to get better results.

In [20]:
knn_gs = KNeighborsClassifier(n_neighbors=k, p=p, weights=w)

pipekg = Pipeline([('vectorizer',tfidf),
                 ('classifier', knn_gs)])
pipekg.fit(x_train, y_train)

y_predKNN_GS = pipekg.predict(x_test)

To check the results, we evaluate the new predictions.

In [21]:
evaluate(y_test,y_predKNN_GS)

CONFUSION MATRIX:
[[ 88  31  17   2  20   3]
 [ 74  35  18   3  26   8]
 [ 43  29  33  10  40   5]
 [ 17   4   9  23  76  15]
 [  6   5   9  15 115  23]
 [ 12   7   4  11  69  55]]
ACCURACY SCORE:
0.3635
CLASSIFICATION REPORT:
	Precision: 0.3733
	Recall: 0.3635
	F1_Score: 0.3419


Now we can generate predictions on the `unlabelled_test_data.csv`, insuring that they match the format of the `unlabelled_test_data.csv` in order to submit them.

In [22]:
x_pred = df_pred['sentence']

y_prediction_knn = pipe.predict(x_pred)

In [23]:
df_pred_knn = df_pred
df_pred_knn['difficulty'] = y_prediction_knn
df_submission_knn = df_pred_knn.drop(['sentence'], axis=1)
df_submission_knn.to_csv('subimissionknn.csv',index=False)
df_submission_knn.head()

Unnamed: 0,id,difficulty
0,0,C2
1,1,C1
2,2,C1
3,3,A1
4,4,C1
