# Data Mining and Machine Learning - Project

## Detecting Difficulty Level of French Texts

### Step by step guidelines

The following are a set of step by step guidelines to help you get started with your project for the Data Mining and Machine Learning class. 
To test what you learned in the class, we will hold a competition. You will create a classifier that predicts how the level of some text in French (A1,..., C2). The team with the highest rank will get some goodies in the last class (some souvenirs from tech companies: Amazon, LinkedIn, etc).

**2 people per team**

Choose a team here:
https://moodle.unil.ch/mod/choicegroup/view.php?id=1305831


#### 1. 📂 Create a public GitHub repository for your team using this naming convention `DMML2022_[your_team_name]` with the following structure:
- data (folder) 
- code (folder) 
- documentation (folder)
- a readme file (.md): *mention team name, participants, brief description of the project, approach, summary of results table and link to the explainatory video (see below).*

All team members should contribute to the GitHub repository.

#### 2. 🇰 Join the competititon on Kaggle using the invitation link we sent on Slack.

Under the Team tab, save your team name (`UNIL_your_team_name`) and make sure your team members join in as well. You can merge your user account with your teammates in order to create a team.

#### 3. 📓 Read the data into your colab notebook. There should be one code notebook per team, but all team members can participate and contribute code. 

You can use either direct the Kaggle API and your Kaggle credentials (as explained below and **entirely optional**), or dowload the data form Kaggle and upload it onto your team's GitHub repository under the data subfolder.

#### 4. 💎 Train your models and upload the code under your team's GitHub repo. Set the `random_state=0`.
- baseline
- logistic regression with TFidf vectoriser (simple, no data cleaning)
- KNN & hyperparameter optimisation (simple, no data cleaning)
- Decision Tree classifier & hyperparameter optimisation (simple, no data cleaning)
- Random Forests classifier (simple, no data cleaning)
- another technique or combination of techniques of your choice

BE CREATIVE! You can use whatever method you want, in order to climb the leaderboard. The only rule is that it must be your own work. Given that, you can use all the online resources you want. 

#### 5. 🎥 Create a YouTube video (10-15 minutes) of your solution and embed it in your notebook. Explain the algorithms used and the evaluation of your solutions. *Select* projects will also be presented live by the group during the last class.


### Submission details (one per team)

1. Download a ZIPped file of your team's repository and submit it in Moodle here. IMPORTANT: in the comment of the submission, insert a link to the repository on Github.
https://moodle.unil.ch/mod/assign/view.php?id=1305833



### Grading (one per team)
- 20% Kaggle Rank
- 50% code quality (using classes, splitting into proper files, documentation, etc)
- 15% github quality (include link to video, table with progress over time, organization of code, images, etc)
- 15% video quality (good sound, good slides, interesting presentation).

## Some further details for points 3 and 4 above.


### 3. Read data into your notebook with the Kaggle API (optional but useful). 

You can also download the data from Kaggle and put it in your team's repo the data folder.

In [1]:
# reading in the data via the Kaggle API

# mount your Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [2]:
# install Kaggle
! pip install kaggle

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### IMPORTANT
Log into your Kaggle account, go to Account > API > Create new API token. You will obtain a kaggle.json file. Save it in your Google Drive (not in a folder, in your general drive).

In [3]:
!mkdir ~/.kaggle

In [4]:
#read in your Kaggle credentials from Google Drive
!cp /content/drive/MyDrive/kaggle.json ~/.kaggle/kaggle.json


In [5]:
!mkdir data


In [6]:
# download the dataset from the competition page
! kaggle competitions download -c detecting-french-texts-difficulty-level-2022

Downloading detecting-french-texts-difficulty-level-2022.zip to /content
  0% 0.00/303k [00:00<?, ?B/s]
100% 303k/303k [00:00<00:00, 106MB/s]


In [7]:
!unzip -o "detecting-french-texts-difficulty-level-2022.zip" -d data

Archive:  detecting-french-texts-difficulty-level-2022.zip
  inflating: data/sample_submission.csv  
  inflating: data/training_data.csv  
  inflating: data/unlabelled_test_data.csv  


In [8]:
# read in your training data
import pandas as pd
import numpy as np

df = pd.read_csv('/content/data/training_data.csv')

In [9]:
df.head()

Unnamed: 0,id,sentence,difficulty
0,0,Les coûts kilométriques réels peuvent diverger...,C1
1,1,"Le bleu, c'est ma couleur préférée mais je n'a...",A1
2,2,Le test de niveau en français est sur le site ...,A1
3,3,Est-ce que ton mari est aussi de Boston?,A1
4,4,"Dans les écoles de commerce, dans les couloirs...",B1


Have a look at the data on which to make predictions.

In [10]:
df_pred = pd.read_csv('/content/data/unlabelled_test_data.csv')
df_pred.head()

Unnamed: 0,id,sentence
0,0,Nous dûmes nous excuser des propos que nous eû...
1,1,Vous ne pouvez pas savoir le plaisir que j'ai ...
2,2,"Et, paradoxalement, boire froid n'est pas la b..."
3,3,"Ce n'est pas étonnant, car c'est une saison my..."
4,4,"Le corps de Golo lui-même, d'une essence aussi..."


And this is the format for your submissions.

In [11]:
df_example_submission = pd.read_csv('/content/data/sample_submission.csv')
df_example_submission.head()

Unnamed: 0,id,difficulty
0,0,A1
1,1,A1
2,2,A1
3,3,A1
4,4,A1


### 4. Train your models

Set your X and y variables. 
Set the `random_state=0`
Split the data into a train and test set using the following parameters `train_test_split(X, y, test_size=0.2, random_state=0)`.



#### 4.1.Baseline
What is the baseline for this classification problem?

In [12]:
np.random.seed = 0
from sklearn.model_selection import train_test_split

In [13]:
X = df.sentence
y = df.difficulty
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=np.random.seed)

In [14]:
base_rate = np.max(df.difficulty.value_counts()/df.difficulty.shape[0])

print("Base rate:", base_rate)

df.difficulty.value_counts()

Base rate: 0.169375


A1    813
C2    807
C1    798
B1    795
A2    795
B2    792
Name: difficulty, dtype: int64

####  4.1.1 Import libraries

In [15]:
!pip install -U spacy
!python -m spacy download fr_core_news_sm
import numpy as np
import pandas as pd 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
import spacy
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import string
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
from spacy.lang.en.stop_words import STOP_WORDS
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

sp = spacy.load('fr_core_news_sm')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
2022-12-21 22:18:38.703438: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting fr-core-news-sm==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.4.0/fr_core_news_sm-3.4.0-py3-none-any.whl (16.3 MB)
[K     |████████████████████████████████| 16.3 MB 8.1 MB/s 
Installing collected packages: fr-core-news-sm
Successfully installed fr-core-news-sm-3.4.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_sm')




#### 4.1.2 Tokenisation

In [16]:
punctuations = string.punctuation
stop_words = spacy.lang.fr.stop_words.STOP_WORDS

def spacy_tokenizer(sentence):
    mytokens = sp(sentence)
    mytokens = [ word.lemma_.lower().strip() for word in mytokens ]
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]
    return mytokens
### Version with data cleaning:
# tfidf_vector = TfidfVectorizer(tokenizer=spacy_tokenizer)
### Version without data cleaning:
tfidf_vector = TfidfVectorizer()


#### 4.2. Logistic Regression (without data cleaning)

# Train a simple logistic regression model using a Tfidf vectoriser.

In [17]:
classifier = LogisticRegression()
pipe = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', classifier)])
pipe.fit(X_train, y_train)


Pipeline(steps=[('vectorizer', TfidfVectorizer()),
                ('classifier', LogisticRegression())])

Calculate accuracy, precision, recall and F1 score on the test set.

In [18]:
from joblib.parallel import DEFAULT_MP_CONTEXT
avg = 'weighted' #[None, ‘binary’ (default), ‘micro’, ‘macro’, ‘samples’, ‘weighted’]
def evaluate(true, pred, name):
    precision = precision_score(true, pred, average=avg)
    recall = recall_score(true, pred, average=avg)
    f1 = f1_score(true, pred, average=avg)
    print(f"CONFUSION MATRIX:\n{confusion_matrix(true, pred)}")
    print(f"ACCURACY SCORE:\n{accuracy_score(true, pred):.4f}")
    print(f"CLASSIFICATION REPORT:\n\tPrecision: {precision:.4f}\n\tRecall: {recall:.4f}\n\tF1_Score: {f1:.4f}")
    dfx = pd.DataFrame([accuracy_score(true, pred).round(5), precision.round(5), recall.round(5), f1.round(5)])
    globals()[f'dfx_{name}'] = dfx
y_pred = pipe.predict(X_test)
evaluate(y_test, y_pred, "LR")


CONFUSION MATRIX:
[[93 31 21 10  4  2]
 [54 60 30  6  6  8]
 [12 38 64 17  9 20]
 [ 6  6 15 66 27 24]
 [ 4  4 10 37 73 45]
 [ 7  8  8 19 24 92]]
ACCURACY SCORE:
0.4667
CLASSIFICATION REPORT:
	Precision: 0.4656
	Recall: 0.4667
	F1_Score: 0.4640


Have a look at the confusion matrix and identify a few examples of sentences that are not well classified.

In [19]:
errordf = pd.DataFrame({ 'Label': y_test, 'Prediction': y_pred, 'Sentences': X_test})
errordf = errordf[errordf['Label'] != errordf['Prediction']]
print(errordf)

     Label Prediction                                          Sentences
2255    C1         C2  C'est en décembre 1967, après bien des invecti...
608     C1         B2  Giscard va pourtant réussir à transformer ce r...
2856    A2         B1  Un choix difficile mais important : le public ...
1889    B1         C1  Le débat porte plutôt sur l'utilité d'une tell...
2358    A2         B1  Il faut du temps et du courage pour soigner to...
...    ...        ...                                                ...
3959    A1         B1                                    J'écris un peu.
4595    A2         B2  Tous les prix sont affichés, mais si besoin, j...
891     C1         B2  Très présente dans l'alimentation antillaise, ...
1005    C1         B1  On réinvente le dimanche dans une perspective ...
1940    C1         B2  Pour les femmes surtout, nuancent Régine Lemoi...

[512 rows x 3 columns]


Generate your first predictions on the `unlabelled_test_data.csv`. make sure your predictions match the format of the `unlabelled_test_data.csv`.

In [20]:
print("Example submission:")
print(df_example_submission.head())
y_pred = pipe.predict(df_pred.sentence)
y_pred = pd.DataFrame({ 'id': df_pred.index, 'difficulty': y_pred})
print("Predictions:")
print(y_pred.head())


Example submission:
   id difficulty
0   0         A1
1   1         A1
2   2         A1
3   3         A1
4   4         A1
Predictions:
   id difficulty
0   0         C2
1   1         A2
2   2         A1
3   3         A1
4   4         C2


#### 4.3. KNN (without data cleaning)

Train a KNN classification model using a Tfidf vectoriser. Show the accuracy, precision, recall and F1 score on the test set.

In [21]:
classifier = KNeighborsClassifier()
pipe = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', classifier)])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
evaluate(y_test, y_pred, "knn")

CONFUSION MATRIX:
[[121  28   8   1   1   2]
 [ 98  51  12   1   1   1]
 [ 81  39  33   3   1   3]
 [ 49  30  19  29   3  14]
 [ 48  36  29  15  29  16]
 [ 37  29  17  23   9  43]]
ACCURACY SCORE:
0.3187
CLASSIFICATION REPORT:
	Precision: 0.4030
	Recall: 0.3187
	F1_Score: 0.3022


Try to improve it by tuning the hyper parameters (`n_neighbors`,   `p`, `weights`).

In [22]:
param_grid = [{'n_neighbors':np.arange(1, 100),
        'p':np.arange(1,3),
        'weights':['uniform','distance']
       }]

grid = GridSearchCV(estimator=classifier, param_grid=param_grid, cv=10, verbose=True)

pipe = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', grid)])

grid_search = pipe.fit(X_train, y_train)

Fitting 10 folds for each of 396 candidates, totalling 3960 fits


In [23]:
print("Hyperparameters:", grid.best_params_)
print("Train Score:", round(grid.best_score_, 4))
print("Test Score:", round(pipe.score(X_test, y_test), 4))

Hyperparameters: {'n_neighbors': 4, 'p': 2, 'weights': 'distance'}
Train Score: 0.3549
Test Score: 0.3677


#### 4.4. Decision Tree Classifier (without data cleaning)

Train a Decison Tree classifier, using a Tfidf vectoriser. Show the accuracy, precision, recall and F1 score on the test set.

In [24]:
classifier = DecisionTreeClassifier()
pipe = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', classifier)])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
evaluate(y_test, y_pred, "dt")

CONFUSION MATRIX:
[[81 42 21 10  3  4]
 [49 61 31 16  4  3]
 [29 37 39 18 21 16]
 [ 6 20 28 45 29 16]
 [11 19 33 36 40 34]
 [12 10 32 35 31 38]]
ACCURACY SCORE:
0.3167
CLASSIFICATION REPORT:
	Precision: 0.3176
	Recall: 0.3167
	F1_Score: 0.3135


Try to improve it by tuning the hyper parameters (`max_depth`, the depth of the decision tree).

In [25]:
grid = {'max_depth':np.arange(1,7)}

tree_cv = GridSearchCV(classifier, grid, cv=10)

pipe = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', tree_cv)])

tree_cv_GS=pipe.fit(X_train, y_train)

In [26]:
print("Hyperparameters:", tree_cv.best_params_)
print("Train Score:", round(tree_cv.best_score_, 4))
print("Test Score:", round(pipe.score(X_test, y_test), 4))

Hyperparameters: {'max_depth': 6}
Train Score: 0.3052
Test Score: 0.2979


#### 4.5. Random Forest Classifier (without data cleaning)

Try a Random Forest Classifier, using a Tfidf vectoriser. Show the accuracy, precision, recall and F1 score on the test set.

In [27]:
classifier = RandomForestClassifier()
pipe = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', classifier)])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
evaluate(y_test, y_pred, "rfc")

CONFUSION MATRIX:
[[124  19  10   5   2   1]
 [ 77  58  22   5   2   0]
 [ 38  46  40  20   8   8]
 [ 16  11  14  67  16  20]
 [ 15  12  18  53  45  30]
 [ 19  11   8  35  21  64]]
ACCURACY SCORE:
0.4146
CLASSIFICATION REPORT:
	Precision: 0.4208
	Recall: 0.4146
	F1_Score: 0.4000


#### 4.7. Show a summary of your results

In [28]:
t2 = pd.concat([dfx_LR,dfx_knn, dfx_dt, dfx_rfc, pd.DataFrame(['0.74333', '-', '-', '-'])],axis=1)
t2.columns = ['Logistic Regression', 'kNN', 'Decision Tree ', 'Random Forests', 'Our Model']
t2 = t2.rename(index={0 : 'Accuracy', 1 : 'Precision', 2 : 'Recall', 3 : 'F1-score', })
t2

Unnamed: 0,Logistic Regression,kNN,Decision Tree,Random Forests,Our Model
Accuracy,0.46667,0.31875,0.31667,0.41458,0.74333
Precision,0.46556,0.40304,0.31757,0.42082,-
Recall,0.46667,0.31875,0.31667,0.41458,-
F1-score,0.464,0.30217,0.31348,0.39999,-
