# Data Mining and Machine Learning - Project

## Detecting Difficulty Level of French Texts

### Step by step guidelines

The following are a set of step by step guidelines to help you get started with your project for the Data Mining and Machine Learning class. 
To test what you learned in the class, we will hold a competition. You will create a classifier that predicts how the level of some text in French (A1,..., C2). The team with the highest rank will get some goodies in the last class (some souvenirs from tech companies: Amazon, LinkedIn, etc).

**2 people per team**

Choose a team here:
https://moodle.unil.ch/mod/choicegroup/view.php?id=1305831


#### 1. 📂 Create a public GitHub repository for your team using this naming convention `DMML2022_[your_team_name]` with the following structure:
- data (folder) 
- code (folder) 
- documentation (folder)
- a readme file (.md): *mention team name, participants, brief description of the project, approach, summary of results table and link to the explainatory video (see below).*

All team members should contribute to the GitHub repository.

#### 2. 🇰 Join the competititon on Kaggle using the invitation link we sent on Slack.

Under the Team tab, save your team name (`UNIL_your_team_name`) and make sure your team members join in as well. You can merge your user account with your teammates in order to create a team.

#### 3. 📓 Read the data into your colab notebook. There should be one code notebook per team, but all team members can participate and contribute code. 

You can use either direct the Kaggle API and your Kaggle credentials (as explained below and **entirely optional**), or dowload the data form Kaggle and upload it onto your team's GitHub repository under the data subfolder.

#### 4. 💎 Train your models and upload the code under your team's GitHub repo. Set the `random_state=0`.
- baseline
- logistic regression with TFidf vectoriser (simple, no data cleaning)
- KNN & hyperparameter optimisation (simple, no data cleaning)
- Decision Tree classifier & hyperparameter optimisation (simple, no data cleaning)
- Random Forests classifier (simple, no data cleaning)
- another technique or combination of techniques of your choice

BE CREATIVE! You can use whatever method you want, in order to climb the leaderboard. The only rule is that it must be your own work. Given that, you can use all the online resources you want. 

#### 5. 🎥 Create a YouTube video (10-15 minutes) of your solution and embed it in your notebook. Explain the algorithms used and the evaluation of your solutions. *Select* projects will also be presented live by the group during the last class.


### Submission details (one per team)

1. Download a ZIPped file of your team's repository and submit it in Moodle here. IMPORTANT: in the comment of the submission, insert a link to the repository on Github.
https://moodle.unil.ch/mod/assign/view.php?id=1305833



### Grading (one per team)
- 20% Kaggle Rank
- 50% code quality (using classes, splitting into proper files, documentation, etc)
- 15% github quality (include link to video, table with progress over time, organization of code, images, etc)
- 15% video quality (good sound, good slides, interesting presentation).

## Some further details for points 3 and 4 above.

### 3. Read data into your notebook with the Kaggle API (optional but useful). 

You can also download the data from Kaggle and put it in your team's repo the data folder.

In [132]:
# reading in the data via the Kaggle API

# mount your Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [133]:
# install Kaggle
! pip install kaggle

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### IMPORTANT
Log into your Kaggle account, go to Account > API > Create new API token. You will obtain a kaggle.json file. Save it in your Google Drive (not in a folder, in your general drive).

In [134]:
!mkdir ~/.kaggle

mkdir: cannot create directory ‘/root/.kaggle’: File exists


In [135]:
#read in your Kaggle credentials from Google Drive
!cp /content/drive/MyDrive/kaggle.json ~/.kaggle/kaggle.json


In [136]:
!mkdir data


mkdir: cannot create directory ‘data’: File exists


In [137]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [138]:
# download the dataset from the competition page
! kaggle competitions download -c detecting-french-texts-difficulty-level-2022

detecting-french-texts-difficulty-level-2022.zip: Skipping, found more recently modified local copy (use --force to force download)


In [139]:
!unzip "detecting-french-texts-difficulty-level-2022.zip" -d data

Archive:  detecting-french-texts-difficulty-level-2022.zip
replace data/sample_submission.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [140]:
# read in your training data
import pandas as pd
import numpy as np

df = pd.read_csv('/content/data/training_data.csv')

In [141]:
df.head()

Unnamed: 0,id,sentence,difficulty
0,0,Les coûts kilométriques réels peuvent diverger...,C1
1,1,"Le bleu, c'est ma couleur préférée mais je n'a...",A1
2,2,Le test de niveau en français est sur le site ...,A1
3,3,Est-ce que ton mari est aussi de Boston?,A1
4,4,"Dans les écoles de commerce, dans les couloirs...",B1


In [142]:
# value counts of language level
df['difficulty'].value_counts()

A1    813
C2    807
C1    798
B1    795
A2    795
B2    792
Name: difficulty, dtype: int64

Have a look at the data on which to make predictions.

In [143]:
df_pred = pd.read_csv('/content/data/unlabelled_test_data.csv')
df_pred.head()

Unnamed: 0,id,sentence
0,0,Nous dûmes nous excuser des propos que nous eû...
1,1,Vous ne pouvez pas savoir le plaisir que j'ai ...
2,2,"Et, paradoxalement, boire froid n'est pas la b..."
3,3,"Ce n'est pas étonnant, car c'est une saison my..."
4,4,"Le corps de Golo lui-même, d'une essence aussi..."


And this is the format for your submissions.

In [144]:
df_example_submission = pd.read_csv('/content/data/sample_submission.csv')
df_example_submission.head()

Unnamed: 0,id,difficulty
0,0,A1
1,1,A1
2,2,A1
3,3,A1
4,4,A1


### 4. Train your models

Set your X and y variables. 
Set the `random_state=0`
Split the data into a train and test set using the following parameters `train_test_split(X, y, test_size=0.2, random_state=0)`.

#### 4.1.Baseline
What is the baseline for this classification problem?

In [145]:
# df_pred.insert(2, 'predicted difficulty', value = 'A1')


In [146]:
df.head()

Unnamed: 0,id,sentence,difficulty
0,0,Les coûts kilométriques réels peuvent diverger...,C1
1,1,"Le bleu, c'est ma couleur préférée mais je n'a...",A1
2,2,Le test de niveau en français est sur le site ...,A1
3,3,Est-ce que ton mari est aussi de Boston?,A1
4,4,"Dans les écoles de commerce, dans les couloirs...",B1


In [147]:
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
import string
from sklearn import preprocessing

In [148]:
# replacing values
df1 = df.copy()

le = preprocessing.LabelEncoder()
le.fit(df1.difficulty)
df1['difficulty_encoded'] = le.transform(df1.difficulty)



In [149]:
df1['difficulty'] = df1['difficulty_encoded']
print(df1)

        id                                           sentence  difficulty  \
0        0  Les coûts kilométriques réels peuvent diverger...           4   
1        1  Le bleu, c'est ma couleur préférée mais je n'a...           0   
2        2  Le test de niveau en français est sur le site ...           0   
3        3           Est-ce que ton mari est aussi de Boston?           0   
4        4  Dans les écoles de commerce, dans les couloirs...           2   
...    ...                                                ...         ...   
4795  4795  C'est pourquoi, il décida de remplacer les hab...           3   
4796  4796  Il avait une de ces pâleurs splendides qui don...           4   
4797  4797  Et le premier samedi de chaque mois, venez ren...           1   
4798  4798  Les coûts liés à la journalisation n'étant pas...           5   
4799  4799  Sur le sable, la mer haletait de toute la resp...           5   

      difficulty_encoded  
0                      4  
1                    

In [150]:
np.random.seed = 0
# select features
X = df1['sentence']
y = df1['difficulty']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)


In [151]:
# your code here (you can use as many lines of code as you like)
X_train

70                                Comment t'appelles-tu ?
4347    Voilà qui serait en effet de nature à simplifi...
1122    Les pèlerins partagèrent alors cette célébrati...
4570                          Qu'est-ce que vous faites ?
34      En voici un des moins obscurs : "Plus nous dev...
                              ...                        
1033    Les micro-changements apportés par ce type d'u...
3264    J'allais à la poste quand j'ai croisé ma cousi...
1653    Au cours des années 1970 et 1980, plusieurs gr...
2607    Stop : tout d'abord, figurez-vous que les vrai...
2732    "On s'est alors dit que le terrain commençait ...
Name: sentence, Length: 3840, dtype: object

In [152]:
y_train

70      0
4347    3
1122    4
4570    0
34      5
       ..
1033    3
3264    1
1653    4
2607    3
2732    2
Name: difficulty, Length: 3840, dtype: int64

#### 4.2. Logistic Regression (without data cleaning)

Train a simple logistic regression model using a Tfidf vectoriser.

In [153]:
# Define classifier
classifier = LogisticRegression()
tfidfVectorizer = TfidfVectorizer()

# Create pipeline
pipe = Pipeline([('vectorizer', tfidfVectorizer),
                 ('classifier', classifier)])
                 


In [154]:
pipe.fit(X_train, y_train)

Pipeline(steps=[('vectorizer', TfidfVectorizer()),
                ('classifier', LogisticRegression())])

Calculate accuracy, precision, recall and F1 score on the test set.

In [155]:
# your code here
# Evaluate the model
from sklearn.metrics import classification_report
y_pred = pipe.predict(X_test)

print(classification_report(y_test, y_pred, digits=5))

              precision    recall  f1-score   support

           0    0.52841   0.57764   0.55193       161
           1    0.40816   0.36585   0.38585       164
           2    0.43243   0.40000   0.41558       160
           3    0.42581   0.45833   0.44147       144
           4    0.51049   0.42197   0.46203       173
           5    0.48168   0.58228   0.52722       158

    accuracy                        0.46667       960
   macro avg    0.46450   0.46768   0.46401       960
weighted avg    0.46556   0.46667   0.46400       960



In [156]:
df_pred['difficulty'] = pipe.predict(df_pred['sentence'])


In [157]:
df_pred

Unnamed: 0,id,sentence,difficulty
0,0,Nous dûmes nous excuser des propos que nous eû...,5
1,1,Vous ne pouvez pas savoir le plaisir que j'ai ...,1
2,2,"Et, paradoxalement, boire froid n'est pas la b...",0
3,3,"Ce n'est pas étonnant, car c'est une saison my...",0
4,4,"Le corps de Golo lui-même, d'une essence aussi...",5
...,...,...,...
1195,1195,C'est un phénomène qui trouve une accélération...,5
1196,1196,Je vais parler au serveur et voir si on peut d...,1
1197,1197,Il n'était pas comme tant de gens qui par pare...,5
1198,1198,Ils deviennent dangereux pour notre économie.,4


In [158]:
df_pred_final = df_pred
df_pred_final.pop('sentence')

0       Nous dûmes nous excuser des propos que nous eû...
1       Vous ne pouvez pas savoir le plaisir que j'ai ...
2       Et, paradoxalement, boire froid n'est pas la b...
3       Ce n'est pas étonnant, car c'est une saison my...
4       Le corps de Golo lui-même, d'une essence aussi...
                              ...                        
1195    C'est un phénomène qui trouve une accélération...
1196    Je vais parler au serveur et voir si on peut d...
1197    Il n'était pas comme tant de gens qui par pare...
1198        Ils deviennent dangereux pour notre économie.
1199    Son succès a généré beaucoup de réactions néga...
Name: sentence, Length: 1200, dtype: object

In [159]:
df_pred_final

Unnamed: 0,id,difficulty
0,0,5
1,1,1
2,2,0
3,3,0
4,4,5
...,...,...
1195,1195,5
1196,1196,1
1197,1197,5
1198,1198,4


In [160]:
df_pred_final.loc[df_pred_final['difficulty']==0, 'difficulty'] = 'A1'
df_pred_final.loc[df_pred_final['difficulty']==1, 'difficulty'] = 'A2'
df_pred_final.loc[df_pred_final['difficulty']==2, 'difficulty'] = 'B1'
df_pred_final.loc[df_pred_final['difficulty']==3, 'difficulty'] = 'B2'
df_pred_final.loc[df_pred_final['difficulty']==4, 'difficulty'] = 'C1'
df_pred_final.loc[df_pred_final['difficulty']==5, 'difficulty'] = 'C2'


In [161]:
df_pred_final

Unnamed: 0,id,difficulty
0,0,C2
1,1,A2
2,2,A1
3,3,A1
4,4,C2
...,...,...
1195,1195,C2
1196,1196,A2
1197,1197,C2
1198,1198,C1


In [162]:
df_pred_final.to_csv('UNIL_SBB_submission.csv', index=False)

Have a look at the confusion matrix and identify a few examples of sentences that are not well classified.

In [None]:
# your code here

Generate your first predictions on the `unlabelled_test_data.csv`. make sure your predictions match the format of the `unlabelled_test_data.csv`.

#### 4.3. KNN (without data cleaning)

Train a KNN classification model using a Tfidf vectoriser. Show the accuracy, precision, recall and F1 score on the test set.

In [None]:
# your code here

Try to improve it by tuning the hyper parameters (`n_neighbors`,   `p`, `weights`).

In [None]:
# your code here

#### 4.4. Decision Tree Classifier (without data cleaning)

Train a Decison Tree classifier, using a Tfidf vectoriser. Show the accuracy, precision, recall and F1 score on the test set.

In [None]:
# your code here

Try to improve it by tuning the hyper parameters (`max_depth`, the depth of the decision tree).

In [None]:
# your code here

#### 4.5. Random Forest Classifier (without data cleaning)

Try a Random Forest Classifier, using a Tfidf vectoriser. Show the accuracy, precision, recall and F1 score on the test set.

In [None]:
# your code here

#### 4.6. Any other technique, including data cleaning if necessary

Try to improve accuracy by training a better model using the techniques seen in class, or combinations of them.

As usual, show the accuracy, precision, recall and f1 score on the test set.

In [None]:
# your code here

#### 4.7. Show a summary of your results