<a href="https://raw.githubusercontent.com/michalis0/DataScience_and_MachineLearning/master/Project/Project_guidelines_2023.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Mining and Machine Learning - Project

## Detecting Difficulty Level of French Texts

### Step by step guidelines

The following are a set of step by step guidelines to help you get started with your project for the Data Mining and Machine Learning class.
To test what you learned in the class, we will hold a competition. You will create a classifier that predicts how the level of some text in French (A1,..., C2). The team with the highest rank will get some goodies in the last class (some souvenirs from tech companies: Amazon, LinkedIn, etc).

**2 people per team**

Choose a team here:
https://moodle.unil.ch/mod/choicegroup/view.php?id=1542704

You have decided to form a startup called “LingoRank” with your University friend and become a millionaire. You have until the end the semester to create a proof of concept for your investors. Your startup will revolutionize the way people learn and get better at a foreign language.

### THE IDEA
You have noticed that to improve one’s skills in a new foreign language, it is important to read texts in that language. These texts have to be at the reader’s language level. However, it is difficult to find texts that are close to someone’s knowledge level (`A1 to C2`). You have decided to build a model for English speakers that predicts the difficulty of a French written text. This can be then used, e.g., in a recommendation system, to recommend texts, e.g, recent news articles that are appropriate for someone’s language level. If someone is at A1 French level, it is inappropriate to present a text at B2 level, as she won’t be able to understand it. Ideally, a text should have many known words and may have a few words that are unknown so that the person can improve.

### 🗄 DATA
You can find the training data and the unlabeled test data in the Data tab.

### 🚀 SUBMISSION
As you build your model and train it on the training data, you can generate predictions for the ( unlabelled ) test data. Make sure that your submission file has the same format as the `sample_submission.csv` file in the Data tab. Once you are sure about your model and satisfied with the prediction accuracy you got on your own test data, you can try to generate predictions for the actual test data and submit it to the competition.

As soon as you make a submission you can see the prediction accuracy and your ranking on the leaderboard. Note that you can only make 5 submissions per day. To know more about the competition rules, check out the rules tab.

### 🚚  DELIVERABLES

-    **Github**: A project GitHub page. There, report the following table *without doing any cleaning on the data*. Do hyper-parameter optimization to find the best solution. Your code should justify your results.

|  |  Logistic regression |  kNN | Decision Tree | Random Forests | Any other technique |
| --- | --- | --- | --- | --- |
|Precision |  | | | |
|  Recall |  | | | |
|  F1-score |  | | | |
|  Accuracy |  | | | | |  |  |



-    Which is the best model?
-    Show the confusion matrix.
-    Show examples of some erroneous predictions. Can you understand where the error is coming from?
-    Do some more analysis to better understand how your model behaves.
-    Have a position in the leaderboard of this competition

Then try to improve this solution and climb up the leaderboard ladder! Expected score results:

>0.46 -> Hm,….you could have tried harder…
>0.47 - 0.52 -> not bad, but you could have done better.
>0.56 - 0.57 -> You did your work
>0.7 -> You used text embeddings, very good
>0.78+ -> WOW!

- **User interface/application**: Think how to use the model in an application. Create a UI with streamlit. You are free to conceive and create whichever application you like with your text difficulty model.


-    **Video**: Create a YouTube video of your solution and embed it in your notebook. Imagine you are giving a presentation or a tutorial. The video should explain:
    -    The problem, your algorithm, how you determine the difficulty.
    -    An evaluation of your solution *accuracy, precision, recall, F1-score, etc*.
    -    A demo of your solution (the UI you implemented).

Upload the video on **Youtube** (set it as *unlisted* if you don't wish it to be publicly visible) and put the link to the video in the readme of the Github repository of your team.

### Tips:

Things you should considering trying are:
- data cleaning (maybe not so important in this dataset, but try!)
- data augmentation (sentence features, e.g., length, [cognates](https://www.fluentin3months.com/french-cognates/), POS, etc),
- use [text embeddings](https://huggingface.co/blog/getting-started-with-embeddings) (Bert, RoBerta, etc). That's your best bet for boosting prediction accuracy. First use a [static embeddings](https://www.kaggle.com/code/matleonard/word-vectors/notebook). Then you can use an embedding and post-train it ([transfer learning](https://spacy.io/usage/embeddings-transformers)) with our own data labels. This should give you (ideally!) the best results.

More ideas for those that wish to go the extra mile:
- Make an application that searches and ranks French YouTube videos using the captions.
- Create an interpretable model that highlights the difficulty of the words/phrases.

### 👩‍💻 LOGISTICS AND DEADLINE

First of all, create an account in kaggle (if you don't have one already). As you enter the competition page, under the Team tab you can merge your user account with your teammates in order to create a *team*. For selecting your team name, please follow the following guideline:

-    For students, participating from UNIL use: `UNIL_<your team name>`. For students from EPFL use: `EPFL_<your team name>`.

Your team name will be shown on the leaderboard and you can compare your score with other teams as you submit your solution. Make sure to mention your team name in your notebook.

###📱 CONTACT
-    Stergios
-    Ludovic

## Some further details for points 3 and 4 above.

### 3. Read data into your notebook with the Kaggle API (optional but useful).

You can also download the data from Kaggle and put it in your team's repo the data folder.

In [13]:
# reading in the data via the Kaggle API

# mount your Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [14]:
# install Kaggle
! pip install kaggle



Log into your Kaggle account, go to Account > API > Create new API token. You will obtain a kaggle.json file, which you save on your Google Drive directy in my drive.

In [15]:
!mkdir ~/.kaggle

mkdir: cannot create directory ‘/root/.kaggle’: File exists


In [16]:
#read in your Kaggle credentials from Google Drive
!cp /content/drive/MyDrive/kaggle.json ~/.kaggle/kaggle.json


In [17]:
# download the dataset from the competition page
! kaggle competitions download -c detecting-french-texts-difficulty-level-2023
from zipfile import ZipFile
with ZipFile('detecting-french-texts-difficulty-level-2023.zip','r') as zip:
  zip.extractall(path="")

detecting-french-texts-difficulty-level-2023.zip: Skipping, found more recently modified local copy (use --force to force download)


In [18]:
# read in your training data
import pandas as pd
import numpy as np

df = pd.read_csv('training_data.csv')

In [19]:
df.head()

Unnamed: 0,id,sentence,difficulty
0,0,Les coûts kilométriques réels peuvent diverger...,C1
1,1,"Le bleu, c'est ma couleur préférée mais je n'a...",A1
2,2,Le test de niveau en français est sur le site ...,A1
3,3,Est-ce que ton mari est aussi de Boston?,A1
4,4,"Dans les écoles de commerce, dans les couloirs...",B1


Have a look at the data on which to make predictions.

In [20]:
df_pred = pd.read_csv('unlabelled_test_data.csv')
df_pred.head()

Unnamed: 0,id,sentence
0,0,Nous dûmes nous excuser des propos que nous eû...
1,1,Vous ne pouvez pas savoir le plaisir que j'ai ...
2,2,"Et, paradoxalement, boire froid n'est pas la b..."
3,3,"Ce n'est pas étonnant, car c'est une saison my..."
4,4,"Le corps de Golo lui-même, d'une essence aussi..."


And this is the format for your submissions.

In [21]:
df_example_submission = pd.read_csv('sample_submission.csv')
df_example_submission.head()

Unnamed: 0,id,difficulty
0,0,A1
1,1,A1
2,2,A1
3,3,A1
4,4,A1


# This is how to submit a pd file with predictions
### for the example we will submit a file where only A1 is given as a prediction

In [30]:
to_predict = pd.read_csv('unlabelled_test_data.csv')
to_predict['difficulty'] = list(map(lambda x: "A1", to_predict['sentence'].tolist()))
predictions = to_predict.drop(columns=['sentence'], inplace = False)
predictions.set_index('id',inplace = True)

predictions.to_csv('submission.csv')

In [32]:
predictions.head(2)

Unnamed: 0_level_0,difficulty
id,Unnamed: 1_level_1
0,A1
1,A1


In [34]:
! kaggle competitions submit -c detecting-french-texts-difficulty-level-2023 -f submission.csv -m "Sample submission"

100% 8.30k/8.30k [00:00<00:00, 12.9kB/s]
Successfully submitted to Detecting the difficulty level of French texts