### Quick Intro

In this notebook, we are going to implement a model created by Madhuri Mamandal (https://github.com/madhurimamandal/Text-classification-into-difficulty-levels). We are going to adapt the code to make it work in our situation and define the best features to classify a series of sentences based on the level of difficulty.



First, let's load the dataset we are going to work on, along with the unlabelled data we need for the submission part. :)

In [None]:
# reading in the data via the Kaggle API

# mount your Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# install Kaggle
! pip install kaggle

#read in your Kaggle credentials from Google Drive
!cp /content/drive/MyDrive/Coding_Challenge/kaggle.json ~/.kaggle/kaggle.json

# download the dataset from the competition page
! kaggle competitions download -c detecting-french-texts-difficulty-level-2022

Mounted at /content/drive
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
cp: cannot create regular file '/root/.kaggle/kaggle.json': No such file or directory
Traceback (most recent call last):
  File "/usr/local/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/usr/local/lib/python3.8/dist-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/usr/local/lib/python3.8/dist-packages/kaggle/api/kaggle_api_extended.py", line 164, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in /root/.kaggle. Or use the environment method.


In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv('drive/MyDrive/Coding_Challenge/training_data.csv')
df_pred = pd.read_csv('drive/MyDrive/Coding_Challenge/unlabelled_test_data.csv')

In [None]:
from collections import Counter

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

###First step : Features extraction

In [None]:
df.sentence.iloc[0]

"Les coûts kilométriques réels peuvent diverger sensiblement des valeurs moyennes en fonction du moyen de transport utilisé, du taux d'occupation ou du taux de remplissage, de l'infrastructure utilisée, de la topographie des lignes, du flux de trafic, etc."

In [None]:
#Preprocessing
def preprocessing(text1):    
        text1 = re.sub('[^a-zA-Z]', ' ', text1)
        return [word for word in text1.lower().split() if not word in set(stopwords.words('french'))]

#Feature extraction

def avg_word_length(sentence):
    words = sentence.split()
    total_length = 0
    for word in words:
        length = 0
        for char in word:
            if char.isalpha():
                length += 1
        total_length += length
    avg_length = total_length / len(words)
    return avg_length

def dif_words(text):
        frequency = Counter(text)
        return len(frequency)

dif_words(df.sentence.iloc[0])

30

In [None]:
def count_syllables(word):
    word = word.lower()
    count = 0
    vowels = "aeiouy"
    if word[0] in vowels:
        count += 1
    for index in range(1, len(word)):
        if word[index] in vowels and word[index-1] not in vowels:
            count += 1
    if word.endswith("e"):
        count -= 1
    if count == 0:
        count += 1
    return count

def avg_syllables(sentence):
    words = sentence.split()
    total_syllables = 0
    for word in words:
        syllables = count_syllables(word)
        total_syllables += syllables
    avg_syllables = total_syllables / len(words)
    return avg_syllables



### Second step: Dataframe creation with the features

In [None]:
av_w_l=df.sentence.apply(avg_word_length)
av_w_l
df['average word length']=av_w_l
df

Unnamed: 0,id,sentence,difficulty,average word length
0,0,Les coûts kilométriques réels peuvent diverger...,C1,5.526316
1,1,"Le bleu, c'est ma couleur préférée mais je n'a...",A1,3.916667
2,2,Le test de niveau en français est sur le site ...,A1,4.000000
3,3,Est-ce que ton mari est aussi de Boston?,A1,3.875000
4,4,"Dans les écoles de commerce, dans les couloirs...",B1,4.794118
...,...,...,...,...
4795,4795,"C'est pourquoi, il décida de remplacer les hab...",B2,5.230769
4796,4796,Il avait une de ces pâleurs splendides qui don...,C1,4.619048
4797,4797,"Et le premier samedi de chaque mois, venez ren...",A2,4.642857
4798,4798,Les coûts liés à la journalisation n'étant pas...,C2,5.937500


In [None]:
dif_w=df.sentence.apply(dif_words)
df['number of unique words']=dif_w
df

Unnamed: 0,id,sentence,difficulty,average word length,number of unique words
0,0,Les coûts kilométriques réels peuvent diverger...,C1,5.526316,30
1,1,"Le bleu, c'est ma couleur préférée mais je n'a...",A1,3.916667,23
2,2,Le test de niveau en français est sur le site ...,A1,4.000000,21
3,3,Est-ce que ton mari est aussi de Boston?,A1,3.875000,18
4,4,"Dans les écoles de commerce, dans les couloirs...",B1,4.794118,32
...,...,...,...,...,...
4795,4795,"C'est pourquoi, il décida de remplacer les hab...",B2,5.230769,27
4796,4796,Il avait une de ces pâleurs splendides qui don...,C1,4.619048,26
4797,4797,"Et le premier samedi de chaque mois, venez ren...",A2,4.642857,22
4798,4798,Les coûts liés à la journalisation n'étant pas...,C2,5.937500,31


In [None]:
av_sy=df.sentence.apply(avg_syllables)
df['average syllables']=av_sy
df

Unnamed: 0,id,sentence,difficulty,average word length,number of unique words,average syllables
0,0,Les coûts kilométriques réels peuvent diverger...,C1,5.526316,30,1.815789
1,1,"Le bleu, c'est ma couleur préférée mais je n'a...",A1,3.916667,23,1.083333
2,2,Le test de niveau en français est sur le site ...,A1,4.000000,21,1.384615
3,3,Est-ce que ton mari est aussi de Boston?,A1,3.875000,18,1.375000
4,4,"Dans les écoles de commerce, dans les couloirs...",B1,4.794118,32,1.558824
...,...,...,...,...,...,...
4795,4795,"C'est pourquoi, il décida de remplacer les hab...",B2,5.230769,27,1.769231
4796,4796,Il avait une de ces pâleurs splendides qui don...,C1,4.619048,26,1.476190
4797,4797,"Et le premier samedi de chaque mois, venez ren...",A2,4.642857,22,1.571429
4798,4798,Les coûts liés à la journalisation n'étant pas...,C2,5.937500,31,1.750000


### Third step: Train our different models


1) We will import the libaries we need and create our independant and dependant variables

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC  
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

#Reading the CSV
Dataframe_Final =df

#Splitting into features and classes
X = Dataframe_Final.drop('id',axis=1).drop('difficulty',axis=1)
y = Dataframe_Final['difficulty']

#Splitting into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.21, random_state = 0, shuffle = False)
 

X_train

Unnamed: 0,sentence,average word length,number of unique words,average syllables
0,Les coûts kilométriques réels peuvent diverger...,5.526316,30,1.815789
1,"Le bleu, c'est ma couleur préférée mais je n'a...",3.916667,23,1.083333
2,Le test de niveau en français est sur le site ...,4.000000,21,1.384615
3,Est-ce que ton mari est aussi de Boston?,3.875000,18,1.375000
4,"Dans les écoles de commerce, dans les couloirs...",4.794118,32,1.558824
...,...,...,...,...
3787,"Le dimanche, nous aimons nous promener en fami...",4.473684,22,1.368421
3788,Si quelque petit champ de cinquante pas de lar...,5.216216,29,1.540541
3789,"Une heure plus tard, il donne à la tribu une c...",3.875000,25,1.187500
3790,"Si vous pouviez rencontrez un artiste, qui est...",5.416667,20,1.750000


In [None]:
y_train

0       C1
1       A1
2       A1
3       A1
4       B1
        ..
3787    A1
3788    C1
3789    A2
3790    B1
3791    C1
Name: difficulty, Length: 3792, dtype: object

2 We import the function which will allow us to measure the main metrics we need to compare our models

In [None]:
#function to calculate metrics of the models
def models_metrics (true, pred):
  precision = precision_score(true, pred, average='weighted')
  recall = recall_score(true, pred, average='weighted')
  f1 = f1_score(true, pred, average='weighted')
  print(f"CONFUSION MATRIX:\n{confusion_matrix(true, pred,)}")
  print(f"ACCURACY SCORE:\n{accuracy_score(true, pred):.4f}")
  print(f"CLASSIFICATION REPORT:\n\tPrecision: {precision:.4f}\n\tRecall: {recall:.4f}\n\tF1_Score: {f1:.4f}")

2) We will then vectorized our sentences

In [None]:
# Using default tokenizer in TfidfVectorizer
tfidf = TfidfVectorizer(ngram_range=(1,1))

In [None]:
features = tfidf.fit_transform(X_train['sentence'])
X_train_vectorized = pd.DataFrame(
    features.todense(),
    columns=tfidf.get_feature_names()
)
X_train_vectorized



Unnamed: 0,000,02h00,03h00,10,100,1000,10000,105,11,110,...,évènements,événement,événements,êtes,être,êtres,êut,île,ôta,ôter
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3787,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3788,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3789,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3790,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now that we have our new dataframe with the features, we can train our model to see how accurate it is.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline

In [None]:
#Define a model AND apply vectorizer
LR = LogisticRegression()


# Create a pipeline
pipe = Pipeline([('vectorizer', tfidf),
                 ('classifier', LR)])

# Fit model on training set
pipe.fit(X_train, y_train)

ValueError: ignored

In [None]:
#predict on test split
y_pred = pipe.predict(X_test)
y_pred

array(['A1', 'A1', 'C1', 'A1'], dtype=object)

In [None]:
models_metrics(y_test,y_pred)

ValueError: ignored