#DATAMINING PROJECT
***From Baptiste Danichert & Ahmed Abdel Aziz***

**Last update**: 13.12.22

**Description:** You have noticed that to improve one’s skills in a new foreign language, it is important to read texts in that language. These texts have to be at the reader’s language level. However, it is difficult to find texts that are close to someone’s knowledge level (A1 to C2). You have decided to build a model for English speakers that predicts the difficulty of a French written text. This can be then used, e.g., in a recommendation system, to recommend texts, e.g, recent news articles that are appropriate for someone’s language level. If someone is at A1 French level, it is inappropriate to present a text at B2 level, as she won’t be able to understand it. Ideally, a text should have many known words and may have a few words that are unknown so that the person can improve.



# 1. Loading the training data

In [2]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv('https://raw.githubusercontent.com/BapDanSI/DataMiningProject/main/data/training_data.csv')
df_pred = pd.read_csv('https://raw.githubusercontent.com/BapDanSI/DataMiningProject/main/data/unlabelled_test_data.csv')

In [4]:
df_pred.head()

Unnamed: 0,id,sentence
0,0,Nous dûmes nous excuser des propos que nous eû...
1,1,Vous ne pouvez pas savoir le plaisir que j'ai ...
2,2,"Et, paradoxalement, boire froid n'est pas la b..."
3,3,"Ce n'est pas étonnant, car c'est une saison my..."
4,4,"Le corps de Golo lui-même, d'une essence aussi..."


In [5]:
df.head()

Unnamed: 0,id,sentence,difficulty
0,0,Les coûts kilométriques réels peuvent diverger...,C1
1,1,"Le bleu, c'est ma couleur préférée mais je n'a...",A1
2,2,Le test de niveau en français est sur le site ...,A1
3,3,Est-ce que ton mari est aussi de Boston?,A1
4,4,"Dans les écoles de commerce, dans les couloirs...",B1


# 2. Dataframe analysis


In [6]:
df.shape

(4800, 3)

-> 4800 rows and 2 columns (excluding first column "id")

In [7]:
df.isnull().sum()

id            0
sentence      0
difficulty    0
dtype: int64

-> no NAs

In [8]:
df.duplicated(subset="sentence").value_counts()

False    4800
dtype: int64

-> no duplicate in the data

#3.  Baseline

In [9]:
np.random.seed = 0

In [10]:
base_rate = max(df.value_counts('difficulty'))/df.shape[0]
print('Base rate:', round(base_rate,4))

Base rate: 0.1694


# 4. Classification Algorithms

Dependent variable (y) is the column named "difficulty".
<br>We split the data into 80% training and 20% test set.


In [11]:
y = df['difficulty']
X = df['sentence']
X_pred = df_pred['sentence']

### i. Logistic Regression.
We use the following parameters for the LogisticRegressionCV():

* cross-validation to 5 folds
* maximum interation to 1000
* random state to 0

In [12]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

In [13]:
tfidf = TfidfVectorizer(ngram_range=(1, 1))

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

classifier = LogisticRegressionCV(solver='lbfgs', cv=5, max_iter=1000, random_state=0)
pl = Pipeline([('vectorizer', tfidf),
                 ('classifier', classifier)])
pl.fit(X_train, y_train)

Pipeline(steps=[('vectorizer', TfidfVectorizer()),
                ('classifier',
                 LogisticRegressionCV(cv=5, max_iter=1000, random_state=0))])

In [15]:
def evaluate(true, pred):
    precision = precision_score(true, pred,average='macro')
    recall = recall_score(true, pred,average='macro')
    f1 = f1_score(true, pred,average='macro')
    print(f"CONFUSION MATRIX:\n{confusion_matrix(true, pred)}")
    print(f"ACCURACY SCORE:\n{accuracy_score(true, pred):.4f}")
    print(f"CLASSIFICATION REPORT:\n\tPrecision: {precision:.4f}\n\tRecall: {recall:.4f}\n\tF1_Score: {f1:.4f}")

In [16]:
y_pred = pl.predict(X_test)
evaluate(y_test, y_pred)

CONFUSION MATRIX:
[[91 34 20  9  4  3]
 [50 62 34  3  5 10]
 [12 35 72 14  7 20]
 [ 6  6 18 67 24 23]
 [ 3  5 13 40 65 47]
 [ 7  6  9 16 23 97]]
ACCURACY SCORE:
0.4729
CLASSIFICATION REPORT:
	Precision: 0.4723
	Recall: 0.4747
	F1_Score: 0.4703


In [17]:
predict = pl.predict(X_pred)

In [18]:
submission= pd.DataFrame()
submission['id']= df_pred.index
submission['difficulty'] = predict

In [19]:
submission.to_csv("submission.csv", index=False)

### ii. kNNeighbours

In [20]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

grid = {'n_neighbors':np.arange(1,100),
        'p':np.arange(1,3),
        'weights':['uniform','distance']}

knn = KNeighborsClassifier()
classifier_knn = GridSearchCV(knn, grid, cv=5)

pl_knn = Pipeline([('vectorizer', tfidf),
                 ('classifier', classifier_knn)])

pl_knn.fit(X_train, y_train)

print("Hyperparameters:", classifier_knn.best_params_)

Hyperparameters: {'n_neighbors': 63, 'p': 2, 'weights': 'distance'}


In [21]:
y_knn_predict = pl_knn.predict(X_test)
evaluate(y_test, y_knn_predict)

CONFUSION MATRIX:
[[105  36  13   0   1   6]
 [ 83  59  19   0   0   3]
 [ 60  48  37   3   1  11]
 [ 18  24  22  32  12  36]
 [ 14  19  29  22  20  69]
 [ 18  20  10  10   7  93]]
ACCURACY SCORE:
0.3604
CLASSIFICATION REPORT:
	Precision: 0.3859
	Recall: 0.3616
	F1_Score: 0.3361


In [22]:
knn_predict = pl_knn.predict(X_pred)
submission_knn= pd.DataFrame()
submission_knn['id']= df_pred.index
submission_knn['difficulty'] = knn_predict
submission_knn.to_csv("submissionknn.csv", index=False)

###  iii. Decision Trees

In [23]:
from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier()
pl_dtc = Pipeline([('vectorizer', tfidf),
                 ('classifier', dtc)])
pl_dtc.fit(X_train, y_train)
y_pred_dtc = pl_dtc.predict(X_test)

evaluate(y_test, y_pred_dtc)

CONFUSION MATRIX:
[[80 41 23  9  2  6]
 [44 52 39 18  4  7]
 [29 42 34 20 20 15]
 [ 7 20 37 37 23 20]
 [11 15 35 36 38 38]
 [10 11 32 33 32 40]]
ACCURACY SCORE:
0.2927
CLASSIFICATION REPORT:
	Precision: 0.2963
	Recall: 0.2927
	F1_Score: 0.2915


In [24]:
dtc_predict = pl_dtc.predict(X_pred)
submission_dtc= pd.DataFrame()
submission_dtc['id']= df_pred.index
submission_dtc['difficulty'] = dtc_predict
submission_dtc.to_csv("submissiondtc.csv", index=False)

### iv. Random Forest Model

In [25]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
pl_rfc = Pipeline([('vectorizer', tfidf),
                 ('classifier', rfc)])

pl_rfc.fit(X_train, y_train)

y_pred_rfc = pl_rfc.predict(X_test)

evaluate(y_test, y_pred_rfc)

CONFUSION MATRIX:
[[118  22  11  10   0   0]
 [ 75  57  23   4   2   3]
 [ 35  40  48  24   9   4]
 [ 16  20  10  62  22  14]
 [ 21  11  21  49  45  26]
 [ 19  12  13  32  26  56]]
ACCURACY SCORE:
0.4021
CLASSIFICATION REPORT:
	Precision: 0.4112
	Recall: 0.4043
	F1_Score: 0.3919


In [26]:
rfc_predict = pl_rfc.predict(X_pred)
submission_rfc= pd.DataFrame()
submission_rfc['id']= df_pred.index
submission_rfc['difficulty'] = rfc_predict
submission_rfc.to_csv("submissionrfc.csv", index=False)

#5. Comparing the models

In [27]:
df_comparison = pd.DataFrame(columns=['Model_Name', 'Base_Rate', 'Precisions','Recall','F1-Score', 'Accuracy'])


df_comparison['Model_Name'] = ['Logistic Reg', 'KNN', 'Tree', 'Random Forest']
df_comparison['Base_Rate'] = ['0.1694', '0.1694', '0.1694', '0.1694']
df_comparison['Precisions'] = ['0.4723', '0.3859', '0.3044','0.4089']
df_comparison['Recall'] = ['0.4747', '0.3616', '0.3018','0.4075']
df_comparison['F1-Score'] = ['0.4703', '0.3361', '0.2933','0.3940']
df_comparison['Accuracy'] = ['0.4729', '0.3604', '0.3010','0.4062']


df_comparison

Unnamed: 0,Model_Name,Base_Rate,Precisions,Recall,F1-Score,Accuracy
0,Logistic Reg,0.1694,0.4723,0.4747,0.4703,0.4729
1,KNN,0.1694,0.3859,0.3616,0.3361,0.3604
2,Tree,0.1694,0.3044,0.3018,0.2933,0.301
3,Random Forest,0.1694,0.4089,0.4075,0.394,0.4062
