#DATAMINING PROJECT V1.8
***From Baptiste Danichert & Ahmed Balibala***

**Last update**: getting a better score

**Description:** You have noticed that to improve one’s skills in a new foreign language, it is important to read texts in that language. These texts have to be at the reader’s language level. However, it is difficult to find texts that are close to someone’s knowledge level (A1 to C2). You have decided to build a model for English speakers that predicts the difficulty of a French written text. This can be then used, e.g., in a recommendation system, to recommend texts, e.g, recent news articles that are appropriate for someone’s language level. If someone is at A1 French level, it is inappropriate to present a text at B2 level, as she won’t be able to understand it. Ideally, a text should have many known words and may have a few words that are unknown so that the person can improve.



# 1. Loading the training data

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/BapDanSI/DataMiningProject/main/data/training_data.csv')
df_pred = pd.read_csv('https://raw.githubusercontent.com/BapDanSI/DataMiningProject/main/data/unlabelled_test_data.csv')

In [None]:
df_pred.head()

Unnamed: 0,id,sentence
0,0,Nous dûmes nous excuser des propos que nous eû...
1,1,Vous ne pouvez pas savoir le plaisir que j'ai ...
2,2,"Et, paradoxalement, boire froid n'est pas la b..."
3,3,"Ce n'est pas étonnant, car c'est une saison my..."
4,4,"Le corps de Golo lui-même, d'une essence aussi..."


In [None]:
df.head()

Unnamed: 0,id,sentence,difficulty
0,0,Les coûts kilométriques réels peuvent diverger...,C1
1,1,"Le bleu, c'est ma couleur préférée mais je n'a...",A1
2,2,Le test de niveau en français est sur le site ...,A1
3,3,Est-ce que ton mari est aussi de Boston?,A1
4,4,"Dans les écoles de commerce, dans les couloirs...",B1


# 2. Dataframe analysis


In [None]:
df.shape

(4800, 3)

-> 4800 rows and 2 columns (excluding first column "id")

In [None]:
df.isnull().sum()

id            0
sentence      0
difficulty    0
dtype: int64

-> no NAs

In [None]:
df.duplicated(subset="sentence").value_counts()

False    4800
dtype: int64

-> no duplicate in the data

#3.  Baseline

In [None]:
np.random.seed = 0

In [None]:
base_rate = max(df.value_counts('difficulty'))/df.shape[0]
print('Base rate:', round(base_rate,4))

Base rate: 0.1694


# 4. Tokinizer function

In [None]:
!pip install -U pip setuptools wheel
!pip install -U spacy
!pip install -U wordstats
!python -m spacy download fr_core_news_sm
from spacy.lang.fr.stop_words import STOP_WORDS as fr_stopwords
from wordstats import Word, common_words
from collections import Counter
import string
import nltk
import spacy
nlp = spacy.load("fr_core_news_sm")

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pip
  Downloading pip-22.3.1-py3-none-any.whl (2.1 MB)
[K     |████████████████████████████████| 2.1 MB 5.3 MB/s 
Collecting setuptools
  Downloading setuptools-65.6.3-py3-none-any.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 35.0 MB/s 
Installing collected packages: setuptools, pip
  Attempting uninstall: setuptools
    Found existing installation: setuptools 57.4.0
    Uninstalling setuptools-57.4.0:
      Successfully uninstalled setuptools-57.4.0
  Attempting uninstall: pip
    Found existing installation: pip 21.1.3
    Uninstalling pip-21.1.3:
      Successfully uninstalled pip-21.1.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ipython 7.9.0 requires jedi>=0.10, which is not installed.[0m
Successfully installed pip-22.

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[0mLooking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wordstats
  Downloading wordstats-1.0.7.tar.gz (3.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m39.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting configobj
  Downloading configobj-5.0.6.tar.gz (33 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wordstats, configobj
  Building wheel for wordstats (setup.py) ... [?25l[?25hdone
  Created wheel for wordstats: filename=wordstats-1.0.7-py3-none-any.whl size=3616363 sha256=3dcd4da07c81615da542b7070746dc62bdafdd8da3c159873aaeb1327d61efeb
  Stored in directory: /root/.cache/pip/wheels/07/27/7d/24ade697c516ed02f369a90c5fec463286a6a67a299f75b711
  Building wheel for configobj (setup.py) ...

In [None]:
df['n_diff'] = df['difficulty'].replace(['A1', 'A2', 'B1', 'B2', 'C1', 'C2'], [1,2,3,4,5,6])

In [None]:
punctuations = string.punctuation
punctuations

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [None]:
corpus = df["sentence"]

In [None]:
from gensim.utils import simple_preprocess
processed_corpus = []
for doc in corpus:
  processed_corpus.append(simple_preprocess(doc))
  
processed_corpus

[['les',
  'coûts',
  'kilométriques',
  'réels',
  'peuvent',
  'diverger',
  'sensiblement',
  'des',
  'valeurs',
  'moyennes',
  'en',
  'fonction',
  'du',
  'moyen',
  'de',
  'transport',
  'utilisé',
  'du',
  'taux',
  'occupation',
  'ou',
  'du',
  'taux',
  'de',
  'remplissage',
  'de',
  'infrastructure',
  'utilisée',
  'de',
  'la',
  'topographie',
  'des',
  'lignes',
  'du',
  'flux',
  'de',
  'trafic',
  'etc'],
 ['le',
  'bleu',
  'est',
  'ma',
  'couleur',
  'préférée',
  'mais',
  'je',
  'aime',
  'pas',
  'le',
  'vert'],
 ['le',
  'test',
  'de',
  'niveau',
  'en',
  'français',
  'est',
  'sur',
  'le',
  'site',
  'internet',
  'de',
  'école'],
 ['est', 'ce', 'que', 'ton', 'mari', 'est', 'aussi', 'de', 'boston'],
 ['dans',
  'les',
  'écoles',
  'de',
  'commerce',
  'dans',
  'les',
  'couloirs',
  'de',
  'places',
  'financières',
  'il',
  'arrive',
  'aujourd',
  'hui',
  'de',
  'croiser',
  'de',
  'jeunes',
  'adultes',
  'de',
  'ou',
  'ans',
 

In [None]:
# A dictionary
from gensim import corpora
dictionary = corpora.Dictionary(processed_corpus)
print(dictionary)

Dictionary(14314 unique tokens: ['coûts', 'de', 'des', 'diverger', 'du']...)


In [None]:
dictionary.token2id

{'coûts': 0,
 'de': 1,
 'des': 2,
 'diverger': 3,
 'du': 4,
 'en': 5,
 'etc': 6,
 'flux': 7,
 'fonction': 8,
 'infrastructure': 9,
 'kilométriques': 10,
 'la': 11,
 'les': 12,
 'lignes': 13,
 'moyen': 14,
 'moyennes': 15,
 'occupation': 16,
 'ou': 17,
 'peuvent': 18,
 'remplissage': 19,
 'réels': 20,
 'sensiblement': 21,
 'taux': 22,
 'topographie': 23,
 'trafic': 24,
 'transport': 25,
 'utilisé': 26,
 'utilisée': 27,
 'valeurs': 28,
 'aime': 29,
 'bleu': 30,
 'couleur': 31,
 'est': 32,
 'je': 33,
 'le': 34,
 'ma': 35,
 'mais': 36,
 'pas': 37,
 'préférée': 38,
 'vert': 39,
 'français': 40,
 'internet': 41,
 'niveau': 42,
 'site': 43,
 'sur': 44,
 'test': 45,
 'école': 46,
 'aussi': 47,
 'boston': 48,
 'ce': 49,
 'mari': 50,
 'que': 51,
 'ton': 52,
 'adultes': 53,
 'années': 54,
 'ans': 55,
 'arrive': 56,
 'aujourd': 57,
 'commerce': 58,
 'couloirs': 59,
 'croiser': 60,
 'dans': 61,
 'financières': 62,
 'hui': 63,
 'hôtes': 64,
 'il': 65,
 'jeunes': 66,
 'maison': 67,
 'ouvrir': 68,
 'p

# Bag of word

In [None]:
# Tokens in document
def get_tokens(sentence):
  doc_tokens = []
  for token in nlp(sentence):
      if (token.is_punct == False) and (token.is_space == False):
        doc_tokens.append(token.lower_)
  return doc_tokens

In [None]:
# List of unique words in corpus (dictionary)
def vocabulary(corpus):
  # Delare output
  word_list = []
  # Loop documents - lower each word and add it to the output
  for document in corpus:
    spacy_doc = nlp(document)
    for token in spacy_doc:
      if token.lower_ not in word_list and (token.is_punct == False) and (token.is_space == False):
        word_list.append(token.lower_)
  # Return output
  return word_list
    
vocabulary(corpus)

['les',
 'coûts',
 'kilométriques',
 'réels',
 'peuvent',
 'diverger',
 'sensiblement',
 'des',
 'valeurs',
 'moyennes',
 'en',
 'fonction',
 'du',
 'moyen',
 'de',
 'transport',
 'utilisé',
 'taux',
 "d'",
 'occupation',
 'ou',
 'remplissage',
 "l'",
 'infrastructure',
 'utilisée',
 'la',
 'topographie',
 'lignes',
 'flux',
 'trafic',
 'etc.',
 'le',
 'bleu',
 "c'",
 'est',
 'ma',
 'couleur',
 'préférée',
 'mais',
 'je',
 "n'",
 'aime',
 'pas',
 'vert',
 'test',
 'niveau',
 'français',
 'sur',
 'site',
 'internet',
 'école',
 '-ce',
 'que',
 'ton',
 'mari',
 'aussi',
 'boston',
 'dans',
 'écoles',
 'commerce',
 'couloirs',
 'places',
 'financières',
 'il',
 'arrive',
 "aujourd'hui",
 'croiser',
 'jeunes',
 'adultes',
 '20',
 '25',
 'ans',
 'qui',
 'prévoient',
 'ouvrir',
 'une',
 'maison',
 'hôtes',
 'quinzaine',
 'années',
 'voilà',
 'autre',
 'histoire',
 "j'",
 'ai',
 'beaucoup',
 'aimée',
 'médecins',
 'disent',
 'souvent',
 "qu'",
 'on',
 'doit',
 'boire',
 'un',
 'verre',
 'vin'

In [None]:
# Bag of Words
def bow(sentence):
  # Get tokens
  doc_tokens = get_tokens(sentence)
  sentence_tokens = vocabulary(corpus)
  # Initialization
  bag = {}
  for token in sentence_tokens:
    bag[token] = 0
  # Add 1 if token is in document
  for token in doc_tokens:
    bag[token] += 1
  # Return
  return bag



In [None]:
# Dataframe - all documents in corpus
bag_of_words = []
for doc in corpus:
  bag = bow(doc)
  bag_of_words.append(bag)
  
pd.DataFrame(bag_of_words)

In [None]:
# Using CountVectorizer
vectorizer = CountVectorizer()
bag_of_words = vectorizer.fit_transform(corpus).todense()
bag_of_words


In [None]:
vectorizer.vocabulary_

In [None]:
bag_of_words = pd.DataFrame(bag_of_words, columns=vectorizer.get_feature_names())
bag_of_words

In [None]:
# Term frequency (TF)
def tf(sentence):
  # Get tokens
  tokens = get_tokens(sentence)
  # Initialization
  term_freq = {}
  for token in tokens:
    term_freq[token] = 0
  # Increment
  for token in tokens:
    term_freq[token] += 1/len(tokens)
  # Return
  return term_freq



In [None]:
stop_words=spacy.lang.fr.stop_words.STOP_WORDS

# Create tokenizer function
def spacy_tokenizer(sentence):
    # Create token object, which is used to create documents with linguistic annotations.
    mytokens = nlp(sentence)

    # Lemmatize each token and convert each token into lowercase
    mytokens = [ word.lemma_.lower().strip() for word in mytokens ]


    # Remove stop words and punctuation
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]

    # Return preprocessed list of tokens
    return mytokens

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
tfidf_vector = TfidfVectorizer(tokenizer=spacy_tokenizer) # we use the above defined tokenizer

#5. Improve text preparation

In [None]:
# Create list of configs
def configs():

    models = list()
    
    # Define config lists
    ngram_range = [(1,1), (1,2), (1, 3), (2, 2), (2, 3), (3, 3)]
    min_df = [1]
    max_df = [1.0]
    analyzer=['word', 'char']
    
    # Create config instances
    for n in ngram_range:
        for i in min_df:
            for j in max_df:
              for a in analyzer:
                    cfg = [n, i, j, a]
                    models.append(cfg)
    return models

configs = configs()
configs[:10]

[[(1, 1), 1, 1.0, 'word'],
 [(1, 1), 1, 1.0, 'char'],
 [(1, 2), 1, 1.0, 'word'],
 [(1, 2), 1, 1.0, 'char'],
 [(1, 3), 1, 1.0, 'word'],
 [(1, 3), 1, 1.0, 'char'],
 [(2, 2), 1, 1.0, 'word'],
 [(2, 2), 1, 1.0, 'char'],
 [(2, 3), 1, 1.0, 'word'],
 [(2, 3), 1, 1.0, 'char']]

In [None]:

# Define list for result
result = []

for config in configs:

    # Redefine vectorizer
    tfidf_vector = TfidfVectorizer(tokenizer=spacy_tokenizer, 
                                   ngram_range=config[0],
                                   min_df=config[1], max_df=config[2], analyzer=config[3])

# 6. Classification Algorithms

Dependent variable (y) is the column named "difficulty".
<br>We split the data into 80% training and 20% test set.


In [None]:
y = df['difficulty']
X = df["sentence"]
X_pred = df_pred["sentence"]

### i. Logistic Regression.
We use the following parameters for the LogisticRegressionCV():

* cross-validation to 5 folds
* maximum interation to 1000
* random state to 0

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

In [None]:
def evaluate(true, pred):
    precision = precision_score(true, pred,average='macro')
    recall = recall_score(true, pred,average='macro')
    f1 = f1_score(true, pred,average='macro')
    print(f"CONFUSION MATRIX:\n{confusion_matrix(true, pred)}")
    print(f"ACCURACY SCORE:\n{accuracy_score(true, pred):.4f}")
    print(f"CLASSIFICATION REPORT:\n\tPrecision: {precision:.4f}\n\tRecall: {recall:.4f}\n\tF1_Score: {f1:.4f}")

In [None]:
np.random.seed = 0

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=50)

classifier = LogisticRegressionCV()

pl = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', classifier)])

pl.fit(X_train, y_train)

y_pred = pl.predict(X_test)

print("CONFIG: ", config)
evaluate(y_test, y_pred)
print("-----------------------")

result.append([config, accuracy_score(y_test, y_pred)])

CONFIG:  [(3, 3), 1, 1.0, 'char']
CONFUSION MATRIX:
[[87 29  7  0  1  0]
 [27 58 31  4  3  1]
 [19 35 47  6  6  4]
 [ 4  4 11 52 29 28]
 [ 1  3  7 22 53 33]
 [ 4  1  8 19 19 57]]
ACCURACY SCORE:
0.4917
CLASSIFICATION REPORT:
	Precision: 0.4880
	Recall: 0.4917
	F1_Score: 0.4880
-----------------------


In [None]:
y_pred = pl.predict(X_test)

evaluate(y_test, y_pred)

CONFUSION MATRIX:
[[87 29  7  0  1  0]
 [27 58 31  4  3  1]
 [19 35 47  6  6  4]
 [ 4  4 11 52 29 28]
 [ 1  3  7 22 53 33]
 [ 4  1  8 19 19 57]]
ACCURACY SCORE:
0.4917
CLASSIFICATION REPORT:
	Precision: 0.4880
	Recall: 0.4917
	F1_Score: 0.4880


In [None]:
predict = pl.predict(X_pred)

In [None]:
submission= pd.DataFrame()
submission['id']= df_pred.index
submission['difficulty'] = predict

In [None]:
submission.to_csv("submission.csv", index=False)

### ii. kNNeighbours

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

grid = {'n_neighbors':np.arange(1,100),
        'p':np.arange(1,3),
        'weights':['uniform','distance']}

knn = KNeighborsClassifier()
classifier_knn = GridSearchCV(knn, grid, cv=5)

pl_knn = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', classifier_knn)])

pl_knn.fit(X_train, y_train)

print("Hyperparameters:", classifier_knn.best_params_)

Hyperparameters: {'n_neighbors': 4, 'p': 2, 'weights': 'distance'}


In [None]:
y_knn_predict = pl_knn.predict(X_test)
evaluate(y_test, y_knn_predict)

CONFUSION MATRIX:
[[110  32  14   0   1   4]
 [ 64  68  18   6   0   8]
 [ 45  34  44  10   7  20]
 [ 13  13  10  45  13  50]
 [ 16   6  11  20  41  79]
 [  9  12   6  11  16 104]]
ACCURACY SCORE:
0.4292
CLASSIFICATION REPORT:
	Precision: 0.4458
	Recall: 0.4301
	F1_Score: 0.4123


In [None]:
knn_predict = pl_knn.predict(X_pred)
submission_knn= pd.DataFrame()
submission_knn['id']= df_pred.index
submission_knn['difficulty'] = knn_predict
submission_knn.to_csv("submissionknn.csv", index=False)

###  iii. Decision Trees

In [None]:
from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier()
pl_dtc = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', dtc)])
pl_dtc.fit(X_train, y_train)
y_pred_dtc = pl_dtc.predict(X_test)

evaluate(y_test, y_pred_dtc)

CONFUSION MATRIX:
[[67 25 15  7  1  9]
 [28 26 43 13  7  7]
 [23 24 30 11 13 16]
 [ 7 12 20 35 28 26]
 [ 1 10 16 34 32 26]
 [ 2  6 15 28 20 37]]
ACCURACY SCORE:
0.3153
CLASSIFICATION REPORT:
	Precision: 0.3146
	Recall: 0.3152
	F1_Score: 0.3138


In [None]:
dtc_predict = pl_dtc.predict(X_pred)
submission_dtc= pd.DataFrame()
submission_dtc['id']= df_pred.index
submission_dtc['difficulty'] = dtc_predict
submission_dtc.to_csv("submissiondtc.csv", index=False)

### iv. Random Forest Model

In [None]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
pl_rfc = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', rfc)])

pl_rfc.fit(X_train, y_train)

y_pred_rfc = pl_rfc.predict(X_test)

evaluate(y_test, y_pred_rfc)

CONFUSION MATRIX:
[[103  15   4   2   0   0]
 [ 39  50  27   4   3   1]
 [ 19  36  35  11  11   5]
 [  2   8  17  42  38  21]
 [  4   5   6  20  47  37]
 [  2   2  10   7  30  57]]
ACCURACY SCORE:
0.4639
CLASSIFICATION REPORT:
	Precision: 0.4530
	Recall: 0.4640
	F1_Score: 0.4522


In [None]:
rfc_predict = pl_rfc.predict(X_pred)
submission_rfc= pd.DataFrame()
submission_rfc['id']= df_pred.index
submission_rfc['difficulty'] = rfc_predict
submission_rfc.to_csv("submissionrfc.csv", index=False)

### v. Support Vector Classifier

In [None]:
from sklearn.svm import SVC

svc = SVC()
pl_svc = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', svc)])

pl_svc.fit(X_train, y_train)

y_pred_svc = pl_svc.predict(X_test)

evaluate(y_test, y_pred_svc)

CONFUSION MATRIX:
[[87 29  6  0  2  0]
 [30 55 36  1  2  0]
 [12 31 57  8  6  3]
 [ 3  1 14 63 29 18]
 [ 1  4  8 23 50 33]
 [ 5  0 10 17 17 59]]
ACCURACY SCORE:
0.5153
CLASSIFICATION REPORT:
	Precision: 0.5134
	Recall: 0.5152
	F1_Score: 0.5130


In [None]:
svc_predict = pl_svc.predict(X_pred)
submission_svc= pd.DataFrame()
submission_svc['id']= df_pred.index
submission_svc['difficulty'] = svc_predict
submission_svc.to_csv("submissionsvc.csv", index=False)

###vi. AdaBoostClassifier

In [None]:
from sklearn.ensemble import AdaBoostClassifier

abc = AdaBoostClassifier()
pl_abc = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', abc)])

pl_abc.fit(X_train, y_train)

y_pred_abc = pl_abc.predict(X_test)

evaluate(y_test, y_pred_abc)

CONFUSION MATRIX:
[[71 36  6  8  2  1]
 [37 46 25  7  6  3]
 [24 25 31 20  9  8]
 [ 6 11 15 35 40 21]
 [ 2  6 10 15 44 42]
 [ 3  4  7 15 34 45]]
ACCURACY SCORE:
0.3778
CLASSIFICATION REPORT:
	Precision: 0.3728
	Recall: 0.3781
	F1_Score: 0.3732


In [None]:
from sklearn.neural_network import MLPClassifier

mlpc = MLPClassifier()
pl_mlpc = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', mlpc)])

pl_mlpc.fit(X_train, y_train)

y_pred_mlpc = pl_mlpc.predict(X_test)

evaluate(y_test, y_pred_mlpc)

CONFUSION MATRIX:
[[80 34  9  0  1  0]
 [29 59 26  5  3  2]
 [19 31 46  6  6  9]
 [ 4  5 16 47 28 28]
 [ 1  1 10 22 57 28]
 [ 1  2 12 22 20 51]]
ACCURACY SCORE:
0.4722
CLASSIFICATION REPORT:
	Precision: 0.4699
	Recall: 0.4721
	F1_Score: 0.4697


In [None]:
mlpc_predict = pl_mlpc.predict(X_pred)
submission_mlpc= pd.DataFrame()
submission_mlpc['id']= df_pred.index
submission_mlpc['difficulty'] = mlpc_predict
submission_mlpc.to_csv("submissionmlpc.csv", index=False)

#7. Comparing the models

In [None]:
import pandas as pd

df_comparison = pd.DataFrame(columns=['Model_Name', 'Base_Rate', 'Precisions','Recall','F1-Score', 'Accuracy'])


df_comparison['Model_Name'] = ['Logistic Reg', 'KNN', 'Tree', 'Random Forest', 'Support Vector', 'AdaBooster', 'MLPC']
df_comparison['Base_Rate'] = ['0.1694', '0.1694', '0.1694', '0.1694', '0.1694', '0.1694', '0.1694']
df_comparison['Precisions'] = ['0.4880', '0.4458', '0.3146','0.4530', '0.5134', '0.3728', '0.4699']
df_comparison['Recall'] = ['0.4917', '0.4301', '0.3152','0.4640', '0.5152', '0.3781', '0.4721']
df_comparison['F1-Score'] = ['0.4880', '0.4123', '0.3138','0.4522', '0.5130', '0.3732', '0.4697']
df_comparison['Accuracy'] = ['0.4917', '0.4292', '0.3153','0.4639', '0.5153', '0.3778', '0.4722']


df_comparison