WELCOME TO THE COLAB NOTEBOOK OF UNIL_TESLA GROUP!


Steps:


1) Take the training set, split it 80-20, train models on the train set and test it on the test set. But, then, when we train the model to submit on kaggle we can use all the dataset to train,

2) Calculate the baseline and all 4 precisions (of 4 models), then vary the parameters to optimize i

3) Then, use a model (or combinaison of many) after doing some data cleaning (e.g. removing stop words, lemmatization, tokenization, etc...) and compute the precision.

## 1. Import required packages

In [142]:
# Import required packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import string # to get punctuations = string.punctuation

from numpy.ma.core import append

%matplotlib inline
sns.set_style("whitegrid")

!pip install -U spacy
!python3 -m spacy download fr_core_news_sm

import spacy
from spacy import displacy
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
from spacy.lang.en.stop_words import STOP_WORDS
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

Collecting fr-core-news-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.2.0/fr_core_news_sm-3.2.0-py3-none-any.whl (17.4 MB)
[K     |████████████████████████████████| 17.4 MB 296 kB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_sm')


## 2. Define and split the data

In [143]:
test_data = pd.read_csv('https://raw.githubusercontent.com/Broly-brolik/DMML2021_Tesla/main/data/unlabelled_test_data.csv')
train_data = pd.read_csv('https://raw.githubusercontent.com/Broly-brolik/DMML2021_Tesla/main/data/training_data.csv')
sample_submission = ('https://raw.githubusercontent.com/Broly-brolik/DMML2021_Tesla/main/sample_submission.csv')

In [144]:
np.random.seed = 0
X = train_data['sentence'] # 'sentence' is the feature to have as in input
y = train_data['difficulty'] # 'difficulty' is our target, the output.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## 3. Train our models & make predictions

### 3.0. Important functions

In [145]:
# Let's create a function that will evaluate each of our different models we'll create
def evaluate(true, pred):
      precision = precision_score(true, pred, pos_label='positive', average='macro')
      recall = recall_score(true, pred, pos_label='positive', average='macro')
      f1 = f1_score(true, pred, pos_label='positive', average='macro')
      print(f"CONFUSION MATRIX:\n{confusion_matrix(true, pred)}")
      print(f"ACCURACY SCORE:\n{accuracy_score(true, pred):.4f}")
      print(f"CLASSIFICATION REPORT:\n\tPrecision: {precision:.4f}\n\tRecall: {recall:.4f}\n\tF1_Score: {f1:.4f}")

# Create tokenizer function for LR
def spacy_tokenizer(text):
  mytokens = sp(text)
  return mytokens

### 3.1. Baseline

In [146]:
# Compute the baseline : 0.1694
# We calculate the most frequent item in y and divide it by the total number of observations.
# We round it to 4 digits and take the maximum.
# Let stock it in baseline variable.
baseline = max(round(y.value_counts()/len(train_data), 4))
baseline

0.1694

In [147]:
# Note: It is equivalent to do it like that (and can you guess why? :)):
round(max(y.value_counts())/len(train_data), 4)
# If you don't see the difference, we just switched max<->round

0.1694

### 3.2. Logistic regression (without data cleaning)

In [148]:
# Instanciation of the tool
# Loading french language
sp = spacy.load("fr_core_news_sm")

In [149]:
LR = LogisticRegression()
#tfidf_vector_word = TfidfVectorizer(tokenizer=spacy_tokenizer, analyzer='word')
tfidf_vector_char = TfidfVectorizer(tokenizer=spacy_tokenizer, analyzer='char')

# Create pipeline
pipe_LR = Pipeline([('vectorizer', tfidf_vector_char),
                 ('classifier', LR)])

# Fit model on training set
pipe_LR.fit(X_train, y_train)

# Predictions
y_pred = pipe_LR.predict(X_test)

# Evaluation - test set
evaluate(y_test, y_pred)

CONFUSION MATRIX:
[[84 31 17 15  5  9]
 [42 60 35 10  6 11]
 [16 35 62 18 12 17]
 [ 6  6 11 42 42 37]
 [ 5  3 13 45 43 64]
 [ 5  4  8 27 34 80]]
ACCURACY SCORE:
0.3865
CLASSIFICATION REPORT:
	Precision: 0.3875
	Recall: 0.3869
	F1_Score: 0.3843


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [150]:
#save 
LR_accuracy = accuracy_score(y_test, y_pred)
LR_precision = precision_score(y_test, y_pred, pos_label='positive', average='macro')
LR_recall = recall_score(y_test, y_pred, pos_label='positive', average='macro')
LR_f1 = f1_score(y_test, y_pred, pos_label='positive', average='macro')



### 3.3. KNN (without data cleaning)

In [151]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

In [152]:
parameters = {'n_neighbors':np.arange(1,5), 
               'p':np.arange(1,5), 
               'weights':('uniform','distance')
               }
knn = KNeighborsClassifier()
knn1 = GridSearchCV(knn, parameters, cv=6)
pipe_knn1 = Pipeline([('vectorizer', tfidf_vector_char), 
                    ('classifier', knn1)])
pipe_knn1.fit(X_train, y_train)
n = knn1.best_params_['n_neighbors']
p = knn1.best_params_['p']
w = knn1.best_params_['weights']

y_pred = pipe_knn1.predict(X_test)
evaluate(y_test, y_pred)

#save 
KNN_accuracy = accuracy_score(y_test, y_pred)
KNN_precision = precision_score(y_test, y_pred, pos_label='positive', average='macro')
KNN_recall = recall_score(y_test, y_pred, pos_label='positive', average='macro')
KNN_f1 = f1_score(y_test, y_pred, pos_label='positive', average='macro')

96 fits failed out of a total of 192.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
96 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 681, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/neighbors/_classification.py", line 198, in fit
    return self._fit(X, y)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/neighbors/_base.py", line 510, in _fit
    "Metric can also be a callable function." % (self.effective_metric_)
ValueError: Metric 'minkowski' not valid for sparse input. Use sorted(sklearn.neighbors.VALID_METRICS_SPARSE['brute']) to 

CONFUSION MATRIX:
[[78 36 30  8  5  4]
 [32 56 36  9 10 21]
 [13 24 44 24 26 29]
 [ 6  7 14 20 50 47]
 [ 3  3 15 32 49 71]
 [ 4 10 11 18 47 68]]
ACCURACY SCORE:
0.3281
CLASSIFICATION REPORT:
	Precision: 0.3340
	Recall: 0.3256
	F1_Score: 0.3255




### 3.4. Decision Tree Classifier (without data cleaning)

In [153]:
# Classification
from sklearn.tree import DecisionTreeClassifier, plot_tree
DTC = DecisionTreeClassifier()
parameters = {'max_depth':np.arange(1,8)}
DTC1 = GridSearchCV(DTC, parameters, cv=6)

# Define vectorizer
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), tokenizer=spacy_tokenizer)

# Create pipeline
pipe_DTC = Pipeline([('vectorizer', tfidf_vector_char),
                 ('classifier', DTC1)])

# Fit model on training set
pipe_DTC.fit(X_train, y_train)
d = DTC1.best_params_

# Predictions
y_pred = pipe_DTC.predict(X_test)

In [154]:
# Evaluate the model
evaluate(y_test, y_pred)

CONFUSION MATRIX:
[[84 40 23 12  1  1]
 [40 51 53 13  3  4]
 [14 57 62 14  7  6]
 [ 6 13 25 37 53 10]
 [ 7 11 33 29 72 21]
 [11 12 19 16 57 43]]
ACCURACY SCORE:
0.3635
CLASSIFICATION REPORT:
	Precision: 0.3781
	Recall: 0.3609
	F1_Score: 0.3617




In [155]:
#save 
DTC_accuracy = accuracy_score(y_test, y_pred)
DTC_precision = precision_score(y_test, y_pred, pos_label='positive', average='macro')
DTC_recall = recall_score(y_test, y_pred, pos_label='positive', average='macro')
DTC_f1 = f1_score(y_test, y_pred, pos_label='positive', average='macro')



### 3.5. Random Forest Classifier (without data cleaning)

In [156]:
from sklearn.ensemble import RandomForestClassifier

# Define vectorizer
tfidf_vector = TfidfVectorizer(tokenizer=spacy_tokenizer) # we use the above defined tokenizer

# Define classifier
RF = RandomForestClassifier()

# Create pipeline
pipe_RFC = Pipeline([('vectorizer', tfidf_vector_char),
                 ('classifier', RF)])

# Fit model on training set
pipe_RFC.fit(X_train, y_train)

# Predictions
y_pred = pipe_RFC.predict(X_test)

# Evaluation - test set
evaluate(y_test, y_pred)

CONFUSION MATRIX:
[[99 33 19  8  0  2]
 [49 63 38  7  1  6]
 [14 43 60 25  9  9]
 [ 6 10 16 51 36 25]
 [ 6  1 18 47 54 47]
 [ 5  7 13 31 35 67]]
ACCURACY SCORE:
0.4104
CLASSIFICATION REPORT:
	Precision: 0.4086
	Recall: 0.4107
	F1_Score: 0.4081




In [157]:
#save 
RFC_accuracy = accuracy_score(y_test, y_pred)
RFC_precision = precision_score(y_test, y_pred, pos_label='positive', average='macro')
RFC_recall = recall_score(y_test, y_pred, pos_label='positive', average='macro')
RFC_f1 = f1_score(y_test, y_pred, pos_label='positive', average='macro')



### 3.6. Any other technique (with data cleaning) 
Using the algorithm with the best result, we clean the data with SpaCy NLP library.

Defining our spacy_tokenizer

In [158]:
# logistic regression
punctuations = string.punctuation
stop_words = spacy.lang.fr.stop_words.STOP_WORDS
def spacy_tokenizer(sentence):
      # Create token object, which is used to create documents with linguistic annotations.
      mytokens = sp(sentence)

      # Lemmatize each token and convert each token into lowercase
      mytokens = [ word.lemma_.lower().strip() for word in mytokens ]
      ## alternative way
      # mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]

      # Remove stop words and punctuation
      mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]

      # Return preprocessed list of tokens
      return mytokens

#### 1st try (LR with data cleaning, lemma and token)

In [159]:
import spacy

In [160]:
nlp = spacy.load('fr_core_news_sm', disable=['parser', 'ner'])

In [161]:
# Lemmatization on X_train observations

def space(comment):
    doc = nlp(comment)
    return " ".join([token.lemma_ for token in doc])
X_train= X_train.apply(space)
X_train.head(3)

70                              comment t' appelle - tu ?
4347    voilà qui être en effet de nature à simplifier...
1122    le pèlerin partager alors ce célébration avec ...
Name: sentence, dtype: object

In [162]:
LR = LogisticRegression()
tfidf_vector = TfidfVectorizer(tokenizer=spacy_tokenizer)# Create pipeline
pipeLR2 = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', LR)])

# Fit model on training set
pipeLR2.fit(X_train, y_train)

# Predictions
y_pred = pipeLR2.predict(X_test)
evaluate(y_pred, y_test)

CONFUSION MATRIX:
[[67 44 26 10 12  9]
 [42 54 27 11  8  4]
 [23 28 48 10 10 10]
 [11 17 20 63 34 24]
 [10 10 15 27 75 36]
 [ 8 11 24 23 34 75]]
ACCURACY SCORE:
0.3979
CLASSIFICATION REPORT:
	Precision: 0.3985
	Recall: 0.3959
	F1_Score: 0.3957




In [163]:
#save 
LR2_accuracy = accuracy_score(y_test, y_pred)
LR2_precision = precision_score(y_test, y_pred, pos_label='positive', average='macro')
LR2_recall = recall_score(y_test, y_pred, pos_label='positive', average='macro')
LR2_f1 = f1_score(y_test, y_pred, pos_label='positive', average='macro')



#### 2nd try (LR with Doc2Vec)

In [164]:
test_data = pd.read_csv('https://raw.githubusercontent.com/Broly-brolik/DMML2021_Tesla/main/data/unlabelled_test_data.csv')
train_data = pd.read_csv('https://raw.githubusercontent.com/Broly-brolik/DMML2021_Tesla/main/data/training_data.csv')
sample_submission = ('https://raw.githubusercontent.com/Broly-brolik/DMML2021_Tesla/main/sample_submission.csv')

In [165]:

from gensim.models.doc2vec import TaggedDocument
sample_tagged = train_data.apply(lambda r: TaggedDocument(words=spacy_tokenizer(r['sentence']), tags=[r.difficulty]), axis=1)
print(sample_tagged.head(3))

0    ([coût, kilométrique, réel, pouvoir, diverger,...
1        ([bleu, couleur, préférer, aime, vert], [A1])
2    ([test, niveau, français, site, internet, écol...
dtype: object


In [166]:
sample_tagged.values[0]

TaggedDocument(words=['coût', 'kilométrique', 'réel', 'pouvoir', 'diverger', 'sensiblement', 'valeur', 'moyen', 'fonction', 'moyen', 'transport', 'utiliser', 'taux', 'occupation', 'taux', 'remplissage', 'infrastructure', 'utiliser', 'topographie', 'ligne', 'flux', 'trafic', 'etc.'], tags=['C1'])

In [167]:
# Train test split - same split as before
train_tagged, test_tagged = train_test_split(sample_tagged, test_size=0.2, random_state=0)
train_tagged

70                                      ([appelle], [A1])
4347    ([nature, simplifier, sensiblement, débat, con...
1122    ([pèlerin, partager, célébration, voisin, indi...
4570                                 ([-ce, faite], [A1])
34      ([obscur, devenir, maître, biosphère, devenir,...
                              ...                        
1033    ([micro-changement, apporter, type, union, cap...
3264        ([aller, poste, croiser, cousin, mari], [A2])
1653    ([cours, année, 1970, 1980, groupe, environnem...
2607    ([figurer, vrai, père, aucun, envie, appeler],...
2732    ([terrain, commencer, favorable, informatique,...
Length: 3840, dtype: object

In [168]:
test_tagged

2255    ([décembre, 1967, bien, invective, parlement, ...
608     ([giscard, aller, pourtant, réussir, transform...
2856    ([choix, difficile, important, public, françai...
1889              ([débat, porte, utilité, mesure], [B1])
1519                                 ([aller, vie], [A1])
                              ...                        
3553    ([engouffrer, sentier, ruer, arbre, tirer, bra...
4595    ([prix, afficher, besoin, mettre, liste, prix,...
891     ([présent, alimentation, antillaise, morue, in...
1005    ([réinvente, dimanche, perspective, laïque], [...
1940    ([femme, nuancent, régine, lemoine, darthois, ...
Length: 960, dtype: object

In [169]:
# Allows to speed up a bit
import multiprocessing
cores = multiprocessing.cpu_count()

In [170]:
# Define Doc2Vec and build vocabulary
from gensim.models import Doc2Vec

model_dbow = Doc2Vec(dm=0, vector_size=30, negative=6, hs=0, min_count=1, sample=0, workers=cores, epoch=300)
model_dbow.build_vocab([x for x in sample_tagged.values])

In [171]:
# Train distributed Bag of Word model
model_dbow.train(train_tagged, total_examples=model_dbow.corpus_count, epochs=model_dbow.epochs)

In [172]:
# Select X and y
def vec_for_learning(model, tagged_docs):
    sents = tagged_docs.values
    targets, regressors = zip(*[(doc.tags[0], model.infer_vector(doc.words, steps=100)) for doc in sents])
    return targets, regressors

y_train, X_train = vec_for_learning(model_dbow, train_tagged)
y_test, X_test = vec_for_learning(model_dbow, test_tagged)

In [173]:
X_train[:3]

(array([-0.23166588,  0.23998977,  0.03163745, -0.17800882, -0.04882336,
        -0.18489705, -0.24608955, -0.1396688 , -0.07768258,  0.08845659,
        -0.20469974, -0.11909915,  0.0353523 , -0.3403728 ,  0.07838056,
        -0.2818986 , -0.17967646,  0.17552437, -0.04133378, -0.1492834 ,
         0.17134136, -0.18431388,  0.21888009, -0.07860448, -0.06091329,
         0.12362346, -0.02192856, -0.15423015, -0.02916885,  0.02396392],
       dtype=float32),
 array([-0.73345625,  0.61690694,  0.10789692, -0.3845667 , -0.24653459,
        -0.67119104, -0.6720537 , -0.38287687, -0.3587298 ,  0.48083696,
        -0.5635869 , -0.29771614, -0.05702979, -1.1506287 ,  0.41859108,
        -0.94260013, -0.7198455 ,  0.4509291 , -0.2591503 , -0.57143694,
         0.3617331 , -0.84339654,  0.6316292 , -0.39781174, -0.00738906,
         0.36913425, -0.2482781 , -0.6240237 ,  0.07065703,  0.15128417],
       dtype=float32),
 array([-0.94373584,  0.8321359 ,  0.12407004, -0.51239866, -0.3667969 ,
   

In [174]:
# Fit model on training set - same algorithm as before
LR3 = LogisticRegression(max_iter=1000, solver='lbfgs')
LR3.fit(X_train, y_train)

# Predictions
y_pred = LR3.predict(X_test)

# Evaluate model

evaluate(y_test, y_pred)

CONFUSION MATRIX:
[[102  33  20   4   2   0]
 [ 62  53  35   7   3   4]
 [ 18  26  73  22   5  16]
 [  2   6  22  58  27  29]
 [  3   3  18  37  72  40]
 [  4   2  12  29  36  75]]
ACCURACY SCORE:
0.4510
CLASSIFICATION REPORT:
	Precision: 0.4490
	Recall: 0.4511
	F1_Score: 0.4471




In [175]:
#save 
LR3_accuracy = accuracy_score(y_test, y_pred)
LR3_precision = precision_score(y_test, y_pred, pos_label='positive', average='macro')
LR3_recall = recall_score(y_test, y_pred, pos_label='positive', average='macro')
LR3_f1 = f1_score(y_test, y_pred, pos_label='positive', average='macro')



This is our best model.
We can now train  our model with all data from train_data and predict for test_set.
Actually, the prediction on Kaggle was better when this model was not trained on all data but only with the train splitted set from train_data.
So this part is not recommended to be executed.

##### Train the model on the whole train_data set

In [176]:
# Define Doc2Vec and build vocabulary
from gensim.models import Doc2Vec

# model_dbow = Doc2Vec(dm=0, vector_size=30, negative=6, hs=0, min_count=1, sample=0, workers=cores, epoch=300)
# model_dbow.build_vocab([x for x in sample_tagged.values])
# model_dbow.train(sample_tagged, total_examples=model_dbow.corpus_count, epochs=model_dbow.epochs) # train on sample_tagged, i.e. all the set

In [177]:
sample_tagged_test = test_data.apply(lambda r: TaggedDocument(words=spacy_tokenizer(r["sentence"]), tags=r["id"]), axis=1)

In [178]:
# Vectorize the X_test, which is our sentences from test_data
sents = sample_tagged_test.values
X_test = np.asarray([model_dbow.infer_vector(doc.words, steps=100) for doc in sents])

In [179]:
X_test.shape

(1200, 30)

In [180]:
test_pred = LR3.predict(X_test)

In [181]:
# sample_tagged

#### Bonus

Some functions to clean the data (but was done by space_tokenizer)

In [182]:
# All in lower case
# X_train = X_train.apply(lambda x: " ".join(x.lower() for x in x.split()))
# Remove duplicates
# X_train = X_train.drop_duplicates
# Removing stopwords on X_train observations
# X_train = X_train.apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))
# Removing punctuations on X_train observations
# X_train = X_train.str.replace('[^\w\s]','')

## 4. Summary of the results

In [183]:
 # make a frame with the summaries
summarises = pd.DataFrame({'Name': ['Logistic Regression', 'KNN', 'Decision Tree Classifier', 'Random Forest Classifier', 'LR2', 'LR3'],
                          'Accuracy': [LR_accuracy, KNN_accuracy, DTC_accuracy, RFC_accuracy, LR2_accuracy, LR3_accuracy],
                          'Precision': [LR_precision, KNN_precision, DTC_precision, RFC_precision, LR2_precision, LR3_precision],
                          'Recall': [LR_recall, KNN_recall, DTC_recall, RFC_recall, LR2_recall, LR3_recall],
                          'f1': [LR_f1, KNN_f1, DTC_f1, RFC_f1, LR2_f1, LR3_f1]
                          })


summarises

Unnamed: 0,Name,Accuracy,Precision,Recall,f1
0,Logistic Regression,0.386458,0.387544,0.386941,0.384255
1,KNN,0.328125,0.334029,0.325574,0.325542
2,Decision Tree Classifier,0.363542,0.378131,0.360916,0.361749
3,Random Forest Classifier,0.410417,0.408577,0.410735,0.408088
4,LR2,0.397917,0.395941,0.398521,0.395732
5,LR3,0.451042,0.448963,0.451101,0.447058


## 5. Predict test set with our best model, and export it in a .csv file

#### 5.1. Predict 

In [184]:
# Fit it with our best model (e.g. pipeLR, pipeknn, pipeDTC, etc.)
# pipeX.fit(X, y) # we can now train it on the whole train set
# test_pred = pipeX.predict(test_data['sentence'])

As in our case the best model was LR3, and we did not want to train the model on all the set, we can just  go the section where there is the model, run it in order to have the test_pred variable filled.

#### 5.2. Create the dataframe, convert it and download it.

In [185]:
# create the dataframe as asked
submission = pd.DataFrame(data=test_pred, columns=['difficulty']) #creating the dataframe with the prediction values.
submission['id'] = submission.index #adding the column 'id'
submission

Unnamed: 0,difficulty,id
0,C1,0
1,B1,1
2,A2,2
3,B2,3
4,C2,4
...,...,...
1195,B1,1195
1196,A2,1196
1197,C2,1197
1198,B2,1198


In [186]:
#export the dataframe as a .csv file
submission.to_csv('submission.csv', index=False) #deleting the index column
from google.colab import files
files.download('submission.csv') # downloading the .csv file

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Conclusion

We submitted the prediction with LR3 model.

## Youtube video

In [188]:
%%HTML
<iframe width="560" src="https://youtube.com/embed/zIJN3GRUI3s"></iframe>

If you have issues for watching the video here, you can directly go to this link: https://youtu.be/zIJN3GRUI3s