<table>
<tr>
<td width=15%><img src="../../img/UGA.png"></img></td>
<td><center><h1>Project n°3</h1></center></td>
<td width=15%><a href="https://team.inria.fr/tripop/team-members/" style="font-size: 16px; font-weight: bold">Florian Vincent</a> </td>
</tr>
</table>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import nltk

In [2]:
%load_ext autoreload
%autoreload 2

# Learning text classification

This project is heavily inspired from [Jigsaw's *Toxic Comments Classification* challenge](https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge/overview) on kaggle.
To avoid copy-pastings of foreign code, it will guide you towards specific tools to test and use.

## Overview of the project

Take a look at the *zip*ed csv data files by unzipping them (`for name in $(ls *.zip); do unzip $name; done;`).

Every comment in the train set is classified with a label in `{"toxic", "severe_toxic", "obscene", "threat", "insult", "identity hate"}`.
You will need to train multiple kind of models to identify those comments, and you will test them against the test dataset.

## Study the data

Representing textual data in an algebraic format (i.e. vectors & matrices) is not easy, but fortunately it has been quickly studied earlier in the lectures.

**Implement a word-vectorizer relying on simple counting for the textual data**

In [3]:
# Importation des données
data_train = pd.read_csv("train.csv")
data_train.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [4]:
# Recherches de valeurs manquantes
data_train.isna().sum()

id               0
comment_text     0
toxic            0
severe_toxic     0
obscene          0
threat           0
insult           0
identity_hate    0
dtype: int64

* # Nettoyage des données

In [5]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re
from nltk.stem import WordNetLemmatizer

In [6]:
# Suppression de la ponctuation et des sauts de lignes
data_train["comment_clean"] = data_train["comment_text"].apply(lambda x : re.sub("[^a-zA-Z]", ' ', x))

In [7]:
data_train[["comment_text", "comment_clean"]].head(6)

Unnamed: 0,comment_text,comment_clean
0,Explanation\nWhy the edits made under my usern...,Explanation Why the edits made under my userna...
1,D'aww! He matches this background colour I'm s...,D aww He matches this background colour I m s...
2,"Hey man, I'm really not trying to edit war. It...",Hey man I m really not trying to edit war It...
3,"""\nMore\nI can't make any real suggestions on ...",More I can t make any real suggestions on im...
4,"You, sir, are my hero. Any chance you remember...",You sir are my hero Any chance you remember...
5,"""\n\nCongratulations from me as well, use the ...",Congratulations from me as well use the to...


In [8]:
# Conversion en minuscule
data_train["comment_clean"] = data_train["comment_clean"].str.lower()

In [9]:
data_train[["comment_text", "comment_clean"]].head(6)

Unnamed: 0,comment_text,comment_clean
0,Explanation\nWhy the edits made under my usern...,explanation why the edits made under my userna...
1,D'aww! He matches this background colour I'm s...,d aww he matches this background colour i m s...
2,"Hey man, I'm really not trying to edit war. It...",hey man i m really not trying to edit war it...
3,"""\nMore\nI can't make any real suggestions on ...",more i can t make any real suggestions on im...
4,"You, sir, are my hero. Any chance you remember...",you sir are my hero any chance you remember...
5,"""\n\nCongratulations from me as well, use the ...",congratulations from me as well use the to...


In [10]:
# Tokenisation (séparation mot à mot)
data_train["comment_clean"] = data_train["comment_clean"].apply(word_tokenize)

In [11]:
data_train[["comment_text", "comment_clean"]].head(6)

Unnamed: 0,comment_text,comment_clean
0,Explanation\nWhy the edits made under my usern...,"[explanation, why, the, edits, made, under, my..."
1,D'aww! He matches this background colour I'm s...,"[d, aww, he, matches, this, background, colour..."
2,"Hey man, I'm really not trying to edit war. It...","[hey, man, i, m, really, not, trying, to, edit..."
3,"""\nMore\nI can't make any real suggestions on ...","[more, i, can, t, make, any, real, suggestions..."
4,"You, sir, are my hero. Any chance you remember...","[you, sir, are, my, hero, any, chance, you, re..."
5,"""\n\nCongratulations from me as well, use the ...","[congratulations, from, me, as, well, use, the..."


In [12]:
# Suppression des stopwords (mots de "liaisons" inutiles)
stop_words = set(stopwords.words("english"))

data_train["comment_clean"] = data_train["comment_clean"].apply(lambda x: [word for word in x if word not in stop_words])

In [13]:
data_train[["comment_text", "comment_clean"]].head(6)

Unnamed: 0,comment_text,comment_clean
0,Explanation\nWhy the edits made under my usern...,"[explanation, edits, made, username, hardcore,..."
1,D'aww! He matches this background colour I'm s...,"[aww, matches, background, colour, seemingly, ..."
2,"Hey man, I'm really not trying to edit war. It...","[hey, man, really, trying, edit, war, guy, con..."
3,"""\nMore\nI can't make any real suggestions on ...","[make, real, suggestions, improvement, wondere..."
4,"You, sir, are my hero. Any chance you remember...","[sir, hero, chance, remember, page]"
5,"""\n\nCongratulations from me as well, use the ...","[congratulations, well, use, tools, well, talk]"


In [14]:
# Lemmatisation
lemmatizer = WordNetLemmatizer()

data_train["comment_clean"] = data_train["comment_clean"].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

In [15]:
data_train[["comment_text", "comment_clean"]].head(6)

Unnamed: 0,comment_text,comment_clean
0,Explanation\nWhy the edits made under my usern...,"[explanation, edits, made, username, hardcore,..."
1,D'aww! He matches this background colour I'm s...,"[aww, match, background, colour, seemingly, st..."
2,"Hey man, I'm really not trying to edit war. It...","[hey, man, really, trying, edit, war, guy, con..."
3,"""\nMore\nI can't make any real suggestions on ...","[make, real, suggestion, improvement, wondered..."
4,"You, sir, are my hero. Any chance you remember...","[sir, hero, chance, remember, page]"
5,"""\n\nCongratulations from me as well, use the ...","[congratulation, well, use, tool, well, talk]"


In [16]:
# Reconversion des listes en chaines de charactères
data_train["comment_clean"] = data_train["comment_clean"].apply(lambda x: " ".join(x))

In [17]:
data_train[["comment_text", "comment_clean"]].head(6)

Unnamed: 0,comment_text,comment_clean
0,Explanation\nWhy the edits made under my usern...,explanation edits made username hardcore metal...
1,D'aww! He matches this background colour I'm s...,aww match background colour seemingly stuck th...
2,"Hey man, I'm really not trying to edit war. It...",hey man really trying edit war guy constantly ...
3,"""\nMore\nI can't make any real suggestions on ...",make real suggestion improvement wondered sect...
4,"You, sir, are my hero. Any chance you remember...",sir hero chance remember page
5,"""\n\nCongratulations from me as well, use the ...",congratulation well use tool well talk


In [None]:
def nettoyage(df) :
    # Suppression de la ponctuation et des sauts de lignes
    df["comment_clean"] = df["comment_text"].apply(lambda x : re.sub("[^a-zA-Z]", ' ', x))

    # Conversion en minuscule
    df["comment_clean"] = df["comment_clean"].str.lower()

    # Tokenisation (séparation mot à mot)
    df["comment_clean"] = df["comment_clean"].apply(word_tokenize)

    # Suppression des stopwords (mots de "liaisons" inutiles)
    stop_words = set(stopwords.words("english"))
    df["comment_clean"] = df["comment_clean"].apply(lambda x: [word for word in x if word not in stop_words])

    # Lemmatisation
    lemmatizer = WordNetLemmatizer()
    df["comment_clean"] = df["comment_clean"].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

    # Reconversion des listes en chaines de charactères
    df["comment_clean"] = df["comment_clean"].apply(lambda x: " ".join(x))

    return df

# Attention : la colonne de texte à traiter doit impérativement s'appeler "comment_text".

* # Vectorisation

In [18]:
# Modèle Sac de Mots (CountVectorizer)
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
data_cv = cv.fit_transform(data_train["comment_clean"])
data_cv.shape

(159571, 158769)

**Implement another vectorizing relying this time on the *tf-idf* metric. Use a pipeline if needed.**

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfv = TfidfVectorizer()
data_tfidf = tfv.fit_transform(data_train["comment_clean"])
data_tfidf.shape

(159571, 158769)

One may wish to take a deeper look in the database by using various techniques.

**Find a suitable dimension reduction technique to study the structure of the data. Display your findings with visual means (you can use `seaborn`).**

## Make classification

We will study during this project a small amount of models.

### Logistic regression

The logistic regression is the most simple and naïve model one can use for classification specifically, but it can provide good insights on the baseline one may wish to achieve with more complex models.

**Implement a logistic classifier. Justify every parameter that you choose and how you chose it.**

In [7]:
## Write your code here

### SVM

The support vector machine used to be the SOTA method for many tasks before neural networks became more popular among data scientists.
Is has a lot of advantages as compared to logistic regression, as it is a kernel method of which the results are still relatively easy to interpret.

**Implement a SVM classifier, justifying your choices of hyper-parameters.**

## Other models

**Choose a model between the following:**
* **K-Nearest Neighbors (*KNN*)**
* **Decision Tree**
* **Random Forest**

**Describe IN YOUR OWN WORDS (plagiarism checks will be made if needed) how the method works, and implement it for the current case, discussing its hyperparameters as well.**

## Random Forest
La méthode de Random Forest se base sur **plusieurs** *arbres de décisions* indépendants afin de prédire un modèle plus précis que ceux obtenu par chaque arbe individuellement.
Un arbre de décision est un ensemble d'algorithmes permettant de séparer au mieux nos données selon un certains nombre de décisions, représentées par des *branches*.
Un arbre est très sensible aux variation des données d'apprentissage. C'est pour cela qu'une forêt est généralement privilégiée : en combinant les résultats de plusieurs arbres de décisions réalisés sur des données d'apprentissage variables, la forêt aléatoire réduite le risque d'erreurs dû à des changements dans les dites données.

In [22]:
# Construction des échantillons nécessaires pour les arbres
# Question : combien de lignes utiliser pour chaque arbre ? Combien d'arbres en tout ?

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

rf = RandomForestClassifier(n_estimators = 100, criterion = "gini", max_depth = None, min_samples_split = 2,
                            min_samples_leaf = 1, min_weight_fraction_leaf = 0.0, max_features = "auto", 
                            max_leaf_nodes = None, min_impurity_decrease = 0.0, bootstrap = True, 
                            oob_score = False, n_jobs = None, random_state = None, verbose = 0,
                            warm_start = False, class_weight = None, ccp_alpha = 0.0, max_samples = None)
# bootstrap = True : faire varier l'échantillon de départ selon la méthode bootstrap

## Compare models

One must then compare the models on the test set and provide metrics to study it.

**Compare previously studied models, with counting *tf* and *tf-idf* as vectorizers, for their best hyperparameters.**

In [10]:
## Write your code here

## Use your model

**Use the best model to build a Command-Line Interface (*CLI*) that is launched by the command `./cli.py [options]` using the `argsparse` module, and that accepts in stdin (standard input) english sentences and classifies them, displaying the result and interesting metrics if relevant.**