# Parcours Ingénieur Machine Learning

## Projet 8 : Participez à une compétition Kaggle !

## Table des matières:
* [Introduction](#0)
* [1. Cleansing des données](#1)
    * [1.1 Installation des packages](#1.1)
    * [1.2 Chargement et visualisation des données](#1.2)
    * [1.3 Traitements de nettoyage](#1.3)   
        * [1.3.1 Analyse des emojis](#1.3.1)
        * [1.3.2 Suppression des digits et autres caractères indésirables](#1.3.2)
        * [1.3.3 Suppression des contractions](#1.3.3)
        * [1.3.4 Gestion des fautes d'orthographe](#1.3.4)
        * [1.3.5 Détection de langue](#1.3.5)
* [2. Data augmentation](#2)
    * [2.1 Installation des packages](#2.1)
    * [2.2 Méthode des synonymes](#2.2)
* [3. Modélisation](#3)
    * [3.1 Configuration du TPU](#3.1)
    * [3.2 Fonction d'encodage](#3.2)
    * [3.3 Construction du modèle](#3.3)
    * [3.4 Entrainement](#3.4)
    * [3.5 Soumission du modèle](#3.5)
    * [3.6 Mode opératoire et tests](#3.6)
* [4. Résultats](#4)
* [Conclusion](#5)
* [Sources](#6)

## Introduction<a class="anchor" id="0"></a>

<p>A travers ce notebook, je vais vous exposer dans le détail ce que j'ai réalisé dans le cadre de ce nouveau projet pour le parcours d'ingénieur machine learning d'OpenClassrooms. 
<p>Le but est de participer à une compétition Kaggle. Je me suis donc inscrit à la compétition "Jigsaw Multilingual Toxic Comment Classification" dont le but est de venir classifier de la donnée textuelle. Il s'agit donc d'une problématique de NLP tournée autour de l'utilisation d'un accélérateur conçu par Google, le TPU, et permettant de tester et mettre en pratique les modèles pré entrainés les plus performants à l'heure actuelle. La donnée à classifier correspond à des commentaires récupérés sur le "Wikipedia talk page comments". L'intérêt est de construire un modèle capable de détecter automatiquement si un commentaire est considéré comme toxique ou non. En d'autres termes, mettre en place un modérateur automatique.
<p>Dans un premier temps, je vais présenter ce que j'ai réalisé comme nettoyage en détaillant chacune des stratégies. Ensuite, j'aborderai la partie data augmentation que j'ai souhaité tester dans le cadre de ce projet. Enfin, la dernière partie sera consacrée à la modélisation et au mode opératoire adopté pour trouver la meilleure combinaison possible entre algorithme de machine learning et data.

## 1. Cleansing des données<a class="anchor" id="1"></a>

### 1.1 Installation des packages<a class="anchor" id="1.1"></a>

<p>Commençons par installer l'ensemble des librairies que j'utiliserai dans le cadre du preprocessing des données.

In [1]:
!pip install emoji
!pip install demoji
!pip install spacy_cld
!pip install autocorrect
!pip install pandarallel
!python -m spacy download xx_ent_wiki_sm
!pip install texthero
!pip install contractions

Collecting emoji
[?25l  Downloading https://files.pythonhosted.org/packages/40/8d/521be7f0091fe0f2ae690cc044faf43e3445e0ff33c574eae752dd7e39fa/emoji-0.5.4.tar.gz (43kB)
[K     |███████▌                        | 10kB 16.3MB/s eta 0:00:01[K     |███████████████                 | 20kB 2.2MB/s eta 0:00:01[K     |██████████████████████▋         | 30kB 2.9MB/s eta 0:00:01[K     |██████████████████████████████▏ | 40kB 3.2MB/s eta 0:00:01[K     |████████████████████████████████| 51kB 2.0MB/s 
[?25hBuilding wheels for collected packages: emoji
  Building wheel for emoji (setup.py) ... [?25l[?25hdone
  Created wheel for emoji: filename=emoji-0.5.4-cp36-none-any.whl size=42176 sha256=0a662047e674c20723f9219ade3cc23b3f0a3eaa46ebafd692550a9d39de3da2
  Stored in directory: /root/.cache/pip/wheels/2a/a9/0a/4f8e8cce8074232aba240caca3fade315bb49fac68808d1a9c
Successfully built emoji
Installing collected packages: emoji
Successfully installed emoji-0.5.4
Collecting demoji
  Downloading htt

In [2]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import seaborn as sns
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

  import pandas.util.testing as tm


In [3]:
import os
import gc

import spacy
from spacy_cld import LanguageDetector
import xx_ent_wiki_sm

from autocorrect import Speller

from tqdm import tqdm
tqdm.pandas()

import re
import nltk

from pandarallel import pandarallel
pandarallel.initialize(progress_bar=True)

import emoji
import demoji
demoji.download_codes()

import texthero as hero
import contractions

import time

import string

punct = set(string.punctuation)
print(punct)

INFO: Pandarallel will run on 2 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
Downloading emoji data ...
... OK (Got response in 0.10 seconds)
Writing emoji data to /root/.demoji/codes.json ...
... OK


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


{'+', "'", '_', '`', '*', '\\', '&', '[', '?', '"', '>', '~', '{', '#', '-', '$', '/', '<', ':', '@', '%', '|', '}', '^', '(', '!', '.', ']', ')', '=', ';', ','}


In [6]:
pd.set_option('display.max_colwidth', None)

### 1.2 Chargement et visualisation des données<a class="anchor" id="1.2"></a>

<p>On commence par charger 3 jeux de données dans 3 dataframes. Ces 3 jeux correspondent à :
<ul>
    <li>train : jeu d'entrainement dans lequel on ne va conserver que 2 informations : les colonnes "comment_text" correspondant aux commentaires et "toxic" qui est un booléen et qui correspond à notre donnée à prédire. Les autres informations permettant de catégoriser le texte ne sont pas prises en compte comme le fait qu'il s'agisse d'une insulte ou d'un commentaire obscène. A noter que dans ce jeu d'entrainement les commentaires sont en anglais.</li>
    <li>valid : jeu de validation. Ce jeu de données permettra de valider le modèle entrainé. Les commentaires dans ce jeu de données sont multilingues.</li>
    <li>test : jeu de test utilisé pour la soumission dans le cadre du concours. L'idée est donc de venir prédire la toxicité des commentaires contenus dans ce fichier à partir du modèle entrainé précédemment.</li>
</ul>
    

In [6]:
train = pd.read_csv('../input/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train.csv')
valid = pd.read_csv('../input/jigsaw-multilingual-toxic-comment-classification/validation.csv')
test = pd.read_csv('../input/jigsaw-multilingual-toxic-comment-classification/test.csv')

train.drop(['severe_toxic','obscene','threat','insult','identity_hate'],axis=1,inplace=True)

train['lang'] = 'en'

#train
display(train.head())
display(train.shape)

#valid
display(valid.head())
display(valid.shape)

#test
display(test.head())
display(test.shape)

Unnamed: 0,id,comment_text,toxic,lang
0,0000997932d777bf,"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",0,en
1,000103f0d9cfb60f,"D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)",0,en
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",0,en
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of """"types of accidents"""" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\n\nThere appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It's listed in the relevant form eg Wikipedia:Good_article_nominations#Transport """,0,en
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember what page that's on?",0,en


(223549, 4)

Unnamed: 0,id,comment_text,lang,toxic
0,0,"Este usuario ni siquiera llega al rango de hereje . Por lo tanto debería ser quemado en la barbacoa para purificar su alma y nuestro aparato digestivo mediante su ingestión. Skipe linkin 22px Honor, valor, leltad. 17:48 13 mar 2008 (UTC)",es,0
1,1,"Il testo di questa voce pare esser scopiazzato direttamente da qui. Immagino possano esserci problemi di copyright, nel fare cio .",it,0
2,2,"Vale. Sólo expongo mi pasado. Todo tiempo pasado fue mejor, ni mucho menos, yo no quisiera retroceder 31 años a nivel particular. Las volveria a pasar putas.Fernando",es,1
3,3,"Bu maddenin alt başlığı olarak uluslararası ilişkiler ile konuyu sürdürmek ile ilgili tereddütlerim var.Önerim siyaset bilimi ana başlığından sonra siyasal yaşam ve toplum, siyasal güç, siyasal çatışma, siyasal gruplar, çağdaş ideolojiler, din, siyasal değişme, kamuoyu, propaganda ve siyasal katılma temelinde çoğulcu siyasal sistemler.Bu alt başlıkların daha anlamlı olacağı kanaatindeyim.",tr,0
4,4,"Belçika nın şehirlerinin yanında ilçe ve beldelerini yaparken sanırım Portekizi örnek alacaksın. Ben de uzak gelecekte(2-3 yıl) bu tip şeyler düşünüyorum. Tabii futbol maddelerinin hakkından geldikten sonra.. daha önce mesajlarınızı görmüştüm, hatta anon bölümünü bizzat kullanıyordum sözünü anlamadım?? tanışmak bugüneymiş gibi bir şey eklemeyi düşündüm ama vazgeçtim. orayı da silmeyi unuttum. boşverin Kıdemli +",tr,0


(8000, 4)

Unnamed: 0,id,content,lang
0,0,Doctor Who adlı viki başlığına 12. doctor olarak bir viki yazarı kendi adını eklemiştir. Şahsen düzelttim. Onaylarsanız sevinirim. Occipital,tr
1,1,"Вполне возможно, но я пока не вижу необходимости выделять материал в отдельную статью. Если про правосудие в СССР будет написано хотя бы килобайт 20-30 — тогда да, следует разделить. Пока же мы в итоге получим одну куцую статью Правосудие и другую не менее куцую статью Правосудие в СССР. Мне кажется, что этот вопрос вполне разумно решать на основе правил ВП:Размер статей? которые не предписывают разделения, пока размер статьи не достигнет хотя бы 50 тыс. знаков.",ru
2,2,"Quindi tu sei uno di quelli conservativi , che preferiscono non cancellare. Ok. Avresti lasciato anche sfaccimma ? Si? Ok. Contento te... io non approvo per nulla, ma non conto nemmeno nulla... Allora lo sai che faccio? Me ne frego! (Aborro il fascismo, ma quando ce vo , ce vo !) Elborgo (sms)",it
3,3,"Malesef gerçekleştirilmedi ancak şöyle bir şey vardı. Belki yararlanırsınız. İyi çalışmalar. Kud yaz Teşekkür ederim. Abidenin maddesini de genişletmeyi düşünüyorum, ileride işime yarayacak bu. cobija Kullandın mı bilmiyorum ama şunu ve şunu da ben iliştireyim. Belki kaynakçaları lazım olur )RapsarEfendim? Yok mu artıran? ) . Kullandınız mı bilmiyorum ama kullanmadıysanız alttaki model, 3d, senaryo ve yerleştirme başlıklarını da incelemenizi tavsiye ederim. Kud yaz Aynen ya, çok güzel bir kaynak ama çalışma sahiplerine attığım e-postaya bir cevap gelmedi. Oradaki çalışmaları kullanabilseydim güzel olacaktı. cobija",tr
4,4,":Resim:Seldabagcan.jpg resminde kaynak sorunu :Resim:Seldabagcan.jpg resmini yüklediğiniz için teşekkürler. Ancak dosyanın tanım sayfasında içeriğin kimin tarafından yapıldığı hakkında ayrıntılı bilgi bulunmamaktadır, yani telif durumu açık değildir. Eğer dosyayı kendiniz yapmadıysanız, içeriğin sahibini belirtmelisiniz. Bir internet sitesinden elde ettiyseniz nereden aldığınızı net şekilde gösteren bir bağlantı veriniz. Diğer yüklediğiniz resimleri kontrol etmek istiyorsanız bu bağlantıyı tıklayın. Kaynaksız ve lisanssız resimler hızlı silme kriterlerinde belirtildiği üzere işaretlendikten bir hafta sonra silinirler. Telif hakları saklı olup adil kullanım politikasına uymayan resimler 48 saat sonra silinirler . Sorularınız için Vikipedi:Medya telif soruları sayfasını kullanabilirsiniz. Teşekkürler. Yabancı msj :Resim:Seldabagcan.jpg için adil kullanım gerekçesi :Resim:Seldabagcan.jpg resmini yüklediğiniz için teşekkürler. Yüklediğiniz resim adil kullanım politikasına uymak zorundadır ancak bu politikaya nasıl uyduğunu gösteren bir açıklama veya gerekçe bulunmamaktadır. Resim tanım sayfasına, kullanıldığı her madde için ayrı ayrı olacak şekilde bir adil kullanım gerekçesi yazmalısınız. Yüklediğiniz diğer resimleri kontrol etmek için bu bağlantıyı tıklayınız. Gerekçesi eksik olan adil kullanım resimleri hızlı silme kriterleri gereğince bir hafta sonra silinirler. Sorularınız için Vikipedi:Medya telif soruları sayfasını kullanabilirsiniz. Teşekkürler. Yabancı msj",tr


(63812, 3)

<p>Vérifions la ventilation des données par rapport à la variable "toxic". On constate que seul 10% des data sont considérées toxiques dans le jeu d'entrainement.

In [7]:
display(train.toxic.value_counts())
display(valid.toxic.value_counts())

0    202165
1     21384
Name: toxic, dtype: int64

0    6770
1    1230
Name: toxic, dtype: int64

### 1.3 Traitement de nettoyage<a class="anchor" id="1.3"></a>

#### 1.3.1 Gestion des emojis<a class="anchor" id="1.3.1"></a>

<p>La première étape du nettoyage va concerner les emojis. Je vais utiliser 2 librairies dans ce cadre : demoji et emoji. La librairie demoji va me permettre de lister de manière unique tous les emojis présents dans la colonne "comment_text" de notre jeu de données. La librairie emoji, quant à elle va être utilisée pour convertir les emojis en texte. Ce texte, sera aussi nettoyé avec la suppression des caractères "_". Ceci va occasionner des doubles espaces qui seront supprimés dans l'étape suivante de notre nettoyage.

In [8]:
other_characters = []
for s in train['comment_text'].fillna('').astype(str):
    for c in s:
        if c.isdigit() or c.isalpha() or c.isalnum() or c.isspace() or c in punct:
            continue
        other_characters.append(c)

In [9]:
demoji.findall(''.join(other_characters))

{'©': 'copyright',
 '®': 'registered',
 '‼': 'double exclamation mark',
 '™': 'trade mark',
 '↔': 'left-right arrow',
 '↕': 'up-down arrow',
 '↗': 'up-right arrow',
 '↘': 'down-right arrow',
 '↙': 'down-left arrow',
 'Ⓜ': 'circled M',
 '▪': 'black small square',
 '▫': 'white small square',
 '▶': 'play button',
 '◀': 'reverse button',
 '◾': 'black medium-small square',
 '☀': 'sun',
 '☁': 'cloud',
 '☂': 'umbrella',
 '☃': 'snowman',
 '☄': 'comet',
 '☎': 'telephone',
 '☑': 'check box with check',
 '☘': 'shamrock',
 '☝': 'index pointing up',
 '☠': 'skull and crossbones',
 '☢': 'radioactive',
 '☣': 'biohazard',
 '☪': 'star and crescent',
 '☮': 'peace symbol',
 '☯': 'yin yang',
 '☸': 'wheel of dharma',
 '☺': 'smiling face',
 '♀': 'female sign',
 '♂': 'male sign',
 '♑': 'Capricorn',
 '♟': 'chess pawn',
 '♠': 'spade suit',
 '♣': 'club suit',
 '♥': 'heart suit',
 '♦': 'diamond suit',
 '♨': 'hot springs',
 '⚔': 'crossed swords',
 '✈': 'airplane',
 '✉': 'envelope',
 '✋🏼': 'raised hand: medium-ligh

In [10]:
def convert_emoji(text):
  text_clean = emoji.demojize(text, delimiters=(" ", " "))
  text_clean = text_clean.replace("_", " ")
  return text_clean

<p>Ci-dessous un exemple du résultat de la fonction.

In [11]:
text = "game is on 🔥😂"

In [12]:
convert_emoji(text)

'game is on  fire  face with tears of joy '

In [15]:
train['comment_text'] = train.progress_apply(lambda x: convert_emoji(x['comment_text']), axis=1)

100%|██████████| 223549/223549 [06:28<00:00, 575.62it/s]


In [16]:
train.head()

Unnamed: 0,id,comment_text,toxic,lang
0,0000997932d777bf,"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",0,en
1,000103f0d9cfb60f,"D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)",0,en
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",0,en
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of """"types of accidents"""" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\n\nThere appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It's listed in the relevant form eg Wikipedia:Good article nominations#Transport """,0,en
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember what page that's on?",0,en


In [29]:
valid['comment_text'] = valid.progress_apply(lambda x: convert_emoji(x['comment_text']), axis=1)

100%|██████████| 8000/8000 [00:13<00:00, 593.69it/s]


In [30]:
test['content'] = test.progress_apply(lambda x: convert_emoji(x['content']), axis=1)

100%|██████████| 63812/63812 [01:57<00:00, 543.57it/s]


#### 1.3.2 Suppression des digits et autres caractères indésirables<a class="anchor" id="1.3.2"></a>

<p>L'étape suivante consiste à supprimer tout caractère indésirable dans la colonne "comment_text" du jeu de données. Pour cela, je vais utiliser la librairie <b>texthero</b>. Cette librairie est très intéressante car elle permet, de manière très rapide, d'effectuer plusieurs tâches de nettoyage simultanément. Pour cela, il suffira de définir la tâche à effectuer dans le pipeline. Ici, je vais supprimer les digits, les éventuels diacritiques (accents, tremas, retour charriot...), les doubles espaces, les urls et les tags html. Le résultat est ensuite inséré dans une nouvelle colonne nommée "comment_text_clean". On peut voir ci-dessous le résultat obtenu.

In [17]:
train['comment_text_clean'] = (
    train['comment_text']
    .pipe(hero.remove_digits)
    .pipe(hero.remove_diacritics)
    .pipe(hero.remove_whitespace)
    .pipe(hero.remove_urls)
    .pipe(hero.remove_html_tags))

In [18]:
train.head()

Unnamed: 0,id,comment_text,toxic,lang,comment_text_clean
0,0000997932d777bf,"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",0,en,"Explanation Why the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now. . . ."
1,000103f0d9cfb60f,"D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)",0,en,"D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) : , January , (UTC)"
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",0,en,"Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info."
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of """"types of accidents"""" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\n\nThere appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It's listed in the relevant form eg Wikipedia:Good article nominations#Transport """,0,en,""" More I can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of """"types of accidents"""" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know. There appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It's listed in the relevant form eg Wikipedia:Good article nominations#Transport """
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember what page that's on?",0,en,"You, sir, are my hero. Any chance you remember what page that's on?"


<p>Ce même nettoyage est réalisé sur les jeux de validation et de test.

In [38]:
valid['comment_text_clean'] = (
    valid['comment_text']
    .pipe(hero.remove_digits)
    .pipe(hero.remove_whitespace)
    .pipe(hero.remove_urls)
    .pipe(hero.remove_html_tags))

In [39]:
valid.head()

Unnamed: 0,id,comment_text,lang,toxic,comment_text_clean
0,0,"Este usuario ni siquiera llega al rango de hereje . Por lo tanto debería ser quemado en la barbacoa para purificar su alma y nuestro aparato digestivo mediante su ingestión. Skipe linkin 22px Honor, valor, leltad. 17:48 13 mar 2008 (UTC)",es,0,"Este usuario ni siquiera llega al rango de hereje . Por lo tanto debería ser quemado en la barbacoa para purificar su alma y nuestro aparato digestivo mediante su ingestión. Skipe linkin 22px Honor, valor, leltad. : mar (UTC)"
1,1,"Il testo di questa voce pare esser scopiazzato direttamente da qui. Immagino possano esserci problemi di copyright, nel fare cio .",it,0,"Il testo di questa voce pare esser scopiazzato direttamente da qui. Immagino possano esserci problemi di copyright, nel fare cio ."
2,2,"Vale. Sólo expongo mi pasado. Todo tiempo pasado fue mejor, ni mucho menos, yo no quisiera retroceder 31 años a nivel particular. Las volveria a pasar putas.Fernando",es,1,"Vale. Sólo expongo mi pasado. Todo tiempo pasado fue mejor, ni mucho menos, yo no quisiera retroceder años a nivel particular. Las volveria a pasar putas.Fernando"
3,3,"Bu maddenin alt başlığı olarak uluslararası ilişkiler ile konuyu sürdürmek ile ilgili tereddütlerim var.Önerim siyaset bilimi ana başlığından sonra siyasal yaşam ve toplum, siyasal güç, siyasal çatışma, siyasal gruplar, çağdaş ideolojiler, din, siyasal değişme, kamuoyu, propaganda ve siyasal katılma temelinde çoğulcu siyasal sistemler.Bu alt başlıkların daha anlamlı olacağı kanaatindeyim.",tr,0,"Bu maddenin alt başlığı olarak uluslararası ilişkiler ile konuyu sürdürmek ile ilgili tereddütlerim var.Önerim siyaset bilimi ana başlığından sonra siyasal yaşam ve toplum, siyasal güç, siyasal çatışma, siyasal gruplar, çağdaş ideolojiler, din, siyasal değişme, kamuoyu, propaganda ve siyasal katılma temelinde çoğulcu siyasal sistemler.Bu alt başlıkların daha anlamlı olacağı kanaatindeyim."
4,4,"Belçika nın şehirlerinin yanında ilçe ve beldelerini yaparken sanırım Portekizi örnek alacaksın. Ben de uzak gelecekte(2-3 yıl) bu tip şeyler düşünüyorum. Tabii futbol maddelerinin hakkından geldikten sonra.. daha önce mesajlarınızı görmüştüm, hatta anon bölümünü bizzat kullanıyordum sözünü anlamadım?? tanışmak bugüneymiş gibi bir şey eklemeyi düşündüm ama vazgeçtim. orayı da silmeyi unuttum. boşverin Kıdemli +",tr,0,"Belçika nın şehirlerinin yanında ilçe ve beldelerini yaparken sanırım Portekizi örnek alacaksın. Ben de uzak gelecekte( - yıl) bu tip şeyler düşünüyorum. Tabii futbol maddelerinin hakkından geldikten sonra.. daha önce mesajlarınızı görmüştüm, hatta anon bölümünü bizzat kullanıyordum sözünü anlamadım?? tanışmak bugüneymiş gibi bir şey eklemeyi düşündüm ama vazgeçtim. orayı da silmeyi unuttum. boşverin Kıdemli +"


In [40]:
test['content_clean'] = (
    test['content']
    .pipe(hero.remove_digits)
    .pipe(hero.remove_whitespace)
    .pipe(hero.remove_urls)
    .pipe(hero.remove_html_tags))

In [41]:
test.head()

Unnamed: 0,id,content,lang,content_clean
0,0,Doctor Who adlı viki başlığına 12. doctor olarak bir viki yazarı kendi adını eklemiştir. Şahsen düzelttim. Onaylarsanız sevinirim. Occipital,tr,Doctor Who adlı viki başlığına . doctor olarak bir viki yazarı kendi adını eklemiştir. Şahsen düzelttim. Onaylarsanız sevinirim. Occipital
1,1,"Вполне возможно, но я пока не вижу необходимости выделять материал в отдельную статью. Если про правосудие в СССР будет написано хотя бы килобайт 20-30 — тогда да, следует разделить. Пока же мы в итоге получим одну куцую статью Правосудие и другую не менее куцую статью Правосудие в СССР. Мне кажется, что этот вопрос вполне разумно решать на основе правил ВП:Размер статей? которые не предписывают разделения, пока размер статьи не достигнет хотя бы 50 тыс. знаков.",ru,"Вполне возможно, но я пока не вижу необходимости выделять материал в отдельную статью. Если про правосудие в СССР будет написано хотя бы килобайт - — тогда да, следует разделить. Пока же мы в итоге получим одну куцую статью Правосудие и другую не менее куцую статью Правосудие в СССР. Мне кажется, что этот вопрос вполне разумно решать на основе правил ВП:Размер статей? которые не предписывают разделения, пока размер статьи не достигнет хотя бы тыс. знаков."
2,2,"Quindi tu sei uno di quelli conservativi , che preferiscono non cancellare. Ok. Avresti lasciato anche sfaccimma ? Si? Ok. Contento te... io non approvo per nulla, ma non conto nemmeno nulla... Allora lo sai che faccio? Me ne frego! (Aborro il fascismo, ma quando ce vo , ce vo !) Elborgo (sms)",it,"Quindi tu sei uno di quelli conservativi , che preferiscono non cancellare. Ok. Avresti lasciato anche sfaccimma ? Si? Ok. Contento te... io non approvo per nulla, ma non conto nemmeno nulla... Allora lo sai che faccio? Me ne frego! (Aborro il fascismo, ma quando ce vo , ce vo !) Elborgo (sms)"
3,3,"Malesef gerçekleştirilmedi ancak şöyle bir şey vardı. Belki yararlanırsınız. İyi çalışmalar. Kud yaz Teşekkür ederim. Abidenin maddesini de genişletmeyi düşünüyorum, ileride işime yarayacak bu. cobija Kullandın mı bilmiyorum ama şunu ve şunu da ben iliştireyim. Belki kaynakçaları lazım olur )RapsarEfendim? Yok mu artıran? ) . Kullandınız mı bilmiyorum ama kullanmadıysanız alttaki model, 3d, senaryo ve yerleştirme başlıklarını da incelemenizi tavsiye ederim. Kud yaz Aynen ya, çok güzel bir kaynak ama çalışma sahiplerine attığım e-postaya bir cevap gelmedi. Oradaki çalışmaları kullanabilseydim güzel olacaktı. cobija",tr,"Malesef gerçekleştirilmedi ancak şöyle bir şey vardı. Belki yararlanırsınız. İyi çalışmalar. Kud yaz Teşekkür ederim. Abidenin maddesini de genişletmeyi düşünüyorum, ileride işime yarayacak bu. cobija Kullandın mı bilmiyorum ama şunu ve şunu da ben iliştireyim. Belki kaynakçaları lazım olur )RapsarEfendim? Yok mu artıran? ) . Kullandınız mı bilmiyorum ama kullanmadıysanız alttaki model, 3d, senaryo ve yerleştirme başlıklarını da incelemenizi tavsiye ederim. Kud yaz Aynen ya, çok güzel bir kaynak ama çalışma sahiplerine attığım e-postaya bir cevap gelmedi. Oradaki çalışmaları kullanabilseydim güzel olacaktı. cobija"
4,4,":Resim:Seldabagcan.jpg resminde kaynak sorunu :Resim:Seldabagcan.jpg resmini yüklediğiniz için teşekkürler. Ancak dosyanın tanım sayfasında içeriğin kimin tarafından yapıldığı hakkında ayrıntılı bilgi bulunmamaktadır, yani telif durumu açık değildir. Eğer dosyayı kendiniz yapmadıysanız, içeriğin sahibini belirtmelisiniz. Bir internet sitesinden elde ettiyseniz nereden aldığınızı net şekilde gösteren bir bağlantı veriniz. Diğer yüklediğiniz resimleri kontrol etmek istiyorsanız bu bağlantıyı tıklayın. Kaynaksız ve lisanssız resimler hızlı silme kriterlerinde belirtildiği üzere işaretlendikten bir hafta sonra silinirler. Telif hakları saklı olup adil kullanım politikasına uymayan resimler 48 saat sonra silinirler . Sorularınız için Vikipedi:Medya telif soruları sayfasını kullanabilirsiniz. Teşekkürler. Yabancı msj :Resim:Seldabagcan.jpg için adil kullanım gerekçesi :Resim:Seldabagcan.jpg resmini yüklediğiniz için teşekkürler. Yüklediğiniz resim adil kullanım politikasına uymak zorundadır ancak bu politikaya nasıl uyduğunu gösteren bir açıklama veya gerekçe bulunmamaktadır. Resim tanım sayfasına, kullanıldığı her madde için ayrı ayrı olacak şekilde bir adil kullanım gerekçesi yazmalısınız. Yüklediğiniz diğer resimleri kontrol etmek için bu bağlantıyı tıklayınız. Gerekçesi eksik olan adil kullanım resimleri hızlı silme kriterleri gereğince bir hafta sonra silinirler. Sorularınız için Vikipedi:Medya telif soruları sayfasını kullanabilirsiniz. Teşekkürler. Yabancı msj",tr,":Resim:Seldabagcan.jpg resminde kaynak sorunu :Resim:Seldabagcan.jpg resmini yüklediğiniz için teşekkürler. Ancak dosyanın tanım sayfasında içeriğin kimin tarafından yapıldığı hakkında ayrıntılı bilgi bulunmamaktadır, yani telif durumu açık değildir. Eğer dosyayı kendiniz yapmadıysanız, içeriğin sahibini belirtmelisiniz. Bir internet sitesinden elde ettiyseniz nereden aldığınızı net şekilde gösteren bir bağlantı veriniz. Diğer yüklediğiniz resimleri kontrol etmek istiyorsanız bu bağlantıyı tıklayın. Kaynaksız ve lisanssız resimler hızlı silme kriterlerinde belirtildiği üzere işaretlendikten bir hafta sonra silinirler. Telif hakları saklı olup adil kullanım politikasına uymayan resimler saat sonra silinirler . Sorularınız için Vikipedi:Medya telif soruları sayfasını kullanabilirsiniz. Teşekkürler. Yabancı msj :Resim:Seldabagcan.jpg için adil kullanım gerekçesi :Resim:Seldabagcan.jpg resmini yüklediğiniz için teşekkürler. Yüklediğiniz resim adil kullanım politikasına uymak zorundadır ancak bu politikaya nasıl uyduğunu gösteren bir açıklama veya gerekçe bulunmamaktadır. Resim tanım sayfasına, kullanıldığı her madde için ayrı ayrı olacak şekilde bir adil kullanım gerekçesi yazmalısınız. Yüklediğiniz diğer resimleri kontrol etmek için bu bağlantıyı tıklayınız. Gerekçesi eksik olan adil kullanım resimleri hızlı silme kriterleri gereğince bir hafta sonra silinirler. Sorularınız için Vikipedi:Medya telif soruları sayfasını kullanabilirsiniz. Teşekkürler. Yabancı msj"


#### 1.3.3 Suppression des contractions<a class="anchor" id="1.3.3"></a>

<p>L'intérêt ici est de remettre au "propre" certains termes anglais en supprimant les contractions (généralement produit sur des auxiliaires). Par exemple, "you're" est transformé en "you are". Cette transformation est appliquée sur la colonne "comment_text_clean" via l'utilisation de la librairie contractions.

In [19]:
def remove_contractions(text):
  text_clean = contractions.fix(str(text))
  return text_clean

<p>Ci-dessous un exemple pour tester la fonction.

In [20]:
remove_contractions("I'm ok")

'I am ok'

In [21]:
train['comment_text_clean'] = train.progress_apply(lambda x: remove_contractions(x['comment_text_clean']), axis=1)

100%|██████████| 223549/223549 [00:11<00:00, 19447.64it/s]


In [22]:
train.head()

Unnamed: 0,id,comment_text,toxic,lang,comment_text_clean
0,0000997932d777bf,"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",0,en,"Explanation Why the edits made under my username Hardcore Metallica Fan were reverted? They were not vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please do not remove the template from the talk page since I am retired now. . . ."
1,000103f0d9cfb60f,"D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)",0,en,"D'aww! He matches this background colour I am seemingly stuck with. Thanks. (talk) : , January , (UTC)"
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",0,en,"Hey man, I am really not trying to edit war. it is just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info."
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of """"types of accidents"""" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\n\nThere appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It's listed in the relevant form eg Wikipedia:Good article nominations#Transport """,0,en,""" More I can not make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of """"types of accidents"""" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know. There appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. it is listed in the relevant form eg Wikipedia:Good article nominations#Transport """
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember what page that's on?",0,en,"You, sir, are my hero. Any chance you remember what page that is on?"


#### 1.3.4 Gestion des fautes d'orthographe<a class="anchor" id="1.3.4"></a>

Ici, je vais essayer de corriger dans la mesure du possible les fautes d'orthographe qui ont pu se glisser dans les commentaires. Je vais utiliser la librairie autocorrect en prenant en compte le paramère fast à True. Cette méthode permet de gagner beaucoup de temps de traitement mais peut dégrader un peu la qualité de la correction. Tout se fait uniquement sur les commentaires en anglais et sur la colonne "comment_text_clean".

In [23]:
check = Speller(lang='en', fast=True)

In [24]:
def make_corrections(text):
  text_clean = check(text)
  return text_clean

<p>Testons la fonction sur un texte mal orthographié.

In [25]:
make_corrections("It is veiry goood !")

'It is very good !'

In [27]:
train['comment_text_clean'] = train.progress_apply(lambda x: make_corrections(x['comment_text_clean']), axis=1)

100%|██████████| 223549/223549 [08:16<00:00, 450.41it/s]


In [28]:
train.head()

Unnamed: 0,id,comment_text,toxic,lang,comment_text_clean
0,0000997932d777bf,"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",0,en,"Explanation Why the edits made under my username Hardcore Metallic Fan were reverted? They were not vandalisms, just closure on some As after I voted at New York Polls FoC. And please do not remove the template from the talk page since I am retired now. . . ."
1,000103f0d9cfb60f,"D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)",0,en,"D'www! He matches this background colour I am seemingly stuck with. Thanks. (talk) : , January , (UTC)"
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",0,en,"Hey man, I am really not trying to edit war. it is just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info."
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of """"types of accidents"""" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\n\nThere appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It's listed in the relevant form eg Wikipedia:Good article nominations#Transport """,0,en,""" More I can not make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of """"types of accidents"""" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know. There appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. it is listed in the relevant form eg Wikipedia:Good article nominations#Transport """
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember what page that's on?",0,en,"You, sir, are my hero. Any chance you remember what page that is on?"


#### 1.3.5 Détection de langues<a class="anchor" id="1.3.5"></a>

<p>Enfin, pour cette dernière étape, je vais essayer de détecter la langue des commentaires entrainés. Pour cela, j'utilise la librairie spacy et un dictionnaire multilingue.

In [21]:
nlp = xx_ent_wiki_sm.load()
language_detect = LanguageDetector()
nlp.add_pipe(language_detect)

In [22]:
def get_lang_score(text, lang):
    try:
        doc = nlp(str(text))
        language_scores = doc._.language_scores
        return language_scores.get(lang, 0)
    except Exception:
        return 0

In [23]:
train['lang_score'] = train.progress_apply(lambda x: get_lang_score(x['comment_text_clean'], x['lang']), axis=1)

100%|██████████| 223549/223549 [25:58<00:00, 143.44it/s]


In [24]:
train[train['lang_score'] < 0.8]

Unnamed: 0.1,Unnamed: 0,id,comment_text,toxic,lang,comment_text_clean,lang_score
146,146,005de39c51ef844a,"Azari or Azerbaijani? \n\nAzari-iranian,azerbaijani-turkic nation.",0,en,"Zari or Azerbaijani? Zari-iranian,azerbaijani-turkic nation.",0.0
177,177,006ca45465868e64,"86.29.244.57|86.29.244.57]] 04:21, 14 May 2007",0,en,". . . | . . . ]] : , May",0.0
182,182,006eaaaca322e12d,""") (ETA: John D. Haynes House. SarekOfVulcan (talk) """,0,en,""") (TA: John D. Raynes House. SarekOfVulcan (talk) """,0.0
281,281,00b211fa0c65d328,"2005 (UTC)\n\n 15:59, 17 December",0,en,"(UTC) : , December",0.0
702,702,01e82a7c3b00c42a,"Valerie Poxleitner \n\nValeri Poxleitner, A.K.A. Lights. If",0,en,"Valerin Poxleitner Valerin Poxleitner, A.K.A. Lights. If",0.0
...,...,...,...,...,...,...,...
223351,223351,ff2a0602fc19d1cc,"== Madde == \n\n Bu madde yalan Olmuş !!!! \n\n Bu kapsamda, Türkiye’nin güvenlik kaygıları temel olarak \n Terörizm, uzun menzilli füzeler ve kitle imha silahlarının yayılması, İrticai faaliyetler, Bölgesel çatışmalardan kaynaklanmaktadır.",0,en,"== Made == Bu made alan Ulmus !!!! Bu kapsamda, Turkize'in guvenlik kaygilari tewel olarak Terorizm, uzan menzilli fueler ve title imha silahlarinin yayilmasi, Irticai faaliyetler, Bolgesel catismalardan kaynaklanmaktadir.",0.0
223355,223355,ff2c0ea4be7d7a16,""":::::::::::Nije ni slicna. Crnogorac je uz etnicku i regionalna oznaka slicno kao Sumadija, Vojvodina, itd., a Bunjevac je pod-etnicka oznaka za odredjenu skupinu Hrvata, slicno kao i Sokci, Boduli, Janjevci....vec sam to rekao pet puta ali cini se da ti to nikako ne ulazi u glavu. A """"velikohrvatsvo"""" ne postoji, ono je izmisljeni pojam da bi se izjednacio sa pojmom 'Velika Srbija' koje su srpski nacionalisti sami izmislili i koristili. \n\n """,0,en,""":::::::::::Nine ni slicna. Crnogorac je up etnicku i regionalna oznaka slicno ka Sumadija, Vojvodina, it would., a Bunjevac je pod-etnicka oznaka za odredjenu skupinu Hrvata, slicno ka i Sorci, Noduli, Janjevci....ve sam to rekao pet put ali mini se da ti to nikako ne ulazi you glave. A """"velikohrvatsvo"""" ne postoji, ono je izmisljeni pojam da bi se izjednacio sa pojmom 'Vedika Srbija' kore su srpski nacionalisti sami izmislili i koristili. """,0.0
223380,223380,ff4ae0cb03e213ca,OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTOMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTOMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF OMFG WTF,0,en,MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTOMFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTOMFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF MFG WTF,0.0
223405,223405,ff60bd49939604f4,"Mein lieber Brendel, Ich bin auch deutscher Herkunft. Mein Vater ist Deutsch, meine Mutter aus Spanien. Ich meine die Nazi-Germanisten, nicht die Deutschen.",0,en,"Vein limber Trendel, Ch bin such deutsche Herkunft. Vein Later ist Deutsche, mine Utter as Spaniel. Ch mine die Nazi-Germanisten, nicht die Deutsche.",0.0


<p>Cette méthode ne me parait pas assez robuste. Je décide finalement de ne pas faire de tri même si on peut voir que certains commentaires n'apportent rien et ne sont d'ailleurs pas considérés comme "toxiques". De plus, étant donné que nous sommes sur une probématique multilingue, le fait d'avoir quelques commentaires dans d'autres langues que l'anglais ne devrait pas poser de problème au modèle lors de la phase d'entrainement.

## 2. Data augmentation<a class="anchor" id="2"></a>

<p>Comme dans le cadre de gestion d'objets de type "image", il est tout à fait possible de faire de la data augmentation en NLP. C'est ce que j'ai souhaité tester dans le cadre de ce projet et mesurer l'effet que cela pouvait avoir pour notre problématique. Cette augmentation de données peut se traduire de différentes formes : remplacement de mots par des synonymes, ajout ou remplacement de mots par d'autres considérés comme étant dans le même contexte... Pour cela, on utilise des dictionnaires pré enregistrés tirés de bert, xlnet ou encore wornet. C'est sur ce dernier que je me baserai pour faire des remplacements de mots par des synonymes.

### 2.1 Installation des packages<a class="anchor" id="2.1"></a>

<p>Tout d'abord, comme pour le nettoyage, la première partie est consacrée à l'installation des packages nécessaires et aux dictionnaires. Dans mon cas, c'est la librairie nlpaug qui est installée et qui sera utilisée.

In [9]:
!pip install nlpaug

Collecting nlpaug
[?25l  Downloading https://files.pythonhosted.org/packages/1f/6c/ca85b6bd29926561229e8c9f677c36c65db9ef1947bfc175e6641bc82ace/nlpaug-0.0.14-py3-none-any.whl (101kB)
[K     |███▎                            | 10kB 17.8MB/s eta 0:00:01[K     |██████▌                         | 20kB 2.1MB/s eta 0:00:01[K     |█████████▊                      | 30kB 2.8MB/s eta 0:00:01[K     |█████████████                   | 40kB 3.1MB/s eta 0:00:01[K     |████████████████▏               | 51kB 2.5MB/s eta 0:00:01[K     |███████████████████▍            | 61kB 2.7MB/s eta 0:00:01[K     |██████████████████████▋         | 71kB 3.0MB/s eta 0:00:01[K     |█████████████████████████▉      | 81kB 3.2MB/s eta 0:00:01[K     |█████████████████████████████   | 92kB 3.4MB/s eta 0:00:01[K     |████████████████████████████████| 102kB 2.7MB/s 
[?25hInstalling collected packages: nlpaug
Successfully installed nlpaug-0.0.14


In [14]:
import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.sentence as nas
import nlpaug.flow as nafc

from nlpaug.util import Action

import nltk
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

### 2.2 Méthode des synonymes<a class="anchor" id="2.2"></a>

<p>Comme expliqué ci-dessus, c'est la méthode des synonymes qui sera employée et testée ici. On peut voir ci-dessous un exemple permettant d'illustrer tout l'intèrêt de cette méthode. Elle nous permet d'enrichir le corpus via de nouvelles features et donner l'espoir d'améliorer d'avantage le modèle.

In [15]:
aug = naw.SynonymAug(aug_src='wordnet')

In [16]:
text = 'The quick brown fox jumps over the lazy dog .'
print(text)

The quick brown fox jumps over the lazy dog .


In [17]:
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
The fast brown fox leap out over the faineant dog .


<p>Le résulat est sauvegardé dans une nouvelle colonne nommée "comment_text_aug". Cette démarche va me permettre de tester mes différents modèles sur 3 types d'input différents. Le but sera donc de trouver la meilleure combinaison possible.

In [28]:
def make_augmentation(text):
  text_aug = aug.augment(str(text))
  return text_aug

In [None]:
train['comment_text_aug'] = train.progress_apply(lambda x: make_augmentation(x['comment_text_clean']), axis=1)

In [30]:
train.head()

Unnamed: 0.1,Unnamed: 0,id,comment_text,toxic,lang,comment_text_clean,lang_score,comment_text_aug
0,0,0000997932d777bf,"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",0,en,"Explanation Why the edits made under my username Hardcore Metallic Fan were reverted? They were not vandalisms, just closure on some As after I voted at New York Polls FoC. And please do not remove the template from the talk page since I am retired now. . . .",0.99,"Explanation Why the edits made nether my username Hardcore Metallic Fan were reverted ? They were not vandalisms , just closure on some As after I voted at New House of york Polls FoC . And please do not remove the template from the talk pageboy since I make up adjourn now . . . ."
1,1,000103f0d9cfb60f,"D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)",0,en,"D'www! He matches this background colour I am seemingly stuck with. Thanks. (talk) : , January , (UTC)",0.98,"500 ' www ! Helium match this background colour I am on the face of it stuck with . Thanks . ( talk ) : , Jan , ( coordinated universal time )"
2,2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",0,en,"Hey man, I am really not trying to edit war. it is just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",0.99,"Hey man , I am really not trying to edit war . it is just that this cat is constantly removing relevant information and talking to pine tree state through edits instead of my talk page . Atomic number 2 seems to care more about the data formatting than the actual info ."
3,3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of """"types of accidents"""" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\n\nThere appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It's listed in the relevant form eg Wikipedia:Good article nominations#Transport """,0,en,""" More I can not make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of """"types of accidents"""" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know. There appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. it is listed in the relevant form eg Wikipedia:Good article nominations#Transport """,0.99,""" More Single can not get any genuine suggestions on improvement - I wondered if the section statistics should be later on , or a subdivision of "" "" types of accidents "" "" - I think the references may need tidying indeed that they are all in the exact same format ie date format etc . I lav do that later on , if no - one else does first - if you have any preferences for formatting style on references or want to do it yourself delight let me acknowledge . There appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up . it comprise listed in the relevant form eg Wikipedia : Good article nominations # Transport """
4,4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember what page that's on?",0,en,"You, sir, are my hero. Any chance you remember what page that is on?",0.98,"You , sir , are my bomber . Any chance you call up what thomas nelson page that is on ?"


<p>Les jeux de données ainsi constitués sont sauvegardés afin de faciliter les différentes soumissions de modèles réalisées sur Kaggle dans le cadre de la compétition.

In [31]:
train.to_csv('train_clean.csv')
valid.to_csv('valid_clean.csv')
test.to_csv('test_clean.csv')

## 3. Modélisation<a class="anchor" id="3"></a>

<p>Maintenant que notre data est correctement nettoyée, nous pouvons passer à la partie modélisation. Plusieurs tests vont être réalisés sur 2 types de modèles particulièrement adaptés à des contextes multilingues. Je vais découper ce chapitre en plusieurs étapes expliquant un cas général tout en faisant le lien avec le modèle final. Je détaillerai ensuite le mode opératoire effectué pour réaliser mes tests.

### 3.1 Configuration du TPU<a class="anchor" id="3.1"></a>

<p>La première étape consiste donc à configurer le TPU et à charger les différentes librairies comme transformers ou Keras depuis Tensorflow. Le code pour détecter le TPU a directement été repris depuis un autre notebook sur Kaggle. Son utilisation va nous permettre d'améliorer les temps de réponse en rendant les traitements 20 fois plus rapides environ.

In [1]:
!pip install transformers

Collecting transformers
  Downloading https://files.pythonhosted.org/packages/27/3c/91ed8f5c4e7ef3227b4119200fc0ed4b4fd965b1f0172021c25701087825/transformers-3.0.2-py3-none-any.whl (769kB)
Collecting tokenizers==0.8.1.rc1 (from transformers)
  Downloading https://files.pythonhosted.org/packages/bf/9f/0bc9d97fc87b91a9f9be68623652734017caac523465ff47b980dd453ae4/tokenizers-0.8.1rc1-cp37-cp37m-win_amd64.whl (1.9MB)
Collecting regex!=2019.12.17 (from transformers)
  Downloading https://files.pythonhosted.org/packages/9c/d1/d2ecb51a8cb38c8278e77a2731c1366881e0dea9671f135d2625f15a73a4/regex-2020.7.14-cp37-cp37m-win_amd64.whl (268kB)
Collecting sentencepiece!=0.1.92 (from transformers)
  Downloading https://files.pythonhosted.org/packages/78/c7/fb817b7f0e8a4df1b1973a8a66c4db6fe10794a679cb3f39cd27cd1e182c/sentencepiece-0.1.91-cp37-cp37m-win_amd64.whl (1.2MB)
Collecting sacremoses (from transformers)
  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a153

In [3]:
import numpy as np
import pandas as pd
from tqdm import tqdm

import os
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.callbacks import EarlyStopping

import transformers
from transformers import TFAutoModel, AutoTokenizer

from tokenizers import BertWordPieceTokenizer

In [None]:
# Detect hardware, return appropriate distribution strategy
try:
    # TPU detection. No parameters necessary if TPU_NAME environment variable is
    # set: this is always the case on Kaggle.
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    print('Running on TPU ', tpu.master())
except ValueError:
    tpu = None

if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
    # Default distribution strategy in Tensorflow. Works on CPU and single GPU.
    strategy = tf.distribute.get_strategy()

print("REPLICAS: ", strategy.num_replicas_in_sync)

### 3.2 Fonction d'encodage<a class="anchor" id="3.2"></a>

<p>Cette étape d'encodage est importante et doit être effectuée avant de lancer nos modèles. Elle permet d'obtenir une représentation vectorielle des mots et phrases de notre input. Elle se décompose en 2 étapes : 
    <ul>
        <li><b>Tokenization :</b> l'idée générale est d'attribuer un index à chaque mot du corpus. Cette indexation se fait via un modèle pré-entrainé, le même que celui qui sera utilisé lors de la construction du modèle. Ici, la phase de tokenization se fera via la fonction "get_tokenizer". </li>
        <li><b>Encodage des tokens :</b> c'est cette étape qui permet de récupérer une représentation vectorielle en fonction du contexte du mot transformé prédemment en token. On créé la fonction "encode" qui va permettre de faire ce travail. Elle utilise la méthode "batch_encode_plus" du tokenizer. A noter que chaque séquence doit être de la même taille, c'est pourquoi la méthode possède un paramètre permettant de définir le padding. Le résultat obtenu est directement stocké comme un tenseur.</li>
    </ul>
<p>Pour permettre d'effectuer ces calculs, et pour rendre la démarche facilement adaptable pour passer d'un modèle pré-entrainé à un autre, je vais utiliser le package "AutoModels" de HuggingFace. Ce package permet de récupérer directement l'architecture cible en fonction du nom ou du path du modèle pré-entrainé choisi dans la méthode "from_pretrained". Par exemple, si je décide de tester une architecture BERT, le fait de saisir "bert-base-multilingual-cased" en tant que modèle pré-entrainé permet à l'algorithme de comprendre qu'il s'agit de l'architecture BERT à remonter.
<p>Avant de lancer cet encodage, on commence par définir les variables qui vont être utilisées dans le cadre de cette modélisation soit la taille du batch, la taille des séquences d'encodage (limité au 192 premiers caractères), le nombre d'époques lancées pour l'entrainement du modèle et surtout le modèle utilisé pour effectuer notre "transfer learning". Pour notre modèle final, la variable MODEL sera renseigné par 'jplu/tf-xlm-roberta-large'. C'est en effet via ce modèle pré-entrainé que j'ai récupéré les meilleurs résultats que je présenterai par la suite. 

In [7]:
# Configuration

AUTO = tf.data.experimental.AUTOTUNE

SEQUENCE_LENGTH = 192

BATCH_SIZE = 16 * strategy.num_replicas_in_sync

EPOCHS = 3

#MODEL = 'distilbert-base-multilingual-cased'
#MODEL = 'bert-base-multilingual-cased'
MODEL = 'jplu/tf-xlm-roberta-large'

In [None]:
def get_tokenizer():
    """Get Tokenizer"""
    tokenizer = AutoTokenizer.from_pretrained(MODEL)
  
    return tokenizer

tokenizer = get_tokenizer()

In [12]:
def encode(texts, tokenizer, maxlen=512):
    enc_di = tokenizer.batch_encode_plus(
        texts,
        return_attention_masks=False,
        return_token_type_ids=False,
        pad_to_max_length=True,
        max_length=maxlen
    )
    return np.array(enc_di['input_ids'])

<p>Maintenant que les fonctions permettant d'effectuer l'opération d'encodage sont définies, je vais pouvoir les appliquer à mes inputs. Pour rappel, depuis la phase de nettoyage, je dispose de 3 datasets possibles : non cleansé, nettoyé, et nettoyé avec data augmentation via des synonymes. L'idée est de tester mes modèles pré entrainés sur chacun des datasets disponibles et ainsi déterminer la meilleur combinaison possible. 
<p>Les tests me montreront que la data cleansée sans augmentation de données s'avère être la meilleure alternative.

In [None]:
%%time 

#Text original
#x_train = encode(train.comment_text.values, tokenizer, maxlen=SEQUENCE_LENGTH)
#x_valid = encode(valid.comment_text.values, tokenizer, maxlen=SEQUENCE_LENGTH)
#x_test = encode(test.content.values, tokenizer, maxlen=SEQUENCE_LENGTH)

#Text clean
x_train = encode(train.comment_text_clean.astype(str), tokenizer, maxlen=SEQUENCE_LENGTH)
x_valid = encode(valid.comment_text_clean.astype(str), tokenizer, maxlen=SEQUENCE_LENGTH)
x_test = encode(test.content_clean.astype(str), tokenizer, maxlen=SEQUENCE_LENGTH)

#Text augmentation
#x_train = encode(train.comment_text_aug.astype(str), tokenizer, maxlen=SEQUENCE_LENGTH)
#x_valid = encode(valid.comment_text_clean.astype(str), tokenizer, maxlen=SEQUENCE_LENGTH)
#x_test = encode(test.content_clean.astype(str), tokenizer, maxlen=SEQUENCE_LENGTH)

y_train = train.toxic.values
y_valid = valid.toxic.values

<p>Pour finaliser nos datasets, on utilise enfin l’API tf.data.Dataset de TensorFlow qui permet de créer un dataset à partir des données d’input et d'appliquer des transformations sur ces données. Organiser ses données de cette manière permet une utilisation optimale des inputs dans le pipeline, que ce soit en terme de temps d'exécution ou de mémoire.

In [None]:
train_dataset = (
    tf.data.Dataset
    .from_tensor_slices((x_train, y_train))
    .repeat()
    .shuffle(2048)
    .batch(BATCH_SIZE)
    .prefetch(AUTO)
)

valid_dataset = (
    tf.data.Dataset
    .from_tensor_slices((x_valid, y_valid))
    .batch(BATCH_SIZE)
    .cache()
    .prefetch(AUTO)
)

test_dataset = (
    tf.data.Dataset
    .from_tensor_slices(x_test)
    .batch(BATCH_SIZE)
)

### 3.3 Construction du modèle<a class="anchor" id="3.3"></a>

<p>La prochaine étape consiste à construire notre modèle. Il s'agit là d'un modèle de type réseau de neuronnes composé d'une couche embarquant le modèle transformer remonté via la fonction TFAutoModel. L'utilisation de la méthode "from_pretained" permet de récupérer les poids associés au modèle sélectionné.
<p>La fonction build_model va donc construire et renvoyé le modèle à partir de l'architecture transformer du modèle à tester. Une couche de sortie de type binaire en relation avec notre problématique est ajoutée au modèle. Ce modèle est ensuite compilé sur l'optimiseur Adam avec une fonction de perte de type "binary crossentropy" et "accuracy" comme métrique

In [None]:
def build_model(transformer, max_len=512):
    """
    function for training model
    """
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    
    sequence_output = transformer(input_word_ids)[0]
    cls_token = sequence_output[:, 0, :]
    out = Dense(1, activation='sigmoid')(x)
    
    model = Model(inputs=input_word_ids, outputs=out)
    model.compile(tf.keras.optimizers.Adam(lr=1e-5), loss='binary_crossentropy', metrics=['accuracy'])
    
    return model

<p>Cette étape nous permet de compiler et charger le modèle sur le TPU. C'est également à ce niveau qu'on vient configurer l'architecture cible via la méthode "from_pretrained" pour remonter les poids du modèle pré-entrainées.

In [None]:
%%time
with strategy.scope():
    transformer_layer = TFAutoModel.from_pretrained(MODEL)
    model = build_model(transformer_layer, max_len=SEQUENCE_LENGTH)
model.summary()

### 3.4 Entrainement<a class="anchor" id="3.4"></a>

<p>Si nous récapitulons : la data est chargée dans des datasets d'entrainement, de validation et de test, et le modèle a été compilé et chargé sur le TPU. Il est temps de passer à la partie entrainement. Cette partie va se décomposer en 2 étapes :
    <ul>
        <li>Entrainement du train_dataset avec validation sur le valid_dataset: on vient entrainer le jeu de données en anglais sur 3 époques</li>
        <li>Entrainement du valid_dataset: on vient ajouter des étapes d'entrainement cette fois sur le jeu de validation qui se compose de données multilingues. Le jeu de données est plus petit et est entrainé sur le double d'époques, soit 6.</li>
    </ul>

In [None]:
n_steps = x_train.shape[0] // BATCH_SIZE
train_history = model.fit(
    train_dataset,
    steps_per_epoch=n_steps,
    validation_data=valid_dataset,
    epochs=EPOCHS
)

In [None]:
n_steps = x_valid.shape[0] // BATCH_SIZE
train_history_2 = model.fit(
    valid_dataset.repeat(),
    steps_per_epoch=n_steps,
    epochs=EPOCHS*2
)

### 3.5 Soumission du modèle<a class="anchor" id="3.5"></a>

<p>Nous arrivons à la fin de la démarche réalisée. Le modèle a été entrainé et je peux à présent prédire la target à partir du jeu de données test puis l'ajouter au dataframe avant de sauvegarder le résultat dans un fichier csv nommé "submission.csv". La soumission de ce résultat me permet d'obtenir mon score.

In [None]:
sub = pd.read_csv("../input/jigsaw-multilingual-toxic-comment-classification/sample_submission.csv")
sub['toxic'] = model.predict(test_dataset, verbose=1)
sub.to_csv('submission.csv', index=False)

### 3.6 Mode opératoire et tests<a class="anchor" id="3.6"></a>

<p>J'ai donc exposé dans les paragraphes précédents la méthodologie appliquée en présentant mon code pour le modèle qui a pu me remonter le meilleur score.
<p>Cependant, d'autres tests ont été réalisés et m'ont permis d'effectuer une dizaine de soumissions. J'ai donc testé :
    <ul>
        <li>Le modèle pré-entrainé "distilbert-base-multilingual-cased" avec les 3 datasets disponibles</li>
        <li>Le modèle pré-entrainé "jplu/tf-xlm-roberta-large" avec les 3 datasets disponibles</li>
        <li>Le modèle pré-entrainé "bert-base-multilingual-cased" avec la colonne "comment_text_clean"</li>
        <li>Le modèle pré-entrainé "jplu/tf-xlm-roberta-large" avec la colonne "comment_text_clean" en mettant un EarlyStopping pour lancer d'avantage d'époques lors de l'entrainement</li>
        <li>Le modèle pré-entrainé "jplu/tf-xlm-roberta-large" avec la colonne "comment_text_clean" en rajoutant une couche de Dropout (voir code ci-dessous)</li>
    </ul>

In [8]:
def build_model(transformer, max_len=512):
    """
    function for training model
    """
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    
    sequence_output = transformer(input_word_ids)[0]
    cls_token = sequence_output[:, 0, :]
    x = tf.keras.layers.Dropout(0.5)(cls_token) 
    out = Dense(1, activation='sigmoid')(x)
    
    model = Model(inputs=input_word_ids, outputs=out)
    model.compile(tf.keras.optimizers.Adam(lr=1e-5), loss='binary_crossentropy', metrics=['accuracy'])
    
    return model

<p>A partir de là, j'ai réalisé un dernier test en gardant le meilleur modèle mais en l'entrainant sur d'avantage de données. A partir du fichier '../input/jigsaw-multilingual-toxic-comment-classification/jigsaw-unintended-bias-train.csv' contenant près de 2 millions de commentaires, j'ai intégré au jeu d'entrainement les 250 000 premières lignes. Une transformation de la variable 'toxic' a été nécessaire car elle contenait une probabilité de toxicité et non un binaire. Ainsi, une probabilité supérieure à 50% entrainait une transformation de la valeur de la variable en 1 sinon 0. Ces données ont également été nettoyées sur le même principe que ce qui a été exposé ci-dessus dans la partie preprocessing. (voir code ci-dessous)

In [None]:
train_1 = pd.read_csv('../input/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train.csv')
train_2 = pd.read_csv('../input/jigsaw-multilingual-toxic-comment-classification/jigsaw-unintended-bias-train.csv')

train_1 = train_1[['id','comment_text','toxic']]
train_2 = train_2[['id','comment_text','toxic']]

train_2 = train_2[0:250000]
train_2['toxic'] = train_2['toxic'].apply(lambda x:0 if x<0.5 else 1)

frames = [train_1,train_2]
train = pd.concat(frames)

## 4. Résultats<a class="anchor" id="4"></a>

<p>Précédemment, j'ai pu exposer chaque test réalisé. Chacun de ces tests m'ont permis d'obtenir un score sur la plateforme Kaggle qui a pu évoluer au fil des améliorations et modifications apportées. Parfois favorablement, parfois non. Ainsi, j'ai pu passer d'un score public à 0.8674 pour atteindre mon maximum à <b>0.9346</b>. 
<p>En comparaison avec les meilleurs résultats du concours, cela me place dans la 2ème partie de tableau autour de la 1000ème place (sur 1650), le meilleur score obtenu étant de 0.9556.
<p>Par rapport aux discussions de la communauté, j'en ai déduit que le modèle pré-entrainé choisi (XLM Roberta) semblait être le modèle le plus approprié à notre problématique. En effet, ce modèle, basé sur BERT (Robust optimized BERT approach), est un ré-entrainement de BERT avec des améliorations sur la méthodologie et avec beaucoup plus de data et de temps de compute. Il est donc logique d'avoir de meilleurs résultats avec ce modèle qu'avec BERT ou DistilBERT.
<p>Je pense donc que la différence peut se situer 2 choses : le volume de data pris en compte pour l'entrainement et le preprocessing des données. En effet, j'ai pu constater qu'une data nettoyée permettait d'avoir de meilleurs résultats. Peut être qu'une seconde phase de nettoyage plus poussée pourrait me permettre de faire évoluer le score à la hausse. Aussi, je n'ai utilisé qu'une petite partie du 2ème jeu d'entrainement mis à disposition. Néanmoins, les temps de traitement assez importants pour 1 run m'ont invité à ne pas prendre plus de données en considération. Il aurait peut être été intéressant de compléter mon jeu de données avec la totalité du second fichier dans l'espoir de voir mon score s'améliorer une nouvelle fois.

## Conclusion<a class="anchor" id="5"></a>

<p>Ce projet a été pour moi très enrichissant sur plusieurs aspects. Tout d'abord, découvrir plus en profondeur ce que pourra m'apporter la plateforme Kaggle à l'avenir et le fonctionnement de ces compétitions très intéressantes. Ensuite, toute cette partie preprocessing de texte où j'ai pu découvrir de nouvelles librairies et faire quelque chose de différent par rapport à un projet précédent. Enfin, sur la mise en place de modèles de NLP pré-entrainés sur des problématiques multilingues et des cas d'usages qu'on peut retrouver dans la vie de tous les jours.

## Sources<a class="anchor" id="6"></a>

<p>Je me suis appuyé sur plusieurs kernels partagés par la communauté pour cette compétition :
    <ul>
        <li> https://www.kaggle.com/rftexas/cleaning-and-removing-mis-spells-from-texts</li>
        <li> https://www.kaggle.com/mobassir/understanding-cross-lingual-models#Setup-TPU-configuration</li>
        <li> https://www.kaggle.com/xhlulu/jigsaw-tpu-xlm-roberta/comments#Build-datasets-objects</li>
    </ul>