# Multilingual Toxic Comment Classification
It only takes one toxic comment to sour an online discussion. A main area of focus is machine learning models that can identify toxicity in online conversations, where toxicity is defined as anything rude, disrespectful or otherwise likely to make someone leave a discussion. If these toxic contributions can be identified, we could have a safer, more collaborative internet.

The main difference of this challenge from another ones is **English-only train data**, while test is multilingual - Wikipedia talk page comments in several different languages

As our computing resources and modeling capabilities grow, so does our potential to support healthy conversations across the globe. Develop strategies to build effective multilingual models and you'll help Conversation AI and the entire industry realize that potential.

Disclaimer: This notebook contains text that may be considered profane, vulgar, or offensive.

In [2]:
# Some imports

import pandas as pd
import numpy as np
import torch
import PersonCreator
import matplotlib.pyplot as plt

from tqdm import tqdm as tqdm
from keras.preprocessing.sequence import pad_sequences
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from sklearn.metrics import accuracy_score, roc_curve
from scipy.special import softmax
from joblib import load

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


The company organized the competition provides two train datasets from previous competions: [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) and [Jigsaw Unintended Bias In Toxicity Classification](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification).
## Jigsaw Toxic Comment Data

In [7]:
train_data_toxic_comment = pd.read_csv("./Data/jigsaw-toxic-comment-train.csv")

In [10]:
print("Samples from the dataset: ")
train_data_toxic_comment.sample(5).iloc[:, 1:]

Samples from the dataset: 


Unnamed: 0,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
10488,December 2007\n Benjiboi,0,0,0,0,0,0
17865,Fuck you Greenman \n\nFucking piece of shit.\n...,1,1,1,0,1,0
2450,"""==On the Removal of the Redirect==\n\nThis pa...",0,0,0,0,0,0
203288,""" \n\n \n\n \n\n \n Hello \n Do you think ...",0,0,0,0,0,0
162097,""" \n\n == Charlie and the Chocolate Factory (v...",0,0,0,0,0,0


In [13]:
def make_len(x):
    return len(x.split())

train_data_toxic_comment["len"] = train_data_toxic_comment["comment_text"].apply(make_len)

print("Mean length of the comments in words: ", train_data_toxic_comment["len"].mean())

Mean length of the comments in words:  66.51716625885153


In [17]:
print("Toxic column contains 0 or 1 - non-toxic or toxic: ")
print("The percent of toxic comments: ", train_data_toxic_comment["toxic"].mean() * 100)
print("Total number of comments: ", len(train_data_toxic_comment))

Toxic column contains 0 or 1 - non-toxic or toxic: 
The percent of toxic comments:  9.565688059441108
Total number of comments:  223549


## Jigsaw Unintended Bias Data
* 223549 - total number of comments
* All of the comments are unique
* 5 additional columns, describing different types of toxicity:  
   * severe_toxic
   * obscene
   * threat
   * insult 
   * identity_hate
* No missing values
* 66.5 words - mean comment length
* 99 words - std
* 202165 non-toxic comments
* 21384 toxic comments
* **10% of comments are toxic** - serious disbalance in classes

In [7]:
train_data_unintended_bias = pd.read_csv("./Data/jigsaw-unintended-bias-train.csv")

In [10]:
print("Samples from the dataset: ")
train_data_toxic_comment.sample(5).iloc[:, 1:]

Samples from the dataset: 


Unnamed: 0,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
10488,December 2007\n Benjiboi,0,0,0,0,0,0
17865,Fuck you Greenman \n\nFucking piece of shit.\n...,1,1,1,0,1,0
2450,"""==On the Removal of the Redirect==\n\nThis pa...",0,0,0,0,0,0
203288,""" \n\n \n\n \n\n \n Hello \n Do you think ...",0,0,0,0,0,0
162097,""" \n\n == Charlie and the Chocolate Factory (v...",0,0,0,0,0,0


In [13]:
def make_len(x):
    return len(x.split())

train_data_toxic_comment["len"] = train_data_toxic_comment["comment_text"].apply(make_len)

print("Mean length of the comments in words: ", train_data_toxic_comment["len"].mean())

Mean length of the comments in words:  66.51716625885153


In [17]:
print("Toxic column contains 0 or 1 - non-toxic or toxic: ")
print("The percent of toxic comments: ", train_data_toxic_comment["toxic"].mean() * 100)
print("Total number of comments: ", len(train_data_toxic_comment))

Toxic column contains 0 or 1 - non-toxic or toxic: 
The percent of toxic comments:  9.565688059441108
Total number of comments:  223549


In [2]:
clf = load('random_forest.joblib')

PersonCreator.display_creator()

HBox(children=(Checkbox(value=False, description='asian', indent=False), Checkbox(value=False, description='bl…

Output()

HBox(children=(Checkbox(value=False, description='atheist', indent=False), Checkbox(value=False, description='…

Output()

HBox(children=(Checkbox(value=False, description='bisexual', indent=False), Checkbox(value=False, description=…

Output()

HBox(children=(Checkbox(value=False, description='female', indent=False), Checkbox(value=False, description='m…

Output()

HBox(children=(Checkbox(value=False, description='intellectual_or_learning_disability', indent=False), Checkbo…

Output()

In [6]:
person = PersonCreator.create().reshape(1, -1)
print(person)
print(clf.predict(person))

tensor([[0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0.]])
[0.]


### The main approach

![BERT](https://cdn-images-1.medium.com/max/1600/1*qimS08ZL6noWDB9k29EOWQ.png)

### Main challenges
* We have english text classification problem on train dataset and multilanguage classification problem on test. So we need to extract information about classes on train and extropolate them on several languages. 
* Number of samples of different classes is unbalanced.


### First approach
* We use pre trained multilingual BERT and fine-tune it on train dataset
* We also upsample minority class and undersample majority class to get rid of calss imbalance and make our model less biased
* 