# Simple EDA - Multilingual Toxic Comment

## Description
> This year, we're taking advantage of Kaggle's new TPU support and challenging you to build multilingual models with English-only training data.

### Data description

> **What should I expect the data format to be?**
>
> The primary data for the competition is, in each provided file, the comment_text column. This contains the text of a comment which has been classified as toxic or non-toxic (0...1 in the toxic column). The train set’s comments are entirely in english and come either from *Civil Comments* or *Wikipedia talk page* edits. The test data's `comment_text` columns are composed of multiple non-English languages.
>
> The `*-train.csv` files and `validation.csv` file also contain a toxic column that is the target to be trained on.
>
> The `jigsaw-toxic-comment-train.csv` and `jigsaw-unintended-bias-train.csv` contain training data (`comment_text` and `toxic`) from the two previous Jigsaw competitions, as well as additional columns that you may find useful.
>
> `*-seqlen128.csv` files contain training, validation, and test data that has been processed for input into BERT.

### What am I predicting?
> You are predicting the probability that a comment is `toxic`. A toxic comment would receive a `1.0`. A benign, non-toxic comment would receive a `0.0`. In the test set, all comments are classified as either a `1.0` or a `0.0`.

### Columns
- **id** - identifier within each file.
- **comment_text** - the text of the comment to be classified.
- **lang** - the language of the comment.
- **toxic** - whether or not the comment is classified as toxic. (Does not exist in test.csv.)

-----------------------------
**I'll update this EDA notebook in the next days/weeks, stay tuned!**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff

from wordcloud import WordCloud, STOPWORDS

In [None]:
DIR_INPUT = '/kaggle/input/jigsaw-multilingual-toxic-comment-classification'

## Train dataset

In [None]:
train_df1 = pd.read_csv(DIR_INPUT + '/jigsaw-toxic-comment-train.csv')
train_df1['src'] = 0
train_df1.head()

In [None]:
train_df2 = pd.read_csv(DIR_INPUT + '/jigsaw-unintended-bias-train.csv')
train_df2['src'] = 1
train_df2.head()

Because in the test/validation data we only have `['id', 'comment_text', 'toxic']` columns, I drop anything else from train.

*Note: The `toxic` ratio is not the same in the two source dataset!*

In [None]:
keep_cols = ['id', 'comment_text', 'toxic', 'src']
train_df = train_df1[keep_cols].append(train_df2[keep_cols])
train_df.head()

In [None]:
del train_df1, train_df2

In [None]:
train_df['toxic'] = (train_df['toxic'] > 0.5).astype(np.uint)

In [None]:
print("We have {} english comments in the train datasets.".format(train_df.shape[0]))

In [None]:
train_df['toxic'].value_counts(normalize=True)

In [None]:
train_df.groupby(by=['toxic', 'src']).count()[['id']]

In [None]:
fig = go.Figure([go.Bar(x=['Not-toxic', 'Toxic'], y=train_df.toxic.value_counts())])
fig.update_layout(
    title='Toxic/non-toxic comments distribution in the train dataset'
)
fig.show()

In [None]:
train_df['comment_text_len'] = train_df['comment_text'].apply(lambda x : len(x))
train_df['comment_text_word_cnt'] = train_df['comment_text'].apply(lambda x : len(x.split(' ')))

In [None]:
fig = px.histogram(train_df, x='comment_text_len', color='toxic', nbins=200)
fig.show(renderer="kaggle")

In [None]:
fig = px.histogram(train_df[train_df['src'] == 0],
                   x='comment_text_len',
                   color='toxic',
                   nbins=200,
                   title='Text length - Source: Jigsaw toxic comment (train)')
fig.show(renderer="kaggle")

In [None]:
fig = px.histogram(train_df[train_df['src'] == 1],
                   x='comment_text_len',
                   color='toxic',
                   nbins=200,
                   title='Text length - Source: Jigsaw unintended bias (train)')
fig.show(renderer="kaggle")

In [None]:
fig = px.histogram(train_df[train_df['src'] == 0],
                   x='comment_text_word_cnt',
                   color='toxic',
                   nbins=200,
                   title='Word count - Source: Jigsaw toxic comment (train)')
fig.show(renderer="kaggle")

In [None]:
fig = px.histogram(train_df[train_df['src'] == 1],
                   x='comment_text_word_cnt',
                   color='toxic',
                   nbins=200,
                   title='Word count - Source: Jigsaw toxic comment (train)')
fig.show(renderer="kaggle")

## Test/Valid dataset

In [None]:
valid_df = pd.read_csv(DIR_INPUT + '/validation.csv')
valid_df.head()

In the validation set we have 15.3% toxic comments (in the train set the toxic comments ratio is only 6.3%)

*Note: the train set is a combination of the previous two competitions' data*

In [None]:
per_lang = valid_df['lang'].value_counts()
fig = go.Figure([go.Bar(x=per_lang.index, y=per_lang.values)])
fig.update_layout(
    title='Language distribution in the validation dataset'
)
fig.show()

In [None]:
valid_df.toxic.value_counts(normalize=True)

In [None]:
fig = go.Figure([go.Bar(x=['Not-toxic', 'Toxic'], y=valid_df.toxic.value_counts())])
fig.update_layout(
    title='Language distribution in the validation dataset'
)
fig.show()

In [None]:
per_lang = valid_df.groupby(by=['lang', 'toxic']).count()[['id']]
per_lang

In [None]:
data = []

for lang in valid_df['lang'].unique():
    y = per_lang[per_lang.index.get_level_values('lang') == lang].values.flatten()
    data.append(go.Bar(name=lang, x=['Non-toxic', 'Toxic'], y=y))

fig = go.Figure(data=data)
fig.update_layout(
    title='Language distribution in the validation dataset',
    barmode='group'
)
fig.show()

In [None]:
test_df = pd.read_csv(DIR_INPUT + '/test.csv')
test_df.head()

In [None]:
test_df['lang'].value_counts()

In [None]:
per_lang = test_df['lang'].value_counts()
fig = go.Figure([go.Bar(x=per_lang.index, y=per_lang.values)])
fig.update_layout(
    title='Language distribution in the test dataset',
)
fig.show()

# Comments

In [None]:
toxic_samples = train_df[train_df['toxic'] == 1].sample(n=5)['comment_text']

for toxic in toxic_samples.values:
    print("")
    print("==============================")
    print(toxic)
    print("==============================")
    print("")

## Wordclouds - Frequent words:


In [None]:
rnd_comments = train_df.sample(n=2500)['comment_text'].values
wc = WordCloud(background_color="black", max_words=2000, stopwords=STOPWORDS.update(['Trump', 'people', 'one', 'will']))
wc.generate(" ".join(rnd_comments))

plt.figure(figsize=(20,10))
plt.axis("off")
plt.title("Random words", fontsize=20)
plt.imshow(wc.recolor(colormap= 'viridis' , random_state=17), alpha=0.98)
plt.show()

In [None]:
rnd_comments = train_df[train_df['toxic'] == 0].sample(n=10000)['comment_text'].values
wc = WordCloud(background_color="black", max_words=2000, stopwords=STOPWORDS.update(['Trump', 'people', 'one', 'will']))
wc.generate(" ".join(rnd_comments))

plt.figure(figsize=(20,10))
plt.axis("off")
plt.title("Frequent words in non-toxic comments", fontsize=20)
plt.imshow(wc.recolor(colormap= 'viridis' , random_state=17), alpha=0.98)
plt.show()

In [None]:
rnd_comments = train_df[train_df['toxic'] == 1].sample(n=10000)['comment_text'].values
wc = WordCloud(background_color="black", max_words=2000, stopwords=STOPWORDS.update(['Trump', 'people', 'one', 'will']))
wc.generate(" ".join(rnd_comments))

plt.figure(figsize=(20,10))
plt.axis("off")
plt.title("Frequent words in toxic comments", fontsize=20)
plt.imshow(wc.recolor(colormap= 'viridis' , random_state=17), alpha=0.98)
plt.show()