In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 
import warnings
warnings.filterwarnings("ignore")
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats.stats import pearsonr
%matplotlib inline
sns.set()
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.
PATH = '/kaggle/input/jigsaw-multilingual-toxic-comment-classification/'

In [None]:
# Function to get a summary table for numeric columns and another one for object columns
def eda(df): 
    eda = df.describe().T
    eda['null_sum'] = df.isnull().sum()
    eda['null_pct'] = df.isnull().mean()
    eda['dtypes'] = df.dtypes
    
    objects = df[[ x for x in df.columns if not x in eda.index]]
    eda_objects = objects.describe().T
    eda_objects['null_sum'] = df.isnull().sum()
    eda_objects['null_pct'] = df.isnull().mean()
    eda_objects['dtypes'] = df.dtypes
    return eda, eda_objects

# EDA
This is rhe third of a series of NLP competition, hence the main data to consider here are obviously the comments. However, the datasets are enriched with some other variables that are worth exploring, just for the sake of learning. In this notebook I will just focus on those variables. Let's proceed.

## File 1 - Unintended Bias
This is an expanded version of the Civil Comments dataset with a range of additional labels.  
Some basic data exploration.

In [None]:
train1 = pd.read_csv(PATH+'jigsaw-unintended-bias-train.csv')
train1.head()

In [None]:
train1_eda, train1_eda_objects = eda(train1)
train1_eda

In [None]:
train1_eda_objects

Knowing that target variable is 'toxic', there are several tipologies of columns that can be classified as:
- ***Toxic Ratios***: Columns with values ranging [0,1] that are available for every row (no nulls). They clearly represent offensive comments. Columns are: severe_toxicity, obscene, identity_attack, insult, threat and sexual_explicit
- ***Feature Ratios***: Columns with values ranging [0,1] that are not 'toxic ratios'. It appears that those ratios are available only for less than 25% of the rows though.
- ***ID's***: Basically publication_id, parent_id and article_id. We'll get to them later
- ***User reactions***: funny, wow, sad, likes and disagree. These seem to be the reaction to a comment. No nulls in these columns
- ***Others***: These are comment_text (main column with texts, we are not going to analyze it here), created_date and rating


Target variable in this training dataset is not 0 nor 1 but the probability bewtween 0 and 1. Let's check out the distribution

In [None]:
fig, ax = plt.subplots(figsize=(10,6), nrows=1, ncols=2)
fig.suptitle("Distribution of Target Variable", size=25)
sns.distplot(train1['toxic'], kde=False, bins=20, ax=ax[0])
ax[0].set(xlabel='Distribution')
sns.distplot(train1['toxic'], kde=False, bins=[0,0.5,1], ax=ax[1])
ax[1].set(xlabel='Treshold = 0.5')
plt.show()

The target variable distribution is not normal, being most of the cases biased towards 0, meaning that the most common value is 0 or no toxic. The second figure shows the distribution with an hypothetical threshold of 0.5

## Toxic Ratios
Columns with values ranging [0,1] that are available for every row (no nulls). They clearly represent offensive comments. Columns are: severe_toxicity, obscene, identity_attack, insult, threat and sexual_explicit. Let's examine how they correlate to each other.

In [None]:
# Features 
toxic_ratios = ['severe_toxicity', 'obscene', 'identity_attack', 'insult', 'threat', 'sexual_explicit']
fig = plt.figure(figsize=(8,8))
train1_ratios = train1[toxic_ratios]
sns.pairplot(train1_ratios)
plt.show()

The variables show low correlation. Maybe 'severe_toxicity' shows that it does not go beyond 0.5 too often. Let's check how they correlate against the target variable.

In [None]:
fig, ax = plt.subplots(figsize=(15,10), nrows=2,ncols=3)
for i,t in enumerate(toxic_ratios):
    r,c = int(i/3),int(i%3)
    sns.scatterplot(x=t, y="toxic", data=train1, ax=ax[r][c])
    ax[r][c].set(xlabel=t)
    ax[r][c].plot([0,1], color='red')
plt.show()

It seems there is a pattern in all these variables. The value of the target variable is in most of cases (with some exceptions) greater than or equal to the value of the variable. Exceptions are those points below the red line in each figure. Let's find out how many exceptions we have in each case

In [None]:
exceptions = []
for i,t in enumerate(toxic_ratios):
    c = len(train1[train1[t]>train1['toxic']])
    exceptions.append({'feature':t,'count': c, 'pct': c/len(train1)})
pd.DataFrame(exceptions).set_index('feature')

In most variables, less than 1% of cases happens to have a higher value than its corresponding toxic value. Can we consider them as outliers?

Finally, adding them out in a single variable and plotting vs target feature as we just did above, yields the following result:

In [None]:
train1['ratios'] = train1[toxic_ratios].sum(axis=1)
fig = plt.figure(figsize=(8,8))
sns.scatterplot(x='ratios', y="toxic", data=train1)
plt.show()

## Feature Ratios
Columns with values ranging [0,1] that are not 'toxic ratios'. It appears that those ratios are available only for less than 25% of the rows though.

In [None]:
feature_ratios = list(train1_eda[train1_eda['null_sum']>1000000].index) 
train1_ratios = train1[feature_ratios + ['toxic']].dropna()
train1_ratios.head()

In [None]:
fig, ax = plt.subplots(figsize=(21,14), nrows=4,ncols=6)
for i,t in enumerate(feature_ratios):
    r,c = int(i/6),int(i%6)
    sns.scatterplot(x=t, y="toxic", data=train1_ratios, ax=ax[r][c])
    ax[r][c].set(xlabel=t)
    ax[r][c].plot([0,1], color='red')
    plt.subplots_adjust(hspace=0.5, wspace= 0.5)
plt.show()

From the figures above we can't see the same pattern as with the "toxic_ratios".

In [None]:
correlations = train1_ratios.corrwith(train1_ratios['toxic']).iloc[:-1].to_frame()
sorted_correlations = correlations[0].sort_values(ascending=False)
fig, ax = plt.subplots(figsize=(5,10))
sns.heatmap(sorted_correlations.to_frame(), cmap='coolwarm', annot=True, vmin=-1, vmax=1, ax=ax)

Correlations very low with the target variable. Probably this is the case becase data in these variables are very sparse. Values are defaulted to 0 unless there is some component of that feature in the comments. So let's check it out.

In [None]:
correlations = []
for i,t in enumerate(feature_ratios):
    r,c = int(i/6),int(i%6)
    corr = {'feature':t}
    corr['original'] = pearsonr(train1_ratios['toxic'], train1_ratios[t])[0]
    df = train1_ratios[train1_ratios[t]>0]
    corr['filtered'] = pearsonr(df['toxic'], df[t])[0]
    correlations.append(corr)
    
correlations = pd.DataFrame(correlations).set_index('feature')
correlations['original'] = correlations['original']
correlations['filtered'] = correlations['filtered']
correlations

In [None]:
fig = plt.figure(figsize=(8,10))
fig.suptitle('Pearson Correlation vs target BEFORE vs AFTER filtering zeros', size=25)
for t in correlations.index:
    plt.plot([correlations.loc[t,'original'],correlations.loc[t,'filtered']], label=t)
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()

Removing 0 values from the features have mixed effects. In some cases, it increases the positive correlation with the target, which means the existance of certain language components increases the chances of a comment to be toxic. However, correlations for most of features remain in a range very close to 0, which means little to no correlation at all.

## ID's
Basically publication_id, parent_id and article_id.  
Apparently there is little to explore in ID columns. However, I would like to analyze the possible relationship between rows with same **parent_id**

In [None]:
ids = ['id','publication_id', 'parent_id', 'article_id']
train1[ids].nunique()

Clearly *id* is just the row id, so nothing to be considered. However, the other variables have a very different number of values. Feature ***publication_id*** is likely to include some information as *just* 53 different values exists.

In [None]:
plt.figure(figsize=(50,10)) # adjust the fig size to see everything
sns.barplot(x=train1['publication_id'].value_counts().index, y=train1['publication_id'].value_counts())
plt.show()

Most of the comments belong to roughly 10 publications.

## User reactions
funny, wow, sad, likes and disagree. These seem to be the reaction to a comment. No nulls in these columns. Each comment can have more than one reaction, so they are not mutually exclusive.

First thing is to find out how many comments don't have any reaction at all.

In [None]:
reactions = ['funny', 'wow', 'sad', 'likes' , 'disagree']
train1_reaction = train1[reactions]
train1_reaction['nreactions'] = train1_reaction.sum(axis=1)
n = len(train1_reaction[train1_reaction['nreactions']==0])
print('Number of comments without reaction: {}'.format(n))
print('Pctg of comments without reaction: {s:.3f}'.format(s=n/len(train1)))

Almost a third of the comments have no reaction at all. Let's check the correlations

In [None]:
train1_reaction = train1[reactions]
train1_reaction['nreactions'] = train1_reaction.sum(axis=1)
sns.pairplot(train1_reaction[train1_reaction['nreactions']!=0].drop('nreactions',axis=1))
plt.show()

Relationship between user reaction features is mostly inverse amongst them. This effect is more pronounced when reactions have a different sentiment, like funny vs disagree. When the sentiment is the same, like disagree vs sad, the effect is less pronounced (as expected) but still inverse. This suggests there is some sort of consensus among the users when rating comments.

However, any single comment might have any variable number of reactions of each type, so establish an isolated relationship between each feature and the target variable could be tricky and misleading. Instead, I will assign a syntethic variable to each row based on the most repeated reaction (the feature with higher amount) and then check the correlation with the target variable.

In [None]:
train1_reaction = train1[reactions+['toxic']]
train1_reaction['nreactions'] = train1_reaction.drop('toxic',axis=1).sum(axis=1)
train1_reaction = train1_reaction[train1_reaction['nreactions']>0].drop('nreactions', axis=1)
train1_reaction['reaction'] = train1_reaction.drop('toxic',axis=1).apply(lambda x: x.argmax(), axis=1 )
train1_reaction['toxic_tr'] = train1_reaction['toxic'].apply(lambda x: int(1) if x>=0.5 else int(0) )
grouped = train1_reaction[['reaction','toxic_tr', 'wow']].groupby(['reaction', 'toxic_tr']).count().reset_index(drop=False).pivot(index='reaction', columns='toxic_tr', values='wow')
grouped['sum'] = grouped.sum(axis=1)
grouped[0] = 100*grouped[0]/grouped['sum']
grouped[1] = 100*grouped[1]/grouped['sum']
grouped = grouped.drop('sum', axis=1)
grouped

In [None]:
fig = plt.figure()
cm = plt.get_cmap('viridis')
ax = fig.add_axes([0,0,1,1])
ax.set_title('$P( toxic | reaction=x)$', size=23)
colors = [ cm(i/(len(grouped.index))) for i in range(len(grouped.index))]
labels = grouped.index
values = grouped[1]
rects = ax.bar(labels, values, color=colors)
for p in rects:
    ax.text( p.get_x() + p.get_width() / 2., p.get_height()* 1.05, s=str('{0:.2f}'.format(p.get_height())), ha = 'center', va = 'center')
plt.show()

The above bars represent $P( toxic | reaction=x)$, in other words, the percentage of toxic comments (given our test threshold 0.5) when the main reacion is $x$.  

The insight here is that "negative" reactions (disagree, sad) are more likely to happen when the comment is toxic. I can't make any assumption about the relative low figures (less than 10%), but I would say that reaction is not a key variable in the toxicity of a comment.

## Others
Features that are not numeric types but objects. Let's start with 'rating'.

In [None]:
train1['rating'].value_counts()

Feature 'rating'just have 2 different values, so we can convert it to a binary variable for modelling. First, let's find out correlation with target variable.

In [None]:
train1['rating_binary'] = train1['rating'].apply(lambda x: 1 if x == 'approved' else 0)
sns.boxplot(train1['rating_binary'], train1['toxic']).set_title('Toxic comments by Rating')
plt.show()

We have converted value "approved" to 1 and "rejected" to 0, and we are checking the distribution of the target variable given each rating value. Distributions are different, which suggests that this feature could be of significance. Let's find out the Pearson correlation.

In [None]:
print("Pearson correlation between rating and target variable {s:.2f}".format(s=pearsonr(train1['rating_binary'], train1['toxic'])[0]))

Correlation is negative, as expected (it would have been positive if we would assign value 1 to 'rejected').  

Finally, let's check created_date, to find out if there is any kind of seasonality or trend related with the target variable.

In [None]:
train1['created_date_date'] = pd.to_datetime(train1['created_date']).dt.date
grouped = train1.groupby('created_date_date').count()[['id']]
fig = plt.figure(figsize=(20,5))
ax = sns.lineplot(x=grouped.index, y= grouped.id)
ax.set_title('Number of comments by date', size=23)
plt.show()

First thing we see is that the number of comments has been increasing. This is of little value apparently, because the dataset has been selected by the competition organizers, and this does not represent the whole comments population.

I wuld like to check if for specific periods, the trend is to post more toxic comments. I would also see if there is any seasonality in it. So I would group 'toxic' values by date and average them, so I have a '*toxicity average*' by date.

In [None]:
train1['created_date_date'] = pd.to_datetime(train1['created_date']).dt.date
grouped = train1[['created_date_date','toxic']].groupby('created_date_date').mean()
fig = plt.figure(figsize=(20,6))
ax = sns.lineplot(x=grouped.index, y= grouped.toxic)
ax.set_title('Average Toxic by Date', size=23)
plt.show()

The time series seems to be stationary. No trend apparently, except for the initial months that there is some variance in the data. Let's decompose it.

In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose
from pylab import rcParams
rcParams['figure.figsize'] = 20, 8
result = seasonal_decompose(grouped, model='additive', freq=1)
fig = result.plot()
plt.show()

This plot is probably unable to decompose properly the series, so let's run ADF and KPSS tests on it

In [None]:
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import kpss

def adf_test(timeseries):
    print ('Results of Dickey-Fuller Test:')
    dftest = adfuller(timeseries, autolag='AIC')
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    for key,value in dftest[4].items():
       dfoutput['Critical Value (%s)'%key] = value
    print (dfoutput)
    
def kpss_test(timeseries):
    print ('Results of KPSS Test:')
    kpsstest = kpss(timeseries, regression='c', nlags=None)
    kpss_output = pd.Series(kpsstest[0:3], index=['Test Statistic','p-value','Lags Used'])
    for key,value in kpsstest[3].items():
        kpss_output['Critical Value (%s)'%key] = value
    print (kpss_output)
    
adf_test(grouped)
print('*'*20)
kpss_test(grouped)

ADF indicates the serie is stationary and KPSS indicates the opposite, so the series might be difference stationary

In [None]:
grouped['toxic_diff'] = grouped['toxic'] - grouped['toxic'].shift(1)
grouped['toxic_diff'].dropna().plot(figsize=(12,8))

In [None]:
adf_test(grouped[['toxic_diff']].dropna())
print('*'*20)
kpss_test(grouped[['toxic_diff']].dropna())

This time, both tests yield the series to be stationary, so the relationship between 'creation_date' and 'toxic' yield little information

## File 2 - Toxic Comment
The dataset is made up of English comments from Wikipedia’s talk page edits.

In [None]:
train2 = pd.read_csv(PATH+'jigsaw-toxic-comment-train.csv')
train2.head()

In this file we just have what we called the 'toxic ratios' with slight differences. Feature 'identity_hate' is not present in the other dataset. At the same time, some other features from thic category present on the other dataset are not present here.

In [None]:
train2_eda, train2_eda_objects = eda(train2)
train2_eda

Just from this summary we can see that numeric variables are all binary, unlike the previous dataset. So just by looking at the mean column, we can see the proportion of 1's and 0's for each variable.

In [None]:
train2_eda_objects

So taking a look at the relationship between the target variable and each of the features

In [None]:
toxic_ratios = ['severe_toxic', 'obscene', 'identity_hate', 'insult', 'threat']
fig, ax = plt.subplots(figsize=(25,3), nrows=1,ncols=5)
for i,t in enumerate(toxic_ratios):
    df = train2[['toxic',t,'id']].groupby(['toxic',t]).count().reset_index()
    df = df.pivot(index='toxic', columns=t, values='id')
    sns.heatmap( data=df, ax=ax[i], annot=True, fmt='.0f')
    ax[i].set(xlabel=t)
    plt.subplots_adjust( wspace= 0.5)
plt.show()

The pattern is clear here. The existance of any of these features indicates a high chance of the comment to be toxic.

In [None]:
correlations = []
for i,t in enumerate(toxic_ratios):
    corr = {'feature':t}
    corr['correlation'] = pearsonr(train2['toxic'], train2[t])[0]
    df = train2[train2[t]>0]
    correlations.append(corr)
    
correlations = pd.DataFrame(correlations).set_index('feature')
correlations

Next steps will be to explore the comments column.  
If you liked this notebook, please upvote!!