<div style="background-color:#dee3e3; padding: 25px 50px 25px 50px;">
<h1 style="font-family: Arial, Helvetica, sans-serif;text-shadow: grey 0px 0px 3px; font-weight: bold; font-size: 230%;"><center>Jiggsaw Toxic Severity Rating</center></h1>

<p style="font-size: 110%;font-style: italic;" >This notebook attempts to perform EDA on the Jiggsaw Toxic Severity Rating dataset. The focus in this competition is on ranking the severity of comment toxicity from innocuous to outrageous. </p>
</div>

<div style="">
<p id="lib" style="color: white; text-shadow: black 0px 0px 3px; font-weight: bold; font-size: 150%; font-family: Arial, Helvetica, sans-serif;
          padding: 10px 10px; background-color: #f2843a; border-radius: 5px;">
    <b> Libraries</b>
</p> 

<p style="font-size: 110%;font-style: italic;" >Install dependencies and import libraries</p>
</div>

In [None]:
!pip install textstat
!pip install pyicu
!pip install pycld2

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import math
import string
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from transformers import pipeline
import textstat
from polyglot.detect import Detector

plt.style.use('ggplot')

<p id="load_data" style="color: white; text-shadow: black 0px 0px 3px; font-weight: bold; font-size: 150%; font-family: Arial, Helvetica, sans-serif;
          padding: 10px 10px; background-color: #f2843a; border-radius: 5px;">
    <b> Load data</b>
</p> 

<p style="font-size: 110%;font-style: italic;" > There are three files in the input folder. Let's load the data in and look at it some sample from the data.</p>


In [None]:
test_df = pd.read_csv('../input/jigsaw-toxic-severity-rating/comments_to_score.csv')
validation_df = pd.read_csv('../input/jigsaw-toxic-severity-rating/validation_data.csv')
print(f'Validation Data is of shape: {validation_df.shape}')
print(f'Comments to score (test) is of shape: {test_df.shape}')

In [None]:
test_df.head()

<p style="font-size: 110%;font-style: italic;" > 
    The test file has comment_id and and text associated to each comment. Next, let's look at the validation data.
</p>



In [None]:
validation_df.head(5)

<p style="font-size: 110%;font-style: italic;" > 
The validation data has the worker id who classified a pair of comments into less or more toxic. There are more than 30K rows in the dataset.
</p>




<p id="eda" style="color: white; text-shadow: black 0px 0px 3px; font-weight: bold; font-size: 150%; font-family: Arial, Helvetica, sans-serif;
          padding: 10px 10px; background-color: #f2843a; border-radius: 5px;">
    <b> Exploratory data analysis</b>
</p> 

<p style="font-size: 110%;font-style: italic;" > Now that we have everything we need in the dataframes let's try to explore the details! 


* <span style="color:#f2843a;font-size: 150%; font-weight: bold;"> Unique comments and workers</span>

In [None]:
print(f'Total workers involved in validation are => {len(validation_df.worker.unique())}')

In [None]:
print(f'Less toxic unique comments => {len(validation_df.less_toxic.unique())}')
print(f'More toxic unique comments => {len(validation_df.more_toxic.unique())}')
print(f'Toal unique comments in both columns => {len(validation_df.more_toxic.append(validation_df.less_toxic).unique())}')

<p style="font-size: 110%; font-style: italic;" > 
Total unique comments are around 14K which were looked at by 753 workers!
</p>


* <span style="color:#f2843a;font-size: 150%; font-weight: bold;"> Languges in validation comments</span>

In [None]:
def detect_language(text):
    return Detector("".join(x for x in text if x.isprintable()), quiet=True).languages[0].name

all_comments = validation_df.less_toxic.append(validation_df.more_toxic).value_counts().index.values
languages = [detect_language(comment) for comment in all_comments]
language_df = pd.DataFrame({
    'text': all_comments,
    'language': languages
})

language_df['language'].value_counts().to_frame().head(15)

<p style="font-size: 110%; font-style: italic;" > 
So, 99.16% of the comments are in English with other languages having very low occurance in comparision.
</p>


* <span style="color:#f2843a;font-size: 150%; font-weight: bold;"> Words and Sentences </span>
<p style="font-size: 110%; font-style: italic;" > 
Do word and sentence count in a comment indcate something? Let's plot the distributions to find out.
</p>


In [None]:
less_toxic_wc = validation_df.less_toxic.apply(lambda comment: textstat.lexicon_count(str(comment)))
more_toxic_wc = validation_df.more_toxic.apply(lambda comment: textstat.lexicon_count(str(comment)))

less_toxic_sc = validation_df.less_toxic.apply(lambda comment: textstat.sentence_count(str(comment)))
more_toxic_sc = validation_df.more_toxic.apply(lambda comment: textstat.sentence_count(str(comment)))

fig = plt.figure(figsize=(12,6))

ax1 = fig.add_subplot(121)
bins = np.linspace(0, 500, 50)
ax1.hist(less_toxic_wc, bins, alpha=0.9, label='less toxic wc', color = "skyblue")
ax1.hist(more_toxic_wc, bins, alpha=0.5, label='more toxic wc', color = "red")
ax1.set_title('Word count distribution')

ax2 = fig.add_subplot(122)
bins = np.linspace(0, 60, 40)
ax2.hist(less_toxic_sc, bins, alpha=0.9, label='less toxic wc', color = "skyblue")
ax2.hist(more_toxic_sc, bins, alpha=0.5, label='more toxic wc', color = "red")
ax2.set_title('Sentence count distribution')

plt.show()

<p style="font-size: 110%; font-style: italic;" > 
So, more toxic comments tend to be short whereas less toxic comments are usually longer. That being said, there's also quite a lot of overlapp! <br>
Further, most comments have 1 or 2 sentences and there's very little to differentiate both classes here as well.
</p>



<p style="font-size: 110%; font-style: italic;" > 
Ok, then let's compare the word count between in each pair and see distributions to find if there's a clue
</p>

In [None]:
wc_diff = validation_df.apply(lambda row: len(row.less_toxic.split()) - len(row.more_toxic.split()), axis=1)

plt.figure(figsize=(8, 6))
bins = np.linspace(-200, 200, 30)
plt.hist(wc_diff, bins, alpha=0.5, label='wc difference', color = "green", ec="black")
plt.legend(loc='upper right')
plt.title('Word count difference [less toxic -  more toxic]')
plt.show()

<p style="font-size: 110%; font-style: italic;" > 
Although the distribution is slight right skewed which means less toxic were longer than more toxic comments there isn't much separating them.
</p>

* <span style="color:#f2843a;font-size: 150%; font-weight: bold;"> Worker work load </span>
<p style="font-size: 110%; font-style: italic;" > 
How many pairs of comments did workers score? Let's plot the distribition
</p>


In [None]:
worker_freq_df = validation_df.worker.value_counts()

plt.figure(figsize=(8, 6))
bins = math.ceil((worker_freq_df.max() - worker_freq_df.min())/10)
plt.hist(worker_freq_df, bins, alpha=0.3, label='wc difference', color = "blue", ec="black")
plt.legend(loc='upper right')
plt.title('Worker WORK load')
plt.show()

<p style="font-size: 110%; font-style: italic;" > 
So, the distribution tells us that most workers scored some 1-20 comment pairs but there were also some workers who did upwards of 200 pairs! Hope they detoxed themselves.

</p>

* <span style="color:#f2843a;font-size: 150%; font-weight: bold;"> 
Comment pair frequency
</span>
<p style="font-size: 110%; font-style: italic;" > 
We saw earlier that there were around 14K unique comment pairs but the dataset has around 30K comment pairs. There must be a lot of repetation of pairs and to check this we plot the frequency distribution.
</p>


In [None]:
pair_occ_df = validation_df[['less_toxic','more_toxic']].apply(lambda row: ' ~ '.join(np.sort(list(row))), axis=1).value_counts().value_counts()

plt.figure(figsize=(8, 6))
plt.bar(pair_occ_df.index, pair_occ_df.values, alpha=0.5, label='', color = "blue", ec="black")
plt.xticks(pair_occ_df.index)
# plt.legend(loc='upper left')
plt.title('Occurance count of comment pairs')
plt.show()


<p style="font-size: 110%; font-style: italic;" > 
So basically, each comment pair either occurs once or thrice in the dataset.
</p>

* <span style="color:#f2843a;font-size: 150%; font-weight: bold;"> 
Agreement among workers
</span>
<p style="font-size: 110%; font-style: italic;" > 
Now that we know there are 10K pairs score thrice, let's checkout the number of times workers agree and the times that they disagree.
</p>


In [None]:
validation_df['comment_pair_ordered'] = validation_df['less_toxic'] + ' : ' + validation_df['more_toxic']
val_order_dict = validation_df['comment_pair_ordered'].value_counts().to_dict()
validation_df['n_agreements'] = validation_df['comment_pair_ordered'].map(val_order_dict)

validation_df['agreement'] = validation_df['n_agreements'].map({1: 'Reviewer Disagreed',
                                                                2: 'Agreed with One Reviwer',
                                                                3: 'All Three Reviewers Agreed'})

ax = validation_df['agreement'].value_counts().plot(kind='bar',  alpha=0.7, figsize=(12, 5))

ax.tick_params(axis='x', rotation=0)
ax.set_title('Worker Agreement', fontsize=16)
plt.show()

* <span style="color:#f2843a;font-size: 150%; font-weight: bold;"> 
Punctuation marks
</span>
<p style="font-size: 110%; font-style: italic;" > 
Often there are several punctuation marks in a toxic comment. Looking at the correlation should give us some idea.
</p>


In [None]:
# all punctutaion count
punctuation_count = lambda l1,l2: sum([1 for x in l1 if x in l2])
validation_df['less_toxic_punctuation_count'] = validation_df.less_toxic.apply(lambda comment: punctuation_count(comment,set(string.punctuation)))
validation_df['more_toxic_punctuation_count'] = validation_df.more_toxic.apply(lambda comment: punctuation_count(comment,set(string.punctuation)))

# exclamation mark count
exclmataion_count = lambda l1: sum([1 for x in l1 if x =='!'])
validation_df['less_toxic_exclmataion_count'] = validation_df.less_toxic.apply(lambda comment: exclmataion_count(comment))
validation_df['more_toxic_exclmataion_count'] = validation_df.more_toxic.apply(lambda comment: exclmataion_count(comment))

punctuation_value_counts = validation_df.apply(lambda row: 1 if row.more_toxic_punctuation_count > row.less_toxic_punctuation_count else 0, axis=1).value_counts()
exclamation_value_counts = validation_df.apply(lambda row: 1 if row.more_toxic_exclmataion_count >= row.less_toxic_exclmataion_count else 0, axis=1).value_counts()

fig = plt.figure(figsize=(12,8))

ax1 = fig.add_subplot(121)
ax1.pie(punctuation_value_counts, labels = ['Less toxic', 'More toxic'], colors = ['skyblue', '#ff6666'])
ax1.set_title('Puncutaion count')

ax2 = fig.add_subplot(122)
ax2.pie(exclamation_value_counts, labels = ['Less toxic', 'More toxic'], colors = ['skyblue', '#ff6666'])
ax2.set_title('Exclamation count')

plt.show()

<p style="font-size: 110%; font-style: italic;" > 
Well, it turns out that less toxic comments usually have more punctuation count. The same is indicated by exclamation count as well which wasn't expected as usually the ones that are more toxic use more exclamation!  
Anyway, it is what it is.
</p>


* <span style="color:#f2843a;font-size: 150%; font-weight: bold;"> 
Language readability
</span>
<p style="font-size: 110%; font-style: italic;" > 
The premise behind checking readability is that less toxic words are easier to read that more toxic ones. <br/>
We will user Textstat to achieve this. Textstat is an easy to use library to calculate statistics from text. It helps determine readability, complexity, and grade level. It offeres various but the one that will be checked in this notebook is called the Flesch Reading Ease Score which is the score of readability which indicates how difficult a passage in English is to understand. 
</p>


In [None]:
less_toxic_reading_score = validation_df.less_toxic.apply(lambda comment: textstat.flesch_reading_ease(comment)).mean()
more_toxic_reading_score = validation_df.more_toxic.apply(lambda comment: textstat.flesch_reading_ease(comment)).mean()

plt.figure(figsize=(8, 6))
plt.bar(['Less toxic', 'More toxic'], [less_toxic_reading_score, more_toxic_reading_score], alpha=0.5, width=0.6)
plt.ylabel('Readability score')
plt.title('Flesch Readability score')
plt.show()


<p style="font-size: 110%; font-style: italic;" > 
As expected, the less toxic comments have a better readability score than the more toxic ones. It would also be a good idea to try out other features from textstat library but we not going to do that in this notebook.
</p>

* <span style="color:#f2843a;font-size: 150%; font-weight: bold;"> 
Most repeated less toxic comment
</span>
<p style="font-size: 110%; font-style: italic;" > 
Comments marked LESS TOXIC the most number of times are as below
</p>


In [None]:
validation_df.less_toxic.value_counts().to_frame().head(5)

<p style="font-size: 110%; font-style: italic;" > 
Comments marked MORE TOXIC the most number of times are as below
</p>

In [None]:
validation_df.more_toxic.value_counts().to_frame().head(5)

* <span style="color:#f2843a;font-size: 150%; font-weight: bold;"> 
Word Cloud
</span>
<p style="font-size: 110%; font-style: italic;" > 
Finally, everyone's favourite Word clouds!!!
</p>


In [None]:
less_toxic_comments = validation_df['less_toxic'].value_counts().to_frame().head(1000)
less_toxic_text = ' '.join(less_toxic_comments.index.tolist())

more_toxic_comments = validation_df['more_toxic'].value_counts().to_frame().head(1000)
more_toxic_text = ' '.join(more_toxic_comments.index.tolist())


less_toxic_wordcloud = WordCloud(max_font_size=50, max_words=100,
                      width=500, height=500,background_color="white").generate(less_toxic_text)


more_toxic_wordcloud = WordCloud(max_font_size=50, max_words=100,
                       width=500, height=500, background_color="black").generate(more_toxic_text)


fig, (ax1,ax2) = plt.subplots(1, 2, figsize=(15,15))
ax1.imshow(less_toxic_wordcloud, interpolation="bilinear")
ax1.axis("off")
ax2.imshow(more_toxic_wordcloud, interpolation="bilinear")
ax2.axis("off")
ax1.set_title('Less Toxic Comments', fontsize=15)
ax2.set_title('More Toxic Comments', fontsize=15)
plt.tight_layout(pad = 5)
plt.show()

<p id="ref" style="color: white; text-shadow: black 0px 0px 3px; font-weight: bold; font-size: 150%; font-family: Arial, Helvetica, sans-serif;
          padding: 10px 10px; background-color: #f2843a; border-radius: 5px;">
    <b>Refrences</b>
</p> 

<p style="font-size: 110%;" >Thank you! </p>


> - [jiggsaw-toxic-comments-eda-twitch-stream](https://www.kaggle.com/robikscube/jiggsaw-toxic-comments-eda-twitch-stream/notebook)
>---

<p style="font-size: 110%; font-style: italic;" > 
Voilà! You made it till the end!!  <br /><br />
Thanks a lot for sticking along and taking your time to read this. Do let know if something needs to be corrected and also feel free to drop a comment. <br />
Have a good one! Stay safe
</p>