### Jigsaw data vizualization for beginners

This notebook is to visualise the text data to see and identify some patterns in the text data which might help us in differentiating between less_toxic and more_toxic comments.

# Problem Statement
<ul style='font-family: Segoe UI; font-size: 1.5em; font-weight: 400; font-size: 15px'>
<li>Build a model that produces scores that rank each pair of comments the same way as the professional raters in the training dataset.</li>
</ul>

<h2 style='font-family: Segoe UI; font-weight: 400;'>Why this competition?</h2>
<p style='font-family: Segoe UI; font-size: 1.5em; font-weight: 400; font-size: 15px'>As evident from the problem statement, this competition presents an unique challenge for a greater purpose. Online bullying has become a epidemic with the boom in connectivity.<br>Hopefully the solutions contribute towards controlling this behaviour so that the internet remains a safe place for everyone.</p>

<h2 style='font-family: Segoe UI; font-weight: 400;'>Expected Outcome</h2>
<p style='font-family: Segoe UI; font-size: 1.5em; font-weight: 400; font-size: 15px'>In this competition we will be ranking comments in order of severity of toxicity.<br>We are given a list of comments, and each comment should be scored according to their relative toxicity. Comments with a higher degree of toxicity should receive a higher numerical value compared to comments with a lower degree of toxicity.</p>

<h2 style='font-family: Segoe UI; font-weight: 400;'>Data Description</h2>
<p style='font-family: Segoe UI; font-size: 1.5em; font-weight: 400; font-size: 15px'>There is no training data for this competition. We can refer to previous Jigsaw competitions for data that might be useful to train models.<br>However, we are provided a set of paired toxicity rankings(as per expert raters) that can be used to validate models.</p>

<h2 style='font-family: Segoe UI; font-weight: 400;'>Grading Metric</h2>
<p style='font-family: Segoe UI; font-size: 1.5em; font-weight: 400; font-size: 15px'>Submissions are evaluated on <b>Average Agreement</b> with Annotators.<br>
For the ground truth, annotators were shown two comments and asked to identify which of the two was more toxic. Pairs of comments can be, and often are, rated by more than one annotator, and may have been ordered differently by different annotators.</p>

<p style='background:MediumSeaGreen; border:0; color: white; text-align: center; font-family: Segoe UI; font-size: 1.5em; font-weight: 400; font-size: 24px'>If you found this notebook useful or use parts of it in your work, please don't forget to show your appreciation by upvoting this kernel. That keeps me motivated and inspires me to write and share such public kernels.<br>Thanks! 😊</p>

# Get GPU Info

In [None]:
!nvidia-smi

In [None]:
import pandas as pd  # data analysis library
import numpy as np  # comprehensive mathematical functions, random number generators, linear algebra routines, Fourier transforms, and more
import matplotlib.pyplot as plt  # provides an implicit way of plotting
import seaborn as sns  # for visualization
from tqdm import tqdm  # progressbar decorator for iterators
import os  # for operating system

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator  # word cloud building library

import warnings  # error processing
warnings.filterwarnings("ignore")

from collections import defaultdict  # if the key is not found in the method, then a new entry is created instead of KeyError. The type of this new entry is specified by the defaultdict argument.

from itertools import cycle  # contains some inbuilt functions for generating sequences using iterators
plt.style.use('ggplot')
color_pal = plt.rcParams['axes.prop_cycle'].by_key()['color']
color_cycle = cycle(plt.rcParams['axes.prop_cycle'].by_key()['color'])

- validation_data.csv - This contains pairs of rankings not from comments_to_score. It gives us an idea of how the rankings were applied. We also can learn about the annotators from this dataset.
- comments_to_score.csv (aka test set)- for each comment text in this file, we need to rank these in order of toxicity.
- sample_submission.csv - a sample submission file.

In [None]:
# Look at the data names and size
!ls -Flash --color ../input/jigsaw-toxic-severity-rating/

In [None]:
val = pd.read_csv('../input/jigsaw-toxic-severity-rating/validation_data.csv')
comments = pd.read_csv('../input/jigsaw-toxic-severity-rating/comments_to_score.csv')
ss = pd.read_csv('../input/jigsaw-toxic-severity-rating/sample_submission.csv')
print(f'Validation Data csv is of shape: {val.shape}')
print(f'Comments csv is of shape: {comments.shape}')
print(f'Sample submission csv is of shape: {ss.shape}')

In [None]:
print(f'Total workers involved in validation are => {len(val.worker.unique())}')

In [None]:
print(f'Less toxic unique comments => {len(val.less_toxic.unique())}')
print(f'More toxic unique comments => {len(val.more_toxic.unique())}')
print(f'Toal unique comments in both columns => {len(val.more_toxic.append(val.less_toxic).unique())}')

Total:
- there are more than 30,000 lines in the dataset (to be precise 30108), thus the total number of comments for analysis is 30 108 * 2 = 60 216
- less toxic unique comments - 11,532 out of 60,216
- more toxic unique comments - 11678 out of 60 216
- unique Toal comments in both columns -14251 out of 60216
- in total, 753 employees were involved in the validation - and divided 60,216 comments into more or less toxic


In [None]:
lens=comments.text.str.len()

In [None]:
lens.hist(color='orange', figsize=(30, 10))

### Validation Data
In this dataset we have three columns. The worker identifier - which is unique for the person ordering the pair of comments. Two columns less_toxic and more_toxic show the comments as the worker has ordered them.

### Comments most and lest commonly ranked less_toxic and more_toxic

In [None]:
# Top 25 "Less Toxic" Comments.
val['less_toxic'].value_counts() \
    .to_frame().head(25)

In [None]:
# Top 25 "More Toxic" Comments.
val['more_toxic'].value_counts() \
    .to_frame().head(25)

### consider the most common words of the most taxing and less toxic reviews
Unigrams are single words in a sentence. It's the smallest unit of word measurement.

In [None]:
# In the fields of computational linguistics and probability, an n-gram (sometimes also called Q-gram) is 
# a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, 
# letters, words or base pairs according to the application.

def generate_ngrams(text, n_gram=1):
    token = [token for token in text.lower().split(' ') if token != '' if token not in STOPWORDS]
    ngrams = zip(*[token[i:] for i in range(n_gram)])
    return [' '.join(ngram) for ngram in ngrams]

df =  val
N = 50  # N number of n-grams to visualize



less_toxic_unigrams = defaultdict(int)
for tweet in df['less_toxic']:
    for word in generate_ngrams(tweet, 1):
        less_toxic_unigrams[word] += 1
        
df_less_toxic_unigrams = pd.DataFrame(sorted(less_toxic_unigrams.items(), key=lambda x: x[1])[::-1])

unigrams_less_100 = df_less_toxic_unigrams[:N]

more_toxic_unigrams = defaultdict(int)
for tweet in df['more_toxic']:
    for word in generate_ngrams(tweet, 1):
        more_toxic_unigrams[word] += 1
        
df_more_toxic_unigrams = pd.DataFrame(sorted(more_toxic_unigrams.items(), key=lambda x: x[1])[::-1])

unigrams_more_100 = df_more_toxic_unigrams[:N]

In [None]:
fig, axes = plt.subplots(ncols=2, figsize=(18, N//2), dpi=100)
plt.tight_layout()

sns.barplot(y=unigrams_less_100[0], x=unigrams_less_100[1], ax=axes[0], color='green')
sns.barplot(y=unigrams_more_100[0], x=unigrams_more_100[1], ax=axes[1], color='red')

for i in range(2):
    axes[i].spines['right'].set_visible(False)
    axes[i].set_xlabel('')
    axes[i].set_ylabel('')
    axes[i].tick_params(axis='x', labelsize=13)
    axes[i].tick_params(axis='y', labelsize=13)

axes[0].set_title(f'Top {N} most common unigrams in less_toxic comments', fontsize=15)
axes[1].set_title(f'Top {N} most common unigrams in more_toxic comments', fontsize=15)

plt.show()

in addition to words, we see that there are often symbols that do not carry a semantic meaning

### bi-grams
Bi-grams are two words zipped together. If we iterate through each word in a sentence, then the pair of that word and the next word is called a bi-gram.

In [None]:
less_toxic_bigrams = defaultdict(int)
for tweet in df['less_toxic']:
    for word in generate_ngrams(tweet, 2):
        less_toxic_bigrams[word] += 1
        
df_less_toxic_bigrams = pd.DataFrame(sorted(less_toxic_bigrams.items(), key=lambda x: x[1])[::-1])

bigrams_less_100 = df_less_toxic_bigrams[:N]

more_toxic_bigrams = defaultdict(int)
for tweet in df['more_toxic']:
    for word in generate_ngrams(tweet, 2):
        more_toxic_bigrams[word] += 1
        
df_more_toxic_bigrams = pd.DataFrame(sorted(more_toxic_bigrams.items(), key=lambda x: x[1])[::-1])

bigrams_more_100 = df_more_toxic_bigrams[:N]

In [None]:
fig, axes = plt.subplots(ncols=2, figsize=(18, N//2), dpi=100)
plt.tight_layout()

sns.barplot(y=bigrams_less_100[0], x=bigrams_less_100[1], ax=axes[0], color='green')
sns.barplot(y=bigrams_more_100[0], x=bigrams_more_100[1], ax=axes[1], color='red')

for i in range(2):
    axes[i].spines['right'].set_visible(False)
    axes[i].set_xlabel('')
    axes[i].set_ylabel('')
    axes[i].tick_params(axis='x', labelsize=13)
    axes[i].tick_params(axis='y', labelsize=13)

axes[0].set_title(f'Top {N} most common bigrams in less_toxic comments', fontsize=15)
axes[1].set_title(f'Top {N} most common bigrams in more_toxic comments', fontsize=15)

plt.show()

### tri-grams
Similarly, the tri-grams would be 3 consecutive words in a sentence

In [None]:
less_toxic_trigrams = defaultdict(int)
for tweet in df['less_toxic']:
    for word in generate_ngrams(tweet, 3):
        less_toxic_trigrams[word] += 1
        
df_less_toxic_trigrams = pd.DataFrame(sorted(less_toxic_trigrams.items(), key=lambda x: x[1])[::-1])

trigrams_less_100 = df_less_toxic_trigrams[:N]

more_toxic_trigrams = defaultdict(int)
for tweet in df['more_toxic']:
    for word in generate_ngrams(tweet, 3):
        more_toxic_trigrams[word] += 1
        
df_more_toxic_trigrams = pd.DataFrame(sorted(more_toxic_trigrams.items(), key=lambda x: x[1])[::-1])

trigrams_more_100 = df_more_toxic_trigrams[:N]

In [None]:
fig, axes = plt.subplots(ncols=2, figsize=(30, N//2), dpi=100)
plt.tight_layout()

sns.barplot(y=trigrams_less_100[0], x=trigrams_less_100[1], ax=axes[0], color='green')
sns.barplot(y=trigrams_more_100[0], x=trigrams_more_100[1], ax=axes[1], color='red')

for i in range(2):
    axes[i].spines['right'].set_visible(False)
    axes[i].set_xlabel('')
    axes[i].set_ylabel('')
    axes[i].tick_params(axis='x', labelsize=13)
    axes[i].tick_params(axis='y', labelsize=13)

axes[0].set_title(f'Top {N} most common trigrams in less_toxic comments', fontsize=35)
axes[1].set_title(f'Top {N} most common trigrams in more_toxic comments', fontsize=35)

plt.show()

### Comment occurance in the validation set.
How often to comments even appear in the validation set? What is the distribution, and what are the top/least occuring comments?

Some thing to note:

Comments tend to occur in multiples of 3 (3, 6, 9, etc.)
Most workers only score a small ammount of comments. However there are workers who score much more than the rest of the population (200+ pairs)

In [None]:
all_comments = pd.concat([val['less_toxic'],
                          val['more_toxic']]) \
    .reset_index(drop=True)

ax = pd.DataFrame(index=range(1,19)) \
    .merge(all_comments.value_counts() \
           .value_counts().to_frame(),
           left_index=True, right_index=True, how='outer').fillna(0) \
    .astype('int').rename(columns={0:'Comment Frequency'}) \
    .plot(kind='bar',
          figsize=(12, 5))
plt.xticks(rotation=0)
ax.set_title('Comment Frequency in Val Dataset', fontsize=20)
ax.set_xlabel('Comment Occurance')
ax.set_ylabel('Number of Comments')
ax.legend().remove()
plt.show()

In [None]:
ax = val['worker'].value_counts() \
    .plot(kind='hist', bins=50,
          color=color_pal[1], figsize=(30, 20))
ax.set_title('Frequeny of Worker in Val Set', fontsize=20)
ax.set_xlabel('Rows in Validation set for a Worker')

So, the distribution tells us that most workers scored some 1-20 comment pairs but there were also some workers who did upwards of 200 pairs!

In [None]:
# The most commonly occuring comment.
all_comments.value_counts() \
    .to_frame().rename(columns={0:'Total Comment Count'}) \
    .head()

In [None]:
# The least common comment.
all_comments.value_counts() \
    .to_frame().rename(columns={0:'Total Comment Count'}) \
    .tail()

### Repeated Pairs in Validation Set¶
How much workers agree and/or disagree.

Comment pairs occur in the same order 1, 2 or 3 times - but never more.
When we take the comments and undo the ordering (sort them alphabetically - we find that the pairs almost always occur 3 times)

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(30, 10), sharey=True)
val['comment_pair_ordered'] = val['less_toxic'] + ' : ' + val['more_toxic']
# The most common pair
val['comment_pair_ordered'] \
    .value_counts().value_counts() \
    .plot(kind='bar', title='Ordered Comment Pairs',
          color=color_pal[4], ax=ax1)
ax1.tick_params(axis='x', rotation=0)
ax1.set_ylabel('Occurance')
ax1.set_xlabel('Number of times Pair is Found in Dataset')


# Comment Pairs in a standard alphabetical order
val['comment_pair_not_ordered'] = val[['less_toxic','more_toxic']] \
    .apply(lambda x: ':'.join(np.sort(list(x))), axis=1)
val['comment_pair_not_ordered'].value_counts().value_counts() \
    .sort_index() \
    .plot(kind='bar', title='Unordered Comment Pairs', ax=ax2,
          color=color_pal[5])
ax2.tick_params(axis='x', rotation=0)
ax2.set_xlabel('Number of times Unordered Pair is Found in Dataset')
plt.show()

### Comments to Grade¶
Do they appear in the validation data? Yes 100% of the public all_comments also appear in the validation data. Thus, each pair of comments occurs in the dataset three times and a smaller part of the sample once.

In [None]:
comments['text'].isin(all_comments).mean()

### Where do labelers disagree the most?
We now know that pairs occur three times in the validation dataset. This leads us to ask the question... are there any "workers" who disagree more than others?

We can create a new columns n_agreements to see for each row how many times the three workers had the same order for the given pair.

In [None]:
val_order_dict = val['comment_pair_ordered'].value_counts().to_dict()
val['n_agreements'] = val['comment_pair_ordered'].map(val_order_dict)

In [None]:
val['agreement'] = val['n_agreements'].map({1: 'Reviewer Disagreed',
                         2: 'Agreed with One Reviwer',
                         3: 'All Three Reviewers Agreed'})
ax = val['agreement'].value_counts().plot(kind='bar', color=color_pal[5],
                                         figsize=(30, 10))
ax.tick_params(axis='x', rotation=0)
ax.set_title('Worker Agreement', fontsize=16)
plt.show()

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(30, 20))
# Reviewers with the most disagreements
val.query('n_agreements == 1')['worker'].value_counts(ascending=True) \
    .tail(20) \
    .plot(kind='barh', title='Reviewers with the Most Disagreements', ax=ax1)

# Reviewers with the most disagreements
val.query('n_agreements == 3')['worker'].value_counts(ascending=True) \
    .tail(60) \
    .plot(kind='barh', title='Reviewers with the Most Agreements', ax=ax2,
         color=color_pal[1])
plt.show()

### Lets look at disagreement count vs. total label reviews

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(30, 15))

val['worker'].value_counts().to_frame().merge(
    val.query('n_agreements == 1')['worker'].value_counts().to_frame(),
    left_index=True, right_index=True
).rename(columns={'worker_x':'Number of Reviews',
                  'worker_y':'Number of Disagreements'}) \
    .plot(x='Number of Reviews', y='Number of Disagreements',
          kind='scatter', title='Worker Reviews vs Disagreements', ax=ax1)

val['worker'].value_counts().to_frame().merge(
    val.query('n_agreements == 3')['worker'].value_counts().to_frame(),
    left_index=True, right_index=True
).rename(columns={'worker_x':'Number of Reviews',
                  'worker_y':'Number of Disagreements'}) \
    .plot(x='Number of Reviews', y='Number of Disagreements',
          kind='scatter', title='Worker Reviews vs Agreements', ax=ax2, color=color_pal[2])

plt.show()

## Wordclouds of Toxic and Non-Toxic Comments.

In [None]:
non_toxic_comments = val['less_toxic'].value_counts() \
    .to_frame().head(1000)
non_toxic_text = ' '.join(non_toxic_comments.index.tolist())

toxic_comments = val['more_toxic'].value_counts() \
    .to_frame().head(1000)
toxic_text = ' '.join(toxic_comments.index.tolist())


wordcloud = WordCloud(max_font_size=50, max_words=100,width=500, height=500,
                      background_color="white") \
    .generate(non_toxic_text)


wordcloud2 = WordCloud(max_font_size=50, max_words=100,width=500, height=500,
                      background_color="black") \
    .generate(toxic_text)


fig, (ax1,ax2) = plt.subplots(1, 2, figsize=(30,20))

ax1.imshow(wordcloud, interpolation="bilinear")
ax1.axis("off")
ax2.imshow(wordcloud2, interpolation="bilinear")
ax2.axis("off")
ax1.set_title('Non Toxic Comments', fontsize=25)
ax2.set_title('Toxic Comments', fontsize=25)
plt.show()

In [None]:
import requests
from io import BytesIO
from PIL import Image
try:
    url="https://user-images.githubusercontent.com/74188336/142692890-641ebc21-2e47-4556-9d37-1c0b9e1a0587.jpeg"
    response = requests.get(url)
    img = Image.open(BytesIO(response.content))

    text = ' '.join(df['less_toxic'].values)
    mask = np.array(img)
    wordcloud = WordCloud(max_font_size=50, max_words=1000, background_color="white", mask=mask, colormap='BuGn').generate(text.lower())
    plt.figure(figsize=(15,15))
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis("off")
    plt.show()
except Exception as e:
    pass

In [None]:
try:
    text = ' '.join(df['more_toxic'].values)
    url="https://user-images.githubusercontent.com/74188336/142692894-c17240e4-1101-4591-9d10-71793e460816.jpeg"
    
    response = requests.get(url)
    img = Image.open(BytesIO(response.content))

    mask = np.array(img)
    wordcloud = WordCloud(max_font_size=50, max_words=2000, background_color="white", mask=mask, contour_width=0, contour_color='grey', colormap='Reds').generate(text.lower())
    plt.figure(figsize=(15,15))
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis("off")
    plt.show()
except Exception as e:
    pass

Total:
- there are more than 30,000 lines in the dataset (to be precise 30108), thus the total number of comments for analysis is 30 108 * 2 = 60 216
- less toxic unique comments - 11,532 out of 60,216
- more toxic unique comments - 11678 out of 60 216
- unique Toal comments in both columns -14251 out of 60216
- in total, 753 employees were involved in the validation - and divided 60,216 comments into more or less toxic
- a large number of unnecessary symbols that do not play a role in determining taxation, such as quotes or the symbol equal
- pairs occur three times in the validation dataset. Moreover, these comments are found both in the same pairs and in the composition of others
- at the same time, new employees who rated pairs of comments did not always give the same rating, subjectivity of employees' assessment of the degree of toxicity of comments
- the distribution tells us that most workers scored some 1-20 comment pairs but there were also some workers who did upwards of 200 pairs

All this says that before training our model, it is necessary to conduct a good pre-processing of the text.

Thanks a lot for sticking along and taking your time to read this. Do let know if something needs to be corrected and also feel free to drop a comment.