# Data Exploration

In this notebook we are going to work with the outputs  from the last notebook (Toxicity_and_Sentiment_extraction) and explore the relationships between different variables in the data and classes from the negativity and sentiment models:

In [28]:
import pandas as pd
import numpy as np
from google.colab import drive
from scipy.stats import pearsonr

drive.mount('/content/drive')
file_location = '/content/drive/My Drive/datasets/toxicity/'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Research questions:

1. Which are the most toxic games with more than $30$ reviews?
2. Is there a correlation between negative sentiment and toxicity?
3. Is there a correlation between recommendation and toxicity?
4. How predictive is the sentiment of a text to the recommendation chance?

In [29]:
classification_results=pd.read_csv(file_location+"classification_results.csv", lineterminator="\n")
classification_results.drop(columns=["Unnamed: 0"], inplace=True)
classification_results.head()

Unnamed: 0,item,uid,text,toxicity,sentiment_flag,toxicity_flag
0,1250,0,simple yet with great replayability in my opin...,0.001871,1,0
1,22200,0,it s unique and worth a playthrough,0.000363,1,0
2,43110,0,great atmosphere the gunplay can be a bit chun...,0.000975,1,0
3,251610,1,i know what you think when you see this title ...,0.001804,1,0
4,227300,1,for a simple it s actually not all that simple...,0.002723,1,0


In [30]:
classification_results.dropna(inplace=True)
classification_results.isnull().sum()


item              0
uid               0
text              0
toxicity          0
sentiment_flag    0
toxicity_flag     0
dtype: int64

In [31]:
user_reviews=pd.read_csv(file_location+"user_reviews_clean.csv", lineterminator="\n")
user_reviews = user_reviews.loc[:, user_reviews.columns.intersection(['item','uid', 'recommend_flag'])]
user_reviews.head()


Unnamed: 0,item,uid,recommend_flag
0,1250,0,1
1,22200,0,1
2,43110,0,1
3,251610,1,1
4,227300,1,1


In [32]:
classification_results = classification_results.merge(user_reviews, on=["item","uid"])
classification_results.head()

Unnamed: 0,item,uid,text,toxicity,sentiment_flag,toxicity_flag,recommend_flag
0,1250,0,simple yet with great replayability in my opin...,0.001871,1,0,1
1,22200,0,it s unique and worth a playthrough,0.000363,1,0,1
2,43110,0,great atmosphere the gunplay can be a bit chun...,0.000975,1,0,1
3,251610,1,i know what you think when you see this title ...,0.001804,1,0,1
4,227300,1,for a simple it s actually not all that simple...,0.002723,1,0,1


We would encode the positive sentiment as positivity $1$ and negative/neutral as $0$ and see if there is a significant correlation between the two:

In [33]:
def encode_positivity(sentiment):
  if sentiment==1:
    return 1
  else:
    return 0

classification_results["positivity"] = classification_results["sentiment_flag"].apply(lambda x: encode_positivity(x))



Check the correlation between positive sentiment and recommendation:

In [34]:
correlation, p_value = pearsonr(classification_results["positivity"], classification_results["recommend_flag"])

print("The correlation between positivity and a recommendation is " + str(correlation) + " with a significance level of " + str(p_value) + "<0.05 indicating a significant result.")

The correlation between positivity and a recommendation is 0.20288329982760742 with a significance level of 0.0<0.05 indicating a significant result.


RQ4: The correlation between positivity and recommendation is $0.202883$ with a significance level of $0.0<0.05$ indicating a significant result. The correlation level of ~$0.20$ indicates weak positive relationship.

We would encode the negative sentiment as negativity = $1$ and positive and neutral as $0$ and see if there is a significant correlation between the two:

In [35]:
def encode_negativity(sentiment):
  if sentiment==-1:
    return 1
  else:
    return 0

classification_results["negativity"] = classification_results["sentiment_flag"].apply(lambda x: encode_negativity(x))

Check the correlation between negative sentiment and toxicity:

In [36]:
correlation, p_value = pearsonr(classification_results["negativity"], classification_results["toxicity_flag"])

print("The correlation between negativity and toxicity is " + str(correlation) + " with a significance level of " + str(p_value) + "<0.05 indicating a significant result.")

The correlation between negativity and toxicity is 0.2454663340281999 with a significance level of 0.0<0.05 indicating a significant result.


RQ2: The correlation between negativity and toxicity is $0.245536$ with a significance level of $0.0<0.05$ indicating a significant result. The correlation level of ~$0.25$ indicates weak positive relationship.


Check the correlation between toxicity and recommendation:

In [37]:
correlation, p_value = pearsonr(classification_results["toxicity_flag"], classification_results["recommend_flag"])

print("The correlation between toxicity and recommendation is " + str(correlation) + " with a significance level of " + str(p_value) + "<0.05 indicating a significant result.")

The correlation between toxicity and recommendation is -0.09594413156700413 with a significance level of 2.0127196943827242e-119<0.05 indicating a significant result.


RQ3: The correlation between toxicity and recommendation is $0.095944$ with a significance level of $0.0<0.05$ indicating a significant result. The correlation level of ~$0.25$ indicates very weak negative relationship.


In [38]:
#get only item and uid
counts = classification_results.loc[:, classification_results.columns.intersection(['item', 'uid'])].groupby(['item']).count()
significant_items = counts[counts["uid"]>30].index
significant_items = np.array(significant_items)

In [39]:
significant_results = classification_results[classification_results["item"].isin(significant_items)]
significant_results.head()

Unnamed: 0,item,uid,text,toxicity,sentiment_flag,toxicity_flag,recommend_flag,positivity,negativity
0,1250,0,simple yet with great replayability in my opin...,0.001871,1,0,1,1,0
2,43110,0,great atmosphere the gunplay can be a bit chun...,0.000975,1,0,1,1,0
4,227300,1,for a simple it s actually not all that simple...,0.002723,1,0,1,1,0
5,239030,1,very fun little game to play when your bored o...,0.004443,1,0,1,1,0
6,248820,2,a suitably punishing roguelike platformer winn...,0.001388,1,0,1,1,0


In [40]:
summary_table = pd.DataFrame()
summary_table["toxic_count"] = significant_results.loc[:, significant_results.columns.intersection(['item','toxicity_flag'])].groupby(['item'])['toxicity_flag'].agg('sum').sort_values(ascending=False)
summary_table.head(20)

Unnamed: 0_level_0,toxic_count
item,Unnamed: 1_level_1
730,258
440,148
4000,90
570,80
221100,78
252490,76
218620,68
304930,64
550,60
49520,36


Load item_id to item name mapping:

In [41]:
item_id2item_map = pd.read_csv(file_location+"item_id2item_map.csv", names=["item","name"])
item_id2item_map.head()

Unnamed: 0,item,name
0,10,Counter-Strike
1,20,Team Fortress Classic
2,30,Day of Defeat
3,40,Deathmatch Classic
4,50,Half-Life: Opposing Force


RQ1: Get the top 20 games in terms of raw toxic comments count by name:

In [42]:
top_20_raw_toxicity = summary_table.merge(item_id2item_map, on="item")
top_20_raw_toxicity.head(20)

Unnamed: 0,item,toxic_count,name
0,730,258,Counter-Strike: Global Offensive
1,4000,90,Garry's Mod
2,221100,78,DayZ
3,252490,76,Rust
4,218620,68,PAYDAY 2
5,304930,64,Unturned
6,550,60,Left 4 Dead 2
7,49520,36,Borderlands 2
8,72850,35,The Elder Scrolls V: Skyrim
9,208090,32,Loadout


Obtain the review counts as a basis for calculating the ratio of toxic comments to all comments for a given item, as with raw counts the games that are played by more people in general will get disproportionately high representation:

In [43]:
review_counts = pd.DataFrame()
review_counts["review_count"] = significant_results.loc[:, significant_results.columns.intersection(['item','toxicity_flag'])].groupby(['item']).agg('count')
review_counts

Unnamed: 0_level_0,review_count
item,Unnamed: 1_level_1
10,56
70,61
220,214
240,237
300,32
...,...
391540,223
413150,98
417860,171
427520,35


In [44]:
summary_table = summary_table.merge(review_counts, on="item")
summary_table["toxicity_ratio"] = summary_table.apply(lambda x: x["toxic_count"]/x["review_count"], axis=1)
summary_table.head()

Unnamed: 0_level_0,toxic_count,review_count,toxicity_ratio
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
730,258,3701,0.069711
440,148,3689,0.040119
4000,90,1704,0.052817
570,80,1541,0.051914
221100,78,729,0.106996


In [45]:
summary_table = summary_table.merge(item_id2item_map, on="item")

RQ1: Get the top 20 games by toxicity ratio:

In [46]:
summary_table.sort_values("toxicity_ratio", ascending=False).head(20)

Unnamed: 0,item,toxic_count,review_count,toxicity_ratio,name
86,334230,8,34,0.235294,Town of Salem
55,63380,11,62,0.177419,Sniper Elite V2
18,265630,22,129,0.170543,Fistful of Frags
92,339800,7,43,0.162791,HuniePop
98,242680,6,41,0.146341,Nuclear Throne
73,70000,9,65,0.138462,Dino D-Day
41,233720,13,94,0.138298,Surgeon Simulator
50,223470,12,89,0.134831,POSTAL 2
95,317360,7,52,0.134615,Double Action: Boogaloo
77,374570,9,67,0.134328,Kung Fury
