# Toxicity and Sentiment extraction

In this notebook we are going to process the text and then extract sentiment and toxicity using models trained on social media comments.

Install necessary packages:

In [1]:
%%capture
!pip install detoxify
!pip install vaderSentiment


Import all necessary packages:

In [2]:
from google.colab import drive
import pandas as pd

In [3]:
   
import re
import spacy
import nltk
# -------------------------------------------------------------------------
from spacy.lang.en import English
from spacy.lang.en.stop_words import STOP_WORDS


import string
from string import ascii_lowercase

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Read the source data from Google Drive:

In [4]:
drive.mount('/content/drive')
file_location = '/content/drive/My Drive/datasets/toxicity/' 
file_name='user_reviews_clean.csv'
user_reviews = pd.read_csv(file_location+file_name, lineterminator='\n')
user_reviews.drop(columns=["Unnamed: 0"], inplace=True)
user_reviews.head()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Unnamed: 0,item,review,uid,month_day,recommend_flag,review_length
0,1250,Simple yet with great replayability. In my opi...,0,November 5,1,249
1,22200,It's unique and worth a playthrough.,0,July 15,1,36
2,43110,Great atmosphere. The gunplay can be a bit chu...,0,April 21,1,182
3,251610,I know what you think when you see this title ...,1,June 24,1,566
4,227300,For a simple (it's actually not all that simpl...,1,September 8,1,590


Remove duplicates:

In [5]:
user_reviews.drop_duplicates(inplace=True)
user_reviews.duplicated().sum()

0

Include customer script text_preprocess.py:

In [6]:
#text preprocessing
import sys
sys.path.insert(0,'/content/drive/My Drive/NLP Project/scripts')

import text_preprocess as tp

In [7]:
user_reviews.dtypes

item               int64
review            object
uid                int64
month_day         object
recommend_flag     int64
review_length      int64
dtype: object

Recode insults to their original form using clean_text function:

In [8]:
user_reviews["text"] = user_reviews["review"].apply(lambda x: tp.clean_text(str(x)))

Infer toxicity values using the Detoxify model:

In [9]:
from detoxify import Detoxify

classificator = Detoxify('unbiased')

classification_results = user_reviews.loc[:, user_reviews.columns.intersection(['uid', 'item', 'text'])]
classification_results['toxicity'] = classification_results["text"].apply(lambda x: classificator.predict(x)['toxicity'])

In [10]:
classification_results.head()

Unnamed: 0,item,uid,text,toxicity
0,1250,0,simple yet with great replayability in my opin...,0.001871
1,22200,0,it s unique and worth a playthrough,0.000363
2,43110,0,great atmosphere the gunplay can be a bit chun...,0.000975
3,251610,1,i know what you think when you see this title ...,0.001804
4,227300,1,for a simple it s actually not all that simple...,0.002723


Round raw toxicity to toxicity_flag:

In [11]:
classification_results['toxicity_flag']= classification_results['toxicity'].apply(lambda x: round(x))
classification_results['toxicity_flag'].value_counts()

0    54928
1     3503
Name: toxicity_flag, dtype: int64

Vader is a rule based system for sentiment extraction using a vocabulary. I have chosen it because it is specialized in social media content.

In [13]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

sid_obj = SentimentIntensityAnalyzer()

def sentiment_vader(sentence):

    # Create a SentimentIntensityAnalyzer object.

    sentiment_dict = sid_obj.polarity_scores(sentence)
    negative = sentiment_dict['neg']
    neutral = sentiment_dict['neu']
    positive = sentiment_dict['pos']
    compound = sentiment_dict['compound']

    if sentiment_dict['compound'] >= 0.05 :
        overall_sentiment = 1

    elif sentiment_dict['compound'] <= - 0.05 :
        overall_sentiment = -1

    else :
        overall_sentiment = 0
  
    return overall_sentiment



Obtain sentiment_flag using the vadere model:

In [14]:
classification_results["sentiment_flag"] = classification_results["text"].apply(lambda x: sentiment_vader(x))

In [15]:
classification_results['sentiment_flag'].value_counts()

 1    37097
 0    11854
-1     9480
Name: sentiment_flag, dtype: int64

Check a random item in the final report:

In [18]:
classification_results[classification_results['item']==212070]

Unnamed: 0,item,uid,text,toxicity,toxicity_flag,sentiment_flag
149,212070,68,i s freaking rad and the ingame graphics are w...,0.788883,1,-1
4748,212070,1887,if you like space battle s with exploding ship...,0.006995,0,0
8822,212070,3453,i love it ah,0.000533,0,1
9904,212070,3891,graphics are amazin,0.000618,0,0
12096,212070,4757,good if you like war thunder and space games a...,0.007146,0,-1
13934,212070,5470,star conflict is a action sci fi space shooter...,0.000668,0,1
16848,212070,6700,it s a realy fantastic game with a wide variet...,0.34814,0,1
18561,212070,7363,wow just wow,0.000544,0,1
19475,212070,7689,getting over the controls was tough lots of pr...,0.000489,0,1
20218,212070,7975,great game it s a game i really love playing it,0.000634,0,1


In [19]:
classification_results.to_csv(file_location+"classification_results.csv")