# Data Processing

Install necessary packages:


In [1]:
!pip install demjson

Collecting demjson
  Downloading demjson-2.2.4.tar.gz (131 kB)
[?25l[K     |██▌                             | 10 kB 18.5 MB/s eta 0:00:01[K     |█████                           | 20 kB 23.6 MB/s eta 0:00:01[K     |███████▌                        | 30 kB 28.7 MB/s eta 0:00:01[K     |██████████                      | 40 kB 14.6 MB/s eta 0:00:01[K     |████████████▌                   | 51 kB 13.0 MB/s eta 0:00:01[K     |███████████████                 | 61 kB 15.0 MB/s eta 0:00:01[K     |█████████████████▌              | 71 kB 12.1 MB/s eta 0:00:01[K     |████████████████████            | 81 kB 13.2 MB/s eta 0:00:01[K     |██████████████████████▍         | 92 kB 14.3 MB/s eta 0:00:01[K     |█████████████████████████       | 102 kB 13.2 MB/s eta 0:00:01[K     |███████████████████████████▍    | 112 kB 13.2 MB/s eta 0:00:01[K     |██████████████████████████████  | 122 kB 13.2 MB/s eta 0:00:01[K     |████████████████████████████████| 131 kB 13.2 MB/s 
[?25hBuilding

Include necessary packages:

In [2]:
from google.colab import drive

import pandas as pd
from demjson import decode
import csv
import os

Import all source files and read the lines from the pseudo-json:

In [3]:
drive.mount('/content/drive')
file_location = '/content/drive/My Drive/datasets/toxicity' 
file_name = 'australian_user_reviews_clean.json'

file_address = os.path.join(file_location,file_name)
with open(file_address, 'r',errors="ignore") as file_used:
    Lines = file_used.readlines()

Mounted at /content/drive


Create a Pandas data frame from the dictionary in the file:

In [4]:
rows = []
for line in Lines:
    data_dict = decode(line)
    data_row = data_dict['reviews']
    name_row = data_dict['user_id']
    for row in data_row:
        row_mod = {'Name':name_row, 'item': row['item_id'], 
                   'recommend': row['recommend'], 'posted': row['posted'], 'review': row['review']}
        rows.append(row_mod)
        

In [5]:
user_reviews= pd.DataFrame(rows)
user_reviews.head()

Unnamed: 0,Name,item,recommend,posted,review
0,76561197970982479,1250,True,"Posted November 5, 2011.",Simple yet with great replayability. In my opi...
1,76561197970982479,22200,True,"Posted July 15, 2011.",It's unique and worth a playthrough.
2,76561197970982479,43110,True,"Posted April 21, 2011.",Great atmosphere. The gunplay can be a bit chu...
3,js41637,251610,True,"Posted June 24, 2014.",I know what you think when you see this title ...
4,js41637,227300,True,"Posted September 8, 2013.",For a simple (it's actually not all that simpl...


Check missings and duplicates:

In [6]:
user_reviews.isnull().sum()

Name         0
item         0
recommend    0
posted       0
review       0
dtype: int64

In [7]:
user_reviews.duplicated().sum()

874

Remove duplicates:

In [8]:
user_reviews.drop_duplicates(inplace=True)
user_reviews.duplicated().sum()

0

Check the significance of the reviews per item:

In [9]:
counts = user_reviews.groupby(['item']).count()
print(counts.shape)
counts[counts["review"]>30]

(3682, 4)


Unnamed: 0_level_0,Name,recommend,posted,review
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
10,56,56,56,56
10090,51,51,51,51
10180,93,93,93,93
104700,42,42,42,42
104900,154,154,154,154
...,...,...,...,...
8980,44,44,44,44
91310,80,80,80,80
9480,37,37,37,37
9900,31,31,31,31


Anonimize the original steam_id using sequetial index:

In [10]:
df_unique_users = pd.DataFrame(user_reviews["Name"].unique(),columns=["user_id"])
df_unique_users.head()

Unnamed: 0,user_id
0,76561197970982479
1,js41637
2,evcentric
3,doctr
4,maplemage


In [11]:
df_unique_users["uid"] = df_unique_users.index
df_unique_users.head()

Unnamed: 0,user_id,uid
0,76561197970982479,0
1,js41637,1
2,evcentric,2
3,doctr,3
4,maplemage,4


In [12]:
df_unique_users.to_csv("/content/drive/My Drive/datasets/toxicity/user_id2_uid_map.csv")

In [13]:
user_reviews=user_reviews.join(df_unique_users.set_index("user_id"), on="Name")
user_reviews.drop(columns=['Name'], inplace=True)
user_reviews.head()

Unnamed: 0,item,recommend,posted,review,uid
0,1250,True,"Posted November 5, 2011.",Simple yet with great replayability. In my opi...,0
1,22200,True,"Posted July 15, 2011.",It's unique and worth a playthrough.,0
2,43110,True,"Posted April 21, 2011.",Great atmosphere. The gunplay can be a bit chu...,0
3,251610,True,"Posted June 24, 2014.",I know what you think when you see this title ...,1
4,227300,True,"Posted September 8, 2013.",For a simple (it's actually not all that simpl...,1


Extract month and day from posted column. Unfortunately the data is incomplete:

In [14]:
date_format = "Posted %B %d, Y."
user_reviews["month_day"] = user_reviews["posted"].apply(lambda x: x.split(",")[0].replace("Posted ","").replace(".",""))

Create recommend_flag for ease of use. Derive review length, needed if a language model is to be used:

In [15]:
def create_recommend_flag(str_bool):
  if str_bool=='True':
    return 1
  elif str_bool=='False':
    return 0
user_reviews["recommend_flag"] = user_reviews["recommend"].apply(lambda x: create_recommend_flag(x))
user_reviews["review_length"] = user_reviews["review"].apply(lambda x: len(x))

Check missings after the transformations:

In [16]:
user_reviews.isnull().sum()

item              0
recommend         0
posted            0
review            0
uid               0
month_day         0
recommend_flag    0
review_length     0
dtype: int64

In [17]:
print(user_reviews["review_length"].describe())
#print(user_reviews[user_reviews["review_length"]>2000])

count    58431.000000
mean       215.942034
std        456.307700
min          0.000000
25%         30.000000
50%         78.000000
75%        207.000000
max       8000.000000
Name: review_length, dtype: float64


Check the final form of the data frame:

In [18]:
user_reviews.head()

Unnamed: 0,item,recommend,posted,review,uid,month_day,recommend_flag,review_length
0,1250,True,"Posted November 5, 2011.",Simple yet with great replayability. In my opi...,0,November 5,1,249
1,22200,True,"Posted July 15, 2011.",It's unique and worth a playthrough.,0,July 15,1,36
2,43110,True,"Posted April 21, 2011.",Great atmosphere. The gunplay can be a bit chu...,0,April 21,1,182
3,251610,True,"Posted June 24, 2014.",I know what you think when you see this title ...,1,June 24,1,566
4,227300,True,"Posted September 8, 2013.",For a simple (it's actually not all that simpl...,1,September 8,1,590


Save the final results:

In [None]:
user_reviews_saved=user_reviews.drop(columns=["posted", "recommend"])
user_reviews_saved.to_csv(file_location+"/user_reviews_clean.csv")