# Dataset Creation

Re-do the dataset creation from scratch (as the original creation in sa_hkuchatgpt is flawed)

(a template script for implementing each version of train/test/validation stuff)

This script is for creating balanced dataset for training/testing/validation purpose

It will change the label representing negative sentiment from -1 to 0 for easier training

It does not pre-process the data (e.g. stemming/removing symbols...)

In [13]:
import pandas as pd
import numpy as np

from pathlib import Path
import random
from datetime import datetime

random.seed(13)

dataset_path = Path('dataset.csv')

dataset = pd.read_csv(dataset_path)
dataset

Unnamed: 0,app_id,app_name,review_text,review_score,review_votes
0,10,Counter-Strike,Ruined my life.,1,0
1,10,Counter-Strike,This will be more of a ''my experience with th...,1,1
2,10,Counter-Strike,This game saved my virginity.,1,0
3,10,Counter-Strike,• Do you like original games? • Do you like ga...,1,0
4,10,Counter-Strike,"Easy to learn, hard to master.",1,1
...,...,...,...,...,...
6417101,99910,Puzzle Pirates,I really ove this game but it needs somethings...,-1,0
6417102,99910,Puzzle Pirates,"Used to play Puzzel Pirates 'way back when', b...",-1,0
6417103,99910,Puzzle Pirates,"This game was aright, though a bit annoying. W...",-1,0
6417104,99910,Puzzle Pirates,"I had a nice review to recommend this game, bu...",-1,0


In [14]:
dataset2 = pd.read_csv('steam.csv', names=['app_id', 'review_text', 'review_score', 'review_votes'])

dataset2

Unnamed: 0,app_id,review_text,review_score,review_votes
0,10,Ruined my life.,1,0
1,10,This will be more of a ''my experience with th...,1,1
2,10,This game saved my virginity.,1,0
3,10,• Do you like original games? • Do you like ga...,1,0
4,10,"Easy to learn, hard to master.",1,1
...,...,...,...,...
6417101,99910,I really ove this game but it needs somethings...,-1,0
6417102,99910,"Used to play Puzzel Pirates 'way back when', b...",-1,0
6417103,99910,"This game was aright, though a bit annoying. W...",-1,0
6417104,99910,"I had a nice review to recommend this game, bu...",-1,0


In [15]:
dataset.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6417106 entries, 0 to 6417105
Data columns (total 5 columns):
 #   Column        Dtype 
---  ------        ----- 
 0   app_id        int64 
 1   app_name      object
 2   review_text   object
 3   review_score  int64 
 4   review_votes  int64 
dtypes: int64(3), object(2)
memory usage: 244.8+ MB


In [16]:
dataset2.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6417106 entries, 0 to 6417105
Data columns (total 4 columns):
 #   Column        Dtype 
---  ------        ----- 
 0   app_id        int64 
 1   review_text   object
 2   review_score  int64 
 3   review_votes  int64 
dtypes: int64(3), object(1)
memory usage: 195.8+ MB


In [17]:
# cross check if both datasets are identical in terms of app_id, review_text, review_score and review_votes

dataset = dataset.drop(columns=['app_name'])
dataset.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6417106 entries, 0 to 6417105
Data columns (total 4 columns):
 #   Column        Dtype 
---  ------        ----- 
 0   app_id        int64 
 1   review_text   object
 2   review_score  int64 
 3   review_votes  int64 
dtypes: int64(3), object(1)
memory usage: 195.8+ MB


In [18]:
dataset.equals(dataset2)

True

This proofs that both dataset downloaded from kaggle and from original zenodo is identical.

Dataset downloaded from Kaggle is more helpful as it included the name of the game as an additional column.

Removing that and through comparison proves identical to the zenodo original one.


For our convenience, we will use the one from Kaggle.

---

In [1]:
import pandas as pd
import numpy as np

from pathlib import Path
import random
from datetime import datetime

random.seed(13)

dataset_path = Path('dataset.csv')

dataset = pd.read_csv(dataset_path)
dataset

Unnamed: 0,app_id,app_name,review_text,review_score,review_votes
0,10,Counter-Strike,Ruined my life.,1,0
1,10,Counter-Strike,This will be more of a ''my experience with th...,1,1
2,10,Counter-Strike,This game saved my virginity.,1,0
3,10,Counter-Strike,• Do you like original games? • Do you like ga...,1,0
4,10,Counter-Strike,"Easy to learn, hard to master.",1,1
...,...,...,...,...,...
6417101,99910,Puzzle Pirates,I really ove this game but it needs somethings...,-1,0
6417102,99910,Puzzle Pirates,"Used to play Puzzel Pirates 'way back when', b...",-1,0
6417103,99910,Puzzle Pirates,"This game was aright, though a bit annoying. W...",-1,0
6417104,99910,Puzzle Pirates,"I had a nice review to recommend this game, bu...",-1,0


- check null values, and remove the rows with null values
- remove duplicate reviews by check review_text and review_score, and app_id

In [2]:
# notice there are some null values
dataset.isnull().sum()

app_id               0
app_name        183234
review_text       7305
review_score         0
review_votes         0
dtype: int64

In [3]:
# We remove rows that contains null values (for both column app_name and review_text)
dataset = dataset[dataset['app_name'].isnull() == False]

dataset = dataset[dataset['review_text'].isnull() == False]

dataset.isnull().sum()

app_id          0
app_name        0
review_text     0
review_score    0
review_votes    0
dtype: int64

In [4]:
dataset.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
Index: 6226728 entries, 0 to 6417105
Data columns (total 5 columns):
 #   Column        Dtype 
---  ------        ----- 
 0   app_id        int64 
 1   app_name      object
 2   review_text   object
 3   review_score  int64 
 4   review_votes  int64 
dtypes: int64(3), object(2)
memory usage: 285.0+ MB


In [5]:
# convert columns to string type

dataset['review_text'] = dataset['review_text'].apply(str, 1)
dataset['app_name'] = dataset['app_name'].apply(str, 1)
dataset.info(verbose=True)

  dataset['review_text'] = dataset['review_text'].apply(str, 1)
  dataset['app_name'] = dataset['app_name'].apply(str, 1)


<class 'pandas.core.frame.DataFrame'>
Index: 6226728 entries, 0 to 6417105
Data columns (total 5 columns):
 #   Column        Dtype 
---  ------        ----- 
 0   app_id        int64 
 1   app_name      object
 2   review_text   object
 3   review_score  int64 
 4   review_votes  int64 
dtypes: int64(3), object(2)
memory usage: 285.0+ MB


In [6]:
dataset.head(10)

Unnamed: 0,app_id,app_name,review_text,review_score,review_votes
0,10,Counter-Strike,Ruined my life.,1,0
1,10,Counter-Strike,This will be more of a ''my experience with th...,1,1
2,10,Counter-Strike,This game saved my virginity.,1,0
3,10,Counter-Strike,• Do you like original games? • Do you like ga...,1,0
4,10,Counter-Strike,"Easy to learn, hard to master.",1,1
5,10,Counter-Strike,"No r8 revolver, 10/10 will play again.",1,1
6,10,Counter-Strike,Still better than Call of Duty: Ghosts...,1,1
7,10,Counter-Strike,"cant buy skins, cases, keys, stickers - gaben ...",1,1
8,10,Counter-Strike,"Counter-Strike: Ok, after 9 years of unlimited...",1,1
9,10,Counter-Strike,Every server is spanish or french. I can now f...,1,0


In [7]:
# findout top review_text, and create a df to show top 50 of them

top_review_text = dataset['review_text'].value_counts().rename_axis('review_text').reset_index(name='counts')
top_review_text.head(50)

Unnamed: 0,review_text,counts
0,Early Access Review,977399
1,Early Access Review,10571
2,10/10,6050
3,.,4769
4,Great game,3662
5,great game,3554
6,Great game!,2440
7,:),2093
8,Nice game,1793
9,Great Game,1659


## Different requirement for diff task

for sentiment analysis, unique pair of 'review_text' and 'review_score' is sufficient for all our three models (no LLM), as we can prompt LLM with the name of the game, and whether the people find it useful.

for topic modeling, it can be game dependent -> unique combination of 'review_text', 'review_score', 'app_id', 'review_votes'  
this is to classify unique comments within a game

They will be done in a separate file specifically for each task

first remove 'Early Access Review'

In [8]:
# we notice that there are some rows onlu contain "Early Access Review" -> not helpful to the analysis
# we remove these rows.

print("before removing:", dataset.shape)

dataset = dataset[dataset['review_text'].str.contains("Early Access Review") == False]

print("after removing:", dataset.shape)

before removing: (6226728, 5)
after removing: (5238690, 5)


In [9]:
# remove comments that contain ♥ (means foul language)

dataset_noheart = dataset[dataset['review_text'].str.contains('♥') == False]
print(dataset_noheart.shape)
print(dataset.shape)

(4891928, 5)
(5238690, 5)


In [10]:
dataset_noheart.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
Index: 4891928 entries, 0 to 6417105
Data columns (total 5 columns):
 #   Column        Dtype 
---  ------        ----- 
 0   app_id        int64 
 1   app_name      object
 2   review_text   object
 3   review_score  int64 
 4   review_votes  int64 
dtypes: int64(3), object(2)
memory usage: 223.9+ MB


In [12]:
# remove duplicate rows (the topic modeling one)

dataset_noheart = dataset_noheart.drop_duplicates(subset=['app_id', 'review_text', 'review_score', 'review_votes'], keep='first')
dataset_noheart.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
Index: 4180148 entries, 0 to 6417105
Data columns (total 5 columns):
 #   Column        Dtype 
---  ------        ----- 
 0   app_id        int64 
 1   app_name      object
 2   review_text   object
 3   review_score  int64 
 4   review_votes  int64 
dtypes: int64(3), object(2)
memory usage: 191.4+ MB


In [13]:
# reset index to link with the original dataset

dataset_noheart = dataset_noheart.reset_index()
dataset_noheart.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4180148 entries, 0 to 4180147
Data columns (total 6 columns):
 #   Column        Dtype 
---  ------        ----- 
 0   index         int64 
 1   app_id        int64 
 2   app_name      object
 3   review_text   object
 4   review_score  int64 
 5   review_votes  int64 
dtypes: int64(4), object(2)
memory usage: 191.4+ MB


In [14]:
dataset_noheart.head(10)

Unnamed: 0,index,app_id,app_name,review_text,review_score,review_votes
0,0,10,Counter-Strike,Ruined my life.,1,0
1,1,10,Counter-Strike,This will be more of a ''my experience with th...,1,1
2,2,10,Counter-Strike,This game saved my virginity.,1,0
3,3,10,Counter-Strike,• Do you like original games? • Do you like ga...,1,0
4,4,10,Counter-Strike,"Easy to learn, hard to master.",1,1
5,5,10,Counter-Strike,"No r8 revolver, 10/10 will play again.",1,1
6,6,10,Counter-Strike,Still better than Call of Duty: Ghosts...,1,1
7,7,10,Counter-Strike,"cant buy skins, cases, keys, stickers - gaben ...",1,1
8,8,10,Counter-Strike,"Counter-Strike: Ok, after 9 years of unlimited...",1,1
9,9,10,Counter-Strike,Every server is spanish or french. I can now f...,1,0


In [15]:
# we save the dataset to a new pkl file

from datetime import datetime

dataset_noheart.to_pickle(f'dataset_heartless_{datetime.now().strftime("%Y%m%d")}.pkl')

---