# Data Collection

To collect all the needed training data, I need to use mostly of the same methods I used to get the sample data. But instead of getting all the reviews from one game, I need a wide variety of games. I start by getting every App ID on Steam, over 115,000. I then grab the first 50 reviews labeled "most helpful". Many games had no reviews, or less than 50, especially when only taking english reviews into account. I end up with over one million reviews for this dataset.

## Getting App IDs

In [1]:
import pandas as pd
import requests

In [2]:
response = requests.get('https://api.steampowered.com/ISteamApps/GetAppList/v2/')
app_ids_df = pd.DataFrame(response.json()['applist']['apps'])
app_ids_df.to_pickle('../data/app_ids.pkl.gz')
app_ids_df

Unnamed: 0,appid,name
0,216938,Pieterw test app76 ( 216938 )
1,660010,test2
2,660130,test3
3,1118314,
4,463797,Warhammer Vermintide - Kruber 'Carroburg Liver...
...,...,...
115467,1632570,Visual Novel Maker - RPG Orchestral Essentials...
115468,1494840,SCOOT
115469,1140180,MAZEMAN
115470,1073910,Before We Leave


In [5]:
app_ids = app_ids_df['appid'].to_numpy()
app_ids, len(app_ids)

(array([ 216938,  660010,  660130, ..., 1140180, 1073910, 1626870],
       dtype=int64),
 115472)

## Getting Reviews

In [6]:
def get_reviews(appid, params):
        url_start = 'https://store.steampowered.com/appreviews/'
        try:
            response = requests.get(url=url_start+str(appid), params=params, headers={'User-Agent': 'Mozilla/5.0'})
        except:
                return {'reviews' : []}
        return response.json() # return data extracted from the json response

In [7]:
reviews = []
cursor = '*'
params = { # https://partner.steamgames.com/doc/store/getreviews
    'json' : 1,
    'filter' : 'all', # sort by: recent, updated, all (helpfullness)
    'language' : 'english', # https://partner.steamgames.com/doc/store/localization
    'day_range' : 9223372036854775807, # shows reveiws from all time
    'review_type' : 'all', # all, positive, negative
    'purchase_type' : 'all', # all, non_steam_purchase, steam
    'num_per_page' : 50,
    'cursor' : '*'.encode()
          }

In [8]:
for i, app_id in enumerate(app_ids):
    reviews += get_reviews(app_id, params)['reviews']
    if (i+1)%500 == 0:
        print(f'{i+1} of {len(app_ids)}: {len(reviews)} reviews')

500 of 115472: 10598 reviews
1000 of 115472: 20406 reviews
1500 of 115472: 31317 reviews
2000 of 115472: 40640 reviews
2500 of 115472: 49629 reviews
3000 of 115472: 58870 reviews
3500 of 115472: 66955 reviews
4000 of 115472: 75854 reviews
4500 of 115472: 85265 reviews
5000 of 115472: 95172 reviews
5500 of 115472: 102967 reviews
6000 of 115472: 113580 reviews
6500 of 115472: 123413 reviews
7000 of 115472: 133254 reviews
7500 of 115472: 144513 reviews
8000 of 115472: 154169 reviews
8500 of 115472: 165639 reviews
9000 of 115472: 178371 reviews
9500 of 115472: 190025 reviews
10000 of 115472: 200838 reviews
10500 of 115472: 211107 reviews
11000 of 115472: 222326 reviews
11500 of 115472: 232323 reviews
12000 of 115472: 244254 reviews
12500 of 115472: 256739 reviews
13000 of 115472: 268477 reviews
13500 of 115472: 281895 reviews
14000 of 115472: 289666 reviews
14500 of 115472: 303750 reviews
15000 of 115472: 316116 reviews
15500 of 115472: 327800 reviews
16000 of 115472: 338292 reviews
16500 

JSONDecodeError: Unexpected UTF-8 BOM (decode using utf-8-sig): line 1 column 1 (char 0)

When running the above code to get the reviews, I got the following error:

JSONDecodeError: Unexpected UTF-8 BOM (decode using utf-8-sig): line 1 column 1 (char 0)

It seems to be fixable by changing the encoding in the requests module, but seeing as this code takes about half a day to run, and I have over a million reviews anyways, this will be good enough.

In [9]:
len(reviews)

1006078

In [10]:
reviews_df = pd.DataFrame(reviews)[['review', 'voted_up']]
reviews_df.dropna(inplace=True)
reviews_df.reset_index(inplace=True)
reviews_df

Unnamed: 0,index,review,voted_up
0,0,Overpriced palette swapped skin.,False
1,1,"Best used with the ""Talabheim Cavalier"" Hat fo...",True
2,2,Carroburg greatswords...thats all you need to ...,True
3,3,"I usually turn my nose up at cosmetic DLC's, b...",True
4,4,Love this game. Bought 'em all :),True
...,...,...,...
1006073,1006073,I like it - let's you take on several train ty...,True
1006074,1006074,I love this route the only reason i got it for...,True
1006075,1006075,"This is a very fun add-on to play, you can pla...",True
1006076,1006076,"GWE is a great DLC, one of my favourites for T...",True


In [11]:
reviews_df.voted_up.value_counts(normalize=True)

True     0.731222
False    0.268778
Name: voted_up, dtype: float64

In [12]:
reviews_df.to_pickle('../data/reviews_raw.pkl.gz')

The initial dataset for this project used games taken from the "hot" section of Steam, and had a 80-20 class imbalance. This data, taken from all Steam apps, has a 75-25 class imbalance. Popular games are more likely to have more reviews, and also have more DLC, which counts as its own app for review purposes. This imbalance might be improved by only counting games, and by taking less reviews per game. However, that would end up with significantly less data overall, so I have decided not to go there for now.

As well, a future improvement could be to get data from other sources in addition to Steam. Metacritic seems like a good choice, as its reviews come with scores. There are also storefronts, such as itch.io or GOG that cater to different types of games than Steam or Metacritic, and so might help the model's ability to generalize.

However, at present, adding more data would just prevent my computer from running these models at all. If I want to increase the dataset size, I first need to get these notebooks running on a better computer, or on something like Amazon Sagemaker.

## Splitting and Pickling

The raw reviews data is too large for my computer to handle, so here I split it into 10 parts and save them seperately. I also randomize the dataframe before splitting.

In [2]:
import numpy as np

In [3]:
reviews_df = pd.read_pickle('../data/reviews_raw.pkl.gz')
reviews_df = reviews_df.sample(frac=1).reset_index(drop=True)

In [4]:
for i, df in enumerate(np.array_split(reviews_df, 10)):
    df.to_pickle(f'../data/reviews_raw_{str(i)}.pkl.gz')

The below functions are ones I had used in old versions of gathering the data. They are no longer used.

In [None]:
def get_n_reviews(appid, n=100):
    reviews = []
    cursor = '*'
    params = { # https://partner.steamgames.com/doc/store/getreviews
            'json' : 1,
            'filter' : 'all', # sort by: recent, updated, all (helpfullness)
            'language' : 'english', # https://partner.steamgames.com/doc/store/localization
            'day_range' : 9223372036854775807, # shows reveiws from all time
            'review_type' : 'all', # all, positive, negative
            'purchase_type' : 'all', # all, non_steam_purchase, steam
        }
    while n > 0:
        params['cursor'] = cursor.encode() # for pagination
        params['num_per_page'] = min(100, n) # 100 is the max possible reviews in one requests
        n -= 100
        
        try:
            response = get_reviews(appid, params)
        except:
                return []
        
        cursor = response['cursor']
        reviews += response['reviews']
        
        if len(response['reviews']) < 100: break
    
    return reviews

In [None]:
from bs4 import BeautifulSoup

def get_n_appids(n=100, filter_by='topsellers'):
    appids = []
    url = f'https://store.steampowered.com/search/?category1=998&filter={filter_by}&page='
    page = 0
    
    while page*25 < n:
        page += 1
        response = requests.get(url=url+str(page), headers={'User-Agent': 'Mozilla/5.0'})
        soup = BeautifulSoup(response.text, 'html.parser')
        for row in soup.find_all(class_='search_result_row'):
            appids.append(row['data-ds-appid'])
    
    return appids[:n]