# Data Collection

To collect all the needed training data, I need to use mostly of the same methods I used to get the sample data. But instead of getting all the reviews from one game, I need a wide variety of games. I start by getting every App ID on Steam, over 115,000. I then grab the first 100 reviews labeled "most helpful". Many games had no reviews, or less than 100, especially when only taking english reviews into account. I end up with 2,132,897 reviews for this dataset.

## Getting App IDs

In [1]:
import pandas as pd
import requests

In [6]:
response = requests.get('https://api.steampowered.com/ISteamApps/GetAppList/v2/')
app_ids_df = pd.DataFrame(response.json()['applist']['apps'])
app_ids_df.to_feather('../data/app_ids.feather')
app_ids_df

Unnamed: 0,appid,name
0,216938,Pieterw test app76 ( 216938 )
1,660010,test2
2,660130,test3
3,1118314,
4,1276781,Tidal Shock: DIVE CREW DLC
...,...,...
115331,1590950,Namnokh
115332,271590,Grand Theft Auto V
115333,1118200,People Playground
115334,965580,Root


In [10]:
app_ids = app_ids_df['appid'].to_numpy()
app_ids

array([ 216938,  660010,  660130, ..., 1118200,  965580,  719890],
      dtype=int64)

## Getting Reviews

In [23]:
def get_reviews(appid, params):
        url_start = 'https://store.steampowered.com/appreviews/'
        try:
            response = requests.get(url=url_start+str(appid), params=params, headers={'User-Agent': 'Mozilla/5.0'})
        except:
                return {'reviews' : []}
        return response.json() # return data extracted from the json response

In [24]:
reviews = []
cursor = '*'
params = { # https://partner.steamgames.com/doc/store/getreviews
    'json' : 1,
    'filter' : 'all', # sort by: recent, updated, all (helpfullness)
    'language' : 'english', # https://partner.steamgames.com/doc/store/localization
    'day_range' : 9223372036854775807, # shows reveiws from all time
    'review_type' : 'all', # all, positive, negative
    'purchase_type' : 'all', # all, non_steam_purchase, steam
    'num_per_page' : 100,
    'cursor' : '*'.encode()
          }

In [25]:
for i, app_id in enumerate(app_ids):
    params['cursor'] = cursor.encode() # for pagination
    response = get_reviews(app_id, params)
    reviews += response['reviews']
    if i+1%500 == 0:
        print(f'{i+1} of {len(app_ids)}: {len(reviews)} reviews')

1 of 115336: 0 reviews
501 of 115336: 4161 reviews
1001 of 115336: 7183 reviews
1501 of 115336: 11815 reviews
2001 of 115336: 16041 reviews
2501 of 115336: 20800 reviews
3001 of 115336: 24584 reviews
3501 of 115336: 26394 reviews
4001 of 115336: 28276 reviews
4501 of 115336: 30743 reviews
5001 of 115336: 33758 reviews
5501 of 115336: 36023 reviews
6001 of 115336: 39187 reviews
6501 of 115336: 44296 reviews
7001 of 115336: 50731 reviews
7501 of 115336: 54956 reviews
8001 of 115336: 60361 reviews
8501 of 115336: 63855 reviews
9001 of 115336: 68875 reviews
9501 of 115336: 71455 reviews
10001 of 115336: 73035 reviews
10501 of 115336: 74228 reviews
11001 of 115336: 76346 reviews
11501 of 115336: 78189 reviews
12001 of 115336: 80174 reviews
12501 of 115336: 87360 reviews
13001 of 115336: 93655 reviews
13501 of 115336: 99988 reviews
14001 of 115336: 107035 reviews
14501 of 115336: 112237 reviews
15001 of 115336: 116494 reviews
15501 of 115336: 121839 reviews
16001 of 115336: 128785 reviews
16

In [None]:
reviews_df = pd.DataFrame(reviews)[['review', 'voted_up']]

In [28]:
reviews_df

Unnamed: 0,review,voted_up
0,Great bundle! Offers some nice and clean skins.,True
1,More outstanding characters to select for Tida...,True
2,A strange blend of war strategy with some ship...,True
3,I love the concept as well as the music. Playi...,True
4,my antivirus said it was malware. that's good ...,True
...,...,...
2132892,"This game has the same issues as Isle, but to ...",False
2132893,I've been really trying to like this game sinc...,False
2132894,"As the other reviews have stated, the game is ...",False
2132895,This game is trash. You can spend 40+ hours of...,False


In [31]:
reviews_df.to_pickle('../data/reviews.pkl.gz', compression='gzip')

In [2]:
#reviews_df = pd.read_pickle('../data/reviews.pkl.gz')
reviews_df.dropna(inplace=True)
reviews_df.reset_index(inplace=True)
reviews_df.voted_up.value_counts(normalize=True)

True     0.745499
False    0.254501
Name: voted_up, dtype: float64

The initial dataset for this project used games taken from the "hot" section of Steam, and had a 80-20 class imbalance. This data, taken from all Steam apps, has a 75-25 class imbalance. Popular games are more likely to have more reviews, and also have more DLC, which counts as its own app for review purposes. This imbalance might be improved by only counting games, and by taking less reviews per game. However, that would end up with significantly less data overall, so I have decided not to go there for now.

As well, a future improvement could be to get data from other sources in addition to Steam. Metacritic seems like a good choice, as its reviews come with scores. There are also storefronts, such as itch.io or GOG that cater to different types of games than Steam or Metacritic, and so might help the model's ability to generalize.

However, at present, adding more data would just prevent my computer from running these models at all. If I want to increase the dataset size, I first need to get these notebooks running on a better computer, or on something like Amazon Sagemaker.

## Splitting and Pickling

The raw reviews data is too large for my computer to handle, so here I split it into 10 parts and save them seperately. I also randomize the dataframe before splitting.

In [2]:
import numpy as np

In [3]:
reviews_df = pd.read_pickle('../data/reviews.pkl.gz')
reviews_df = reviews_df.sample(frac=1).reset_index(drop=True)

In [4]:
for i, df in enumerate(np.array_split(reviews_df, 10)):
    df.to_pickle(f'../data/reviews_raw_{str(i)}.pkl.gz')

The below functions are ones I had used in old versions of gathering the data. They are no longer used.

In [16]:
def get_n_reviews(appid, n=100):
    reviews = []
    cursor = '*'
    params = { # https://partner.steamgames.com/doc/store/getreviews
            'json' : 1,
            'filter' : 'all', # sort by: recent, updated, all (helpfullness)
            'language' : 'english', # https://partner.steamgames.com/doc/store/localization
            'day_range' : 9223372036854775807, # shows reveiws from all time
            'review_type' : 'all', # all, positive, negative
            'purchase_type' : 'all', # all, non_steam_purchase, steam
        }
    while n > 0:
        params['cursor'] = cursor.encode() # for pagination
        params['num_per_page'] = min(100, n) # 100 is the max possible reviews in one requests
        n -= 100
        
        try:
            response = get_reviews(appid, params)
        except:
                return []
        
        cursor = response['cursor']
        reviews += response['reviews']
        
        if len(response['reviews']) < 100: break
    
    return reviews

In [4]:
from bs4 import BeautifulSoup

def get_n_appids(n=100, filter_by='topsellers'):
    appids = []
    url = f'https://store.steampowered.com/search/?category1=998&filter={filter_by}&page='
    page = 0
    
    while page*25 < n:
        page += 1
        response = requests.get(url=url+str(page), headers={'User-Agent': 'Mozilla/5.0'})
        soup = BeautifulSoup(response.text, 'html.parser')
        for row in soup.find_all(class_='search_result_row'):
            appids.append(row['data-ds-appid'])
    
    return appids[:n]