# Data Collection

In the first part of this steam project, we are going to create our datasets by requesting data from two APIs, one from the steamspy website and one from steam itself. 

The overall goal of this project is to analyse apps (games) from the steam store to determine what sort of games perform better in terms of sales, play time and ratings. Most of our information about the games will come from the steam API, whereas the target variables will mostly be supplied by the steamspy API.

For the purposes of this project, we'll be performing as little data cleaning as possible at this stage, providing a 'dirty' data set for data cleaning, which is the next step. We begin by importing the libraries we will be using.

Section outline:

- Create app list from steamspy api using 'all' request (Can create from steam api, but includes too many videos and demos)
- Retrieve individual app data from steam, by iterating through app list
- Retrieve individual app data from steamspy, by iterating through app list
- Save app list, steam data and steamspy data to csv files

API references:

- https://partner.steamgames.com/doc/webapi/ISteamApps
- https://steamapi.xpaw.me/#
- https://wiki.teamfortress.com/wiki/User:RJackson/StorefrontAPI
- https://steamspy.com/api.php

In [10]:
# import libraries
import csv
import datetime as dt
import json
import os
import statistics
import time

import numpy as np
import pandas as pd
import requests

Next, we define a general, all-purpose function to process get requests from an API, supplied through a URL parameter. A dictionary of parameters can be supplied which are passed into the get request automatically, depending on the requirements of the API.

Rather than simply returning the response, we handle a couple of scenarios to help automation. Occasionally we encounter an SSLError, in which case we simply wait a few seconds then try again (by recursively calling the function). When this happens, and generally throughout this project, we provide quite verbose feedback to show when these errors are encountered and how they are handled.

Occasionally there is no response when a request is made (returns None). This usually happens when too many requests are made in a period of time, and the polling limit has been reached. We try to avoid this by pausing briefly between requests, as we'll see later, but in case we breach the polling limit we wait 10 seconds then try again.

Handling these errors in this way ensures that our function almost always returns the desired response, which we return in json format to make processing easier.

In [11]:
def get_request(url, parameters=None):
    """Return json-formatted response of a get request using optional parameters.
    
    Parameters
    ----------
    url : string
    parameters : {'parameter': 'value'}
        parameters to pass as part of get request
    
    Returns
    -------
    json_data
        json-formatted response (dict-like)
    """
    try:
        response = requests.get(url=url, params=parameters)
    except SSLError as s:
        print('SSL Error:', s)
        
        for i in range(5, 0, -1):
            print('\rWaiting... ({})'.format(i), end='')
            time.sleep(1)
        print('\rRetrying.' + ' '*10)
        
        # recusively try again
        return get_request(url, parameters)
    
    if response:
        return response.json()
    else:
        # if response is none usually means too many requests. Wait and try again 
        print('No response, waiting 10 seconds...')
        time.sleep(10)
        print('Retrying.')
        return get_request(url, parameters)

With this initial set up complete, we now need to generate a list of app ids which we can use to build our data sets. It's possible to generate one from the steam API, however this has over 70,000 entries, many of which are demos and videos with no way to differentiate. Instead, steamspy supply an 'all' request, and this supplies some information about all the apps they track. It doesn't supply all the information about each app, so we still need to request this information about each app individually, but it provides a good starting point.

Because many of the return fields are strings containing commas and other punctuation, it is easiest to read the response into a pandas dataframe, and export the required appid and name fields to a csv. We could just keep the appid column as a list or pandas series, but it may be useful to keep the app name at this stage.

In [12]:
url = "https://steamspy.com/api.php"
parameters = {"request": "all"}

# request 'all' from steam spy and parse into dataframe
json_data = get_request(url, parameters=parameters)
steam_spy_all = pd.DataFrame.from_dict(json_data, orient='index')

# generate sorted app_list from steamspy data, to be used for retrieving individual app data from steam
app_list = steam_spy_all[['appid', 'name']].sort_values('appid').reset_index(drop=True)

# export disabled to keep consistency across download sessions, instead read from stored csv
# app_list.to_csv('../data/app_list.csv', index=False)
app_list = pd.read_csv('../data/raw/app_list.csv')

# display first few rows
app_list.head()

Unnamed: 0,appid,name
0,10,Counter-Strike
1,20,Team Fortress Classic
2,30,Day of Defeat
3,40,Deathmatch Classic
4,50,Half-Life: Opposing Force


Now we have our app dataframe, we can iterate over the app ids and request the individual app data from the servers. Here we set out our logic to retrieve and process this information, then finally store the data as a csv file.

Because it takes a long time to retrieve the data, it would be dangerous to attempt it all in one go as any errors or connection time-outs would cause the loss of our data. For this reason we define a function to download and process the requests in chunks, appending each chunk to file and keeping track of the highest index written in an external file.

This not only provides security, allowing us to easily restart the process if an error is encountered, it means we can complete the download across multiple sessions.

Again, we provide verbose output for rows exported, chunks complete, time taken and estimated time remaining.

Finally, we define some functions to reset the index for testing and demonstration, retrieve the index from file (maintains persistence across sessions) and prepare our csv file to store the data, clearing the file if it exists and writing a header row. This last function will only wipe the file if the index is 0.

In [13]:
def get_app_data(start, stop, parser, pause):
    """Return list of app data generated from parser.
    
    parser : function to handle request
    """    
    
    app_data = []

    for index, row in app_list[start:stop].iterrows():
        print('Current index: {}'.format(index), end='\r')
        
        appid = row['appid']
        name = row['name']

        data = parser(appid, name)
        app_data.append(data)

        time.sleep(pause) # prevent overloading api with requests
                
    return app_data


def process_chunks(parser, app_list, data_filename, index_filename, columns, begin=0, end=-1, chunksize=100, pause=1):
    """Process app data in chunks for stability, appending directly to file.
    
    parser : custom function to format request
    app_list : dataframe of appid and name
    data_filename : filename to save app data
    index_filename : filename to store highest index written
    columns : column names for file
    
    Keyword arguments:
    
    begin : starting index (get from index_filename, default 0)
    end : index to finish (defaults to end of app_list)
    chunksize : number of apps to write in each chunk (default 100)
    pause : time to wait after each api request (defualt 1)
    
    returns: none
    """
    
    print('Starting at index {}:\n'.format(begin))
    
    if end == -1:
        end = len(app_list) + 1
        
    chunks = np.arange(begin, end, chunksize)
    chunks = np.append(chunks, end)
    
    apps_written = 0
    chunk_times = []
    
    for i in range(len(chunks) - 1):
        start_time = time.time()
        
        start = chunks[i]
        stop = chunks[i+1]
        
        app_data = get_app_data(start, stop, parser, pause)
        
        rel_path = os.path.join('../data/raw', data_filename)
        
        with open(rel_path, 'a', newline='', encoding='utf-8') as f:
            writer = csv.DictWriter(f, fieldnames=columns, extrasaction='ignore')
            
            for j in range(3,0,-1):
                print("\rAbout to write data, don't stop script! ({})".format(j), end='')
                time.sleep(0.5)
            
            writer.writerows(app_data)
            print('\rExported lines {}-{} to {}.'.format(start, stop-1, data_filename), end=' ')
            
        apps_written += len(app_data)
        
        idx_path = os.path.join('../src', index_filename)
        
        with open(idx_path, 'w') as f:
            index = stop
            print(index, file=f)
            
        end_time = time.time()
        time_taken = end_time - start_time
        
        chunk_times.append(time_taken)
        mean_time = statistics.mean(chunk_times)
        
        est_remaining = (len(chunks) - i - 2) * mean_time
        
        remaining_td = dt.timedelta(seconds=round(est_remaining))
        time_td = dt.timedelta(seconds=round(time_taken))
        mean_td = dt.timedelta(seconds=round(mean_time))
        
        print('Chunk {} time: {} (avg: {}, remaining: {})'.format(i, time_td, mean_td, remaining_td))
            
    print('\nProcessing chunks complete. {} apps written'.format(apps_written))

    
def reset_index(index_filename):
    """Reset index in file to 0"""
    rel_path = os.path.join('../src', index_filename)
    with open(rel_path, 'w') as f:
        print(0, file=f)
        

def get_index(index_filename):
    """Retrieve index from file, returning 0 if file not found"""
    try:
        rel_path = os.path.join('../src', index_filename)

        with open(rel_path, 'r') as f:
            index = int(f.readline())
    
    except FileNotFoundError:
        index = 0
        
    return index


def prepare_data_file(filename, index, columns):
    """Create file and write headers if index is 0"""
    if index == 0:
        rel_path = os.path.join('../data/raw', filename)

        with open(rel_path, 'w', newline='') as f:
            writer = csv.DictWriter(f, fieldnames=columns)
            writer.writeheader()

Now we are ready to start downloading data and writing to file. We define our logic particular to handling the steam API - in fact if no data is returned we return just the name and appid - then begin setting some parameters. We define the files we will write our data and index to, and the columns for the csv file. The API doesn't return every column for every app, so it is best to explicitly set these.

Next we run our functions to set up our files, and make a call to `process_chunks` to begin our process. 

In [14]:
def parse_steam_request(appid, name):
    url = "http://store.steampowered.com/api/appdetails/"
    parameters = {"appids": appid}
    
    json_data = get_request(url, parameters=parameters)
    json_app_data = json_data[str(appid)]
    
    if json_app_data['success']:
        data = json_app_data['data']
    else:
        data = {'name': name, 'steam_appid': appid}
        
    return data


steam_app_data = 'steam_app_data_demo.csv'
steam_index = 'steam_index_demo.txt'
steam_columns = ['type', 'name', 'steam_appid', 'required_age', 'is_free', 'controller_support', 'dlc', 'detailed_description', 'about_the_game', 'short_description', 'fullgame', 'supported_languages', 'header_image', 'website', 'pc_requirements', 'mac_requirements', 'linux_requirements', 'legal_notice', 'drm_notice', 'ext_user_account_notice', 'developers', 'publishers', 'demos', 'price_overview', 'packages', 'package_groups', 'platforms', 'metacritic', 'reviews', 'categories', 'genres', 'screenshots', 'movies', 'recommendations', 'achievements', 'release_date', 'support_info', 'background', 'content_descriptors']

# Overwrite last index for demonstration
reset_index(steam_index)

index = get_index(steam_index)

# Wipe or create data file and write headers if index is 0
prepare_data_file(steam_app_data, index, steam_columns)

# Set end and chunksize for demonstration - remove to run through entire app list
process_chunks(
    parser=parse_steam_request,
    app_list=app_list,
    data_filename=steam_app_data,
    index_filename=steam_index,
    columns=steam_columns,
    begin=index,
    end=10,
    chunksize=2
)

Starting at index 0:

Exported lines 0-1 to steam_app_data_demo.csv. Chunk 0 time: 0:00:05 (avg: 0:00:05, remaining: 0:00:19)
Exported lines 2-3 to steam_app_data_demo.csv. Chunk 1 time: 0:00:05 (avg: 0:00:05, remaining: 0:00:14)
Exported lines 4-5 to steam_app_data_demo.csv. Chunk 2 time: 0:00:05 (avg: 0:00:05, remaining: 0:00:10)
Exported lines 6-7 to steam_app_data_demo.csv. Chunk 3 time: 0:00:05 (avg: 0:00:05, remaining: 0:00:05)
Exported lines 8-9 to steam_app_data_demo.csv. Chunk 4 time: 0:00:05 (avg: 0:00:05, remaining: 0:00:00)

Processing chunks complete. 10 apps written


In [21]:
# inspect downloaded data
pd.read_csv('../data/raw/steam_app_data_demo.csv').head()

Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,...,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors
0,game,Counter-Strike,10,0,False,,,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,...,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 65801},{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'url': 'http://steamcommunity.com/app/10', 'e...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,"{'ids': [2, 5], 'notes': 'Includes intense vio..."
1,game,Team Fortress Classic,20,0,False,,,One of the most popular online action games of...,One of the most popular online action games of...,One of the most popular online action games of...,...,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 2805},{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/20/...,"{'ids': [2, 5], 'notes': 'Includes intense vio..."
2,game,Day of Defeat,30,0,False,,,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,...,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 1993},{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/30/...,"{'ids': [], 'notes': None}"
3,game,Deathmatch Classic,40,0,False,,,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,...,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 933},{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/40/...,"{'ids': [], 'notes': None}"
4,game,Half-Life: Opposing Force,50,0,False,,,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 4366},{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'url': 'https://help.steampowered.com', 'emai...",https://steamcdn-a.akamaihd.net/steam/apps/50/...,"{'ids': [], 'notes': None}"


Now we have our steam data, we perform a very similar process to retrieve data from steamspy. 

Our parse function is a little simpler because of how data is returned, and the maximum polling rate of this API is higher so we can set `pause` lower and download faster.

In [15]:
def parse_steamspy_request(appid, name):
    url = "https://steamspy.com/api.php"
    parameters = {"request": "appdetails", "appid": appid}
    
    json_data = get_request(url, parameters)
    return json_data


# set files and columns
steamspy_data = 'steamspy_data_demo.csv'
steamspy_index = 'steamspy_index_demo.txt'
steamspy_columns = ['appid', 'name', 'developer', 'publisher', 'score_rank', 'positive', 'negative', 'userscore', 'owners', 'average_forever', 'average_2weeks', 'median_forever', 'median_2weeks', 'price', 'initialprice', 'discount', 'languages', 'genre', 'ccu', 'tags']
            
reset_index(steamspy_index)
index = get_index(steamspy_index)

# Wipe data file if index is 0
prepare_data_file(steamspy_data, index, steamspy_columns)

process_chunks(
    parser=parse_steamspy_request,
    app_list=app_list,
    data_filename=steamspy_data,
    index_filename=steamspy_index,
    columns=steamspy_columns,
    begin=index,
    end=20,
    chunksize=5,
    pause=0.3
)

Starting at index 0:

Exported lines 0-4 to steamspy_data_demo.csv. Chunk 0 time: 0:00:03 (avg: 0:00:03, remaining: 0:00:10)
Exported lines 5-9 to steamspy_data_demo.csv. Chunk 1 time: 0:00:04 (avg: 0:00:04, remaining: 0:00:07)
Exported lines 10-14 to steamspy_data_demo.csv. Chunk 2 time: 0:00:03 (avg: 0:00:04, remaining: 0:00:04)
Exported lines 15-19 to steamspy_data_demo.csv. Chunk 3 time: 0:00:04 (avg: 0:00:04, remaining: 0:00:00)

Processing chunks complete. 20 apps written


In [22]:
pd.read_csv('../data/raw/steamspy_data_demo.csv').head()

Unnamed: 0,appid,name,developer,publisher,score_rank,positive,negative,userscore,owners,average_forever,average_2weeks,median_forever,median_2weeks,price,initialprice,discount,languages,genre,ccu,tags
0,10,Counter-Strike,Valve,Valve,,124638,3338,0,"10,000,000 .. 20,000,000",8586,1854,255,835,999,999,0,"English, French, German, Italian, Spanish - Sp...",Action,15197,"{'Action': 2681, 'FPS': 2048, 'Multiplayer': 1..."
1,20,Team Fortress Classic,Valve,Valve,,3321,635,0,"5,000,000 .. 10,000,000",1541,0,17,0,499,499,0,"English, French, German, Italian, Spanish - Sp...",Action,73,"{'Action': 208, 'FPS': 188, 'Multiplayer': 172..."
2,30,Day of Defeat,Valve,Valve,,3420,399,0,"5,000,000 .. 10,000,000",1860,1,27,1,499,499,0,"English, French, German, Italian, Spanish - Spain",Action,118,"{'FPS': 138, 'World War II': 122, 'Multiplayer..."
3,40,Deathmatch Classic,Valve,Valve,,1275,270,0,"5,000,000 .. 10,000,000",2453,0,95,0,499,499,0,"English, French, German, Italian, Spanish - Sp...",Action,5,"{'Action': 85, 'FPS': 71, 'Multiplayer': 58, '..."
4,50,Half-Life: Opposing Force,Gearbox Software,Valve,,5259,290,0,"5,000,000 .. 10,000,000",2099,0,230,0,499,499,0,"English, French, German, Korean",Action,55,"{'FPS': 235, 'Action': 211, 'Sci-fi': 166, 'Si..."


# Next Steps

Now we have our downloaded data, stored in two separate csv files, we can move on to the next step: data cleaning. As previously stated, we have performed as little cleaning and parsing of columns as possible to help create 'dirty data' for cleaning.