# Steam Data Analysis

## Project Goals

<!-- PELICAN_BEGIN_SUMMARY -->

The motivation is gather, process and analyze Steam Store data to get insights about trends in the videogame market. As it is an online marketplace with public available data, it offers us more possibilities than analyzing console games data, where we would have to rely on an existing dataset.

We want to focus on two main aspects, first a general market analysis to know which genres are the most popular, pricing strategies and so on, which could be interesting for a new developer trying to make a new game or deciding a price policy. This has been studied already by other enthusiasts in internet, and also by Marketing companies helping publishers.

But to offer a different analysis, we want to also focus on the developers and publishers, to see which ones are the most successfull, how they have improved / worsen between the years, which titles have cemented their success and so on. In the light of recent years we have seen many acquisitions by large publishers such as Tencent, Microsoft and Sony, so this is very interesting concept.

This will be a complete data project, with a data acquisition section (by using some APIs and web scrapping), then data cleaning and joining data from different sources, an exploratory data analysis, and finally some key conclusions.


## Data Acquisition

This is the section where I struggled initially. There were several datasets available at [kaggle](https://www.kaggle.com/datasets), reddit and similar websites, but most were outdated or did not contain all the information I wanted to explore. Also I wanted to extract it directly from an API or use web scrapping, if possible, to learn a bit more (I already had experience with Twitter which has an excellent API).

[SteamSpy](https://steamspy.com/about) is a webpage which offers data about Steam games. In the past it was even able to deliver a good guess of sales, but that has become harder throughout the years. Check [VG Insights](https://vginsights.com/insights/article/how-to-estimate-steam-video-game-sales) for more information. It is also a good webpage if you want to explore market data on your own.
The most important thing for Steamspy is that it has its own API [here](https://steamspy.com/api.php). It can provide us easily with an already filtered list of games (not other apps or DLCs), and also some metrics not available at Steam directly such as an estimate of sales and the positive or negative reviews (Steam only gives us total number reviews).

Regarding Steam directly, the API is available at https://partner.steamgames.com/ , however you need a developer key and some (most of the functions) are tied to your key as they are intended to be used to manage your own products at the Steam store. Thanks to [Nik Davis](http://nik-davis.github.io) I discovered there were also a few API functions via the WEB API which can be used without a key at all. See here for more details: [StorefrontAPI](https://wiki.teamfortress.com/wiki/User:RJackson/StorefrontAPI).

Getting the information from Steam will be a bit more difficult, but it will give us additional metrics, such as release date, genre...

We will retrieve first the list of appids and the information available at Steam Spy, then get for each appid the information from Steam and combine them in an unique dataframe. There will be no loss of information as app ids are unique. Afterwards, we will perform cleaning and finally start analyzing our dataset.

## Process:

- Create an app list and gather available data from SteamSpy API using 'all' request
- Retrieve individual app data from Steam API, by iterating through app list
- Export app list, Steam data and SteamSpy data to csv files

## API references:

- https://partner.steamgames.com/doc/webapi
- https://wiki.teamfortress.com/wiki/User:RJackson/StorefrontAPI
- https://steamapi.xpaw.me/#
- https://steamspy.com/api.php


## Credits

The most important source I found while looking how to connect to the API was Nik Davis, check his blog for a different analysis on steam data (from 2019) http://nik-davis.github.io
Download functions for the APIs are based on his notebook for "Steam Data Download". I had to make some changes and simplify a bit.

Steamspy seems to have changed its API, so I had to change the download method to instead download all the data by page (set of 1000 ids). The functions defined for Steam API itself still work as is. 

In [16]:
# standard library imports
import csv
import datetime as dt
import json
import os
import statistics
import time

# third-party imports
import numpy as np
import pandas as pd
import requests

# customisations - ensure tables show all columns
pd.set_option("max_columns", 100)

In [17]:
import seaborn as sns
import matplotlib.pyplot as plt

The next function uses requests library to get JSON response from web APIs. It is based on Nik Davis previous work, and it is quite standard as (thankfully) web APIs use a standard format, and requests makes it really easy.

In [18]:
def get_request(url,parameters=None, steamspy=False):
    """Return json-formatted response of a get request using optional parameters.
    
    Parameters
    ----------
    url : string
    parameters : {'parameter': 'value'}
        parameters to pass as part of get request
    
    Returns
    -------
    json_data
        json-formatted response (dict-like)
    """
    try:
        response = requests.get(url=url, params=parameters)
    except SSLError as s:
        print('SSL Error:', s)
        
        for i in range(5, 0, -1):
            print('\rWaiting... ({})'.format(i), end='')
            time.sleep(1)
        print('\rRetrying.' + ' '*10)
        
        # recursively try again
        return get_request(url, parameters, steamspy)
    
    if response:
        return response.json()
    else:
        # We do not know how many pages steamspy has... and it seems to work well, so we will use no response to stop.
        if steamspy:
            return "stop"
        else :
            # response is none usually means too many requests. Wait and try again 
            print('No response, waiting 10 seconds...')
            time.sleep(10)
            print('Retrying.')
            return get_request(url, parameters, steamspy)

## Steam Spy API

APPs on steam have an unique ID. The requests to Steam API (which has more information than Steam Spy) have to be made for a specific ID. This means we have to get first a list of ids.

We can do this in several ways, but this is what I decided to follow:

* Using Steam Spy API (see https://steamspy.com/api.php) to get the list of IDs and also the metadata from Steam Spy (at the same time). Alternatively, we could use Steam API to get a list of apps, then filter them (see https://api.steampowered.com/ISteamApps/GetAppList/v2/? or https://steamapi.xpaw.me/#IStoreService/GetAppInfo ).

* Then using Steam API to loop for each ID from the list and getting the complete info.

We are going to use this request: https://steamspy.com/api.php?request=all&page=1 - return apps 1,000-1,999 of all apps.

In [6]:
url = "https://steamspy.com/api.php"
parameters = {"request": "all"}
i=0

# request 'all' from steam spy and parse into dataframe
parameters["page"] = i
json_data = get_request(url, parameters=parameters, steamspy=True)
steam_spy_all = pd.DataFrame.from_dict(json_data, orient='index')
i=i+1

while (True):
    parameters["page"] = i
    json_data = get_request(url, parameters=parameters, steamspy=True)
    if json_data == "stop":
        #We reached the end of steamspy
        break
    steam_spy_next_page = pd.DataFrame.from_dict(json_data, orient='index')
    steam_spy_all = steam_spy_all.append(steam_spy_next_page)
    time.sleep(1)
    i=i+1

steam_spy_all.to_csv("../data/download/steamspy_appid.csv")
# generate sorted app_list from steamspy data
# app_list = steam_spy_all[['appid', 'name']].sort_values('appid').reset_index(drop=True)


# display first few rows
# app_list.head()

We did not know how many appids were registered in steamspy at 26/1/2022 (steamspy ranks them by user average, it seems), and it seems there are 51k appids.

The code above stops if it fails using the steamspy flag, because steamspy API is quite generous and has let us download all the data with just 1s pause between requests (although in their API it says we should wait 60s for a full page request).

Instead of telling us that a new page does not exist (by returning a null, in example) it gives us a server error, so this is a way to do it. Steam just returns null in this scenario.

In [19]:
df_steam_spy = pd.read_csv("../data/download/steamspy_appid.csv")
# generate sorted app_list from steamspy data
app_list = df_steam_spy[['appid', 'name']].sort_values('appid').reset_index(drop=True)

In [20]:
df_steam_spy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51353 entries, 0 to 51352
Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Unnamed: 0       51353 non-null  int64  
 1   appid            51353 non-null  int64  
 2   name             51340 non-null  object 
 3   developer        51137 non-null  object 
 4   publisher        51202 non-null  object 
 5   score_rank       43 non-null     float64
 6   positive         51353 non-null  int64  
 7   negative         51353 non-null  int64  
 8   userscore        51353 non-null  int64  
 9   owners           51353 non-null  object 
 10  average_forever  51353 non-null  int64  
 11  average_2weeks   51353 non-null  int64  
 12  median_forever   51353 non-null  int64  
 13  median_2weeks    51353 non-null  int64  
 14  price            51324 non-null  float64
 15  initialprice     51331 non-null  float64
 16  discount         51331 non-null  float64
 17  ccu         

## Define Download Logic

Here I just use Nik Davis previous work, to get the info about the app IDs from Steam. Now that I know the polling rate of the API storefront, I could probably do something less sofisticated, but I prefer to focus on the analysis rather than in the acquisition phase. I will keep the original comments from Nik Davis `in quotes` to let the reader understand the process.

`Now we have the app_list dataframe, we can iterate over the app IDs and request individual app data from the servers. Here we set out our logic to retrieve and process this information, then finally store the data as a csv file.`

`Because it takes a long time to retrieve the data, it would be dangerous to attempt it all in one go as any errors or connection time-outs could cause the loss of all our data. For this reason we define a function to download and process the requests in batches, appending each batch to an external file and keeping track of the highest index written in a separate file.`

`This not only provides security, allowing us to easily restart the process if an error is encountered, but also means we can complete the download across multiple sessions.`

`Again, we provide verbose output for rows exported, batches complete, time taken and estimated time remaining.`

In [21]:
def get_app_data(start, stop, parser, pause):
    """Return list of app data generated from parser.
    
    parser : function to handle request
    """
    app_data = []
    
    # iterate through each row of app_list, confined by start and stop
    for index, row in app_list[start:stop].iterrows():
        print('Current index: {}'.format(index), end='\r')
        
        appid = row['appid']
        name = row['name']

        # retrive app data for a row, handled by supplied parser, and append to list
        data = parser(appid, name)
        app_data.append(data)

        time.sleep(pause) # prevent overloading api with requests
    
    return app_data


def process_batches(parser, app_list, download_path, data_filename, index_filename,
                    columns, begin=0, end=-1, batchsize=100, pause=1):
    """Process app data in batches, writing directly to file.
    
    parser : custom function to format request
    app_list : dataframe of appid and name
    download_path : path to store data
    data_filename : filename to save app data
    index_filename : filename to store highest index written
    columns : column names for file
    
    Keyword arguments:
    
    begin : starting index (get from index_filename, default 0)
    end : index to finish (defaults to end of app_list)
    batchsize : number of apps to write in each batch (default 100)
    pause : time to wait after each api request (defualt 1)
    
    returns: none
    """
    print('Starting at index {}:\n'.format(begin))
    
    # by default, process all apps in app_list
    if end == -1:
        end = len(app_list) + 1
    
    # generate array of batch begin and end points
    batches = np.arange(begin, end, batchsize)
    batches = np.append(batches, end)
    
    apps_written = 0
    batch_times = []
    
    for i in range(len(batches) - 1):
        start_time = time.time()
        
        start = batches[i]
        stop = batches[i+1]
        
        app_data = get_app_data(start, stop, parser, pause)
        
        rel_path = os.path.join(download_path, data_filename)
        
        # writing app data to file
        with open(rel_path, 'a', newline='', encoding='utf-8') as f:
            writer = csv.DictWriter(f, fieldnames=columns, extrasaction='ignore')
            
            for j in range(3,0,-1):
                print("\rAbout to write data, don't stop script! ({})".format(j), end='')
                time.sleep(0.5)
            
            writer.writerows(app_data)
            print('\rExported lines {}-{} to {}.'.format(start, stop-1, data_filename), end=' ')
            
        apps_written += len(app_data)
        
        idx_path = os.path.join(download_path, index_filename)
        
        # writing last index to file
        with open(idx_path, 'w') as f:
            index = stop
            print(index, file=f)
            
        # logging time taken
        end_time = time.time()
        time_taken = end_time - start_time
        
        batch_times.append(time_taken)
        mean_time = statistics.mean(batch_times)
        
        est_remaining = (len(batches) - i - 2) * mean_time
        
        remaining_td = dt.timedelta(seconds=round(est_remaining))
        time_td = dt.timedelta(seconds=round(time_taken))
        mean_td = dt.timedelta(seconds=round(mean_time))
        
        print('Batch {} time: {} (avg: {}, remaining: {})'.format(i, time_td, mean_td, remaining_td))
            
    print('\nProcessing batches complete. {} apps written'.format(apps_written))

`Next we define some functions to handle and prepare the external files.`

`We use reset_index for testing and demonstration, allowing us to easily reset the index in the stored file to 0, effectively restarting the entire download process.`

`We define get_index to retrieve the index from file, maintaining persistence across sessions. Every time a batch of information (app data) is written to file, we write the highest index within app_data that was retrieved. As stated, this is partially for security, ensuring that if there is an error during the download we can read the index from file and continue from the end of the last successful batch. Keeping track of the index also allows us to pause the download, continuing at a later time.`

`Finally, the prepare_data_file function readies the csv for storing the data. If the index we retrieved is 0, it means we are either starting for the first time or starting over. In either case, we want a blank csv file with only the header row to begin writing to, se we wipe the file (by opening in write mode) and write the header. Conversely, if the index is anything other than 0, it means we already have downloaded information, and can leave the csv file alone.`

In [24]:
def reset_index(download_path, index_filename):
    """Reset index in file to 0."""
    rel_path = os.path.join(download_path, index_filename)
    
    f= open(rel_path, 'w')
    f.write("0")
        

def get_index(download_path, index_filename):
    """Retrieve index from file, returning 0 if file not found."""
    try:
        rel_path = os.path.join(download_path, index_filename)

        with open(rel_path, 'r') as f:
            index = int(f.readline())
    
    except FileNotFoundError:
        index = 0
        
    return index


def prepare_data_file(download_path, filename, index, columns):
    """Create file and write headers if index is 0."""
    if index == 0:
        rel_path = os.path.join(download_path, filename)

        with open(rel_path, 'w', newline='') as f:
            writer = csv.DictWriter(f, fieldnames=columns)
            writer.writeheader()

## Download Steam Data

`Now we are ready to start downloading data and writing to file. We define our logic particular to handling the steam API - in fact if no data is returned we return just the name and appid - then begin setting some parameters. We define the files we will write our data and index to, and the columns for the csv file. The API doesn't return every column for every app, so it is best to explicitly set these.`

`Next we run our functions to set up the files, and make a call to process_batches to begin the process. Some additional parameters have been added for demonstration, to constrain the download to just a few rows and smaller batches. Removing these would allow the entire download process to be repeated.`

I retouched many of these parameters just to check if the download could made in batches (requesting several steamapps at the same time), or even putting a faster polling rate (right now it is one second).

The storefront API (http://store.steampowered.com/api/) is very much undocumented, but the key getaway is that is not possible. The official SteamWorks API lets us do other things and it is quite well documented, but we cannot get the data available at a steam webpage, which are the things interesting for us.

The storefront API is only accessible with these [requests](https://wiki.teamfortress.com/wiki/User:RJackson/StorefrontAPI), and according to this [stackoverflow discussion](https://stackoverflow.com/questions/46330864/steam-api-all-games):
`There is a general API rate limit for each unique IP adress of 200 requests in five minutes which is one request every 1.5 seconds.` This matches our experience, Nik Davis put a pause between requests of just 1 second, and with this we get some but a few reconnect errors. If we put no pause at all, at the end we are limited by the 200 requests every 5 minutes.

That means that for a volume of around 50k at January 2022 (the steam apps available also at steam spy, already filtered by game and some owner data...) this download will take around 21 hours. Thankfully we can resume it and do it in several batches.

If we were to build a web app and we wanted to update the information daily, we could instead try pulling the full applist from steam along with the "last updated" and only request the full appid information for those ids. This is probably what Steam Spy does, and SteamDB instead uses a more sofisticated approach by being notified of any changes to appids via steamworks.

In any case, for a one shot analysis (and not a web page where the user could explore the information), the full download approach is fine.

In [25]:
def parse_steam_request(appid, name):
    """Unique parser to handle data from Steam Store API.
    
    Returns : json formatted data (dict-like)
    """
    url = "http://store.steampowered.com/api/appdetails/"
    parameters = {"appids": appid, "key":"CFDD73A08F5BD5ACD3197173F81E5563"}
    
    json_data = get_request(url, parameters=parameters)
    json_app_data = json_data[str(appid)]
    
    if json_app_data['success']:
        data = json_app_data['data']
    else:
        data = {'name': name, 'steam_appid': appid}
        
    return data


# Set file parameters
download_path = '../data/download'
steam_app_data = 'steam_app_data.csv'
steam_index = 'steam_index.txt'

steam_columns = [
    'type', 'name', 'steam_appid', 'required_age', 'is_free', 'controller_support',
    'dlc', 'detailed_description', 'about_the_game', 'short_description', 'fullgame',
    'supported_languages', 'header_image', 'website', 'pc_requirements', 'mac_requirements',
    'linux_requirements', 'legal_notice', 'drm_notice', 'ext_user_account_notice',
    'developers', 'publishers', 'demos', 'price_overview', 'packages', 'package_groups',
    'platforms', 'metacritic', 'reviews', 'categories', 'genres', 'screenshots',
    'movies', 'recommendations', 'achievements', 'release_date', 'support_info',
    'background', 'content_descriptors'
]

# Overwrites last index for demonstration (would usually store highest index so can continue across sessions)
# reset_index(download_path, steam_index)

# Retrieve last index downloaded from file
index = get_index(download_path, steam_index)

# Wipe or create data file and write headers if index is 0
prepare_data_file(download_path, steam_app_data, index, steam_columns)

# Set end and chunksize for demonstration - remove to run through entire app list
process_batches(
    parser=parse_steam_request,
    app_list=app_list,
    download_path=download_path,
    data_filename=steam_app_data,
    index_filename=steam_index,
    columns=steam_columns,
    begin=index,
    #end=10,
    #batchsize=5
)

Starting at index 18000:

Exported lines 18000-18099 to steam_app_data.csv. Batch 0 time: 0:02:28 (avg: 0:02:28, remaining: 13:40:52)
Exported lines 18100-18199 to steam_app_data.csv. Batch 1 time: 0:02:30 (avg: 0:02:29, remaining: 13:44:44)
No response, waiting 10 seconds...
Retrying.
Exported lines 18200-18299 to steam_app_data.csv. Batch 2 time: 0:02:43 (avg: 0:02:34, remaining: 14:07:28)
Exported lines 18300-18399 to steam_app_data.csv. Batch 3 time: 0:02:28 (avg: 0:02:32, remaining: 13:57:35)
Exported lines 18400-18499 to steam_app_data.csv. Batch 4 time: 0:02:27 (avg: 0:02:31, remaining: 13:49:26)
Exported lines 18500-18599 to steam_app_data.csv. Batch 5 time: 0:02:31 (avg: 0:02:31, remaining: 13:46:37)
No response, waiting 10 seconds...
Retrying.
Exported lines 18600-18699 to steam_app_data.csv. Batch 6 time: 0:02:39 (avg: 0:02:32, remaining: 13:50:16)
Exported lines 18700-18799 to steam_app_data.csv. Batch 7 time: 0:02:29 (avg: 0:02:32, remaining: 13:45:37)
No response, waiting

Exported lines 24700-24799 to steam_app_data.csv. Batch 67 time: 0:02:31 (avg: 0:02:32, remaining: 11:15:44)
Exported lines 24800-24899 to steam_app_data.csv. Batch 68 time: 0:02:33 (avg: 0:02:32, remaining: 11:13:13)
Exported lines 24900-24999 to steam_app_data.csv. Batch 69 time: 0:02:30 (avg: 0:02:32, remaining: 11:10:31)
Exported lines 25000-25099 to steam_app_data.csv. Batch 70 time: 0:02:33 (avg: 0:02:32, remaining: 11:08:00)
Exported lines 25100-25199 to steam_app_data.csv. Batch 71 time: 0:02:31 (avg: 0:02:32, remaining: 11:05:23)
Exported lines 25200-25299 to steam_app_data.csv. Batch 72 time: 0:02:33 (avg: 0:02:32, remaining: 11:02:52)
Exported lines 25300-25399 to steam_app_data.csv. Batch 73 time: 0:02:30 (avg: 0:02:32, remaining: 11:00:10)
Exported lines 25400-25499 to steam_app_data.csv. Batch 74 time: 0:02:29 (avg: 0:02:32, remaining: 10:57:26)
No response, waiting 10 seconds...
Retrying.
Exported lines 25500-25599 to steam_app_data.csv. Batch 75 time: 0:02:40 (avg: 0:02

Exported lines 31500-31599 to steam_app_data.csv. Batch 135 time: 0:02:37 (avg: 0:02:33, remaining: 8:23:29)
Exported lines 31600-31699 to steam_app_data.csv. Batch 136 time: 0:02:28 (avg: 0:02:33, remaining: 8:20:49)
No response, waiting 10 seconds...
Retrying.
Exported lines 31700-31799 to steam_app_data.csv. Batch 137 time: 0:02:39 (avg: 0:02:33, remaining: 8:18:25)
Exported lines 31800-31899 to steam_app_data.csv. Batch 138 time: 0:02:28 (avg: 0:02:33, remaining: 8:15:46)
No response, waiting 10 seconds...
Retrying.
Exported lines 31900-31999 to steam_app_data.csv. Batch 139 time: 0:02:40 (avg: 0:02:33, remaining: 8:13:24)
Exported lines 32000-32099 to steam_app_data.csv. Batch 140 time: 0:02:29 (avg: 0:02:33, remaining: 8:10:46)
Exported lines 32100-32199 to steam_app_data.csv. Batch 141 time: 0:02:30 (avg: 0:02:33, remaining: 8:08:11)
Exported lines 32200-32299 to steam_app_data.csv. Batch 142 time: 0:02:29 (avg: 0:02:33, remaining: 8:05:33)
No response, waiting 10 seconds...
Ret

Exported lines 36300-36399 to steam_app_data.csv. Batch 183 time: 0:02:33 (avg: 0:02:36, remaining: 6:30:12)
Exported lines 36400-36499 to steam_app_data.csv. Batch 184 time: 0:02:52 (avg: 0:02:36, remaining: 6:27:49)
Exported lines 36500-36599 to steam_app_data.csv. Batch 185 time: 0:02:58 (avg: 0:02:36, remaining: 6:25:31)
Exported lines 36600-36699 to steam_app_data.csv. Batch 186 time: 0:02:34 (avg: 0:02:36, remaining: 6:22:53)
Exported lines 36700-36799 to steam_app_data.csv. Batch 187 time: 0:02:38 (avg: 0:02:36, remaining: 6:20:18)
Exported lines 36800-36899 to steam_app_data.csv. Batch 188 time: 0:02:34 (avg: 0:02:36, remaining: 6:17:40)
Exported lines 36900-36999 to steam_app_data.csv. Batch 189 time: 0:02:34 (avg: 0:02:36, remaining: 6:15:02)
Exported lines 37000-37099 to steam_app_data.csv. Batch 190 time: 0:02:45 (avg: 0:02:36, remaining: 6:12:32)
Exported lines 37100-37199 to steam_app_data.csv. Batch 191 time: 0:02:41 (avg: 0:02:36, remaining: 6:10:00)
Exported lines 3720

Exported lines 43600-43699 to steam_app_data.csv. Batch 256 time: 0:02:39 (avg: 0:02:36, remaining: 3:20:21)
Exported lines 43700-43799 to steam_app_data.csv. Batch 257 time: 0:02:27 (avg: 0:02:36, remaining: 3:17:42)
No response, waiting 10 seconds...
Retrying.
Exported lines 43800-43899 to steam_app_data.csv. Batch 258 time: 0:02:40 (avg: 0:02:36, remaining: 3:15:07)
Exported lines 43900-43999 to steam_app_data.csv. Batch 259 time: 0:02:27 (avg: 0:02:36, remaining: 3:12:28)
No response, waiting 10 seconds...
Retrying.
Exported lines 44000-44099 to steam_app_data.csv. Batch 260 time: 0:02:41 (avg: 0:02:36, remaining: 3:09:54)
Exported lines 44100-44199 to steam_app_data.csv. Batch 261 time: 0:02:26 (avg: 0:02:36, remaining: 3:07:15)
No response, waiting 10 seconds...
Retrying.
Exported lines 44200-44299 to steam_app_data.csv. Batch 262 time: 0:02:39 (avg: 0:02:36, remaining: 3:04:40)
Exported lines 44300-44399 to steam_app_data.csv. Batch 263 time: 0:02:30 (avg: 0:02:36, remaining: 3:

Exported lines 50500-50599 to steam_app_data.csv. Batch 325 time: 0:02:36 (avg: 0:02:36, remaining: 0:20:46)
Exported lines 50600-50699 to steam_app_data.csv. Batch 326 time: 0:02:26 (avg: 0:02:36, remaining: 0:18:10)
No response, waiting 10 seconds...
Retrying.
Exported lines 50700-50799 to steam_app_data.csv. Batch 327 time: 0:02:36 (avg: 0:02:36, remaining: 0:15:34)
Exported lines 50800-50899 to steam_app_data.csv. Batch 328 time: 0:02:26 (avg: 0:02:36, remaining: 0:12:58)
No response, waiting 10 seconds...
Retrying.
Exported lines 50900-50999 to steam_app_data.csv. Batch 329 time: 0:02:38 (avg: 0:02:36, remaining: 0:10:23)
Exported lines 51000-51099 to steam_app_data.csv. Batch 330 time: 0:02:27 (avg: 0:02:36, remaining: 0:07:47)
No response, waiting 10 seconds...
Retrying.
Exported lines 51100-51199 to steam_app_data.csv. Batch 331 time: 0:02:38 (avg: 0:02:36, remaining: 0:05:11)
Exported lines 51200-51299 to steam_app_data.csv. Batch 332 time: 0:02:26 (avg: 0:02:36, remaining: 0:

In [26]:
# inspect downloaded data
pd.read_csv('../data/download/steam_app_data.csv').info()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51353 entries, 0 to 51352
Data columns (total 39 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   type                     51199 non-null  object 
 1   name                     51343 non-null  object 
 2   steam_appid              51353 non-null  int64  
 3   required_age             51199 non-null  object 
 4   is_free                  51199 non-null  object 
 5   controller_support       10647 non-null  object 
 6   dlc                      8615 non-null   object 
 7   detailed_description     51157 non-null  object 
 8   about_the_game           51156 non-null  object 
 9   short_description        51160 non-null  object 
 10  fullgame                 0 non-null      float64
 11  supported_languages      51165 non-null  object 
 12  header_image             51199 non-null  object 
 13  website                  29258 non-null  object 
 14  pc_requirements       

In [27]:
# inspect downloaded steamspy data
pd.read_csv('../data/download/steamspy_data.csv').head()

FileNotFoundError: [Errno 2] No such file or directory: '../data/download/steamspy_data.csv'

## Next Steps

Here we have defined and demonstrated the download process used to generate the data sets. This was completed separately but the full, raw data can be found [on Kaggle](https://kaggle.com/nikdavis/steam-store-raw).

We now have two tables of data with a variety of information about apps on the Steam store. From the Steam data it looks like there are some useful columns like `required_age`, `developers` and `genres` which we can eventually turn into features for analysis, and a `price_overview` column which may inform the success and sales of each game. The `owners` column of the SteamSpy data could be useful, however the [margin of error](https://steamspy.com/about) means data may not be accurate enough for meaningful analysis, we'll have to see what we can manage after cleaning. Instead we may have to use the `positive` and `negative` ratings or average play-time to create our metrics. There is also a `tags` column which appears to crossover with the `categories` and `genres` columns in the Steam data. We may wish to merge these, or keep one over the other.

These are all decisions we'll come to in later stages of the project. With the data downloaded, this stage is now complete. In the next step, we'll take care of preparing and cleaning the data, readying a complete data set to use for analysis.