<a href="https://colab.research.google.com/github/Jcc329/Jessica_DATA606/blob/main/Raw_data/1.Accessing_Steam_APIs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data 606 - Data Science Capstone
### Jessica Conroy

Project Stage: Data Acquisition

This notebook aims to access and request data from the Steam API and Steamspy API. 

### Accessing Steam Data Process

The first call to the steam API gets a list of all games currently or soon to be available on the Steam service.

This list is then converted into a pandas dataframe and cleaned by removing as many blank, test, or beta games as possible based on the name of the game. This is so that the final dataset doesn't contain new games that don't have enough review information, or 'games' that were created without any associated data (for example, by someone testing how to use the platform).

The final dataframe is then passed to a function define below. That function randomizes the dataframe using sklearn shuffle and then impliments 3 api calls for each appid in the list, adding the data for that game to a dictionary. The first API requests the general steam data, the second requests the top 20 reviews and associated review metadata, the third requests supplementary data available from the steamspy API. 

This loop runs for 6 hours and then ends. The goal being to collect a large random sample of games that I can then analyze while keeping in mind time limitations and rate limits.

The function then converts the final dictionary into a dataframe and returns that dataframe.

### Saving the data

Output data is saved as a CSV to my local machine.

### Primary Analysis

Basic descriptive statistics are run using describe and info.

Data Cleaning will occur in the next notebook of this series.

### Sources

Inspiration came from https://nik-davis.github.io/posts/2019/steam-data-collection/ 


In [1]:
!pip install steamspypi

Collecting steamspypi
  Downloading steamspypi-1.1.1-py3-none-any.whl (11 kB)
Installing collected packages: steamspypi
Successfully installed steamspypi-1.1.1


In [2]:
# standard library imports
import csv
import datetime as dt
import json
import os
import statistics
import time

import numpy as np
import pandas as pd
import requests
import steamspypi
from sklearn.utils import shuffle

pd.set_option("max_columns", 100)
pd.set_option('display.max_rows', None)

# Stage 1: Collect all Game IDs and Clean

In [3]:
#Get all game ids and names
#URL call found here: https://partner.steamgames.com/doc/webapi/ISteamApps
URL = 'https://api.steampowered.com/ISteamApps/GetAppList/v2/'

response = requests.get(url=URL)
json_data = response.json()
GameIDs = pd.DataFrame.from_dict(json_data['applist']['apps'])
#Clean up the dataframe to remove empty strings and test/demo games
GameIDs['name'] = GameIDs['name'].str.strip()
GameIDs['name'] = GameIDs['name'].str.lower()

#First I will remove all perfect matches that I found that do not lend themselves to the contain statement:
GameIDs = GameIDs[GameIDs['name'].isin(['','pieterw test app76 ( 216938 )','test2','test3', 'tidewoken public test', 
                                        'now testing: 407', 'test re(quietmansion1 special teaser)', '<h1>test</h1>', 
                                        'test', 'test project', 'steamvr performance test', 'testcontent', 'vrq test'
                                        ]) == False]

#Second I will remove the partial matches that don't inacurately remove names that keep (for example, the first line below removes all games that contain playtest in the 
#name of the game. This is okay because playtest is a very specific term. I included all names that specifically contain 'test' that I wanted to remove because just removing
#anything that contains test would remove things with 'contest' in the name, or 'testemate' and so on.)

GameIDs = GameIDs[GameIDs['name'].str.contains('playtest')==False]
GameIDs = GameIDs[GameIDs['name'].str.contains('closed testing')==False]
GameIDs = GameIDs[GameIDs['name'].str.contains('testapp')==False]
GameIDs = GameIDs[GameIDs['name'].str.contains(' test ')==False]
GameIDs = GameIDs[GameIDs['name'].str.contains('betatest')==False]
GameIDs = GameIDs[GameIDs['name'].str.contains('test server')==False]
GameIDs = GameIDs[GameIDs['name'].str.contains('beta test')==False]
GameIDs = GameIDs[GameIDs['name'].str.contains('tidewoken public test')==False]
GameIDs = GameIDs[GameIDs['name'].str.contains('open test')==False]
GameIDs = GameIDs[GameIDs['name'].str.contains('dev test')==False]
GameIDs = GameIDs[GameIDs['name'].str.contains('- test')==False]
GameIDs = GameIDs[GameIDs['name'].str.contains('feature test')==False]
GameIDs = GameIDs[GameIDs['name'].str.contains('technical test')==False]
GameIDs = GameIDs[GameIDs['name'].str.contains('early access testing')==False]
GameIDs = GameIDs[GameIDs['name'].str.contains('_test')==False]
GameIDs = GameIDs[GameIDs['name'].str.contains(' demo')==False]
GameIDs = GameIDs[GameIDs['name'].str.contains('public test')==False]


In [4]:
GameIDs.shape

(126017, 2)

# Stage 2: Gather data for a Sample of the games

In [5]:
#Create function to collect data from APIs
def CollectSteamData(GameIDDF, timeLimit_min):
    '''
    input: dataframe containing IDs and names of games 
    output: dataframe containing all api data from a random sample of the games
    '''
    #Steam API 1: primary game data
    #https://stackoverflow.com/questions/69512319/steam-api-to-get-game-info
    #Steam API 2: Review data
    #https://partner.steamgames.com/doc/store/getreviews
    #Steamspy API: Supplemental usage and cost data
    # https://pypi.org/project/steamspypi/
    # https://steamspy.com/api.php
    
    #Randomize the data frame
    IDs = shuffle(GameIDDF)
    GameDict = {}
    starttime = time.time()
    for appid in IDs['appid']:
        try:
            gameURL = 'http://store.steampowered.com/api/appdetails?appids=' + str(appid)
            response = requests.get(url=gameURL)
            json_data = response.json()
            GameData = json_data[str(appid)]['data']
            time.sleep(1) # 1 second rate limit on API calls
            reviewURL = 'http://store.steampowered.com/appreviews/' + str(appid) + '?json=1'
            response = requests.get(url=reviewURL)
            json_data = response.json()
            ReviewScore = json_data['query_summary']['review_score']
            ReviewScoreDesc = json_data['query_summary']['review_score_desc']
            reviewText = ''
            for review in json_data['reviews']:
                reviewText = reviewText + review['review']
            
            ReviewDict = {'Review Score':ReviewScore, 'Review Score Description': ReviewScoreDesc, 'Top Reviews by Upvotes':reviewText}

            data_request = dict()
            data_request['request'] = 'appdetails'
            data_request['appid'] = str(appid)
            steamspydata = steamspypi.download(data_request)

            # Combine all three json dictionaries and convert to dataframe
            GameData.update(ReviewDict)
            GameData.update(steamspydata)
            time.sleep(1) # 1 second rate limit on API calls

        except: #games that do not have any associated data or other failed api calls
            time.sleep(1)
        endtime = time.time()
        elapsedtime = (endtime-starttime)/60
        if elapsedtime >= timeLimit_min: #If Greater than or equal to a set number of minutes, then end
            break
        #add all data for current app loop to GameDict
        GameDict.update({str(appid): GameData})
    #Convert to Dataframe
    GameDF = pd.DataFrame.from_dict(GameDict, orient='index')

    return GameDF

In [6]:
Hours = 18
minutes = Hours*60
Sample_Game_Data = CollectSteamData(GameIDs, minutes)

In [7]:
from google.colab import files
Sample_Game_Data.to_csv('RawSteamGameData2.csv') 
files.download('RawSteamGameData2.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Step 3: Explore raw data
I ended up with 7,309 games and 62 fields upon initial data extraction. 

Several fields were dropped due to high number of nulls while others were dropped because they represented duplicate data or weren't relevant. 

For the remaining columns, fields containing multiple data values were expanded into individual columns. 

Only about 250 games in the sample had metacritic scores. I will therefore work primarily with review scores for predicting success, which are based on weighted scores of users leaving reviews.

In [8]:
Sample_Game_Data.shape

(18424, 62)

In [9]:
Sample_Game_Data.columns

Index(['type', 'name', 'steam_appid', 'required_age', 'is_free',
       'detailed_description', 'about_the_game', 'short_description',
       'fullgame', 'header_image', 'website', 'pc_requirements',
       'mac_requirements', 'linux_requirements', 'developers', 'publishers',
       'price_overview', 'packages', 'package_groups', 'platforms',
       'screenshots', 'release_date', 'support_info', 'background',
       'content_descriptors', 'supported_languages', 'categories', 'genres',
       'movies', 'achievements', 'Review Score', 'Review Score Description',
       'Top Reviews by Upvotes', 'appid', 'developer', 'publisher',
       'score_rank', 'positive', 'negative', 'userscore', 'owners',
       'average_forever', 'average_2weeks', 'median_forever', 'median_2weeks',
       'price', 'initialprice', 'discount', 'ccu', 'languages', 'genre',
       'tags', 'legal_notice', 'demos', 'controller_support', 'dlc', 'reviews',
       'metacritic', 'recommendations', 'ext_user_account_notice'

In [10]:
Sample_Game_Data.describe(include='all')

Unnamed: 0,type,name,steam_appid,required_age,is_free,detailed_description,about_the_game,short_description,fullgame,header_image,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,price_overview,packages,package_groups,platforms,screenshots,release_date,support_info,background,content_descriptors,supported_languages,categories,genres,movies,achievements,Review Score,Review Score Description,Top Reviews by Upvotes,appid,developer,publisher,score_rank,positive,negative,userscore,owners,average_forever,average_2weeks,median_forever,median_2weeks,price,initialprice,discount,ccu,languages,genre,tags,legal_notice,demos,controller_support,dlc,reviews,metacritic,recommendations,ext_user_account_notice,drm_notice,alternate_appid
count,18424,18424.0,18424.0,18424.0,18424,18424.0,18424.0,18424.0,6776,18424,10477,18424,18424,18424,17573,18424,13590,13893,18424,18424,17598,18424,18424,18424.0,18424,17719,17582,17235,11633,5113,18422.0,18422,18422.0,18422.0,18422.0,18422.0,18422.0,18422.0,18422.0,18422.0,18422,18422.0,18422.0,18422.0,18422.0,17007.0,17007.0,17007.0,18422.0,17007,18422.0,18422,7171,1084,4825,1590,1560,656,2171,189,116,2.0
unique,10,16099.0,,10.0,2,15275.0,15274.0,15564.0,2733,16094,6368,12285,3221,2084,10041,8366,740,12121,12024,6,15404,3511,9752,15409.0,1587,2951,2511,1066,10178,4396,,19,8834.0,,9091.0,7500.0,5.0,,,,12,,,,,209.0,96.0,50.0,,2335,972.0,8213,4428,945,1,1399,1351,557,1087,134,37,2.0
top,game,,,0.0,False,,,,"{'appid': '252690', 'name': 'Fantasy Grounds C...",https://cdn.akamai.steamstatic.com/steam/apps/...,http://www.fantasygrounds.com,[],[],[],"[SmiteWorks USA, LLC]",[],"{'currency': 'TWD', 'initial': 2200, 'final': ...",[130890],[],"{'windows': True, 'mac': False, 'linux': False}","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","{'coming_soon': False, 'date': ''}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}",English,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 256742437, 'name': 'Sebino Lake - Trai...",{'total': 0},,No user reviews,,,,,,,,,"0 .. 20,000",,,,,0.0,0.0,0.0,,English,,[],© 2015 UBISOFT ENTERTAINMENT. ALL RIGHTS RESER...,"[{'appid': 1742710, 'description': ''}]",full,[498590],“Block Busters is the next Rocket League in th...,"{'score': 87, 'url': 'https://www.metacritic.c...",{'total': 108},Uplay (Supports Linking to Steam Account),Denuvo Anti-tamper<br>5 different PC within a ...,37920.0
freq,11201,16.0,,18096.0,16596,813.0,813.0,351.0,277,7,277,1017,11152,12743,347,2469,1544,9,4700,12827,5,254,1104,821.0,15834,4569,3073,893,5,109,,8229,8232.0,,2161.0,3765.0,18405.0,,,,15895,,,,,3343.0,3343.0,15922.0,,8202,2502.0,7406,189,3,4825,4,4,4,16,11,43,1.0
mean,,,1014737.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.669308,,,1014816.0,,,,1259.372,177.808436,0.065737,,62.788459,3.639507,55.019542,3.683476,,,,1433.175,,,,,,,,,,,,,
std,,,498115.7,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.897592,,,497995.8,,,,47454.22,6523.135592,2.242255,,580.320889,62.17416,513.002715,73.777249,,,,182665.4,,,,,,,,,,,,,
min,,,10.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,,,10.0,,,,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,,,,0.0,,,,,,,,,,,,,
25%,,,603727.5,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,,,603810.0,,,,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,,,,0.0,,,,,,,,,,,,,
50%,,,1011725.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,,,1011725.0,,,,2.0,0.0,0.0,,0.0,0.0,0.0,0.0,,,,0.0,,,,,,,,,,,,,
75%,,,1438870.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,4.0,,,1438820.0,,,,25.0,8.0,0.0,,0.0,0.0,0.0,0.0,,,,0.0,,,,,,,,,,,,,


In [11]:
GameData.isnull().sum()

NameError: ignored

In [None]:
GameData.info()