# Steam Data Analysis

# Data Cleaning

## Introduction

Thanks to the data gathering, we now have two separate csv files:

* `steamspy_appid.csv`: Full request to Steam Spy, for all available IDs (9 February 2022)
* `steam_app_data.csv`: Full App Info from the Steam Storefront related to the previous IDs (8 February 2022)

Almost all the data necessary for the analysis should be at the `steam_app_data.csv`. However, in `steamspy_appid.csv` we have additional information which might be very useful:

* Positive Reviews (count)
* Negative Reviews (count)
* Average and Medians of Concurrent Players (several columns)
* Peak Concurrent Players (ccu column)
* Owners estimate, by using Steam Spy algorithm (wide ranges)
* Tags (list)

That is why we should consolidate both datasets in only one dataframe for our analysis. After all, the information is all observations about unique App Games from Steam. Also, we need to carefully check if there is any missing data, or data that might not be consistent as the Steam Storefront has been around for quite some years, and there might be different information in older apps, or newer apps might have less information... It is also quite probable that some columns have duplicated values (such as developers that are the same with multiple names)...

It is also good to note that we also downloaded separately the metadata from the steam reviews and stored it in `steamreviews_data.csv`. We did this after noticing discrepancies in the number of reviews between the total in Steam Store, and the numbers in Steam Spy. Here we have the exact numbers and also the rating showed at Steam, so it will be probably used as baseline for this information.

We will probably need to do the cleaning in two big steps, first any cleaning prior to merge the two dataframes, and then the final cleaning after inspecting the joined database. This makes sense as we will have more complete data after merging.

Let us proceed with cleaning and joining, and that will allow us to do a good analysis.

In [1]:
# standard library imports
import csv
import datetime as dt
import json
import os
import statistics
import time
import re

# third-party imports
import numpy as np
import seaborn as sns
import pandas as pd
import requests
import plotly.express as px
import matplotlib.pyplot as plt

In [2]:
spy = pd.read_csv("../data/download/steamspy_data.csv")
store = pd.read_csv("../data/download/steam_app_data.csv")

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


## Assessment - Pre-merge

Later, we will do a full Exploratory Data Analysis, that will be focused on gaining insights. Here however we will focus just on an Assessment to detect Data Quality and Tidiness issues. 

We will look at any possible issues within the data, and make a list of necessary changes. 

After that, we will define some functions to make these changes and clean, join both databases. That way, if we need to gather again and clean the data later, we can do the whole process with few if any changes.

Then we will clean again the data, after joining, as some columns will be merged.

Let's start with a summary of how many data we have from each dataset, and what columns are available.

In [3]:
spy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63968 entries, 0 to 63967
Data columns (total 20 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   appid            63968 non-null  int64  
 1   name             63865 non-null  object 
 2   developer        56705 non-null  object 
 3   publisher        56740 non-null  object 
 4   score_rank       48 non-null     float64
 5   positive         63968 non-null  int64  
 6   negative         63968 non-null  int64  
 7   userscore        63968 non-null  int64  
 8   owners           63968 non-null  object 
 9   average_forever  63968 non-null  int64  
 10  average_2weeks   63968 non-null  int64  
 11  median_forever   63968 non-null  int64  
 12  median_2weeks    63968 non-null  int64  
 13  price            56854 non-null  float64
 14  initialprice     56856 non-null  float64
 15  discount         56856 non-null  float64
 16  languages        56814 non-null  object 
 17  genre       

In [4]:
store.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66803 entries, 0 to 66802
Data columns (total 39 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   type                     66655 non-null  object 
 1   name                     66803 non-null  object 
 2   steam_appid              66803 non-null  int64  
 3   required_age             66655 non-null  object 
 4   is_free                  66655 non-null  object 
 5   controller_support       14422 non-null  object 
 6   dlc                      9342 non-null   object 
 7   detailed_description     66587 non-null  object 
 8   about_the_game           66585 non-null  object 
 9   short_description        66593 non-null  object 
 10  fullgame                 0 non-null      float64
 11  supported_languages      66599 non-null  object 
 12  header_image             66655 non-null  object 
 13  website                  36119 non-null  object 
 14  pc_requirements       

Out of 65000 unique app ids, in Steam Spy we have non null data for almost all of them, while in the steam database there is missing information for a lot of columns. 

However for more than half of them there are at least more than 65k values. Even in those cases, we have to check why we have missing information - there may be valid reasons. One approach might be to just keep those games out of the analysis (i.e if they are very marginal games with few or no reviews, with a lot of missing data, or not games at all), but it might be interesting to check what they are first.

There are clearly columns in the steam database with "optional" information, such us `dlc`, `controller_support`, `demos`, `reviews`, `metacritic`, `recommendations`, `achievements`, `drm_notice`, `ext_user_account_notice`. Some of them we will have to check, as probably they are null if the game does not have that feature at all, and in the case of external reviews/recommendations/metacritic, it is possible the game does have them but it was not included in the steam storefront (for instance, if the game had mostly negative reviews).

Another issue that might be an opportunity, is that we have some overlap between Steam Spy and Steam Store data... we have the names, developers, publishers and price data available in both datasets, and at first glance it seems Steam Store has more data. But we should check them and consolidate prior to further cleaning.

`store` vs `spy`

`name` vs `name`

`price_overview` vs `price`, `initial_price`, `discount`

`genres` vs `genre`

`developers` vs `developer`

`publishers` vs `publisher`

`supported_languages` vs `languages`

Let's start to check out really how many games we have with important missing information, and decide if we should just drop them.


### Unique IDs

We just said the app ids are unique... But we should check if we have duplicated app ids in our dataframes. We used an iterative process, and it could be possible that some ids when requested redirect us to a new id. This has been observed trying to access directly in the Steam Store page with some of the "missing" ids. For instance, different versions of Guild Wars 2 all lead us to a unique store page on Steam, as the old versions do not exist anymore.

In [5]:
spy["appid"].duplicated().sum()

0

No duplicate ids for Steam Spy, which is great news!. It might make sense due to the reasoning above - Steam Spy just keeps the old records as well.

In [6]:
store["steam_appid"].duplicated().sum()

0

In [7]:
store.duplicated().sum()

0

Well, it seems we were succesful rewriting the request functions to get only new ids and purge any duplicates! We can follow into just cleaning the database with the meaningful data.

### Name

In [8]:
spy[spy["name"].isnull()].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 103 entries, 1094 to 63967
Data columns (total 20 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   appid            103 non-null    int64  
 1   name             0 non-null      object 
 2   developer        7 non-null      object 
 3   publisher        7 non-null      object 
 4   score_rank       0 non-null      float64
 5   positive         103 non-null    int64  
 6   negative         103 non-null    int64  
 7   userscore        103 non-null    int64  
 8   owners           103 non-null    object 
 9   average_forever  103 non-null    int64  
 10  average_2weeks   103 non-null    int64  
 11  median_forever   103 non-null    int64  
 12  median_2weeks    103 non-null    int64  
 13  price            11 non-null     float64
 14  initialprice     11 non-null     float64
 15  discount         11 non-null     float64
 16  languages        9 non-null      object 
 17  genre      

In [9]:
store[store["name"].isnull()]

Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,...,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors


In [10]:
store[store["name"].isnull()]["steam_appid"]

Series([], Name: steam_appid, dtype: int64)

In [11]:
spy[spy["name"].isnull()]["appid"]

1094       63970
3918      315210
6892      396420
9074      460250
13110     576960
          ...   
63961    1899160
63962    1899200
63963    1899430
63965    1900820
63967    1902210
Name: appid, Length: 103, dtype: int64

#### Name overview
Analyzing these briefly (viewing them manually at Steam), we see that there are 2 root causes for blank games in the Steam Spy database. Either the game uses some sort of emoticon in the game, or the game has been deleted from steam. Let's do a crosscheck between them.

In [12]:
store[store["steam_appid"].isin(spy[spy["name"].isnull()]["appid"].values)]

Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,...,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors
4835,game,🐰RabbiruN🐰,806160,0,False,,[835480],Rabbirun is a casual endless running game wher...,Rabbirun is a casual endless running game wher...,Rabbirun is a casual endless running game wher...,...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256711130, 'name': 'Rabbirun Trailer',...",,"{'total': 156, 'highlighted': [{'name': '', 'p...","{'coming_soon': False, 'date': '27 Mar, 2018'}","{'url': 'http://palenogames.ru', 'email': 'con...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
4837,game,🚀👾Absolute Blue 👾🚀,806220,0,False,,[810280],Note: I am currently working on a new ambitiou...,Note: I am currently working on a new ambitiou...,Sidescrolling Shoot’em’up that will transport ...,...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256708652, 'name': 'Gameplay Trailer',...",,,"{'coming_soon': False, 'date': '21 Apr, 2018'}","{'url': 'http://intermediaware.com', 'email': ...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
9800,game,🚀 Human Rocket Person,965340,0,False,full,,"<img src=""https://cdn.akamai.steamstatic.com/s...","<img src=""https://cdn.akamai.steamstatic.com/s...",Human Rocket Person is an absurd &amp; fun pla...,...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256733564, 'name': 'Human Rocket Perso...",,"{'total': 22, 'highlighted': [{'name': 'Apple ...","{'coming_soon': False, 'date': '14 Nov, 2018'}","{'url': '', 'email': 'feedback@2ndstudio.com'}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [1, 5], 'notes': 'This game contains f..."
13052,game,👾 Foreign Frugglers,1071920,0,False,full,,The Fruggle is real!<br />\r\nEight brave surv...,The Fruggle is real!<br />\r\nEight brave surv...,The Fruggle is real! Four brave survivors need...,...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256776726, 'name': 'Gameplay Trailer',...",,"{'total': 16, 'highlighted': [{'name': 'Ghetto...","{'coming_soon': False, 'date': '26 Jun, 2019'}","{'url': '', 'email': 'games.ultimo@gmail.com'}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
39655,game,🔴 Circles,460250,0.0,False,,,"Circles is a unique, intuitive puzzle game whe...","Circles is a unique, intuitive puzzle game whe...",Circles is an abstract puzzle game that takes ...,...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '4', 'description': 'Casual'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256662578, 'name': 'Circles Trailer', ...",,"{'total': 8, 'highlighted': [{'name': '•', 'pa...","{'coming_soon': False, 'date': '17 Feb, 2017'}","{'url': '', 'email': 'support@illusivegames.com'}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
63957,game,Puzzle Mix,1899160,0.0,False,,,We offer you a puzzle game. There are 70 puzzl...,We offer you a puzzle game. There are 70 puzzl...,We offer you a puzzle game. The meaning of the...,...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '25', 'description': 'Adventure'}, {'i...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256872607, 'name': 'robot87', 'thumbna...",,,"{'coming_soon': True, 'date': '18 Feb, 2022'}","{'url': '', 'email': 'robot72@yandex.ru'}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
63958,game,Bombardier,1899200,0.0,False,,,Bombardier is a 2 - 4 local multiplayer game o...,Bombardier is a 2 - 4 local multiplayer game o...,Bombardier is a quick and chaotic online multi...,...,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256872169, 'name': 'ClusterTruck9000',...",,,"{'coming_soon': True, 'date': 'Oct 2022'}","{'url': '', 'email': 'info@studiosquidinc.com'}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
63959,game,VEREDA - Mystery Escape Room Adventure,1899430,0.0,False,,,"<h2 class=""bb_tag""> VEREDA - Mystery Escape Ro...","<h2 class=""bb_tag""> VEREDA - Mystery Escape Ro...",Vereda is a 3d escape room puzzle adventure. P...,...,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '25', 'description': 'Adventure'}, {'i...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256872306, 'name': 'Vereda Trailer', '...",,,"{'coming_soon': True, 'date': '9 Mar, 2022'}","{'url': 'https://www.m9games.co.uk/Contact/', ...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
63961,game,Mighty Castles,1900820,0.0,False,,,Mighty Castles is a Tower Defense game with RP...,Mighty Castles is a Tower Defense game with RP...,Mighty Castles is a Tower Defense game with RP...,...,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '3', 'description': 'RPG'}]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256872374, 'name': '1', 'thumbnail': '...",,,"{'coming_soon': True, 'date': '21 Feb, 2022'}","{'url': '', 'email': '21eremin21@gmail.com'}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"


#### Are there any duplicate names?

In [13]:
store["name"].value_counts()

Alone                        5
Lost                         4
Bounce                       4
Fireflies                    4
Escape                       4
                            ..
Fire Guild                   1
The Tragedy of little Joy    1
探灵警探                         1
Lucen                        1
Profectus                    1
Name: name, Length: 66414, dtype: int64

In [14]:
store[store["name"]=="Alone"]

Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,...,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors
5823,game,Alone,837930,0.0,False,,,"<h2 class=""bb_tag""><strong>The Sun Falls Behin...","<h2 class=""bb_tag""><strong>The Sun Falls Behin...",Alone is a Pixelated Survival Game set in the ...,...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '25', 'description': 'Adventure'}, {'i...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256712913, 'name': 'Alone - the Only S...",,"{'total': 20, 'highlighted': [{'name': 'Frozen...","{'coming_soon': False, 'date': '1 May, 2018'}",{'url': 'https://killedpixel.wixsite.com/home'...,https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
6981,game,Alone,871870,0.0,False,,,You wake up finding yourself alone in your cam...,You wake up finding yourself alone in your cam...,Alone is a 3D Puzzle Sidescroller with heavy f...,...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '25', 'description': 'Adventure'}, {'i...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256717925, 'name': 'Alone Gameplay Tra...",,"{'total': 7, 'highlighted': [{'name': 'First o...","{'coming_soon': False, 'date': '21 Jun, 2018'}","{'url': '', 'email': 'avasion.alone@gmail.com'}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
27986,game,Alone,1640090,0.0,False,full,,Alone is a <strong>precision platformer</stron...,Alone is a <strong>precision platformer</stron...,"An epic, unforgiving platformer in which you n...",...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256844155, 'name': 'Alone Trailer', 't...",,"{'total': 7, 'highlighted': [{'name': 'You los...","{'coming_soon': False, 'date': '29 Jul, 2021'}","{'url': '', 'email': 'Contact@AdamJN.com'}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
30153,game,Alone,1789460,0.0,False,,,Alone is all about surviving in a deserted isl...,Alone is all about surviving in a deserted isl...,An open world survival crafting game based mor...,...,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '23', 'description': 'Indie'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256859195, 'name': 'Alone Trailer', 't...",,,"{'coming_soon': False, 'date': '11 Nov, 2021'}","{'url': '', 'email': 'achillesgamestudio@gmail...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
59082,game,Alone,1686490,0.0,False,,,"<img src=""https://cdn.akamai.steamstatic.com/s...","<img src=""https://cdn.akamai.steamstatic.com/s...",Alone is a puzzle platformer where you control...,...,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '25', 'description': 'Adventure'}, {'i...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256847947, 'name': 'Alone - Kickstarte...",,,"{'coming_soon': True, 'date': '15 Sep, 2022'}","{'url': '', 'email': 'connorwarrington27@gmail...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"


In [15]:
store[store["name"]=="['']"]

Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,...,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors


We have some duplicate names, which seems a bit weird, but they are really different games.

Just in case, let's check also for some weird names.

In [16]:
store[store["name"].apply(lambda x: len(x) < 6)]["name"].value_counts()

Alone    5
Lost     4
Dodge    3
Helix    3
Surge    3
        ..
就这消消乐    1
GNOG     1
Gogoo    1
大小串串烧    1
灵异AE     1
Name: name, Length: 2551, dtype: int64

In [17]:
store[store["name"].isin(["none","None","na","Na","False","false",0,"","invalid","Invalid"])]

Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,...,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors
65991,game,none,339860,0.0,False,,,,,,...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '25', 'description': 'Adventure'}, {'i...",,,,"{'total': 3, 'highlighted': [{'name': 'Master ...","{'coming_soon': False, 'date': '27 Feb, 2015'}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}"
66127,game,none,385020,0.0,False,,,- discontinued - (please remove),- discontinued - (please remove),- discontinued - (please remove),...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '4', 'description': 'Casual'}, {'id': ...",,,,,"{'coming_soon': False, 'date': '4 Nov, 2015'}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}"
66173,game,none,398970,0.0,False,,,,,,...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '25', 'description': 'Adventure'}, {'i...",,,,"{'total': 35, 'highlighted': [{'name': ""They'v...","{'coming_soon': False, 'date': '5 Nov, 2015'}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}"


#### Name cleaning decision
All games from the store database have valid names, except those three detected above that we should clearly remove. We keep the rest of the column from store as is.

* Delete the rows with name = "none"
* When merging both dataframes, regarding column `name`, keep `name` from Steam Storefront

In [18]:
def cleanName(store):
    store = store[store["name"] != "none"].copy()
    return store

### Developers

Compared to publishers where the store dataset has no null values, we have a few missing developers. Let's check them just in case.

In [19]:
store[store["developers"].isnull()&~store["recommendations"].isnull()]

Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,...,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors
2360,game,A Hat in Time - Modding Tools,734880,0.0,False,,,,,,...,,,,,{'total': 126},,"{'coming_soon': False, 'date': '13 Oct, 2017'}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}"
14129,game,妄想破绽 Broken Delusion,1108320,0.0,False,,[1284010],"<img src=""https://media.st.dl.pinyuncloud.com/...","<img src=""https://media.st.dl.pinyuncloud.com/...",《妄想破绽》是一款文字冒险（AVG）游戏，数十万字精彩原创剧情，将带给大家一段近未来架空科技...,...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '25', 'description': 'Adventure'}, {'i...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256766448, 'name': '概念PV', 'thumbnail'...",{'total': 3788},"{'total': 27, 'highlighted': [{'name': 'Broken...","{'coming_soon': False, 'date': '27 Nov, 2019'}","{'url': 'http://game.bilibili.com/kf/', 'email...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
31140,game,Tycoon City: New York,9730,0.0,False,,,<h1>Special Offer</h1><p>Officially Licensed T...,Here's your chance to make it big in the Big A...,Here's your chance to make it big in the Big A...,...,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '28', 'description': 'Simulation'}]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,{'total': 169},,"{'coming_soon': False, 'date': '12 Mar, 2008'}","{'url': '', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
31180,game,Crash Time 2,11390,0.0,False,,,Solve exciting criminal cases on the mean stre...,Solve exciting criminal cases on the mean stre...,Crash Time 2 is an open-world combat racing ga...,...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256810412, 'name': 'Crash Time 2 Steam...",{'total': 1079},,"{'coming_soon': False, 'date': '27 Aug, 2009'}","{'url': '', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
31556,game,18 Wheels of Steel: Extreme Trucker,33730,0.0,False,,,You ‘da Boss! Move it better and faster while ...,You ‘da Boss! Move it better and faster while ...,You ‘da Boss! Move it better and faster while ...,...,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '28', 'description': 'Simulation'}]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,{'total': 109},,"{'coming_soon': False, 'date': '23 Sep, 2009'}","{'url': 'https://playhardgames.net/contact/', ...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
32499,game,Patterns,218980,0.0,False,,,Create worlds beyond your imagination in Patte...,Create worlds beyond your imagination in Patte...,Create worlds beyond your imagination in Patte...,...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '4', 'description': 'Casual'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 2028932, 'name': 'Patterns Trailer 2',...",{'total': 108},,"{'coming_soon': False, 'date': ''}",{'url': 'http://www.buildpatterns.com/#!commun...,https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
32625,game,Rise of Venice,227020,0.0,False,,[260860],<h1>A local Venetian merchant tells stories of...,Venice was at the peak of its power during the...,"Venice: As a young man striving for success, p...",...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '28', 'description': 'Simulation'}, {'...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 2029090, 'name': 'Rise of Venice Gamep...",{'total': 288},"{'total': 56, 'highlighted': [{'name': 'Every ...","{'coming_soon': False, 'date': '27 Sep, 2013'}","{'url': 'forum.kalypsomedia.com', 'email': 'su...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
32653,game,Teenage Mutant Ninja Turtles™: Out of the Shadows,228560,0.0,False,,,Teenage Mutant Ninja Turtles: Out of the Shado...,Teenage Mutant Ninja Turtles: Out of the Shado...,Teenage Mutant Ninja Turtles: Out of the Shado...,...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,{'total': 632},"{'total': 30, 'highlighted': [{'name': 'The Sw...","{'coming_soon': False, 'date': '28 Aug, 2013'}","{'url': 'https://support.activision.com/', 'em...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
32846,game,Rain Blood Chronicles: Mirage,240660,0.0,False,,,"Told as a “Wuxia” story, which literally means...","Told as a “Wuxia” story, which literally means...","Told as a “Wuxia” story, which literally means...",...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 2029042, 'name': 'Mirage- Featurette',...",{'total': 129},"{'total': 102, 'highlighted': [{'name': 'Rich ...","{'coming_soon': False, 'date': '11 Nov, 2013'}","{'url': '', 'email': 'mirage@origo-games.com'}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
32870,game,Rayman® Legends,242550,0.0,False,,,"Michel Ancel, the celebrated creator of Rayman...","Michel Ancel, the celebrated creator of Rayman...","Michel Ancel, the celebrated creator of Rayman...",...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 2029252, 'name': 'Rayman Legends Launc...",{'total': 4983},,"{'coming_soon': False, 'date': '29 Aug, 2013'}","{'url': 'http://www.support.ubi.com', 'email':...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"


There are around 160 entries without developers - there are some games which are no longer available, but if we filter by recommendations there are a few which are valid. These are mostly retro games which some publisher has the right to, but the developer is unlisted intentionally (it might not exist anymore, or they just do not care).

In [20]:
store["developers"].value_counts().head(60)

['Choice of Games']                     144
['Laush Dmitriy Sergeevich']            113
['Creobit']                             102
['KOEI TECMO GAMES CO., LTD.']           93
['Sokpop Collective']                    90
['Boogygames Studios']                   88
['Hosted Games']                         85
['Elephant Games']                       75
['Blender Games']                        71
['SEGA']                                 67
['RewindApp']                            62
['Ripknot Systems']                      62
['Somer Games']                          60
['AMAX Interactive']                     58
['MAGIX Software GmbH']                  57
['ImperiumGame']                         57
['William at Oxford']                    55
['Eipix Entertainment']                  54
['Nikita "Ghost_RUS"']                   51
['玫瑰工作室']                                46
['Individual Software']                  44
['Snkl Studio']                          43
['HotFoodGames']                

In [21]:
store[store["developers"]=="['']"]

Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,...,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors


You will understand in the publishers section why I checked that. It seems it is a placeholder in Steam for mandatory values which are not filled, or have been deleted.

In [22]:
spy[spy["appid"].isin(store[store["developers"].isnull()]["steam_appid"].values)]["developer"].value_counts().head(60)

dtp – young entertainment Gmbh &amp; Co. KG         5
Softstar Technology (Beijing) Co.,Ltd               3
CS-REPORTERS.INC                                    2
ArenaNet®                                           2
AIDIS                                               1
Ubisoft                                             1
Obsidian Entertainment                              1
NEXON Korea Corp.                                   1
Atomic Jelly                                        1
MegaFun Games Ltd.                                  1
MCGame                                              1
KingsIsle Entertainment                             1
IPBuilders                                          1
Saber Interactive                                   1
Climax Studios                                      1
Rhaon Entertainment                                 1
BitLight                                            1
Gravity Interactive                                 1
Gravity, Inc.               

#### Developers: Cleaning Decision

* First we will merge store and spy, keeping store data unless we have a NaN
* Since this process will be the same for other duplicate columns, we will do it first!

* Then we will copy the publisher name into the developer, for the cases without developers. Games with other missing information we will take care of afterwards.

In [23]:
# To simplify cleaning, let's change appid and steam_appid to.. just id. And in fact, let's make that our index!
# Since we will be using df.fillna(df2) later, it would be useful to change similar column names so they are the same.
def renameIDs(store,spy):
    store = store.rename(columns={"steam_appid":"id"})
    store = store.set_index("id")
    spy = spy.rename(columns={"appid":"id", "genre":"genres", "developer":"developers", "publisher":"publishers",
                              "languages":"supported_languages"})
    spy = spy.set_index("id")
    return store, spy

In [24]:
store, spy = renameIDs(store,spy)

In [25]:
# In this function, the index from both df must be the same - the old appid in our case.
# Also, the column names where we will be getting our values should also be the same.
# Lastly, ideally we would the values to be formatted in the same way - but we can also check later.
def updateFromAlternateSource(maindf,subdf):
    df = maindf.copy()
    df = df.fillna(subdf)
    return df

Now we could actually run this function and update the developers from Steam Spy. But doing this will also mean we add any 
extra values from `genres` and `languages`. This might be a problem since they are formatted differently.

We will have to take this into account when formatting these two columns, as the information from Steam Spy will be added for the NaN.

Let's also define a new function to give us the name and links to any subseries of apps, for troubleshooting.

In [26]:
def getSteamLink(df):
    for item in df.index:
        print(df.loc[item]["name"]+" https://store.steampowered.com/app/"+str(item))

### Publishers

It seemed that the publishers were ok, as we have no NaN. However, there are a lot of blank names. This is probably a mandatory metadata from Steam, and some ids have managed to not put a publisher whatsoever doing that.

Let's look at them, if there are valid ones (i.e ones who have a developer) we can consider them self-published and just do the same as before, copying the developer name into the publisher.

In [27]:
store[store["publishers"]==""]

Unnamed: 0_level_0,type,name,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,fullgame,...,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1


That's not right... While doing the cleaning process, I detected some empty strings... right?

In [28]:
store["publishers"].value_counts()

['Big Fish Games']    416
['']                  377
['SEGA']              168
['8floor']            168
['Strategy First']    155
                     ... 
['SOUP STUDIOS']        1
['地心游戏']                1
['Cupfox Studio']       1
['mchernykh']           1
['Juno Morrow']         1
Name: publishers, Length: 36353, dtype: int64

In [29]:
(store["publishers"]=="['']").sum()

377

In [30]:
store[store["publishers"]=="['']"]

Unnamed: 0_level_0,type,name,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,fullgame,...,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
312860,game,Defense Grid 2: A Matter of Endurance,0.0,False,,,,,,,...,,,,,,,"{'coming_soon': False, 'date': '22 Sep, 2014'}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}"
384930,game,Pilot Crusader,0.0,False,,,Pilot Crusader is a retro shoot-em-up with a s...,Pilot Crusader is a retro shoot-em-up with a s...,Take control of a powerful spaceship and wage ...,,...,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 2040586, 'name': 'Pilot Crusader Trail...",,,"{'coming_soon': False, 'date': '10 Jul, 2015'}","{'url': 'http://radlabgaming.com', 'email': 's...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
636560,game,Runewards: Strategy Card Game,0.0,True,,,Runewards is a free-to-play competitive strate...,Runewards is a free-to-play competitive strate...,Runewards is a free-to-play competitive strate...,,...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '37', 'description': 'Free to Play'}, ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256686468, 'name': 'Runewards - Game T...",,,"{'coming_soon': False, 'date': '13 Feb, 2018'}",{'url': 'https://forum.runewards.com/index.php...,https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
666870,game,Perfect Crime,18.0,False,,"[1227610, 1327410]",<h1>English Supported Completely!</h1><p><img ...,"<img src=""https://cdn.akamai.steamstatic.com/s...","With superb script and calm narrative style, P...",,...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '23', 'description': 'Indie'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256760057, 'name': '宣传视频', 'thumbnail'...",{'total': 325},"{'total': 40, 'highlighted': [{'name': 'People...","{'coming_soon': False, 'date': '1 Oct, 2020'}","{'url': '', 'email': '570505836@qq.com'}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [2, 5], 'notes': None}"
688150,game,JUMPER : SPEEDRUN,0,True,,,<strong>JUMPER : SPEEDRUN</strong> is an FPS P...,<strong>JUMPER : SPEEDRUN</strong> is an FPS P...,JUMPER : SPEEDRUN is a new Type of FPS Platfor...,,...,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '23', 'description': 'Indie'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256691573, 'name': 'JUMPER : SPEEDRUN'...",,,"{'coming_soon': False, 'date': '28 Aug, 2017'}","{'url': '', 'email': 'aaent.games@gmail.com'}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
616090,game,龙魂时刻,0.0,False,,,"《龙魂时刻》是一款3D无锁定动作网游,拥有东西方文化碰撞的独特世界观/精彩刺激的PVP玩法和...","《龙魂时刻》是一款3D无锁定动作网游,拥有东西方文化碰撞的独特世界观/精彩刺激的PVP玩法和...",全新3D无锁定动作网游，独具特色的主角时刻，充满挑战的团队副本，全面革新的3D动作竞技，流畅...,,...,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256688609, 'name': 'x5-newCJ626', 'thu...",,,"{'coming_soon': False, 'date': '27 Jun, 2017'}","{'url': 'http://help.163.com/', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
633130,game,Nongünz,0.0,False,full,,<h1>NONGÜNZ IS OUT! THANK YOU!</h1><p><img src...,<strong>Nongünz</strong> is a nihilistic actio...,Nongünz is a nihilistic action platformer rogu...,,...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256685374, 'name': 'Launch Trailer', '...",{'total': 179},"{'total': 26, 'highlighted': [{'name': 'A', 'p...","{'coming_soon': False, 'date': '19 May, 2017'}","{'url': 'www.digerati.games', 'email': 'george...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
658440,game,Enshrouded World: Home Truths,0.0,False,,,,,,,...,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}]",,,,"{'total': 10, 'highlighted': [{'name': 'Bona F...","{'coming_soon': False, 'date': '26 Feb, 2018'}","{'url': '', 'email': 'enshroudedworld@gmail.com'}",,"{'ids': [], 'notes': None}"
669280,game,Les Quatre Alices,0.0,False,,,A young girl walks in a dark forest. <br>She m...,A young girl walks in a dark forest. <br>She m...,"A short visual novel, made in RPG MAKER MV, wh...",,...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '4', 'description': 'Casual'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256694073, 'name': 'Gameplay Trailer',...",,"{'total': 9, 'highlighted': [{'name': ""Let's b...","{'coming_soon': False, 'date': '28 Aug, 2017'}","{'url': 'http://miagames.net/en', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"


That's just evil. Those publishers just put [''] in a string inside the field, just like that. We will change them to NaN. We might want to revisit names, developers... everything really, just in case. As this is really important (to check out the NaNs), we will do it before following with the assessment of the rest of variables.

Checking the values manually, some of them are publishers which have asked to remove the game from steam. But if you put the id manually, it is still available (i.e https://store.steampowered.com/app/633130/Nongnz/ ). Maybe the publisher itself has asked to delete their account.

Is it possible that some of these values were registered at some point by Steam Spy and conserved? Let's check that, if not we will simply treat them like NaNs.

In [31]:
(~spy[spy.index.isin(store[store["publishers"]=="['']"].index)]["publishers"].isnull()).sum()

262

It seems we can recover up to 262 values from Steam Spy, now that we have discovered that this supposedly complete column had some NaNs..

This is the reason I decided to split the cleaning section into a pre-merge and post-merge cleaning.

#### Publisher/Others: Cleaning Decision

* I.e using `store = store.replace("['']", np.NaN)` we should catch any [''] false strings in the steam database, which we thought more complete. Then merge ids, using the Steam Store value (if available) and falling back to Steam Spy if possible.


* If there is no publisher, but we have a developer, then we will use the developer as publisher as well. If there is no publisher or developer, we will simply delete the record.

In [32]:
def getOtherColumnValue(row,current,alternate):
    if pd.isna(row[current]):
        return row[alternate]
    else:
        return row[current]

def fixDevPub(store, spy):
    store = store.replace("['']", np.NaN)
    store = updateFromAlternateSource(store,spy)
    store["developers"] = store.apply(getOtherColumnValue, current="developers", alternate="publishers", axis=1)
    store["publishers"] = store.apply(getOtherColumnValue, current="publishers", alternate="developers", axis=1)
    return store

Running this function will get any values from steam spy which are useful from the repeated columns. We have also eliminated the empty string values and replaced them with NaN, to ensure our cleaning functions detect them properly.

However, note that we have also updated genres and languages by doing it this way...

In [33]:
store = fixDevPub(store, spy)

In [34]:
store.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 66803 entries, 10140 to 676480
Data columns (total 38 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   type                     66655 non-null  object 
 1   name                     66803 non-null  object 
 2   required_age             66655 non-null  object 
 3   is_free                  66655 non-null  object 
 4   controller_support       14422 non-null  object 
 5   dlc                      9342 non-null   object 
 6   detailed_description     66587 non-null  object 
 7   about_the_game           66585 non-null  object 
 8   short_description        66593 non-null  object 
 9   fullgame                 0 non-null      float64
 10  supported_languages      66667 non-null  object 
 11  header_image             66655 non-null  object 
 12  website                  36119 non-null  object 
 13  pc_requirements          66655 non-null  object 
 14  mac_requirements 

Before continuining, let's drop any rows where we dont have publisher and developer.

In [35]:
store = store[~store["publishers"].isnull()]

### Genre

There are 3 similar types of data here. We have genre, categories, and tags. However, genre is the only one present at both dataframes, so we will check it first. It is possible that we consolidate these 3 columns into one or two final ones.

But it could be more complex. If we want to do any machine learning approach, we might need to turn these unique categories, tags etcetera into unique columns with 1/0 values. In any case, that is food for thought for later processes. What is important right now is that we do not lose any information.

Recalling the first part of this section, genre had almost no NaNs on store, but a lot on spy. Let's check if we can complete it only with store, and compare the different formats.

In [36]:
store["genres"].value_counts()

[{'id': '1', 'description': 'Action'}, {'id': '23', 'description': 'Indie'}]                                                                                                                                                                                                                                                                          3980
[{'id': '4', 'description': 'Casual'}, {'id': '23', 'description': 'Indie'}]                                                                                                                                                                                                                                                                          3598
[{'id': '1', 'description': 'Action'}, {'id': '25', 'description': 'Adventure'}, {'id': '23', 'description': 'Indie'}]                                                                                                                                                                                            

In [37]:
spy["genres"].value_counts()

Action, Indie                                                                                                        3626
Casual, Indie                                                                                                        3360
Action, Adventure, Indie                                                                                             2924
Adventure, Indie                                                                                                     2495
Action, Casual, Indie                                                                                                1944
                                                                                                                     ... 
Action, Casual, Free to Play, Indie, Massively Multiplayer, Racing, Strategy, Early Access                              1
Action, Adventure, Casual, Free to Play, Indie, Massively Multiplayer, Simulation, Sports, Strategy, Early Access       1
Casual, Free to Play, In

If there are no single commas inside any genre, it would make sense to list them exactly like Steam Spy has done. If not, we will look for a different character, or even just splitting it into a list, but something clearer than this dict form in string available for the Steam Store.

In [38]:
store["genres"].iloc[12]

"[{'id': '3', 'description': 'RPG'}]"

In [39]:
store["genres"].isnull().sum()

114

In [40]:
def extractDict(weirdList, key):
    if weirdList != weirdList:
        return np.NaN
    else:
        try:
            evalList = eval(weirdList)
            phrase = ""
            # Warning : this works for a list of dicts. If we got a single dict, it will loop
            if(type(evalList) == dict):
                return evalList[key]
            else:
                for dictionary in evalList:
                    phrase += dictionary[key] + ", "
                return phrase[:-2]
        except :
            return np.NaN

A little explanation of above. Most games are indeed formatted with a dict inside. But there are a few ones (48), that after closer inspection already had the genre column formatted into the games of the genres separated by commas. Of these ones, there is only one valid game (one game that still exists in the store), https://store.steampowered.com/app/22330/The_Elder_Scrolls_IV_Oblivion_Game_of_the_Year_Edition/

This was actually recovered with the update function we defined and executed above with the developers and publishers, the information is coming from steam spy.

In any case, since the other invalid games will be deleted later, we make an exception if there is no dict and just return the actual string with a try and except. We will have to do something similar with the languages.

In [41]:
store["genres"] = store["genres"].apply(extractDict, key="description")

In [42]:
store["genres"]

id
10140                                                Sports
10240                                             Adventure
11050                                     Adventure, Casual
11230                                 Casual, Indie, Racing
12430                                                Casual
                                ...                        
673400                                Design & Illustration
673730                  Indie, Simulation, Audio Production
674010                             Action, Adventure, Indie
675230    Adventure, Casual, Free to Play, Massively Mul...
676480                                                Indie
Name: genres, Length: 66653, dtype: object

In [43]:
store.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 66653 entries, 10140 to 676480
Data columns (total 38 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   type                     66606 non-null  object 
 1   name                     66653 non-null  object 
 2   required_age             66606 non-null  object 
 3   is_free                  66606 non-null  object 
 4   controller_support       14409 non-null  object 
 5   dlc                      9341 non-null   object 
 6   detailed_description     66555 non-null  object 
 7   about_the_game           66553 non-null  object 
 8   short_description        66560 non-null  object 
 9   fullgame                 0 non-null      float64
 10  supported_languages      66633 non-null  object 
 11  header_image             66606 non-null  object 
 12  website                  36099 non-null  object 
 13  pc_requirements          66606 non-null  object 
 14  mac_requirements 

Let's see what we have in categories.

In [44]:
store["categories"].apply(extractDict, key="description")

id
10140                                         Single-player
10240                                         Single-player
11050                                         Single-player
11230                           Single-player, Multi-player
12430                                         Single-player
                                ...                        
673400                                                  NaN
673730                                        Single-player
674010    Single-player, Steam Achievements, Steam Works...
675230    Multi-player, MMO, PvP, Online PvP, Co-op, Onl...
676480    Single-player, Steam Achievements, Full contro...
Name: categories, Length: 66653, dtype: object

Looking at rows with NaN values, there are a lot of games abandoned, mixed with applications which should have not been listed as games. We will drop any row which has NaNs in categories.

In [45]:
store["categories"].value_counts()

[{'id': 2, 'description': 'Single-player'}]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    18057
[{'id': 2, 'description': 'Single-player'}, {'id': 22, 'description': 'Steam Achievements'}]                                                                                                                                                       

In [46]:
store["categories"].head()

id
10140          [{'id': 2, 'description': 'Single-player'}]
10240          [{'id': 2, 'description': 'Single-player'}]
11050          [{'id': 2, 'description': 'Single-player'}]
11230    [{'id': 2, 'description': 'Single-player'}, {'...
12430          [{'id': 2, 'description': 'Single-player'}]
Name: categories, dtype: object

There are actually tons of useful metadata here. This seems to be what is shown at the steam store webpage at the right.

We might want to extract useful data in separate columns, such as Single Player, Multiplayer and also get/compare metadata with other columns regarding Achivements, controller support.
There might be additional features which could be good to add to our analysis, such as Steam Workshop or Steam Cloud functionality.

Let's see which categories we have that may be interesting, and how much it is repeated through our games.

In [47]:
def getCategoryList(series):
    lista = pd.Series(data="Single-player", index=[2])
    numbers = pd.Series(data=0, index=[2])
    
    for idx, value in series.iteritems():
        try:
            value = eval(value)
            for item in value:
                lista[item["id"]] = item["description"]
                try:
                    numbers[item["id"]]+=1
                except:
                    numbers[item["id"]] = 1
        except:
            False
    data = {"category": lista, "numbers": numbers}
    lista = pd.concat(data, axis = 1)
    return lista

In [48]:
lista = getCategoryList(store["categories"])

In [49]:
lista.sort_values(by="numbers")

Unnamed: 0,category,numbers
6,Mods (require HL2),2
19,Mods,2
40,SteamVR Collectibles,43
51,Steam Workshop,45
16,Includes Source SDK,56
32,Steam Turn Notifications,108
8,Valve Anti-Cheat enabled,110
31,VR Support,253
14,Commentary available,260
48,LAN Co-op,610


We will extract the following data:

`controller_support` : Available as a separate column already, but that one only includes Full Controller Support. Here we have Full and Partial, so let's add also that information.

`achievements`, `cloud`, `trading_cards`, `leaderboards`, `workshop`, `level_editor` : False or True. These are popular Steam Features.

`in_app_purchases`: Not really common in steam, but we actually have quite some games with it. True or False.



Now about Multiplayer, Online, and so on there are a LOT of options. Remote Play does not make much sense, as it is in theory available for any game. We will make an exception for Remote Play Together as it allows you to share the game with a friend and play in a kind of stream / split screen. We will consider it as "Co-op". Let's try to divide them in sensible columns to avoid having a lot of separate columns with mixed meanings with the following logic:

`single-Player`: The id that already identifies it.

`multiplayer`: Its tag, and just in case, any of the other multiplayer tags.

`online`: Online Co-op, Online PvP, Remote Play Together, Cross Platform Multiplayer, MMO

`pvp`: True if PvP, shared/split screen PvP, Lan PVP, Online PvP.

`co-op`: True if Co-op, shared/split screen Co-op, Lan Co-op, Online Co-op, Remote Play Together

`local_multiplayer`: True if shared/split screen, and shared/split screen PvP or Co-op.

`lan`: True if we get any of the LAN possibilities.

`mmo` : True if MMO

We will ignore other minor features such as captions, commentary... VR does not make much sense as there are more games which support VR than what is shown by this category, as we can check simply by doing a search on Steam (At least 5k games support VR https://store.steampowered.com/search/?category1=998&vrsupport=402). We should look at this separately.

In [50]:
def extractCategories(df):
    df["singleplayer"] = False
    df["multiplayer"] = False
    df["pvp"] = False
    df["co-op"] = False
    df["online"] = False
    df["local_multiplayer"] = False
    df["mmo"] = False
    df["lan"] = False
    df["achievements"] = False
    df["cloud"] = False
    df["trading_cards"] = False
    df["leaderboards"] = False
    df["workshop"] = False
    df["in_app_purchases"] = False
    df["level_editor"] = False
    df["controller_support"] = df["controller_support"].fillna("none")
    
    for idx, value in df["categories"].iteritems():
        try:
            value = eval(value)
            for item in value:
                cat = item["description"]
                if cat == "Single-player":
                    df.at[idx,"singleplayer"] = True
                    
                elif cat == "Multi-player":
                    df.at[idx,"multiplayer"] = True
                    
                elif cat == "PvP":
                    df.at[idx,"multiplayer"] = True
                    df.at[idx,"pvp"] = True
                    
                elif cat == "Co-op":
                    df.at[idx,"multiplayer"] = True
                    df.at[idx,"co-op"] = True
                    
                elif cat == "Online PvP":
                    df.at[idx,"multiplayer"] = True
                    df.at[idx,"pvp"] = True
                    df.at[idx,"online"] = True
                    
                elif cat == "Online Co-op":
                    df.at[idx,"multiplayer"] = True
                    df.at[idx,"co-op"] = True
                    df.at[idx,"online"] = True
                    
                elif cat == "Shared/Split Screen":
                    df.at[idx,"local_multiplayer"] = True
                    df.at[idx,"multiplayer"] = True
                    
                elif cat == "Shared/Split Screen PvP":
                    df.at[idx,"multiplayer"] = True
                    df.at[idx,"pvp"] = True
                    df.at[idx,"local_multiplayer"] = True
                    
                elif cat == "Shared/Split Screen Co-op":
                    df.at[idx,"multiplayer"] = True
                    df.at[idx,"co-op"] = True
                    df.at[idx,"local_multiplayer"] = True  
                    
                elif cat == "Cross-Platform Multiplayer":
                    df.at[idx,"multiplayer"] = True
                    df.at[idx,"online"] = True                  
                    
                elif cat == "MMO":
                    df.at[idx,"multiplayer"] = True
                    df.at[idx,"mmo"] = True
                    df.at[idx,"online"] = True

                elif cat == "LAN PvP":
                    df.at[idx,"multiplayer"] = True
                    df.at[idx,"pvp"] = True
                    df.at[idx,"lan"] = True
                    
                elif cat == "LAN Co-op":
                    df.at[idx,"multiplayer"] = True
                    df.at[idx,"co-op"] = True
                    df.at[idx,"lan"] = True
                    
                elif cat == "Partial Controller Support":
                    df.at[idx,"controller_support"] = "partial"

                elif cat == "Full controller support":
                    df.at[idx,"controller_support"] = "full"
                    
                elif cat == "Steam Achievements":
                    df.at[idx,"achievements"] = True

                elif cat == "Steam Cloud":
                    df.at[idx,"cloud"] = True
                    
                elif cat == "Steam Trading Cards":
                    df.at[idx,"trading_cards"] = True
                    
                elif cat == "Steam Leaderboards":
                    df.at[idx,"leaderboards"] = True

                elif cat == "Steam Workshop":
                    df.at[idx,"workshop"] = True                    

                elif cat == "Includes level editor":
                    df.at[idx,"level_editor"] = True                    

                elif cat == "In-App Purchases":
                    df.at[idx,"in_app_purchases"] = True                     
        except:
            False
    return df

In [51]:
store = extractCategories(store)

In [52]:
store["in_app_purchases"].value_counts()

False    64802
True      1851
Name: in_app_purchases, dtype: int64

In [53]:
store["level_editor"].value_counts()

False    64797
True      1856
Name: level_editor, dtype: int64

In [54]:
store["workshop"].value_counts()

False    64892
True      1761
Name: workshop, dtype: int64

In [55]:
store["singleplayer"].value_counts()

True     62067
False     4586
Name: singleplayer, dtype: int64

In [56]:
store["multiplayer"].value_counts()

False    52449
True     14204
Name: multiplayer, dtype: int64

In [57]:
store["mmo"].value_counts()

False    65523
True      1130
Name: mmo, dtype: int64

In [58]:
store["local_multiplayer"].value_counts()

False    60755
True      5898
Name: local_multiplayer, dtype: int64

In [59]:
store["online"].value_counts()

False    58126
True      8527
Name: online, dtype: int64

In [60]:
store["achievements"].value_counts()

False    35885
True     30768
Name: achievements, dtype: int64

In [61]:
store["cloud"].value_counts()

False    51455
True     15198
Name: cloud, dtype: int64

In [62]:
store["trading_cards"].value_counts()

False    57252
True      9401
Name: trading_cards, dtype: int64

In [63]:
store["controller_support"].value_counts()

none       43031
full       14396
partial     9226
Name: controller_support, dtype: int64

´

In [64]:

store["pvp"].value_counts()

False    57639
True      9014
Name: pvp, dtype: int64

In [111]:
store["lan"].value_counts()

False    65697
True       908
Name: lan, dtype: int64

In [112]:
store["co-op"].value_counts()

False    59515
True      7090
Name: co-op, dtype: int64

Now that we've done this, let's at least check for games that do not have single player or multiplayer. Which does not make any sense.

In [116]:
store[~store["multiplayer"] & ~store["singleplayer"]]

Unnamed: 0_level_0,type,name,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,fullgame,supported_languages,...,trading_cards,leaderboards,workshop,in_app_purchases,level_editor,mature,total_positive,total_negative,total_reviews,rating
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
312860,game,Defense Grid 2: A Matter of Endurance,False,none,False,,,,,"English, Not supported",...,False,False,False,False,False,False,4,10,14,38.054619
339510,game,Psy High,False,none,False,When the kids at your high school start develo...,When the kids at your high school start develo...,Interactive teen supernatural mystery! You and...,,English,...,False,False,False,False,False,False,108,30,138,71.862434
516130,game,Runner3,False,full,True,The rhythm-music platformer gameplay of BIT.TR...,The rhythm-music platformer gameplay of BIT.TR...,The rhythm-music gameplay of BIT.TRIP RUNNER a...,,"English<strong>*</strong>, French, Italian, Ge...",...,False,True,False,False,False,False,153,49,202,70.542298
681410,game,Adventures of the Worm,False,none,False,Adventures of the Worm is a puzzle game in whi...,Adventures of the Worm is a puzzle game in whi...,Adventures of the Worm is a puzzle game in whi...,,"English<strong>*</strong>, Czech<strong>*</str...",...,False,False,False,False,False,False,8,0,8,74.194375
685770,game,2D Mahjong Temple,False,none,False,<h1>🔥 Upcoming games - Wishlist now!</h1><p><a...,Experience Far Eastern Mahjong fun and help th...,Experience Far Eastern Mahjong fun and help th...,,"English, French, Italian, German, Spanish - Sp...",...,False,False,False,False,False,False,3,3,6,50.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
670470,game,MAGIX Video deluxe 2018 Steam Edition,False,none,False,"<h1>New Version</h1><p><a href=""https://store....",Introduction to video editing: <strong>Video d...,Create amazing videos with Video deluxe 2018 S...,,"English, French, Italian, German, Spanish - Sp...",...,False,False,False,False,False,False,12,10,22,52.776746
671190,game,Pro Motion NG,False,none,True,"An industry standard for two decades, talent f...","An industry standard for two decades, talent f...",Pro Motion is not only a powerhouse of tools f...,,English,...,False,False,False,False,False,False,91,6,97,82.794006
672760,game,Oneiric Masterpieces - Paris,False,none,False,The Oneiric Collection is the new way to visit...,The Oneiric Collection is the new way to visit...,Oneiric Masterpieces - Paris is the most compl...,,English<strong>*</strong><br><strong>*</strong...,...,False,False,False,False,False,False,1,1,2,50.000000
672870,game,ScreenPlay,True,none,False,ScreenPlay is an Open Source cross-platform ap...,ScreenPlay is an Open Source cross-platform ap...,ScreenPlay is an Open Source cross-platform ap...,,"English, German, French, Spanish - Spain, Russ...",...,False,False,True,False,False,False,0,0,0,0.500000


### Metadata - Store

The following columns have useful information (mostly only available at the storefront dataframe), and curiously all of them have the same non-null values . It could be interesting to check if all missing are from the same app ids. Since we will be missing crucial information with those games (specially the release date), we could just delete these rows.

* type 
* required_age 
* is_free
* header_image 
* pc_requirements
* mac_requirements
* linux_requirements
* package_groups 
* platforms 
* release_date
* support_info

In [65]:
store[store["type"].notnull()].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 66606 entries, 10140 to 676480
Data columns (total 52 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   type                     66606 non-null  object 
 1   name                     66606 non-null  object 
 2   required_age             66606 non-null  object 
 3   is_free                  66606 non-null  object 
 4   controller_support       66606 non-null  object 
 5   dlc                      9341 non-null   object 
 6   detailed_description     66555 non-null  object 
 7   about_the_game           66553 non-null  object 
 8   short_description        66560 non-null  object 
 9   fullgame                 0 non-null      float64
 10  supported_languages      66586 non-null  object 
 11  header_image             66606 non-null  object 
 12  website                  36099 non-null  object 
 13  pc_requirements          66606 non-null  object 
 14  mac_requirements 

In [66]:
store[store["type"].isnull()]

Unnamed: 0_level_0,type,name,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,fullgame,...,online,local_multiplayer,mmo,lan,cloud,trading_cards,leaderboards,workshop,in_app_purchases,level_editor
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
681810,,(Chinese PaladinSword and Fairy 6),,,none,,,,,,...,False,False,False,False,False,False,False,False,False,False
681820,,(Chinese PaladinSword and Fairy 4),,,none,,,,,,...,False,False,False,False,False,False,False,False,False,False
681830,,(Chinese PaladinSword and Fairy 5),,,none,,,,,,...,False,False,False,False,False,False,False,False,False,False
681840,,Chinese PaladinSword and Fairy 5 Prequel,,,none,,,,,,...,False,False,False,False,False,False,False,False,False,False
698600,,Tooth and Claw,,,none,,,,,,...,False,False,False,False,False,False,False,False,False,False
700740,,Teenage Mutant Ninja Turtles: Portal Power,,,none,,,,,,...,False,False,False,False,False,False,False,False,False,False
710130,,(Hidden Dragon Legend: Shadow Trace),,,none,,,,,,...,False,False,False,False,False,False,False,False,False,False
799960,,Wizard101,,,none,,,,,,...,False,False,False,False,False,False,False,False,False,False
804780,,VRでレムと異世界生活-膝枕&添寝編,,,none,,,,,,...,False,False,False,False,False,False,False,False,False,False
806560,,VRでエミリアと異世界生活-膝枕&添寝編,,,none,,,,,,...,False,False,False,False,False,False,False,False,False,False


In [67]:
store["type"].value_counts()

game           66605
advertising        1
Name: type, dtype: int64

It can be useful to at least check the value_counts of each column to detect any strange values.

In [68]:
store["required_age"].value_counts()

0                                                                             41448
0.0                                                                           17165
0.0                                                                            6951
18                                                                              241
18.0                                                                            195
16                                                                              105
16.0                                                                            103
18.0                                                                             84
16.0                                                                             50
12                                                                               43
12.0                                                                             36
17.0                                                                        

In [69]:
getSteamLink(store[store["required_age"]==12.0])

Anachronox https://store.steampowered.com/app/242940
Startopia https://store.steampowered.com/app/243040
Starlight Inception™ https://store.steampowered.com/app/250720
//N.P.P.D. RUSH//- The milk of Ultraviolet https://store.steampowered.com/app/270090
LARA CROFT AND THE TEMPLE OF OSIRIS™ https://store.steampowered.com/app/289690
Final Fantasy IV (3D Remake) https://store.steampowered.com/app/312750
Tallowmere https://store.steampowered.com/app/340520
CroNix https://store.steampowered.com/app/343630
Ascent - The Space Game https://store.steampowered.com/app/345010
FINAL FANTASY IV: THE AFTER YEARS https://store.steampowered.com/app/346830
Poppy Kart https://store.steampowered.com/app/352530
FINAL FANTASY X/X-2 HD Remaster https://store.steampowered.com/app/359870
Contradiction - Spot The Liar! https://store.steampowered.com/app/373390
FINAL FANTASY IX https://store.steampowered.com/app/377840
Dungeon Nightmares II : The Memory https://store.steampowered.com/app/382090
Mad Snowboarding

This column is really messy. Almost all games have a "0" value, which probably means no restriction, with then some with 18... and a lot of strange age values.

According to PEGI the values should be 3, 7, 12, 16 and 18.

This sounds like something that might be informed by the "content_descriptors" column.

In [70]:
store["content_descriptors"].value_counts()

{'ids': [], 'notes': None}                                                                                                                                                                                                                                                                                                                  55590
{'ids': [2, 5], 'notes': None}                                                                                                                                                                                                                                                                                                                925
{'ids': [1, 5], 'notes': None}                                                                                                                                                                                                                                                                                                      

In [71]:
getSteamLink(store[store["content_descriptors"]=="{'ids': [2, 5], 'notes': None}"])

The Way We All Go https://store.steampowered.com/app/352610
Gil's Lucid Dreams https://store.steampowered.com/app/556260
Perfect Crime https://store.steampowered.com/app/666870
Forsaken Remastered https://store.steampowered.com/app/668980
CODE VEIN https://store.steampowered.com/app/678960
Kama Bullet Heritage https://store.steampowered.com/app/680690
ZomDay https://store.steampowered.com/app/681390
MarZ: Tactical Base Defense https://store.steampowered.com/app/682530
Beyond the Wall https://store.steampowered.com/app/684560
Days Of Purgatory https://store.steampowered.com/app/684840
Insanity VR: Last Score https://store.steampowered.com/app/686340
Transparent Black https://store.steampowered.com/app/687830
TOXICANT https://store.steampowered.com/app/688120
Cynoclept: The Game https://store.steampowered.com/app/688880
Memories https://store.steampowered.com/app/689680
Sword Legacy: Omen https://store.steampowered.com/app/690140
Killing Floor: Incursion https://store.steampowered.com/ap

Darklands:Awakening https://store.steampowered.com/app/1423500
Strong towers https://store.steampowered.com/app/1433930
CatMafia https://store.steampowered.com/app/1436830
Elite Commander https://store.steampowered.com/app/1439180
SIDE https://store.steampowered.com/app/1440680
The Trap: Remastered https://store.steampowered.com/app/1446130
The Tides of Time https://store.steampowered.com/app/1452360
Shooty https://store.steampowered.com/app/1454060
Misery Mansion https://store.steampowered.com/app/1458980
Panic Attack https://store.steampowered.com/app/1462100
Snow Survival https://store.steampowered.com/app/1463080
Demon heart https://store.steampowered.com/app/1464190
Eclipse https://store.steampowered.com/app/1464380
末日杀 Might & Trap: Apocalypse https://store.steampowered.com/app/1464410
Legendary Tales https://store.steampowered.com/app/1465070
SamuraiZero https://store.steampowered.com/app/1467080
Maniac Path 2 https://store.steampowered.com/app/1479470
Sentinel: Cursed Knight ht

Viking Age: Odin's Warrior https://store.steampowered.com/app/773380
R.A.I.D. https://store.steampowered.com/app/790560
Feed Eve https://store.steampowered.com/app/790950
Tale of Ronin https://store.steampowered.com/app/791630
SOLOS https://store.steampowered.com/app/809060
Death from Unknown: Survival https://store.steampowered.com/app/809550
Bloodstone https://store.steampowered.com/app/811030
RAW FOOTAGE https://store.steampowered.com/app/812090
Galactic Tanks https://store.steampowered.com/app/818710
Before the Blood https://store.steampowered.com/app/824110
Oroborus: Planes Of The Dead https://store.steampowered.com/app/824690
Vertigo FPS https://store.steampowered.com/app/833940
Die In The Dark https://store.steampowered.com/app/847860
A Survival Game Project https://store.steampowered.com/app/849890
Order of the Assassin https://store.steampowered.com/app/851690
Barry Has a Secret https://store.steampowered.com/app/864230
Gull Kebap VR https://store.steampowered.com/app/867750
T

It seems that content descriptor [2,5] adds this statement: 

The developers describe the content like this:

This Game may contain content not appropriate for all ages, or may not be appropriate for viewing at work: Frequent Violence or Gore, General Mature Content

They may add a note, but the keynote here is that [2] is frequent violence or gore, [5] means mature content.
More details here:
https://steamcommunity.com/games/593110/announcements/detail/1708442022337025126

Honestly, seeing the stats from pegi https://pegi.info/page/statistics-about-pegi (around 16% of games are 18+), only about a thousand +18 seems bad.

Anyhow, it seems these two columns should be simplified to +18 or nothing, since everything seems to be quite unregulated, and probably popular titles which are +18 are at least correctly rated. It is unclear if this information would be useful, but this is the only source available and this makes the most sensible choice to at least show some value.

Let's check a very popular +18 title and see if it makes sense.
https://store.steampowered.com/app/271590/Grand_Theft_Auto_V/

In [72]:
print(store.loc[271590]["required_age"], store.loc[271590]["content_descriptors"])

17.0 {'ids': [5], 'notes': None}


It seems it might make sense to put as +18 any title with required age 17 or more, OR if it has the id 5 which means mature content.

In [73]:
store["is_free"].value_counts()

False    59073
True      7533
Name: is_free, dtype: int64

In [74]:
store["pc_requirements"].value_counts()

{'minimum': '<strong>Minimum:</strong><br><ul class="bb_ul"><li><strong>OS:</strong> Windows 7</li></ul>'}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    226
{'minimum': '<strong>Minimum:</strong><br><ul class="bb_ul"><li><strong>OS:</strong> Windows 10</li></ul>'}                           

This could be useful information, but it is html formatted, and not standard. We will keep not clean it at all, and we will separate it from our main table at the end, in case someone needs this information. The same happens with the mac and linux requirements.

If a developer wants to look what systems the users have, it is better to just use the steam survey: https://store.steampowered.com/hwsurvey

In [75]:
print(store.iloc[0]["header_image"])

https://cdn.akamai.steamstatic.com/steam/apps/10140/header.jpg?t=1534443458


This column just gives us a link to a header image. This could be useful in the visualization phase, if we want to highlight any particular game.

In [76]:
store.iloc[0]["package_groups"]

"[{'name': 'default', 'title': 'Buy 3D Ultra™ Minigolf Adventures', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 1873, 'percent_savings_text': ' ', 'percent_savings': 0, 'option_text': '3D Ultra Mini Golf Adventures - 9,99€', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 999}]}]"

This seems the information about in which packs is this game available. Let's try looking for a game which is by itself and in a pack to confirm it.
https://store.steampowered.com/app/413410/Danganronpa_Trigger_Happy_Havoc/

In [77]:
store.loc[413410]["package_groups"]

'[{\'name\': \'default\', \'title\': \'Buy Danganronpa: Trigger Happy Havoc\', \'description\': \'\', \'selection_text\': \'Select a purchase option\', \'save_text\': \'\', \'display_type\': 0, \'is_recurring_subscription\': \'false\', \'subs\': [{\'packageid\': 82833, \'percent_savings_text\': \'-60% \', \'percent_savings\': 0, \'option_text\': \'Danganronpa: Trigger Happy Havoc - <span class="discount_original_price">19,99€</span> 7,99€\', \'option_description\': \'\', \'can_get_free_license\': \'0\', \'is_free_license\': False, \'price_in_cents_with_discount\': 799}]}]'

Well... the packs are not visible here, only the game by itself. We have some metadata related to the buy box, and also a discount which was present at the time. Depending on how the price_overview is, this might be useful to complete it. But the format is a bit messy (not standard, see those two examples), so it would be best to avoid it and erase this column.

In [78]:
store["platforms"].value_counts()

{'windows': True, 'mac': False, 'linux': False}    49841
{'windows': True, 'mac': True, 'linux': True}       7895
{'windows': True, 'mac': True, 'linux': False}      6965
{'windows': True, 'mac': False, 'linux': True}      1883
{'windows': False, 'mac': True, 'linux': False}       15
{'windows': False, 'mac': False, 'linux': True}        6
{'windows': False, 'mac': True, 'linux': True}         1
Name: platforms, dtype: int64

We will need to change the format to separated comma values, but the information is good.

In [79]:
store["release_date"].value_counts()

{'coming_soon': True, 'date': '2022'}                                1125
{'coming_soon': True, 'date': 'TBA'}                                  834
{'coming_soon': True, 'date': 'Coming Soon'}                          653
{'coming_soon': True, 'date': ''}                                     399
{'coming_soon': True, 'date': 'TBD'}                                  251
                                                                     ... 
{'coming_soon': True, 'date': 'early access is planned for 2022'}       1
{'coming_soon': True, 'date': 'Date Soon'}                              1
{'coming_soon': True, 'date': '12 Sep, 2022'}                           1
{'coming_soon': True, 'date': 'Early 2022 - Wishlist Now! 🔔'}           1
{'coming_soon': True, 'date': 'Падажжите скора будет'}                  1
Name: release_date, Length: 6363, dtype: int64

Here we have two different fields, one if the game has already released or not (coming_soon) and we also have the date. The date seems to be free text for the unreleased games, but we should check if it has always the same format for the already released.

In [80]:
store["release_date"].sample(n=10)

id
740990      {'coming_soon': False, 'date': '6 Dec, 2017'}
1341230    {'coming_soon': False, 'date': '19 Nov, 2020'}
1037570    {'coming_soon': False, 'date': '28 Apr, 2019'}
1277360    {'coming_soon': False, 'date': '25 Apr, 2020'}
833440     {'coming_soon': False, 'date': '11 Oct, 2018'}
1508660     {'coming_soon': False, 'date': '7 Feb, 2021'}
1112950    {'coming_soon': False, 'date': '25 Jul, 2019'}
1264050    {'coming_soon': False, 'date': '26 Mar, 2021'}
363960      {'coming_soon': False, 'date': '6 Oct, 2015'}
1254160    {'coming_soon': False, 'date': '29 Jun, 2020'}
Name: release_date, dtype: object

After testing a few, it looks like that, free text if it is not released, but formated in  day Month, year otherwise. We will revisit this when we change it. Since the free text will be troublesome, maybe we can keep this in one columns with the real date if it has been release, and False if it has not released.

In [81]:
store["support_info"].value_counts()

{'url': '', 'email': ''}                                                                                  1008
{'url': 'https://bigfishgames.custhelp.com/app/home', 'email': 'info@bigfishgames.com'}                    221
{'url': 'https://www.facebook.com/8FloorGames', 'email': 'mikhail.zverev@8floor.net'}                      195
{'url': '', 'email': 'support@quanticlab.com'}                                                             136
{'url': '', 'email': 'mail@garage-games.ru'}                                                               118
                                                                                                          ... 
{'url': '', 'email': 'josh@broodlingstudios.com'}                                                            1
{'url': 'http://www.poorrolemodel.com/', 'email': ''}                                                        1
{'url': '', 'email': 'evgeniy7000@gmail.com'}                                                                1
{

This information is listed on each steam store page, but it does not seem to be mandatory. That makes sense, as steam hosts a forum for each game, called "steam discussions". We can safely delete this column as it has no use for our analysis. It could be perhaps used to enrich developer or publisher data, but we have those columns quite clean. 

In case it is useful for anybody, we will keep it separated.

In [82]:
store["website"].value_counts()

https://www.facebook.com/8FloorGames/            171
https://www.choiceofgames.com/                   152
https://steamcommunity.com/groups/alawargames     71
http://www.exosyphen.com                          68
https://www.facebook.com/DnovelGames/             65
                                                ... 
https://pqube.co.uk/gal-gun-returns/               1
http://www.daedalus-thegame.com/                   1
http://www.octodadgame.com                         1
http://www.gamecity.com.tw/souzou/                 1
http://deanforge.com/                              1
Name: website, Length: 30002, dtype: int64

In [118]:
store["ext_user_account_notice"].value_counts()

Uplay (Supports Linking to Steam Account)                                                                                                                                                   40
EA Account (Supports Linking to Steam Account)                                                                                                                                              30
Slitherine PBEM++ for Multiplayer                                                                                                                                                           27
PlayFab (Supports Linking to Steam Account)                                                                                                                                                 25
Ubisoft Account (Supports Linking to Steam Account)                                                                                                                                         23
                                             

#### Metadata - Store: First Columns

We can confirm that there is a group of "metadata" columns had the same rows with nans, so we will drop those rows. Looking at the IDs directly on steam, we cannot even catch its webpage - they mostly seem betas,old non-functional demos or tests...

* To clean the following columns from NaN, we will simply keep the rows with values from any of the columns as the missing info is from the same rows. 
`type  required_age  is_free  header_image  pc_requirements  mac_requirements  linux_requirements package_groups  platforms  release_date  support_info`

Additional cleaning required:

`type`
* We will drop the only 2 app_ids which are not games.

`required_age content_descriptors`
* A new column will be created with Mature being True or False. If the game is 17+, or it has an id in content_descriptors of a mature game, it will be considered Mature.

`is_free`
* Already clean, True or False.

`header_image`
* Will be kept to aid our visualizations. Might be interested for Image based ML.

`pc_requirements  mac_requirements  linux_requirements`
* We will not treat this, it will be separated from the main dataframe as it is not useful and poorly formatted.

`package_groups`
* Eventually this will be dropped, but we will see if it could be used for any errors in the price column.

`platforms`
* We need to change it into separated comma values.

`release_date`
* The format will be changed to a single date if it has been released, or False if it has not been released yet.

`support_info`
* We will not treat this, it will be separated from the main dataframe as it is not useful.

After some review, as we have a few additional columns that is clear that we will not use, but could be useful for a ML image oriented approach... we will also separate `movies`, `screenshots`, `background`, `website` in this step.

#### Metadata - Store: Columns with a lot of NaNs

Next we have the following columns:

* `metacritic` : The numeric value if available, False if it has no value.
* `reviews` : It is free text, so we will change it to True or False. Meaning True = there was a profesional review linked.
* `recommendations` : After inspecting this, it seems linked to the user reviews on Steam. But we got only the total recommendations and also we have tons of NaN, which makes no sense. We will get this information from other places.



* `demos`
* `drm_notice`
* `dlc`

In this case, this group deals with features. The game may have a demo, it uses DRM, it has DLC. We will check these and see if they can be turned into a True / False column with no NaNs.

* `fullgame`: Useless column filled with NaNs.
* `legal_notice`: This has no useful information for us.
* `ext_user_account_notice`: This tells us if we need to link an external account (such as uplay for Ubisoft, or Origin in case of EA). It is not clear if it will be useful but we will keep it anyway. Let's add False for all these cases where we got no notice.

In [83]:
def isAgeMature(age):
    age = str(age)
    try:
        x = re.search("\d+", age).group()
        x = int(x)
    except:
        return False
    if 16 < x < 30:
        return True
    else:
        return False

def isDescriptorMature(descriptor):
    try :
        ids = eval(descriptor)["ids"]
        if 5 in ids:
            return True
        else:
            return False
    except :
        return False

def getMatureMetadata(store):
    store["required_age"] = store["required_age"].apply(isAgeMature)
    store["content_descriptors"] = store["content_descriptors"].apply(isDescriptorMature)
    store["mature"] = store["required_age"] | store["content_descriptors"]
    store = store.drop(columns=["required_age", "content_descriptors"])
    return store

def getMyPlatform(value):
    if value == "{'windows': True, 'mac': False, 'linux': False}":
        return "Windows"
    elif value == "{'windows': True, 'mac': True, 'linux': True}":
        return "Windows, Mac, Linux"
    elif value == "{'windows': True, 'mac': True, 'linux': False}":
        return "Windows, Mac"
    elif value == "{'windows': False, 'mac': True, 'linux': False}":
        return "Mac"
    elif value == "{'windows': False, 'mac': False, 'linux': True}":
        return "Linux"
    elif value == "{'windows': True, 'mac': False, 'linux': True}":
        return "Windows, Linux"
    else:
        return "Mac, Linux"

def getPlatforms(store):
    store["platforms"] = store["platforms"].apply(getMyPlatform)
    return store

def getReleaseDate(value):
    if extractDict(value,"coming_soon") == True:
        return False
    else:
        thisDate = extractDict(value, "date")
        return pd.to_datetime(thisDate)

def getMetacritic(value):
    try :
        score = eval(value)["score"]
        return int(score)
    except :
        return False
    
def getCleanMetadata(store):
    # This takes care of both cleaning the NaN without these metadata, and the types we do not want
    store = store[store["type"]=="game"].copy()
    
    store = getMatureMetadata(store)
    
    extra = store.loc[:,["name","support_info", "header_image", "website",
                         "pc_requirements","mac_requirements","linux_requirements",
                         "background","screenshots","movies"]].copy()
    
    store = getPlatforms(store)
    
    store["metacritic"] = store["metacritic"].apply(getMetacritic)
    
    store["demos"] = ~store["demos"].isna()
    
    store["reviews"] = ~store["reviews"].isna()
    
    store["drm_notice"] = ~store["drm_notice"].isna()
    
    store["dlc"] = ~store["dlc"].isna()
    
    store["release_date"] = store["release_date"].apply(getReleaseDate)
    
    store["ext_user_account_notice"] = store["ext_user_account_notice"].fillna(False)
    
    store = store.rename(columns={"drm_notice":"drm"})
    
    store = store.drop(columns=["pc_requirements","mac_requirements","linux_requirements",
                         "background","screenshots","movies","support_info",
                               "website", "fullgame"])
    
    
    return store, extra

In [84]:
store, extra = getCleanMetadata(store)

In [85]:
store.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 66605 entries, 10140 to 676480
Data columns (total 43 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   type                     66605 non-null  object 
 1   name                     66605 non-null  object 
 2   is_free                  66605 non-null  object 
 3   controller_support       66605 non-null  object 
 4   dlc                      66605 non-null  bool   
 5   detailed_description     66554 non-null  object 
 6   about_the_game           66552 non-null  object 
 7   short_description        66559 non-null  object 
 8   fullgame                 0 non-null      float64
 9   supported_languages      66585 non-null  object 
 10  header_image             66605 non-null  object 
 11  legal_notice             19498 non-null  object 
 12  drm                      66605 non-null  bool   
 13  ext_user_account_notice  1124 non-null   object 
 14  developers       

The most noticeable thing is that we got missing 

### Reviews/Recommendations

This is one of the more important features we want. What is the number of total reviews / recommendations, which might give us insights on how many copies were sold, and has the game been received positively?

In [86]:
store["recommendations"]

id
10140                NaN
10240                NaN
11050     {'total': 166}
11230     {'total': 310}
12430     {'total': 101}
               ...      
673400               NaN
673730               NaN
674010               NaN
675230               NaN
676480               NaN
Name: recommendations, Length: 66605, dtype: object

This column is the number of total user reviews in steam. There are some NaNs, and if we check the Steam Store it does not show the same numbers. We have to tackle this column together with the positive/negative reviews from steam spy.

Here we must make a decision. Steam does not give us any metric of success except the number of total reviews. From Steam Spy, we can calculate the ratio of positive / total number of reviews. Should we forget about this column from Steam, and instead only keep records with positive/negative information from Steam Spy?

In [87]:
store["reviews"]

id
10140     False
10240     False
11050      True
11230     False
12430     False
          ...  
673400    False
673730    False
674010    False
675230    False
676480     True
Name: reviews, Length: 66605, dtype: bool

In [88]:
store["metacritic"]

id
10140     False
10240     False
11050        70
11230     False
12430     False
          ...  
673400    False
673730    False
674010    False
675230    False
676480    False
Name: metacritic, Length: 66605, dtype: object

In [89]:
getSteamLink(store.head())

3D Ultra™ Minigolf Adventures https://store.steampowered.com/app/10140
A Stroke of Fate: Operation Valkyrie https://store.steampowered.com/app/10240
Dracula: Origin https://store.steampowered.com/app/11050
Gumboy Tournament https://store.steampowered.com/app/11230
SlamIt Pinball Big Score https://store.steampowered.com/app/12430


I am just going to check Dracula origins to understand what these fields are telling us. 

Ok, so from `reviews` it seems the game has a professional review listed here, chosen by hand as it is a positive one. It might be worth it to make this into a True False (professional reviews) as we cannot get the data about the score.

The same game shows the `metacritic` score, even if it is not a very good one, we might want to parse this into the numeric score or False if it does not have a metacritic score.

Let's get back to user reviews in Steam.

In [90]:
print(spy.loc[10]["name"]+" "+"+"+str(spy.loc[10]["positive"])+"/"+str(spy.loc[10]["positive"]+spy.loc[10]["negative"]))

Counter-Strike +193046/197986


In [91]:
print(store.loc[10]["name"]+" "+str(store.loc[10]["recommendations"]))

Counter-Strike {'total': 118156}


In [92]:
getSteamLink(store[store["name"].str.contains("Counter-Strike")])

Counter-Strike https://store.steampowered.com/app/10
Counter-Strike: Condition Zero https://store.steampowered.com/app/80
Counter-Strike: Source https://store.steampowered.com/app/240
Counter-Strike: Global Offensive https://store.steampowered.com/app/730
Counter-Strike Nexon: Studio https://store.steampowered.com/app/273110


Looking at Steam directly, the number of total recommendations from the Steam Store is actually accurate (around 119k). So why do we have in this case (one of the first games available on Steam) such a huge divergence? Let's try comparing a newer game. Maybe some of these accounts got disabled in time, or steam at some point redid the review system...

https://store.steampowered.com/app/548430/Deep_Rock_Galactic/

97% of 98762 reviews are positive. This was taken almost a month after the data from the APIs.

In [93]:
print(spy.loc[548430]["name"]+" "+"+"+str(spy.loc[548430]["positive"])+"/"+str(spy.loc[548430]["positive"]+spy.loc[548430]["negative"])+" "+str(spy.loc[548430]["positive"]*100/(spy.loc[548430]["positive"]+spy.loc[548430]["negative"])))

Deep Rock Galactic +117228/120917 96.94914693550122


In [94]:
print(store.loc[548430]["name"]+" "+str(store.loc[548430]["recommendations"]))

Deep Rock Galactic {'total': 96680}


At least the percentage matches.
Let's try a newer game, one from 3rd Quarter 2021. It will not make sense checking the webpage directly, as now there will be a lot of newer reviews.

https://store.steampowered.com/app/1551360/Forza_Horizon_5/
(59,855) 86%



In [95]:
print(spy.loc[1551360]["name"]+" "+"+"+str(spy.loc[1551360]["positive"])+"/"+str(spy.loc[1551360]["positive"]+spy.loc[1551360]["negative"])+" "+str(spy.loc[1551360]["positive"]*100/(spy.loc[1551360]["positive"]+spy.loc[1551360]["negative"])))

Forza Horizon 5 +53761/62209 86.41997138677684


In [96]:
print(store.loc[1551360]["name"]+" "+str(store.loc[1551360]["recommendations"]))

Forza Horizon 5 {'total': 54837}


In any case, the metrics are similar, and the most important one (percentage) could be obtained with steam spy and matches the steam store page. As we have a lot of NaNs from steam, we will trust Steam Spy here, although it is puzzling why the numbers do not match. Maybe Steam Spy registers re-reviews as two different reviews?.

Looking for an alternative, it seems possible to get the information from Steam API (partners) directly. Initially I thought it was not going to be possible as it is in the partners side. See this query for example:

https://store.steampowered.com/appreviews/10?json=1?&num_per_page=0?&language=all

Note: it is important to put language=all, otherwise we get fewer results for the totals.

At the end, we went this route and documented everything in the data collection section. As a result we got the following dataset: `steamreviews_data.csv`

In [97]:
steamreviews = pd.read_csv('../data/download/steamreviews_data.csv')

In [98]:
steamreviews

Unnamed: 0,appid,review_score,review_score_desc,total_positive,total_negative,total_reviews
0,10140,5,Mixed,59,32,91
1,10240,5,Mixed,16,8,24
2,11050,5,Mixed,114,56,170
3,11230,6,Mostly Positive,224,84,308
4,12430,4,Mostly Negative,33,68,101
...,...,...,...,...,...,...
66798,673400,0,9 user reviews,1,8,9
66799,673730,0,1 user reviews,1,0,1
66800,674010,0,2 user reviews,0,2,2
66801,675230,0,No user reviews,0,0,0


Here we have several columns but with dependant information. In reality, only total positive and total negative are unique values. What will be important for us later on?

* Total reviews: Gives us insight on number of sales
* Review score: Gives us insight on how well appreciated the game is.

So in that sense, we should only care about two columns. But since this is key information, we will keep total_positive, total_negative, total_reviews and feature a new column for the score.

Let's take steamdb advice on how to calculate the rating: https://steamdb.info/stats/gameratings/
In summary, what will happen is that games with very few reviews will tend to have a 50% rating, and in a logarithmic scale we will get more sure (twice sure) that the actual ratio of positive/total reviews is correct.

So a game with 1 positive review is actually 50% regardless of the review, a game with 10 out of 10 positives reviews will be 50% x 0.5 + 100% x 0.5 = 75%. If we have 100 reviews (all positive) then it would be 87.5%.

Let's see if this might be unfair. Do we have many games below 50% pure rating?

In [99]:
steamreviews["review_score"].value_counts()

0    36853
8     8178
5     7630
7     6444
6     5331
4     1313
9      746
3      264
2       37
1        7
Name: review_score, dtype: int64

Mostly there are unrated games (which at this point means less than 10 reviews), then we have around 26k games between 5 and 8. There are only about 1.5k bad games (1 to 4), and most of them are just bad (4). And we have 746 excellent games.

So it seems that it is reasonable to use the algorithm above to calculate the rating.

In [100]:
steamreviews["rating"] = (steamreviews["total_positive"]/steamreviews["total_reviews"] - (steamreviews["total_positive"]/steamreviews["total_reviews"] -0.5)*np.power(2,-np.log10(steamreviews["total_reviews"]+1)))*100

In [101]:
steamreviews.head()

Unnamed: 0,appid,review_score,review_score_desc,total_positive,total_negative,total_reviews,rating
0,10140,5,Mixed,59,32,91,61.032103
1,10240,5,Mixed,16,8,24,60.342157
2,11050,5,Mixed,114,56,170,63.43013
3,11230,6,Mostly Positive,224,84,308,68.681559
4,12430,4,Mostly Negative,33,68,101,36.979205


In [102]:
steamreviews.sort_values(by="rating", ascending=False).head()

Unnamed: 0,appid,review_score,review_score_desc,total_positive,total_negative,total_reviews,rating
30898,620,9,Overwhelmingly Positive,227794,2760,230554,97.61687
14429,1118200,9,Overwhelmingly Positive,90690,946,91636,97.396651
38521,427520,9,Overwhelmingly Positive,107963,1222,109185,97.393155
15275,1145360,9,Overwhelmingly Positive,164786,2217,167003,97.369054
32114,105600,9,Overwhelmingly Positive,715943,14293,730236,97.217507


Let's see which are these fantastic games!

In [103]:
store.loc[steamreviews.sort_values(by="rating", ascending=False).head()["appid"]]

Unnamed: 0_level_0,type,name,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,fullgame,supported_languages,...,local_multiplayer,mmo,lan,cloud,trading_cards,leaderboards,workshop,in_app_purchases,level_editor,mature
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
620,game,Portal 2,False,full,True,Portal 2 draws from the award-winning formula ...,Portal 2 draws from the award-winning formula ...,The &quot;Perpetual Testing Initiative&quot; h...,,"English<strong>*</strong>, French<strong>*</st...",...,True,False,False,True,True,False,True,False,True,False
1118200,game,People Playground,False,none,False,"<img src=""https://cdn.cloudflare.steamstatic.c...","<img src=""https://cdn.cloudflare.steamstatic.c...","Shoot, stab, burn, poison, tear, vaporise, or ...",,English,...,False,False,False,True,True,False,True,False,False,True
427520,game,Factorio,False,none,True,<strong>Factorio</strong> is a game in which y...,<strong>Factorio</strong> is a game in which y...,Factorio is a game about building and creating...,,"English, French, Italian, German, Spanish - Sp...",...,False,False,True,True,False,False,False,False,True,False
1145360,game,Hades,False,full,True,"<img src=""https://cdn.cloudflare.steamstatic.c...","<img src=""https://cdn.cloudflare.steamstatic.c...",Defy the god of the dead as you hack and slash...,,"English<strong>*</strong>, French, Italian, Ge...",...,False,False,False,True,True,False,False,False,False,False
105600,game,Terraria,False,full,True,"Dig, Fight, Explore, Build: The very world is...","Dig, Fight, Explore, Build: The very world is...","Dig, fight, explore, build! Nothing is impossi...",,"English, French, Italian, German, Spanish - Sp...",...,False,False,False,True,True,False,False,False,False,False


This correlates with what we see at https://steamdb.info/stats/gameratings/ . As you can see, the numbers of reviews are slightly different, but we decided to go with the reviews extracted directly from Steam.

In [104]:
steamreviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66803 entries, 0 to 66802
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   appid              66803 non-null  int64  
 1   review_score       66803 non-null  int64  
 2   review_score_desc  66803 non-null  object 
 3   total_positive     66803 non-null  int64  
 4   total_negative     66803 non-null  int64  
 5   total_reviews      66803 non-null  int64  
 6   rating             47799 non-null  float64
dtypes: float64(1), int64(5), object(1)
memory usage: 3.6+ MB


In the mathematical operations we got a lot of NaNs, due to games with 0 total reviews. Let's assign them a score of 50%, as the medium point. This is the same approach used in the algorithm above.

In [105]:
steamreviews["rating"] = steamreviews["rating"].fillna(50.0)

In [106]:
steamreviews = steamreviews.rename(columns={"appid":"id"})
steamreviews = steamreviews.set_index("id")

In [107]:
steamreviews = steamreviews.drop(["review_score", "review_score_desc"], axis=1)

In [108]:
store = store.join(steamreviews)
store = store.drop(columns=["recommendations"])

In [109]:
store.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 66605 entries, 10140 to 676480
Data columns (total 46 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   type                     66605 non-null  object 
 1   name                     66605 non-null  object 
 2   is_free                  66605 non-null  object 
 3   controller_support       66605 non-null  object 
 4   dlc                      66605 non-null  bool   
 5   detailed_description     66554 non-null  object 
 6   about_the_game           66552 non-null  object 
 7   short_description        66559 non-null  object 
 8   fullgame                 0 non-null      float64
 9   supported_languages      66585 non-null  object 
 10  header_image             66605 non-null  object 
 11  legal_notice             19498 non-null  object 
 12  drm                      66605 non-null  bool   
 13  ext_user_account_notice  1124 non-null   object 
 14  developers       

We still have two important features left: Price and Language. and we will also need to do a bit further cleaning with the few NaNs lost in there. Let's go with Price first.

## Price

The information for price is stored in `price_overview`. We also noticed that there was information on `package_groups`. Let's try getting the information form overview and see if we need extra from other sources.

In [123]:
print(store.iloc[0]["price_overview"])
print(store.iloc[1]["price_overview"])
print(store.iloc[2]["price_overview"])

{'currency': 'EUR', 'initial': 999, 'final': 999, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': '9,99€'}
{'currency': 'EUR', 'initial': 699, 'final': 699, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': '6,99€'}
{'currency': 'EUR', 'initial': 999, 'final': 999, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': '9,99€'}


DOWNLOAD?? REUSE BELOW??
https://store.steampowered.com/api/appdetails/?appids=1608290


example of an id. He expected applist df with the ids for the download related to STEAM, we have to get the appids again from the previous dataframe (and erase the steamspy thingie)

In [110]:
# sns.pairplot(store,y_vars=["rating","total_reviews"])