This notebook is for analysing a Kaggle dataset taken from this link: https://www.kaggle.com/nikdavis/steam-store-games.

In [1]:
# Importing standard packages for data exploration and processing.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline


data = pd.read_csv('data/kaggle/2_steam.csv')
data.head()

Unnamed: 0,appid,name,release_date,english,developer,publisher,platforms,required_age,categories,genres,steamspy_tags,achievements,positive_ratings,negative_ratings,average_playtime,median_playtime,owners,price
0,10,Counter-Strike,2000-11-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,124534,3339,17612,317,10000000-20000000,7.19
1,20,Team Fortress Classic,1999-04-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,3318,633,277,62,5000000-10000000,3.99
2,30,Day of Defeat,2003-05-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Valve Anti-Cheat enabled,Action,FPS;World War II;Multiplayer,0,3416,398,187,34,5000000-10000000,3.99
3,40,Deathmatch Classic,2001-06-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,1273,267,258,184,5000000-10000000,3.99
4,50,Half-Life: Opposing Force,1999-11-01,1,Gearbox Software,Valve,windows;mac;linux,0,Single-player;Multi-player;Valve Anti-Cheat en...,Action,FPS;Action;Sci-fi,0,5250,288,624,415,5000000-10000000,3.99


In [2]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27075 entries, 0 to 27074
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   appid             27075 non-null  int64  
 1   name              27075 non-null  object 
 2   release_date      27075 non-null  object 
 3   english           27075 non-null  int64  
 4   developer         27075 non-null  object 
 5   publisher         27075 non-null  object 
 6   platforms         27075 non-null  object 
 7   required_age      27075 non-null  int64  
 8   categories        27075 non-null  object 
 9   genres            27075 non-null  object 
 10  steamspy_tags     27075 non-null  object 
 11  achievements      27075 non-null  int64  
 12  positive_ratings  27075 non-null  int64  
 13  negative_ratings  27075 non-null  int64  
 14  average_playtime  27075 non-null  int64  
 15  median_playtime   27075 non-null  int64  
 16  owners            27075 non-null  object

In [3]:
data.describe()

Unnamed: 0,appid,english,required_age,achievements,positive_ratings,negative_ratings,average_playtime,median_playtime,price
count,27075.0,27075.0,27075.0,27075.0,27075.0,27075.0,27075.0,27075.0,27075.0
mean,596203.5,0.981127,0.354903,45.248864,1000.559,211.027147,149.804949,146.05603,6.078193
std,250894.2,0.136081,2.406044,352.670281,18988.72,4284.938531,1827.038141,2353.88008,7.874922
min,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,401230.0,1.0,0.0,0.0,6.0,2.0,0.0,0.0,1.69
50%,599070.0,1.0,0.0,7.0,24.0,9.0,0.0,0.0,3.99
75%,798760.0,1.0,0.0,23.0,126.0,42.0,0.0,0.0,7.19
max,1069460.0,1.0,18.0,9821.0,2644404.0,487076.0,190625.0,190625.0,421.99


There is no missing data anywhere, good. However, some columns contain multiple things per row (platforms, categories and tags at the very least) and the estimated number of owners is actually an interval rather than an integer value. We might want to create some dummies for those columns but that is for later, let us first clean up the dataset a bit. 

In [4]:
# Let us make it a bit more presentable.
data.columns = data.columns.str.capitalize()
data = data.rename(columns={'Appid': 'App_id'})

# Almost all apps are in English and that is our language of interest.
data = data[data['English'] == 1]
data.drop('English', axis=1, inplace=True)
data.head()

Unnamed: 0,App_id,Name,Release_date,Developer,Publisher,Platforms,Required_age,Categories,Genres,Steamspy_tags,Achievements,Positive_ratings,Negative_ratings,Average_playtime,Median_playtime,Owners,Price
0,10,Counter-Strike,2000-11-01,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,124534,3339,17612,317,10000000-20000000,7.19
1,20,Team Fortress Classic,1999-04-01,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,3318,633,277,62,5000000-10000000,3.99
2,30,Day of Defeat,2003-05-01,Valve,Valve,windows;mac;linux,0,Multi-player;Valve Anti-Cheat enabled,Action,FPS;World War II;Multiplayer,0,3416,398,187,34,5000000-10000000,3.99
3,40,Deathmatch Classic,2001-06-01,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,1273,267,258,184,5000000-10000000,3.99
4,50,Half-Life: Opposing Force,1999-11-01,Gearbox Software,Valve,windows;mac;linux,0,Single-player;Multi-player;Valve Anti-Cheat en...,Action,FPS;Action;Sci-fi,0,5250,288,624,415,5000000-10000000,3.99


In [5]:
data.groupby('Owners').size()

Owners
0-20000                18166
100000-200000           1373
1000000-2000000          287
10000000-20000000         21
100000000-200000000        1
20000-50000             3016
200000-500000           1268
2000000-5000000          192
20000000-50000000          3
50000-100000            1676
500000-1000000           513
5000000-10000000          46
50000000-100000000         2
dtype: int64

We might need an integer value for number of game owners in the future. To that end, let us add another column with estimates equal to the value in the middle of our given intervals. To this end, we are going to add a few more columns from existing features. Moreover, since we are primarily interested in the more popular games let us drop all games with less than 100,000 owners. It would rule out most of the dataset but make it more robust, and the criteria can be easily lowered later on.

In [6]:
data['Estimated_owners'] = data['Owners'].apply(lambda x: (int(x.split('-')[0]) + int(x.split('-')[1])) / 2)
data['Total_ratings'] = data['Positive_ratings'] + data['Negative_ratings']
data['Recommended'] = data['Positive_ratings'] / data['Total_ratings']
data['Playtime_proportion'] = data['Average_playtime'] / data['Median_playtime']
data = data[data['Estimated_owners'] >= 100000]

# Making it pretty.
data['Estimated_owners'] = data['Estimated_owners'].astype('int')
data = data.round(2)
data = data[[col for col in data.columns[:11]] + ['Total_ratings', 'Positive_ratings', 'Negative_ratings', 'Recommended'] +
            ['Average_playtime', 'Median_playtime', 'Playtime_proportion', 'Owners', 'Estimated_owners', 'Price']]
data.head()

Unnamed: 0,App_id,Name,Release_date,Developer,Publisher,Platforms,Required_age,Categories,Genres,Steamspy_tags,...,Total_ratings,Positive_ratings,Negative_ratings,Recommended,Average_playtime,Median_playtime,Playtime_proportion,Owners,Estimated_owners,Price
0,10,Counter-Strike,2000-11-01,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,...,127873,124534,3339,0.97,17612,317,55.56,10000000-20000000,15000000,7.19
1,20,Team Fortress Classic,1999-04-01,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,...,3951,3318,633,0.84,277,62,4.47,5000000-10000000,7500000,3.99
2,30,Day of Defeat,2003-05-01,Valve,Valve,windows;mac;linux,0,Multi-player;Valve Anti-Cheat enabled,Action,FPS;World War II;Multiplayer,...,3814,3416,398,0.9,187,34,5.5,5000000-10000000,7500000,3.99
3,40,Deathmatch Classic,2001-06-01,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,...,1540,1273,267,0.83,258,184,1.4,5000000-10000000,7500000,3.99
4,50,Half-Life: Opposing Force,1999-11-01,Gearbox Software,Valve,windows;mac;linux,0,Single-player;Multi-player;Valve Anti-Cheat en...,Action,FPS;Action;Sci-fi,...,5538,5250,288,0.95,624,415,1.5,5000000-10000000,7500000,3.99


In [7]:
len(data)

3706

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3706 entries, 0 to 26951
Data columns (total 21 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   App_id               3706 non-null   int64  
 1   Name                 3706 non-null   object 
 2   Release_date         3706 non-null   object 
 3   Developer            3706 non-null   object 
 4   Publisher            3706 non-null   object 
 5   Platforms            3706 non-null   object 
 6   Required_age         3706 non-null   int64  
 7   Categories           3706 non-null   object 
 8   Genres               3706 non-null   object 
 9   Steamspy_tags        3706 non-null   object 
 10  Achievements         3706 non-null   int64  
 11  Total_ratings        3706 non-null   int64  
 12  Positive_ratings     3706 non-null   int64  
 13  Negative_ratings     3706 non-null   int64  
 14  Recommended          3706 non-null   float64
 15  Average_playtime     3706 non-null   

The only column containing Null values is 'Playtime_proportion' which makes sense. After all, many of these games are very old and it is possible that nobody played them in the two weeks prior to when this dataset was gathered. And in such cases, dividing 0 by 0 gives us a Null value. We will need to separate those games from the others when we decide to do any sort of analysis that is taking the playtime into consideration.