This notebook is for analysing a Kaggle dataset taken from this link: https://www.kaggle.com/nikdavis/steam-store-games.

In [1]:
# Importing standard packages for data exploration and processing.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
pd.options.display.float_format = '{:.2f}'.format

data = pd.read_csv('data/kaggle/2_steam.csv')
data.head()

Unnamed: 0,appid,name,release_date,english,developer,publisher,platforms,required_age,categories,genres,steamspy_tags,achievements,positive_ratings,negative_ratings,average_playtime,median_playtime,owners,price
0,10,Counter-Strike,2000-11-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,124534,3339,17612,317,10000000-20000000,7.19
1,20,Team Fortress Classic,1999-04-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,3318,633,277,62,5000000-10000000,3.99
2,30,Day of Defeat,2003-05-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Valve Anti-Cheat enabled,Action,FPS;World War II;Multiplayer,0,3416,398,187,34,5000000-10000000,3.99
3,40,Deathmatch Classic,2001-06-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,1273,267,258,184,5000000-10000000,3.99
4,50,Half-Life: Opposing Force,1999-11-01,1,Gearbox Software,Valve,windows;mac;linux,0,Single-player;Multi-player;Valve Anti-Cheat en...,Action,FPS;Action;Sci-fi,0,5250,288,624,415,5000000-10000000,3.99


In [2]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27075 entries, 0 to 27074
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   appid             27075 non-null  int64  
 1   name              27075 non-null  object 
 2   release_date      27075 non-null  object 
 3   english           27075 non-null  int64  
 4   developer         27075 non-null  object 
 5   publisher         27075 non-null  object 
 6   platforms         27075 non-null  object 
 7   required_age      27075 non-null  int64  
 8   categories        27075 non-null  object 
 9   genres            27075 non-null  object 
 10  steamspy_tags     27075 non-null  object 
 11  achievements      27075 non-null  int64  
 12  positive_ratings  27075 non-null  int64  
 13  negative_ratings  27075 non-null  int64  
 14  average_playtime  27075 non-null  int64  
 15  median_playtime   27075 non-null  int64  
 16  owners            27075 non-null  object

In [3]:
data.describe()

Unnamed: 0,appid,english,required_age,achievements,positive_ratings,negative_ratings,average_playtime,median_playtime,price
count,27075.0,27075.0,27075.0,27075.0,27075.0,27075.0,27075.0,27075.0,27075.0
mean,596203.51,0.98,0.35,45.25,1000.56,211.03,149.8,146.06,6.08
std,250894.17,0.14,2.41,352.67,18988.72,4284.94,1827.04,2353.88,7.87
min,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,401230.0,1.0,0.0,0.0,6.0,2.0,0.0,0.0,1.69
50%,599070.0,1.0,0.0,7.0,24.0,9.0,0.0,0.0,3.99
75%,798760.0,1.0,0.0,23.0,126.0,42.0,0.0,0.0,7.19
max,1069460.0,1.0,18.0,9821.0,2644404.0,487076.0,190625.0,190625.0,421.99


In [4]:
data.groupby('owners').size()

owners
0-20000                18596
100000-200000           1386
1000000-2000000          288
10000000-20000000         21
100000000-200000000        1
20000-50000             3059
200000-500000           1272
2000000-5000000          193
20000000-50000000          3
50000-100000            1695
500000-1000000           513
5000000-10000000          46
50000000-100000000         2
dtype: int64

There is no missing data anywhere, good. However, some columns contain multiple things per row. We will definitely need to create some dummies for those columns. The estimated number of owners is actually an interval rather than an integer value, so let us add another column with estimates equal to the value in the middle of our given intervals.

We will also add a few more columns from existing features. And since we are primarily interested in the more popular games let us drop all games with less than 100,000 owners. It would rule out most of the dataset but make it more robust, and the criteria can be easily lowered later on.

In [5]:
# Let us make it a bit more presentable.
data.columns = data.columns.str.capitalize()
data = data.rename(columns={'Appid': 'App_id', 'Steamspy_tags': 'Tags'})
data['App_id'] = data['App_id'].astype('str')

# Almost all apps are in English and that is our language of interest.
data = data[data['English'] == 1]
data.drop('English', axis=1, inplace=True)
data.head()

#Adding new columns.
data['Estimated_owners'] = data['Owners'].apply(lambda x: (int(x.split('-')[0]) + int(x.split('-')[1])) / 2)
data['Total_ratings'] = data['Positive_ratings'] + data['Negative_ratings']
data['Recommended'] = data['Positive_ratings'] / data['Total_ratings']
data['Playtime_proportion'] = data['Average_playtime'] / data['Median_playtime']
data = data[data['Estimated_owners'] >= 100000]

# Making it pretty.
data['Estimated_owners'] = data['Estimated_owners'].astype('int')
data.reset_index(drop=True, inplace=True)
data = data[[col for col in data.columns[:11]] + ['Total_ratings', 'Positive_ratings', 'Negative_ratings', 'Recommended'] +
            ['Average_playtime', 'Median_playtime', 'Playtime_proportion', 'Owners', 'Estimated_owners', 'Price']]
data.head()

Unnamed: 0,App_id,Name,Release_date,Developer,Publisher,Platforms,Required_age,Categories,Genres,Tags,...,Total_ratings,Positive_ratings,Negative_ratings,Recommended,Average_playtime,Median_playtime,Playtime_proportion,Owners,Estimated_owners,Price
0,10,Counter-Strike,2000-11-01,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,...,127873,124534,3339,0.97,17612,317,55.56,10000000-20000000,15000000,7.19
1,20,Team Fortress Classic,1999-04-01,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,...,3951,3318,633,0.84,277,62,4.47,5000000-10000000,7500000,3.99
2,30,Day of Defeat,2003-05-01,Valve,Valve,windows;mac;linux,0,Multi-player;Valve Anti-Cheat enabled,Action,FPS;World War II;Multiplayer,...,3814,3416,398,0.9,187,34,5.5,5000000-10000000,7500000,3.99
3,40,Deathmatch Classic,2001-06-01,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,...,1540,1273,267,0.83,258,184,1.4,5000000-10000000,7500000,3.99
4,50,Half-Life: Opposing Force,1999-11-01,Gearbox Software,Valve,windows;mac;linux,0,Single-player;Multi-player;Valve Anti-Cheat en...,Action,FPS;Action;Sci-fi,...,5538,5250,288,0.95,624,415,1.5,5000000-10000000,7500000,3.99


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3706 entries, 0 to 3705
Data columns (total 21 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   App_id               3706 non-null   object 
 1   Name                 3706 non-null   object 
 2   Release_date         3706 non-null   object 
 3   Developer            3706 non-null   object 
 4   Publisher            3706 non-null   object 
 5   Platforms            3706 non-null   object 
 6   Required_age         3706 non-null   int64  
 7   Categories           3706 non-null   object 
 8   Genres               3706 non-null   object 
 9   Tags                 3706 non-null   object 
 10  Achievements         3706 non-null   int64  
 11  Total_ratings        3706 non-null   int64  
 12  Positive_ratings     3706 non-null   int64  
 13  Negative_ratings     3706 non-null   int64  
 14  Recommended          3706 non-null   float64
 15  Average_playtime     3706 non-null   i

The only column containing Null values is 'Playtime_proportion' which makes sense. After all, many of these games are very old and it is possible that nobody played them in the two weeks prior to when this dataset was gathered. And in such cases, dividing 0 by 0 gives us a Null value. We will need to separate those games from the others when we decide to do any sort of analysis that is taking the playtime into consideration. But before we do that, let us create the dummies first.

In [7]:
# Now on to the dummies.
dummy_columns = ['Platforms', 'Categories', 'Genres', 'Tags', 'Owners']
dummies = pd.DataFrame()
for column in dummy_columns:
    dummies = pd.concat([dummies, data[column].str.get_dummies(sep=';')])
dummies.head()

Unnamed: 0,linux,mac,windows,Captions available,Co-op,Commentary available,Cross-Platform Multiplayer,Full controller support,In-App Purchases,Includes Source SDK,...,100000-200000,1000000-2000000,10000000-20000000,100000000-200000000,200000-500000,2000000-5000000,20000000-50000000,500000-1000000,5000000-10000000,50000000-100000000
0,1.0,1.0,1.0,,,,,,,,...,,,,,,,,,,
1,1.0,1.0,1.0,,,,,,,,...,,,,,,,,,,
2,1.0,1.0,1.0,,,,,,,,...,,,,,,,,,,
3,1.0,1.0,1.0,,,,,,,,...,,,,,,,,,,
4,1.0,1.0,1.0,,,,,,,,...,,,,,,,,,,


In [8]:
dummies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18530 entries, 0 to 3705
Columns: 309 entries, linux to 50000000-100000000
dtypes: float64(309)
memory usage: 43.8 MB


In [9]:
dummies.describe()

Unnamed: 0,linux,mac,windows,Captions available,Co-op,Commentary available,Cross-Platform Multiplayer,Full controller support,In-App Purchases,Includes Source SDK,...,100000-200000,1000000-2000000,10000000-20000000,100000000-200000000,200000-500000,2000000-5000000,20000000-50000000,500000-1000000,5000000-10000000,50000000-100000000
count,3706.0,3706.0,3706.0,3706.0,7412.0,3706.0,3706.0,3706.0,3706.0,3706.0,...,3706.0,3706.0,3706.0,3706.0,3706.0,3706.0,3706.0,3706.0,3706.0,3706.0
mean,0.3,0.41,1.0,0.04,0.08,0.01,0.08,0.27,0.07,0.01,...,0.37,0.08,0.01,0.0,0.34,0.05,0.0,0.14,0.01,0.0
std,0.46,0.49,0.0,0.21,0.27,0.12,0.28,0.44,0.26,0.08,...,0.48,0.27,0.08,0.02,0.47,0.22,0.03,0.35,0.11,0.02
min,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


We got a whole bunch of dummy features. Unfortunately, none of them seem to indicate which column they originated from. We will address that later.

In [10]:
data.drop(dummy_columns, axis=1, inplace=True)
played = data[data['Playtime_proportion'] > 0].copy()
unplayed = data[data['Playtime_proportion'].isnull()].copy()
unplayed.head()

Unnamed: 0,App_id,Name,Release_date,Developer,Publisher,Required_age,Achievements,Total_ratings,Positive_ratings,Negative_ratings,Recommended,Average_playtime,Median_playtime,Playtime_proportion,Estimated_owners,Price
28,1300,SiN Episodes: Emergence,2006-05-10,Ritual Entertainment,Ritual Entertainment,0,0,529,468,61,0.88,0,0,,150000,7.19
33,1630,Disciples II: Rise of the Elves,2006-07-06,Strategy First,Strategy First,0,0,559,451,108,0.81,0,0,,150000,4.99
50,2350,QUAKE III: Team Arena,2007-08-03,id Software,id Software,0,0,139,108,31,0.78,0,0,,350000,12.99
52,2370,HeXen: Deathkings of the Dark Citadel,2007-08-03,Raven Software,id Software,0,0,77,57,20,0.74,0,0,,150000,2.99
53,2390,Heretic: Shadow of the Serpent Riders,2007-08-03,Raven Software,id Software,0,0,446,417,29,0.93,0,0,,350000,2.99


In [11]:
unplayed.describe()

Unnamed: 0,Required_age,Achievements,Total_ratings,Positive_ratings,Negative_ratings,Recommended,Average_playtime,Median_playtime,Playtime_proportion,Estimated_owners,Price
count,347.0,347.0,347.0,347.0,347.0,347.0,347.0,347.0,0.0,347.0,347.0
mean,0.37,27.81,854.45,646.03,208.43,0.72,0.0,0.0,,204466.86,7.25
std,2.43,218.59,1068.69,884.32,312.65,0.18,0.0,0.0,,145208.4,8.86
min,0.0,0.0,19.0,8.0,5.0,0.17,0.0,0.0,,150000.0,0.0
25%,0.0,0.0,242.5,150.5,54.5,0.6,0.0,0.0,,150000.0,0.0
50%,0.0,0.0,481.0,356.0,104.0,0.75,0.0,0.0,,150000.0,4.99
75%,0.0,23.0,992.5,799.0,234.0,0.86,0.0,0.0,,150000.0,9.99
max,18.0,4034.0,7172.0,6770.0,2520.0,0.98,0.0,0.0,,1500000.0,69.99


As we can see, even the games that have no playtime recorded for the past two weeks tend to have a fair amount of ratings with most of them being positive, as well as decent estimates for the number of owners. These are probably the games that were once popular but are too old to attract and retain players anymore.