# Steam Library Data Analysis
## Project outline
<center><img src="https://cdn.cloudflare.steamstatic.com/store/home/store_home_share.jpg"></center>


<p>Steam is a digital video game distribution service and storefront released by the Valve Corporation. Released in 2003 and later updated to distribute third-party titles in 2005, Steam quickly gained market traction and currently provides various services to thousands of video game players, developers and publishers.</p>

<p>As Steam services millions of users in the PC gaming market, data gathered from its storefront can be significantly useful in identifying various trends within the PC gaming market which can be used to help develop future software. This project aims to analyse and identify trends within the PC gaming market by utilising a dataset collected from the Steam Official Store by Gustavo Querede.</p>

## Data preparation
### Importing libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from tabulate import tabulate
import plotly.graph_objects as go

### Importing dataset

In [2]:
df= pd.read_csv('/kaggle/input/game-recommendations-on-steam/games.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50796 entries, 0 to 50795
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   app_id          50796 non-null  int64  
 1   title           50796 non-null  object 
 2   date_release    50796 non-null  object 
 3   win             50796 non-null  bool   
 4   mac             50796 non-null  bool   
 5   linux           50796 non-null  bool   
 6   rating          50796 non-null  object 
 7   positive_ratio  50796 non-null  int64  
 8   user_reviews    50796 non-null  int64  
 9   price_final     50796 non-null  float64
 10  price_original  50796 non-null  float64
 11  discount        50796 non-null  float64
 12  steam_deck      50796 non-null  bool   
dtypes: bool(4), float64(3), int64(3), object(3)
memory usage: 3.7+ MB


In [3]:
data_types = df.dtypes.value_counts()
data_types_table = pd.concat([data_types], axis = 1, keys = ['Sum of Data Type'])
data_types_table

Unnamed: 0,Sum of Data Type
bool,4
int64,3
object,3
float64,3


**Summary**
<br>
Initial investigation into this dataset reveals that we have 50796 entries across 12 columns with 3 int, 3 object, 4 boolen and 3 float datatypes. For data analysis it will be important to:
1. Convert 'object' columns to 'string' columns and convert the date_release column to a datetime datatype
2. Double-check for any null values and/or duplicates


### 1. Conversion of datatypes

#### Converting object datatypes to string datatypes

In [4]:
object_select = df.select_dtypes(include = 'object').columns
df[object_select]=df[object_select].astype('string')

#### Converting the date_release column to the datetime datatype

In [5]:
df['date_release']= pd.to_datetime(df['date_release'])

#### Converted datatypes

In [6]:
data_types = df.dtypes.value_counts()
data_types_table = pd.concat([data_types], axis = 1, keys = ['Sum of Data Type'])
data_types_table

Unnamed: 0,Sum of Data Type
bool,4
int64,3
float64,3
string,2
datetime64[ns],1


**Summary**
<p>Here we can see that we have successfully converted all object datatypes into strings. We have also converted the date_release column to datetime.</p>

### 2a. Null Check

In [7]:
null_check = df.isnull().sum()

missing_data_table = pd.concat([null_check], axis = 1, keys = ['Total Missing Data'])

missing_data_table

Unnamed: 0,Total Missing Data
app_id,0
title,0
date_release,0
win,0
mac,0
linux,0
rating,0
positive_ratio,0
user_reviews,0
price_final,0


### 2b. Duplicate Check

In [8]:
duplicate_check = df.duplicated().sum()
print(f"There are {duplicate_check} duplicate rows within this dataset")

There are 0 duplicate rows within this dataset


**Summary**
<p> We can see from our data validation that we have no null values and no duplicates within our data</p>

## Data Analysis
### Initial investigation
#### Top 20 rated steam games released

In [9]:
top_games = df[(df['positive_ratio'] >= 90) & (df['rating'] == 'Overwhelmingly Positive')].sort_values(by=['user_reviews', 'positive_ratio'], ascending=[False, True]).head(20)
top_games[['title','user_reviews','positive_ratio']]

Unnamed: 0,title,user_reviews,positive_ratio
12942,Terraria,943413,97
15277,Garry's Mod,853733,96
12704,The Witcher® 3: Wild Hunt,668455,96
47688,Wallpaper Engine,637341,98
12618,Left 4 Dead 2,574470,97
14372,Stardew Valley,505882,98
47444,Euro Truck Simulator 2,494214,97
13107,Phasmophobia,486466,96
11624,The Forest,416113,95
48090,Valheim,356617,95


In [10]:
fig= px.scatter(top_games,x= 'title', y='price_final', title= 'Top 20 positive rated games on Steam', hover_name="title", hover_data={'title': False, 'date_release': True, 'user_reviews': True, 'positive_ratio': True}, labels={"title":"Title", "date_release":"Release Date", "price_final":"Price","positive_ratio": "Positive Ratio","user_reviews":"Total User Reviews"})
fig.update_yaxes(title_text='Price of game ($)')
fig.update_xaxes(title_text='Game Title')


fig.show()

**Summary**
<p>Here we can see the top 20 games released on steam from games with a postive ratio above or equal to 90 and a rating of 'Overwhelmingly Postive', the highest rating possible. Games were then sorted in descending order so that only the titles with the highest user reviews, postive ratio and rating were displayed. This ensured that games with low user reviews were not present as lower user reviews often has less reliable and often skewed positive ratios and ratings. </p>

#### Lowest 20 rated steam games released

In [11]:
bottom_games = df[(df['positive_ratio'] <= 50) & (df['rating'].isin(['Overwhelmingly Negative', 'Negative','Mostly Negative']))].sort_values(by=['user_reviews', 'positive_ratio'], ascending=[False, True]).head(20)
bottom_games[['title','user_reviews','positive_ratio']]

Unnamed: 0,title,user_reviews,positive_ratio
14190,Overwatch® 2,181198,9
37392,Mirror 2: Project X,110981,26
48774,eFootball™ 2024,59071,37
50794,PAYDAY 3,29458,38
3191,War of the Three Kingdoms,21276,15
27239,Call of Duty®: Warzone™ 2.0,19770,34
48536,MOBILE SUIT GUNDAM BATTLE OPERATION 2,15685,20
34045,Cultivation Tales,12299,37
34902,DYNASTY WARRIORS 9,9363,34
12651,Destiny 2: Lightfall,7343,31


In [12]:
fig= px.scatter(bottom_games,x= 'title', y='price_final', title= 'Top 20 negative rated games on Steam', hover_name="title", hover_data={'title': False, 'date_release': True, 'user_reviews': True, 'positive_ratio': True}, labels={"title":"Game Title", "date_release":"Release Date", "price_final":"Price","positive_ratio": "Positive Ratio","user_reviews":"Total User Reviews"})
fig.update_yaxes(title_text='Price of game ($)')
fig.update_xaxes(title_text='Game Title')


fig.show()

**Summary**
<p>Here we can see the bottom 20 games released on Steam from games with a positive ratio below or equal to 50 and a rating of either 'Overwhelmingly Negative', 'Mostly Negative' or 'Negative', the bottom three rating classifications. Games were then sorted in descending order for user reviews and ascending order for positive ratio so that only the titles with the most user reviews but the worst positive ratio and rating were displayed. Similar to the previous analysis, this ensured that games with low user reviews were not present as lower user reviews often have less reliable and often skewed positive ratios and ratings. </p>


### Descriptive Statistics of software pricing across each operating system (Windows, Mac, Linux)

In [13]:
windows_games = df[(df['win'] == True) & (df['price_final'] != 0)]
mac_games = df[(df['mac'] == True) & (df['price_final'] != 0)]
linux_games = df[(df['linux'] == True) & (df['price_final'] != 0)]

os_stats = {
    'Windows': windows_games['price_final'].agg(['mean','median','min','max']).round(2),
    'Mac': mac_games['price_final'].agg(['mean','median','min','max']).round(2),
    'Linux': linux_games['price_final'].agg(['mean','median','min','max']).round(2)
}

os_stats_df = pd.DataFrame(os_stats)
os_stats_final_df = os_stats_df.transpose()
print(tabulate(os_stats_final_df, headers= 'keys', tablefmt='simple_outline'))

┌─────────┬────────┬──────────┬───────┬────────┐
│         │   mean │   median │   min │    max │
├─────────┼────────┼──────────┼───────┼────────┤
│ Windows │  10.53 │     6.99 │  0.27 │ 299.99 │
│ Mac     │   9.96 │     6.99 │  0.27 │ 269.99 │
│ Linux   │   9.74 │     6.99 │  0.27 │ 199.99 │
└─────────┴────────┴──────────┴───────┴────────┘


**Summary**
<p>Steam hosts a storefront for 3 different operating systems. During this analysis, we split the data into three datasets for each operating system that Steam provides games for. The data was then investigated using descriptive statistics to find if any specific operating system was priced differently from others. Interestingly the mean and median of each operating system are quite similar to each other, suggesting that the operating system has little to no effect on game pricing.</p>

#### Game Rating Distribution 

In [14]:
rating_count = df['rating'].value_counts()
rating_count_df= pd.DataFrame(rating_count).reset_index()
rating_count_df.columns = ['Rating', 'Game Count']

fig= px.bar(rating_count_df, x='Game Count', y='Rating', color='Rating', text_auto= True, title= 'Game Rating Distribution', orientation='h')

fig.show()

**Summary**
<p>To analyse the game rating distribution, we isolated and counted all the game rating values. Overall, we can see that within this dataset games are on average rated more positively with most games falling between mixed and very positive. It was uncommon to find games rated on either extreme with only 1110 games rated as 'Overwhelmingly Positive and only 14 games as 'Overwhelmingly Negative'.</p>

<p> This data is expected as the Steam rating scale splits games into three groups (10-49 reviews, 59-499 reviews and 500+ reviews). For a game to reach 'very positive' or 'overwhelmingly positive' the game must have over 50 or 500 reviews respectively with a positive review ratio of 80% or above, the same is found for negative reviews with a positive review ratio of 19% or less. While this explains why there a fewer games in either extreme rating, it does not explain why most games on Steam are more positively skewed.</p>

#### Average cost of a steam game per year

In [15]:
df['Year'] = df['date_release'].dt.year
df_2013 = df[df['Year'] >= 2012]
mean_original_prices = df_2013.groupby('Year')['price_final'].mean().round(2).reset_index()


fig= px.line(mean_original_prices, x= 'Year', y='price_final', title='Average price of a game on steam each year')
fig.update_traces(mode="markers+lines", hovertemplate=None)
fig.update_layout(hovermode="x unified")
fig.update_yaxes(title_text='Average price of game ($)')
fig.update_xaxes(title_text='Year released')

fig.show()

**Summary**
<p>Understanding the average market price for a video game is important. If game developers are to correctly price their games, it's helpful to understand what a player is paying on average for a video game. From this chart we can see that video game pricing across Steam has remained relatively stable. Keep in mind this data includes everything listed on stream, from software to downloadable content for DLC. Items listed for free were removed from the analysis to avoid diluting the information. </p>

#### Steam Games released per year

In [16]:
platform_counts_per_year = df.groupby(df['date_release'].dt.year)[['win', 'mac', 'linux']].sum()
platform_counts_per_year.reset_index(inplace=True)
platform_counts_per_year.rename(columns={'win': 'Windows Games', 'mac': 'Mac Games', 'linux': 'Linux Games'}, inplace=True)

fig = px.bar(platform_counts_per_year, x='date_release', y=['Windows Games','Mac Games' ,'Linux Games'], text_auto=True,labels={"variable":"Operating System", "value":"Total"}, hover_name="date_release", hover_data={'date_release': False})

fig.show()

**Summary**
<p>Operating system support within the desktop gaming industry has been a popular concern for many PC, Mac and Linux users. While the PC has historically been the go-to operating system for both game development and game optimisation, Mac OS and Linux systems have slowly increased their user base to a stage where developers must now decide whether multiplatform support for their game is financially feasible. Over time we can see that game support for Mac OS systems have steadily increased alongside its user base, while Linux user continue to get a lower amount of support each year since its dip in 2019.</p>