# Steam Stats Exploratory Data Analysis

This notebook contains the exploratory data analysis of Steam game statistics dataset.

## Table of Contents
1. [Data Loading](#data-loading)
2. [Data Overview](#data-overview)
3. [Data Cleaning](#data-cleaning)
4. [Exploratory Analysis](#exploratory-analysis)
5. [Visualizations](#visualizations)
6. [Key Findings](#key-findings)

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path


from data.data_loader import load_steam_data, clean_steam_data


plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
%matplotlib inline

## Data Loading

Load the Steam dataset from Kaggle. Make sure to place your dataset in the `data/raw/` directory.

In [None]:
data_path = Path.cwd().parent / 'data' / 'raw' / 'games_march2025_full.csv'
df = load_steam_data(data_path)

# Display basic information
print(f"Dataset shape: {df.shape}")
df.head()

Successfully loaded dataset with 94948 rows and 47 columns
Dataset shape: (94948, 47)


Unnamed: 0,appid,name,release_date,required_age,price,dlc_count,detailed_description,about_the_game,short_description,reviews,...,average_playtime_2weeks,median_playtime_forever,median_playtime_2weeks,discount,peak_ccu,tags,pct_pos_total,num_reviews_total,pct_pos_recent,num_reviews_recent
0,730,Counter-Strike 2,2012-08-21,0,0.0,1,"For over two decades, Counter-Strike has offer...","For over two decades, Counter-Strike has offer...","For over two decades, Counter-Strike has offer...",,...,879,5174,350,0,1212356,"{'FPS': 90857, 'Shooter': 65397, 'Multiplayer'...",86,8632939,82,96473
1,578080,PUBG: BATTLEGROUNDS,2017-12-21,0,0.0,0,"LAND, LOOT, SURVIVE! Play PUBG: BATTLEGROUNDS ...","LAND, LOOT, SURVIVE! Play PUBG: BATTLEGROUNDS ...",Play PUBG: BATTLEGROUNDS for free. Land on str...,,...,0,0,0,0,616738,"{'Survival': 14838, 'Shooter': 12727, 'Battle ...",59,2513842,68,16720
2,570,Dota 2,2013-07-09,0,0.0,2,"The most-played game on Steam. Every day, mill...","The most-played game on Steam. Every day, mill...","Every day, millions of players worldwide enter...",“A modern multiplayer masterpiece.” 9.5/10 – D...,...,1536,898,892,0,555977,"{'Free to Play': 59933, 'MOBA': 20158, 'Multip...",81,2452595,80,29366
3,271590,Grand Theft Auto V Legacy,2015-04-13,17,0.0,0,"When a young street hustler, a retired bank ro...","When a young street hustler, a retired bank ro...",Grand Theft Auto V for PC offers players the o...,,...,771,7101,74,0,117698,"{'Open World': 32644, 'Action': 23539, 'Multip...",87,1803832,92,17517
4,488824,Tom Clancy's Rainbow Six® Siege,2015-12-01,17,19.99,9,Edition Comparison Ultimate Edition The Tom Cl...,“One of the best first-person shooters ever ma...,"Tom Clancy's Rainbow Six® Siege is an elite, t...",,...,0,0,0,0,0,"{'FPS': 8082, 'Multiplayer': 6139, 'Tactical':...",84,1168404,76,13017


## Data Overview

Get an overview of the dataset structure and basic statistics.

In [3]:
# Basic dataset information
print("Dataset Info:")
df.info()

print("\nBasic Statistics:")
df.describe()

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94948 entries, 0 to 94947
Data columns (total 47 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   appid                     94948 non-null  int64  
 1   name                      94946 non-null  object 
 2   release_date              94948 non-null  object 
 3   required_age              94948 non-null  int64  
 4   price                     94948 non-null  float64
 5   dlc_count                 94948 non-null  int64  
 6   detailed_description      89522 non-null  object 
 7   about_the_game            89499 non-null  object 
 8   short_description         89599 non-null  object 
 9   reviews                   10428 non-null  object 
 10  header_image              94948 non-null  object 
 11  website                   41194 non-null  object 
 12  support_url               44185 non-null  object 
 13  support_email             78848 non-null  objec

Unnamed: 0,appid,required_age,price,dlc_count,metacritic_score,achievements,recommendations,user_score,score_rank,positive,...,average_playtime_forever,average_playtime_2weeks,median_playtime_forever,median_playtime_2weeks,discount,peak_ccu,pct_pos_total,num_reviews_total,pct_pos_recent,num_reviews_recent
count,94948.0,94948.0,94948.0,94948.0,94948.0,94948.0,94948.0,94948.0,39.0,94948.0,...,94948.0,94948.0,94948.0,94948.0,94948.0,94948.0,94948.0,94948.0,94948.0,94948.0
mean,1707531.0,0.178287,6.911444,0.563203,2.763966,19.543729,1022.212,0.030975,99.128205,1217.905,...,108.6097,4.75732,108.3665,5.017926,4.307094,92.85272,44.630261,1448.044,5.32798,16.879871
std,926434.6,1.701329,13.071148,14.915685,14.111183,159.798834,22741.51,1.569178,0.695076,30979.74,...,6620.827,175.961001,8555.995,184.244795,16.111535,5554.794,40.837047,35481.41,22.460691,459.114933
min,20.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0,98.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0
25%,887337.5,0.0,0.99,0.0,0.0,0.0,0.0,0.0,99.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0
50%,1591145.0,0.0,3.99,0.0,0.0,2.0,0.0,0.0,99.0,8.0,...,0.0,0.0,0.0,0.0,0.0,0.0,58.0,15.0,-1.0,-1.0
75%,2491702.0,0.0,9.99,0.0,0.0,19.0,0.0,0.0,100.0,51.0,...,0.0,0.0,0.0,0.0,0.0,0.0,84.0,80.0,-1.0,-1.0
max,3570420.0,21.0,999.98,3427.0,97.0,9821.0,4401572.0,100.0,100.0,7480813.0,...,1462997.0,18568.0,1462997.0,18568.0,100.0,1212356.0,100.0,8632939.0,100.0,96473.0


In [14]:
# Check for missing values
print("Missing Values:")
missing_data = df.isnull().sum()
missing_data[missing_data > 0].sort_values(ascending=False)

Missing Values:


score_rank              94909
metacritic_url          91372
reviews                 84520
notes                   78296
website                 53754
support_url             50763
support_email           16100
about_the_game           5449
detailed_description     5426
short_description        5349
name                        2
dtype: int64

## Data Cleaning

Clean and preprocess the data for analysis.

In [17]:
# Clean the dataset
df_clean = clean_steam_data(df)

print(f"Original dataset: {len(df)} rows")
print(f"Cleaned dataset: {len(df_clean)} rows")
print(f"Rows removed: {len(df) - len(df_clean)}")

Data cleaned: 94948 rows remaining after preprocessing
Original dataset: 94948 rows
Cleaned dataset: 94948 rows
Rows removed: 0


## Exploratory Analysis

Perform detailed exploratory analysis of the Steam data.

In [None]:
# TODO: Add specific analysis based on your dataset columns
# Examples:
# - Game price analysis
# - Genre popularity
# - Release date trends
# - Rating analysis
# - Platform analysis


Column names in the dataset:
['appid', 'name', 'release_date', 'required_age', 'price', 'dlc_count', 'detailed_description', 'about_the_game', 'short_description', 'reviews', 'header_image', 'website', 'support_url', 'support_email', 'windows', 'mac', 'linux', 'metacritic_score', 'metacritic_url', 'achievements', 'recommendations', 'notes', 'supported_languages', 'full_audio_languages', 'packages', 'developers', 'publishers', 'categories', 'genres', 'screenshots', 'movies', 'user_score', 'score_rank', 'positive', 'negative', 'estimated_owners', 'average_playtime_forever', 'average_playtime_2weeks', 'median_playtime_forever', 'median_playtime_2weeks', 'discount', 'peak_ccu', 'tags', 'pct_pos_total', 'num_reviews_total', 'pct_pos_recent', 'num_reviews_recent']


## Visualizations

Create visualizations to better understand the data patterns.

In [None]:
# TODO: Uncomment and modify based on your dataset columns
# plot_price_distribution(df_clean, 'price')
# plot_genre_popularity(df_clean, 'genres')

## Key Findings

Summarize the key insights from the exploratory data analysis:

1. **Finding 1**: Description of key insight
2. **Finding 2**: Description of key insight
3. **Finding 3**: Description of key insight

### Next Steps
- Data preprocessing for dashboard
- Feature engineering
- Dashboard development