# Steam Games: Data Cleaning Notebook

**Purpose:** Prepare a clean, analysis‑ready subset of the Steam Games dataset.

**Outputs:** `data/steam_games_cleaned.csv`



## 1. Setup & Config

- Set the path to raw dataset (CSV).  
- If file name differs, update `RAW_DATA_PATH`.  

In [None]:
import kagglehub
# Downloading the dataset
# path = kagglehub.dataset_download("fronkongames/steam-games-dataset")
# print("Path to dataset files:", path)

In [None]:
from pathlib import Path
import pandas as pd
import numpy as np

# Configure paths 
RAW_DATA_PATH = Path("data/games_raw.csv")  
CLEAN_PATH = Path("../data/steam_games_cleaned.csv")
FIXED_DATA_PATH = Path("data/games_fixed.csv") # <-- replace with your actual filename
RAW_DATA_PATH, CLEAN_PATH, FIXED_DATA_PATH

(PosixPath('data/games_raw.csv'),
 PosixPath('data/steam_games_cleaned.csv'),
 PosixPath('data/games_fixed.csv'))


## 2. Load & Inspect


In [194]:
# Load the dataset 
if not RAW_DATA_PATH.exists():
    raise FileNotFoundError(f"Raw dataset not found at {RAW_DATA_PATH}")

df_raw = pd.read_csv(FIXED_DATA_PATH)
print(df_raw.shape)
df_raw.head(3)

(111452, 40)


Unnamed: 0,AppID,Name,Release date,Estimated owners,Peak CCU,Required age,Price,Discount,DLC count,About the game,...,Average playtime two weeks,Median playtime forever,Median playtime two weeks,Developers,Publishers,Categories,Genres,Tags,Screenshots,Movies
0,20200,Galactic Bowling,"Oct 21, 2008",0 - 20000,0,0,19.99,0,0,Galactic Bowling is an exaggerated and stylize...,...,0,0,0,Perpetual FX Creative,Perpetual FX Creative,"Single-player,Multi-player,Steam Achievements,...","Casual,Indie,Sports","Indie,Casual,Sports,Bowling",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
1,655370,Train Bandit,"Oct 12, 2017",0 - 20000,0,0,0.99,0,0,THE LAW!! Looks to be a showdown atop a train....,...,0,0,0,Rusty Moyher,Wild Rooster,"Single-player,Steam Achievements,Full controll...","Action,Indie","Indie,Action,Pixel Graphics,2D,Retro,Arcade,Sc...",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
2,1732930,Jolt Project,"Nov 17, 2021",0 - 20000,0,0,4.99,0,0,Jolt Project: The army now has a new robotics ...,...,0,0,0,Campião Games,Campião Games,Single-player,"Action,Adventure,Indie,Strategy",,https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...


In [None]:
df_raw.info()

In [None]:
#before cleaning 
df_raw.describe()

In [None]:
df_raw.nunique() #check unique values in each column

In [196]:
print(df_raw.columns.tolist())

df_raw[["AppID","Name","Release date","Estimated owners","Peak CCU","Required age","Price","Discount","DLC count","About the game"]].head(3)

['AppID', 'Name', 'Release date', 'Estimated owners', 'Peak CCU', 'Required age', 'Price', 'Discount', 'DLC count', 'About the game', 'Supported languages', 'Full audio languages', 'Reviews', 'Header image', 'Website', 'Support url', 'Support email', 'Windows', 'Mac', 'Linux', 'Metacritic score', 'Metacritic url', 'User score', 'Positive', 'Negative', 'Score rank', 'Achievements', 'Recommendations', 'Notes', 'Average playtime forever', 'Average playtime two weeks', 'Median playtime forever', 'Median playtime two weeks', 'Developers', 'Publishers', 'Categories', 'Genres', 'Tags', 'Screenshots', 'Movies']


Unnamed: 0,AppID,Name,Release date,Estimated owners,Peak CCU,Required age,Price,Discount,DLC count,About the game
0,20200,Galactic Bowling,"Oct 21, 2008",0 - 20000,0,0,19.99,0,0,Galactic Bowling is an exaggerated and stylize...
1,655370,Train Bandit,"Oct 12, 2017",0 - 20000,0,0,0.99,0,0,THE LAW!! Looks to be a showdown atop a train....
2,1732930,Jolt Project,"Nov 17, 2021",0 - 20000,0,0,4.99,0,0,Jolt Project: The army now has a new robotics ...


# Drop NLP specific/ unwanted columns
    "Header image", "Website", "Support url", "Support email",
    "Metacritic url", "Screenshots", "Movies",
    "Score rank", "Reviews", "Notes", "Supported languages ", " Full audio languages"

- Change numerical nans to 0
- Change categorical nans to 'Unknown'
- Drop duplicate AppIDs

In [None]:
# Change numerical nans to 0
# Change categorical nans to 'Unknown'
# Drop duplicate AppIDs


## Drop irrelevant / sparse columns
drop_cols = [ "Supported languages", "Full audio languages" ,"Reviews", "Release date", 
    "Header image", "Website", "Support url", "Support email",
    "Metacritic url", "Notes","Developers ", "Screenshots", "Movies", "Score rank" #NLP specific
]

# drop_cols =[]

df = df_raw.copy()

# Parse release year from release date
df["Release year"] = pd.to_datetime(df["Release date"], errors="coerce", infer_datetime_format=True).dt.year

df.update(df)

df = df.drop(columns=drop_cols, errors="ignore")

# Handle nulls
# Drop rows missing critical identifiers
df = df.dropna(subset=["AppID", "Name", "Release year", "Price"])

df['Release year']= df["Release year"].astype(int)

# Fill numeric NaNs with 0 
num_fill_zero = ["Achievements", "Recommendations", 
                 "Average playtime forever", "Median playtime forever"]
for col in num_fill_zero:
    if col in df.columns:
        df[col] = df[col].fillna(0)

# Fill categorical NaNs with 'Unknown'
cat_fill_unknown = ["Genres", "Categories", "Tags", "Developers", "Publishers"]
for col in cat_fill_unknown:
    if col in df.columns:
        df[col] = df[col].fillna("Unknown")

df = df.drop_duplicates(subset=["AppID"])
df.info()

  df["Release year"] = pd.to_datetime(df["Release date"], errors="coerce", infer_datetime_format=True).dt.year


<class 'pandas.core.frame.DataFrame'>
Index: 111315 entries, 0 to 111451
Data columns (total 28 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   AppID                       111315 non-null  int64  
 1   Name                        111315 non-null  object 
 2   Estimated owners            111315 non-null  object 
 3   Peak CCU                    111315 non-null  int64  
 4   Required age                111315 non-null  int64  
 5   Price                       111315 non-null  float64
 6   Discount                    111315 non-null  int64  
 7   DLC count                   111315 non-null  int64  
 8   About the game              104838 non-null  object 
 9   Windows                     111315 non-null  bool   
 10  Mac                         111315 non-null  bool   
 11  Linux                       111315 non-null  bool   
 12  Metacritic score            111315 non-null  int64  
 13  User score         

In [198]:
missing_values_count = df_raw.isnull().sum()
missing_values_count[0:40]  # Display counts of missing values for the first 40 columns

AppID                              0
Name                               6
Release date                       0
Estimated owners                   0
Peak CCU                           0
Required age                       0
Price                              0
Discount                           0
DLC count                          0
About the game                  6483
Supported languages                0
Full audio languages               0
Reviews                       100828
Header image                       0
Website                        64994
Support url                    60693
Support email                  19025
Windows                            0
Mac                                0
Linux                              0
Metacritic score                   0
Metacritic url                107447
User score                         0
Positive                           0
Negative                           0
Score rank                    111408
Achievements                       0
R

In [199]:
total_cells = np.prod(df_raw.shape)
print(total_cells)
total_missing = missing_values_count.sum()
total_missing
# percent of data that is missing
percent_missing = (total_missing/total_cells) * 100
print(percent_missing)

4458080
14.40936456950077


In [200]:
df_copy = df.copy()
missing_values_count1 = df_copy.isnull().sum()
missing_values_count1[0:40]  # Display counts of missing values for the first 40 columns

AppID                            0
Name                             0
Estimated owners                 0
Peak CCU                         0
Required age                     0
Price                            0
Discount                         0
DLC count                        0
About the game                6477
Windows                          0
Mac                              0
Linux                            0
Metacritic score                 0
User score                       0
Positive                         0
Negative                         0
Achievements                     0
Recommendations                  0
Average playtime forever         0
Average playtime two weeks       0
Median playtime forever          0
Median playtime two weeks        0
Developers                       0
Publishers                       0
Categories                       0
Genres                           0
Tags                             0
Release year                     0
dtype: int64

In [201]:
#save cleaned data
df.to_csv(CLEAN_PATH, index=False)
print(f"Cleaned data saved to {CLEAN_PATH}")

Cleaned data saved to data/steam_games_cleaned.csv


In [202]:

total_cells1 = np.prod(df_copy.shape)
print(total_cells1)
total_missing1 = missing_values_count1.sum()
# percent of data that is missing
percent_missing = (total_missing1/total_cells1) * 100
print(percent_missing)

3116820
0.20780795811115174


In [None]:
#print first 100 rows of appid
print(df.iloc[100:210])
df['Name'].head(100)

In [None]:
game_opinion= df[["Name",'Release year', 'Estimated owners', 'Required age','Price', "Recommendations","User score","Positive","Negative","Achievements"]]
game_opinion.head(200)

In [None]:
# Check ranges and summary stats of numeric columns
num_cols = df.select_dtypes(include=[np.number]).columns
summary = df[num_cols].describe().T
summary["num_missing"] = df[num_cols].isna().sum()
summary["num_negatives"] = (df[num_cols] < 0).sum()
summary["num_zeros"] = (df[num_cols] == 0).sum()
summary

In [203]:
print(df.shape)
df[["AppID","Name","Release year","Price"]].head()
df.isna().sum().sort_values(ascending=False).head(10)

(111315, 28)


About the game      6477
AppID                  0
Estimated owners       0
Name                   0
Peak CCU               0
Required age           0
Discount               0
Price                  0
DLC count              0
Windows                0
dtype: int64

In [None]:
# display(df_raw.head(3))
# Candidate keys
for key in [["AppID"], ["Name"], ["Publishers"], ["Developers"]]:
    if all(c in df_copy.columns for c in key):
        dup = df_copy.duplicated(subset=key).sum()
        print(f"Duplicates on {key}: {dup}")


## 3. Notes for README.md

- **Source:** Steam games dataset (Kaggle/Steam API).   
- **Cleaning decisions:**   
  - Parsed `release_year` from `release_date` (dropped rows missing critical fields).  
  - Removed duplicates by `AppID` and the NaN values  
- **Output:** `data/steam_games_cleaned.csv`  
- EDA-ready columns include `Price`, `Positive`, `Negative`, `Genres`, `Release year`.
