# COGS 108 - Data Checkpoint

## Authors
Team:
- Ashley Vo: Conceptualization, Writing – review & editing, Project administration, Data curation
- Dorje Pradhan: Conceptualization, Writing – original draft, Writing – review & editing, Data curation
- Kilhoon (Andy) Kim: Writing – original draft, Writing – review & editing, Data curation
- Kobe Wood: Data curation, Writing – original draft, Writing – review & editing
- Vy (Kiet) Dang: Background research, Writing – original draft, Data curation

## Research Question

How did the popularity of each different video game mode (singleplayer, multiplayer, online co-op) on PC change between the pre-COVID period (2018-2019), the COVID period (2020-2021), and post-COVID period (2022-2023) among the top 250 player-count Steam games from each year from each mode?

where we are defining **popularity** by metrics of:
- Average concurrent player count over a given period 
- Peak player count over a given period

## Background and Prior Work

The global coronavirus outbreak in 2020, called COVID-19, has caused a global pandemic, forcing people to isolate and quarantine from each other. During the lockdown period, people spent most of their time at home and turned to digital entertainment and video games as a way to socialize and de-stress. Therefore, the gaming industry in this period witnessed a peak in gamer activity, play time, sales, and stock values.

In this study, we are interested in finding which games were the most popular from each game mode (single player, multiplayer, online co-op). Seeing what game mode people are most interested in, could give the gaming industry a better understanding of their consumer's desires. At first, the group was interested in researching the top 50 games from each game mode, between the pre-COVID period (2018-2019), the COVID era (2020-2021), and the post-COVID era (2021-2023), across multiple popular platforms such as Steam, Epic Games, Xbox, PlayStation, Nintendo, etc. **However**, not all platforms share their players' statistics to the public. So, we narrowed down our platform domain to only include Steam because it has a public database which is called "Steam Charts" and "SteamDB".

There are multiple research papers conducted about video games' activity and price range analysis over the course of time, such as:
- Aliev et al. (2025): These researchers investigated how the pandemic affected the prices and player reviews of mostly Indie games on Steam. By analyzing SteamDB data, they found that player reviews and activity levels are highly correlated. This study confirmed that SteamDB is a reliable tool for our project, but their work mostly focused on Indie games category while treating AAA - a pricing category - as its own game category, but also, we want to look at the top 50 games overall from each game mode.

- Şener et al. (2021): This paper successfully investigated the broader economic impact of COVID-19 on the gaming industry and showed a significant rise in player activity on Steam during 2020. Their findings gave us a "baseline" of when player activity started to increase, but it did not include other factors like playtime, pricing dynamics, game modes, etc. 

- Toledo (2021): Toledo used consumer **surveys** to study how gaming habits changed during lockdown. The survey showed that games became a bridge to "online social life" during quarantine period. This findings is why we emphasize on comparing Online Co-op and Multiplayer modes against Single Player games, as we wanted to see if that online social trend lasted after the pandemic ends. This is the harder data to categorize due to Steam tags are very misleading by mixing Singleplayers and Multiplayers altogether.

Therefore, as curious gamers and data analysts, we decided to conduct extensive research on what are the top 50 popular games from each game mode on Steam pre, during, and post the COVID-19 era.

References (include links):
1. Aliev, A. R., Eyniyev, R., & Aliyev, T. A. (n.d.). Analyzing Price Dynamics, Activity of Players and Reviews of Popular Indie Games on Steam Post-COVID-19 Pandemic using SteamDB. https://www.mecs-press.org/ijitcs/ijitcs-v17-n3/IJITCS-V17-N3-3.pdf
2. Şener, Mehmet & Yalcin, Turkan & Gulseven, Osman. (2021). The Impact of COVID-19 on the Video Game Industry. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.3766147
3. Toledo, M. (2021). Video Game Habits COVID-19. Journal of Marketing Management and Consumer Behavior, 3(4), 66–89. https://doi.org/10.2139/ssrn.3676004

## Hypothesis


We believe there will be a dramatic rise in the popularity of online co-op and multiplayer games during the COVID era, with some of that increase continuing after COVID. Our thinking is as follows: people were stuck inside and had largely lost the ability to connect with each other in person, so games that allowed online interaction became more appealing. In this study, popularity will be measured using average concurrent player count and peak player count, and we will examine these patterns within the top 250 Steam games across the pre-COVID period (2018-2019), COVID period (2020-2021), and post-COVID period (2022-2023). We also expect that, within the top 250, the number of games tagged as multiplayer or online co-op will increase during COVID compared with pre-COVID, though we recognize that tags are not mutually exclusive and a game may appear in more than one mode. 

## Data

### Data overview
#### Dataset 1: Steam250 - Top 250 Games Of Each Year
- Dataset name: Steam250 - Top 250 Games Of Each Year
- Link to the dataset:
  - 2018: https://steam250.com/2018 
  - 2019: https://steam250.com/2019
  - 2020: https://steam250.com/2020
  - 2021: https://steam250.com/2021
  - 2022: https://steam250.com/2022
  - 2023: https://steam250.com/2023
- Number of observations (per year): 250 
- Number of variables (per year): 5
- Important notes: Steam250 does not provide play mode data, so we will need to obtain and join this with another dataset, in our case, Dataset 2.

#### Dataset 2: Steam Games Dataset (Kaggle)
- Dataset name: Steam Games Dataset (Kaggle)
- Link to the dataset: https://www.kaggle.com/datasets/fronkongames/steam-games-dataset
- Number of observations: 122611
- Number of variables: 39
- Important notes: This dataset serves as a supplementary dataset to provide playmode data for Dataset 1.

#### Dataset 3: Steam Charts Historical Player Activity
- Dataset name: Steam Charts game-level player activity (scraped/collected)
- Link to the dataset: https://steamcharts.com/
- Number of observations: [1463] (roughly: number of games × number of months in 2018–2023, not recounting for repeat games)
- Number of variables: [8] (year, rank, name, appid, month, avg_players, peak_players, status)
- This dataset is the core time-series source for our project because it contains the two popularity metrics we can reliably measure across all periods: average concurrent players and peak concurrent players. In practical terms, average concurrent players captures the typical number of people actively playing a game at the same time during a month, while peak concurrent players captures the maximum simultaneous activity reached during that month. Both metrics are counts of players (not percentages), and both are useful: average concurrency reflects sustained engagement, while peak concurrency reflects major surges and maximum demand.
- Important notes: We will first gather a list of the top 250 games from each year before pulling from SteamCharts to avoid needlessly gathering data we don't need.

In [None]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [None]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

# (!) Our data comes from scraping HTML so this section is unneeded.
import get_data # this is where we get the function we need to download data
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
# datafiles = [
#     { 'url': '', 'filename':''},
#     { 'url': '', 'filename':''}
# ]
# get_data.get_raw(datafiles,destination_directory='data/00-raw/')


# OUR IMPORTS
import csv
import time
import hashlib
from pathlib import Path
from typing import Optional
import requests
import pandas as pd
from bs4 import BeautifulSoup

# Note: this will need to be run to get games.csv (Dataset 2) as the CSV is too large for GitHub.
get_data.main()

### Steam250 Data Collection
Steam250 is a site that displays a ranking of the top 250 of each year. Although there is no CSV provided, we can scrape the website to gather the data we need. They provide 5 variables of interest to us: 
1. Rank is an integer that denotes the ranking of each game with 1 being the best of that year and 250 being the 250th best of the year.
2. AppID is an integer that uniquely identifies each and every game on Steam.
3. Name is a string that belongs to the name of the game.
4. Rating is a player voted score. This is an integer but realistically acts as a percentage (e.g. 94 corresponds to 94%).
5. Number of votes is an integer that denotes the number of players that have voted for that game.

The biggest acknowledgement we have to make is that our analysis is going to be based off of Steam250's definition of top games. However, upon [reading their process](https://steam250.com/about), we concluded that it was an agreeable and good enough approach as finding the "top games" isn't readily or easily available elsewhere.

Let's load all of the CSVs so we can observe their structure en masse.

In [None]:
steam250_2018_df = pd.read_csv('data/00-raw/2018_top250.csv')
steam250_2019_df = pd.read_csv('data/00-raw/2019_top250.csv')
steam250_2020_df = pd.read_csv('data/00-raw/2020_top250.csv')
steam250_2021_df = pd.read_csv('data/00-raw/2021_top250.csv')
steam250_2022_df = pd.read_csv('data/00-raw/2022_top250.csv')
steam250_2023_df = pd.read_csv('data/00-raw/2023_top250.csv')

yearly_dfs = [steam250_2018_df, steam250_2019_df, steam250_2020_df, steam250_2021_df, steam250_2022_df, steam250_2023_df]

We want to make sure there aren't any missing values and our data types line up correctly.

In [None]:
for i in range(6):
    curr_df = yearly_dfs[i]

    print(
        '=' * 64, '\n',
        f'Current year data frame: {2018 + i}\n',
        '=' * 64, '\n',
        'Shape: ',
        curr_df.shape, '\n\n',
        'Any nulls?\n',
        curr_df.isna().any(), '\n\n',
        'Column types:\n',
        curr_df.dtypes, '\n\n',
        'First five rows of the data:\n',
        curr_df.head(5), '\n\n',
        sep=''
    )

The data looks as we expect!

### Steam Games Dataset
This dataset was provided by Martin Bustos on Kaggle. Bustos created the dataset using Steam's API and Steam Spy. It provides meta information of over 122,000 games on Steam. It has 39 variables. However, we are only interested in three key variables while being able to ignore the rest as they are not important to our research question:
1. AppID is a integer value that uniquely identifies each and every game on Steam.
2. Name is a string that belongs to the name of the game.
3. Categories should be an array of strings. Each string corresponds to a characterization of the game such as "singleplayer" or "PvP."

As we'll discover, reading the CSV proved to be a little troublesome with column mismatch. Since we're only concerned with the Categories of each game this dataset provides, we can safely disregard the other columns. There were other datasets we experimented with that didn't fully account for all of the games we've gathered.

First, let's read in the CSV downloaded from Kaggle and get an idea of its size.

In [None]:

steam_games_df = pd.read_csv('./data/00-raw/games.csv', index_col=False)

steam_games_df.shape

Now, let's take a peak at what the data looks like:

In [None]:
steam_games_df.head()

Let's look at the column types.

In [None]:
steam_games_df.dtypes

This data set is not clean. The columns don't line up with the values in the rows. For example, we see what looks like descriptions of games under "Supported Languages" rather than "About the game." In printing out the column types, we see some discrepencies like "Metacritic url" being an `int64` rather than an object. Luckily, we only care about a few variables in the dataset: AppID, Name, and Categories. We can ignore the rest. Note that Categories is currently under the "Genres" column. We can fix that:

In [None]:
steam_games_subset_df = steam_games_df[['AppID', 'Name', 'Genres']]
steam_games_subset_df.columns = ['appid', 'name', 'genres']

steam_games_subset_df.head()

Let's check for missing values.

In [None]:
steam_games_subset_df.isna().sum()

There appears to be one entry without a name. Let's find out what game that is and whether it should be cause for concern.

In [None]:
steam_games_subset_df[steam_games_subset_df['name'].isna()]

Interestingly enough, after [looking up the game on Steam](https://store.steampowered.com/app/396420/_/), it literally has no name, so this isn't something to worry about. Now, we can move on to merging datasets 1 and 2 together.

### Merging: Steam250 and Steam Games (Kaggle)

Now, we want to merge datasets 1 and 2. There are a few things we must fix like adjusting the categories to what we need. So, we will find out what needs to be replaced. We'll use 2018 as an example to find out what categories we need to map as well as find out if there are any python typing quirks with our data.

In [None]:
steam250_2018_df_merged = steam250_2018_df.merge(
    steam_games_subset_df,
    on='appid',
    how='left'
)

print('What is the type of the tags column?\n', type(steam250_2018_df_merged['genres'].iloc[0]), '\n')

Since it's a string, we should convert it to a list to easily parse it.

In [None]:
steam250_2018_df_merged['genres'] = steam250_2018_df_merged['genres'].fillna('').str.split(',')
steam250_2018_df_merged
# print('What is the type of the tags column now?\n', type(steam250_2018_df_merged['genres'].iloc[0]), '\n')

In [None]:
# See unique tags
unique_tags = (
    steam250_2018_df_merged['genres']
    .explode()
    .str
    .strip()
    .unique()
)

print('What are the unique values that can appear under categories?\n', unique_tags)

Let's create two things:
- A mapping to group and normalize similar play modes together to the ones that are of our interest to our research question: singleplayer, multiplayer, and co-op
- A helper function to clean up the columns in the rows for us

In [None]:
tag_map = {
    'Single-player': 'singleplayer',

    'Multi-player': 'multiplayer',
    'MMO': 'multiplayer',
    'PvP': 'multiplayer',
    'Online PvP': 'multiplayer',
    'LAN PvP': 'multiplayer',
    'Shared/Split Screen PvP': 'multiplayer',
    'Cross-Platform Multiplayer': 'multiplayer',

    'Co-op': 'co-op',
    'Online Co-op': 'co-op',
    'LAN Co-op': 'co-op',
    'Shared/Split Screen Co-op': 'co-op',
}

In [None]:
def clean_tags(tag_list):
    if not isinstance(tag_list, list):
        return []

    cleaned = []

    for tag in tag_list:
        tag = tag.strip()

        if tag in tag_map:
            cleaned.append(tag_map[tag])

    return list(set(cleaned))

Now, we can adjust the tags for each top 250 per year in batch.

In [None]:
for i in range(6):
    curr_year_df = pd.read_csv(f'./data/00-raw/{2018 + i}_top250.csv')

    curr_merged_df = curr_year_df.merge(
        steam_games_subset_df,
        on='appid',
        how='left'
    )

    curr_merged_df = curr_merged_df[['rank', 'appid', 'name_x', 'num_votes', 'rating', 'genres']]
    curr_merged_df.columns = ['rank', 'appid', 'name', 'num_votes', 'rating', 'tags']

    # Convert the tags column to a list
    curr_merged_df['tags'] = (
        curr_merged_df['tags']
        .fillna('')
        .str.split(',')
    )

    curr_merged_df['tags'] = curr_merged_df['tags'].apply(clean_tags)

    # Save to the data/02-processed directory
    curr_merged_df.to_csv(f'./data/02-processed/{2018 + i}_top250_final.csv')

Let's make sure all of our data is in order.

In [None]:
for i in range(6): 
    curr_df = pd.read_csv(f'./data/02-processed/{2018 + i}_top250_final.csv')
    print(
        '=' * 64, '\n',
        f'Current year data frame: {2018 + i}\n',
        '=' * 64, '\n',
        'Shape: ',
        curr_df.shape, '\n\n',
        'Any nulls?\n',
        curr_df.isna().any(), '\n\n',
        'Column types:\n',
        curr_df.dtypes, '\n\n',
        'First five rows of the data:\n',
        curr_df.head(5), '\n\n',
        sep=''
    )

And with that, everything looks as expected, so we can now look to SteamCharts for more granular data!

### SteamCharts Player Data Collection
For this project, the most relevant variables are: game identifier (name and appid), date (month and year), average concurrent players, and peak concurrent players. 
- A game name (a string) may not be unique against the entire Steam catalog, so there exists a Steam appid that uniquely identifies each game. This is a number, and it is expected every game has one. 
- Dates are going to have some format similar to `YYYY-MM` such that we can parse it to aggregate the monthly player data by the defined study periods.
- Average and peak concurrent players are numeric values that we expect to be greater than or equal to zero. Since we are looking at the top 250 Steam games, it's expected this value is certainly greater than zero.

We may later aggregate monthly values into three study periods: pre-COVID (2018–2019), COVID (2020–2021), and post-COVID (2022–2023). This allows direct period-to-period comparisons for each game and for groups of games by mode tags.

A key strength of this dataset is that it provides consistent and public Steam activity data at scale. SteamCharts obtains data directly from Steam's Web API. The main shortcomings are that it is Steam-only therefore not representative of console ecosystems like Nintendo, Sony, and Xbox's. This may lead to underrepresention and edge cases where historical coverage is incomplete for certain titles and does not directly provide causal explanations for changes in player activity. It is also important to note Steam does not record data for players who decide to play offline i.e. disconnected from the internet therefore disconnected from Steam's servers. As such, the data does not account for these cases even if players are playing through Steam. Lastly, the top-game selection introduces survivorship/popularity bias relative to the full Steam catalog, leaving out games that may see interesting growth or variability in their player populations despite not being in the top 250.

#### Utility Functions Overview
The helper functions below standardize small repeated tasks so the collection pipeline is easier to read and debug.  
- `clean_num` converts numeric text scraped from HTML (including commas and dash placeholders) into floats.  
- `build_game_url` creates a SteamCharts URL from a game `appid`.  
- `cache_key_for_url` creates a deterministic filename-safe cache key from a URL so cached HTML pages can be reused across runs.

In [None]:
# SteamCharts collection helpers + pipeline 

# Base URL pattern for SteamCharts game pages
# inject app_id into {app_id}, e.g. app_id=730 -> https://steamcharts.com/app/730
BASE_URL = "https://steamcharts.com/app/{appid}"

# Apprently some sites block requests that do not provide a browser-like user agent.
HEADERS = {
    "User-Agent": "Mozilla/5.0 (compatible; SteamChartsYearScraper/1.0)"
}

# Some utlity helpers! ==========================================================================

def clean_num(value: str) -> Optional[float]:
    """
    Convert numeric text to float.

    Handles:
    - commas: "12,345.6" -> 12345.6
    - blanks/dashes -> None
    - invalid values -> None
    """
    if value is None:
        return None

    text = str(value).strip().replace(",", "")
    if text in {"", "-", "—"}:
        return None

    try:
        return float(text)
    except ValueError:
        return None


def build_game_url(appid: int) -> str:
    """
    Build SteamCharts URL for one appid.
    Example: appid=730 -> "https://steamcharts.com/app/730"
    """
    return BASE_URL.format(appid=int(appid))



def cache_key_for_url(url: str) -> str:
    """
    Build deterministic cache filename key from URL.
    Using md5 keeps filenames short and filesystem-safe.
    Why this exists:
    - URL text may not be ideal as a filename.
    - Hash gives stable and filesystem-safe names.
    """
    return hashlib.md5(url.encode("utf-8")).hexdigest()

#### Input Loading and Validation
Before scraping SteamCharts, we validate each yearly input CSV to ensure it has the expected schema: `rank`, `name`, and `appid`.  
This step enforces consistent data types, removes empty names, drops duplicates, and sorts by rank so processing is deterministic.  
Failing early on malformed input helps avoid harder-to-diagnose errors later in the scraping pipeline.

In [None]:
# Input and loading validation ==================================================================
def load_input_csv(file_path: Path) -> pd.DataFrame:
    """
    Load and validate one input CSV.
    (Probably not neccessary, but I think it's good practice just in case)

    Required columns:
    - rank
    - name
    - appid

    Returns:
    - Cleaned DataFrame with normalized dtypes:
      rank:int, name:str, appid:int
    """

    # Read CSV (consider encoding="utf-8-sig")
    df = pd.read_csv(file_path, encoding="utf-8-sig")

    # Validate required columns
    required = {"rank", "name", "appid"}
    missing = required - set(df.columns)
    if missing:
        raise ValueError(
            f"{file_path.name} is missing required columns: {sorted(missing)}. "
            f"Found columns: {list(df.columns)}"
        )

    # Keep only needed columns in predictable order 
    df = df[["rank", "name", "appid"]].copy()

    # Convert numeric fields and fail LOUDLY if invalid
    df["rank"] = pd.to_numeric(df["rank"], errors="raise").astype(int)
    df["appid"] = pd.to_numeric(df["appid"], errors="raise").astype(int)

    # 4) Strip whitespace on name
    df["name"] = df["name"].astype(str).str.strip()

    # Remove rows with empty names 
    df = df[df["name"] != ""].copy()

    # drop duplicates on rank+appid
    df = df.drop_duplicates(subset=["rank", "appid"]).reset_index(drop=True)

    # Return cleaned DataFrame
    # Sort by rank for deterministic processing
    df = df.sort_values("rank").reset_index(drop=True)


    return df

#### Network Requests and HTML Caching
The scraper first checks whether a game's HTML page is already cached locally. If so, it reads from cache; otherwise, it fetches the page from SteamCharts and stores a local copy.  
This reduces repeated web requests, improves reproducibility across reruns, and speeds up development/testing.  
A short delay is added between live requests to avoid overloading the source website.

In [None]:
# Network + cache =================================================================================
def get_game_page_html(
    appid: int,
    session: requests.Session,
    cache_dir: Path,
    use_cache: bool = True,
    request_delay_sec: float = 0.6,
) -> str:
    """
    Return HTML for one game page, using cache when available.

    Flow:
    1) Build game URL from appid
    2) Compute cache filename from URL hash
    3) If cache exists and use_cache=True -> return cached HTML
    4) Else fetch from network, save cache, sleep briefly, return HTML
    """

    # Ensure cache_dir exists
    cache_dir.mkdir(parents=True, exist_ok=True)

    # Compute URL and cache filename
    url = build_game_url(appid)
    cache_file = cache_dir / f"{cache_key_for_url(url)}.html"

    # If cache hit and use_cache: read + return
    if cache_file.exists() and use_cache:
        return cache_file.read_text(encoding="utf-8")

    # Live request path
    resp = session.get(url, headers=HEADERS, timeout=30)
    resp.raise_for_status()
    html = resp.text

    # Save to cache for future runs
    cache_file.write_text(html, encoding="utf-8")

    # Don't attack our lord and savior GabeN with rapid-fire requests
    time.sleep(request_delay_sec)

    return html


# HTML parsing ===================================================================================


def parse_year_data_from_html(html: str, target_year: int) -> list[dict]:
    """
    Parse SteamCharts monthly table from one app page, filtered to target year.

    Input:
    - html: raw page HTML
    - target_year: year to keep (e.g., 2021)

    Output row shape:
    {
      "month": "YYYY-MM",
      "avg_players": float|None,
      "peak_players": float|None
    }

    Why this exists:
    - Pure parser function (HTML in -> structured rows out).
    - Easy to test independently from I/O.
    """

    # Parse html with BeautifulSoup
    soup = BeautifulSoup(html, "html.parser")
    parsed_rows = []

    # Select monthly table rows (table.common-table tbody tr)
    rows = soup.select("table.common-table tbody tr")

    # For each row, parse month/avg/peak
    for tr in rows: 
        tds = tr.find_all("td")

        # Monthly rows should have at least 5 columns
        # Month | Avg. Players | Gain | Gain % | Peak Players
        if len(tds) < 5:
            continue

        month_text = tds[0].get_text(" ", strip=True)

        # Skip "Last 30 Days"
        if month_text.lower() == "last 30 days":
            continue

        # Parse month text with pd.to_datetime(..., format="%B %Y")
        month_dt = pd.to_datetime(month_text, format="%B %Y", errors="coerce")
        if pd.isna(month_dt):
            continue

        # Keep rows where parsed year == target_year
        if int(month_dt.year) != int(target_year):
            continue

        # Clean numeric fields with clean_num()
        avg_text = tds[1].get_text(" ", strip=True)
        peak_text = tds[4].get_text(" ", strip=True)

        # Return rows sorted by month asc
        parsed_rows.append({
            "month": month_dt.strftime("%Y-%m"),
            "avg_players": clean_num(avg_text),
            "peak_players": clean_num(peak_text),
        })

    # Keep output in chronological order
    parsed_rows.sort(key=lambda r: r["month"])
    return parsed_rows



#### Yearly Collection Logic
For each year, we read the corresponding top-250 appid file, fetch each game's SteamCharts page, and parse only rows that match the target year.  
Each output row is assigned a status:
- `ok` if monthly rows were parsed successfully,
- `no_data_for_year` if the page loaded but no rows matched the year,
- `request_error` for HTTP/network failures,
- `parse_error` for HTML parsing issues.
This status field helps us audit data quality before analysis.

In [None]:
# year collection ====================================================================================
def collect_one_year(
    input_csv: Path,
    year: int,
    cache_dir: Path,
    use_cache: bool = True,
    request_delay_sec: float = 0.6,
) -> pd.DataFrame:
    """
    Collect SteamCharts data for one year's input list.

    Steps:
    - Load CSV (rank, name, appid)
    - For each game:
      - Fetch or read cached HTML
      - Parse target-year monthly rows
      - Emit result rows with status labels

    Status values:
    - "ok"               : parsed monthly rows exist
    - "no_data_for_year" : page loaded, but no rows for that year
    - "request_error"    : failed HTTP request
    - "parse_error"      : page fetched but parse failed
    """

    # games_df = load_input_csv(input_csv)
    games_df = load_input_csv(input_csv)

    # Initialize out_rows = []
    out_rows = []
    total = len(games_df)

    # Create requests.Session()
    with requests.Session() as session:
        for idx, row in games_df.iterrows():
            rank = int(row["rank"])
            name = row["name"]
            appid = int(row["appid"])

            # Fetch HTML (cache first)
            try: 
                html = get_game_page_html(
                    appid=appid,
                    session=session,
                    cache_dir=cache_dir,
                    use_cache=use_cache,
                    request_delay_sec=request_delay_sec,
                )
            except Exception:
                out_rows.append({
                    "year": year,
                    "rank": rank,
                    "name": name,
                    "appid": appid,
                    "month": None,
                    "avg_players": None,
                    "peak_players": None,
                    "status": "request_error",
                })
                print(f"[{idx+1}/{total}] {name} (appid={appid}): request error")
                continue

            # Parse only rows for target year
            try:
                parsed_rows = parse_year_data_from_html(html, target_year=year)
            except Exception:
                out_rows.append({
                    "year": year,
                    "rank": rank,
                    "name": name,
                    "appid": appid,
                    "month": None,
                    "avg_players": None,
                    "peak_players": None,
                    "status": "parse_error",
                })
                print(f"[{idx+1}/{total}] {name} (appid={appid}): parse error")
                continue

            # no rows found for this yeaer 
            if not parsed_rows:
                out_rows.append({
                    "year": year,
                    "rank": rank,
                    "name": name,
                    "appid": appid,
                    "month": None,
                    "avg_players": None,
                    "peak_players": None,
                    "status": "no_data_for_year",
                })
                print(f"[{idx+1}/{total}] {name} (appid={appid}): no data for year")
                continue

            # Found rows: attach metadata 
            for pr in parsed_rows:
                out_rows.append({
                    "year": year,
                    "rank": rank,
                    "name": name,
                    "appid": appid,
                    "month": pr["month"],
                    "avg_players": pr["avg_players"],
                    "peak_players": pr["peak_players"],
                    "status": "ok",
                })

            print(f"[{idx+1}/{total}] {name} ({appid}) -> ok ({len(parsed_rows)} months)")

    result_df = pd.DataFrame(
        out_rows,
        columns=[
            "year", 
            "rank", 
            "name", 
            "appid",
            "month", 
            "avg_players", 
            "peak_players",
            "status",
        ],)

    # Sort for readability (rank, then month)
    result_df = result_df.sort_values(
                    by=["rank", "month"], 
                    na_position="last"
                ).reset_index(drop=True)

    return result_df


def collect_year_range(
    start_year: int,
    end_year: int,
    input_dir: Path,
    input_pattern: str,      # e.g. "{year}_top250_ids.csv"
    output_dir: Path,
    cache_dir: Path,
    use_cache: bool = True,
    request_delay_sec: float = 0.6,
    write_combined: bool = True,
) -> pd.DataFrame:
    """
    Run collection across a year range using predictable filenames.

    For each year:
    - Build input file path from input_pattern
    - Skip year if file missing
    - Collect year data
    - Write per-year CSV

    Optionally:
    - Combine all years into one DataFrame + CSV
    """

    # Ensure output_dir exists
    output_dir.mkdir(parents=True, exist_ok=True)
    cache_dir.mkdir(parents=True, exist_ok=True)

    all_parts = []

    for year in range(start_year, end_year + 1):
    #   build input_csv path from pattern
        input_csv = input_dir / input_pattern.format(year=year)

    #   if missing file: print skip and continue
        if not input_csv.exists():
            print(f"{year} SKIP - missing input file:{input_csv}")
            continue

        year_df = collect_one_year(
            input_csv=input_csv,
            year=year,
            cache_dir=cache_dir,
            use_cache=use_cache,
            request_delay_sec=request_delay_sec,
        )
    #   write year_df to output_dir / f"steamcharts_{year}_top250.csv"
        year_out_path = output_dir / f"steamcharts_{year}_top250.csv"
        year_df.to_csv(year_out_path, index=False)
        print(f"[{year}] wrote {year_out_path} ({len(year_df)} rows)")

        status_counts = year_df["status"].value_counts(dropna=False)
        print(f"[{year}] status summary:\n{status_counts.to_string()}")

    #   append year_df to all_parts
        all_parts.append(year_df)

    # If nothing processed, return empty with expected schema
    if not all_parts:
        return pd.DataFrame(columns=[
            "year", "rank", "name", "appid", "month",
            "avg_players", "peak_players", "status"
        ])

    combined_df = pd.concat(all_parts, ignore_index=True)

    if write_combined:
        combined_out_path = output_dir / f"steamcharts_{start_year}_{end_year}_combined.csv"
        combined_df.to_csv(combined_out_path, index=False)
        print(f"\nWrote combined file: {combined_out_path} ({len(combined_df)} rows)")

    return combined_df


#### Status filtering
Not all of the requests are expected to go through perfectly, so statuses are attatched to them so we can easily pick out which ones are missing data. 

This function takes a year, loads that year’s interim SteamCharts CSV, filters to rows where `status == "ok"`, and writes the filtered result to `data/02-processed`.

In [None]:

from pathlib import Path
import pandas as pd

def keep_only_ok_status(
    year: int,
    interim_dir: Path = Path("data/01-interim"),
    processed_dir: Path = Path("data/02-processed"),
) -> Path:
    """
    Reads one yearly interim SteamCharts CSV and writes an ok-only version.

    Input:
      data/01-interim/steamcharts_{year}_top250.csv
    Output:
      data/02-processed/steamcharts_{year}_top250_ok.csv

    Returns the output path.
    """
    input_path = interim_dir / f"steamcharts_{year}_top250.csv"
    output_path = processed_dir / f"steamcharts_{year}_top250_ok.csv"

    df = pd.read_csv(input_path)

    if "status" not in df.columns:
        raise ValueError(f"'status' column not found in {input_path}")

    df_ok = df[df["status"] == "ok"].copy()

    output_path.parent.mkdir(parents=True, exist_ok=True)
    df_ok.to_csv(output_path, index=False)

    return output_path



def make_combined_ok_file(
    start_year: int,
    end_year: int,
    processed_dir: Path = Path("data/02-processed"),
) -> Path:
    """
    Combines the yearly ok-only files into one ok-only combined file.

    Inputs:
      data/02-processed/steamcharts_{year}_top250_ok.csv
    Output:
      data/02-processed/steamcharts_{start}_{end}_ok.csv

    Returns the output path.
    """
    parts = []

    for year in range(start_year, end_year + 1):
        fp = processed_dir / f"steamcharts_{year}_top250_ok.csv"
        if fp.exists():
            parts.append(pd.read_csv(fp))

    if not parts:
        raise FileNotFoundError("No yearly ok-only files found to combine.")

    combined_ok = pd.concat(parts, ignore_index=True)

    out_path = processed_dir / f"steamcharts_{start_year}_{end_year}_ok.csv"
    combined_ok.to_csv(out_path, index=False)

    return out_path


#### Quick Data Quality and Summary
This section loads the combined SteamCharts output and reports high-level checks used before modeling:
- available columns,
- dataset size,
- status counts from collection,
- period-level descriptive summaries for `avg_players` and `peak_players` using only `status == "ok"` rows.
These checks provide a compact overview of data completeness and trend direction across pre-COVID, COVID, and post-COVID periods.

In [None]:
def report_steamcharts_summary(csv_path: Path, label: str = "") -> None:
    """
    Print a compact data-quality + descriptive summary for a SteamCharts scrape CSV.

    Reports:
    - Columns
    - Overall size
    - Status counts (if column exists)
    - Period-level summary of avg_players / peak_players
      (uses status=='ok' rows when status column exists)
    """
    df = pd.read_csv(csv_path)

    header = f"=== {label} ===" if label else "==="
    print(f"\n{header}")
    print(f"File: {csv_path}")
    print(f"Rows: {df.shape[0]} | Columns: {df.shape[1]}")

    print("\nColumns:")
    print(df.columns.tolist())

    # Status counts (only if present)
    if "status" in df.columns:
        print("\nStatus counts:")
        print(df["status"].value_counts(dropna=False).to_string())

    # Period-level summary (descriptive only)
    required = {"year", "avg_players", "peak_players"}
    if required.issubset(df.columns):
        tmp = df.copy()

        # light coercion for summary calculations only
        tmp["year"] = pd.to_numeric(tmp["year"], errors="coerce")
        tmp["avg_players"] = pd.to_numeric(tmp["avg_players"], errors="coerce")
        tmp["peak_players"] = pd.to_numeric(tmp["peak_players"], errors="coerce")

        # If status exists, restrict to ok for summaries
        if "status" in tmp.columns:
            tmp = tmp[tmp["status"] == "ok"].copy()

        def period_label(y):
            if pd.isna(y):
                return "unknown"
            y = int(y)
            if 2018 <= y <= 2019:
                return "pre_covid"
            if 2020 <= y <= 2021:
                return "covid"
            if 2022 <= y <= 2023:
                return "post_covid"
            return "other"

        tmp["period"] = tmp["year"].apply(period_label)

        period_summary = (
            tmp.groupby("period", as_index=False)
               .agg(
                   n_rows=("period", "size"),
                   avg_players_mean=("avg_players", "mean"),
                   avg_players_median=("avg_players", "median"),
                   peak_players_mean=("peak_players", "mean"),
                   peak_players_median=("peak_players", "median"),
               )
               .sort_values("period")
        )

        print("\nPeriod-level summary (avg/peak players):")
        print(period_summary.to_string(index=False))


#### Main Run Configuration
`main()` centralizes the user-editable run settings (year range, input pattern, output directory, cache directory, and delay settings).  
This keeps the rest of the pipeline stable while making it easy to rerun collection with different folders or year ranges.  
The run writes one CSV per year and a combined CSV for downstream summary and analysis, then prints a summary of the shape of the data, tidys the data, then prints the summary once more.

In [None]:
def main():
    start_year = 2018
    end_year = 2023

    input_dir = Path("data/02-processed")
    input_pattern = "{year}_top250_ids.csv"
    output_dir = Path("data/01-interim")
    cache_dir = Path("data/00-raw/steamcharts_cache")

    # Run scraper to create interim per-year + combined
    collect_year_range(
        start_year=start_year,
        end_year=end_year,
        input_dir=input_dir,
        input_pattern=input_pattern,
        output_dir=output_dir,
        cache_dir=cache_dir,
        use_cache=True,
        request_delay_sec=0.6,
        write_combined=True,
    )

    # Print BEFORE summary (combined interim)
    combined_before = output_dir / f"steamcharts_{start_year}_{end_year}_combined.csv"
    report_steamcharts_summary(combined_before, label="Combined BEFORE status filtering")

    # Create yearly ok-only files
    for year in range(start_year, end_year + 1):
        keep_only_ok_status(year, interim_dir=output_dir, processed_dir=Path("data/02-processed"))

    # Create combined ok-only file
    combined_after = make_combined_ok_file(start_year, end_year, processed_dir=Path("data/02-processed"))

    # Print AFTER summary (combined ok-only)
    report_steamcharts_summary(combined_after, label="Combined AFTER status filtering (ok-only)")

main()

## Ethics

### A. Data Collection
  [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

> Our project does not involve human subjects directly. We are analyzing publicly avaliable gaming statistics from Steam, which does not require consent.

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
> We agree that by focusing solely on Steam data, we introduce platform bias. However, our research question is specifically aimed on PC gaming on Steam.

 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
> Our dataset contains no personally identifiable information (PII). We are using aggregate player statistics and game mode data only.

 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?
> We are not collecting data on protected groups such as gender or race, as our analysis focuses on gaming trends at the aggregate game mode rather than individual player demographics. 

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
> The data we are using is publicly available from Steam Charts and SteamDB. We are not collecting any new data or storing sensitive information.

 - [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
> Not applicable, as we are not collecting or storing any personal information from indivisuals.

 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?
> Since we are using publicly available data and not collecting new data, data retention is not our concern. Any data we download for analysis will be retained only for the duration of this project.

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
> We acknowledge that other gaming platforms exist (e.g., Nintendo, Sony Playstation, Microsoft Xbox, Mobile) which provide different gaming experiences. Our dataset is limited to Steam users, who may not be representative of the global gaming population (e.g., mobile gamers or console game players). However, our research question is intentionally narrowed to PC gaming on Steam due to data availability and accessibility. 

 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
> There could be potential bias in our dataset if certain games have wealthy backers who promote them more heavily, which could inflate their popularity metrics. We are aware of this possibility when interpreting our results.

 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
> We are committed to representing our data honestly and will strive to create visualization and statistics that accurately reflect the underlying trends without misleading interpretations.

 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
> Privacy is not a conter for our analysis since we are not using any data with PII. 

 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?
> We are committed to documenting our analysis process thoroughly to ensure reproducibility. This includes maintaining clear records of our data sources, processing steps.

### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
> We are considering PC players as a general population. Since our analysis focuses on game mode trends rather than player demographics, bias and discrimination concerns are not relevant to our research.

 - [X] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
> Not applicable for the same reason as D.1.

 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
> We are not creating a predictive model or optimizing for specific metrics.

 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
> Not applicable, as we are not buliding a predictive model or making automated decisions.

 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?
> While we are not buliding a model, we will clearly communicate the limitation of our analysis, including our focus on Steam data only and potential biases in the dataset.

### E. Deployment
 - [X] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
> Not applicable. We are not deploying a model. This is a research project analyzing game mode trends.

 - [X] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
> Not applicable as our analysis is unlikely to cause harm and we are not deploying a system that impacts indivisuals.

 - [X] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
> Not applicable for the Same reason as E.1

 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?
> We are using data for research, but game developers might usage our results. For example, if they see old games were popular, they might just copy them instead of making new, creative games. This could lead to too many similar games and less variety for players.

## Team Expectations 

Team expectations are also [found separately here](./admin/rules.md).

### Communication
**Discord** is our main form of communication. We have a group chat.
- **Responding** If there is a message that needs responding to/acknowledgement, respond within about 1 day/24 hours (48 hours can be acceptable if there's an emergency). The exception to this if we have a planned meeting coming up and we ask for a ready check, then it's expected that there's an almost immediate response.
- **Respect** Stay reasonably respectful to one another. It's okay to disagree, but do talk about the issue together or bring in another person (or the entire group) to discuss the matter if needed to mediate. If you don't talk about something, there's no way we'd know what's wrong.
  
### Missing Tasks/Meetings
- **Tasks** If you can't complete a task, let us know as soon as possible (i.e. as soon as you find out) so we can reorganize task assignment or move our schedule around.
- **Meetings** If you can't make a meeting, that's okay, and it's not detrimental. However, that would mean you can't provide your input on something live. You can share your thoughts and ideas in our group chat in this event so we can discuss your ideas. We do take meeting notes, so please read them to stay up to date with the team.

### Team Structure and Decision Making
- **Team Roles** We don't plan on having established team roles, but we'll try to have everyone do a bit of everything (to the best of our ability). The only real "role" we'll have is one note taker per meeting.
- **Task Tracking** We'll use the GitHub Projects tab/Kanban on the team repository. 
- **Decision Making** If it comes to a decision, we'll have a vote to decide (more votes = win).

### Addressing Problem Members
This is our protocol on addressing non-responsive teammates/those refusing to do work:
1. First offense: check-in and see if everything is okay.
2. Second offense: what we do depends, but we'll talk with you again.
3. Clearly becoming a pattern: talk to a TA and/or the professor.

## Project Timeline Proposal

| Type | Date | Meeting/Due Time | To Complete Before Meeting | Discuss at Meeting |
| ---- | ---- | ---- | ---- |  ---- |
| Meeting | 2/22 | 2pm  | Read up on EDA checkpoint requirements; come into meeting with ideas on how to approach things | Discuss EDA and split up tasks for EDA checkpoint. |
| Meeting | 3/1  | 2pm  | Make 70-80% progress on EDA tasks | Check in on EDA progress and see what needs to be done. |
| **DUE** EDA Checkpoint | 3/4 | 11:59pm | - | - |
| Meeting | 3/8  | 2pm  | Wrap up any loose ends we didn't finish (if applicable). Read up on final project expectations. | Discuss final project tasks and split up tasks. |
| Meeting | 3/15 | 2pm  | Make about 80% progress on final project tasks. | Discuss the video work. |
| **DUE** Final Project + Video | 3/18 | 11:59pm | - | - |