## Rubric

Instructions: DELETE this cell before you submit via a `git push` to your repo before deadline. This cell is for your reference only and is not needed in your report. 

Scoring: Out of 10 points

- Each Developing  => -2 pts
- Each Unsatisfactory/Missing => -4 pts
  - until the score is 

If students address the detailed feedback in a future checkpoint they will earn these points back


|                  | Unsatisfactory                                                                                                                                                                                                    | Developing                                                                                                                                                                                              | Proficient                                     | Excellent                                                                                                                              |
|------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------|
| Data relevance   | Did not have data relevant to their question. Or the datasets don't work together because there is no way to line them up against each other. If there are multiple datasets, most of them have this trouble | Data was only tangentially relevant to the question or a bad proxy for the question. If there are multiple datasets, some of them may be irrelevant or can't be easily combined.                       | All data sources are relevant to the question. | Multiple data sources for each aspect of the project. It's clear how the data supports the needs of the project.                         |
| Data description | Dataset or its cleaning procedures are not described. If there are multiple datasets, most have this trouble                                                                                              | Data was not fully described. If there are multiple datasets, some of them are not fully described                                                                                                      | Data was fully described                       | The details of the data descriptions and perhaps some very basic EDA also make it clear how the data supports the needs of the project. |
| Data wrangling   | Did not obtain data. They did not clean/tidy the data they obtained.  If there are multiple datasets, most have this trouble                                                                                 | Data was partially cleaned or tidied. Perhaps you struggled to verify that the data was clean because they did not present it well. If there are multiple datasets, some have this trouble | The data is cleaned and tidied.                | The data is spotless and they used tools to visualize the data cleanliness and you were convinced at first glance                      |


# COGS 108 - Data Checkpoint

## Authors
Team:
- Ashley Vo: Conceptualization, Writing – review & editing, Project administration, Data curation
- Dorje Pradhan: Conceptualization, Writing – original draft, Writing – review & editing, Data curation
- Kilhoon (Andy) Kim: Writing – original draft, Writing – review & editing, Data curation
- Kobe Wood: Data curation, Writing – original draft, Writing – review & editing
- Vy (Kiet) Dang: Background research, Writing – original draft, Data curation

## Research Question

How did the popularity of each different video game mode (singleplayer, multiplayer, online co-op) on PC change between the pre-COVID period (2018-2019), the COVID period (2020-2021), and post-COVID period (2022-2023) among the top 250 player-count Steam games from each year from each mode?

where we are defining **popularity** by metrics of:
- Average concurrent player count over a given period 
- Peak player count over a given period

## Background and Prior Work

Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback

## Hypothesis


We believe there will be a dramatic rise in the popularity of online co-op and multiplayer games during the COVID era, with some of that increase continuing after COVID. Our thinking is as follows: people were stuck inside and had largely lost the ability to connect with each other in person, so games that allowed online interaction became more appealing. In this study, popularity will be measured using average concurrent player count and peak player count, and we will examine these patterns within the top 250 Steam games across the pre-COVID period (2018-2019), COVID period (2020-2021), and post-COVID period (2022-2023). We also expect that, within the top 250, the number of games tagged as multiplayer or online co-op will increase during COVID compared with pre-COVID, though we recognize that tags are not mutually exclusive and a game may appear in more than one mode. 

## Data

### Data overview

Instructions: REPLACE the contents of this cell with descriptions of your actual datasets.

For each dataset include the following information
- Dataset #1
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
  - Description of the variables most relevant to this project
  - Descriptions of any shortcomings this dataset has with repsect to the project
- Dataset #2 (if you have more than one!)
  - same as above
- etc

Each dataset deserves either a set of bullet points as above or a few sentences if you prefer that method.

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

#### Dataset 1: Steam Charts Historical Player Activity

#### Dataset name: Steam Charts game-level player activity (scraped/collected)
Link to dataset source: https://steamcharts.com/

Number of observations: [1463] (roughly: number of games × number of months in 2018–2023, not recounting for repeat games)

Number of variables: [144000 total] (year, rank, name, appid, month, avg_players, peak_players, status) per game per month for 6 years

This dataset is the core time-series source for our project because it contains the two popularity metrics we can reliably measure across all periods: average concurrent players and peak concurrent players. In practical terms, average concurrent players captures the typical number of people actively playing a game at the same time during a month, while peak concurrent players captures the maximum simultaneous activity reached during that month. Both metrics are counts of players (not percentages), and both are useful: average concurrency reflects sustained engagement, while peak concurrency reflects major surges and maximum demand.

In [None]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [None]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

# (!) Our data comes from scraping HTML so this section is unneeded.
# import get_data # this is where we get the function we need to download data
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
# datafiles = [
#     { 'url': '', 'filename':''},
#     { 'url': '', 'filename':''}
# ]
# get_data.get_raw(datafiles,destination_directory='data/00-raw/')


# OUR IMPORTS
# Setup imports for Dataset 1 pipeline
import csv
import time
import hashlib
from pathlib import Path
from typing import Optional

import requests
import pandas as pd
from bs4 import BeautifulSoup


### Dataset #1 

Instructions: 
1. Change the header from Dataset #1 to something more descriptive of the dataset
2. Write a few paragraphs about this dataset. Make sure to cover
   1. Describe the important metrics, what units they are in, and giv some sense of what they mean.  For example "Fasting blood glucose in units of mg glucose per deciliter of blood.  Normal values for healthy individuals range from 70 to 100 mg/dL.  Values 100-125 are prediabetic and values >125mg/dL indicate diabetes. Values <70 indicate hypoglycemia. Fasting idicates the patient hasn't eaten in the last 8 hours.  If blood glucose is >250 or <50 at any time (regardless of the time of last meal) the patient's life may be in immediate danger"
   2. If there are any major concerns with the dataset, describe them. For example "Dataset is composed of people who are serious enough about eating healthy that they voluntarily downloaded an app dedicated to tracking their eating patterns. This sample is likely biased because of that self-selection. These people own smartphones and may be healthier and may have more disposable income than the average person.  Those who voluntarily log conscientiously and for long amounts of time are also likely even more interested in health than those who download the app and only log a bit before getting tired of it"
3. Use the cell below to 
    1. load the dataset 
    2. make the dataset tidy or demonstrate that it was already tidy
    3. demonstrate the size of the dataset
    4. find out how much data is missing, where its missing, and if its missing at random or seems to have any systematic relationships in its missingness
    5. find and flag any outliers or suspicious entries
    6. clean the data or demonstrate that it was already clean.  You may choose how to deal with missingness (dropna of fillna... how='any' or 'all') and you should justify your choice in some way
    7. You will load raw data from `data/00-raw/`, you will (optionally) write intermediate stages of your work to `data/01-interim` and you will write the final fully wrangled version of your data to `data/02-processed`
4. Optionally you can also show some summary statistics for variables that you think are important to the project
5. Feel free to add more cells here if that's helpful for you

### SteamCharts Player Data Collection
For this project, the most relevant variables are: game identifier (name and appid), date (month and year), average concurrent players, and peak concurrent players. We may later aggregate monthly values into three study periods: pre-COVID (2018–2019), COVID (2020–2021), and post-COVID (2022–2023). This allows direct period-to-period comparisons for each game and for groups of games by mode tags.

A key strength of this dataset is that it provides consistent and public Steam activity data at scale. The main shortcomings are that it is Steam-only (not representative of console ecosystems), may underrepresent edge cases where historical coverage is incomplete for certain titles, and does not directly provide causal explanations for changes in player activity. Also, top-game selection introduces survivorship/popularity bias relative to the full Steam catalog.

#### Aside: Getting Game IDs
Our second dataset is Steam250, but due to the modular way our team split up our work, we need to reference the top 250 Steam appids from Steam250 before going to SteamCharts to pull player data. So, we're going to collect those ids into CSVs (where one year = one CSV) to reference. 

In [None]:
# Build yearly top-250 appid CSVs from Steam250 
years = [2018, 2019, 2020, 2021, 2022, 2023]

for year in years:
    url = 'https://steam250.com/reviews/' + str(year)
    html = requests.get(url).text
    soup = BeautifulSoup(html, 'html.parser')

    section = soup.select_one("section.applist.compact")
    rows = section.find_all("div", id=True)

    records = []

    for row in rows:
        try:
            rank_div = row.find_all("div", recursive=False)[0]

            # Ignore rank change stats e.g. "+2" or "-1"
            texts = [t.strip() for t in rank_div.contents if isinstance(t, str) and t.strip()]
            if texts:
                rank = int(texts[0])
            else:
                continue

            name_tag = row.select_one("a[title]")
            name = name_tag.text.strip()

            app_link = name_tag["href"]
            appid = int(app_link.split("/")[-1])

            rating = row.select_one("span.rating").text.strip().replace("%","")
            rating = float(rating)

            votes_raw = row.select_one("span.votes")["title"]
            votes = int(votes_raw.split()[0].replace(",", ""))

            records.append({
                "rank": rank,
                "name": name,
                "appid": appid
            })

        except Exception as e:
            print(e)
            continue

    curr_df = pd.DataFrame(records)
    curr_file_name = 'data/02-processed/' + str(year) + '_top250_ids.csv' 

    curr_df.to_csv(curr_file_name, index=False)

# Done message
print("Top-250 appid files generated!")

We can take a look at the data. First, let's store the results from the previous block into variables.

In [None]:
df_2018ids = pd.read_csv('data/02-processed/2018_top250_ids.csv')
df_2019ids = pd.read_csv('data/02-processed/2019_top250_ids.csv')
df_2020ids = pd.read_csv('data/02-processed/2020_top250_ids.csv')
df_2021ids = pd.read_csv('data/02-processed/2021_top250_ids.csv')
df_2022ids = pd.read_csv('data/02-processed/2022_top250_ids.csv')
df_2023ids = pd.read_csv('data/02-processed/2023_top250_ids.csv')

yearly_dfs = [df_2018ids, df_2019ids, df_2020ids, df_2021ids, df_2022ids, df_2023ids]

The code ensures a schema of `Steam250IDs(rank, name, appid)` for each CSV. While we're confident in the data's cleanliness and tidiness, we can check and summarize the results.

In [None]:
for i in range(6): 
    curr_df = yearly_dfs[i]
    print(
        'Current year data frame: ',
        str(2018 + i),
        '\n===================================================\n',
        '- Shape: ',
        curr_df.shape,
        '\n- Do we have 250 games? ',
        '(Yes)' if (curr_df['appid'].nunique() == len(curr_df)) else '(No)',
        '\n- How many duplicate games in the current year? (',
        curr_df.duplicated().sum(),
        ')\n- Are there any nulls present?\n',
        curr_df.isna().any(),
        '\n\n',
        '- What are the column types?\n',
        curr_df.dtypes,
        '\n\n',
        '- First five rows of the data:\n',
        curr_df.head(),
        '\n\n',
        sep=''
    )

Everything looks as expected, so we can pull from SteamCharts now that we know what games to look for!

In [None]:
# SteamCharts collection helpers + pipeline 

# Base URL pattern for SteamCharts game pages
# inject app_id into {app_id}, e.g. app_id=730 -> https://steamcharts.com/app/730
BASE_URL = "https://steamcharts.com/app/{appid}"

# Apprently some sites block requests that do not provide a browser-like user agent.
HEADERS = {
    "User-Agent": "Mozilla/5.0 (compatible; SteamChartsYearScraper/1.0)"
}

# Some utlity helpers! ==========================================================================

def clean_num(value: str) -> Optional[float]:
    """
    Convert numeric text to float.

    Handles:
    - commas: "12,345.6" -> 12345.6
    - blanks/dashes -> None
    - invalid values -> None
    """
    if value is None:
        return None

    text = str(value).strip().replace(",", "")
    if text in {"", "-", "—"}:
        return None

    try:
        return float(text)
    except ValueError:
        return None


def build_game_url(appid: int) -> str:
    """
    Build SteamCharts URL for one appid.
    Example: appid=730 -> "https://steamcharts.com/app/730"
    """
    return BASE_URL.format(appid=int(appid))



def cache_key_for_url(url: str) -> str:
    """
    Build deterministic cache filename key from URL.
    Using md5 keeps filenames short and filesystem-safe.
    Why this exists:
    - URL text may not be ideal as a filename.
    - Hash gives stable and filesystem-safe names.
    """
    return hashlib.md5(url.encode("utf-8")).hexdigest()



# Input and loading validation ==================================================================


def load_input_csv(file_path: Path) -> pd.DataFrame:
    """
    Load and validate one input CSV.
    (Probably not neccessary, but I think it's good practice just in case)

    Required columns:
    - rank
    - name
    - appid

    Returns:
    - Cleaned DataFrame with normalized dtypes:
      rank:int, name:str, appid:int
    """
    # TODO:
    # Read CSV (consider encoding="utf-8-sig")
    df = pd.read_csv(file_path, encoding="utf-8-sig")

    # Validate required columns
    required = {"rank", "name", "appid"}
    missing = required - set(df.columns)
    if missing:
        raise ValueError(
            f"{file_path.name} is missing required columns: {sorted(missing)}. "
            f"Found columns: {list(df.columns)}"
        )

    # Keep only needed columns in predictable order 
    df = df[["rank", "name", "appid"]].copy()

    # Convert numeric fields and fail LOUDLY if invalid
    df["rank"] = pd.to_numeric(df["rank"], errors="raise").astype(int)
    df["appid"] = pd.to_numeric(df["appid"], errors="raise").astype(int)

    # 4) Strip whitespace on name
    df["name"] = df["name"].astype(str).str.strip()

    # Remove rows with empty names 
    df = df[df["name"] != ""].copy()

    # drop duplicates on rank+appid
    df = df.drop_duplicates(subset=["rank", "appid"]).reset_index(drop=True)

    # Return cleaned DataFrame
    # Sort by rank for deterministic processing
    df = df.sort_values("rank").reset_index(drop=True)


    return df


# Network + cache =================================================================================


def get_game_page_html(
    appid: int,
    session: requests.Session,
    cache_dir: Path,
    use_cache: bool = True,
    request_delay_sec: float = 0.6,
) -> str:
    """
    Return HTML for one game page, using cache when available.

    Flow:
    1) Build game URL from appid
    2) Compute cache filename from URL hash
    3) If cache exists and use_cache=True -> return cached HTML
    4) Else fetch from network, save cache, sleep briefly, return HTML
    """
    # TODO:
    # 1) Ensure cache_dir exists
    cache_dir.mkdir(parents=True, exist_ok=True)

    # 2) Compute URL and cache filename
    url = build_game_url(appid)
    cache_file = cache_dir / f"{cache_key_for_url(url)}.html"

    # 3) If cache hit and use_cache: read + return
    if cache_file.exists() and use_cache:
        return cache_file.read_text(encoding="utf-8")

    # Live request path
    resp = session.get(url, headers=HEADERS, timeout=30)
    resp.raise_for_status()
    html = resp.text

    # Save to cache for future runs
    cache_file.write_text(html, encoding="utf-8")

    # Don't attack our lord and savior GabeN with rapid-fire requests
    time.sleep(request_delay_sec)

    return html


# HTML parsing ===================================================================================


def parse_year_data_from_html(html: str, target_year: int) -> list[dict]:
    """
    Parse SteamCharts monthly table from one app page, filtered to target year.

    Input:
    - html: raw page HTML
    - target_year: year to keep (e.g., 2021)

    Output row shape:
    {
      "month": "YYYY-MM",
      "avg_players": float|None,
      "peak_players": float|None
    }

    Why this exists:
    - Pure parser function (HTML in -> structured rows out).
    - Easy to test independently from I/O.
    """
    # TODO:
    # 1) Parse html with BeautifulSoup
    soup = BeautifulSoup(html, "html.parser")
    parsed_rows = []

    # 2) Select monthly table rows (table.common-table tbody tr)
    rows = soup.select("table.common-table tbody tr")

    # 3) For each row, parse month/avg/peak
    for tr in rows: 
        tds = tr.find_all("td")

        # Monthly rows should have at least 5 columns
        # Month | Avg. Players | Gain | Gain % | Peak Players
        if len(tds) < 5:
            continue

        month_text = tds[0].get_text(" ", strip=True)

        # 4) Skip "Last 30 Days"
        if month_text.lower() == "last 30 days":
            continue

        # 5) Parse month text with pd.to_datetime(..., format="%B %Y")
        month_dt = pd.to_datetime(month_text, format="%B %Y", errors="coerce")
        if pd.isna(month_dt):
            continue

        # 6) Keep rows where parsed year == target_year
        if int(month_dt.year) != int(target_year):
            continue

        # 7) Clean numeric fields with clean_num()
        avg_text = tds[1].get_text(" ", strip=True)
        peak_text = tds[4].get_text(" ", strip=True)

        # 8) Return rows sorted by month asc
        parsed_rows.append({
            "month": month_dt.strftime("%Y-%m"),
            "avg_players": clean_num(avg_text),
            "peak_players": clean_num(peak_text),
        })

    # Keep output in chronological order
    parsed_rows.sort(key=lambda r: r["month"])
    return parsed_rows



# year collection ====================================================================================


def collect_one_year(
    input_csv: Path,
    year: int,
    cache_dir: Path,
    use_cache: bool = True,
    request_delay_sec: float = 0.6,
) -> pd.DataFrame:
    """
    Collect SteamCharts data for one year's input list.

    Steps:
    - Load CSV (rank, name, appid)
    - For each game:
      - Fetch or read cached HTML
      - Parse target-year monthly rows
      - Emit result rows with status labels

    Status values:
    - "ok"               : parsed monthly rows exist
    - "no_data_for_year" : page loaded, but no rows for that year
    - "request_error"    : failed HTTP request
    - "parse_error"      : page fetched but parse failed
    """

    # games_df = load_input_csv(input_csv)
    games_df = load_input_csv(input_csv)

    # Initialize out_rows = []
    out_rows = []
    total = len(games_df)

    # Create requests.Session()
    with requests.Session() as session:
        for idx, row in games_df.iterrows():
            rank = int(row["rank"])
            name = row["name"]
            appid = int(row["appid"])

            # Fetch HTML (cache first)
            try: 
                html = get_game_page_html(
                    appid=appid,
                    session=session,
                    cache_dir=cache_dir,
                    use_cache=use_cache,
                    request_delay_sec=request_delay_sec,
                )
            except Exception:
                out_rows.append({
                    "year": year,
                    "rank": rank,
                    "name": name,
                    "appid": appid,
                    "month": None,
                    "avg_players": None,
                    "peak_players": None,
                    "status": "request_error",
                })
                print(f"[{idx+1}/{total}] {name} (appid={appid}): request error")
                continue

            # Parse only rows for target year
            try:
                parsed_rows = parse_year_data_from_html(html, target_year=year)
            except Exception:
                out_rows.append({
                    "year": year,
                    "rank": rank,
                    "name": name,
                    "appid": appid,
                    "month": None,
                    "avg_players": None,
                    "peak_players": None,
                    "status": "parse_error",
                })
                print(f"[{idx+1}/{total}] {name} (appid={appid}): parse error")
                continue

            # no rows found for this yeaer 
            if not parsed_rows:
                out_rows.append({
                    "year": year,
                    "rank": rank,
                    "name": name,
                    "appid": appid,
                    "month": None,
                    "avg_players": None,
                    "peak_players": None,
                    "status": "no_data_for_year",
                })
                print(f"[{idx+1}/{total}] {name} (appid={appid}): no data for year")
                continue

            # Found rows: attach metadata 
            for pr in parsed_rows:
                out_rows.append({
                    "year": year,
                    "rank": rank,
                    "name": name,
                    "appid": appid,
                    "month": pr["month"],
                    "avg_players": pr["avg_players"],
                    "peak_players": pr["peak_players"],
                    "status": "ok",
                })

            print(f"[{idx+1}/{total}] {name} ({appid}) -> ok ({len(parsed_rows)} months)")

    result_df = pd.DataFrame(
        out_rows,
        columns=[
            "year", 
            "rank", 
            "name", 
            "appid",
            "month", 
            "avg_players", 
            "peak_players",
            "status",
        ],)

    # Sort for readability (rank, then month)
    result_df = result_df.sort_values(
                    by=["rank", "month"], 
                    na_position="last"
                ).reset_index(drop=True)

    return result_df


def collect_year_range(
    start_year: int,
    end_year: int,
    input_dir: Path,
    input_pattern: str,      # e.g. "{year}_top250_ids.csv"
    output_dir: Path,
    cache_dir: Path,
    use_cache: bool = True,
    request_delay_sec: float = 0.6,
    write_combined: bool = True,
) -> pd.DataFrame:
    """
    Run collection across a year range using predictable filenames.

    For each year:
    - Build input file path from input_pattern
    - Skip year if file missing
    - Collect year data
    - Write per-year CSV

    Optionally:
    - Combine all years into one DataFrame + CSV
    """

    # Ensure output_dir exists
    output_dir.mkdir(parents=True, exist_ok=True)
    cache_dir.mkdir(parents=True, exist_ok=True)

    all_parts = []

    for year in range(start_year, end_year + 1):
    #   build input_csv path from pattern
        input_csv = input_dir / input_pattern.format(year=year)

    #   if missing file: print skip and continue
        if not input_csv.exists():
            print(f"{year} SKIP - missing input file:{input_csv}")
            continue

        year_df = collect_one_year(
            input_csv=input_csv,
            year=year,
            cache_dir=cache_dir,
            use_cache=use_cache,
            request_delay_sec=request_delay_sec,
        )
    #   write year_df to output_dir / f"steamcharts_{year}_top250.csv"
        year_out_path = output_dir / f"steamcharts_{year}_top250.csv"
        year_df.to_csv(year_out_path, index=False)
        print(f"[{year}] wrote {year_out_path} ({len(year_df)} rows)")

        status_counts = year_df["status"].value_counts(dropna=False)
        print(f"[{year}] status summary:\n{status_counts.to_string()}")

    #   append year_df to all_parts
        all_parts.append(year_df)

    # If nothing processed, return empty with expected schema
    if not all_parts:
        return pd.DataFrame(columns=[
            "year", "rank", "name", "appid", "month",
            "avg_players", "peak_players", "status"
        ])

    combined_df = pd.concat(all_parts, ignore_index=True)

    if write_combined:
        combined_out_path = output_dir / f"steamcharts_{start_year}_{end_year}_combined.csv"
        combined_df.to_csv(combined_out_path, index=False)
        print(f"\nWrote combined file: {combined_out_path} ({len(combined_df)} rows)")

    return combined_df


# MAIN =======================================================================================================


def main():
    """
    Central run configuration.

    Keep all user-editable settings here so the rest of the code
    stays stable and easy to reason about.
    """

    start_year = 2018
    end_year = 2023

    # IMPORTANT!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    # PAY ATTENTION TO ME!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    # CHANGE THESE PATHS AS NEEDED TO POINT TO CORRECT FOLDERS AND FILES
    # I HAVE PUT EXAMPLE PATHS AS COMMENTS BUT I DONT KNOW WHERE WE ACTUALLY WANT EVERYTHING TO GO
    input_dir = Path(".") # e.g. Path("data/00-raw")

    # File name pattern for input CSVs, with {year} placeholder, e.g. "2018_top250_ids.csv"
    input_pattern = "{year}_top250_ids.csv"

    # Output and cache directory for per-year and combined CSVs
    output_dir = Path("outputs")  # e.g. Path("data/01-interim")
    cache_dir = Path("steamcharts_cache") # e.g. Path("data/01-interim/steamcharts_cache")

    # Cache behavior: True = reuse cached pages if present
    use_cache = True

    # Delay for live requests (in seconds) to avoid hammering the server
    request_delay_sec = 0.6

    combined_df = collect_year_range(
        start_year=start_year,
        end_year=end_year,
        input_dir=input_dir,
        input_pattern=input_pattern,
        output_dir=output_dir,
        cache_dir=cache_dir,
        use_cache=use_cache,
        request_delay_sec=request_delay_sec,
        write_combined=True,
    )

    print("\nCombined preview:")
    print(combined_df.head(20).to_string(index=False))

    print("\nCombined status summary:")
    if not combined_df.empty:
        print(combined_df["status"].value_counts(dropna=False).to_string())
    else:
        print("No data collected. Check input file paths/pattern.")


# Run collection
main()

In [None]:
# Some simple stats looking at the combined data from steamcharts_2018_2023_combined.csv

# CHANGE THIS PATH if needed
combined_path = Path("00-raw/outputs/steamcharts_2018_2023_combined.csv")

df = pd.read_csv(combined_path)

# Columns
print("=== Columns ===")
print(df.columns.tolist())

# Overall size
print("\n=== Overall Size ===")
print(f"Rows: {df.shape[0]}")
print(f"Columns: {df.shape[1]}")

# Status counts
print("\n=== Status Counts ===")
if "status" in df.columns:
    print(df["status"].value_counts(dropna=False).to_string())
else:
    print("No 'status' column found.")

# Period-level summary (descriptive only)
required = {"year", "avg_players", "peak_players"}
if required.issubset(df.columns):
    tmp = df.copy()

    # light coercion for summary calculations only
    tmp["year"] = pd.to_numeric(tmp["year"], errors="coerce")
    tmp["avg_players"] = pd.to_numeric(tmp["avg_players"], errors="coerce")
    tmp["peak_players"] = pd.to_numeric(tmp["peak_players"], errors="coerce")

    # use only successful scrape rows if status exists
    if "status" in tmp.columns:
        tmp = tmp[tmp["status"] == "ok"].copy()

    def period_label(y):
        if pd.isna(y):
            return "unknown"
        y = int(y)
        if 2018 <= y <= 2019:
            return "pre_covid"
        elif 2020 <= y <= 2021:
            return "covid"
        elif 2022 <= y <= 2023:
            return "post_covid"
        return "other"

    tmp["period"] = tmp["year"].apply(period_label)

    period_summary = (
        tmp.groupby("period", as_index=False)
           .agg(
               n_rows=("period", "size"),
               avg_players_mean=("avg_players", "mean"),
               avg_players_median=("avg_players", "median"),
               peak_players_mean=("peak_players", "mean"),
               peak_players_median=("peak_players", "median"),
           )
           .sort_values("period")
    )

    print("\n=== Period-Level Summary (status='ok') ===")
    print(period_summary.to_string(index=False))
else:
    print("\nMissing one or more required columns for period summary:", required)

### Dataset #2 

See instructions above for Dataset #1.  Feel free to keep adding as many more datasets as you need.  Put each new dataset in its own section just like these. 

Lastly if you do have multiple datasets, add another section where you demonstrate how you will join, align, cross-reference or whatever to combine data from the different datasets

Please note that you can always keep adding more datasets in the future if these datasets you turn in for the checkpoint aren't sufficient.  The goal here is demonstrate that you can obtain and wrangle data.  You are not tied down to only use what you turn in right now.

In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE


## Ethics

Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback

## Team Expectations 

Team expectations are also [found separately here](./admin/rules.md).

### Communication
**Discord** is our main form of communication. We have a group chat.
- **Responding** If there is a message that needs responding to/acknowledgement, respond within about 1 day/24 hours (48 hours can be acceptable if there's an emergency). The exception to this if we have a planned meeting coming up and we ask for a ready check, then it's expected that there's an almost immediate response.
- **Respect** Stay reasonably respectful to one another. It's okay to disagree, but do talk about the issue together or bring in another person (or the entire group) to discuss the matter if needed to mediate. If you don't talk about something, there's no way we'd know what's wrong.
  
### Missing Tasks/Meetings
- **Tasks** If you can't complete a task, let us know as soon as possible (i.e. as soon as you find out) so we can reorganize task assignment or move our schedule around.
- **Meetings** If you can't make a meeting, that's okay, and it's not detrimental. However, that would mean you can't provide your input on something live. You can share your thoughts and ideas in our group chat in this event so we can discuss your ideas. We do take meeting notes, so please read them to stay up to date with the team.

### Team Structure and Decision Making
- **Team Roles** We don't plan on having established team roles, but we'll try to have everyone do a bit of everything (to the best of our ability). The only real "role" we'll have is one note taker per meeting.
- **Task Tracking** We'll use the GitHub Projects tab/Kanban on the team repository. 
- **Decision Making** If it comes to a decision, we'll have a vote to decide (more votes = win).

### Addressing Problem Members
This is our protocol on addressing non-responsive teammates/those refusing to do work:
1. First offense: check-in and see if everything is okay.
2. Second offense: what we do depends, but we'll talk with you again.
3. Clearly becoming a pattern: talk to a TA and/or the professor.

## Project Timeline Proposal

| Type | Date | Meeting/Due Time | To Complete Before Meeting | Discuss at Meeting |
| ---- | ---- | ---- | ---- |  ---- |
| Meeting | 2/22 | 2pm  | Read up on EDA checkpoint requirements; come into meeting with ideas on how to approach things | Discuss EDA and split up tasks for EDA checkpoint. |
| Meeting | 3/1  | 2pm  | Make 70-80% progress on EDA tasks | Check in on EDA progress and see what needs to be done. |
| **DUE** EDA Checkpoint | 3/4 | 11:59pm | - | - |
| Meeting | 3/8  | 2pm  | Wrap up any loose ends we didn't finish (if applicable). Read up on final project expectations. | Discuss final project tasks and split up tasks. |
| Meeting | 3/15 | 2pm  | Make about 80% progress on final project tasks. | Discuss the video work. |
| **DUE** Final Project + Video | 3/18 | 11:59pm | - | - |