# 01 — Data Collection Log

This notebook documents the data collection process: what was collected, from which sources, the problems encountered, and solutions applied.

## Data Sources

| Source | What it provides | Rate limit | Target |
|--------|-----------------|------------|--------|
| **Steam Web API** | User-game ownership + playtime via friend-graph BFS | ~100K req/day | 5,000–10,000 public profiles |
| **SteamSpy** | Estimated owners, price, reviews, CCU per game | 4 req/sec | All games in user dataset |
| **RAWG** | Genres, tags, platforms, Metacritic, release dates | 20K req/month | All games (fuzzy-matched by title) |

In [None]:
import json
import sys
from pathlib import Path

import pandas as pd

# Add project root to path
sys.path.insert(0, str(Path.cwd().parent))

from src.utils import load_config, setup_logging

setup_logging()

RAW_DIR = Path("../data/raw")
PROCESSED_DIR = Path("../data/processed")

## Stage 1: Steam Friend-Graph Crawl

**Strategy:** BFS starting from seed Steam IDs, expanding through friend lists. For each public profile, collect all owned games with playtime.

**Key decisions:**
- Skip private profiles (communityvisibilitystate ≠ 3)
- Include free-to-play games (important for market coverage)
- Log-transform playtime downstream (handles AFK/idle inflation)
- Checkpoint every 100 users for crash resilience

In [None]:
# Load the crawl results
ug_path = RAW_DIR / "user_games.csv"
if ug_path.exists():
    user_games_raw = pd.read_csv(ug_path)
    print(f"User-game rows:    {len(user_games_raw):,}")
    print(f"Unique users:      {user_games_raw['steam_id'].nunique():,}")
    print(f"Unique games:      {user_games_raw['app_id'].nunique():,}")
    print(f"\nPlaytime stats (minutes):")
    print(user_games_raw["playtime_forever"].describe())
else:
    print(f"No data yet at {ug_path}. Run: python -m src.collect --stage steam")

## Stage 2: SteamSpy Market Data

**Strategy:** For every unique app_id in the user dataset, pull detailed stats from SteamSpy.

**Key data:** Owner estimates (range string → parsed to low/mid/high), current price (cents), positive/negative reviews, average/median playtime.

In [None]:
spy_path = RAW_DIR / "steamspy_details.json"
if spy_path.exists():
    with open(spy_path) as f:
        spy_data = json.load(f)
    spy_df = pd.DataFrame(spy_data)
    print(f"SteamSpy records:  {len(spy_df):,}")
    print(f"Columns:           {list(spy_df.columns)}")
    if "owners_mid" in spy_df.columns:
        print(f"\nOwner midpoint stats:")
        print(spy_df["owners_mid"].describe())
else:
    print(f"No data yet at {spy_path}. Run: python -m src.collect --stage steamspy")

## Stage 3: RAWG Metadata

**Strategy:** For each unique game, search RAWG by title and use fuzzy matching (token_sort_ratio ≥ 75%) to handle naming differences (trademark symbols, subtitles, regional variants).

**Key data:** Genres, tags, platforms, Metacritic score, release date, developers, publishers.

In [None]:
rawg_path = RAW_DIR / "rawg_metadata.json"
if rawg_path.exists():
    with open(rawg_path) as f:
        rawg_data = json.load(f)
    matched = rawg_data.get("matched", [])
    unmatched = rawg_data.get("unmatched", [])
    total = len(matched) + len(unmatched)
    match_rate = len(matched) / total if total > 0 else 0
    print(f"RAWG matched:      {len(matched):,}")
    print(f"RAWG unmatched:    {len(unmatched):,}")
    print(f"Match rate:        {match_rate:.1%}")
    if matched:
        rawg_df = pd.DataFrame(matched)
        print(f"\nMatch score distribution:")
        print(rawg_df["match_score"].describe())
else:
    print(f"No data yet at {rawg_path}. Run: python -m src.collect --stage rawg")

## Data Quality & Bias Discussion

### Known Biases (documented honestly)

1. **Selection bias:** Friend-graph crawling over-represents socially connected users. Isolated accounts (no friends, private profiles) are systematically excluded.

2. **Survivorship bias:** Only public profiles are visible. Users who set profiles to private may have different gaming patterns.

3. **Playtime ≠ enjoyment:** Idle hours, AFK farming, and background running inflate playtime. We flag extreme outliers (>99.9th percentile) but cannot fully correct for this.

4. **SteamSpy owner estimates:** ±20–30% uncertainty for smaller titles. Ranges are wider for games with fewer data points.

5. **Price is current, not historical:** Doesn't account for launch discounts, seasonal sales, or bundle deals. Revenue estimates are upper-bounded by this.

6. **RAWG coverage gaps:** Not all Steam games exist in RAWG. Match rate documented above. Unmatched games tend to be very small/niche titles.

### Mitigation

- All revenue estimates state explicit assumptions
- We report ranges (25th–75th percentile), not point estimates
- Bias discussion is included in every notebook that produces quantitative claims

In [None]:
# Load and display the data quality report
report_path = PROCESSED_DIR / "data_quality_report.json"
if report_path.exists():
    with open(report_path) as f:
        report = json.load(f)
    for key, value in report.items():
        print(f"{key}: {value}")
else:
    print("Run the clean stage first: python -m src.collect --stage clean")