## League of Legends - Data Processing Notebook

Purpose: This notebook focuses strictly on acquiring and persisting the RAW Riot match data (match detail + timeline). No cleaning, feature engineering, or train/val/test splitting happens here — those steps live in the DAIA and Modeling notebook.

Scope in this notebook:
- Authenticate with Riot API and fetch raw match JSONs to `data/raw/`.
- Maintain a lightweight `matchlist.json` to track which matches are available locally.

Out of scope (handled in DAIA & Modeling):
- Building pregame or 15-minute feature tables.
- Cleaning, filtering, or schema validation.
- Train/validation/test splits and modeling.

## Introduction & Goals

**Goals of this notebook:**

-   Collect match data from the Riot API.

-   Parse and extract both pre-game features (champion picks, roles) and early-game features at 15 minutes (gold diff, kills, objectives).

-   Produce a cleaned, versioned dataset (CSV) with clear documentation for each column.

**Why separate processing from modeling?**

Riot match timelines and match lists can become large (MBs–GBs). Processing and cleaning in a dedicated notebook or script avoids repeated heavy work during model experimentation and keeps the ML notebook focused on training and evaluation.

# Explainable AI (XAI)

### Key Concepts
- XAI = Explainable Artificial Intelligence; methods and techniques to make AI models transparent and understandable to humans.
- **Global explanations**: Show which features are most influential on average across a dataset.
- **Local explanations**: Explain why a specific prediction was made.
- **Intrinsic vs Post-hoc**: 
  - **Intrinsic**: Models that are transparent by design (e.g., decision trees, linear models).  
  - **Post-hoc**: Methods applied to black-box models after training (e.g., SHAP, LIME, counterfactuals).
- **Visualization & communication**: Plots, summary graphs, or textual explanations make insights accessible for different audiences.

### Pros of XAI
- **Transparency:** Users understand why a model makes a prediction.
- **Trust:** Increases adoption by showing clear reasoning.
- **Debugging & improvement:** Identifies if models rely on meaningful or misleading signals.
- **Actionable insights:** Helps make decisions in strategy, operations, or coaching.
- **Flexible communication:** Tailored explanations for technical and non-technical users.

### Cons and Dangers
- **Complexity:** Adds extra steps to modeling pipelines.
- **Misinterpretation:** Oversimplified explanations may mislead users.
- **Trade-off with accuracy:** Simple interpretable models may be less accurate; black-box models need complex post-hoc explanations.
- **Computational cost:** Methods like SHAP can be resource-intensive on large datasets.
- **Partial explanations:** Some methods only approximate reasoning, which can be risky if users over-trust them.
- **Ethical & safety concerns:** Revealing certain features or internal logic may expose vulnerabilities (e.g., security risks, sensitive data leaks).

## Application Across Domains

#### Games (General)
- XAI is used to explain AI behavior in a wide range of games, from strategy to simulation:  
  - **NPC behavior**: Why an AI-controlled character made a certain move or decision.  
  - **Dynamic difficulty adjustment (DDA)**: Why the game adapts difficulty based on player performance.  
  - **Analytics and predictions**: Explaining why an AI predicts a certain outcome or strategy is likely to succeed.  
- Benefits: Helps developers debug and balance games, improves player trust, and provides educational insight into gameplay mechanics.  
- Challenges: High-dimensional state spaces, real-time decision-making, and latent human factors like coordination or communication are hard to capture.

#### League of Legends (Specific)
- Early-game gold, first tower, and objectives are strong predictors of match outcomes.  
- XAI helps **coaches, players, and analysts** understand *why* a team is favored.  
- Limitations: Some factors like team synergy, communication, or morale cannot be measured numerically.  
- Goal: Make predictions interpretable and actionable in real time without compromising fairness or strategy secrecy.

#### Science and Engineering
- XAI explains results from simulations, experiments, or predictive maintenance models.
- Enables **trust, safety, and accountability**: Users must understand why models make predictions that impact real-world systems.
- Examples: AI flagging faulty machinery, medical diagnosis, or chemical simulations.
- Challenge: Balancing **high performance** (accuracy) and **interpretability** is critical for safety.

#### Ethical and Regulatory Considerations
- XAI is increasingly required for ethical, legal, and safety compliance (e.g., GDPR in Europe, AI safety standards).
- Certain aspects of models should **not be fully disclosed** to prevent misuse:
  - Trade secrets or proprietary algorithms.
  - Sensitive personal data in healthcare or finance.
  - Safety-critical logic where exposing details could be dangerous (e.g., autonomous vehicles or critical infrastructure).
- Regulation ensures transparency without compromising security or privacy.

### What to Explain and What Not to Explain
- **Explain:**  
  - Key predictive features and their contributions.  
  - Model reasoning that impacts human decision-making.  
  - Patterns that help users learn, plan, or optimize strategy.
- **Do not explain fully:**  
  - Sensitive data or personally identifiable information.  
  - Internal logic that could create security risks.  
  - Proprietary algorithms where disclosure is not allowed.

### Personal Learning Takeaways
- XAI bridges the gap between AI predictions and human understanding, making models more trustworthy and actionable.
- Choosing the right model depends on **accuracy, interpretability, and domain needs**.
- Safety, ethics, and regulations are as important as performance: what we explain and what we hide must be carefully considered.
- Applications differ by domain: in games, clarity and strategic insight are priorities; in science or engineering, safety, reliability, and reproducibility are critical.


## Environment & setup

In [4]:
# Import core Python libraries
import pandas as pd
import requests
import tqdm
import pyarrow
import fastparquet
import matplotlib.pyplot as plt
import seaborn as sns
import os
import time
import json
from typing import Dict, Any, List
from dotenv import load_dotenv

import requests
import pandas as pd
from tqdm import tqdm


# plotting
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(context="notebook", style="darkgrid", palette="deep", font="sans-serif", font_scale=1)


# Load environment variables
from dotenv import load_dotenv
load_dotenv()


RIOT_API_KEY = os.getenv('RIOT_API_KEY')
if not RIOT_API_KEY:
    raise RuntimeError('Set RIOT_API_KEY as an environment variable or in a .env file')


HEADERS = {"X-Riot-Token": RIOT_API_KEY}

## Data sourcing (Riot API overview)

### What endpoints we be used:

- **Match IDs** by **PUUID:** /lol/match/v5/matches/by-puuid/{puuid}/ids - returns a list of match IDs for a given player.

- **Match detail by matchId**: /lol/match/v5/matches/{matchId} — returns the full match info and timeline (timelines might be included alongside).

### Filtering notes:

- Use **queue=420** to restrict to **Ranked Solo/Duo matches (competitive).**

- Filter out **remakes** or **extremely short games** (e.g., gameDuration < 9 minutes) to avoid noisy examples.

### Rate limits & best practices:

- Riot enforces **rate limits**, so I will be using **time.sleep** to pause between requests and wait longer if blocked (HTTP 429).

- **Cache** raw match JSONs to disk so you don't re-download them.

## Data requirements and Schema


### 1. Core Features (and why)

- **Match metadata**: `matchId`, `queueId`, `gameDuration` - needed to filter matches.
- **Champion & role picks**: champion IDs/names, team positions - needed for pre-game features.
- **Lane metrics at 15 minutes**: gold diff, CS diff, XP diff, K/D/A per lane - strong early indicators.
- **Team objectives by 15 minutes**: `firstTower`, `firstDragon`, `firstHerald`, void grubs (`horde`).
- **Target variables**: `blue_win` (0/1) and `win_probability` (for model output).



### 2. Proposed Output Schema (one row per match)

### Match Metadata
- `matchId` (`str`)
- `queueId` (`int`)
- `gameDuration` (`int`, seconds)
- `blue_win` (`0/1`)

### Champion Picks & Roles
- `blue_champions` (list of champion IDs)
- `red_champions` (list of champion IDs)
- `blue_roles` (list of positions)
- `red_roles` (list of positions)

### Early Features (15 min)
- `blue_gold_15`, `red_gold_15`, `gold_diff_15` - difference in gold at 15 minutes
- `blue_cs_15`, `red_cs_15`, `cs_diff_15` - difference in farm at 15 minutes
- `blue_xp_15`, `red_xp_15`, `xp_diff_15` - difference in XP at 15 minutes
- `blue_kills_15`, `red_kills_15`, `kills_diff_15` - difference in kills at 15 minutes

### Objectives (15 min)
- `first_tower` (None / blue / red)
- `first_dragon` (None / blue / red)
- `first_herald` (None / blue / red)
- `first_grub` (None / blue / red)



## Storage strategy & versioning

For this project, storage is intentionally simple in this notebook: we cache RAW JSON responses under `data/raw/` and maintain `matchlist.json`. Processed/cleaned tables are created downstream in the DAIA & Modeling notebook.

- Raw data: every match JSON from the Riot API lives in `data/raw/` (both match detail and timeline). We never overwrite these so we can always reprocess.
- Match index: `matchlist.json` lists the match IDs we have locally.

Note: Processed artifacts (Parquet/CSV) will be written by the DAIA notebook, not here. This keeps acquisition (this notebook) separate from processing/modeling (DAIA).

## Raw data collection

### Important initial design choices:

- I am going to collect match IDs by querying several PUUIDs (players). Prefer high-activity public accounts.

- For initial proof-of-concept, I will collect 1,000–5,000 matches. For final model I will have tens of thousands

In [10]:
# Constants + basic Riot fetch helpers (simple version)
REGION = 'europe'  # riot regional routing
HEADERS = {"X-Riot-Token": RIOT_API_KEY}  # auth header
MATCHLIST_URL = f'https://{REGION}.api.riotgames.com/lol/match/v5/matches/by-puuid/{{puuid}}/ids'
MATCH_URL = f'https://{REGION}.api.riotgames.com/lol/match/v5/matches/{{matchId}}'
TIMELINE_URL = f'https://{REGION}.api.riotgames.com/lol/match/v5/matches/{{matchId}}/timeline'
RAW_DIR = 'data/raw'

# Basic GET with tiny retry + rate limit handling
def safe_get(url, params=None, retries=5):
    for i in range(retries):
        resp = requests.get(url, headers=HEADERS, params=params)
        if resp.status_code == 200:
            return resp.json()
        elif resp.status_code == 429:  # rate limited
            wait = int(resp.headers.get('Retry-After', 1))
            print(f"Rate limited. Waiting {wait}s...")
            time.sleep(wait)
    raise RuntimeError(f"Failed to GET {url} after {retries} retries")

# Get player unique id from riot name + tag
def get_puuid(game_name, tag_line):
    url = f"https://{REGION}.api.riotgames.com/riot/account/v1/accounts/by-riot-id/{game_name}/{tag_line}"
    resp = requests.get(url, headers=HEADERS)
    if resp.status_code == 200:
        return resp.json().get("puuid")
    print(f"Failed to get PUUID for {game_name}#{tag_line}: {resp.status_code}")
    return None

# Just grab some match ids for one player (solo queue)
def fetch_match_ids(puuid, count=20, queue=420):
    params = {"start": 0, "count": count, "queue": queue}
    return safe_get(MATCHLIST_URL.format(puuid=puuid), params=params)

# Save both match + timeline JSON locally if missing (cache)
def fetch_and_save_match_with_timeline(match_id):
    os.makedirs(RAW_DIR, exist_ok=True)

    match_path = os.path.join(RAW_DIR, f"{match_id}.json")
    if not os.path.exists(match_path):  # don't redownload
        match_data = safe_get(MATCH_URL.format(matchId=match_id))
        with open(match_path, 'w', encoding='utf-8') as f:
            json.dump(match_data, f)

    timeline_path = os.path.join(RAW_DIR, f"{match_id}_timeline.json")
    if not os.path.exists(timeline_path):
        timeline_data = safe_get(TIMELINE_URL.format(matchId=match_id))
        with open(timeline_path, 'w', encoding='utf-8') as f:
            json.dump(timeline_data, f)

    return match_path, timeline_path

# Generic fetch/save helper
def fetch_and_save_json(url, path):
    os.makedirs(os.path.dirname(path), exist_ok=True)
    if os.path.exists(path):  # already there
        return path
    data = safe_get(url)
    with open(path, 'w', encoding='utf-8') as f:
        json.dump(data, f)
    return path

# Single test player (can add more later)
players = [{"name": "BB99", "tag": "BLZNT"}]

# Resolve their PUUIDs
for p in players:
    p["puuid"] = get_puuid(p["name"], p["tag"])
puuids = [p["puuid"] for p in players if p.get("puuid")]
print("PUUIDs fetched:", puuids)

# Rate limiting guard (rough)
all_match_ids = []
MAX_REQUESTS = 99  # under 100 per 2 min window
WINDOW = 120
request_count = 0
start_time = time.time()

for puuid in puuids:
    match_ids = fetch_match_ids(puuid, count=20)  # small sample
    all_match_ids.extend(match_ids)
    
    for match_id in tqdm(match_ids, desc=f"Downloading matches for {puuid}"):
        # Full match
        fetch_and_save_json(MATCH_URL.format(matchId=match_id), os.path.join(RAW_DIR, f"{match_id}.json"))
        # Timeline
        fetch_and_save_json(TIMELINE_URL.format(matchId=match_id), os.path.join(RAW_DIR, f"{match_id}_timeline.json"))
        
        request_count += 2  # two calls per match
        if request_count >= MAX_REQUESTS:
            elapsed = time.time() - start_time
            sleep_time = max(0, WINDOW - elapsed)
            if sleep_time > 0:
                print(f"Sleeping {sleep_time:.1f}s to avoid rate limit...")
                time.sleep(sleep_time)
            start_time = time.time()
            request_count = 0

# Persist the collected list
with open("matchlist.json", "w") as f:
    json.dump(all_match_ids, f, indent=2)

print(f" Downloaded {len(all_match_ids)} matches to '{RAW_DIR}/'")

PUUIDs fetched: ['T9lGIR8iroZCD7e9-3hWNs-wg9h3eA1UjcT_4YsKlkdZY0L9tZWhJlMg9kGT99wpuBNZxT5iDpY3lg']


Downloading matches for T9lGIR8iroZCD7e9-3hWNs-wg9h3eA1UjcT_4YsKlkdZY0L9tZWhJlMg9kGT99wpuBNZxT5iDpY3lg: 100%|██████████| 20/20 [00:00<00:00, 1362.76it/s]

 Downloaded 20 matches to 'data/raw/'





### Regenerate matchlist.json after adding new matches

In [7]:
# Regenerate matchlist.json from all match JSON files in data/raw
import os
import json

raw_dir = 'data/raw'
match_ids = []
for fname in os.listdir(raw_dir):
    if fname.endswith('.json') and not fname.endswith('_timeline.json'):
        match_id = fname.replace('.json', '')
        match_ids.append(match_id)

# Remove duplicates, just in case
match_ids = sorted(set(match_ids))

with open("matchlist.json", "w") as f:
    json.dump(match_ids, f, indent=2)

print(f"Updated matchlist.json with {len(match_ids)} match IDs from {raw_dir}")

Updated matchlist.json with 10000 match IDs from data/raw


## Processing & feature extraction

### Design approach (how I actually set this up)



To keep it simple the data is split into two parts based on when the state of the game:



1. **Pregame table** - only info known before the game starts:

   - Champion picks by lane (top, jg, mid, bot, support) for each team.

   - Match duration (just for quick filtering later).



2. **15-minute table** - snapshot of early game:

   - First objective takers (`tower`, `dragon`, `herald`, void grubs = `horde`).

   - Team gold / xp / cs / kills and their diffs at 15:00.

   - Games shorter than 15 minutes are skipped (remakes / noise).



Both tables have one row per `matchId`, so joining them later is trivial.



Why bother splitting? Faster iteration. I can train a draft-only model without loading timeline data, and I can tweak early game logic without touching the pregame export.



Storage:

- Parquet main files: `lol_pregame_data.parquet` and `lol_15min_data.parquet`.

- Small CSV previews (`*_preview.csv`) for a quick look at the dataset



No manifest, no version bumps - if I change logic I just overwrite the files as this is a small scale project. If I later need history I can start versioning.



Flow I run:

1. Regenerate `matchlist.json` from raw if needed.

2. Build pregame (no timeline needed).

3. Build 15m table (needs timeline + duration filter).

4. Save Parquet + preview CSVs.


## Pregame Table

In [9]:
# Build Pregame Table from local raw JSONs (no API calls)
import os
import json
import pandas as pd

RAW_DIR = 'data/raw'
PROCESSED_DIR = 'data/processed'
os.makedirs(PROCESSED_DIR, exist_ok=True)

# Utility: normalize Riot position fields to our 5 lanes
POS_MAP = {
    'TOP': 'top',
    'JUNGLE': 'jg',
    'MIDDLE': 'mid',
    'MID': 'mid',
    'BOTTOM': 'bot',
    'BOT': 'bot',
    'UTILITY': 'support',
    'SUPPORT': 'support',
}

def normalize_role(p):
    # Prefer teamPosition, then individualPosition, then lane/role combos
    team_pos = (p.get('teamPosition') or '').upper()
    indiv_pos = (p.get('individualPosition') or '').upper()
    lane = (p.get('lane') or '').upper()
    role = (p.get('role') or '').upper()

    for cand in (team_pos, indiv_pos, lane):
        if cand in POS_MAP:
            return POS_MAP[cand]

    # Duo bottom handling
    if lane in ('BOTTOM', 'BOT'):
        if 'SUPPORT' in role or role == 'UTILITY':
            return 'support'
        return 'bot'

    return None  # unknown

# Load full match list
with open('matchlist.json', 'r', encoding='utf-8') as f:
    match_ids = json.load(f)

rows = []
skipped_missing = []
for match_id in match_ids:
    match_path = os.path.join(RAW_DIR, f'{match_id}.json')
    if not os.path.exists(match_path):
        skipped_missing.append(match_id)
        continue

    try:
        with open(match_path, 'r', encoding='utf-8') as f:
            m = json.load(f)
    except Exception:
        skipped_missing.append(match_id)
        continue

    info = m.get('info', {})
    parts = info.get('participants', []) or []
    if len(parts) != 10:
        skipped_missing.append(match_id)
        continue

    # Prepare containers for each team
    blue = {'top': None, 'jg': None, 'mid': None, 'bot': None, 'support': None}
    red = {'top': None, 'jg': None, 'mid': None, 'bot': None, 'support': None}

    for p in parts:
        champ = p.get('championName') or str(p.get('championId'))
        team = 'blue' if p.get('teamId') == 100 else 'red'
        r = normalize_role(p)
        if r is None:
            # If role unknown, try to infer jungle if smite used (optional, quick heuristic)
            # We keep it simple: leave None; will remain blank if not inferable
            pass
        if team == 'blue':
            if r in blue and blue[r] is None:
                blue[r] = champ
        else:
            if r in red and red[r] is None:
                red[r] = champ

    row = {
        'matchId': match_id,
        'gameDuration': info.get('gameDuration'),
        'bluetop': blue['top'],
        'bluejg': blue['jg'],
        'bluemid': blue['mid'],
        'bluebot': blue['bot'],
        'bluesupport': blue['support'],
        'redtop': red['top'],
        'redjg': red['jg'],
        'redmid': red['mid'],
        'redbot': red['bot'],
        'redsupport': red['support'],
    }
    rows.append(row)

pregame_df = pd.DataFrame(rows)

# Save outputs (full + small preview)
pregame_parquet = os.path.join(PROCESSED_DIR, 'lol_pregame_data.parquet')
pregame_csv = os.path.join(PROCESSED_DIR, 'lol_pregame_data_preview.csv')

pregame_df.to_parquet(pregame_parquet, index=False)
pregame_df.head(50).to_csv(pregame_csv, index=False)

print(f"Pregame rows written: {len(pregame_df)}")
print(f"Missing raw match JSONs: {len(skipped_missing)}")
print('Files saved:')
print(' -', pregame_parquet)
print(' -', pregame_csv)

Pregame rows written: 10000
Missing raw match JSONs: 0
Files saved:
 - data/processed\lol_pregame_data.parquet
 - data/processed\lol_pregame_data_preview.csv


## Fifteen Minute Table

In [1]:
# Setup for 15-minute processing
# Define the target minute and output directory used below
import os

TARGET_MINUTE = 15  # minutes
processed_dir = 'data/processed'
os.makedirs(processed_dir, exist_ok=True)

print(f"TARGET_MINUTE set to {TARGET_MINUTE} minutes")
print(f"Processed output directory: {processed_dir}")


TARGET_MINUTE set to 15 minutes
Processed output directory: data/processed


In [None]:
# Minimal helpers to load raw JSON and aggregate 15m participant stats
import json
from typing import Dict

RAW_DIR = 'data/raw'


def load_match_json(match_id: str) -> Dict:
    path = os.path.join(RAW_DIR, f"{match_id}.json")
    if not os.path.exists(path):
        raise FileNotFoundError(path)
    with open(path, 'r', encoding='utf-8') as f:
        return json.load(f)


def load_timeline_json(match_id: str) -> Dict:
    path = os.path.join(RAW_DIR, f"{match_id}_timeline.json")
    if not os.path.exists(path):
        raise FileNotFoundError(path)
    with open(path, 'r', encoding='utf-8') as f:
        return json.load(f)

# Aggregate participant stats at target minute
def aggregate_participant_stats_at_minute(match_json: Dict, timeline_json: Dict, minute: int = None) -> Dict[int, Dict[str, int]]:
    if minute is None:
        minute = TARGET_MINUTE
    target_ms = minute * 60_000

    frames = timeline_json.get('info', {}).get('frames', [])
    if not frames:
        raise ValueError('Timeline has no frames')

    # Find the latest frame at or before target time
    frame_idx = None
    for i, fr in enumerate(frames):
        if fr.get('timestamp', 0) <= target_ms:
            frame_idx = i
        else:
            break
    if frame_idx is None:
        raise ValueError(f'No frame available at or before {minute} minutes')

    frame = frames[frame_idx]
    part_frames = frame.get('participantFrames') or {}

    # Initialize kills up to target frame from events
    kills = {pid: 0 for pid in range(1, 11)}
    for fr in frames[: frame_idx + 1]:
        for ev in fr.get('events', []) or []:
            if ev.get('type') == 'CHAMPION_KILL':
                kid = ev.get('killerId')
                if isinstance(kid, int) and 1 <= kid <= 10:
                    kills[kid] = kills.get(kid, 0) + 1

    out: Dict[int, Dict[str, int]] = {}
    for k, pf in part_frames.items():
        # Keys may be strings "1".."10"
        try:
            pid = int(k)
        except Exception:
            continue
        tg = pf.get('totalGold')
        xp = pf.get('xp')
        cs = pf.get('minionsKilled')
        # Fallbacks
        tg = 0 if tg is None else tg
        xp = 0 if xp is None else xp
        cs = 0 if cs is None else cs
        out[pid] = {
            'totalGold': int(tg),
            'xp': int(xp),
            'minionsKilled': int(cs),
            'kills': int(kills.get(pid, 0)),
        }

    if not out:
        raise ValueError('participantFrames missing at target minute')
    return out


In [None]:
# Rebuild 15-minute table with correct loop placement
import os
import pandas as pd
from tqdm import tqdm

# You can set SAMPLE_N=None to process all matches; keep a small sample first for a quick sanity-check
SAMPLE_N = None

MIN_GAME_DURATION = TARGET_MINUTE * 60
rows = []
skipped_short, missing_timeline, skipped_snapshot = [], [], []

ids_iter = match_ids_all
if SAMPLE_N is not None and SAMPLE_N > 0:
    ids_iter = match_ids_all[:SAMPLE_N]

for match_id in tqdm(ids_iter, desc=f"Processing matches (15m, SAMPLE_N={SAMPLE_N})"):
    try:
        match_json = load_match_json(match_id)
    except FileNotFoundError:
        continue

    game_duration = match_json.get('info', {}).get('gameDuration', 0)
    if game_duration < MIN_GAME_DURATION:
        skipped_short.append((match_id, game_duration))
        continue

    try:
        timeline_json = load_timeline_json(match_id)
    except FileNotFoundError:
        missing_timeline.append(match_id)
        continue

    try:
        stats_snapshot = aggregate_participant_stats_at_minute(match_json, timeline_json, minute=TARGET_MINUTE)
    except ValueError as exc:
        skipped_snapshot.append((match_id, str(exc)))
        continue

    teams = match_json['info']['teams']

    def first_team(obj_key):
        try:
            if teams[0]['objectives'][obj_key]['first']:
                return 'blue'
            if teams[1]['objectives'][obj_key]['first']:
                return 'red'
        except Exception:
            return None
        return None

    row = {
        'matchId': match_id,
        'queueId': match_json['info'].get('queueId'),
        'gameDuration': game_duration,
        'blue_win': int(teams[0]['win']),
        'first_tower': first_team('tower'),
        'first_dragon': first_team('dragon'),
        'first_herald': first_team('riftHerald'),
        'first_grub': first_team('horde'),
    }

    blue_ids = [p['participantId'] for p in match_json['info']['participants'][:5]]
    red_ids = [p['participantId'] for p in match_json['info']['participants'][5:]]

    row.update({
        'blue_gold_15': sum(stats_snapshot.get(pid, {}).get('totalGold', 0) for pid in blue_ids),
        'red_gold_15': sum(stats_snapshot.get(pid, {}).get('totalGold', 0) for pid in red_ids),
        'blue_xp_15': sum(stats_snapshot.get(pid, {}).get('xp', 0) for pid in blue_ids),
        'red_xp_15': sum(stats_snapshot.get(pid, {}).get('xp', 0) for pid in red_ids),
        'blue_cs_15': sum(stats_snapshot.get(pid, {}).get('minionsKilled', 0) for pid in blue_ids),
        'red_cs_15': sum(stats_snapshot.get(pid, {}).get('minionsKilled', 0) for pid in red_ids),
        'blue_kills_15': sum(stats_snapshot.get(pid, {}).get('kills', 0) for pid in blue_ids),
        'red_kills_15': sum(stats_snapshot.get(pid, {}).get('kills', 0) for pid in red_ids),
    })

    row['gold_diff_15'] = row['blue_gold_15'] - row['red_gold_15']
    row['cs_diff_15'] = row['blue_cs_15'] - row['red_cs_15']
    row['xp_diff_15'] = row['blue_xp_15'] - row['red_xp_15']
    row['kills_diff_15'] = row['blue_kills_15'] - row['red_kills_15']

    rows.append(row)

# Build DataFrame and save
fixed_df_15m = pd.DataFrame(rows)
print(
    f"Processed: {len(ids_iter)} | Kept: {len(fixed_df_15m)} | "
    f"Skipped<15m: {len(skipped_short)} | MissingTimeline: {len(missing_timeline)} | SnapshotErrors: {len(skipped_snapshot)}"
)

# Only write if we have rows
if len(fixed_df_15m) > 0:
    fifteen_parquet = os.path.join(processed_dir, 'lol_15min_data.parquet')
    fifteen_csv = os.path.join(processed_dir, 'lol_15min_data_preview.csv')
    fixed_df_15m.to_parquet(fifteen_parquet, index=False)
    fixed_df_15m.head(50).to_csv(fifteen_csv, index=False)
    print('Saved 15-minute dataset files (Parquet + preview CSV) [corrected loop].')
else:
    print('No rows to write; check skip counters above.')


Number of base match JSON files in data/raw: 10000
Total match IDs from matchlist.json: 10000


Processing matches (15m):   9%|▊         | 860/10000 [00:40<07:11, 21.18it/s]



KeyboardInterrupt: 