## League of Legends - Data Processing Notebook

Purpose: This notebook documents a clear and reproducible Data Processing workflow for the League of Legends Match Prediction challange. It explains why each step is needed, the code to collect and transform Riot match data, and produces a clean dataset ready for modeling.

## Introduction & Goals

**Goals of this notebook:**

-   Collect match data from the Riot API.

-   Parse and extract both pre-game features (champion picks, roles) and early-game features at 15 minutes (gold diff, kills, objectives).

-   Produce a cleaned, versioned dataset (CSV) with clear documentation for each column.

**Why separate processing from modeling?**

Riot match timelines and match lists can become large (MBs–GBs). Processing and cleaning in a dedicated notebook or script avoids repeated heavy work during model experimentation and keeps the ML notebook focused on training and evaluation.

## Environment & setup

In [9]:
# Import core Python libraries
import pandas as pd
import requests
import tqdm
import pyarrow
import fastparquet
import matplotlib.pyplot as plt
import seaborn as sns
import os
import time
import json
from typing import Dict, Any, List
from dotenv import load_dotenv

import requests
import pandas as pd
from tqdm import tqdm


# plotting
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(context="notebook", style="darkgrid", palette="deep", font="sans-serif", font_scale=1)


# Load environment variables
from dotenv import load_dotenv
load_dotenv()


RIOT_API_KEY = os.getenv('RIOT_API_KEY')
if not RIOT_API_KEY:
    raise RuntimeError('Set RIOT_API_KEY as an environment variable or in a .env file')


HEADERS = {"X-Riot-Token": RIOT_API_KEY}

## Data sourcing (Riot API overview)

### What endpoints we be used:

- **Match IDs** by **PUUID:** /lol/match/v5/matches/by-puuid/{puuid}/ids - returns a list of match IDs for a given player.

- **Match detail by matchId**: /lol/match/v5/matches/{matchId} — returns the full match info and timeline (timelines might be included alongside).

### Filtering notes:

- Use **queue=420** to restrict to **Ranked Solo/Duo matches (competitive).**

- Filter out **remakes** or **extremely short games** (e.g., gameDuration < 9 minutes) to avoid noisy examples.

### Rate limits & best practices:

- Riot enforces **rate limits**, so I will be using **time.sleep** to pause between requests and wait longer if blocked (HTTP 429).

- **Cache** raw match JSONs to disk so you don't re-download them.

## Data requirements and Schema


### 1. Core Features (and why)

- **Match metadata**: `matchId`, `queueId`, `gameDuration` - needed to filter matches.
- **Champion & role picks**: champion IDs/names, team positions - needed for pre-game features.
- **Lane metrics at 15 minutes**: gold diff, CS diff, XP diff, K/D/A per lane - strong early indicators.
- **Team objectives by 15 minutes**: `firstTower`, `firstDragon`, `firstHerald`, void grubs (`horde`).
- **Target variables**: `blue_win` (0/1) and `win_probability` (for model output).



### 2. Proposed Output Schema (one row per match)

### Match Metadata
- `matchId` (`str`)
- `queueId` (`int`)
- `gameDuration` (`int`, seconds)
- `blue_win` (`0/1`)

### Champion Picks & Roles
- `blue_champions` (list of champion IDs)
- `red_champions` (list of champion IDs)
- `blue_roles` (list of positions)
- `red_roles` (list of positions)

### Early Features (15 min)
- `blue_gold_15`, `red_gold_15`, `gold_diff_15` - difference in gold at 15 minutes
- `blue_cs_15`, `red_cs_15`, `cs_diff_15` - difference in farm at 15 minutes
- `blue_xp_15`, `red_xp_15`, `xp_diff_15` - difference in XP at 15 minutes
- `blue_kills_15`, `red_kills_15`, `kills_diff_15` - difference in kills at 15 minutes

### Objectives (15 min)
- `first_tower` (None / blue / red)
- `first_dragon` (None / blue / red)
- `first_herald` (None / blue / red)
- `first_grub` (None / blue / red)



## Storage strategy & versioning

For this project, I decided to keep storage dead simple.
### Raw and Processed Data
- **Raw data**: every match JSON from the Riot API lives in `data/raw/` (both the match detail and its timeline). I never overwrite these so I can always reprocess.
- **Processed data**: two Parquet files:
  - `data/processed/lol_pregame_data.parquet` (lane champion picks)
  - `data/processed/lol_15min_data.parquet` (early game stats + first objectives)
- **Preview CSVs**: `lol_pregame_data_preview.csv` and `lol_15min_data_preview.csv` (just first 50 rows to eyeball).

Why Parquet? Smaller + keeps types + faster to load. CSV is only for quick inspection.

No manifest / no versioning right now — I just overwrite while iterating. If I later need reproducibility or experiment tracking, I can start saving versioned snapshots like `lol_pregame_data_v002.parquet`.

## Raw data collection

### Important initial design choices:

- I am going to collect match IDs by querying several PUUIDs (players). Prefer high-activity public accounts.

- For initial proof-of-concept, I will collect 1,000–5,000 matches. For final model I will have tens of thousands

In [None]:
# Constants + basic Riot fetch helpers (simple version)
REGION = 'europe'  # riot regional routing
HEADERS = {"X-Riot-Token": RIOT_API_KEY}  # auth header
MATCHLIST_URL = f'https://{REGION}.api.riotgames.com/lol/match/v5/matches/by-puuid/{{puuid}}/ids'
MATCH_URL = f'https://{REGION}.api.riotgames.com/lol/match/v5/matches/{{matchId}}'
TIMELINE_URL = f'https://{REGION}.api.riotgames.com/lol/match/v5/matches/{{matchId}}/timeline'
RAW_DIR = 'data/raw'

# Basic GET with tiny retry + rate limit handling
def safe_get(url, params=None, retries=5):
    for i in range(retries):
        resp = requests.get(url, headers=HEADERS, params=params)
        if resp.status_code == 200:
            return resp.json()
        elif resp.status_code == 429:  # rate limited
            wait = int(resp.headers.get('Retry-After', 1))
            print(f"Rate limited. Waiting {wait}s...")
            time.sleep(wait)
    raise RuntimeError(f"Failed to GET {url} after {retries} retries")

# Get player unique id from riot name + tag
def get_puuid(game_name, tag_line):
    url = f"https://{REGION}.api.riotgames.com/riot/account/v1/accounts/by-riot-id/{game_name}/{tag_line}"
    resp = requests.get(url, headers=HEADERS)
    if resp.status_code == 200:
        return resp.json().get("puuid")
    print(f"Failed to get PUUID for {game_name}#{tag_line}: {resp.status_code}")
    return None

# Just grab some match ids for one player (solo queue)
def fetch_match_ids(puuid, count=20, queue=420):
    params = {"start": 0, "count": count, "queue": queue}
    return safe_get(MATCHLIST_URL.format(puuid=puuid), params=params)

# Save both match + timeline JSON locally if missing (cache)
def fetch_and_save_match_with_timeline(match_id):
    os.makedirs(RAW_DIR, exist_ok=True)

    match_path = os.path.join(RAW_DIR, f"{match_id}.json")
    if not os.path.exists(match_path):  # don't redownload
        match_data = safe_get(MATCH_URL.format(matchId=match_id))
        with open(match_path, 'w', encoding='utf-8') as f:
            json.dump(match_data, f)

    timeline_path = os.path.join(RAW_DIR, f"{match_id}_timeline.json")
    if not os.path.exists(timeline_path):
        timeline_data = safe_get(TIMELINE_URL.format(matchId=match_id))
        with open(timeline_path, 'w', encoding='utf-8') as f:
            json.dump(timeline_data, f)

    return match_path, timeline_path

# Generic fetch/save helper
def fetch_and_save_json(url, path):
    os.makedirs(os.path.dirname(path), exist_ok=True)
    if os.path.exists(path):  # already there
        return path
    data = safe_get(url)
    with open(path, 'w', encoding='utf-8') as f:
        json.dump(data, f)
    return path

# Single test player (can add more later)
players = [{"name": "BB99", "tag": "BLZNT"}]

# Resolve their PUUIDs
for p in players:
    p["puuid"] = get_puuid(p["name"], p["tag"])
puuids = [p["puuid"] for p in players if p.get("puuid")]
print("PUUIDs fetched:", puuids)

# Rate limiting guard (rough)
all_match_ids = []
MAX_REQUESTS = 99  # under 100 per 2 min window
WINDOW = 120
request_count = 0
start_time = time.time()

for puuid in puuids:
    match_ids = fetch_match_ids(puuid, count=20)  # small sample
    all_match_ids.extend(match_ids)
    
    for match_id in tqdm(match_ids, desc=f"Downloading matches for {puuid}"):
        # Full match
        fetch_and_save_json(MATCH_URL.format(matchId=match_id), os.path.join(RAW_DIR, f"{match_id}.json"))
        # Timeline
        fetch_and_save_json(TIMELINE_URL.format(matchId=match_id), os.path.join(RAW_DIR, f"{match_id}_timeline.json"))
        
        request_count += 2  # two calls per match
        if request_count >= MAX_REQUESTS:
            elapsed = time.time() - start_time
            sleep_time = max(0, WINDOW - elapsed)
            if sleep_time > 0:
                print(f"Sleeping {sleep_time:.1f}s to avoid rate limit...")
                time.sleep(sleep_time)
            start_time = time.time()
            request_count = 0

# Persist the collected list
with open("matchlist.json", "w") as f:
    json.dump(all_match_ids, f, indent=2)

print(f" Downloaded {len(all_match_ids)} matches to '{RAW_DIR}/'")

2025-10-24 00:48:06,127 [INFO] Starting with 1000 existing matches


RuntimeError: Request failed (401) for https://eun1.api.riotgames.com/lol/league/v4/entries/RANKED_SOLO_5x5/GOLD/I: {"status":{"message":"Unknown apikey","status_code":401}}

### Regenerate matchlist.json after adding new matches

In [19]:
# Regenerate matchlist.json from all match JSON files in data/raw
import os
import json

raw_dir = 'data/raw'
match_ids = []
for fname in os.listdir(raw_dir):
    if fname.endswith('.json') and not fname.endswith('_timeline.json'):
        match_id = fname.replace('.json', '')
        match_ids.append(match_id)

# Remove duplicates, just in case
match_ids = sorted(set(match_ids))

with open("matchlist.json", "w") as f:
    json.dump(match_ids, f, indent=2)

print(f"Updated matchlist.json with {len(match_ids)} match IDs from {raw_dir}")

Updated matchlist.json with 45 match IDs from data/raw


## Processing & feature extraction

### Design approach (how I actually set this up)



To keep it simple the data is split into two parts based on when the state of the game:



1. **Pregame table** - only info known before the game starts:

   - Champion picks by lane (top, jg, mid, bot, support) for each team.

   - Match duration (just for quick filtering later).



2. **15-minute table** - snapshot of early game:

   - First objective takers (`tower`, `dragon`, `herald`, void grubs = `horde`).

   - Team gold / xp / cs / kills and their diffs at 15:00.

   - Games shorter than 15 minutes are skipped (remakes / noise).



Both tables have one row per `matchId`, so joining them later is trivial.



Why bother splitting? Faster iteration. I can train a draft-only model without loading timeline data, and I can tweak early game logic without touching the pregame export.



Storage:

- Parquet main files: `lol_pregame_data.parquet` and `lol_15min_data.parquet`.

- Small CSV previews (`*_preview.csv`) for a quick look at the dataset



No manifest, no version bumps - if I change logic I just overwrite the files as this is a small scale project. If I later need history I can start versioning.



Flow I run:

1. Regenerate `matchlist.json` from raw if needed.

2. Build pregame (no timeline needed).

3. Build 15m table (needs timeline + duration filter).

4. Save Parquet + preview CSVs.


In [None]:
from scripts.fetch_eune_matches import main as fetch_matches_main

# Runs like: !python scripts/fetch_eune_matches.py --target-match-count 1000 ...
fetch_matches_main([
    "--target-match-count", "1000",
    "--matches-per-player", "100",
    "--history-window", "600",
])

## Pregame Table

In [None]:
from scripts.fetch_eune_matches import main as fetch_matches_main

# Runs like: !python scripts/fetch_eune_matches.py --target-match-count 1000 ...
fetch_matches_main([
    "--target-match-count", "1000",
    "--matches-per-player", "100",
    "--history-window", "600",
])

Pregame (all lanes): 100%|██████████| 1000/1000 [00:12<00:00, 79.41it/s]



Simplified pregame table saved: data/processed\lol_pregame_data.parquet (rows=1000)
Columns: ['matchId', 'gameDuration', 'bluetop', 'bluejg', 'bluemid', 'bluebot', 'bluesupport', 'redtop', 'redjg', 'redmid', 'redbot', 'redsupport']
Empty lane value counts: {'bluetop': 0, 'bluejg': 0, 'bluemid': 0, 'bluebot': 0, 'bluesupport': 0, 'redtop': 0, 'redjg': 0, 'redmid': 0, 'redbot': 0, 'redsupport': 0}
Sample non-empty rows:
           matchId  gameDuration       bluetop    bluejg bluemid      bluebot  \
0  EUN1_3830286977          2248  FiddleSticks     Sylas    Ahri         Ashe   
1  EUN1_3830307017          1896         Teemo     Viego   Akali  MissFortune   
2  EUN1_3830671280           999      Volibear  Nocturne    Ashe      Caitlyn   
3  EUN1_3830680693          2021        Illaoi     Shaco   Yasuo        Vayne   
4  EUN1_3830696669          1438   Mordekaiser  MasterYi    Ekko        Vayne   

  bluesupport        redtop    redjg    redmid   redbot redsupport  
0        Bard  Heimerd

## Fifteen Minute Table

In [None]:
# Build 15-Minute Table

raw_dir = 'data/raw'
all_match_files = [f.replace('.json', '') for f in os.listdir(raw_dir) if f.endswith('.json') and not f.endswith('_timeline.json')]
print("Number of base match JSON files in data/raw:", len(all_match_files))


# Load full list from matchlist.json
with open('matchlist.json', 'r') as f:
    match_ids_all = json.load(f)
print('Total match IDs from matchlist.json:', len(match_ids_all))


MIN_GAME_DURATION = TARGET_MINUTE * 60  # require full duration to compute 15m stats
rows = []
skipped_short = []
missing_timeline = []
skipped_snapshot = []

# Process each match
for match_id in tqdm(match_ids_all, desc="Processing matches (15m)"):
    try:
        match_json = load_match_json(match_id)
    except FileNotFoundError:
        continue  # skip if match file missing
    game_duration = match_json.get('info', {}).get('gameDuration', 0)
    if game_duration < MIN_GAME_DURATION:  # ignore super short games (remakes)
        skipped_short.append((match_id, game_duration))
        continue
    try:
        timeline_json = load_timeline_json(match_id)  # need events for early stats
    except FileNotFoundError:
        missing_timeline.append(match_id)
        continue
    try:
        stats_snapshot = aggregate_participant_stats_at_minute(match_json, timeline_json)
    except ValueError as exc:
        skipped_snapshot.append((match_id, str(exc)))
        continue
    teams = match_json['info']['teams']



# Method to determine which team got first objective
def first_team(obj_key):
    try:
        if teams[0]['objectives'][obj_key]['first']:
            return 'blue'
        if teams[1]['objectives'][obj_key]['first']:
            return 'red'
    except Exception:
        return None
    return None

# base row
row = {
    'matchId': match_id,
    'queueId': match_json['info'].get('queueId'),
    'gameDuration': game_duration,
    'blue_win': int(match_json['info']['teams'][0]['win']),  # 1 if blue won
    'first_tower': first_team('tower'),
    'first_dragon': first_team('dragon'),
    'first_herald': first_team('riftHerald'),
    'first_grub': first_team('horde'),  # void grubs are exposed as HORDE in the API
}

# participantId lists split by side
blue_ids = [p['participantId'] for p in match_json['info']['participants'][:5]]
red_ids = [p['participantId'] for p in match_json['info']['participants'][5:]]

# aggregate team sums at 15m
row.update({
    'blue_gold_15': sum(stats_snapshot[pid]['totalGold'] for pid in blue_ids),
    'red_gold_15': sum(stats_snapshot[pid]['totalGold'] for pid in red_ids),
    'blue_xp_15': sum(stats_snapshot[pid]['xp'] for pid in blue_ids),
    'red_xp_15': sum(stats_snapshot[pid]['xp'] for pid in red_ids),
    'blue_cs_15': sum(stats_snapshot[pid]['minionsKilled'] for pid in blue_ids),
    'red_cs_15': sum(stats_snapshot[pid]['minionsKilled'] for pid in red_ids),
    'blue_kills_15': sum(stats_snapshot[pid]['kills'] for pid in blue_ids),
    'red_kills_15': sum(stats_snapshot[pid]['kills'] for pid in red_ids),
})

# diffs (blue - red)
row['gold_diff_15'] = row['blue_gold_15'] - row['red_gold_15']
row['cs_diff_15'] = row['blue_cs_15'] - row['red_cs_15']
row['xp_diff_15'] = row['blue_xp_15'] - row['red_xp_15']
row['kills_diff_15'] = row['blue_kills_15'] - row['red_kills_15']
rows.append(row)

# Build DataFrame
df_15m = pd.DataFrame(rows)

# Save outputs (full + small preview)
fifteen_parquet = os.path.join(processed_dir, 'lol_15min_data.parquet')
fifteen_csv = os.path.join(processed_dir, 'lol_15min_data_preview.csv')
df_15m.to_parquet(fifteen_parquet, index=False)
df_15m.head(50).to_csv(fifteen_csv, index=False)
print('Saved 15-minute dataset files (Parquet + preview CSV).')

Number of base match JSON files in data/raw: 1000
Total match IDs from matchlist.json: 1000


Processing matches (15m): 100%|██████████| 1000/1000 [00:22<00:00, 45.08it/s]

Rows in 15m DataFrame (>= 900s): 998
Skipped 2 short games (<900s)
           matchId  gameDuration first_tower first_dragon first_herald  \
0  EUN1_3830286977          2248         red         blue          red   
1  EUN1_3830307017          1896         red         blue         blue   
2  EUN1_3830671280           999        blue         blue         blue   
3  EUN1_3830680693          2021         red          red         blue   
4  EUN1_3830696669          1438        blue          red         blue   

  first_grub  
0        red  
1       blue  
2        red  
3       blue  
4       blue  
Saved 15-minute dataset files (Parquet + preview CSV).
15m DataFrame columns: ['matchId', 'queueId', 'gameDuration', 'blue_win', 'first_tower', 'first_dragon', 'first_herald', 'first_grub', 'blue_gold_15', 'red_gold_15', 'blue_xp_15', 'red_xp_15', 'blue_cs_15', 'red_cs_15', 'blue_kills_15', 'red_kills_15', 'gold_diff_15', 'cs_diff_15', 'xp_diff_15', 'kills_diff_15']
Objective value counts (first


