## League of Legends - Data Processing Notebook

Purpose: This notebook documents a clear and reproducible Data Processing workflow for the League of Legends Match Prediction challange. It explains why each step is needed, the code to collect and transform Riot match data, and produces a clean dataset ready for modeling.

## Introduction & Goals

**Goals of this notebook:**

-   Collect match data from the Riot API.

-   Parse and extract both pre-game features (champion picks, roles) and early-game features at 10 minutes (gold diff, kills, objectives).

-   Produce a cleaned, versioned dataset (CSV) with clear documentation for each column.

**Why separate processing from modeling?**

Riot match timelines and match lists can become large (MBs–GBs). Processing and cleaning in a dedicated notebook or script avoids repeated heavy work during model experimentation and keeps the ML notebook focused on training and evaluation.

## Environment & setup

In [1]:
# Import core Python libraries
import pandas as pd
import requests
import tqdm
import pyarrow
import fastparquet
import matplotlib.pyplot as plt
import seaborn as sns
import os
import time
import json
from typing import Dict, Any, List
from dotenv import load_dotenv

import requests
import pandas as pd
from tqdm import tqdm


# plotting
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(context="notebook", style="darkgrid", palette="deep", font="sans-serif", font_scale=1)


# Load environment variables
from dotenv import load_dotenv
load_dotenv()


RIOT_API_KEY = os.getenv('RIOT_API_KEY')
if not RIOT_API_KEY:
    raise RuntimeError('Set RIOT_API_KEY as an environment variable or in a .env file')


HEADERS = {"X-Riot-Token": RIOT_API_KEY}

## Data sourcing (Riot API overview)

### What endpoints we be used:

- **Match IDs** by **PUUID:** /lol/match/v5/matches/by-puuid/{puuid}/ids - returns a list of match IDs for a given player.

- **Match detail by matchId**: /lol/match/v5/matches/{matchId} — returns the full match info and timeline (timelines might be included alongside).

### Filtering notes:

- Use **queue=420** to restrict to **Ranked Solo/Duo matches (competitive).**

- Filter out **remakes** or **extremely short games** (e.g., gameDuration < 9 minutes) to avoid noisy examples.

### Rate limits & best practices:

- Riot enforces **rate limits**, so I will be using **time.sleep** to pause between requests and wait longer if blocked (HTTP 429).

- **Cache** raw match JSONs to disk so you don't re-download them.

## Data requirements and Schema


### 1. Core Features (and why)

- **Match metadata**: `matchId`, `queueId`, `gameDuration` - needed to filter matches.
- **Champion & role picks**: champion IDs/names, team positions - needed for pre-game features.
- **Lane metrics at 10 minutes**: gold diff, CS diff, XP diff, K/D/A per lane - strong early indicators.
- **Team objectives at 10 minutes**: `firstBlood`, `firstTower`, `firstDragon` (+ type), `firstHerald`, counts of towers/dragons.
- **Target variables**: `blue_win` (0/1) and `win_probability` (for model output).



### 2. Proposed Output Schema (one row per match)

### Match Metadata
- `matchId` (`str`)
- `queueId` (`int`)
- `gameDuration` (`int`, seconds)
- `blue_win` (`0/1`)

### Champion Picks & Roles
- `blue_champions` (list of champion IDs)
- `red_champions` (list of champion IDs)
- `blue_roles` (list of positions)
- `red_roles` (list of positions)

### Early Features (10 min)
- `blue_gold_10`, `red_gold_10`, `gold_diff_10` - difference in gold at 10 minutes
- `blue_cs_10`, `red_cs_10`, `cs_diff_10` - difference in farm at 10 minutes
- `blue_xp_10`, `red_xp_10`, `xp_diff_10` - difference in xp at 10 minutes
- `blue_kills_10`, `red_kills_10`, `kills_diff_10`- difference in kills at 10 minutes

### Objectives (10 min)
- `first_blood` (None / blue / red)
- `first_tower` (None / blue / red)
- `first_dragon` (None / blue / red)
- `first_rift_herald` (None / blue / red)
- `blue_towers_10`, `red_towers_10` - tower kills at 10 minutes
- `blue_dragons_10`, `red_dragons_10` - dragon kills at 10 minutes


## Storage strategy & versioning

For this project, I decided to store the raw and processed match data in a way that’s easy to manage and scale. Here’s what I did:
### Raw and Processed Data
- **Raw data**: every match JSON I collect from the Riot API is saved under `data/raw/{matchId}json`. 
This way, I have a permanent copy of the original data that I can always reprocess if needed.
- **Processed data**: I’m saving the cleaned dataset as  `data/processed/lol_matches_v001.parquet`.
   I chose **Parquet** because:
  - It’s **much smaller and faster** than CSV, which matters since I’ll eventually have thousands of matches.
  - It keeps **data types** (int, float, string) correctly without me having to convert them manually.
  - I can quickly load only the columns I need for analysis or modeling.
- **Manifest**: I also create `data/processed/manifest.json` that tracks:
  - the dataset version  
  - number of rows  
  - the date it was generated  
  - the raw files used  

  I could use CSV instead, which is easier to open in Excel or share, but for this size of data, Parquet makes working with it way faster and more efficient. I might save a CSV copy later if I want to show it.

### Versioning

- I use **semantic versioning** for the dataset: `v001`, `v002`, etc.  
- The **raw JSON files are never overwritten**. If I need to reprocess or add more matches, I create a new processed version.


## Raw data collection

### Important initial design choices:

- I am going to collect match IDs by querying several PUUIDs (players). Prefer high-activity public accounts.

- For initial proof-of-concept, I will collect 1,000–5,000 matches. For final model I will have tens of thousands

In [23]:
#Constants
REGION = 'europe'
HEADERS = {"X-Riot-Token": RIOT_API_KEY}
MATCHLIST_URL = f'https://{REGION}.api.riotgames.com/lol/match/v5/matches/by-puuid/{{puuid}}/ids'
MATCH_URL = f'https://{REGION}.api.riotgames.com/lol/match/v5/matches/{{matchId}}'
TIMELINE_URL = f'https://{REGION}.api.riotgames.com/lol/match/v5/matches/{{matchId}}/timeline'
RAW_DIR = 'data/raw'

# Functions!
def safe_get(url, params=None, retries=5):
    for i in range(retries):
        resp = requests.get(url, headers=HEADERS, params=params)
        if resp.status_code == 200:
            return resp.json()
        elif resp.status_code == 429:  # rate limited
            wait = int(resp.headers.get('Retry-After', 1))
            print(f"Rate limited. Waiting {wait}s...")
            time.sleep(wait)
    raise RuntimeError(f"Failed to GET {url} after {retries} retries")

def get_puuid(game_name, tag_line):
    url = f"https://{REGION}.api.riotgames.com/riot/account/v1/accounts/by-riot-id/{game_name}/{tag_line}"
    resp = requests.get(url, headers=HEADERS)
    if resp.status_code == 200:
        return resp.json().get("puuid")
    print(f"Failed to get PUUID for {game_name}#{tag_line}: {resp.status_code}")
    return None

def fetch_match_ids(puuid, count=20, queue=420):
    params = {"start": 0, "count": count, "queue": queue}
    return safe_get(MATCHLIST_URL.format(puuid=puuid), params=params)

# Fetch and save match JSON and timeline JSON
def fetch_and_save_match_with_timeline(match_id):
    os.makedirs(RAW_DIR, exist_ok=True)

    # Match JSON
    match_path = os.path.join(RAW_DIR, f"{match_id}.json")
    if not os.path.exists(match_path):
        match_data = safe_get(MATCH_URL.format(matchId=match_id))
        with open(match_path, 'w', encoding='utf-8') as f:
            json.dump(match_data, f)

    # Timeline JSON
    timeline_path = os.path.join(RAW_DIR, f"{match_id}_timeline.json")
    if not os.path.exists(timeline_path):
        timeline_data = safe_get(TIMELINE_URL.format(matchId=match_id))
        with open(timeline_path, 'w', encoding='utf-8') as f:
            json.dump(timeline_data, f)

    return match_path, timeline_path

# Fetch JSON from URL and save if it doesn't exist.
def fetch_and_save_json(url, path):
    os.makedirs(os.path.dirname(path), exist_ok=True)
    if os.path.exists(path):
        return path
    data = safe_get(url)
    with open(path, 'w', encoding='utf-8') as f:
        json.dump(data, f)
    return path

# Players to fetch data from
players = [{"name": "BB99", "tag": "BLZNT"}]

# Fetch PUUIDs
for p in players:
    p["puuid"] = get_puuid(p["name"], p["tag"])
puuids = [p["puuid"] for p in players if p.get("puuid")]
print("PUUIDs fetched:", puuids)

# Fetch matches with rate-limit handling
all_match_ids = []
MAX_REQUESTS = 99  # just under Riot's 100 requests per 2 min
WINDOW = 120
request_count = 0
start_time = time.time()

for puuid in puuids:
    match_ids = fetch_match_ids(puuid, count=20)
    all_match_ids.extend(match_ids)
    
    for match_id in tqdm(match_ids, desc=f"Downloading matches for {puuid}"):
        # Full match details
        fetch_and_save_json(MATCH_URL.format(matchId=match_id), os.path.join(RAW_DIR, f"{match_id}.json"))

        # Match timeline
        fetch_and_save_json(TIMELINE_URL.format(matchId=match_id), os.path.join(RAW_DIR, f"{match_id}_timeline.json"))
        
        request_count += 2  # two requests per match
        if request_count >= MAX_REQUESTS:
            elapsed = time.time() - start_time
            sleep_time = max(0, WINDOW - elapsed)
            if sleep_time > 0:
                print(f"Sleeping {sleep_time:.1f}s to avoid rate limit...")
                time.sleep(sleep_time)
            start_time = time.time()
            request_count = 0

# Save full list of match IDs
with open("matchlist.json", "w") as f:
    json.dump(all_match_ids, f, indent=2)

print(f" Downloaded {len(all_match_ids)} matches to '{RAW_DIR}/'")

PUUIDs fetched: ['T9lGIR8iroZCD7e9-3hWNs-wg9h3eA1UjcT_4YsKlkdZY0L9tZWhJlMg9kGT99wpuBNZxT5iDpY3lg']


Downloading matches for T9lGIR8iroZCD7e9-3hWNs-wg9h3eA1UjcT_4YsKlkdZY0L9tZWhJlMg9kGT99wpuBNZxT5iDpY3lg: 100%|██████████| 20/20 [00:00<00:00, 3202.49it/s]

 Downloaded 20 matches to 'data/raw/'





### Regenerate matchlist.json after adding new matches

In [27]:
# Regenerate matchlist.json from all match JSON files in data/raw
import os
import json

raw_dir = 'data/raw'
match_ids = []
for fname in os.listdir(raw_dir):
    if fname.endswith('.json') and not fname.endswith('_timeline.json'):
        match_id = fname.replace('.json', '')
        match_ids.append(match_id)

# Remove duplicates, just in case
match_ids = sorted(set(match_ids))

with open("matchlist.json", "w") as f:
    json.dump(match_ids, f, indent=2)

print(f"Updated matchlist.json with {len(match_ids)} match IDs from {raw_dir}")

Updated matchlist.json with 45 match IDs from data/raw


## Parsing & feature extraction

### Design approach (how I actually set this up)

To keep it simple the data is split into two parts based on when the state of the game:

1. **Pregame table** – only info known before the game starts:
   - Match metadata (`id`, `queue`, `version`, `duration`).
   - Champion picks (`blue` vs `red`) in pick order.
   - Positions/roles (which champion is facing which and in what position).
   Lists are stored as pipe-separated strings to keep it flat (easy to split back later).

2. **10-minute table** – what the game state looks like at/around 10:00:
   - First objective takers (`tower`, `dragon`, `herald`, grubs = `horde`).
   - Team gold / xp / cs / kills and their diffs.
   - I skip matches shorter than 10 min because those don’t have a real early game.

Both tables have one row per `matchId`, so joining them later is trivial.

Why bother splitting? Faster iteration. I can train a draft-only model without loading timeline data, and I can tweak early game logic without touching the pregame export.

Storage: Parquet for the real thing, small CSV preview just so I can eyeball outputs. A `manifest.json` tracks versions + counts.

Version bump any time I change schema or logic (e.g., add lane diffs → go to v002).

Flow I run:
1. Load raw JSON list.
2. Build pregame (no timeline).
3. Build 10m table (needs timeline + duration filter).
4. Save Parquet/CSV + previews + update manifest.

In [45]:
#Process all matches
raw_dir = 'data/raw'
all_match_files = [f.replace('.json','') for f in os.listdir(raw_dir) if f.endswith('.json')]

# Debug: print number of match JSON files in data/raw
print("Number of match JSON files in data/raw:", len([f for f in os.listdir(raw_dir) if f.endswith('.json')]))

MIN_GAME_DURATION = 600  # seconds; skip remakes / ultra-short games
rows = []
skipped_short = []

for match_id in tqdm(all_match_ids, desc="Processing matches"):
    match_json = load_match_json(match_id)
    game_duration = match_json['info'].get('gameDuration', 0)
    # Skip very short games (likely remakes)
    if game_duration < MIN_GAME_DURATION:
        skipped_short.append((match_id, game_duration))
        continue
    
    timeline_json = load_timeline_json(match_id)
    stats_10min = aggregate_participant_stats_at_10min(match_json, timeline_json)
    
    teams = match_json['info']['teams']
    
    def first_team(obj_key):
        try:
            if teams[0]['objectives'][obj_key]['first']:
                return 'blue'
            if teams[1]['objectives'][obj_key]['first']:
                return 'red'
        except Exception:
            return None
        return None
    
    # Build dataset row
    row = {
        'matchId': match_id,
        'queueId': match_json['info'].get('queueId'),
        'gameDuration': game_duration,
        'blue_win': int(match_json['info']['teams'][0]['win']),
        # first objective takers (void grubs now keyed as 'horde' in current patches)
        'first_tower': first_team('tower'),
        'first_dragon': first_team('dragon'),
        'first_herald': first_team('riftHerald'),
        'first_grub': first_team('horde')
    }
    
    # aggregate team-level stats
    blue_ids = [p['participantId'] for p in match_json['info']['participants'][:5]]
    red_ids = [p['participantId'] for p in match_json['info']['participants'][5:]]

    row.update({
        'blue_gold_10': sum(stats_10min[pid]['totalGold'] for pid in blue_ids),
        'red_gold_10': sum(stats_10min[pid]['totalGold'] for pid in red_ids),
        'blue_xp_10': sum(stats_10min[pid]['xp'] for pid in blue_ids),
        'red_xp_10': sum(stats_10min[pid]['xp'] for pid in red_ids),
        'blue_cs_10': sum(stats_10min[pid]['minionsKilled'] for pid in blue_ids),
        'red_cs_10': sum(stats_10min[pid]['minionsKilled'] for pid in red_ids),
        'blue_kills_10': sum(stats_10min[pid]['kills'] for pid in blue_ids),
        'red_kills_10': sum(stats_10min[pid]['kills'] for pid in red_ids),
    })
    row['gold_diff_10'] = row['blue_gold_10'] - row['red_gold_10']
    row['cs_diff_10'] = row['blue_cs_10'] - row['red_cs_10']
    row['xp_diff_10'] = row['blue_xp_10'] - row['red_xp_10']
    row['kills_diff_10'] = row['blue_kills_10'] - row['red_kills_10']
    
    rows.append(row)

# Convert to DataFrame
df = pd.DataFrame(rows)
print(f"Rows in final DataFrame (>= {MIN_GAME_DURATION}s):", df.shape[0])
if skipped_short:
    print(f"Skipped {len(skipped_short)} short games (<{MIN_GAME_DURATION}s):")
    for mid, dur in skipped_short:
        print(f"  - {mid} (duration={dur}s)")
print(df[['matchId','gameDuration','first_tower','first_dragon','first_herald','first_grub']].head())

# Save processed dataset as CSV
os.makedirs('data/processed', exist_ok=True)
df.to_csv('data/processed/lol_10min_matches.csv', index=False)
print("✅ Saved processed dataset as CSV with first objective takers (using 'horde' for void grubs). Remakes filtered.")

Number of match JSON files in data/raw: 90


Number of match JSON files in data/raw: 90


Processing matches (10m): 100%|██████████| 45/45 [00:00<00:00, 123.65it/s]

Number of match JSON files in data/raw: 90


Processing matches (10m): 100%|██████████| 45/45 [00:00<00:00, 123.65it/s]

Rows in 10m DataFrame (>= 600s): 43
Skipped 2 short games (<600s):
  - EUN1_3831011312 (duration=110s)
  - EUN1_3832486067 (duration=117s)
           matchId  gameDuration first_tower first_dragon first_herald  \
0  EUN1_3830286977          2248         red         blue          red   
1  EUN1_3830307017          1896         red         blue         blue   
2  EUN1_3830671280           999        blue         blue         blue   
3  EUN1_3830680693          2021         red          red         blue   
4  EUN1_3830696669          1438        blue          red         blue   

  first_grub  
0        red  
1       blue  
2        red  
3       blue  
4       blue  





NameError: name 'processed_dir' is not defined