## Web-Scraping Script for Fotmob
*Author: James Njoroge*

Scraping to collect the following metrics from the fotmob website for the 2024/2025 Premier League Season:
### Match Data:
1. **Label (match-wise, post-match)**:
    - Goals
    - Fouls
    - Expected Goals (xG)
    - Big chances
    - Total Shots
    - Shots on Target
    - Shots Inside Box
    - Corners
    - Touches in opposition box
    - Stadium attendance vs. Stadium Capacity (*Factor in the fan demand since they play a hand in influencing player behavior.*)
2. **Features (team-wise, pre-match)**:
    - xG/90
    - shots on target/90
    - corners/90
    - xG conceded/90
    - Home advantage

### Intuition:

1. From a bit of observation using the browser developer tools, we realise that whenever we look at a specific match's page or statistics, Fotmob will query (via api) the data from the backend and that is returned by a json, that is then parsed to pass the statistics on to the UI. (See this from inspecting the network tab)
2. What we need is that json file and in our case, we need to do it for all 380 matches that took place in the 24-25 season. 
3. Therefore, the next logical question is whether we can intercept the data json file as it is being passed to the frontend. We can using selenium-wire. The flow is like this: 
    - We go to the fotmob page with all the rounds matches (38 rounds).
    - From this page we can pull the links for each specific match from that round by targeting the div classes and the anchor classes.
    - Then from here we visit all of these match links and intercept the data json file that has been requested from the backend using there private API.
    - In a few cases, you'll find that there was no data file saved, because the page probably took too long to load, this is okay and as long as it's one or two files, you can go to the browser yourself and pull that json file from the network tab and add it to your directory. 
4. This json file has all of the data and metadata needed for each match (including the prevailing weather conditions) so we simply save it and can process the parts that we need later on.
5. Having tried querying the fotmob API directly through the browser, we see that it is pretty much locked (good programming practice) so we simply take the long route. 

### Already scraped data for reference

I have provided my scraped match json files in the ```24-25_PL_Data_raw.zip``` file for your use or reference if you decide to test out the script.

### Setup: Load Libraries and Season Configuration

1. Align all scraping dependencies (Selenium Wire, requests, compression helpers) so the notebook can monitor network traffic reliably.
2. Lock in the league/season identifiers plus filesystem and Selenium knobs that the rest of the flow reuses.
3. Next we package small utilities around these constants to keep the later scraping loop tidy.



In [None]:
# Imports for coding and structures
import os, re, json, time, random, gzip, zlib
from typing import Dict, List, Tuple, Optional
from urllib.parse import urljoin

import brotli        # for Content-Encoding: br
import requests      # optional direct fallback

# --- Selenium (network interception via selenium-wire)
from seleniumwire import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options

# =========================
# Config
# =========================
SEASON_LABEL       = "24-25_PL_Data"       # top-level folder
LEAGUE_ID          = 47                    # Premier League (can check fotmob for other league codes)
LEAGUE_SLUG        = "premier-league"      # basically how fotmob denotes their league name
SEASON_SLUG        = "2024-2025"           # specific season in fotmob's format
ROUND_START        = 0                     # you observed 0..37
ROUND_END          = 37
DATA_ROOT          = SEASON_LABEL
CHROMEDRIVER_PATH  = os.path.expanduser("/Path/to/your/chromedriver-mac-arm64/chromedriver") # Using mac here but change it to your systems path (plus download your chromedriver from chrome developer tools website)
HEADLESS           = True                  # do the task without visible user interface (in the background)
PER_MATCH_SLEEP    = 0.25
PAGE_SETTLE_WAIT   = 3.0                   # give a bit more time for all elements and queries to load
RETRY_ON_MISS      = 1                     # extra quick retry if miss
ALLOW_DIRECT_FALLBACK = True               # try direct GET if interception misses
DEBUG              = False                 # set True to emit extra logs + snapshots

# exact classes where the matches are listed on a frontend website
SECTION_CLASS_EXACT = "css-o4yr0b-LeagueMatchesSectionCSS eaa01ac0"
ANCHOR_CLASS_EXACT  = "css-1ajdexg-MatchWrapper e1mxmq6p0"

ROUND_URL_TMPL = (
    "https://www.fotmob.com/leagues/{league_id}/matches/{league_slug}"
    "?season={season}&group=by-round&round={round_no}"
) # url template for the matchpages for each round (note league_id, season, and round numbers)

# Accept BOTH endpoints (some matches use /api/matchDetails, others /api/data/matchDetails)
DETAILS_PATTERNS = [
    re.compile(r"https://(?:www\.)?fotmob\.com/api/data/matchDetails\?matchId=(\d+)"),
    re.compile(r"https://(?:www\.)?fotmob\.com/api/matchDetails\?matchId=(\d+)")
]
# accept /matches/...#<id>  OR  /match/<id>
MATCH_HREF_RX = re.compile(r"(?:/match(?:es)?/[^#]*#|/match/)(\d+)")

# Ensure that filepaths to save the data files exist and different data path for debugging, just in case.
os.makedirs(DATA_ROOT, exist_ok=True)
DEBUG_DIR = os.path.join(DATA_ROOT, "_debug")
if DEBUG: os.makedirs(DEBUG_DIR, exist_ok=True)


### Utilities: Shared Helpers

1. Bundle Selenium bootstrap, directory creation, and ID parsing helpers so later cells stay focused on workflow logic.
2. Centralise response decoding and safe file naming, giving us reusable building blocks for every match we touch.
3. With these basics covered, we can concentrate on discovering the per-round match URLs.



In [21]:
# =========================
# Utilities
# =========================
# Bootstrap selenium-wire so we can capture network traffic from match pages.
def init_driver(headless: bool = True):
    """Initialise a selenium-wire Chrome driver using the configured Chrome options."""
    opts = Options()
    if headless:
        opts.add_argument("--headless=new")
    opts.add_argument("--no-sandbox")
    opts.add_argument("--disable-dev-shm-usage")
    service = Service(CHROMEDRIVER_PATH)
    driver = webdriver.Chrome(service=service, options=opts)  # selenium-wire driver
    driver.set_page_load_timeout(45)
    return driver

# Keep a per-round folder structure for the downloaded JSON files.
def ensure_round_dir(round_no: int) -> str:
    """Create (if needed) and return the filesystem path for a given round."""
    rdir = os.path.join(DATA_ROOT, f"round_{round_no}")
    os.makedirs(rdir, exist_ok=True)
    return rdir

# Pull out the matchId token from Fotmob anchor links.
def match_id_from_href(href: str) -> Optional[str]:
    """Extract the matchId token from a Fotmob anchor href."""
    if not href:
        return None
    m = MATCH_HREF_RX.search(href)
    return m.group(1) if m else None

# Case-insensitive header lookup helper for captured responses.
def get_header(headers, key: str) -> Optional[str]:
    """Retrieve a header value in a case-insensitive way from selenium-wire metadata."""
    if not headers:
        return None
    key = key.lower()
    for k, v in headers.items():
        if k.lower() == key:
            return v
    return None

# Normalise compressed response bodies to readable JSON bytes.
def decode_body(raw: bytes, headers: Dict) -> bytes:
    """Decompress the response body based on Content-Encoding and return raw JSON bytes."""
    enc = (get_header(headers, "Content-Encoding") or "").lower()
    if "br" in enc:
        return brotli.decompress(raw)
    if "gzip" in enc:
        return gzip.decompress(raw)
    if "deflate" in enc:
        # some servers use raw deflate (zlib) — try both
        try:
            return zlib.decompress(raw)
        except zlib.error:
            return zlib.decompress(raw, -zlib.MAX_WBITS)
    return raw  # no compression

# Persist intercepted payloads in a pretty-printed format for inspection.
def save_json_pretty(path: str, raw_json_bytes: bytes):
    """Write the captured payload to disk with readable indentation."""
    data = json.loads(raw_json_bytes.decode("utf-8"))
    with open(path, "w", encoding="utf-8") as f:
        json.dump(data, f, ensure_ascii=False, indent=2)

# Sanitise strings when embedding team names into filenames.
def safe_name(s: str) -> str:
    """Sanitise strings for filesystem-friendly filenames."""
    return re.sub(r"[^a-zA-Z0-9_\-]+", "_", s).strip("_")

# Lightweight debug logger toggled via the DEBUG flag.
def _dbg(msg: str):
    """Emit debug output when verbose tracing is enabled."""
    if DEBUG:
        print(msg)

# Optional HTML snapshots make debugging missing elements easier.
def _save_snapshot(driver, round_no, tag):
    """Persist the HTML snapshot for troubleshooting when DEBUG mode is active."""
    if not DEBUG:
        return
    p = os.path.join(DEBUG_DIR, f"round_{round_no}_{tag}.html")
    try:
        with open(p, "w", encoding="utf-8") as f:
            f.write(driver.page_source)
        print(f"[SNAPSHOT] {p}")
    except Exception:
        pass

# Limit selenium-wire capture scope to matchDetails requests only.
def details_scope_pattern() -> str:
    """Limit selenium-wire interception to matchDetails endpoints across both URL patterns."""
    return r".*fotmob\.com/api/(?:data/)?matchDetails\?matchId=\d+.*"

# Extract matchId values from captured request URLs.
def extract_mid_from_url(url: str) -> Optional[str]:
    """Pull the matchId from a captured request URL using the configured regex patterns."""
    for rx in DETAILS_PATTERNS:
        m = rx.search(url)
        if m:
            return m.group(1)
    return None


### Discovery: Collect Match Links Per Round

1. Visit the Fotmob round page and target the match containers using the observed CSS classes.
2. Fall back to a broader anchor scan when the exact selectors miss, deduplicating links as we go.
3. Once we have the URLs, we can intercept the matchDetails payloads for each fixture.



In [22]:
# =========================
# Discovery (exact classes first, then fallback)
# =========================
def discover_match_urls_for_round(round_no: int) -> List[str]:
    url = ROUND_URL_TMPL.format(
        league_id=LEAGUE_ID,
        league_slug=LEAGUE_SLUG,
        season=SEASON_SLUG,
        round_no=round_no,
    )
    driver = init_driver(headless=HEADLESS)
    try:
        _dbg(f"[URL] navigating -> {url}")
        driver.get(url)
        time.sleep(1.0)
        _dbg(f"[URL] landed     -> {driver.current_url}")

        # Wait for sections with exact class
        section_css_exact = "section." + ".".join(SECTION_CLASS_EXACT.split())
        sections = []
        try:
            WebDriverWait(driver, 10).until(
                EC.presence_of_all_elements_located((By.CSS_SELECTOR, section_css_exact))
            )
            sections = driver.find_elements(By.CSS_SELECTOR, section_css_exact)
            _dbg(f"[SECTIONS:EXACT] {len(sections)} via '{section_css_exact}'")
        except TimeoutException:
            _dbg(f"[SECTIONS:EXACT] none after wait")
            _save_snapshot(driver, round_no, "no_sections_exact")

        links: List[str] = []

        if sections:
            anchor_css_exact = "a." + ".".join(ANCHOR_CLASS_EXACT.split())
            for sec in sections:
                anchors = sec.find_elements(By.CSS_SELECTOR, f'{anchor_css_exact}[href]')
                for a in anchors:
                    href = a.get_attribute("href")
                    if not href: continue
                    href = urljoin("https://www.fotmob.com", href)
                    if "/matches/" in href or "/match/" in href:
                        links.append(href)

        # Fallback: whole page scan
        if not links:
            time.sleep(1.0)
            anchors = driver.find_elements(By.CSS_SELECTOR, 'a[href^="/matches/"], a[href*="/match/"]')
            for a in anchors:
                href = a.get_attribute("href")
                if not href: continue
                href = urljoin("https://www.fotmob.com", href)
                if "/matches/" in href or "/match/" in href:
                    links.append(href)

        # Deduplicate preserving order
        seen, out = set(), []
        for h in links:
            if h not in seen:
                seen.add(h)
                out.append(h)
        return out

    finally:
        driver.quit()

### Interception: Capture `matchDetails` Payloads

1. Spin up a fresh Selenium Wire session for each match URL so the request log starts clean and we do not carry over stale interceptions between fixtures.
2. Decode any compressed responses and retry or fall back to direct API calls when interception misses.
3. The returned bytes will feed the lightweight parsers that pull team names and support on-disk storage.


In [23]:
# =========================
# Interception (selenium-wire) with decode + retries + fallback
# =========================
def intercept_matchdetails_for_url(match_url: str, expected_match_id: Optional[str]) -> Optional[Tuple[str, bytes]]:
    """
    Opens the match page, captures network responses, and returns
    (api_url, decoded_json_bytes) for the *current* matchId.
    """
    opts = Options()
    if HEADLESS:
        opts.add_argument("--headless=new")
    opts.add_argument("--no-sandbox")
    opts.add_argument("--disable-dev-shm-usage")
    service = Service(CHROMEDRIVER_PATH)
    driver = webdriver.Chrome(service=service, options=opts)  # selenium-wire driver
    try:
        driver.scopes = [details_scope_pattern()]
        driver.requests.clear()

        driver.get(match_url)
        time.sleep(PAGE_SETTLE_WAIT)

        # small nudge; some pages lazy-load
        try:
            driver.execute_script("window.scrollBy(0, 300);")
        except Exception:
            pass
        time.sleep(0.8)

        def pick_matchdetails() -> Optional[Tuple[str, bytes]]:
            hits = []
            for req in driver.requests:
                if not req.response: 
                    continue
                mid_in_url = extract_mid_from_url(req.url)
                if not mid_in_url:
                    continue
                if expected_match_id and mid_in_url != expected_match_id:
                    # skip head-to-head calls for other matches
                    continue
                try:
                    raw = req.response.body or b""
                except Exception:
                    raw = b""
                if not raw:
                    continue
                # decompress to readable JSON
                decoded = decode_body(raw, req.response.headers or {})
                # quick sanity: ensure it parses
                try:
                    json.loads(decoded.decode("utf-8"))
                except Exception:
                    continue
                hits.append((req.url, decoded))
            # return the first good hit (usually only one)
            return hits[0] if hits else None

        hit = pick_matchdetails()

        # quick retry if miss
        attempts = 0
        while not hit and attempts < RETRY_ON_MISS:
            time.sleep(1.5)
            try:
                driver.execute_script("window.scrollBy(0, 600);")
            except Exception:
                pass
            hit = pick_matchdetails()
            attempts += 1

        # OPTIONAL: direct fallback (single polite GET) if still missing
        if not hit and ALLOW_DIRECT_FALLBACK and expected_match_id:
            try:
                # prefer /api/data/matchDetails; fallback to /api/matchDetails
                sess = requests.Session()
                sess.headers.update({
                    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                                  "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127 Safari/537.36",
                    "Accept": "application/json, text/plain, */*",
                    "Referer": "https://www.fotmob.com/",
                })
                urls = [
                    f"https://www.fotmob.com/api/data/matchDetails?matchId={expected_match_id}",
                    f"https://www.fotmob.com/api/matchDetails?matchId={expected_match_id}",
                ]
                for u in urls:
                    r = sess.get(u, timeout=20)
                    if r.ok and r.content:
                        # requests auto-decompresses; just sanity-parse
                        json.loads(r.content.decode("utf-8"))
                        return (u, r.content)
            except Exception:
                pass

        return hit

    finally:
        driver.quit()

### Parsing: Pull Team Metadata from Payloads

1. Quickly examine the JSON structure for the header/general sections to extract the home and away names.
2. Provide graceful fallbacks when the structure varies so downstream code can still log the match.
3. With this helper, saved files carry human-readable context in their filenames and index entries.



In [24]:
# =========================
# Parse minimal team info
# =========================
def extract_teams_from_payload(payload: bytes) -> Tuple[Optional[str], Optional[str]]:
    try:
        d = json.loads(payload.decode("utf-8", errors="ignore"))
    except Exception:
        return (None, None)

    header = d.get("header") if isinstance(d, dict) else None
    if isinstance(header, dict):
        t = header.get("teams")
        if isinstance(t, list) and len(t) >= 2:
            home = t[0].get("name") if isinstance(t[0], dict) else None
            away = t[1].get("name") if isinstance(t[1], dict) else None
            if home or away:
                return (home, away)
        home = header.get("home", {}).get("name")
        away = header.get("away", {}).get("name")
        if home or away:
            return (home, away)

    general = d.get("general") if isinstance(d, dict) else None
    if isinstance(general, dict):
        h = general.get("homeTeam", {})
        a = general.get("awayTeam", {})
        home = h.get("name") if isinstance(h, dict) else None
        away = a.get("name") if isinstance(a, dict) else None
        if home or away:
            return (home, away)

    return (None, None)

### Orchestration: Scrape the Entire Season

1. Loop across every round, use the discovery helpers to gather fixtures, and intercept the corresponding matchDetails JSON.
2. Persist each payload with clear filenames, build an index of metadata, and respect delays to keep the crawl polite.
3. After the loop, write an index.json summary that fuels validation and downstream analysis.



In [29]:
# =========================
# Main
# =========================
def main():
    index_rows: List[Dict] = []

    for round_no in range(ROUND_START, ROUND_END + 1):
        round_dir = ensure_round_dir(round_no)

        urls = discover_match_urls_for_round(round_no)

        # keep only links that yield a parseable matchId
        cleaned = []
        seen = set()
        for u in urls:
            mid = match_id_from_href(u)
            if not mid:
                continue
            if u not in seen:
                seen.add(u)
                cleaned.append(u)

        print(f"Round {round_no}: {len(cleaned)} matches")

        for i, u in enumerate(cleaned, 1):
            mid = match_id_from_href(u)
            saved_path = None
            api_url = None
            home = away = None

            hit = intercept_matchdetails_for_url(u, mid)
            if hit:
                api_url, payload = hit
                home, away = extract_teams_from_payload(payload)

                base = f"{mid}_matchDetails"
                if home and away:
                    base += f"_{safe_name(home)}-vs-{safe_name(away)}"
                fname = base + ".json"
                fpath = os.path.join(round_dir, fname)
                save_json_pretty(fpath, payload)          # <-- pretty, readable JSON
                saved_path = fpath
                print(f"  [{i}/{len(cleaned)}] saved {mid} -> {os.path.relpath(fpath)}")
            else:
                print(f"  [{i}/{len(cleaned)}] missed {mid} (no matchDetails)")
                # (It will still be recorded in index)

            index_rows.append({
                "round": round_no,
                "matchUrl": u,
                "matchId": mid,
                "apiUrl": api_url,
                "home": home,
                "away": away,
                "jsonPath": os.path.relpath(saved_path) if saved_path else None,
                "ts": time.time(),
            })

            time.sleep(PER_MATCH_SLEEP + random.uniform(0, 0.15))

    # write season-level index
    index_path = os.path.join(DATA_ROOT, "index.json")
    with open(index_path, "w", encoding="utf-8") as f:
        json.dump(index_rows, f, ensure_ascii=False, indent=2)
    print(f"\nWrote index -> {index_path}")
    print("Done.")

if __name__ == "__main__":
    main()

[URL] navigating -> https://www.fotmob.com/leagues/47/matches/premier-league?season=2024-2025&group=by-round&round=0
[URL] landed     -> https://www.fotmob.com/leagues/47/matches/premier-league?season=2024-2025&group=by-round&round=0
[SECTIONS:EXACT] 4 via 'section.css-o4yr0b-LeagueMatchesSectionCSS.eaa01ac0'
Round 0: 10 matches
  [1/10] saved 4506263 -> 24-25_PL_Data/round_0/4506263_matchDetails_Manchester_United-vs-Fulham.json
  [2/10] saved 4506264 -> 24-25_PL_Data/round_0/4506264_matchDetails_Ipswich_Town-vs-Liverpool.json
  [3/10] saved 4506265 -> 24-25_PL_Data/round_0/4506265_matchDetails_Arsenal-vs-Wolverhampton_Wanderers.json
  [4/10] saved 4506266 -> 24-25_PL_Data/round_0/4506266_matchDetails_Everton-vs-Brighton_Hove_Albion.json
  [5/10] saved 4506267 -> 24-25_PL_Data/round_0/4506267_matchDetails_Newcastle_United-vs-Southampton.json
  [6/10] saved 4506268 -> 24-25_PL_Data/round_0/4506268_matchDetails_Nottingham_Forest-vs-AFC_Bournemouth.json
  [7/10] saved 4506269 -> 24-25_PL_

### QA: Verify Download Coverage

1. After scraping, sweep the filesystem to count distinct matchIds and confirm the expected range is present.
2. Surface any missing or unexpected IDs so we know instantly which fixtures need another pass.
3. This quick audit runs standalone, letting us rerun it after manual file changes or re-scrapes.



In [None]:
# Post-scraping validation of downloaded match JSON files
from pathlib import Path

EXPECTED_MATCH_COUNT = 380
EXPECTED_MATCH_ID_START = 4506263 # Pull from fotmob website
EXPECTED_MATCH_ID_END = 4506642 # Pull from fotmob website / should be the starting match id plus 379 due to zero-based indexing.


def summarize_match_downloads(data_root: str = DATA_ROOT) -> None:
    data_root_path = Path(data_root)
    if not data_root_path.exists():
        print(f"Data root {data_root} does not exist.")
        return

    match_ids = []
    for round_dir in sorted(data_root_path.glob('round_*')):
        for json_path in round_dir.glob('*.json'):
            stem = json_path.stem
            match_id_part = stem.split('_', 1)[0]
            if match_id_part.isdigit():
                match_ids.append(int(match_id_part))

    unique_ids = sorted(set(match_ids))
    expected_ids = set(range(EXPECTED_MATCH_ID_START, EXPECTED_MATCH_ID_END + 1))
    missing_ids = sorted(expected_ids - set(unique_ids))
    unexpected_ids = sorted(set(unique_ids) - expected_ids)

    print(f"Found {len(unique_ids)} match JSON files (expected {EXPECTED_MATCH_COUNT}).")
    if unique_ids:
        print(f"Observed matchId range: {unique_ids[0]} - {unique_ids[-1]}")
    else:
        print('Observed matchId range: none')

    if missing_ids:
        print(f"Missing matchIds ({len(missing_ids)}): {', '.join(str(mid) for mid in missing_ids)}")
    else:
        print('Missing matchIds: none')

    if unexpected_ids:
        print(f"Unexpected matchIds ({len(unexpected_ids)}): {', '.join(str(mid) for mid in unexpected_ids)}")
    else:
        print('Unexpected matchIds: none')


summarize_match_downloads()

Found 380 match JSON files (expected 380).
Observed matchId range: 4506263 - 4506642
Missing matchIds: none
Unexpected matchIds: none


### Recovery: Reconstruct `index.json` from Disk

1. When the on-disk index drifts, rebuild it by reading each round's JSON files and extracting match metadata.
2. Recreate per-round structures with match URLs, team names, and relative file paths for easy navigation.
3. Writing the refreshed index.json keeps validation steps and future analytics consistent with the files we have.



In [31]:
# Rebuild index.json from downloaded match files if index.json is not built correctly. index.json is used to give us a very high level review of each matches file, where it stored, and where we can countercheck if it's information is correct.
import json
from pathlib import Path


def _extract_teams_from_payload_dict(payload: dict):
    header = payload.get('header') if isinstance(payload, dict) else None
    if isinstance(header, dict):
        teams = header.get('teams')
        if isinstance(teams, list) and len(teams) >= 2:
            home = teams[0].get('name') if isinstance(teams[0], dict) else None
            away = teams[1].get('name') if isinstance(teams[1], dict) else None
            if home or away:
                return home, away
        home_name = header.get('home', {}).get('name') if isinstance(header.get('home'), dict) else None
        away_name = header.get('away', {}).get('name') if isinstance(header.get('away'), dict) else None
        if home_name or away_name:
            return home_name, away_name

    general = payload.get('general') if isinstance(payload, dict) else None
    if isinstance(general, dict):
        home = general.get('homeTeam', {}).get('name') if isinstance(general.get('homeTeam'), dict) else None
        away = general.get('awayTeam', {}).get('name') if isinstance(general.get('awayTeam'), dict) else None
        if home or away:
            return home, away

    return None, None


def rebuild_index_from_files(data_root: str = DATA_ROOT, output_path: Path | None = None) -> Path | None:
    data_root_path = Path(data_root)
    if not data_root_path.exists():
        print(f"Data root {data_root} does not exist.")
        return None

    rounds = []
    for round_dir in sorted(data_root_path.glob('round_*')):
        if not round_dir.is_dir():
            continue
        try:
            round_no = int(round_dir.name.split('_', 1)[1])
        except (IndexError, ValueError):
            continue

        matches = []
        for json_path in sorted(round_dir.glob('*.json')):
            stem = json_path.stem
            match_id_part = stem.split('_', 1)[0]
            if not match_id_part.isdigit():
                continue

            match_id = int(match_id_part)
            home_team = away_team = None
            try:
                payload = json.loads(json_path.read_text(encoding='utf-8'))
                home_team, away_team = _extract_teams_from_payload_dict(payload)
            except Exception as exc:
                print(f"Warning: failed to parse {json_path}: {exc}")

            matches.append({
                'matchId': match_id,
                'matchUrl': f"https://www.fotmob.com/match/{match_id}",
                'home': home_team,
                'away': away_team,
                'jsonPath': str(json_path.relative_to(data_root_path)),
            })

        rounds.append({
            'round': round_no,
            'matches': matches,
        })

    rounds.sort(key=lambda item: item['round'])
    if output_path is None:
        output_path = data_root_path / 'index.json'

    output_path.write_text(json.dumps(rounds, ensure_ascii=False, indent=2), encoding='utf-8')
    print(f"Wrote rebuilt index to {output_path}")
    return output_path


rebuild_index_from_files()



Wrote rebuilt index to 24-25_PL_Data/index.json


PosixPath('24-25_PL_Data/index.json')