# NHL Data Scraper v3.0.0

Previously, I used selenium to scrape data from the NHL stats website. This is an attempt to scrape data from **https://api-web.nhle.com/**
Specifically, Here is the documentation:

**https://github.com/Zmalski/NHL-API-Reference**

> ## Features
> 
> #### 1. Basic Game Information
> 1. ~~`game_id` - Unique identifier for the game~~
> 2. ~~`date` - Date of the game~~
> 3. ~~`home_team` - Home team abbreviation~~
> 4. ~~`away_team` - Away team abbreviation~~
> 5. ~~`home_win` - Target variable (1 = home win, 0 = home loss)~~
> 
> #### 2. Team Performance - Rolling Averages (Last 5 Games)
> 6. ~~`home_win_pct_l5` - Home team win percentage (last 5 games)~~
> 7. ~~`away_win_pct_l5` - Away team win percentage (last 5 games)~~
> 8. ~~`home_gf_per_game_l5` - Home team goals for per game (last 5)~~
> 9. ~~`away_gf_per_game_l5` - Away team goals for per game (last 5)~~
> 10. ~~`home_ga_per_game_l5` - Home team goals against per game (last 5)~~
> 11. ~~`away_ga_per_game_l5` - Away team goals against per game (last 5)~~
> 12. ~~`home_goal_diff_l5` - Home team goal differential (last 5)~~
> 13. ~~`away_goal_diff_l5` - Away team goal differential (last 5)~~
> 14. ~~`home_shots_per_game_l5` - Home team shots per game (last 5)~~
> 15. ~~`away_shots_per_game_l5` - Away team shots per game (last 5)~~
> 16. ~~`home_shot_diff_l5` - Home team shot differential (last 5)~~
> 17. ~~`away_shot_diff_l5` - Away team shot differential (last 5)~~
> 18. ~~`home_shooting_pct_l5` - Home team shooting percentage (last 5)~~
> 19. ~~`away_shooting_pct_l5` - Away team shooting percentage (last 5)~~
> 
> #### 3. Special Teams - Rolling Averages (Last 5 Games)
> 20. ~~`home_pp_pct_l5` - Home team power play percentage (last 5)~~
> 21. ~~`away_pp_pct_l5` - Away team power play percentage (last 5)~~
> 22. ~~`home_pk_pct_l5` - Home team penalty kill percentage (last 5)~~
> 23. ~~`away_pk_pct_l5` - Away team penalty kill percentage (last 5)~~
> 24. ~~`home_pp_opportunities_l5` - Home team power play opportunities (last 5)~~
> 25. ~~`away_pp_opportunities_l5` - Away team power play opportunities (last 5)~~
> 26. ~~`home_pk_opportunities_l5` - Home team penalty kill opportunities (last 5)~~
> 27. ~~`away_pk_opportunities_l5` - Away team penalty kill opportunities (last 5)~~
> 
> #### 4. Advanced Metrics - Rolling Averages (Last 5 Games)
> 28. ~~`home_save_pct_l5` - Home team save percentage (last 5)~~
> 29. ~~`away_save_pct_l5` - Away team save percentage (last 5)~~
> 30. ~~`home_faceoff_pct_l5` - Home team faceoff win percentage (last 5)~~
> 31. ~~`away_faceoff_pct_l5` - Away team faceoff win percentage (last 5)~~
> 
> #### 5. Season-to-Date Statistics
> 32. ~~`home_win_pct_season` - Home team win percentage (full season)~~
> 33. ~~`away_win_pct_season` - Away team win percentage (full season)~~
> 34. ~~`home_home_win_pct` - Home team win percentage at home (season)~~
> 35. ~~`away_away_win_pct` - Away team win percentage on road (season)~~
> 36. ~~`home_gf_per_game_season` - Home team goals for per game (season)~~
> 37. ~~`away_gf_per_game_season` - Away team goals for per game (season)~~
> 38. ~~`home_ga_per_game_season` - Home team goals against per game (season)~~
> 39. ~~`away_ga_per_game_season` - Away team goals against per game (season)~~
> 
> #### 6. Game Context & Situational Features
> 40. `home_days_rest` - Days of rest for home team
> 41. `away_days_rest` - Days of rest for away team
> 42. `home_back_to_back` - Home team playing back-to-back (1=yes, 0=no)
> 43. `away_back_to_back` - Away team playing back-to-back (1=yes, 0=no)
> 44. `division_game` - Teams in same division (1=yes, 0=no)
> 
> #### 7. Starting Goalie Statistics
> 50. ~~`home_goalie_save_pct_l5` - Home starting goalie save % (last 5 starts)~~
> 51. ~~`away_goalie_save_pct_l5` - Away starting goalie save % (last 5 starts)~~
> 54. ~~`home_goalie_gaa_l5` - Home starting goalie GAA (last 5 starts)~~
> 55. ~~`away_goalie_gaa_l5` - Away starting goalie GAA (last 5 starts)~~
> 56. ~~`home_goalie_wins_l5` - Home starting goalie wins (last 5 starts)~~
> 58. `home_goalie_days_rest` - Days since home goalie's last start
> 59. `away_goalie_days_rest` - Days since away goalie's last start
> 
> 
> #### 8. Streaks & Momentum
> 62. `home_current_streak` - Current win/loss streak (positive=wins, negative=losses)
> 63. `away_current_streak` - Away team win/loss streak
> 
> #### 9. Head-to-Head Features
> 66. `h2h_home_wins_season` - Home team wins vs away team (this season)
> 67. `h2h_away_wins_season` - Away team wins vs home team (this season)
> 68. `h2h_home_gf_avg` - Home team goals per game vs away team (season)
> 69. `h2h_away_gf_avg` - Away team goals per game vs home team (season)
> 
> #### 10. Differential Features (Engineered)
> 70. ~~`win_pct_diff_l5` - home_win_pct_l5 - away_win_pct_l5~~
> 71. ~~`goal_diff_l5` - home_goal_diff_l5 - away_goal_diff_l5~~
> 72. ~~`shot_diff_l5` - home_shot_diff_l5 - away_shot_diff_l5~~
> 73. `goalie_save_pct_diff` - home_goalie_save_pct - away_goalie_save_pct
> 74. `rest_advantage` - home_days_rest - away_days_rest


## Imports

In [1]:
import os
import time
import pandas as pd
import numpy as np
import requests
import json
from datetime import datetime, timedelta
import csv
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

## Constants

In [None]:
API_BASE_URL = "https://api-web.nhle.com/"
OUTPUT_DIR = "generated/data/"
CSV_FILE = f"{OUTPUT_DIR}/nhl_games_data.csv"

: 

: 

In [None]:
os.remove(os.path.join(OUTPUT_DIR, "nhl_games_data.csv"))

: 

: 

## Step 1. Fetch Basic Game Info

In [None]:
import asyncio
import aiohttp
from aiohttp import ClientTimeout

API_BASE_URL = "https://api-web.nhle.com"

TIMEOUT = ClientTimeout(total=15)
MAX_CONCURRENT_REQUESTS = 10
RETRIES = 3

semaphore = asyncio.Semaphore(MAX_CONCURRENT_REQUESTS)

async def fetch_json(session, url):
    async with semaphore:
        for attempt in range(RETRIES):
            try:
                async with session.get(url) as resp:
                    if resp.status == 200:
                        return await resp.json()
                    elif resp.status in (429, 500, 502, 503, 504):
                        await asyncio.sleep(2 ** attempt)
                    else:
                        return None
            except Exception:
                await asyncio.sleep(2 ** attempt)
        return None


: 

: 

In [None]:
# --- Setup requests session ---
# session = requests.Session()
# retries = Retry(
#     total=5,                  # retry up to 5 times
#     backoff_factor=0.5,         # wait 0.5s, 1s, 2s, etc. between retries
#     status_forcelist=[500, 502, 503, 504, 429],  # retry on server or rate-limit errors
#     allowed_methods=["GET"]   # only retry GET requests
# )
# session.mount("https://", HTTPAdapter(max_retries=retries))

# Request endpoint for game story data
async def fetch_game_story_async(session, game_id):
    url = f"{API_BASE_URL}/v1/wsc/game-story/{game_id}"
    return await fetch_json(session, url)

# Helpers to extract stats
def get_num_powerplays(powerplay_str):
    if not powerplay_str:
        return 0
    try:
        goals, opps = map(int, powerplay_str.split("/"))
        return goals, opps
    except ValueError:
        return 0

def calc_num_penalty_kills(powerplays_against, ppg_against):
    return powerplays_against - ppg_against

def calc_penalty_kill_pct(powerplays, ppg_against):
    if powerplays == 0:
        return 0.0
    pk_successes = powerplays - ppg_against
    return round((pk_successes / powerplays) * 100, 2)

def extract_name(obj):
    if not obj:
        return ""
    if isinstance(obj, dict):
        return obj.get("default", "") if "default" in obj else ""
    return str(obj)

def extract_category_stat(stats, category):
    for stat in stats:
        if stat.get("category") == category:
            return stat.get("homeValue", 0), stat.get("awayValue", 0)
    return 0, 0

def extract_all_stats(game_data):

    is_future_or_live_game = game_data.get("gameState") in ["FUT", "LIVE"]
    if is_future_or_live_game:
        return None

    date_str = game_data.get("gameDate", "")
    season_year = game_data.get("season", 0)

    home = game_data.get("homeTeam", {})
    away = game_data.get("awayTeam", {})
    team_stats = game_data.get("summary", {}).get("teamGameStats", [])

    home_place = extract_name(home.get("placeName"))
    home_name = extract_name(home.get("name"))
    home_abbrev = home.get("abbrev", "")
    away_place = extract_name(away.get("placeName"))
    away_name = extract_name(away.get("name"))
    away_abbrev = away.get("abbrev", "")

    full_home_name = f"{home_place} {home_name}".strip()
    full_away_name = f"{away_place} {away_name}".strip()

    if full_home_name in ("Arizona Coyotes", "Utah Utah Hockey Club"):
        home_place = "Utah"
        home_name = "Mammoth"
        home_abbrev = "UTA"

    if full_away_name in ("Arizona Coyotes", "Utah Utah Hockey Club"):
        away_place = "Utah"
        away_name = "Mammoth"
        away_abbrev = "UTA"

    home_faceoffwin_pct, away_faceoffwin_pct = extract_category_stat(team_stats, "faceoffWinningPctg")
    home_powerplays_str, away_powerplays_str = extract_category_stat(team_stats, "powerPlay")

    home_pp_goals, home_pp_opps = get_num_powerplays(home_powerplays_str)
    away_pp_goals, away_pp_opps = get_num_powerplays(away_powerplays_str)

    home_penaltykills = calc_num_penalty_kills(away_pp_opps, away_pp_goals)
    away_penaltykills = calc_num_penalty_kills(home_pp_opps, home_pp_goals)
    home_pk_pct = calc_penalty_kill_pct(away_pp_opps, away_pp_goals)
    away_pk_pct = calc_penalty_kill_pct(home_pp_opps, home_pp_goals)
    home_powerplay_pct, away_powerplay_pct = extract_category_stat(team_stats, "powerPlayPctg")
    home_pims, away_pims = extract_category_stat(team_stats, "pim")
    home_hits, away_hits = extract_category_stat(team_stats, "hits")
    home_blockedshots, away_blockedshots = extract_category_stat(team_stats, "blockedShots")
    home_takeaways, away_takeaways = extract_category_stat(team_stats, "takeaways")
    home_giveaways, away_giveaways = extract_category_stat(team_stats, "giveaways")


    row = {
        "game_id": game_data.get("id"),
        "date": date_str,
        "season": season_year,
        "home_team": f"{home_place} {home_name}".strip(),
        "away_team": f"{away_place} {away_name}".strip(),
        "home_team_abbrev": home_abbrev,
        "away_team_abbrev": away_abbrev,
        "home_win": 1 if home.get("score", 0) > away.get("score", 0) else 0,
        "home_gf": home.get("score", 0),
        "away_gf": away.get("score", 0),
        "home_ga": away.get("score", 0),
        "away_ga": home.get("score", 0),
        "home_sog": home.get("sog", 0),
        "away_sog": away.get("sog", 0),
        "home_faceoffwin_pct": home_faceoffwin_pct,
        "away_faceoffwin_pct": away_faceoffwin_pct,
        "home_powerplays": home_pp_opps,
        "away_powerplays": away_pp_opps,
        "home_powerplay_pct": home_powerplay_pct,
        "away_powerplay_pct": away_powerplay_pct,
        "home_pk": home_penaltykills,
        "away_pk": away_penaltykills,
        "home_pk_pct": home_pk_pct,
        "away_pk_pct": away_pk_pct,
        "home_pims": home_pims,
        "away_pims": away_pims,
        "home_hits": home_hits,
        "away_hits": away_hits,
        "home_blockedshots": home_blockedshots,
        "away_blockedshots": away_blockedshots,
        "home_takeaways": home_takeaways,
        "away_takeaways": away_takeaways,
        "home_giveaways": home_giveaways,
        "away_giveaways": away_giveaways
    }
    return row



: 

: 

In [None]:
START_SEASON = 2023
TEST_SEASON = 2024
END_SEASON = 2025
PREV_SEASON = 2022
LAST_N_GAMES = 100
MAX_GAMES = 1312
SLEEP_SEC = 0.1

SEASONS = [2023, 2024, 2025]

# Make output directory
os.makedirs(OUTPUT_DIR, exist_ok=True)
out_file = os.path.join(OUTPUT_DIR, "nhl_games_data.csv")

fieldnames = [
    "game_id", "date", "season", "home_team", "away_team",
    "home_team_abbrev", "away_team_abbrev", "home_win",
    "home_gf", "away_gf", "home_ga", "away_ga", "home_sog", "away_sog",
    "home_faceoffwin_pct", "away_faceoffwin_pct", "home_powerplays", "away_powerplays",
    "home_powerplay_pct", "away_powerplay_pct", "home_pk", "away_pk", "home_pk_pct", "away_pk_pct",
    "home_pims", "away_pims", "home_hits", "away_hits", "home_blockedshots", "away_blockedshots", 
    "home_takeaways", "away_takeaways", "home_giveaways", "away_giveaways"
]

# ---------------- FILE MODE ----------------
file_exists = os.path.exists(out_file)
mode = "a" if file_exists else "w"

async def fetch_season_games(season):
    game_ids = [
        f"{season}02{str(game_num).zfill(4)}"
        for game_num in range(1, MAX_GAMES + 1)
    ]

    rows = []

    async with aiohttp.ClientSession(timeout=TIMEOUT) as session:
        for i in range(0, len(game_ids), MAX_CONCURRENT_REQUESTS):
            batch = game_ids[i:i + MAX_CONCURRENT_REQUESTS]

            tasks = [
                fetch_game_story_async(session, gid)
                for gid in batch
            ]

            results = await asyncio.gather(*tasks)

            for game_data in results:
                if not game_data:
                    continue

                row = extract_all_stats(game_data)
                if row is None:
                    break
                else:
                    rows.append(row)

            await asyncio.sleep(0.1)
            
    return rows


print("\nüèÅ Finished fetching games!")


üèÅ Finished fetching games!


: 

: 

In [None]:
# ===== RUN STEP 1 =====
all_rows = []

for season in SEASONS:
    print(f"üìÖ Fetching season {season}...")
    season_rows = await fetch_season_games(season)
    all_rows.extend(season_rows)

df = pd.DataFrame(all_rows)

os.makedirs(OUTPUT_DIR, exist_ok=True)
df.to_csv(CSV_FILE, index=False)

print(f"‚úÖ Saved {len(df)} games to {CSV_FILE}")


üìÖ Fetching season 2023...
üìÖ Fetching season 2024...
üìÖ Fetching season 2025...
‚úÖ Saved 3478 games to generated/data//nhl_games_data.csv


: 

: 

## Step 2. Fetch Team Performance Data - Rolling Averages

At this point, we have all the features needed to calculate the rolling averages (last 5 games) for the following features:

> 6. `home_win_pct_l5` - Home team win percentage (last 5 games)
> 7. `away_win_pct_l5` - Away team win percentage (last 5 games)
> 8. `home_gf_per_game_l5` - Home team goals for per game (last 5)
> 9. `away_gf_per_game_l5` - Away team goals for per game (last 5)
> 10. `home_ga_per_game_l5` - Home team goals against per game (last 5)
> 11. `away_ga_per_game_l5` - Away team goals against per game (last 5)
> 12. `home_goal_diff_l5` - Home team goal differential (last 5)
> 13. `away_goal_diff_l5` - Away team goal differential (last 5)
> 14. `home_shots_per_game_l5` - Home team shots per game (last 5)
> 15. `away_shots_per_game_l5` - Away team shots per game (last 5)
> 16. `home_shot_diff_l5` - Home team shot differential (last 5)
> 17. `away_shot_diff_l5` - Away team shot differential (last 5)
> 18. `home_shooting_pct_l5` - Home team shooting percentage (last 5)
> 19. `away_shooting_pct_l5` - Away team shooting percentage (last 5)
>
> #### 3. Special Teams - Rolling Averages (Last 5 Games)
> 20. `home_pp_pct_l5` - Home team power play percentage (last 5)
> 21. `away_pp_pct_l5` - Away team power play percentage (last 5)
> 22. `home_pk_pct_l5` - Home team penalty kill percentage (last 5)
> 23. `away_pk_pct_l5` - Away team penalty kill percentage (last 5)
> 24. `home_pp_opportunities_l5` - Home team power play opportunities (last 5)
> 25. `away_pp_opportunities_l5` - Away team power play opportunities (last 5)
> 26. `home_pk_opportunities_l5` - Home team penalty kill opportunities (last 5)
> 27. `away_pk_opportunities_l5` - Away team penalty kill opportunities (last 5)
>
> #### 4. Advanced Metrics - Rolling Averages (Last 5 Games)
> 32. `home_faceoff_pct_l5` - Home team faceoff win percentage (last 5)
> 33. `away_faceoff_pct_l5` - Away team faceoff win percentage (last 5)

In [None]:
df = pd.read_csv(CSV_FILE, parse_dates=["date"])

# Step 1. Prepare combined dataframe for all teams
home_stats = df[["date",
                 "season",
                 "home_team_abbrev",
                 "home_gf", 
                 "home_ga", 
                 "home_sog", 
                 "home_win", 
                 "home_powerplay_pct", 
                 "home_pk_pct", 
                 "home_powerplays", 
                 "home_pk",
                 "home_faceoffwin_pct",
                 "home_pims",
                 "home_hits",
                 "home_blockedshots",
                 "home_giveaways",
                 "home_takeaways"]].rename(columns={
    "home_team_abbrev": "team_abbrev",
    "home_gf": "goals_for",
    "home_ga": "goals_against",
    "home_sog": "shots_on_goal",
    "home_win": "win",
    "home_powerplay_pct": "powerplay_pct",
    "home_pk_pct": "penalty_kill_pct",
    "home_powerplays": "powerplays",
    "home_pk": "penalty_kills",
    "home_faceoffwin_pct": "faceoffwin_pct",
    "home_pims": "pims",
    "home_hits": "hits",
    "home_blockedshots": "blockedshots",
    "home_giveaways": "giveaways",
    "home_takeaways": "takeaways"
})

away_stats = df[["date", 
                 "season",
                 "away_team_abbrev", 
                 "away_gf", 
                 "away_ga", 
                 "away_sog", 
                 "home_win", 
                 "away_powerplay_pct", 
                 "away_pk_pct", 
                 "away_powerplays", 
                 "away_pk",
                 "away_faceoffwin_pct",
                 "away_pims",
                 "away_hits",
                 "away_blockedshots",
                 "away_giveaways",
                 "away_takeaways"]].rename(columns={
    "away_team_abbrev": "team_abbrev",
    "away_gf": "goals_for",
    "away_ga": "goals_against",
    "away_sog": "shots_on_goal",
    "home_win": "win",
    "away_powerplay_pct": "powerplay_pct",
    "away_pk_pct": "penalty_kill_pct",
    "away_powerplays": "powerplays",
    "away_pk": "penalty_kills",
    "away_faceoffwin_pct": "faceoffwin_pct",
    "away_pims": "pims",
    "away_hits": "hits",
    "away_blockedshots": "blockedshots",
    "away_giveaways": "giveaways",
    "away_takeaways": "takeaways"
})

# For away games, win = 1 - home_win
away_stats["win"] = 1 - away_stats["win"]

combined_stats = pd.concat([home_stats, away_stats], ignore_index=True)

# Step 2. Sort and compute rolling stats for last 5 games
# using lambda makes sure it uses data only from that team and doesn't spill over to other teams that come before it after sorting
combined_stats = (
    combined_stats
    .sort_values(by=["team_abbrev", "season", "date"])
    .reset_index(drop=True)
)
combined_stats["gf_per_game_l5"] = combined_stats.groupby(["team_abbrev", "season"])["goals_for"].transform(lambda x: x.rolling(5, min_periods=1).mean().shift(1))
combined_stats["ga_per_game_l5"] = combined_stats.groupby(["team_abbrev", "season"])["goals_against"].transform(lambda x: x.rolling(5, min_periods=1).mean().shift(1))
combined_stats["sog_per_game_l5"] = combined_stats.groupby(["team_abbrev", "season"])["shots_on_goal"].transform(lambda x: x.rolling(5, min_periods=1).mean().shift(1))
combined_stats["wins_l5"] = combined_stats.groupby(["team_abbrev", "season"])["win"].transform(lambda x: x.rolling(window=5, min_periods=1).sum().shift(1))
combined_stats["powerplay_pct_l5"] = combined_stats.groupby(["team_abbrev", "season"])["powerplay_pct"].transform(lambda x: x.rolling(5, min_periods=1).mean().shift(1))
combined_stats["penalty_kill_pct_l5"] = combined_stats.groupby(["team_abbrev", "season"])["penalty_kill_pct"].transform(lambda x: x.rolling(5, min_periods=1).mean().shift(1))
combined_stats["powerplays_l5"] = combined_stats.groupby(["team_abbrev", "season"])["powerplays"].transform(lambda x: x.rolling(window=5, min_periods=1).sum().shift(1))
combined_stats["penalty_kills_l5"] = combined_stats.groupby(["team_abbrev", "season"])["penalty_kills"].transform(lambda x: x.rolling(window=5, min_periods=1).sum().shift(1))
combined_stats["faceoffwin_pct_l5"] = combined_stats.groupby(["team_abbrev", "season"])["faceoffwin_pct"].transform(lambda x: x.rolling(5, min_periods=1).mean().shift(1))
combined_stats["pims_l5"] = combined_stats.groupby(["team_abbrev", "season"])["pims"].transform(lambda x: x.rolling(5, min_periods=1).mean().shift(1))
combined_stats["hits_l5"] = combined_stats.groupby(["team_abbrev", "season"])["hits"].transform(lambda x: x.rolling(5, min_periods=1).mean().shift(1))
combined_stats["blockedshots_l5"] = combined_stats.groupby(["team_abbrev", "season"])["blockedshots"].transform(lambda x: x.rolling(5, min_periods=1).mean().shift(1))
combined_stats["giveaways_l5"] = combined_stats.groupby(["team_abbrev", "season"])["giveaways"].transform(lambda x: x.rolling(5, min_periods=1).mean().shift(1))
combined_stats["takeaways_l5"] = combined_stats.groupby(["team_abbrev", "season"])["takeaways"].transform(lambda x: x.rolling(5, min_periods=1).mean().shift(1))

# calc home and away win percentages over last 5 
combined_stats["games_l5"] = (
    combined_stats.groupby(["team_abbrev", "season"])["win"]
    .transform(lambda x: x.rolling(5, min_periods=1).count().shift(1))
)

combined_stats["win_pct_l5"] = combined_stats["wins_l5"] / combined_stats["games_l5"]

# Step 3. Merge rolling stats back to original dataframe
df = df.merge(
    combined_stats[["date",
                    "season",
                    "team_abbrev", 
                    "gf_per_game_l5", 
                    "ga_per_game_l5", 
                    "sog_per_game_l5", 
                    "wins_l5", 
                    "win_pct_l5", 
                    "powerplay_pct_l5", 
                    "penalty_kill_pct_l5", 
                    "powerplays_l5", 
                    "penalty_kills_l5",
                    "faceoffwin_pct_l5",
                    "pims_l5",
                    "hits_l5",
                    "blockedshots_l5",
                    "giveaways_l5",
                    "takeaways_l5"]],
    left_on=["home_team_abbrev", "date", "season"],
    right_on=["team_abbrev", "date", "season"],
    how="left"
).rename(columns={"gf_per_game_l5": "home_gf_per_game_l5", 
                  "ga_per_game_l5": "home_ga_per_game_l5", 
                  "sog_per_game_l5": "home_sog_per_game_l5", 
                  "wins_l5": "home_wins_l5", 
                  "win_pct_l5": "home_win_pct_l5",
                  "powerplay_pct_l5": "home_powerplay_pct_l5",
                  "penalty_kill_pct_l5": "home_penalty_kill_pct_l5",
                  "powerplays_l5": "home_powerplay_opps_l5",
                  "penalty_kills_l5": "home_pk_opps_l5",
                  "faceoffwin_pct_l5": "home_faceoffwin_pct_l5",
                  "pims_l5": "home_pims_l5",
                  "hits_l5": "home_hits_l5",
                  "blockedshots_l5": "home_blockedshots_l5",
                  "giveaways_l5": "home_giveaways_l5",
                  "takeaways_l5": "home_takeaways_l5"
}).drop(columns=["team_abbrev"])

df = df.merge(
    combined_stats[[
        "date", "season", "team_abbrev",
        "gf_per_game_l5", "ga_per_game_l5", "sog_per_game_l5",
        "wins_l5", "win_pct_l5",
        "powerplay_pct_l5", "penalty_kill_pct_l5",
        "powerplays_l5", "penalty_kills_l5",
        "faceoffwin_pct_l5", "pims_l5", "hits_l5",
        "blockedshots_l5", "giveaways_l5", "takeaways_l5"
    ]],
    left_on=["away_team_abbrev", "date", "season"],
    right_on=["team_abbrev", "date", "season"],
    how="left"
).rename(columns={
    "gf_per_game_l5": "away_gf_per_game_l5",
    "ga_per_game_l5": "away_ga_per_game_l5",
    "sog_per_game_l5": "away_sog_per_game_l5",
    "wins_l5": "away_wins_l5",
    "win_pct_l5": "away_win_pct_l5",
    "powerplay_pct_l5": "away_powerplay_pct_l5",
    "penalty_kill_pct_l5": "away_penalty_kill_pct_l5",
    "powerplays_l5": "away_powerplay_opps_l5",
    "penalty_kills_l5": "away_pk_opps_l5",
    "faceoffwin_pct_l5": "away_faceoffwin_pct_l5",
    "pims_l5": "away_pims_l5",
    "hits_l5": "away_hits_l5",
    "blockedshots_l5": "away_blockedshots_l5",
    "giveaways_l5": "away_giveaways_l5",
    "takeaways_l5": "away_takeaways_l5"
}).drop(columns=["team_abbrev"])

# Step 4. Save updated dataframe back to CSV
cols_to_round = [c for c in df.columns if c.endswith(("_l5"))]
df[cols_to_round] = df[cols_to_round].round(3)

df.to_csv(CSV_FILE, index=False)

: 

: 

Now lets get the **diff**, and **percentage** stats for l5 using the averge **gf, ga,** and **sog** we just calculated

In [None]:
df = pd.read_csv(CSV_FILE, parse_dates=["date"])

# Step 1: compute HOME ‚àí AWAY deltas directly
df["home_goal_diff_l5"] = (
    df["home_gf_per_game_l5"] - df["away_gf_per_game_l5"]
)

df["home_ga_diff_l5"] = (
    df["home_ga_per_game_l5"] - df["away_ga_per_game_l5"]
)

df["home_shot_diff_l5"] = (
    df["home_sog_per_game_l5"] - df["away_sog_per_game_l5"]
)

# Step 2: round only l5 features
cols_to_round = [
    "home_gf_per_game_l5",
    "away_gf_per_game_l5",
    "home_ga_per_game_l5",
    "away_ga_per_game_l5",
    "home_sog_per_game_l5",
    "away_sog_per_game_l5",
    "home_goal_diff_l5",
    "home_ga_diff_l5",
    "home_shot_diff_l5",
]

df[cols_to_round] = df[cols_to_round].round(3)

# Step 3: save
df.to_csv(CSV_FILE, index=False)


: 

: 

## Step 3. Fetch Team save% and Goalie Performance Data
This is done in the same step because we are goign to use the same endpoint

The **combined save percentage (SV%)** for both goalies is calculated as:

$$
SV\% = \frac{\text{Saves}_{\text{goalie1}} + \text{Saves}_{\text{goalie2}}}
{\text{Shots}_{\text{goalie1}} + \text{Shots}_{\text{goalie2}}}
$$


In [None]:
# New base URL needed for goalie data, but reuse OUTPUT_DIR
API_BASE_URL = "https://api-web.nhle.com"
OUTPUT_DIR = "generated/data/"
CSV_FILE = f"{OUTPUT_DIR}/nhl_games_data.csv"

: 

: 

We also need to extract starter goalie performance to use in the next sectiion for season data and avearge over last 5 games data

##### This data will also provide valuable details about the team's defense

> #### 7. Starting Goalie Statistics  
> 48. `home_goalie_save_pct` - Home starting goalie save %
> 49. `away_goalie_save_pct` - Away starting goalie save %
> 52. `home_goalie_ga` - Goals allowed by the home starting goalie in the game  
> 53. `away_goalie_ga` - Goals allowed by the away starting goalie in the game  
> 58. `home_goalie_days_rest` - Days since home goalie's last start  
> 59. `away_goalie_days_rest` - Days since away goalie's last start  
> 60. `home_goalie_evenStrengthShotsAgainst` - Even-strength shots faced by the home starting goalie  
> 61. `away_goalie_evenStrengthShotsAgainst` - Even-strength shots faced by the away starting goalie  
> 62. `home_goalie_powerPlayShotsAgainst` - Power-play shots faced by the home starting goalie  
> 63. `away_goalie_powerPlayShotsAgainst` - Power-play shots faced by the away starting goalie  
> 64. `home_goalie_shorthandedShotsAgainst` - Shorthanded shots faced by the home starting goalie  
> 65. `away_goalie_shorthandedShotsAgainst` - Shorthanded shots faced by the away starting goalie  
> 66. `home_goalie_evenStrengthGoalsAgainst` - Goals allowed at even strength by the home starting goalie  
> 67. `away_goalie_evenStrengthGoalsAgainst` - Goals allowed at even strength by the away starting goalie  
> 68. `home_goalie_powerPlayGoalsAgainst` - Power-play goals allowed by the home starting goalie  
> 69. `away_goalie_powerPlayGoalsAgainst` - Power-play goals allowed by the away starting goalie  
> 70. `home_goalie_goalsAgainst` - Total goals against for the home starting goalie (game)  
> 71. `home_gaolie_saves` - Total saves recorded by the home starting goalie (game)  
> 72. `away_goalie_saves` - Total saves recorded by the away starting goalie (game)

In [None]:
# --- Setup requests session ---
# session = requests.Session()
# retries = Retry(
#     total=5,                  # retry up to 5 times
#     backoff_factor=0.5,         # wait 0.5s, 1s, 2s, etc. between retries
#     status_forcelist=[500, 502, 503, 504, 429],  # retry on server or rate-limit errors
#     allowed_methods=["GET"]   # only retry GET requests
# )
# session.mount("https://", HTTPAdapter(max_retries=retries))

# Request endpoint for game story data
async def fetch_boxscores(session, gid):
    url = f"{API_BASE_URL}/v1/gamecenter/{gid}/boxscore"
    return await fetch_json(session, url)

def extract_name(obj):
    if not obj:
        return ""
    if isinstance(obj, dict):
        return obj.get("default", "") if "default" in obj else ""
    return str(obj)

def calc_team_save_pct(saves, shots_against):
    if shots_against == 0:
        return 0.0
    return round((saves / shots_against), 3)

def extract_fractional_stat(stat_str):
    if not stat_str:
        return 0.0
    try:
        numerator, denominator = map(int, stat_str.split("/"))
        if denominator == 0:
            return 0.0
        return round(numerator / denominator, 3)
    except ValueError:
        return 0.0

def get_starter_goalie(goalies):
    if not goalies:
        return "", None

    starter = next((g for g in goalies if g.get("starter")), goalies[0])
    return extract_name(starter.get("name", {})), starter
    
def extract_all_goalie_stats(boxscore_data):

    date_str = boxscore_data.get("gameDate", "")
    season_year = boxscore_data.get("season", 0)

    home = boxscore_data.get("homeTeam", {})
    away = boxscore_data.get("awayTeam", {})
    home_players = boxscore_data.get("playerByGameStats", {}).get("homeTeam", {}).get("goalies", [])
    away_players = boxscore_data.get("playerByGameStats", {}).get("awayTeam", {}).get("goalies", [])

    home_starter_goalie, home_starter_obj = get_starter_goalie(home_players)
    away_starter_goalie, away_starter_obj = get_starter_goalie(away_players)


    # We need to use both goalies in case there was a change. Both goalie data contributes to the team's overall save percentage
    # Sometimes the api only returns one goalie
    if len(home_players) < 2:
        home_save_pct = home_players[0].get("savePctg", 0.0)
    else:
        home_save_pct = calc_team_save_pct(
            home_players[0].get("saves", 0) + home_players[1].get("saves", 0),
            home_players[0].get("shotsAgainst", 0) + home_players[1].get("shotsAgainst", 0)
        )

    if len(away_players) < 2:
        away_save_pct = away_players[0].get("savePctg", 0.0)
    else:
        away_save_pct = calc_team_save_pct(
            away_players[0].get("saves", 0) + away_players[1].get("saves", 0),
            away_players[0].get("shotsAgainst", 0) + away_players[1].get("shotsAgainst", 0)
        )

    home_starter_obj = home_players[0] if home_players[0].get("starter") else home_players[1]
    away_starter_obj = away_players[0] if away_players[0].get("starter") else away_players[1]
    
    home_goalie_save_pct = home_starter_obj.get("savePctg", 0.0)
    home_goalie_ga = home_starter_obj.get("goalsAgainst", 0)
    home_goalie_saves = home_starter_obj.get("saves", 0)
    home_goalie_evenStrengthShotsAgainst = extract_fractional_stat(home_starter_obj.get("evenStrengthShotsAgainst", "0/0"))
    home_goalie_powerPlayShotsAgainst = extract_fractional_stat(home_starter_obj.get("powerPlayShotsAgainst", "0/0"))
    home_goalie_shorthandedShotsAgainst = extract_fractional_stat(home_starter_obj.get("shorthandedShotsAgainst", "0/0"))
    home_goalie_evenStrengthGoalsAgainst = home_starter_obj.get("evenStrengthGoalsAgainst", 0)
    home_goalie_powerPlayGoalsAgainst = home_starter_obj.get("powerPlayGoalsAgainst", 0)
    away_goalie_save_pct = away_starter_obj.get("savePctg", 0.0)
    away_goalie_ga = away_starter_obj.get("goalsAgainst", 0)
    away_goalie_saves = away_starter_obj.get("saves", 0)
    away_goalie_evenStrengthShotsAgainst = extract_fractional_stat(away_starter_obj.get("evenStrengthShotsAgainst", "0/0"))
    away_goalie_powerPlayShotsAgainst = extract_fractional_stat(away_starter_obj.get("powerPlayShotsAgainst", "0/0"))
    away_goalie_shorthandedShotsAgainst = extract_fractional_stat(away_starter_obj.get("shorthandedShotsAgainst", "0/0"))
    away_goalie_evenStrengthGoalsAgainst = away_starter_obj.get("evenStrengthGoalsAgainst", 0)
    away_goalie_powerPlayGoalsAgainst = away_starter_obj.get("powerPlayGoalsAgainst", 0)

    home_abbrev = home.get("abbrev", "")
    away_abbrev = away.get("abbrev", "")

    if f"{home_abbrev}" == "ARI":
        home_abbrev = "UTA"
    if f"{away_abbrev}" == "ARI":
        away_abbrev = "UTA"

    row = {
        "game_id": boxscore_data.get("id"),
        "date": date_str,
        "season": season_year,
        "home_team_abbrev": home_abbrev,
        "away_team_abbrev": away_abbrev,
        "home_goalie_starter": home_starter_goalie,
        "away_goalie_starter": away_starter_goalie,
        "home_save_pct": home_save_pct,
        "away_save_pct": away_save_pct,
        "home_goalie_save_pct": home_goalie_save_pct,
        "away_goalie_save_pct": away_goalie_save_pct,
        "home_goalie_ga": home_goalie_ga,
        "away_goalie_ga": away_goalie_ga,
        "home_goalie_saves": home_goalie_saves,
        "away_goalie_saves": away_goalie_saves,
        "home_goalie_evenStrengthShotsAgainst": home_goalie_evenStrengthShotsAgainst,
        "away_goalie_evenStrengthShotsAgainst": away_goalie_evenStrengthShotsAgainst,
        "home_goalie_powerPlayShotsAgainst": home_goalie_powerPlayShotsAgainst,
        "away_goalie_powerPlayShotsAgainst": away_goalie_powerPlayShotsAgainst,
        "home_goalie_shorthandedShotsAgainst": home_goalie_shorthandedShotsAgainst,
        "away_goalie_shorthandedShotsAgainst": away_goalie_shorthandedShotsAgainst,
        "home_goalie_evenStrengthGoalsAgainst": home_goalie_evenStrengthGoalsAgainst,
        "away_goalie_evenStrengthGoalsAgainst": away_goalie_evenStrengthGoalsAgainst,
        "home_goalie_powerPlayGoalsAgainst": home_goalie_powerPlayGoalsAgainst,
        "away_goalie_powerPlayGoalsAgainst": away_goalie_powerPlayGoalsAgainst,
    }
    return row


: 

: 

## Step 3.1 - find the l5 average for all the stats in the previous cell

In [None]:
async def fetch_all_goalie_data(game_ids):
    results = []

    async with aiohttp.ClientSession(timeout=TIMEOUT) as session:
        for i in range(0, len(game_ids), MAX_CONCURRENT_REQUESTS):
            batch = game_ids[i:i + MAX_CONCURRENT_REQUESTS]

            tasks = [
                fetch_boxscores(session, gid)
                for gid in batch
            ]
            boxscores = await asyncio.gather(*tasks)

            for boxscore in boxscores:
                if boxscore is None:
                    continue
                row = extract_all_goalie_stats(boxscore)
                results.append(row)

            await asyncio.sleep(0.1)  # rate-limit safety

    return results

: 

: 

In [None]:
game_ids = df["game_id"].tolist()
goalie_data_rows = await fetch_all_goalie_data(game_ids)

goalie_df = pd.DataFrame(goalie_data_rows)


: 

: 

In [None]:
HOME_RENAME = {
    "home_goalie_starter": "goalie",
    "home_goalie_save_pct": "save_pct",
    "home_goalie_ga": "ga",
    "home_goalie_saves": "saves",
    "home_goalie_evenStrengthShotsAgainst": "ev_sa",
    "home_goalie_powerPlayShotsAgainst": "pp_sa",
    "home_goalie_shorthandedShotsAgainst": "sh_sa",
    "home_goalie_evenStrengthGoalsAgainst": "ev_ga",
    "home_goalie_powerPlayGoalsAgainst": "pp_ga",
}

AWAY_RENAME = {
    "away_goalie_starter": "goalie",
    "away_goalie_save_pct": "save_pct",
    "away_goalie_ga": "ga",
    "away_goalie_saves": "saves",
    "away_goalie_evenStrengthShotsAgainst": "ev_sa",
    "away_goalie_powerPlayShotsAgainst": "pp_sa",
    "away_goalie_shorthandedShotsAgainst": "sh_sa",
    "away_goalie_evenStrengthGoalsAgainst": "ev_ga",
    "away_goalie_powerPlayGoalsAgainst": "pp_ga",
}

HOME_TEAM_SAVE_STATS = "home_save_pct"
AWAY_TEAM_SAVE_STATS = "away_save_pct"


: 

: 

In [None]:
print(goalie_df.columns.tolist())

print("rows:", len(goalie_data_rows))
print("sample:", goalie_data_rows[:2])


['game_id', 'date', 'season', 'home_team_abbrev', 'away_team_abbrev', 'home_goalie_starter', 'away_goalie_starter', 'home_save_pct', 'away_save_pct', 'home_goalie_save_pct', 'away_goalie_save_pct', 'home_goalie_ga', 'away_goalie_ga', 'home_goalie_saves', 'away_goalie_saves', 'home_goalie_evenStrengthShotsAgainst', 'away_goalie_evenStrengthShotsAgainst', 'home_goalie_powerPlayShotsAgainst', 'away_goalie_powerPlayShotsAgainst', 'home_goalie_shorthandedShotsAgainst', 'away_goalie_shorthandedShotsAgainst', 'home_goalie_evenStrengthGoalsAgainst', 'away_goalie_evenStrengthGoalsAgainst', 'home_goalie_powerPlayGoalsAgainst', 'away_goalie_powerPlayGoalsAgainst']
rows: 3478
sample: [{'game_id': 2023020001, 'date': '2023-10-10', 'season': 20232024, 'home_team_abbrev': 'TBL', 'away_team_abbrev': 'NSH', 'home_goalie_starter': 'J. Johansson', 'away_goalie_starter': 'J. Saros', 'home_save_pct': 0.903, 'away_save_pct': 0.879, 'home_goalie_save_pct': 0.903226, 'away_goalie_save_pct': 0.878788, 'home_go

: 

: 

In [None]:
goalie_df['date'] = pd.to_datetime(goalie_df['date'])
goalie_df = goalie_df.sort_values("date").reset_index(drop=True)
ROLLING_N = 5

home_goalies = goalie_df[
    ["game_id", "date", "season", *HOME_RENAME.keys()]
].rename(columns=HOME_RENAME)

away_goalies = goalie_df[
    ["game_id", "date", "season", *AWAY_RENAME.keys()]
].rename(columns=AWAY_RENAME)

goalie_long = (
    pd.concat([home_goalies, away_goalies], ignore_index=True)
    .sort_values(["goalie", "season", "date"])
    .reset_index(drop=True)
)

GOALIE_STATS = ["save_pct", "ga", "saves", "ev_sa", "pp_sa", "sh_sa", "ev_ga", "pp_ga"]

for stat in GOALIE_STATS:
    goalie_long[f"{stat}_l5"] = (
        goalie_long
        .groupby(["goalie", "season"])[stat]
        .transform(lambda x: x.rolling(ROLLING_N, min_periods=1).mean().shift(1))
    )

goalie_l5 = goalie_long[
    ["game_id", "goalie"] + [f"{s}_l5" for s in GOALIE_STATS]
]

# HOME GOALIE
goalie_df = goalie_df.merge(
    goalie_l5,
    left_on=["game_id", "home_goalie_starter"],
    right_on=["game_id", "goalie"],
    how="left"
).rename(columns={
    f"{s}_l5": f"home_goalie_{s}_l5" for s in GOALIE_STATS
}).drop(columns=["goalie"])

# AWAY GOALIE
goalie_df = goalie_df.merge(
    goalie_l5,
    left_on=["game_id", "away_goalie_starter"],
    right_on=["game_id", "goalie"],
    how="left"
).rename(columns={
    f"{s}_l5": f"away_goalie_{s}_l5" for s in GOALIE_STATS
}).drop(columns=["goalie"])


# TEAM SAVE PCT L5
team_long = pd.concat(
    [
        goalie_df[["game_id", "date", "season", "home_team_abbrev", "home_save_pct"]]
        .rename(columns={"home_team_abbrev": "team", "home_save_pct": "save_pct"}),

        goalie_df[["game_id", "date", "season", "away_team_abbrev", "away_save_pct"]]
        .rename(columns={"away_team_abbrev": "team", "away_save_pct": "save_pct"}),
    ],
    ignore_index=True
).sort_values(["team", "season", "date"]).reset_index(drop=True)

team_long["team_save_pct_l5"] = (
    team_long
    .groupby(["team", "season"])["save_pct"]
    .transform(lambda x: x.rolling(ROLLING_N, min_periods=1).mean().shift(1))
)

# MERGE TEAM STATS BACK
goalie_df = goalie_df.merge(
    team_long[["game_id", "team", "team_save_pct_l5"]],
    left_on=["game_id", "home_team_abbrev"],
    right_on=["game_id", "team"],
    how="left"
).rename(columns={"team_save_pct_l5": "home_team_save_pct_l5"}).drop(columns=["team"])

goalie_df = goalie_df.merge(
    team_long[["game_id", "team", "team_save_pct_l5"]],
    left_on=["game_id", "away_team_abbrev"],
    right_on=["game_id", "team"],
    how="left"
).rename(columns={"team_save_pct_l5": "away_team_save_pct_l5"}).drop(columns=["team"])

: 

: 

In [None]:
# Identify columns to merge

GOALIE_MERGE_COLS = [
    "game_id",

    # raw goalie + team stats
    "home_goalie_starter",
    "away_goalie_starter",
    "home_save_pct",
    "away_save_pct",
    "home_goalie_save_pct",
    "away_goalie_save_pct",
    "home_goalie_ga",
    "away_goalie_ga",
    "home_goalie_saves",
    "away_goalie_saves",
    "home_goalie_evenStrengthShotsAgainst",
    "away_goalie_evenStrengthShotsAgainst",
    "home_goalie_powerPlayShotsAgainst",
    "away_goalie_powerPlayShotsAgainst",
    "home_goalie_shorthandedShotsAgainst",
    "away_goalie_shorthandedShotsAgainst",
    "home_goalie_evenStrengthGoalsAgainst",
    "away_goalie_evenStrengthGoalsAgainst",
    "home_goalie_powerPlayGoalsAgainst",
    "away_goalie_powerPlayGoalsAgainst",

    # goalie L5
    "home_goalie_save_pct_l5",
    "home_goalie_ga_l5",
    "home_goalie_saves_l5",
    "home_goalie_ev_sa_l5",
    "home_goalie_pp_sa_l5",
    "home_goalie_sh_sa_l5",
    "home_goalie_ev_ga_l5",
    "home_goalie_pp_ga_l5",

    "away_goalie_save_pct_l5",
    "away_goalie_ga_l5",
    "away_goalie_saves_l5",
    "away_goalie_ev_sa_l5",
    "away_goalie_pp_sa_l5",
    "away_goalie_sh_sa_l5",
    "away_goalie_ev_ga_l5",
    "away_goalie_pp_ga_l5",

    # team L5
    "home_team_save_pct_l5",
    "away_team_save_pct_l5",
]

main_df = pd.read_csv(CSV_FILE, parse_dates=["date"])
main_df = main_df.merge(
    goalie_df[GOALIE_MERGE_COLS],
    on="game_id",
    how="left",
    validate="one_to_one"
)

main_df.to_csv(CSV_FILE, index=False)


: 

: 

In [None]:
# Sanity check: ensure no duplicate game_ids
assert len(df) == len(df["game_id"].unique())


: 

: 

## Step 4. - Fetch Season-to-Date Statistics
> 32. `home_win_pct_season` - Home team win percentage (full season)
> 33. `away_win_pct_season` - Away team win percentage (full season)
> 34. `home_home_win_pct` - Home team win percentage at home (season)
> 35. `away_away_win_pct` - Away team win percentage on road (season)
> 36. `home_gf_per_game_season` - Home team goals for per game (season)
> 37. `away_gf_per_game_season` - Away team goals for per game (season)
> 38. `home_ga_per_game_season` - Home team goals against per game (season)
> 39. `away_ga_per_game_season` - Away team goals against per game (season)

In [None]:
# New base URL needed for goalie data, but reuse OUTPUT_DIR
API_BASE_URL = "https://api-web.nhle.com/v1/standings"
OUTPUT_DIR = "generated/data/"
CSV_FILE = f"{OUTPUT_DIR}/nhl_games_data.csv"

: 

: 

In [None]:
# # --- Setup requests session ---
# session = requests.Session()
# retries = Retry(
#     total=5,                  # retry up to 5 times
#     backoff_factor=0.5,         # wait 0.5s, 1s, 2s, etc. between retries
#     status_forcelist=[500, 502, 503, 504, 429],  # retry on server or rate-limit errors
#     allowed_methods=["GET"]   # only retry GET requests
# )
# session.mount("https://", HTTPAdapter(max_retries=retries))

# Request endpoint for game story data
async def fetch_standings_info(session, date):
    url = f"{API_BASE_URL}/{date}"
    return await fetch_json(session, url)

STANDINGS_FIELDS = [
    "pointPctg",
    "gamesPlayed",
    "goalsForPctg",
    "homeGamesPlayed",
    "homeWins",
    "homeLosses",
    "roadGamesPlayed",
    "roadWins",
    "roadLosses",
    "streakCode",
    "streakCount",
]

def extract_all_teams_playing_on_date(game_date):
    main_df = pd.read_csv(CSV_FILE, parse_dates=["date"])
    teams_home = main_df[main_df["date"] == game_date]["home_team_abbrev"].unique().tolist()
    teams_away = main_df[main_df["date"] == game_date]["away_team_abbrev"].unique().tolist()
    all_teams = set(teams_home + teams_away)
    return all_teams

def extract_all_standing_stats(standings_data, playing_teams):
    rows = {}

    for team in standings_data["standings"]:
        abbrev = team["teamAbbrev"]["default"]

        if f"{abbrev}" == "ARI":
            abbrev = "UTA"

        # Skip teams not playing on this date
        if abbrev not in playing_teams:
            continue

        rows[abbrev] = {
            field: team.get(field)
            for field in STANDINGS_FIELDS
        }

    return rows
        


: 

: 

In [None]:
async def build_standings_dataframe():
    df = pd.read_csv(CSV_FILE, parse_dates=["date"])
    df_subset = df

    unique_dates = df_subset["date"].unique()

    standing_rows = []

    async with aiohttp.ClientSession(timeout=TIMEOUT) as session:
        for i, date_val in enumerate(unique_dates):
            date_str = date_val.strftime("%Y-%m-%d")

            standings_data = await fetch_standings_info(session, date_str)
            if not standings_data:
                continue

            playing_teams = extract_all_teams_playing_on_date(
                pd.to_datetime(date_val)
            )

            standings_by_team = extract_all_standing_stats(
                standings_data,
                playing_teams
            )

            for _, row in df_subset[df_subset["date"] == date_val].iterrows():
                game_id = row["game_id"]
                home_abbrev = row["home_team_abbrev"]
                away_abbrev = row["away_team_abbrev"]

                home_stats = standings_by_team.get(home_abbrev, {})
                away_stats = standings_by_team.get(away_abbrev, {})

                out_row = {
                    "game_id": game_id,
                    "date": date_str,
                }

                for field in STANDINGS_FIELDS:
                    out_row[f"home_{field}"] = home_stats.get(field)
                    out_row[f"away_{field}"] = away_stats.get(field)

                standing_rows.append(out_row)

            if (i + 1) % 100 == 0:
                print(f"‚úÖ Processed {i + 1}/{len(unique_dates)} dates")

            await asyncio.sleep(0.5)

    return pd.DataFrame(standing_rows)


standing_df = await build_standings_dataframe()

standing_df['home_gamesPlayed'] = standing_df['home_gamesPlayed'].astype(int)
standing_df['away_gamesPlayed'] = standing_df['away_gamesPlayed'].astype(int)

# Total season win percentages
standing_df["home_win_pct_season"] = (
    standing_df["home_homeWins"] + standing_df["home_roadWins"]
) / (
    standing_df["home_gamesPlayed"]
)

standing_df["away_win_pct_season"] = (
    standing_df["away_homeWins"] + standing_df["away_roadWins"]
) / (
    standing_df["away_gamesPlayed"]
)

standing_df["home_home_win_pct"] = (
    standing_df["home_homeWins"]
    / standing_df["home_homeGamesPlayed"]
)

standing_df["away_away_win_pct"] = (
    standing_df["away_roadWins"]
    / standing_df["away_roadGamesPlayed"]
)

standing_df["home_gf_per_game_season"] = standing_df["home_goalsForPctg"]
standing_df["away_gf_per_game_season"] = standing_df["away_goalsForPctg"]

standing_df["home_pointPctg_season"] = standing_df["home_pointPctg"]
standing_df["away_pointPctg_season"] = standing_df["away_pointPctg"]

standing_df["pointPctg_diff"] = (
    standing_df["home_pointPctg_season"]
    - standing_df["away_pointPctg_season"]
)

# subtract 1 from streak count if winning streak to get number of consecutive wins before this game
standing_df["home_win_streak"] = standing_df.apply(
    lambda r: (r["home_streakCount"] - 1) if r["home_streakCode"] == "W" else 0,
    axis=1
)

standing_df["away_win_streak"] = standing_df.apply(
    lambda r: (r["away_streakCount"] - 1) if r["away_streakCode"] == "W" else 0,
    axis=1
)

standing_df = standing_df.fillna(0)



‚úÖ Processed 100/471 dates
‚úÖ Processed 200/471 dates
‚úÖ Processed 300/471 dates
‚úÖ Processed 400/471 dates


: 

: 

In [None]:
SEASON_STATS = [
    "home_win_pct_season",
    "away_win_pct_season",
    "home_home_win_pct",
    "away_away_win_pct",
    "home_gf_per_game_season",
    "away_gf_per_game_season",
    "home_pointPctg_season",
    "away_pointPctg_season",
    "pointPctg_diff",
    "home_win_streak",
    "away_win_streak",
]

df = pd.read_csv(CSV_FILE, parse_dates=["date"])

df = df.merge(
    standing_df[["game_id"] + SEASON_STATS],
    on="game_id",
    how="left",
    validate="one_to_one"
)
df.to_csv(CSV_FILE, index=False)

: 

: 

#### Note: round all numeric columns to max 3 decimal places

In [None]:
df = pd.read_csv(CSV_FILE, parse_dates=["date"])
df = df.round(3)
df.to_csv(CSV_FILE, index=False)

: 

: 

### Step 5 - Calculate rest days per team and goalie

In [None]:
# RESET MAIN DF 
# This will remove home and away rest days
# This is for testing purposes

# df = df.drop(columns=["home_rest_days", "away_rest_days", "home_rest_days.1", "away_rest_days.1"], errors="ignore")
# df = df.drop(columns=["home_goalie_rest_days", "away_goalie_rest_days", "home_goalie_rest_days.1", "away_goalie_rest_days.1"], errors="ignore")
# df.to_csv(CSV_FILE, index=False)

: 

: 

#### 5.1 Build a team-game timeline

In [None]:
df = pd.read_csv(CSV_FILE, parse_dates=["date"])

home_games = df[["date", "season", "home_team_abbrev"]].rename(
    columns={"home_team_abbrev": "team"}
)

away_games = df[["date", "season", "away_team_abbrev"]].rename(
    columns={"away_team_abbrev": "team"}
)

team_games = pd.concat([home_games, away_games], ignore_index=True)

team_games.head()

Unnamed: 0,date,season,team
0,2023-10-10,20232024,TBL
1,2023-10-10,20232024,PIT
2,2023-10-10,20232024,VGK
3,2023-10-11,20232024,CAR
4,2023-10-11,20232024,TOR


: 

: 

#### 5.2 Sort and compute rest days

In [None]:
team_games = team_games.sort_values(["team", "season", "date"])

team_games["team_rest_days"] = (
    team_games.groupby(["team", "season"])["date"]
    .diff()
    .dt.days
    - 1
)

# Merge back into main dataframe
df = df.merge(
    team_games[["team", "season", "date", "team_rest_days"]],
    left_on=["home_team_abbrev", "season", "date"],
    right_on=["team", "season", "date"],
    how="left"
).rename(columns={"team_rest_days": "home_rest_days"}).drop(columns=["team"])

df = df.merge(
    team_games[["team", "season", "date", "team_rest_days"]],
    left_on=["away_team_abbrev", "season", "date"],
    right_on=["team", "season", "date"],
    how="left"
).rename(columns={"team_rest_days": "away_rest_days"}).drop(columns=["team"])

df.to_csv(CSV_FILE, index=False)

# open and read CSV
df = pd.read_csv(CSV_FILE, parse_dates=["date"])
df.tail()

Unnamed: 0,game_id,date,season,home_team,away_team,home_team_abbrev,away_team_abbrev,home_win,home_gf,away_gf,...,away_away_win_pct,home_gf_per_game_season,away_gf_per_game_season,home_pointPctg_season,away_pointPctg_season,pointPctg_diff,home_win_streak,away_win_streak,home_rest_days,away_rest_days
3473,2025020853,2026-01-29,20252026,Minnesota Wild,Calgary Flames,MIN,CGY,1,4,1,...,0.296,3.255,2.509,0.655,0.453,0.202,1,0,1.0,3.0
3474,2025020854,2026-01-29,20252026,Edmonton Oilers,San Jose Sharks,EDM,SJS,1,4,3,...,0.5,3.455,3.154,0.582,0.558,0.024,2,0,4.0,1.0
3475,2025020855,2026-01-29,20252026,Vancouver Canucks,Anaheim Ducks,VAN,ANA,1,2,0,...,0.433,2.648,3.278,0.38,0.546,-0.167,0,0,1.0,3.0
3476,2025020856,2026-01-29,20252026,Vegas Golden Knights,Dallas Stars,VGK,DAL,0,4,5,...,0.567,3.34,3.352,0.604,0.657,-0.054,0,2,1.0,1.0
3477,2025020857,2026-01-29,20252026,Seattle Kraken,Toronto Maple Leafs,SEA,TOR,1,5,2,...,0.348,2.887,3.259,0.557,0.528,0.029,2,0,1.0,1.0


: 

: 

#### 5.3 Build goalie-game timeline

In [None]:
df = pd.read_csv(CSV_FILE, parse_dates=["date"])

home_goalie_games = df[["date", "season", "home_team_abbrev", "home_goalie_starter"]].rename(
    columns={"home_team_abbrev": "team", 
            "home_goalie_starter": "goalie"}
)

away_goalie_games = df[["date", "season", "away_team_abbrev", "away_goalie_starter"]].rename(
    columns={"away_team_abbrev": "team",
            "away_goalie_starter": "goalie"}
)

goalie_games = pd.concat([home_goalie_games, away_goalie_games], ignore_index=True)
goalie_games.head()

Unnamed: 0,date,season,team,goalie
0,2023-10-10,20232024,TBL,J. Johansson
1,2023-10-10,20232024,PIT,T. Jarry
2,2023-10-10,20232024,VGK,A. Hill
3,2023-10-11,20232024,CAR,F. Andersen
4,2023-10-11,20232024,TOR,I. Samsonov


: 

: 

#### 5.4 Sort and compute rest days for GOALIES

In [None]:
goalie_games = goalie_games.sort_values(["goalie", "team", "season", "date"])

goalie_games["goalie_rest_days"] = (
    goalie_games.groupby(["goalie", "team", "season"])["date"]
    .diff()
    .dt.days
    - 1
)

# Merge back into main dataframe
df = df.merge(
    goalie_games[
        ["goalie", "team", "season", "date", "goalie_rest_days"]
    ],
    left_on=["home_goalie_starter", "home_team_abbrev", "season", "date"],
    right_on=["goalie", "team", "season", "date"],
    how="left"
).rename(columns={"goalie_rest_days": "home_goalie_rest_days"}).drop(columns=["goalie", "team"])

df = df.merge(
    goalie_games[["goalie", "team", "season", "date", "goalie_rest_days"]],
    left_on=["away_goalie_starter", "away_team_abbrev", "season", "date"],
    right_on=["goalie", "team", "season", "date"],
    how="left"
).rename(columns={"goalie_rest_days": "away_goalie_rest_days"}).drop(columns=["goalie", "team"])

df.to_csv(CSV_FILE, index=False)

# open and read CSV
df = pd.read_csv(CSV_FILE, parse_dates=["date"])
df.tail()

Unnamed: 0,game_id,date,season,home_team,away_team,home_team_abbrev,away_team_abbrev,home_win,home_gf,away_gf,...,away_gf_per_game_season,home_pointPctg_season,away_pointPctg_season,pointPctg_diff,home_win_streak,away_win_streak,home_rest_days,away_rest_days,home_goalie_rest_days,away_goalie_rest_days
3473,2025020853,2026-01-29,20252026,Minnesota Wild,Calgary Flames,MIN,CGY,1,4,1,...,2.509,0.655,0.453,0.202,1,0,1.0,3.0,4.0,5.0
3474,2025020854,2026-01-29,20252026,Edmonton Oilers,San Jose Sharks,EDM,SJS,1,4,3,...,3.154,0.582,0.558,0.024,2,0,4.0,1.0,4.0,1.0
3475,2025020855,2026-01-29,20252026,Vancouver Canucks,Anaheim Ducks,VAN,ANA,1,2,0,...,3.278,0.38,0.546,-0.167,0,0,1.0,3.0,11.0,3.0
3476,2025020856,2026-01-29,20252026,Vegas Golden Knights,Dallas Stars,VGK,DAL,0,4,5,...,3.352,0.604,0.657,-0.054,0,2,1.0,1.0,3.0,1.0
3477,2025020857,2026-01-29,20252026,Seattle Kraken,Toronto Maple Leafs,SEA,TOR,1,5,2,...,3.259,0.557,0.528,0.029,2,0,1.0,1.0,3.0,5.0


: 

: 

### Step 6. Head-to-Head data

> 66. `h2h_home_wins_season` - Home team wins vs away team (this season)
> 67. `h2h_away_wins_season` - Away team wins vs home team (this season)
> 68. `h2h_home_gf_avg` - Home team goals per game vs away team (season)
> 69. `h2h_away_gf_avg` - Away team goals per game vs home team (season)

In [None]:
# RESET MAIN DF 
# This will remove home and away head to head stats
# This is for testing purposes

# cols = ["home_h2h_wins", "away_h2h_wins", "home_h2h_wins_x", "home_h2h_wins_y", "away_h2h_wins_x", "away_h2h_wins_y", "home_h2h_gf_avg", "home_h2h_gf", "home_h2h_gf_x", "home_h2h_gf_y", "away_h2h_gf_avg", "away_h2h_gf", "away_h2h_gf_x", "away_h2h_gf_y", "home_h2h_wins_diff", "matchup"]
# df = df.drop(columns=cols, errors="ignore")
# df.to_csv(CSV_FILE, index=False)

: 

: 

#### 6.1 create matchup key

In [None]:
df = pd.read_csv(CSV_FILE, parse_dates=["date"])

df["matchup"] = df.apply(
    lambda r: "_".join(sorted([r["home_team_abbrev"], r["away_team_abbrev"]])),
    axis=1
)

df = df.sort_values(["season", "date"]).reset_index(drop=True)


: 

: 

#### 6.3 Rolling H2H win differential (home ‚àí away)

In [None]:
h2h_long = pd.concat([
    df[[
        "game_id", "date", "season", "matchup",
        "home_team_abbrev", "away_team_abbrev",
        "home_gf", "home_win"
    ]]
    .rename(columns={
        "home_team_abbrev": "team",
        "away_team_abbrev": "opponent",
        "home_gf": "gf",
        "home_win": "win"
    }),

    df[[
        "game_id", "date", "season", "matchup",
        "away_team_abbrev", "home_team_abbrev",
        "away_gf", "home_win"
    ]]
    .assign(win=lambda x: 1 - x["home_win"])
    .rename(columns={
        "away_team_abbrev": "team",
        "home_team_abbrev": "opponent",
        "away_gf": "gf"
    })
    .drop(columns="home_win")
], ignore_index=True)


h2h_long.head()

Unnamed: 0,game_id,date,season,matchup,team,opponent,gf,win
0,2023020001,2023-10-10,20232024,NSH_TBL,TBL,NSH,5,1
1,2023020002,2023-10-10,20232024,CHI_PIT,PIT,CHI,2,0
2,2023020003,2023-10-10,20232024,SEA_VGK,VGK,SEA,4,1
3,2023020004,2023-10-11,20232024,CAR_OTT,CAR,OTT,5,1
4,2023020005,2023-10-11,20232024,MTL_TOR,TOR,MTL,6,1


: 

: 

#### 6.4 Season-only H2H goals-for averages

In [None]:
h2h_long = h2h_long.sort_values(["season", "matchup", "date"])

h2h_long["h2h_wins"] = (
    h2h_long
    .groupby(["season", "matchup", "team"])["win"]
    .transform(lambda s: s.cumsum().shift(1))
    .fillna(0)
)

h2h_long["h2h_gf"] = (
    h2h_long
    .groupby(["season", "matchup", "team"])["gf"]
    .transform(lambda s: s.expanding().mean().shift(1))
    .fillna(0)
).round(3)

h2h_long.head(20)

Unnamed: 0,game_id,date,season,matchup,team,opponent,gf,win,h2h_wins,h2h_gf
80,2023020081,2023-10-22,20232024,ANA_BOS,ANA,BOS,1,0,0.0,0.0
3558,2023020081,2023-10-22,20232024,ANA_BOS,BOS,ANA,3,1,0.0,0.0
99,2023020100,2023-10-26,20232024,ANA_BOS,BOS,ANA,3,0,1.0,3.0
3577,2023020100,2023-10-26,20232024,ANA_BOS,ANA,BOS,4,1,0.0,1.0
736,2023020737,2024-01-23,20232024,ANA_BUF,ANA,BUF,4,1,0.0,0.0
4214,2023020737,2024-01-23,20232024,ANA_BUF,BUF,ANA,2,0,0.0,0.0
865,2023020866,2024-02-19,20232024,ANA_BUF,BUF,ANA,3,0,0.0,2.0
4343,2023020866,2024-02-19,20232024,ANA_BUF,ANA,BUF,4,1,1.0,4.0
33,2023020034,2023-10-15,20232024,ANA_CAR,ANA,CAR,6,1,0.0,0.0
3511,2023020034,2023-10-15,20232024,ANA_CAR,CAR,ANA,3,0,0.0,0.0


: 

: 

In [None]:
home_stats = h2h_long.rename(columns={
    "team": "home_team_abbrev",
    "h2h_wins": "home_h2h_wins",
    "h2h_gf": "home_h2h_gf"
})[[
    "game_id", "home_team_abbrev",
    "home_h2h_wins", "home_h2h_gf"
]]

away_stats = h2h_long.rename(columns={
    "team": "away_team_abbrev",
    "h2h_wins": "away_h2h_wins",
    "h2h_gf": "away_h2h_gf"
})[[
    "game_id", "away_team_abbrev",
    "away_h2h_wins", "away_h2h_gf"
]]

df = df.merge(home_stats, on=["game_id", "home_team_abbrev"], how="left")
df = df.merge(away_stats, on=["game_id", "away_team_abbrev"], how="left")

df["home_h2h_wins_diff"] = df["home_h2h_wins"] - df["away_h2h_wins"]


df.to_csv(CSV_FILE, index=False)

: 

: 

In [None]:
df = pd.read_csv(CSV_FILE, parse_dates=["date"])

print(df.columns.tolist())


['game_id', 'date', 'season', 'home_team', 'away_team', 'home_team_abbrev', 'away_team_abbrev', 'home_win', 'home_gf', 'away_gf', 'home_ga', 'away_ga', 'home_sog', 'away_sog', 'home_faceoffwin_pct', 'away_faceoffwin_pct', 'home_powerplays', 'away_powerplays', 'home_powerplay_pct', 'away_powerplay_pct', 'home_pk', 'away_pk', 'home_pk_pct', 'away_pk_pct', 'home_pims', 'away_pims', 'home_hits', 'away_hits', 'home_blockedshots', 'away_blockedshots', 'home_takeaways', 'away_takeaways', 'home_giveaways', 'away_giveaways', 'home_gf_per_game_l5', 'home_ga_per_game_l5', 'home_sog_per_game_l5', 'home_wins_l5', 'home_win_pct_l5', 'home_powerplay_pct_l5', 'home_penalty_kill_pct_l5', 'home_powerplay_opps_l5', 'home_pk_opps_l5', 'home_faceoffwin_pct_l5', 'home_pims_l5', 'home_hits_l5', 'home_blockedshots_l5', 'home_giveaways_l5', 'home_takeaways_l5', 'away_gf_per_game_l5', 'away_ga_per_game_l5', 'away_sog_per_game_l5', 'away_wins_l5', 'away_win_pct_l5', 'away_powerplay_pct_l5', 'away_penalty_kill_pc

: 

: 

In [None]:
df.shape

(3478, 126)

: 

: 

: 

: 