# NHL Data Scraper v3.0.0

Previously, I used selenium to scrape data from the NHL stats website. This is an attempt to scrape data from **https://api-web.nhle.com/**
Specifically, Here is the documentation:

**https://github.com/Zmalski/NHL-API-Reference**

> ## Features
> 
> #### 1. Basic Game Information
> 1. ~~`game_id` - Unique identifier for the game~~
> 2. ~~`date` - Date of the game~~
> 3. ~~`home_team` - Home team abbreviation~~
> 4. ~~`away_team` - Away team abbreviation~~
> 5. ~~`home_win` - Target variable (1 = home win, 0 = home loss)~~
> 
> #### 2. Team Performance - Rolling Averages (Last 5 Games)
> 6. ~~`home_win_pct_l5` - Home team win percentage (last 5 games)~~
> 7. ~~`away_win_pct_l5` - Away team win percentage (last 5 games)~~
> 8. ~~`home_gf_per_game_l5` - Home team goals for per game (last 5)~~
> 9. ~~`away_gf_per_game_l5` - Away team goals for per game (last 5)~~
> 10. ~~`home_ga_per_game_l5` - Home team goals against per game (last 5)~~
> 11. ~~`away_ga_per_game_l5` - Away team goals against per game (last 5)~~
> 12. ~~`home_goal_diff_l5` - Home team goal differential (last 5)~~
> 13. ~~`away_goal_diff_l5` - Away team goal differential (last 5)~~
> 14. ~~`home_shots_per_game_l5` - Home team shots per game (last 5)~~
> 15. ~~`away_shots_per_game_l5` - Away team shots per game (last 5)~~
> 16. ~~`home_shot_diff_l5` - Home team shot differential (last 5)~~
> 17. ~~`away_shot_diff_l5` - Away team shot differential (last 5)~~
> 18. ~~`home_shooting_pct_l5` - Home team shooting percentage (last 5)~~
> 19. ~~`away_shooting_pct_l5` - Away team shooting percentage (last 5)~~
> 
> #### 3. Special Teams - Rolling Averages (Last 5 Games)
> 20. ~~`home_pp_pct_l5` - Home team power play percentage (last 5)~~
> 21. ~~`away_pp_pct_l5` - Away team power play percentage (last 5)~~
> 22. ~~`home_pk_pct_l5` - Home team penalty kill percentage (last 5)~~
> 23. ~~`away_pk_pct_l5` - Away team penalty kill percentage (last 5)~~
> 24. ~~`home_pp_opportunities_l5` - Home team power play opportunities (last 5)~~
> 25. ~~`away_pp_opportunities_l5` - Away team power play opportunities (last 5)~~
> 26. ~~`home_pk_opportunities_l5` - Home team penalty kill opportunities (last 5)~~
> 27. ~~`away_pk_opportunities_l5` - Away team penalty kill opportunities (last 5)~~
> 
> #### 4. Advanced Metrics - Rolling Averages (Last 5 Games)
> 28. `home_save_pct_l5` - Home team save percentage (last 5)
> 29. `away_save_pct_l5` - Away team save percentage (last 5)
> 30. `home_pdo_l5` - Home team PDO (shooting% + save%) (last 5)
> 31. `away_pdo_l5` - Away team PDO (last 5)
> 32. ~~`home_faceoff_pct_l5` - Home team faceoff win percentage (last 5)~~
> 33. ~~`away_faceoff_pct_l5` - Away team faceoff win percentage (last 5)~~
> 
> #### 5. Season-to-Date Statistics
> 34. `home_win_pct_season` - Home team win percentage (full season)
> 35. `away_win_pct_season` - Away team win percentage (full season)
> 36. `home_home_win_pct` - Home team win percentage at home (season)
> 37. `away_away_win_pct` - Away team win percentage on road (season)
> 38. `home_gf_per_game_season` - Home team goals for per game (season)
> 39. `away_gf_per_game_season` - Away team goals for per game (season)
> 40. `home_ga_per_game_season` - Home team goals against per game (season)
> 41. `away_ga_per_game_season` - Away team goals against per game (season)
> 
> #### 6. Game Context & Situational Features
> 42. `home_days_rest` - Days of rest for home team
> 43. `away_days_rest` - Days of rest for away team
> 44. `home_back_to_back` - Home team playing back-to-back (1=yes, 0=no)
> 45. `away_back_to_back` - Away team playing back-to-back (1=yes, 0=no)
> 46. `division_game` - Teams in same division (1=yes, 0=no)
> 47. `conference_game` - Teams in same conference (1=yes, 0=no)
> 48. `day_of_week` - Day of the week
> 49. `is_weekend` - Weekend game indicator (1=yes, 0=no)
> 
> #### 7. Starting Goalie Statistics
> 50. `home_goalie_save_pct` - Home starting goalie save % (season)
> 51. `away_goalie_save_pct` - Away starting goalie save % (season)
> 52. `home_goalie_save_pct_l5` - Home starting goalie save % (last 5 starts)
> 53. `away_goalie_save_pct_l5` - Away starting goalie save % (last 5 starts)
> 54. `home_goalie_gaa` - Home starting goalie goals against avg (season)
> 55. `away_goalie_gaa` - Away starting goalie goals against avg (season)
> 56. `home_goalie_gaa_l5` - Home starting goalie GAA (last 5 starts)
> 57. `away_goalie_gaa_l5` - Away starting goalie GAA (last 5 starts)
> 58. `home_goalie_wins_l5` - Home starting goalie wins (last 5 starts)
> 59. `away_goalie_wins_l5` - Away starting goalie wins (last 5 starts)
> 60. `home_goalie_days_rest` - Days since home goalie's last start
> 61. `away_goalie_days_rest` - Days since away goalie's last start
> 62. `home_goalie_gsax` - Home goalie goals saved above expected (season)
> 63. `away_goalie_gsax` - Away goalie goals saved above expected (season)
> 
> #### 8. Backup Goalie Statistics (Goalie Depth)
> 64. `home_backup_save_pct` - Home backup goalie save % (season)
> 65. `away_backup_save_pct` - Away backup goalie save % (season)
> 66. `home_backup_gaa` - Home backup goalie GAA (season)
> 67. `away_backup_gaa` - Away backup goalie GAA (season)
> 68. `home_goalie_depth_score` - Average save % of top 2 goalies (home)
> 69. `away_goalie_depth_score` - Average save % of top 2 goalies (away)
> 
> #### 9. Streaks & Momentum
> 70. `home_current_streak` - Current win/loss streak (positive=wins, negative=losses)
> 71. `away_current_streak` - Away team win/loss streak
> 72. `home_ot_record` - Home team overtime/shootout record (season)
> 73. `away_ot_record` - Away team overtime/shootout record (season)
> 
> #### 10. Head-to-Head Features
> 74. `h2h_home_wins_season` - Home team wins vs away team (this season)
> 75. `h2h_away_wins_season` - Away team wins vs home team (this season)
> 76. `h2h_home_gf_avg` - Home team goals per game vs away team (season)
> 77. `h2h_away_gf_avg` - Away team goals per game vs home team (season)
> 
> #### 11. Differential Features (Engineered)
> 78. `win_pct_diff_l5` - home_win_pct_l5 - away_win_pct_l5
> 79. `goal_diff_l5` - home_goal_diff_l5 - away_goal_diff_l5
> 80. `shot_diff_l5` - home_shot_diff_l5 - away_shot_diff_l5
> 81. `goalie_save_pct_diff` - home_goalie_save_pct - away_goalie_save_pct
> 82. `rest_advantage` - home_days_rest - away_days_rest
> 83. `pp_pk_matchup` - home_pp_pct_l5 × away_pk_pct_l5
> 84. `pk_pp_matchup` - home_pk_pct_l5 × away_pp_pct_l5


## Imports

In [132]:
import os
import time
import pandas as pd
import numpy as np
import requests
import json
from datetime import datetime, timedelta
import csv
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

## Constants

In [133]:
API_BASE_URL = "https://api-web.nhle.com/"
OUTPUT_DIR = "generated/api_data/"

## Step 1. Fetch Basic Game Info

In [139]:
# --- Setup requests session ---
session = requests.Session()
retries = Retry(
    total=5,                  # retry up to 5 times
    backoff_factor=1,         # wait 1s, 2s, 4s, etc. between retries
    status_forcelist=[500, 502, 503, 504, 429],  # retry on server or rate-limit errors
    allowed_methods=["GET"]   # only retry GET requests
)
session.mount("https://", HTTPAdapter(max_retries=retries))

# Request endpoint for game story data
def fetch_game_story(game_id):
    url = f"{API_BASE_URL}/v1/wsc/game-story/{game_id}"
    try:
        response = session.get(url, timeout=10)
        response.raise_for_status()
        return response.json()
    except requests.exceptions.RequestException as e:
        print(f"⚠️ Failed to fetch game {game_id}: {e}")
        return None

# Helpers to extract stats
def get_num_powerplays(powerplay_str):
    if not powerplay_str:
        return 0
    try:
        goals, opps = map(int, powerplay_str.split("/"))
        return goals, opps
    except ValueError:
        return 0

def calc_num_penalty_kills(powerplays_against, ppg_against):
    return powerplays_against - ppg_against

def calc_penalty_kill_pct(powerplays, ppg_against):
    if powerplays == 0:
        return 0.0
    pk_successes = powerplays - ppg_against
    return round((pk_successes / powerplays) * 100, 2)

def extract_name(obj):
    if not obj:
        return ""
    if isinstance(obj, dict):
        return obj.get("default", "") if "default" in obj else ""
    return str(obj)

def extract_category_stat(stats, category):
    for stat in stats:
        if stat.get("category") == category:
            return stat.get("homeValue", 0), stat.get("awayValue", 0)
    return 0, 0

def extract_all_stats(game_data):
    home = game_data.get("homeTeam", {})
    away = game_data.get("awayTeam", {})
    team_stats = game_data.get("summary", {}).get("teamGameStats", [])

    home_place = extract_name(home.get("placeName"))
    home_name = extract_name(home.get("name"))
    away_place = extract_name(away.get("placeName"))
    away_name = extract_name(away.get("name"))

    home_faceoffwin_pct, away_faceoffwin_pct = extract_category_stat(team_stats, "faceoffWinningPctg")
    home_powerplays_str, away_powerplays_str = extract_category_stat(team_stats, "powerPlay")

    home_pp_goals, home_pp_opps = get_num_powerplays(home_powerplays_str)
    away_pp_goals, away_pp_opps = get_num_powerplays(away_powerplays_str)

    home_penaltykills = calc_num_penalty_kills(away_pp_opps, away_pp_goals)
    away_penaltykills = calc_num_penalty_kills(home_pp_opps, home_pp_goals)
    home_pk_pct = calc_penalty_kill_pct(away_pp_opps, away_pp_goals)
    away_pk_pct = calc_penalty_kill_pct(home_pp_opps, home_pp_goals)
    home_powerplay_pct, away_powerplay_pct = extract_category_stat(team_stats, "powerPlayPctg")
    home_pims, away_pims = extract_category_stat(team_stats, "pim")
    home_hits, away_hits = extract_category_stat(team_stats, "hits")
    home_blockedshots, away_blockedshots = extract_category_stat(team_stats, "blockedShots")
    home_takeaways, away_takeaways = extract_category_stat(team_stats, "takeaways")
    home_giveaways, away_giveaways = extract_category_stat(team_stats, "giveaways")


    row = {
        "game_id": game_data.get("id"),
        "date": game_data.get("gameDate", ""),
        "home_team": f"{home_place} {home_name}".strip(),
        "away_team": f"{away_place} {away_name}".strip(),
        "home_team_abbrev": home.get("abbrev", ""),
        "away_team_abbrev": away.get("abbrev", ""),
        "home_win": 1 if home.get("score", 0) > away.get("score", 0) else 0,
        "home_gf": home.get("score", 0),
        "away_gf": away.get("score", 0),
        "home_ga": away.get("score", 0),
        "away_ga": home.get("score", 0),
        "home_sog": home.get("sog", 0),
        "away_sog": away.get("sog", 0),
        "home_faceoffwin_pct": home_faceoffwin_pct,
        "away_faceoffwin_pct": away_faceoffwin_pct,
        "home_powerplays": home_pp_opps,
        "away_powerplays": away_pp_opps,
        "home_powerplay_pct": home_powerplay_pct,
        "away_powerplay_pct": away_powerplay_pct,
        "home_pk": home_penaltykills,
        "away_pk": away_penaltykills,
        "home_pk_pct": home_pk_pct,
        "away_pk_pct": away_pk_pct,
        "home_pims": home_pims,
        "away_pims": away_pims,
        "home_hits": home_hits,
        "away_hits": away_hits,
        "home_blockedshots": home_blockedshots,
        "away_blockedshots": away_blockedshots,
        "home_takeaways": home_takeaways,
        "away_takeaways": away_takeaways,
        "home_giveaways": home_giveaways,
        "away_giveaways": away_giveaways
    }
    return row



In [140]:
test_num_games = 200
num_games = 1312
season = 2022
prev_season = season - 1

# Make output directory
os.makedirs(OUTPUT_DIR, exist_ok=True)
out_file = os.path.join(OUTPUT_DIR, "games_basic.csv")

fieldnames = [
    "game_id", "date", "home_team", "away_team",
    "home_team_abbrev", "away_team_abbrev", "home_win",
    "home_gf", "away_gf", "home_ga", "away_ga", "home_sog", "away_sog",
    "home_faceoffwin_pct", "away_faceoffwin_pct", "home_powerplays", "away_powerplays",
    "home_powerplay_pct", "away_powerplay_pct", "home_pk", "away_pk", "home_pk_pct", "away_pk_pct",
    "home_pims", "away_pims", "home_hits", "away_hits", "home_blockedshots", "away_blockedshots", 
    "home_takeaways", "away_takeaways", "home_giveaways", "away_giveaways"
]

with open(out_file, "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=fieldnames)
    writer.writeheader()

    # Fetch last 10 games from previous season (will help with rolling stats)
    # NOTE: uncomment this block after first run
    print(f"Fetching last 10 games of {prev_season} season...")
    for game_num in range(num_games - 99, num_games + 1):
        game_key = f"{prev_season}02{str(game_num).zfill(4)}"
        game_data = fetch_game_story(game_key)
        if not game_data:
            continue
        
        row = extract_all_stats(game_data)
        
        writer.writerow(row)

        if game_num % 10 == 0:
            print(f"✅ Processed {game_num}/{num_games} games from previous season...")
        time.sleep(1)

    # Fetch current season games
    # NOTE: change season value above for new season
    #       this block will keep adding to the same CSV with new season data
    print(f"\nFetching {test_num_games} games from {season} season...")
    for i, game_num in enumerate(range(1, test_num_games + 1), start=1):
        game_key = f"{season}02{str(game_num).zfill(4)}"
        game_data = fetch_game_story(game_key)

        if not game_data:
            continue

        row = extract_all_stats(game_data)

        writer.writerow(row)
        # Delay + heartbeat
        if i % 50 == 0:
            print(f"✅ Processed {i}/{test_num_games} games...")
        time.sleep(1)

print("\n Finished fetching games!")

Fetching last 10 games of 2021 season...
✅ Processed 1220/1312 games from previous season...
✅ Processed 1230/1312 games from previous season...
✅ Processed 1240/1312 games from previous season...
✅ Processed 1250/1312 games from previous season...
✅ Processed 1260/1312 games from previous season...
✅ Processed 1270/1312 games from previous season...
✅ Processed 1280/1312 games from previous season...
✅ Processed 1290/1312 games from previous season...
✅ Processed 1300/1312 games from previous season...
✅ Processed 1310/1312 games from previous season...

Fetching 200 games from 2022 season...
✅ Processed 50/200 games...
✅ Processed 100/200 games...
✅ Processed 150/200 games...
✅ Processed 200/200 games...

 Finished fetching games!


## Step 2. Fetch Team Performance Data - Rolling Averages

At this point, we have all the features needed to calculate the rolling averages (last 5 games) for the following features:

> 6. `home_win_pct_l5` - Home team win percentage (last 5 games)
> 7. `away_win_pct_l5` - Away team win percentage (last 5 games)
> 8. `home_gf_per_game_l5` - Home team goals for per game (last 5)
> 9. `away_gf_per_game_l5` - Away team goals for per game (last 5)
> 10. `home_ga_per_game_l5` - Home team goals against per game (last 5)
> 11. `away_ga_per_game_l5` - Away team goals against per game (last 5)
> 12. `home_goal_diff_l5` - Home team goal differential (last 5)
> 13. `away_goal_diff_l5` - Away team goal differential (last 5)
> 14. `home_shots_per_game_l5` - Home team shots per game (last 5)
> 15. `away_shots_per_game_l5` - Away team shots per game (last 5)
> 16. `home_shot_diff_l5` - Home team shot differential (last 5)
> 17. `away_shot_diff_l5` - Away team shot differential (last 5)
> 18. `home_shooting_pct_l5` - Home team shooting percentage (last 5)
> 19. `away_shooting_pct_l5` - Away team shooting percentage (last 5)
>
> #### 3. Special Teams - Rolling Averages (Last 5 Games)
> 20. `home_pp_pct_l5` - Home team power play percentage (last 5)
> 21. `away_pp_pct_l5` - Away team power play percentage (last 5)
> 22. `home_pk_pct_l5` - Home team penalty kill percentage (last 5)
> 23. `away_pk_pct_l5` - Away team penalty kill percentage (last 5)
> 24. `home_pp_opportunities_l5` - Home team power play opportunities (last 5)
> 25. `away_pp_opportunities_l5` - Away team power play opportunities (last 5)
> 26. `home_pk_opportunities_l5` - Home team penalty kill opportunities (last 5)
> 27. `away_pk_opportunities_l5` - Away team penalty kill opportunities (last 5)
>
> #### 4. Advanced Metrics - Rolling Averages (Last 5 Games)
> 32. `home_faceoff_pct_l5` - Home team faceoff win percentage (last 5)
> 33. `away_faceoff_pct_l5` - Away team faceoff win percentage (last 5)

> **Note:** the first 5 games of every season will take stats from the last 5 games of the previous season

In [141]:
CSV_FILE = f"{OUTPUT_DIR}/games_basic.csv"

df = pd.read_csv(CSV_FILE)

starting_game_id_2022 = 2022020001
starting_game_id_2023 = 2023020001
starting_game_id_2024 = 2024020001

# Data we need to track for each team
# - total wins in last 5 games
# - total goals for in last 5 games
# - total goals against in last 5 games
# - total shots in last 5 games
# - total shots against in last 5 games


In [142]:
df = pd.read_csv(CSV_FILE, parse_dates=["date"])

# Step 1. Prepare combined dataframe for all teams
home_stats = df[["date", 
                 "home_team_abbrev",
                 "home_gf", 
                 "home_ga", 
                 "home_sog", 
                 "home_win", 
                 "home_powerplay_pct", 
                 "home_pk_pct", 
                 "home_powerplays", 
                 "home_pk",
                 "home_faceoffwin_pct",
                 "home_pims",
                 "home_hits",
                 "home_blockedshots",
                 "home_giveaways",
                 "home_takeaways"]].rename(columns={
    "home_team_abbrev": "team_abbrev",
    "home_gf": "goals_for",
    "home_ga": "goals_against",
    "home_sog": "shots_on_goal",
    "home_win": "win",
    "home_powerplay_pct": "powerplay_pct",
    "home_pk_pct": "penalty_kill_pct",
    "home_powerplays": "powerplays",
    "home_pk": "penalty_kills",
    "home_faceoffwin_pct": "faceoffwin_pct",
    "home_pims": "pims",
    "home_hits": "hits",
    "home_blockedshots": "blockedshots",
    "home_giveaways": "giveaways",
    "home_takeaways": "takeaways"
})

away_stats = df[["date", 
                 "away_team_abbrev", 
                 "away_gf", 
                 "away_ga", 
                 "away_sog", 
                 "home_win", 
                 "away_powerplay_pct", 
                 "away_pk_pct", 
                 "away_powerplays", 
                 "away_pk",
                 "away_faceoffwin_pct",
                 "away_pims",
                 "away_hits",
                 "away_blockedshots",
                 "away_giveaways",
                 "away_takeaways"]].rename(columns={
    "away_team_abbrev": "team_abbrev",
    "away_gf": "goals_for",
    "away_ga": "goals_against",
    "away_sog": "shots_on_goal",
    "home_win": "win",
    "away_powerplay_pct": "powerplay_pct",
    "away_pk_pct": "penalty_kill_pct",
    "away_powerplays": "powerplays",
    "away_pk": "penalty_kills",
    "away_faceoffwin_pct": "faceoffwin_pct",
    "away_pims": "pims",
    "away_hits": "hits",
    "away_blockedshots": "blockedshots",
    "away_giveaways": "giveaways",
    "away_takeaways": "takeaways"
})

# For away games, win = 1 - home_win
away_stats["win"] = 1 - away_stats["win"]

combined_stats = pd.concat([home_stats, away_stats], ignore_index=True)

# Step 2. Sort and compute rolling stats for last 5 games
combined_stats = combined_stats.sort_values(by=["team_abbrev", "date"])
combined_stats["gf_per_game_l5"] = combined_stats.groupby("team_abbrev")["goals_for"].rolling(window=5, min_periods=1).mean().shift(1).reset_index(0, drop=True)
combined_stats["ga_per_game_l5"] = combined_stats.groupby("team_abbrev")["goals_against"].rolling(window=5, min_periods=1).mean().shift(1).reset_index(0, drop=True)
combined_stats["sog_per_game_l5"] = combined_stats.groupby("team_abbrev")["shots_on_goal"].rolling(window=5, min_periods=1).mean().shift(1).reset_index(0, drop=True)
combined_stats["wins_l5"] = combined_stats.groupby("team_abbrev")["win"].rolling(window=5, min_periods=1).sum().shift(1).reset_index(0, drop=True)
combined_stats["powerplay_pct_l5"] = combined_stats.groupby("team_abbrev")["powerplay_pct"].rolling(window=5, min_periods=1).mean().shift(1).reset_index(0, drop=True)
combined_stats["penalty_kill_pct_l5"] = combined_stats.groupby("team_abbrev")["penalty_kill_pct"].rolling(window=5, min_periods=1).mean().shift(1).reset_index(0, drop=True)
combined_stats["powerplays_l5"] = combined_stats.groupby("team_abbrev")["powerplays"].rolling(window=5, min_periods=1).sum().shift(1).reset_index(0, drop=True)
combined_stats["penalty_kills_l5"] = combined_stats.groupby("team_abbrev")["penalty_kills"].rolling(window=5, min_periods=1).sum().shift(1).reset_index(0, drop=True)
combined_stats["faceoffwin_pct_l5"] = combined_stats.groupby("team_abbrev")["faceoffwin_pct"].rolling(window=5, min_periods=1).mean().shift(1).reset_index(0, drop=True)
combined_stats["pims_l5"] = combined_stats.groupby("team_abbrev")["pims"].rolling(window=5, min_periods=1).mean().shift(1).reset_index(0, drop=True)
combined_stats["hits_l5"] = combined_stats.groupby("team_abbrev")["hits"].rolling(window=5, min_periods=1).mean().shift(1).reset_index(0, drop=True)
combined_stats["blockedshots_l5"] = combined_stats.groupby("team_abbrev")["blockedshots"].rolling(window=5, min_periods=1).mean().shift(1).reset_index(0, drop=True)
combined_stats["giveaways_l5"] = combined_stats.groupby("team_abbrev")["giveaways"].rolling(window=5, min_periods=1).mean().shift(1).reset_index(0, drop=True)
combined_stats["takeaways_l5"] = combined_stats.groupby("team_abbrev")["takeaways"].rolling(window=5, min_periods=1).mean().shift(1).reset_index(0, drop=True)

# calc home and away win percentages over last 5 games
combined_stats["win_pct_l5"] = combined_stats["wins_l5"] / 5.0

# Step 3. Merge rolling stats back to original dataframe
df = df.merge(
    combined_stats[["date", 
                    "team_abbrev", 
                    "gf_per_game_l5", 
                    "ga_per_game_l5", 
                    "sog_per_game_l5", 
                    "wins_l5", 
                    "win_pct_l5", 
                    "powerplay_pct_l5", 
                    "penalty_kill_pct_l5", 
                    "powerplays_l5", 
                    "penalty_kills_l5",
                    "faceoffwin_pct_l5",
                    "pims_l5",
                    "hits_l5",
                    "blockedshots_l5",
                    "giveaways_l5",
                    "takeaways_l5"]],
    left_on=["home_team_abbrev", "date"],
    right_on=["team_abbrev", "date"],
    how="left"
).rename(columns={"gf_per_game_l5": "home_gf_per_game_l5", 
                  "ga_per_game_l5": "home_ga_per_game_l5", 
                  "sog_per_game_l5": "home_sog_per_game_l5", 
                  "wins_l5": "home_wins_l5", 
                  "win_pct_l5": "home_win_pct_l5",
                  "powerplay_pct_l5": "home_powerplay_pct_l5",
                  "penalty_kill_pct_l5": "home_penalty_kill_pct_l5",
                  "powerplays_l5": "home_powerplay_opps_l5",
                  "penalty_kills_l5": "home_pk_opps_l5",
                  "faceoffwin_pct_l5": "home_faceoffwin_pct_l5",
                  "pims_l5": "home_pims_l5",
                  "hits_l5": "home_hits_l5",
                  "blockedshots_l5": "home_blockedshots_l5",
                  "giveaways_l5": "home_giveaways_l5",
                  "takeaways_l5": "home_takeaways_l5"
}).drop(columns=["team_abbrev"])

df = df.merge(
    combined_stats[[
        "date", "team_abbrev",
        "gf_per_game_l5", "ga_per_game_l5", "sog_per_game_l5",
        "wins_l5", "win_pct_l5",
        "powerplay_pct_l5", "penalty_kill_pct_l5",
        "powerplays_l5", "penalty_kills_l5",
        "faceoffwin_pct_l5", "pims_l5", "hits_l5",
        "blockedshots_l5", "giveaways_l5", "takeaways_l5"
    ]],
    left_on=["away_team_abbrev", "date"],
    right_on=["team_abbrev", "date"],
    how="left"
).rename(columns={
    "gf_per_game_l5": "away_gf_per_game_l5",
    "ga_per_game_l5": "away_ga_per_game_l5",
    "sog_per_game_l5": "away_sog_per_game_l5",
    "wins_l5": "away_wins_l5",
    "win_pct_l5": "away_win_pct_l5",
    "powerplay_pct_l5": "away_powerplay_pct_l5",
    "penalty_kill_pct_l5": "away_penalty_kill_pct_l5",
    "powerplays_l5": "away_powerplay_opps_l5",
    "penalty_kills_l5": "away_pk_opps_l5",
    "faceoffwin_pct_l5": "away_faceoffwin_pct_l5",
    "pims_l5": "away_pims_l5",
    "hits_l5": "away_hits_l5",
    "blockedshots_l5": "away_blockedshots_l5",
    "giveaways_l5": "away_giveaways_l5",
    "takeaways_l5": "away_takeaways_l5"
}).drop(columns=["team_abbrev"])

# Step 4. Save updated dataframe back to CSV

cols_to_round = [c for c in df.columns if c.endswith(("_l5"))]
df[cols_to_round] = df[cols_to_round].round(2)

df.to_csv(CSV_FILE, index=False)

Now lets get the **diff**, and **percentage** stats for l5 using the averge **gf, ga,** and **sog** we just calculated

In [143]:
df = pd.read_csv(CSV_FILE, parse_dates=["date"])

# Step 1. prepare combined dataframe for all teams
home_stats = df[["date", "home_team_abbrev", "home_win", "home_gf_per_game_l5", "home_ga_per_game_l5", "home_sog_per_game_l5"]].rename(columns={
    "home_team_abbrev": "team_abbrev",
    "home_win": "is_win",
    "home_gf_per_game_l5": "gf_per_game_l5",
    "home_ga_per_game_l5": "ga_per_game_l5",
    "home_sog_per_game_l5": "sog_per_game_l5"
})

away_stats = df[["date", "away_team_abbrev", "home_win", "away_gf_per_game_l5", "away_ga_per_game_l5", "away_sog_per_game_l5"]].rename(columns={
    "away_team_abbrev": "team_abbrev",
    "home_win": "is_win",
    "away_gf_per_game_l5": "gf_per_game_l5",
    "away_ga_per_game_l5": "ga_per_game_l5",
    "away_sog_per_game_l5": "sog_per_game_l5"
})

combined_stats = pd.concat([home_stats, away_stats], ignore_index=True)

# Step 2. Sort and compute diff of home and away gf_per_game_l5, ga_per_game_l5, and sog_per_game_l5
combined_stats = combined_stats.sort_values(by=["date", "team_abbrev"]).reset_index(drop=True)

combined_stats["gf_per_game_l5_diff"] = combined_stats["gf_per_game_l5"].diff()
combined_stats["ga_per_game_l5_diff"] = combined_stats["ga_per_game_l5"].diff()
combined_stats["sog_per_game_l5_diff"] = combined_stats["sog_per_game_l5"].diff()

# Step 3. Merge diff stats back to original dataframe
df = df.merge(
    combined_stats[["date", "team_abbrev", "gf_per_game_l5_diff", "ga_per_game_l5_diff", "sog_per_game_l5_diff"]],
    left_on=["home_team_abbrev", "date"],
    right_on=["team_abbrev", "date"],
    how="left"
).rename(columns={
    "gf_per_game_l5_diff": "home_goal_diff_l5",
    "ga_per_game_l5_diff": "home_ga_diff_l5",
    "sog_per_game_l5_diff": "home_shot_diff_l5"
}).drop(columns=["team_abbrev"])

df = df.merge(
    combined_stats[["date", "team_abbrev", "gf_per_game_l5_diff", "ga_per_game_l5_diff", "sog_per_game_l5_diff"]],
    left_on=["away_team_abbrev", "date"],
    right_on=["team_abbrev", "date"],
    how="left"
).rename(columns={
    "gf_per_game_l5_diff": "away_goal_diff_l5",
    "ga_per_game_l5_diff": "away_ga_diff_l5",
    "sog_per_game_l5_diff": "away_shot_diff_l5"
}).drop(columns=["team_abbrev"])

# Step 4. Save updated dataframe back to CSV

cols_to_round = [c for c in df.columns if c.endswith(("_l5"))]
df[cols_to_round] = df[cols_to_round].round(2)

df.to_csv(CSV_FILE, index=False)