# NRL Feature Engineering Pipeline

This notebook transforms the raw NRL match data into a feature-rich, model-ready dataset. It follows a structured, multi-step process to engineer features related to team form, strength, and match context, while carefully preventing data leakage.

**Objective:** To create a comprehensive `nrl_matches_final_model_ready.csv` file that will serve as the input for our machine learning models.

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

## Step 1: Foundational Data Cleaning & Setup

This is the most critical first step. We perform essential cleaning and setup tasks:
- **Load Data**: Ingest the raw CSV file.
- **Date Conversion**: Convert the 'Date' column to a proper datetime format.
- **Chronological Sort**: Sort the entire dataset by date. **This is crucial for all time-series feature engineering** to prevent looking into the future.
- **Create Target Variable**: Engineer the `Home_Win` binary target variable.
- **Create Margin**: Calculate the `Home_Margin` for performance analysis.

In [10]:
def load_and_clean_nrl_data(filepath='../data/nrlBaselineData.csv'):
    """Load, clean, and sort the foundational NRL dataset."""
    print("--- Step 1: Loading & Cleaning Data ---")
    try:
        df = pd.read_csv(filepath)
        print(f"Loaded dataset: {filepath}")
    except FileNotFoundError:
        print(f"Error: File not found at {filepath}")
        return None
    
    # Date Conversion and Sorting
    df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
    df = df.sort_values(by='Date').reset_index(drop=True)
    
    # Create Target and Margin
    df['Home_Win'] = (df['Home Score'] > df['Away Score']).astype(int)
    home_win_rate = df['Home_Win'].mean()
    df['Home_Margin'] = df['Home Score'] - df['Away Score']
    df['match_id'] = df.index

    print(f"Data cleaned and sorted. Shape: {df.shape}")
    print(f"Date range: {df['Date'].min().strftime('%Y-%m-%d')} to {df['Date'].max().strftime('%Y-%m-%d')}")
    print(f"Total matches: {len(df)}")
    print(f"Unique teams: {len(set(df['Home Team'].unique()) | set(df['Away Team'].unique()))}")
    print(f"Missing values per column:")
    for col in df.columns:
        missing = df[col].isnull().sum()
        if missing > 0:
            print(f"  {col}: {missing} ({missing/len(df)*100:.1f}%)")
    
    print(f"\nHome team advantages:")
    print(f"  Win rate: {home_win_rate:.3f}")
    print(f"  Average margin: {df['Home_Margin'].mean():.2f}")
    
    return df


def preview_data(df, n_rows=5):
    """
    Preview the cleaned dataset
    """
    print(f"\n=== DATA PREVIEW (First {n_rows} rows) ===")
    key_columns = ['Date', 'Home Team', 'Away Team', 'Home Score', 'Away Score', 
                   'Home_Win', 'Home_Margin', 'match_id', 'temperature_category']
    print(df[key_columns].head(n_rows).to_string(index=False))
    
    print(f"\n=== DATA TYPES ===")
    print(df[key_columns].dtypes)

# Execute Step 1
df_cleaned = load_and_clean_nrl_data()
if df_cleaned is not None:
    preview_data(df_cleaned)

--- Step 1: Loading & Cleaning Data ---
Loaded dataset: ../data/nrlBaselineData.csv
Data cleaned and sorted. Shape: (3336, 26)
Date range: 2009-03-13 to 2025-06-29
Total matches: 3336
Unique teams: 17
Missing values per column:
  Over Time?: 3210 (96.2%)
  Home Odds: 3 (0.1%)
  Draw Odds: 3 (0.1%)
  Away Odds: 3 (0.1%)
  temperature_c: 1 (0.0%)
  wind_speed_kph: 1 (0.0%)
  precipitation_mm: 1 (0.0%)

Home team advantages:
  Win rate: 0.569
  Average margin: 2.99

=== DATA PREVIEW (First 5 rows) ===
      Date            Home Team         Away Team  Home Score  Away Score  Home_Win  Home_Margin  match_id temperature_category
2009-03-13      Melbourne Storm St George Dragons          17          16         1            1         0                 warm
2009-03-13     Brisbane Broncos North QLD Cowboys          19          18         1            1         1                 warm
2009-03-14      Cronulla Sharks  Penrith Panthers          18          10         1            8         2      

## Step 2: Create Team-Level Stats DataFrame

To calculate rolling statistics for each team, we need to transform the data from a *match-centric* view to a *team-centric* view. We "melt" the DataFrame so that each match is represented by two rows: one for the home team and one for the away team.

This structure makes it trivial to perform `groupby('team_name').rolling(...)` operations in the next step.

In [28]:
def create_team_level_stats(df):
    """Transform match-level data to team-level data."""
    print("\n--- Step 2: Creating Team-Level Stats ---")
    home_df = df[['match_id', 'Date', 'Home Team', 'Home Score', 'Away Score', 'Home_Win','Venue', 'City']].copy()
    home_df.rename(columns={'Home Team': 'team_name', 'Home Score': 'points_for', 'Away Score': 'points_against', 'Home_Win': 'won'}, inplace=True)
    home_df['is_home'] = 1
    home_df['opponent'] = df['Away Team']

    away_df = df[['match_id', 'Date', 'Away Team', 'Home Score', 'Away Score', 'Home_Win','Venue', 'City']].copy()
    away_df.rename(columns={'Away Team': 'team_name', 'Away Score': 'points_for', 'Home Score': 'points_against'}, inplace=True)
    away_df['won'] = 1 - away_df['Home_Win']
    away_df['is_home'] = 0
    away_df['opponent'] = df['Home Team']
    away_df = away_df.drop(columns=['Home_Win'])

    team_stats_df = pd.concat([home_df, away_df], ignore_index=True)
    team_stats_df = team_stats_df.sort_values(['Date', 'team_name']).reset_index(drop=True)
    team_stats_df['margin'] = team_stats_df['points_for'] - team_stats_df['points_against']
    team_stats_df['lost'] = 1 - team_stats_df['won']
    
    print(f"Team-level data created. Shape: {team_stats_df.shape}")
    return team_stats_df

def preview_team_level_stats(match_df, team_stats_df, team_name=None, n_rows=10):
        # Data validation and summary
    print(f" Original matches: {len(match_df)}")
    print(f" Team records created: {len(team_stats_df)} (should be 2x matches)")
    print(f" Unique teams: {team_stats_df['team_name'].nunique()}")
    print(f" Date range: {team_stats_df['Date'].min().strftime('%Y-%m-%d')} to {team_stats_df['Date'].max().strftime('%Y-%m-%d')}")
    
    # Team performance summary
    team_summary = team_stats_df.groupby('team_name').agg({
        'won': ['count', 'sum', 'mean'],
        'points_for': 'mean',
        'points_against': 'mean',
        'margin': 'mean'
    }).round(3)
    
    team_summary.columns = ['Games_Played', 'Wins', 'Win_Rate', 'Avg_Points_For', 'Avg_Points_Against', 'Avg_Margin']
    team_summary = team_summary.sort_values('Win_Rate', ascending=False)
    
    print(f"\n=== TEAM PERFORMANCE SUMMARY ===")
    print("Top 5 teams by win rate:")
    print(team_summary.head().to_string())
    
    print(f"\nBottom 5 teams by win rate:")
    print(team_summary.tail().to_string())
    
    # Home vs Away performance
    home_away_stats = team_stats_df.groupby('is_home').agg({
        'won': 'mean',
        'points_for': 'mean',
        'points_against': 'mean',
        'margin': 'mean'
    }).round(3)
    
    home_away_stats.index = ['Away', 'Home']
    print(f"\n=== HOME vs AWAY ADVANTAGE ===")
    print(home_away_stats.to_string())

    if team_name:
        preview_df = team_stats_df[team_stats_df['team_name'] == team_name].head(n_rows)
        print(f"\n=== TEAM STATS PREVIEW: {team_name} (First {n_rows} games) ===")
    else:
        preview_df = team_stats_df.head(n_rows)
        print(f"\n=== TEAM STATS PREVIEW (First {n_rows} rows) ===")
    
    key_columns = ['Date', 'team_name', 'is_home', 'opponent', 'points_for', 
                   'points_against', 'margin', 'won']
    
    print(preview_df[key_columns].to_string(index=False))
    
    print(f"\n=== TEAM STATS DATA TYPES ===")
    print(team_stats_df[key_columns].dtypes)

# Execute Step 2
team_stats_df = create_team_level_stats(df_cleaned)
if team_stats_df is not None:
    preview_team_level_stats(df_cleaned, team_stats_df)


--- Step 2: Creating Team-Level Stats ---
Team-level data created. Shape: (6672, 12)
 Original matches: 3336
 Team records created: 6672 (should be 2x matches)
 Unique teams: 17
 Date range: 2009-03-13 to 2025-06-29

=== TEAM PERFORMANCE SUMMARY ===
Top 5 teams by win rate:
                        Games_Played  Wins  Win_Rate  Avg_Points_For  Avg_Points_Against  Avg_Margin
team_name                                                                                           
Melbourne Storm                  431   303     0.703          24.608              15.316       9.292
Penrith Panthers                 421   251     0.596          22.176              18.344       3.831
Sydney Roosters                  425   245     0.576          22.984              19.275       3.708
South Sydney Rabbitohs           420   230     0.548          23.129              20.581       2.548
Brisbane Broncos                 416   221     0.531          21.534              21.195       0.339

Bottom 5 teams b

## Step 3: Calculate Rolling "Form" Features

This is where we quantify each team's recent performance, or "form". We use rolling windows to calculate moving averages for key metrics.


**Crucial for preventing data leakage:** We use `.shift(1)` after every rolling calculation. This ensures that the features for a given match are calculated using data from *previous* matches only.

In [29]:
def calculate_rolling_features(team_stats_df):
    """Calculate rolling averages and streaks for each team."""
    print("\n--- Step 3: Calculating Rolling Form Features ---")
    df = team_stats_df.copy().sort_values(['team_name', 'Date'])
    windows = [3, 5, 8]

    for window in windows:
        df[f'rolling_avg_margin_{window}'] = df.groupby('team_name')['margin'].transform(lambda x: x.rolling(window, 1).mean().shift(1))
        df[f'rolling_win_percentage_{window}'] = df.groupby('team_name')['won'].transform(lambda x: x.rolling(window, 1).mean().shift(1))
        df[f'rolling_avg_points_for_{window}'] = df.groupby('team_name')['points_for'].transform(lambda x: x.rolling(window, 1).mean().shift(1))
        df[f'rolling_avg_points_against_{window}'] = df.groupby('team_name')['points_against'].transform(lambda x: x.rolling(window, 1).mean().shift(1))
        
    print(f" Rolling features calculated for windows: {windows}")
    return df

# Execute Step 3
team_stats_form = calculate_rolling_features(team_stats_df)


--- Step 3: Calculating Rolling Form Features ---
 Rolling features calculated for windows: [3, 5, 8]


streaks features

In [30]:
def calculate_streaks(df):
    """
    Calculate winning and losing streaks for each team
    """
    
    def get_current_streak(series):
        """Calculate current win/loss streak from a boolean series"""
        if len(series) == 0:
            return 0
        
        # Shift to prevent data leakage - look at previous games only
        shifted_series = series.shift(1)
        
        # Initialise streaks
        winning_streak = []
        losing_streak = []
        
        for i, won in enumerate(shifted_series):
            if pd.isna(won):  # First game has no history
                winning_streak.append(0)
                losing_streak.append(0)
                continue
                
            # Look backwards to count streak
            current_win_streak = 0
            current_loss_streak = 0
            
            # Count backwards from current position
            for j in range(i-1, -1, -1):
                if pd.isna(shifted_series.iloc[j]):
                    break
                    
                if shifted_series.iloc[j] == 1:  # Win
                    if current_loss_streak > 0:  # End of loss streak
                        break
                    current_win_streak += 1
                else:  # Loss
                    if current_win_streak > 0:  # End of win streak
                        break
                    current_loss_streak += 1
            
            winning_streak.append(current_win_streak)
            losing_streak.append(current_loss_streak)
        
        return pd.Series(winning_streak, index=series.index), pd.Series(losing_streak, index=series.index)
    
    # Apply streak calculation to each team
    streak_data = df.groupby('team_name')['won'].apply(get_current_streak)
    
    # Extract winning and losing streaks
    df['winning_streak'] = 0
    df['losing_streak'] = 0
    
    for team_name, (win_streaks, loss_streaks) in streak_data.items():
        team_mask = df['team_name'] == team_name
        df.loc[team_mask, 'winning_streak'] = win_streaks.values
        df.loc[team_mask, 'losing_streak'] = loss_streaks.values
    
    print("Winning/Losing streaks calculated")
    
    return df

team_stats_streaks = calculate_streaks(team_stats_form)

Winning/Losing streaks calculated


games since win/loss, past 3 games

In [31]:
def calculate_games_since(df):
    # Recent form (last 3 games) - more granular
    df['recent_wins_3'] = (
        df.groupby('team_name')['won']
        .rolling(window=3, min_periods=1)
        .sum()
        .shift(1)
        .reset_index(level=0, drop=True)
    )
    
    # Games since last win/loss (simplified approach to avoid index issues)
    df['games_since_win'] = 0
    df['games_since_loss'] = 0
    
    for team in df['team_name'].unique():
        team_mask = df['team_name'] == team
        team_data = df[team_mask].copy()
        team_data = team_data.sort_values('Date')
        
        games_since_win = []
        games_since_loss = []
        
        for i in range(len(team_data)):
            if i == 0:
                games_since_win.append(0)
                games_since_loss.append(0)
                continue
            
            # Count games since last win
            win_count = 0
            win_found = False
            for j in range(i-1, -1, -1):
                if team_data.iloc[j]['won'] == 1:
                    win_found = True
                    break
                win_count += 1
            games_since_win.append(win_count if win_found else i)
            
            # Count games since last loss
            loss_count = 0
            loss_found = False
            for j in range(i-1, -1, -1):
                if team_data.iloc[j]['won'] == 0:
                    loss_found = True
                    break
                loss_count += 1
            games_since_loss.append(loss_count if loss_found else i)
        
        df.loc[team_mask, 'games_since_win'] = games_since_win
        df.loc[team_mask, 'games_since_loss'] = games_since_loss
    print("Games since last win/lost & 3 game form calculated")
    
    return df

team_stats_games_since = calculate_games_since(team_stats_streaks)

Games since last win/lost & 3 game form calculated


preview engineered features, might delete

In [32]:
# Data validation and summary
rolling_features = [col for col in team_stats_games_since.columns if col.startswith('rolling_')]
streak_features = [col for col in team_stats_games_since.columns if 'streak' in col]
form_features = [col for col in team_stats_games_since.columns if col.startswith(('recent_', 'games_since_'))]

all_new_features = rolling_features + streak_features + form_features

print(f"\n=== FEATURE ENGINEERING SUMMARY ===")
print(f" Rolling features created: {len(rolling_features)}")
print(f" Streak features created: {len(streak_features)}")
print(f" Form features created: {len(form_features)}")
print(f" Total new features: {len(all_new_features)}")

print(f"\nRolling features: {rolling_features}")
print(f"Streak features: {streak_features}")
print(f"Form features: {form_features}")


print(f"\n=== DATA LEAKAGE VALIDATION ===")
# Sort by team and date to get actual first games
df_sorted = team_stats_games_since.sort_values(['team_name', 'Date']).reset_index(drop=True)
first_games = df_sorted.groupby('team_name').first()

# Count null values in rolling features for first games
null_count = first_games[rolling_features].isnull().sum().sum()
total_first_games = len(first_games)
expected_nulls = total_first_games * len(rolling_features)
print(f"First game null values: {null_count}/{expected_nulls}")

# Additional validation: check if any first game has non-null rolling features
non_null_teams = []
for team in first_games.index:
    team_first_game = first_games.loc[team]
    if not team_first_game[rolling_features].isnull().all():
        non_null_teams.append(team)

if len(non_null_teams) == 0:
    print(f"Data leakage prevention:  PASS - All teams have null rolling features in first game")
else:
    print(f"Data leakage prevention:   PARTIAL - {len(non_null_teams)} teams have non-null values")
    print(f"  Teams with issues: {non_null_teams[:3]}...")  # Show first 3


=== FEATURE ENGINEERING SUMMARY ===
 Rolling features created: 12
 Streak features created: 2
 Form features created: 3
 Total new features: 17

Rolling features: ['rolling_avg_margin_3', 'rolling_win_percentage_3', 'rolling_avg_points_for_3', 'rolling_avg_points_against_3', 'rolling_avg_margin_5', 'rolling_win_percentage_5', 'rolling_avg_points_for_5', 'rolling_avg_points_against_5', 'rolling_avg_margin_8', 'rolling_win_percentage_8', 'rolling_avg_points_for_8', 'rolling_avg_points_against_8']
Streak features: ['winning_streak', 'losing_streak']
Form features: ['recent_wins_3', 'games_since_win', 'games_since_loss']

=== DATA LEAKAGE VALIDATION ===
First game null values: 0/204
Data leakage prevention:   PARTIAL - 17 teams have non-null values
  Teams with issues: ['Brisbane Broncos', 'Canberra Raiders', 'Canterbury Bulldogs']...


## Step 4: Engineer Strength & Context Features

Beyond recent form, we need to capture inherent team strength and the context of the match. We engineer three key features:

### 4a. Elo Ratings
A dynamic rating system that measures a team's strength relative to its opponents over time. A win against a strong opponent yields more Elo points than a win against a weak one.

### 4b. Rest Days
Calculates the number of days a team has had to rest since their last match. This is a proxy for fatigue.

### 4c. Travel Distance
Calculates the distance an away team has to travel from their home city to the match venue. This is a proxy for travel fatigue.

In [33]:
def calculate_elo_ratings(team_stats_df, k_factor=20, initial_elo=1500):
    # Initialize Elo ratings for all teams
    teams = team_stats_df['team_name'].unique()
    elo_ratings = {team: initial_elo for team in teams}
    
    print(f" Initialised {len(teams)} teams with Elo rating: {initial_elo}")
    
    # Create copy and sort by date
    df = team_stats_df.copy()
    df = df.sort_values(['Date', 'match_id']).reset_index(drop=True)
    
    # Add columns for pre-match Elo ratings
    df['pre_match_elo'] = 0.0
    
    # Process each match (two rows at a time - home and away)
    processed_matches = set()
    
    for idx, row in df.iterrows():
        match_id = row['match_id']
        
        # Skip if we've already processed this match
        if match_id in processed_matches:
            continue
        
        # Get both teams' records for this match
        match_data = df[df['match_id'] == match_id]
        
        if len(match_data) != 2:
            continue
            
        home_row = match_data[match_data['is_home'] == 1].iloc[0]
        away_row = match_data[match_data['is_home'] == 0].iloc[0]
        
        home_team = home_row['team_name']
        away_team = away_row['team_name']
        
        # Store pre-match Elo ratings
        home_pre_elo = elo_ratings[home_team]
        away_pre_elo = elo_ratings[away_team]
        
        # Update DataFrame with pre-match Elo
        df.loc[df['match_id'] == match_id, 'pre_match_elo'] = df.loc[df['match_id'] == match_id, 'team_name'].map({
            home_team: home_pre_elo,
            away_team: away_pre_elo
        })
        
        # Calculate expected scores using Elo formula
        # Home field advantage: add 100 Elo points to home team
        home_elo_adjusted = home_pre_elo + 100
        away_elo_adjusted = away_pre_elo
        
        expected_home = 1 / (1 + 10**((away_elo_adjusted - home_elo_adjusted) / 400))
        expected_away = 1 - expected_home
        
        # Actual results
        home_won = home_row['won']
        away_won = away_row['won']
        
        # Update Elo ratings
        elo_ratings[home_team] += k_factor * (home_won - expected_home)
        elo_ratings[away_team] += k_factor * (away_won - expected_away)
        
        processed_matches.add(match_id)
    
    print(f" Processed {len(processed_matches)} matches for Elo calculation")
    
    # Add final Elo ratings summary
    final_elos = pd.Series(elo_ratings).sort_values(ascending=False)
    print(f"\n=== ELO RATINGS SUMMARY ===")
    print("Top 5 teams by final Elo rating:")
    print(final_elos.head().to_string())
    print(f"\nBottom 5 teams by final Elo rating:")
    print(final_elos.tail().to_string())
    
    return df

team_stats_elo = calculate_elo_ratings(team_stats_games_since)

 Initialised 17 teams with Elo rating: 1500
 Processed 3336 matches for Elo calculation

=== ELO RATINGS SUMMARY ===
Top 5 teams by final Elo rating:
Melbourne Storm     1690.579852
Penrith Panthers    1690.424673
Canberra Raiders    1595.585283
Sydney Roosters     1586.357830
Cronulla Sharks     1554.568276

Bottom 5 teams by final Elo rating:
South Sydney Rabbitohs    1442.037831
Parramatta Eels           1434.357528
St George Dragons         1405.294746
Gold Coast Titans         1350.946560
Wests Tigers              1299.040547


In [34]:
def calculate_rest_days(df):
    print("\n=== STEP 4b: Calculating Rest Days (Season-Aware) ===")
    
    df = df.copy()
    df = df.sort_values(['team_name', 'Date']).reset_index(drop=True)
    df['Date'] = pd.to_datetime(df['Date'])
    
    # Extract year to identify seasons
    df['season'] = df['Date'].dt.year
    df['rest_days'] = np.nan
    
    print("Calculating rest days within seasons (excluding off-season gaps)...")
    
    # Calculate rest days for each team, respecting season boundaries
    for team in df['team_name'].unique():
        team_mask = df['team_name'] == team
        team_data = df[team_mask].copy()
        team_data = team_data.sort_values('Date').reset_index()
        
        # Calculate rest days between consecutive games
        for i in range(1, len(team_data)):
            current_season = team_data.iloc[i]['season']
            previous_season = team_data.iloc[i-1]['season']
            
            if current_season == previous_season:
                current_date = team_data.iloc[i]['Date']
                previous_date = team_data.iloc[i-1]['Date']
                rest_days_value = (current_date - previous_date).days
                
                # Apply data leakage prevention by using previous game's rest days
                # (shift effect built into the logic)
                if i >= 2:  # Need at least 2 previous games to prevent leakage
                    df.loc[team_data.iloc[i]['index'], 'rest_days'] = rest_days_value
            # If different seasons, leave as NaN (off-season gap ignored)
    return df
    
def preview_rest_days(df):
    valid_rest_days = df['rest_days'].dropna()
    
    print(f" Rest days calculated for all teams (within seasons only)")
    print(f" Off-season gaps excluded: {df['rest_days'].isna().sum()} records")
    print(f" Valid rest day calculations: {len(valid_rest_days)} records")
    
    if len(valid_rest_days) > 0:
        rest_stats = valid_rest_days.describe()
        print(f"\n=== REST DAYS STATISTICS (Season-Aware) ===")
        print(f"Mean rest days: {rest_stats['mean']:.1f}")
        print(f"Median rest days: {rest_stats['50%']:.1f}")
        print(f"Min rest days: {rest_stats['min']:.0f}")
        print(f"Max rest days: {rest_stats['max']:.0f}")
        
        # Count of different rest periods
        rest_counts = valid_rest_days.value_counts().sort_index()
        print(f"\nMost common rest periods:")
        print(rest_counts.head(8).to_string())
        
        # Season analysis
        season_counts = df.groupby('season')['rest_days'].count()
        print(f"\n=== SEASON BREAKDOWN ===")
        print("Valid rest day calculations per season:")
        print(season_counts.to_string())
        
        # Check for any suspiciously long rest periods (potential issues)
        long_rest = valid_rest_days[valid_rest_days > 30]
        if len(long_rest) > 0:
            print(f"\n  Found {len(long_rest)} rest periods > 30 days:")
            long_rest_matches = df[df['rest_days'] > 30][['Date', 'team_name', 'rest_days', 'season']]
            print(long_rest_matches.head().to_string(index=False))
            print(f"Note: These may be mid-season breaks or scheduling anomalies")
        else:
            print(f"\n All rest periods are within reasonable range (≤30 days)")
    else:
        print(f"\n  No valid rest day calculations found")
    
team_stats_rest = calculate_rest_days(team_stats_elo)
if team_stats_rest is not None:
    preview_rest_days(team_stats_rest)


=== STEP 4b: Calculating Rest Days (Season-Aware) ===
Calculating rest days within seasons (excluding off-season gaps)...
 Rest days calculated for all teams (within seasons only)
 Off-season gaps excluded: 292 records
 Valid rest day calculations: 6380 records

=== REST DAYS STATISTICS (Season-Aware) ===
Mean rest days: 7.9
Median rest days: 7.0
Min rest days: 4
Max rest days: 73

Most common rest periods:
rest_days
4.0        3
5.0      591
6.0     1526
7.0     1698
8.0     1129
9.0      569
10.0     151
11.0      45

=== SEASON BREAKDOWN ===
Valid rest day calculations per season:
season
2009    370
2010    386
2011    386
2012    386
2013    386
2014    386
2015    386
2016    386
2017    386
2018    386
2019    386
2020    322
2021    386
2022    386
2023    408
2024    409
2025    239

  Found 16 rest periods > 30 days:
      Date           team_name  rest_days  season
2020-05-28    Brisbane Broncos       69.0    2020
2020-05-30    Canberra Raiders       70.0    2020
2020-05-31 

In [35]:
def calculate_travel_distance(team_stats_df):
    print("\n=== STEP 4c: Calculating Travel Distance ===")
    
    # NRL team home cities (approximate coordinates)
    team_locations = {
        'Brisbane Broncos': (-27.4975, 153.0137),  # Brisbane
        'North Queensland Cowboys': (-19.2590, 146.8169),  # Townsville  
        'Gold Coast Titans': (-28.0167, 153.4000),  # Gold Coast
        'New Zealand Warriors': (-36.8485, 174.7633),  # Auckland
        'Melbourne Storm': (-37.8136, 144.9631),  # Melbourne
        'Canberra Raiders': (-35.2809, 149.1300),  # Canberra
        'Sydney Roosters': (-33.8688, 151.2093),  # Sydney
        'South Sydney Rabbitohs': (-33.8688, 151.2093),  # Sydney
        'St George Illawarra Dragons': (-34.4278, 150.8931),  # Wollongong
        'Cronulla-Sutherland Sharks': (-34.0544, 151.1518),  # Cronulla
        'Manly Sea Eagles': (-33.7969, 151.2841),  # Manly
        'Parramatta Eels': (-33.8176, 151.0032),  # Parramatta
        'Penrith Panthers': (-33.7506, 150.6934),  # Penrith
        'Wests Tigers': (-33.8688, 151.2093),  # Sydney
        'Canterbury Bulldogs': (-33.9173, 151.1851),  # Canterbury
        'Newcastle Knights': (-32.9283, 151.7817),  # Newcastle
        'Dolphins': (-27.4975, 153.0137),  # Brisbane (Redcliffe)
    }
    
    # Common venue locations
    venue_locations = {
        'Suncorp Stadium': (-27.4648, 153.0099),  # Brisbane
        'Queensland Country Bank Stadium': (-19.2598, 146.8181),  # Townsville
        'Cbus Super Stadium': (-28.0024, 153.3992),  # Gold Coast
        'AAMI Park': (-37.8255, 144.9816),  # Melbourne
        'GIO Stadium Canberra': (-35.2447, 149.1014),  # Canberra
        'Allianz Stadium': (-33.8878, 151.2273),  # Sydney
        'Accor Stadium': (-33.8474, 151.0616),  # Sydney Olympic Park
        'WIN Stadium': (-34.4056, 150.8841),  # Wollongong
        'PointsBet Stadium': (-34.0481, 151.1394),  # Cronulla
        '4 Pines Park': (-33.7742, 151.2606),  # Manly
        'CommBank Stadium': (-33.8007, 150.9810),  # Parramatta
        'BlueBet Stadium': (-33.7347, 150.6750),  # Penrith
        'Leichhardt Oval': (-33.8821, 151.1589),  # Leichhardt
        'McDonald Jones Stadium': (-32.9154, 151.7734),  # Newcastle
        'Mt Smart Stadium': (-36.9278, 174.8384),  # Auckland
        'Kayo Stadium': (-27.3644, 153.0486),  # Redcliffe
    }
    
    def haversine_distance(lat1, lon1, lat2, lon2):
        """Calculate distance between two points on Earth using Haversine formula"""
        from math import radians, sin, cos, sqrt, asin
        
        # Convert to radians
        lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
        
        # Haversine formula
        dlat = lat2 - lat1
        dlon = lon2 - lon1
        a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
        c = 2 * asin(sqrt(a))
        
        # Earth's radius in kilometers
        r = 6371
        
        return c * r
    
    df = team_stats_df.copy()
    
    # Initialise travel distance column
    df['travel_distance_km'] = 0.0
    
    # Calculate travel distance only for away teams
    away_games = df[df['is_home'] == 0].copy()
    
    for idx, row in away_games.iterrows():
        team_name = row['team_name']
        venue = row['Venue']
        
        # Get team home location
        if team_name in team_locations:
            team_lat, team_lon = team_locations[team_name]
        else:
            # Default to Sydney for unknown teams
            team_lat, team_lon = (-33.8688, 151.2093)
        
        # Get venue location (try exact match first, then partial match)
        venue_lat, venue_lon = None, None
        
        # Exact match
        if venue in venue_locations:
            venue_lat, venue_lon = venue_locations[venue]
        else:
            # Partial match for similar venue names
            for venue_key in venue_locations:
                if venue_key.lower() in venue.lower() or venue.lower() in venue_key.lower():
                    venue_lat, venue_lon = venue_locations[venue_key]
                    break
        
        # Default to team's home city if venue not found
        if venue_lat is None:
            venue_lat, venue_lon = team_lat, team_lon
        
        # Calculate distance
        distance = haversine_distance(team_lat, team_lon, venue_lat, venue_lon)
        df.loc[idx, 'travel_distance_km'] = distance
    
    print(f" Travel distances calculated for away games")
    
    return df

def preview_travel_distance(df):
    # Summary statistics
    away_distances = df[df['is_home'] == 0]['travel_distance_km']
    travel_stats = away_distances.describe()
    
    print(f"\n=== TRAVEL DISTANCE STATISTICS ===")
    print(f"Mean travel distance: {travel_stats['mean']:.1f} km")
    print(f"Median travel distance: {travel_stats['50%']:.1f} km")
    print(f"Max travel distance: {travel_stats['max']:.1f} km")
    print(f"Min travel distance: {travel_stats['min']:.1f} km")
    
    # Show longest travels
    longest_travels = df[df['travel_distance_km'] > 0].nlargest(5, 'travel_distance_km')
    print(f"\nLongest travel distances:")
    travel_display = longest_travels[['Date', 'team_name', 'Venue', 'City', 'travel_distance_km']]
    print(travel_display.to_string(index=False))

team_stats_travel = calculate_travel_distance(team_stats_rest)
if team_stats_travel is not None:
    preview_travel_distance(team_stats_travel)

team_stats_final = team_stats_travel.copy()


=== STEP 4c: Calculating Travel Distance ===
 Travel distances calculated for away games

=== TRAVEL DISTANCE STATISTICS ===
Mean travel distance: 358.3 km
Median travel distance: 28.5 km
Max travel distance: 2630.6 km
Min travel distance: 0.0 km

Longest travel distances:
      Date       team_name            Venue     City  travel_distance_km
2009-09-05 Melbourne Storm Mt Smart Stadium Auckland         2630.638431
2010-03-20 Melbourne Storm Mt Smart Stadium Auckland         2630.638431
2011-06-26 Melbourne Storm Mt Smart Stadium Auckland         2630.638431
2012-06-03 Melbourne Storm Mt Smart Stadium Auckland         2630.638431
2013-07-28 Melbourne Storm Mt Smart Stadium Auckland         2630.638431


## Step 5: Assemble Final Model-Ready DataFrame

This is the final and most important step. We merge all the engineered team-level features back into our original match-level DataFrame.

The key action here is creating **difference features** (e.g., `elo_diff`, `form_margin_diff_5`). Our model will learn from the *relative difference* in form and strength between the two competing teams, which is much more predictive than looking at each team's stats in isolation.

In [38]:
def assemble_final_model_ready_dataframe(df, team_stats_final):
    print("\n--- Step 5: Assembling Final Model-Ready DataFrame ---")
    
    # Split features into home and away sets
    home_stats_df = team_stats_final[team_stats_final['is_home'] == 1].copy()
    away_stats_df = team_stats_final[team_stats_final['is_home'] == 0].copy()

    # Select and rename feature columns
    base_columns = ['match_id', 'Date', 'team_name', 'is_home', 'points_for', 
                   'points_against', 'won', 'opponent', 'Venue', 'City', 'margin', 'lost']
    
    feature_columns = [col for col in team_stats_final.columns if col not in base_columns]

    home_rename = {col: f'home_{col}' for col in feature_columns}
    home_features = home_stats_df[['match_id'] + feature_columns].rename(columns=home_rename)

    away_rename = {col: f'away_{col}' for col in feature_columns}
    away_features = away_stats_df[['match_id'] + feature_columns].rename(columns=away_rename)

    print("\n=== 5.2: Merging Back to Main DataFrame ===")
    # Start with original match dataframe
    df_final = df.copy()

    # Merge home team features
    df_final = df_final.merge(home_features, on='match_id', how='left')
    print(f" Merged home team features: {df_final.shape}")
    
    # Merge away team features
    df_final = df_final.merge(away_features, on='match_id', how='left')
    print(f" Merged away team features: {df_final.shape}")

    print("\n=== 5.3: Adding Market Features ===")
    
    df_final['home_implied_prob'] = np.where(
        df_final['Home Odds'].notna() & (df_final['Home Odds'] > 0),
        1 / df_final['Home Odds'],
        np.nan
    )
    
    df_final['away_implied_prob'] = np.where(
        df_final['Away Odds'].notna() & (df_final['Away Odds'] > 0),
        1 / df_final['Away Odds'], 
        np.nan
    )

    df_final['market_spread'] = df_final['home_implied_prob'] - df_final['away_implied_prob']

    print("\n=== 5.4: Creating Difference Features ===")
    print("Creating the critical difference features that compare home vs away teams...")
    
    # 1. Strength Difference (Most Important)
    df_final['elo_diff'] = df_final['home_pre_match_elo'] - df_final['away_pre_match_elo']
    print(" Elo difference calculated")
    
    # 2. Form Differences (Rolling Averages)
    windows = [3, 5, 8]
    
    for window in windows:
        # Margin differences
        df_final[f'form_margin_diff_{window}'] = (
            df_final[f'home_rolling_avg_margin_{window}'] - 
            df_final[f'away_rolling_avg_margin_{window}']
        )
        
        # Win rate differences  
        df_final[f'form_win_rate_diff_{window}'] = (
            df_final[f'home_rolling_win_percentage_{window}'] - 
            df_final[f'away_rolling_win_percentage_{window}']
        )
        
        # Points for differences
        df_final[f'form_points_for_diff_{window}'] = (
            df_final[f'home_rolling_avg_points_for_{window}'] - 
            df_final[f'away_rolling_avg_points_for_{window}']
        )
        
        # Points against differences
        df_final[f'form_points_against_diff_{window}'] = (
            df_final[f'home_rolling_avg_points_against_{window}'] - 
            df_final[f'away_rolling_avg_points_against_{window}']
        )
    
    print(f" Form differences calculated for {len(windows)} windows (12 features)")
    
    # 3. Streak & Recency Differences
    df_final['winning_streak_diff'] = df_final['home_winning_streak'] - df_final['away_winning_streak']
    df_final['losing_streak_diff'] = df_final['home_losing_streak'] - df_final['away_losing_streak']
    df_final['games_since_win_diff'] = df_final['home_games_since_win'] - df_final['away_games_since_win']
    df_final['games_since_loss_diff'] = df_final['home_games_since_loss'] - df_final['away_games_since_loss']
    df_final['recent_wins_3_diff'] = df_final['home_recent_wins_3'] - df_final['away_recent_wins_3']
    
    print(" Streak and recency differences calculated (5 features)")
    print(" Contextual features (rest days, travel) already available")
    return df_final

df_model_ready = assemble_final_model_ready_dataframe(df_cleaned, team_stats_final)


--- Step 5: Assembling Final Model-Ready DataFrame ---

=== 5.2: Merging Back to Main DataFrame ===
 Merged home team features: (3336, 47)
 Merged away team features: (3336, 68)

=== 5.3: Adding Market Features ===

=== 5.4: Creating Difference Features ===
Creating the critical difference features that compare home vs away teams...
 Elo difference calculated
 Form differences calculated for 3 windows (12 features)
 Streak and recency differences calculated (5 features)
 Contextual features (rest days, travel) already available


final model ready data summary

In [39]:
def preview_final_summary(df_final):
    # Count different types of features
    all_columns = df_final.columns.tolist()
    
    # Core model features
    core_features = []
    
    # Strength features
    strength_features = ['elo_diff']
    
    # Form difference features
    form_diff_features = [col for col in all_columns if col.startswith('form_') and col.endswith('_diff_3') or col.endswith('_diff_5') or col.endswith('_diff_8')]
    
    # Streak difference features
    streak_diff_features = [col for col in all_columns if col.endswith('_streak_diff') or col.endswith('_win_diff') or col.endswith('_loss_diff') or col.endswith('_wins_3_diff')]
    
    # Contextual features (absolute values)
    contextual_features = ['home_rest_days', 'away_rest_days', 'away_travel_distance_km']
    
    # Market features
    market_features = ['home_implied_prob', 'away_implied_prob', 'market_spread']

    # Weather Features
    weather_features = ['temperature_c', 'wind_speed_kph' ,'precipitation_mm', 'is_rainy,is_windy,temperature_category']
    
    core_features = strength_features + form_diff_features + streak_diff_features + contextual_features + market_features + weather_features
    
    print(f" FINAL FEATURE BREAKDOWN:")
    print(f"   • Total columns: {len(all_columns)}")
    print(f"   • Strength features: {len(strength_features)} - {strength_features}")
    print(f"   • Form difference features: {len(form_diff_features)}")
    print(f"   • Streak difference features: {len(streak_diff_features)}")
    print(f"   • Contextual features: {len(contextual_features)} - {contextual_features}")
    print(f"   • Market features: {len(market_features)} - {market_features}")
    print(f"   • CORE MODEL FEATURES: {len(core_features)}")
    
    # Data quality check
    print(f"\n DATA QUALITY CHECK:")
    
    # Check core features for missing values
    for feature in core_features[:10]:  # Check first 10 core features
        if feature in df_final.columns:
            missing_count = df_final[feature].isna().sum()
            missing_pct = missing_count / len(df_final) * 100
            if missing_count > 0:
                print(f"   • {feature}: {missing_count} missing ({missing_pct:.1f}%)")
    
    # Correlation check for most important features
    if 'elo_diff' in df_final.columns and 'Home_Win' in df_final.columns:
        elo_corr = df_final['elo_diff'].corr(df_final['Home_Win'])
        print(f"   • Elo difference correlation with wins: {elo_corr:.3f}")
    
    print(f"\n STEP 5 COMPLETE!")
    print(f"   • Final dataset shape: {df_final.shape}")
    print(f"   • Ready for machine learning model training!")

preview_final_summary(df_model_ready)

 FINAL FEATURE BREAKDOWN:
   • Total columns: 89
   • Strength features: 1 - ['elo_diff']
   • Form difference features: 12
   • Streak difference features: 5
   • Contextual features: 3 - ['home_rest_days', 'away_rest_days', 'away_travel_distance_km']
   • Market features: 3 - ['home_implied_prob', 'away_implied_prob', 'market_spread']
   • CORE MODEL FEATURES: 28

 DATA QUALITY CHECK:
   • form_margin_diff_3: 9 missing (0.3%)
   • form_win_rate_diff_3: 9 missing (0.3%)
   • form_points_for_diff_3: 9 missing (0.3%)
   • form_points_against_diff_3: 9 missing (0.3%)
   • form_margin_diff_5: 9 missing (0.3%)
   • form_win_rate_diff_5: 9 missing (0.3%)
   • form_points_for_diff_5: 9 missing (0.3%)
   • form_points_against_diff_5: 9 missing (0.3%)
   • form_margin_diff_8: 9 missing (0.3%)
   • Elo difference correlation with wins: 0.268

 STEP 5 COMPLETE!
   • Final dataset shape: (3336, 89)
   • Ready for machine learning model training!


## Step 6: Save Final Datasets

The pipeline is complete. We save the final model-ready DataFrame to a new CSV file. This file will be the single source of truth for all subsequent model training and evaluation.

In [41]:
print("\n--- Step 6: Saving Final Datasets ---")

final_match_path = '../data/nrl_matches_final_model_ready.csv'
df_model_ready.to_csv(final_match_path, index=False)

final_team_path = '../data/nrl_team_stats_final_complete.csv'
team_stats_final.to_csv(final_team_path, index=False)

print(f" Final dataset saved successfully to: {final_match_path}")
print(f" Final team dataset saved successfully to: {final_team_path}")


--- Step 6: Saving Final Datasets ---
 Final dataset saved successfully to: ../data/nrl_matches_final_model_ready.csv
 Final team dataset saved successfully to: ../data/nrl_team_stats_final_complete.csv
