# IPL Cricket Player Performance Prediction - Feature Engineering

This notebook performs comprehensive feature engineering for IPL cricket player performance prediction.

## Pipeline Overview:
1. Load and prepare data
2. Aggregate to player-match level
3. Engineer advanced features (PUT, rolling, venue, PvP, career)
4. Create target labels
5. Feature selection and preprocessing
6. Time-series aware train-test split
7. Create feature pipeline
8. Save artifacts

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from datetime import datetime
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import joblib
import warnings
warnings.filterwarnings("ignore")

print("Libraries imported successfully!")

Libraries imported successfully!


## 1. Loading and Preparing Data

In [2]:
def load_and_prepare_data():
    """Load merged data and prepare for feature engineering"""
    print("=" * 60)
    print("1. LOADING AND PREPARING DATA")
    print("=" * 60)
    
    # Load merged data
    df = pd.read_csv("data/merged_data.csv")
    
    # Convert date column to datetime
    df["date"] = pd.to_datetime(df["date"])
    
    # Rename batsman to batter for consistency
    df["batter"] = df["batsman"]
    
    print(f"Data loaded successfully!")
    print(f"Shape: {df.shape}")
    print(f"Date range: {df["date"].min()} to {df["date"].max()}")
    print(f"Unique batters: {df["batter"].nunique()}")
    print(f"Unique venues: {df["venue"].nunique()}")
    print(f"Unique bowling teams: {df["bowling_team"].nunique()}")
    
    return df

# Load the data
df = load_and_prepare_data()

1. LOADING AND PREPARING DATA


Data loaded successfully!
Shape: (179078, 46)
Date range: 2008-04-18 00:00:00 to 2017-05-21 00:00:00
Unique batters: 516
Unique venues: 41
Unique bowling teams: 13


## 2. Aggregating to Player-Match Level

In [3]:
def aggregate_to_player_match(df):
    """Aggregate ball-by-ball data to player-match level"""
    print("\n" + "=" * 60)
    print("2. AGGREGATING TO PLAYER-MATCH LEVEL")
    print("=" * 60)
    
    # Aggregate to player-match level
    player_match = df.groupby(["batter", "match_id", "date", "venue", "bowling_team"]).agg({
        "batsman_runs": "sum",
        "ball": "count",
        "is_wicket": "sum",
        "is_boundary": "sum",
        "is_six": "sum",
        "is_four": "sum"
    }).reset_index()
    
    # Calculate strike rate
    player_match["strike_rate"] = (player_match["batsman_runs"] / player_match["ball"] * 100).round(2)
    
    # Sort by batter and date for time-series operations
    player_match = player_match.sort_values(["batter", "date"])
    
    print(f"Player-match level data shape: {player_match.shape}")
    print(f"Sample data:")
    print(player_match.head())
    
    return player_match

# Aggregate data to player-match level
player_match = aggregate_to_player_match(df)


2. AGGREGATING TO PLAYER-MATCH LEVEL


Player-match level data shape: (9515, 12)
Sample data:
           batter  match_id       date  \
0  A Ashish Reddy       346 2012-04-29   
1  A Ashish Reddy       352 2012-05-04   
2  A Ashish Reddy       359 2012-05-08   
3  A Ashish Reddy       373 2012-05-18   
4  A Ashish Reddy       376 2012-05-20   

                                       venue                 bowling_team  \
0                           Wankhede Stadium               Mumbai Indians   
1            MA Chidambaram Stadium, Chepauk          Chennai Super Kings   
2  Rajiv Gandhi International Stadium, Uppal                 Punjab Kings   
3  Rajiv Gandhi International Stadium, Uppal             Rajasthan Royals   
4  Rajiv Gandhi International Stadium, Uppal  Royal Challengers Bangalore   

   batsman_runs  ball  is_wicket  is_boundary  is_six  is_four  strike_rate  
0            10    10          1            1       1        0        100.0  
1             3     3          1            0       0        0        100

## 3.1 Engineering PUT Features (Player vs Team)

In [4]:
def engineer_put_features(player_match):
    """Create Player vs Team (PUT) features"""
    print("\n" + "=" * 60)
    print("3.1 ENGINEERING PUT FEATURES")
    print("=" * 60)
    
    # PUT (Player vs Team) average - overall average against bowling team
    player_match["put_avg"] = player_match.groupby(["batter", "bowling_team"])["batsman_runs"].transform("mean")
    
    # PUT average with expanding mean (cumulative performance)
    put_expanding = player_match.groupby(["batter", "bowling_team"])["batsman_runs"]\
        .expanding().mean().shift(1, fill_value=np.nan)
    # Reset index to match original dataframe
    put_expanding = put_expanding.reset_index(level=[0,1], drop=True)
    player_match["put_avg_expanding"] = put_expanding
    
    print("PUT features created:")
    print(player_match[["batter", "bowling_team", "batsman_runs", "put_avg", "put_avg_expanding"]].head())
    
    return player_match

# Create PUT features
player_match = engineer_put_features(player_match)


3.1 ENGINEERING PUT FEATURES


PUT features created:
           batter                 bowling_team  batsman_runs    put_avg  \
0  A Ashish Reddy               Mumbai Indians            10  13.500000   
1  A Ashish Reddy          Chennai Super Kings             3  15.000000   
2  A Ashish Reddy                 Punjab Kings             8  12.333333   
3  A Ashish Reddy             Rajasthan Royals            10  12.333333   
4  A Ashish Reddy  Royal Challengers Bangalore             4  11.000000   

   put_avg_expanding  
0           8.500000  
1                NaN  
2          13.000000  
3          12.333333  
4          12.333333  


## 3.2 Engineering Rolling Form Features

In [5]:
def engineer_rolling_features(player_match):
    """Create rolling form features (last 5 matches)"""
    print("\n" + "=" * 60)
    print("3.2 ENGINEERING ROLLING FORM FEATURES")
    print("=" * 60)
    
    # Rolling average of last 5 matches
    player_match["rolling_avg_5"] = player_match.groupby("batter")["batsman_runs"]\
        .rolling(5, min_periods=1).mean().reset_index(level=0, drop=True)
    
    # Rolling strike rate of last 5 matches
    player_match["rolling_sr_5"] = player_match.groupby("batter")["strike_rate"]\
        .rolling(5, min_periods=1).mean().reset_index(level=0, drop=True)
    
    print("Rolling form features created:")
    print(player_match[["batter", "date", "batsman_runs", "rolling_avg_5", "rolling_sr_5"]].head(10))
    
    return player_match

# Create rolling form features
player_match = engineer_rolling_features(player_match)


3.2 ENGINEERING ROLLING FORM FEATURES


Rolling form features created:
            batter       date  batsman_runs  rolling_avg_5  rolling_sr_5
0   A Ashish Reddy 2012-04-29            10          10.00       100.000
1   A Ashish Reddy 2012-05-04             3           6.50       100.000
2   A Ashish Reddy 2012-05-08             8           7.00       100.000
3   A Ashish Reddy 2012-05-18            10           7.75       137.500
4   A Ashish Reddy 2012-05-20             4           7.00       126.000
5   A Ashish Reddy 2013-04-05             7           6.40       141.000
6   A Ashish Reddy 2013-04-07            14           8.60       144.334
14  A Ashish Reddy 2013-04-09             3           7.60       139.334
7   A Ashish Reddy 2013-04-12            16           8.80       124.890
8   A Ashish Reddy 2013-04-14             4           8.80       124.890


## 3.3 Engineering Venue Features

In [6]:
def engineer_venue_features(player_match):
    """Create venue average features"""
    print("\n" + "=" * 60)
    print("3.3 ENGINEERING VENUE FEATURES")
    print("=" * 60)
    
    # Venue average for each batter
    player_match["venue_avg"] = player_match.groupby(["batter", "venue"])["batsman_runs"].transform("mean")
    
    # Overall venue average
    player_match["venue_overall_avg"] = player_match.groupby("venue")["batsman_runs"].transform("mean")
    
    print("Venue features created:")
    print(player_match[["batter", "venue", "batsman_runs", "venue_avg", "venue_overall_avg"]].head())
    
    return player_match

# Create venue features
player_match = engineer_venue_features(player_match)


3.3 ENGINEERING VENUE FEATURES
Venue features created:
           batter                                      venue  batsman_runs  \
0  A Ashish Reddy                           Wankhede Stadium            10   
1  A Ashish Reddy            MA Chidambaram Stadium, Chepauk             3   
2  A Ashish Reddy  Rajiv Gandhi International Stadium, Uppal             8   
3  A Ashish Reddy  Rajiv Gandhi International Stadium, Uppal            10   
4  A Ashish Reddy  Rajiv Gandhi International Stadium, Uppal             4   

   venue_avg  venue_overall_avg  
0  10.000000          19.321591  
1  19.500000          20.098611  
2   8.454545          19.611342  
3   8.454545          19.611342  
4   8.454545          19.611342  


## 3.4 Engineering PvP Features (Player vs Player)

In [7]:
def engineer_pvp_features(player_match, df):
    """Create Player vs Player (PvP) features"""
    print("\n" + "=" * 60)
    print("3.4 ENGINEERING PVP FEATURES")
    print("=" * 60)
    
    # For PvP, we need to consider the specific bowlers faced
    # First, get ball-by-ball data with bowler information
    pvp_data = df.groupby(["batter", "bowler", "match_id"])["batsman_runs"].sum().reset_index()
    pvp_data = pvp_data.sort_values(["batter", "match_id"])
    
    # Calculate PvP expanding average
    pvp_expanding = pvp_data.groupby(["batter", "bowler"])["batsman_runs"]\
        .expanding().mean().shift(1, fill_value=np.nan)
    pvp_expanding = pvp_expanding.reset_index(level=[0,1], drop=True)
    pvp_data["pvp_avg"] = pvp_expanding
    
    # Merge back to player-match level (take average of all bowlers faced in the match)
    pvp_match_avg = pvp_data.groupby(["batter", "match_id"])["pvp_avg"].mean().reset_index()
    player_match = player_match.merge(pvp_match_avg, on=["batter", "match_id"], how="left")
    
    print("PvP features created:")
    print(player_match[["batter", "match_id", "batsman_runs", "pvp_avg"]].head())
    
    return player_match

# Create PvP features
player_match = engineer_pvp_features(player_match, df)


3.4 ENGINEERING PVP FEATURES


PvP features created:
           batter  match_id  batsman_runs    pvp_avg
0  A Ashish Reddy       346            10   3.888889
1  A Ashish Reddy       352             3   3.250000
2  A Ashish Reddy       359             8   2.000000
3  A Ashish Reddy       373            10   4.000000
4  A Ashish Reddy       376             4  18.000000


## 3.5 Engineering Career Features

In [8]:
def engineer_career_features(player_match):
    """Create career average features"""
    print("\n" + "=" * 60)
    print("3.5 ENGINEERING CAREER FEATURES")
    print("=" * 60)
    
    # Career average using expanding mean
    career_avg_expanding = player_match.groupby("batter")["batsman_runs"]\
        .expanding().mean().shift(1, fill_value=np.nan)
    career_avg_expanding = career_avg_expanding.reset_index(level=0, drop=True)
    player_match["career_avg"] = career_avg_expanding
    
    # Career strike rate using expanding mean
    career_sr_expanding = player_match.groupby("batter")["strike_rate"]\
        .expanding().mean().shift(1, fill_value=np.nan)
    career_sr_expanding = career_sr_expanding.reset_index(level=0, drop=True)
    player_match["career_sr"] = career_sr_expanding
    
    # Career matches played
    player_match["career_matches"] = player_match.groupby("batter").cumcount()
    
    print("Career features created:")
    print(player_match[["batter", "date", "batsman_runs", "career_avg", "career_sr", "career_matches"]].head(10))
    
    return player_match

# Create career features
player_match = engineer_career_features(player_match)


3.5 ENGINEERING CAREER FEATURES
Career features created:
           batter       date  batsman_runs  career_avg   career_sr  \
0  A Ashish Reddy 2012-04-29            10         NaN         NaN   
1  A Ashish Reddy 2012-05-04             3   10.000000  100.000000   
2  A Ashish Reddy 2012-05-08             8    6.500000  100.000000   
3  A Ashish Reddy 2012-05-18            10    7.000000  100.000000   
4  A Ashish Reddy 2012-05-20             4    7.750000  137.500000   
5  A Ashish Reddy 2013-04-05             7    7.000000  126.000000   
6  A Ashish Reddy 2013-04-07            14    7.000000  134.166667   
7  A Ashish Reddy 2013-04-09             3    8.000000  131.667143   
8  A Ashish Reddy 2013-04-12            16    7.375000  124.583750   
9  A Ashish Reddy 2013-04-14             4    8.333333  130.494444   

   career_matches  
0               0  
1               1  
2               2  
3               3  
4               4  
5               5  
6               6  
7          

## 4. Creating Target Labels

In [9]:
def create_target_labels(player_match):
    """Create target labels (next match performance)"""
    print("\n" + "=" * 60)
    print("4. CREATING TARGET LABELS")
    print("=" * 60)
    
    # Create target: runs in next match
    player_match = player_match.sort_values(["batter", "date"])
    player_match["target_next_runs"] = player_match.groupby("batter")["batsman_runs"].shift(-1)
    
    # Create target: strike rate in next match
    player_match["target_next_sr"] = player_match.groupby("batter")["strike_rate"].shift(-1)
    
    # Remove rows where target is NaN (last match for each player)
    player_match_clean = player_match.dropna(subset=["target_next_runs"])
    
    print(f"Data after creating targets: {player_match_clean.shape}")
    print(f"Target statistics:")
    print(f"Next match runs - Mean: {player_match_clean["target_next_runs"].mean():.2f}, Std: {player_match_clean["target_next_runs"].std():.2f}")
    print(f"Next match SR - Mean: {player_match_clean["target_next_sr"].mean():.2f}, Std: {player_match_clean["target_next_sr"].std():.2f}")
    
    return player_match_clean

# Create target labels
player_match_clean = create_target_labels(player_match)


4. CREATING TARGET LABELS
Data after creating targets: (9054, 24)
Target statistics:
Next match runs - Mean: 19.72, Std: 20.85
Next match SR - Mean: 106.99, Std: 63.58


## 5. Feature Selection and Preprocessing

In [10]:
def feature_selection_and_preprocessing(player_match_clean):
    """Select features and handle preprocessing"""
    print("\n" + "=" * 60)
    print("5. FEATURE SELECTION AND PREPROCESSING")
    print("=" * 60)
    
    # Select features for modeling
    feature_columns = [
        "put_avg", "put_avg_expanding",
        "rolling_avg_5", "rolling_sr_5",
        "venue_avg", "venue_overall_avg",
        "pvp_avg",
        "career_avg", "career_sr", "career_matches",
        "ball", "is_boundary", "is_six", "is_four"
    ]
    
    # Handle missing values in features
    for col in feature_columns:
        if col in player_match_clean.columns:
            player_match_clean[col] = player_match_clean[col].fillna(player_match_clean[col].median())
    
    features = player_match_clean[feature_columns]
    labels_runs = player_match_clean["target_next_runs"]
    labels_sr = player_match_clean["target_next_sr"]
    
    print(f"Features shape: {features.shape}")
    print(f"Features selected: {feature_columns}")
    print(f"Feature statistics:")
    print(features.describe())
    
    return features, labels_runs, labels_sr, feature_columns

# Select features and preprocess
features, labels_runs, labels_sr, feature_columns = feature_selection_and_preprocessing(player_match_clean)


5. FEATURE SELECTION AND PREPROCESSING
Features shape: (9054, 14)
Features selected: ['put_avg', 'put_avg_expanding', 'rolling_avg_5', 'rolling_sr_5', 'venue_avg', 'venue_overall_avg', 'pvp_avg', 'career_avg', 'career_sr', 'career_matches', 'ball', 'is_boundary', 'is_six', 'is_four']
Feature statistics:


           put_avg  put_avg_expanding  rolling_avg_5  rolling_sr_5  \
count  9054.000000        9054.000000    9054.000000   9054.000000   
mean     19.754829          19.946419      19.972511    107.640728   
std      11.717716          14.720824      12.387729     34.774497   
min       0.000000           0.000000       0.000000      0.000000   
25%      10.666667           8.500000      10.500000     87.228000   
50%      20.000000          18.333333      18.500000    107.078000   
75%      27.000000          28.000000      27.400000    127.743000   
max     120.000000         158.000000     158.000000    400.000000   

         venue_avg  venue_overall_avg      pvp_avg   career_avg    career_sr  \
count  9054.000000        9054.000000  9054.000000  9054.000000  9054.000000   
mean     19.788203          19.327437     5.683232    19.948605   107.819885   
std      13.275306           1.445922     3.294719     9.929360    27.143746   
min       0.000000          14.787879     0.00000




## 6. Time-Series Aware Train-Test Split

In [11]:
def time_series_split(features, labels_runs, labels_sr, player_match_clean):
    """Perform time-series aware train-test split"""
    print("\n" + "=" * 60)
    print("6. TIME-SERIES AWARE TRAIN-TEST SPLIT")
    print("=" * 60)
    
    # Sort by date for proper time-series split
    player_match_clean = player_match_clean.sort_values("date")
    
    # Time-series split (80% train, 20% test)
    split_idx = int(len(player_match_clean) * 0.8)
    
    X_train = features[:split_idx]
    X_test = features[split_idx:]
    y_train_runs = labels_runs[:split_idx]
    y_test_runs = labels_runs[split_idx:]
    y_train_sr = labels_sr[:split_idx]
    y_test_sr = labels_sr[split_idx:]
    
    print(f"Train set size: {len(X_train)} ({len(X_train)/len(features)*100:.1f}%)")
    print(f"Test set size: {len(X_test)} ({len(X_test)/len(features)*100:.1f}%)")
    print(f"Train date range: {player_match_clean.iloc[:split_idx]["date"].min()} to {player_match_clean.iloc[:split_idx]["date"].max()}")
    print(f"Test date range: {player_match_clean.iloc[split_idx:]["date"].min()} to {player_match_clean.iloc[split_idx:]["date"].max()}")
    
    return X_train, X_test, y_train_runs, y_test_runs, y_train_sr, y_test_sr

# Perform time-series split
X_train, X_test, y_train_runs, y_test_runs, y_train_sr, y_test_sr = time_series_split(
    features, labels_runs, labels_sr, player_match_clean
)


6. TIME-SERIES AWARE TRAIN-TEST SPLIT
Train set size: 7243 (80.0%)
Test set size: 1811 (20.0%)
Train date range: 2008-04-18 00:00:00 to 2015-05-09 00:00:00
Test date range: 2015-05-09 00:00:00 to 2017-05-19 00:00:00


## 7. Creating Feature Pipeline

In [12]:
def create_feature_pipeline(X_train, X_test, feature_columns):
    """Create and apply feature preprocessing pipeline"""
    print("\n" + "=" * 60)
    print("7. CREATING FEATURE PIPELINE")
    print("=" * 60)
    
    # Create preprocessing pipeline
    feature_pipeline = Pipeline([
        ("scaler", StandardScaler())
    ])
    
    # Fit pipeline on training data
    X_train_scaled = feature_pipeline.fit_transform(X_train)
    X_test_scaled = feature_pipeline.transform(X_test)
    
    # Convert back to DataFrame for easier handling
    X_train_scaled = pd.DataFrame(X_train_scaled, columns=feature_columns)
    X_test_scaled = pd.DataFrame(X_test_scaled, columns=feature_columns)
    
    print("Feature pipeline created and applied:")
    print(f"Scaled training data shape: {X_train_scaled.shape}")
    print(f"Scaled test data shape: {X_test_scaled.shape}")
    print(f"Sample scaled features:")
    print(X_train_scaled.head())
    
    return feature_pipeline, X_train_scaled, X_test_scaled

# Create feature pipeline
feature_pipeline, X_train_scaled, X_test_scaled = create_feature_pipeline(X_train, X_test, feature_columns)


7. CREATING FEATURE PIPELINE
Feature pipeline created and applied:
Scaled training data shape: (7243, 14)
Scaled test data shape: (1811, 14)
Sample scaled features:
    put_avg  put_avg_expanding  rolling_avg_5  rolling_sr_5  venue_avg  \
0 -0.502035          -0.749183      -0.780673     -0.207564  -0.709098   
1 -0.374855          -0.084118      -1.063708     -0.207564   0.003366   
2 -0.600954          -0.444831      -1.023275     -0.207564  -0.825001   
3 -0.600954          -0.489921      -0.962624      0.861467  -0.825001   
4 -0.714003          -0.489921      -1.023275      0.533631  -0.825001   

   venue_overall_avg   pvp_avg  career_avg  career_sr  career_matches  \
0          -0.004717 -0.523122    0.118640   0.115093       -1.000485   
1           0.534424 -0.716012   -0.956843  -0.253947       -0.968057   
2           0.196329 -1.093406   -1.302976  -0.253947       -0.935629   
3           0.196329 -0.489576   -1.253529  -0.253947       -0.903202   
4           0.196329  3.

## 8. Saving Artifacts

In [13]:
def save_artifacts(feature_pipeline, final_dataset, X_train, X_test, y_train_runs, y_test_runs, y_train_sr, y_test_sr, feature_columns):
    """Save all artifacts"""
    print("\n" + "=" * 60)
    print("8. SAVING ARTIFACTS")
    print("=" * 60)
    
    # Save the feature pipeline
    joblib.dump(feature_pipeline, "feature_pipeline.pkl")
    
    # Save the final dataset with all features and targets
    final_dataset.to_csv("data/dataset.csv", index=False)
    
    # Save train-test splits
    train_data = pd.concat([X_train, y_train_runs, y_train_sr], axis=1)
    train_data.columns = feature_columns + ["target_runs", "target_sr"]
    train_data.to_csv("data/train_data.csv", index=False)
    
    test_data = pd.concat([X_test, y_test_runs, y_test_sr], axis=1)
    test_data.columns = feature_columns + ["target_runs", "target_sr"]
    test_data.to_csv("data/test_data.csv", index=False)
    
    print("Artifacts saved successfully:")
    print("- feature_pipeline.pkl: Preprocessing pipeline")
    print("- dataset.csv: Complete feature-engineered dataset")
    print("- train_data.csv: Training set")
    print("- test_data.csv: Test set")

# Save all artifacts
save_artifacts(
    feature_pipeline, player_match_clean, X_train, X_test, 
    y_train_runs, y_test_runs, y_train_sr, y_test_sr, feature_columns
)


8. SAVING ARTIFACTS


Artifacts saved successfully:
- feature_pipeline.pkl: Preprocessing pipeline
- dataset.csv: Complete feature-engineered dataset
- train_data.csv: Training set
- test_data.csv: Test set


## 9. WICKET FEATURE ENGINEERING

This section adds comprehensive wicket prediction features to complement the batting performance features.

In [14]:
def engineer_wicket_features(player_match, df):
    """Create comprehensive wicket prediction features"""
    print("\n" + "=" * 60)
    print("9. ENGINEERING WICKET FEATURES")
    print("=" * 60)
    
    # First, create wicket-specific data at player-match level
    # Get wicket data for each player in each match
    wicket_data = df[df['player_dismissed'].notna()].copy()
    
    # Create dismissal type mapping for common categories
    dismissal_mapping = {
        'caught': 'caught',
        'bowled': 'bowled', 
        'run out': 'run_out',
        'lbw': 'lbw',
        'stumped': 'stumped',
        'caught and bowled': 'caught_and_bowled',
        'retired hurt': 'retired_hurt',
        'hit wicket': 'hit_wicket',
        'obstructing the field': 'obstructing'
    }
    
    # Map dismissal types
    wicket_data['dismissal_type_mapped'] = wicket_data['dismissal_kind'].map(dismissal_mapping).fillna('other')
    
    # Create wicket features for each player-match
    player_wickets = wicket_data.groupby(['player_dismissed', 'match_id', 'date', 'venue', 'bowling_team']).agg({
        'dismissal_kind': 'count',  # Total wickets
        'dismissal_type_mapped': lambda x: (x == 'caught').sum(),  # Caught dismissals
        'over': 'mean',  # Average over when dismissed
        'ball': 'mean'   # Average ball when dismissed
    }).reset_index()
    
    player_wickets.columns = ['batter', 'match_id', 'date', 'venue', 'bowling_team', 
                             'wickets_lost', 'caught_dismissals', 'avg_over_dismissed', 'avg_ball_dismissed']
    
    # Merge wicket data back to player_match
    player_match_wickets = player_match.merge(
        player_wickets, 
        on=['batter', 'match_id', 'date', 'venue', 'bowling_team'], 
        how='left'
    )
    
    # Fill missing wicket data with 0 (no wickets in that match)
    player_match_wickets['wickets_lost'] = player_match_wickets['wickets_lost'].fillna(0)
    player_match_wickets['caught_dismissals'] = player_match_wickets['caught_dismissals'].fillna(0)
    player_match_wickets['avg_over_dismissed'] = player_match_wickets['avg_over_dismissed'].fillna(20)  # Default to last over
    player_match_wickets['avg_ball_dismissed'] = player_match_wickets['avg_ball_dismissed'].fillna(6)   # Default to last ball
    
    # Create wicket probability features
    # PUT wicket rate (Player vs Team wicket rate)
    player_match_wickets['put_wicket_rate'] = player_match_wickets.groupby(['batter', 'bowling_team'])['wickets_lost'].transform('mean')
    
    # PUT wicket rate expanding (cumulative)
    put_wicket_expanding = player_match_wickets.groupby(['batter', 'bowling_team'])['wickets_lost']\
        .expanding().mean().shift(1, fill_value=0)
    put_wicket_expanding = put_wicket_expanding.reset_index(level=[0,1], drop=True)
    player_match_wickets['put_wicket_rate_expanding'] = put_wicket_expanding
    
    # Venue wicket rate
    player_match_wickets['venue_wicket_rate'] = player_match_wickets.groupby(['batter', 'venue'])['wickets_lost'].transform('mean')
    
    # Overall venue wicket rate
    player_match_wickets['venue_overall_wicket_rate'] = player_match_wickets.groupby('venue')['wickets_lost'].transform('mean')
    
    # Rolling wicket features (last 5 matches)
    player_match_wickets['rolling_wicket_rate_5'] = player_match_wickets.groupby('batter')['wickets_lost']\
        .rolling(5, min_periods=1).mean().reset_index(level=0, drop=True)
    
    # Career wicket rate (expanding)
    career_wicket_expanding = player_match_wickets.groupby('batter')['wickets_lost']\
        .expanding().mean().shift(1, fill_value=0)
    career_wicket_expanding = career_wicket_expanding.reset_index(level=0, drop=True)
    player_match_wickets['career_wicket_rate'] = career_wicket_expanding
    
    # Dismissal type features
    player_match_wickets['caught_rate'] = player_match_wickets.groupby('batter')['caught_dismissals']\
        .expanding().mean().shift(1, fill_value=0).reset_index(level=0, drop=True)
    
    # Bowling strength features (quality of opposition bowling)
    bowling_strength = df.groupby(['bowling_team', 'match_id']).agg({
        'is_wicket': 'sum',
        'total_runs': 'mean'
    }).reset_index()
    bowling_strength.columns = ['bowling_team', 'match_id', 'team_wickets', 'avg_runs_conceded']
    
    # Merge bowling strength
    player_match_wickets = player_match_wickets.merge(
        bowling_strength, 
        on=['bowling_team', 'match_id'], 
        how='left'
    )
    
    print("Wicket features created:")
    print(f"Shape: {player_match_wickets.shape}")
    print(f"Sample wicket features:")
    print(player_match_wickets[['batter', 'wickets_lost', 'put_wicket_rate', 'venue_wicket_rate', 
                               'rolling_wicket_rate_5', 'career_wicket_rate']].head(10))
    
    return player_match_wickets

# Create wicket features
player_match_wickets = engineer_wicket_features(player_match_clean, df)


9. ENGINEERING WICKET FEATURES


Wicket features created:
Shape: (9054, 37)
Sample wicket features:
           batter  wickets_lost  put_wicket_rate  venue_wicket_rate  \
0  A Ashish Reddy           1.0         1.000000                1.0   
1  A Ashish Reddy           1.0         0.666667                0.5   
2  A Ashish Reddy           1.0         0.333333                0.7   
3  A Ashish Reddy           0.0         0.333333                0.7   
4  A Ashish Reddy           1.0         1.000000                0.7   
5  A Ashish Reddy           0.0         0.000000                0.7   
6  A Ashish Reddy           1.0         1.000000                0.7   
7  A Ashish Reddy           1.0         1.000000                1.0   
8  A Ashish Reddy           1.0         1.000000                1.0   
9  A Ashish Reddy           1.0         1.000000                1.0   

   rolling_wicket_rate_5  career_wicket_rate  
0                   1.00            0.000000  
1                   1.00            1.000000  
2         

In [15]:
## 10. WICKET TARGET LABELS

def create_wicket_targets(player_match_wickets):
    """Create wicket prediction target labels"""
    print("\n" + "=" * 60)
    print("10. CREATING WICKET TARGET LABELS")
    print("=" * 60)
    
    # Sort by batter and date for target creation
    player_match_wickets = player_match_wickets.sort_values(['batter', 'date'])
    
    # Create target: wickets in next match (binary - will player get out?)
    player_match_wickets['target_next_wicket'] = (player_match_wickets.groupby('batter')['wickets_lost'].shift(-1) > 0).astype(int)
    
    # Create target: probability of getting out (continuous)
    player_match_wickets['target_next_wicket_prob'] = player_match_wickets.groupby('batter')['wickets_lost'].shift(-1)
    
    # Create target: caught dismissal probability
    player_match_wickets['target_next_caught'] = (player_match_wickets.groupby('batter')['caught_dismissals'].shift(-1) > 0).astype(int)
    
    # Remove rows where target is NaN (last match for each player)
    player_match_wickets_clean = player_match_wickets.dropna(subset=['target_next_wicket'])
    
    print(f"Data after creating wicket targets: {player_match_wickets_clean.shape}")
    print(f"Wicket target statistics:")
    print(f"Next match wicket rate: {player_match_wickets_clean['target_next_wicket'].mean():.3f}")
    print(f"Next match caught rate: {player_match_wickets_clean['target_next_caught'].mean():.3f}")
    print(f"Wicket distribution:")
    print(player_match_wickets_clean['target_next_wicket'].value_counts(normalize=True))
    
    return player_match_wickets_clean

# Create wicket target labels
player_match_wickets_clean = create_wicket_targets(player_match_wickets)


10. CREATING WICKET TARGET LABELS
Data after creating wicket targets: (9054, 40)
Wicket target statistics:
Next match wicket rate: 0.745
Next match caught rate: 0.445
Wicket distribution:
target_next_wicket
1    0.744864
0    0.255136
Name: proportion, dtype: float64


In [16]:
## 11. COMBINED FEATURE SELECTION AND PREPROCESSING

def combined_feature_selection_and_preprocessing(player_match_wickets_clean):
    """Select and preprocess all features for both batting and wicket prediction"""
    print("\n" + "=" * 60)
    print("11. COMBINED FEATURE SELECTION AND PREPROCESSING")
    print("=" * 60)
    
    # Batting features (original)
    batting_features = [
        "put_avg", "put_avg_expanding",
        "rolling_avg_5", "rolling_sr_5",
        "venue_avg", "venue_overall_avg",
        "pvp_avg",
        "career_avg", "career_sr", "career_matches",
        "ball", "is_boundary", "is_six", "is_four"
    ]
    
    # New wicket features
    wicket_features = [
        "put_wicket_rate", "put_wicket_rate_expanding",
        "venue_wicket_rate", "venue_overall_wicket_rate", 
        "rolling_wicket_rate_5", "career_wicket_rate",
        "caught_rate", "avg_over_dismissed", "avg_ball_dismissed",
        "team_wickets", "avg_runs_conceded"
    ]
    
    # Combined feature list
    all_features = batting_features + wicket_features
    
    # Handle missing values in features
    for col in all_features:
        if col in player_match_wickets_clean.columns:
            player_match_wickets_clean[col] = player_match_wickets_clean[col].fillna(player_match_wickets_clean[col].median())
    
    # Create feature matrices
    features = player_match_wickets_clean[all_features]
    
    # Batting targets
    labels_runs = player_match_wickets_clean["target_next_runs"]
    labels_sr = player_match_wickets_clean["target_next_sr"]
    
    # Wicket targets
    labels_wicket = player_match_wickets_clean["target_next_wicket"]
    labels_wicket_prob = player_match_wickets_clean["target_next_wicket_prob"]
    labels_caught = player_match_wickets_clean["target_next_caught"]
    
    print(f"Combined features shape: {features.shape}")
    print(f"Batting features: {len(batting_features)}")
    print(f"Wicket features: {len(wicket_features)}")
    print(f"Total features: {len(all_features)}")
    print(f"\nFeature statistics:")
    print(features.describe())
    
    return (features, labels_runs, labels_sr, labels_wicket, labels_wicket_prob, labels_caught, 
            all_features, batting_features, wicket_features)

# Select and preprocess all features
(features, labels_runs, labels_sr, labels_wicket, labels_wicket_prob, labels_caught, 
 all_features, batting_features, wicket_features) = combined_feature_selection_and_preprocessing(player_match_wickets_clean)


11. COMBINED FEATURE SELECTION AND PREPROCESSING
Combined features shape: (9054, 25)
Batting features: 14
Wicket features: 11
Total features: 25

Feature statistics:
           put_avg  put_avg_expanding  rolling_avg_5  rolling_sr_5  \
count  9054.000000        9054.000000    9054.000000   9054.000000   
mean     19.754829          19.946419      19.972511    107.640728   
std      11.717716          14.720824      12.387729     34.774497   
min       0.000000           0.000000       0.000000      0.000000   
25%      10.666667           8.500000      10.500000     87.228000   
50%      20.000000          18.333333      18.500000    107.078000   
75%      27.000000          28.000000      27.400000    127.743000   
max     120.000000         158.000000     158.000000    400.000000   

         venue_avg  venue_overall_avg      pvp_avg   career_avg    career_sr  \
count  9054.000000        9054.000000  9054.000000  9054.000000  9054.000000   
mean     19.788203          19.327437     

In [17]:
## 12. COMBINED TIME-SERIES TRAIN-TEST SPLIT

def combined_time_series_split(features, labels_runs, labels_sr, labels_wicket, 
                               labels_wicket_prob, labels_caught, player_match_wickets_clean):
    """Perform time-series aware train-test split for all targets"""
    print("\n" + "=" * 60)
    print("12. COMBINED TIME-SERIES TRAIN-TEST SPLIT")
    print("=" * 60)
    
    # Sort by date for proper time-series split
    player_match_wickets_clean = player_match_wickets_clean.sort_values("date")
    
    # Time-series split (80% train, 20% test)
    split_idx = int(len(player_match_wickets_clean) * 0.8)
    
    # Split features
    X_train = features[:split_idx]
    X_test = features[split_idx:]
    
    # Split batting targets
    y_train_runs = labels_runs[:split_idx]
    y_test_runs = labels_runs[split_idx:]
    y_train_sr = labels_sr[:split_idx]
    y_test_sr = labels_sr[split_idx:]
    
    # Split wicket targets
    y_train_wicket = labels_wicket[:split_idx]
    y_test_wicket = labels_wicket[split_idx:]
    y_train_wicket_prob = labels_wicket_prob[:split_idx]
    y_test_wicket_prob = labels_wicket_prob[split_idx:]
    y_train_caught = labels_caught[:split_idx]
    y_test_caught = labels_caught[split_idx:]
    
    print(f"Train set size: {len(X_train)} ({len(X_train)/len(features)*100:.1f}%)")
    print(f"Test set size: {len(X_test)} ({len(X_test)/len(features)*100:.1f}%)")
    print(f"Train date range: {player_match_wickets_clean.iloc[:split_idx]['date'].min()} to {player_match_wickets_clean.iloc[:split_idx]['date'].max()}")
    print(f"Test date range: {player_match_wickets_clean.iloc[split_idx:]['date'].min()} to {player_match_wickets_clean.iloc[split_idx:]['date'].max()}")
    
    print(f"\nTarget distributions in train set:")
    print(f"Runs - Mean: {y_train_runs.mean():.2f}, Std: {y_train_runs.std():.2f}")
    print(f"Strike Rate - Mean: {y_train_sr.mean():.2f}, Std: {y_train_sr.std():.2f}")
    print(f"Wicket Rate: {y_train_wicket.mean():.3f}")
    print(f"Caught Rate: {y_train_caught.mean():.3f}")
    
    return (X_train, X_test, y_train_runs, y_test_runs, y_train_sr, y_test_sr,
            y_train_wicket, y_test_wicket, y_train_wicket_prob, y_test_wicket_prob,
            y_train_caught, y_test_caught)

# Perform combined time-series split
(X_train, X_test, y_train_runs, y_test_runs, y_train_sr, y_test_sr,
 y_train_wicket, y_test_wicket, y_train_wicket_prob, y_test_wicket_prob,
 y_train_caught, y_test_caught) = combined_time_series_split(
    features, labels_runs, labels_sr, labels_wicket, labels_wicket_prob, labels_caught, player_match_wickets_clean
)


12. COMBINED TIME-SERIES TRAIN-TEST SPLIT
Train set size: 7243 (80.0%)
Test set size: 1811 (20.0%)
Train date range: 2008-04-18 00:00:00 to 2015-05-09 00:00:00
Test date range: 2015-05-09 00:00:00 to 2017-05-19 00:00:00

Target distributions in train set:
Runs - Mean: 19.37, Std: 20.64
Strike Rate - Mean: 106.80, Std: 64.22
Wicket Rate: 0.740
Caught Rate: 0.436


In [18]:
## 13. COMBINED FEATURE PIPELINE

def create_combined_feature_pipeline(X_train, X_test, all_features):
    """Create and apply feature preprocessing pipeline for all features"""
    print("\n" + "=" * 60)
    print("13. CREATING COMBINED FEATURE PIPELINE")
    print("=" * 60)
    
    # Create preprocessing pipeline
    combined_feature_pipeline = Pipeline([
        ("scaler", StandardScaler())
    ])
    
    # Fit pipeline on training data
    X_train_scaled = combined_feature_pipeline.fit_transform(X_train)
    X_test_scaled = combined_feature_pipeline.transform(X_test)
    
    # Convert back to DataFrame for easier handling
    X_train_scaled = pd.DataFrame(X_train_scaled, columns=all_features)
    X_test_scaled = pd.DataFrame(X_test_scaled, columns=all_features)
    
    print("Combined feature pipeline created and applied:")
    print(f"Scaled training data shape: {X_train_scaled.shape}")
    print(f"Scaled test data shape: {X_test_scaled.shape}")
    print(f"Sample scaled features (first 5):")
    print(X_train_scaled.iloc[:, :5].head())
    
    return combined_feature_pipeline, X_train_scaled, X_test_scaled

# Create combined feature pipeline
combined_feature_pipeline, X_train_scaled, X_test_scaled = create_combined_feature_pipeline(X_train, X_test, all_features)


13. CREATING COMBINED FEATURE PIPELINE
Combined feature pipeline created and applied:
Scaled training data shape: (7243, 25)
Scaled test data shape: (1811, 25)
Sample scaled features (first 5):
    put_avg  put_avg_expanding  rolling_avg_5  rolling_sr_5  venue_avg
0 -0.502035          -0.749183      -0.780673     -0.207564  -0.709098
1 -0.374855          -0.084118      -1.063708     -0.207564   0.003366
2 -0.600954          -0.444831      -1.023275     -0.207564  -0.825001
3 -0.600954          -0.489921      -0.962624      0.861467  -0.825001
4 -0.714003          -0.489921      -1.023275      0.533631  -0.825001


In [19]:
## 14. SAVING COMBINED ARTIFACTS

def save_combined_artifacts(combined_feature_pipeline, final_dataset, X_train_scaled, X_test_scaled,
                           y_train_runs, y_test_runs, y_train_sr, y_test_sr,
                           y_train_wicket, y_test_wicket, y_train_wicket_prob, y_test_wicket_prob,
                           y_train_caught, y_test_caught, all_features, batting_features, wicket_features):
    """Save all combined artifacts for both batting and wicket prediction"""
    print("\n" + "=" * 60)
    print("14. SAVING COMBINED ARTIFACTS")
    print("=" * 60)
    
    # Save the combined feature pipeline
    joblib.dump(combined_feature_pipeline, "combined_feature_pipeline.pkl")
    
    # Save the final dataset with all features and targets
    final_dataset.to_csv("data/combined_dataset.csv", index=False)
    
    # Save combined training data with all targets
    train_data_combined = pd.concat([
        X_train_scaled,
        y_train_runs.rename('target_runs'),
        y_train_sr.rename('target_sr'),
        y_train_wicket.rename('target_wicket'),
        y_train_wicket_prob.rename('target_wicket_prob'),
        y_train_caught.rename('target_caught')
    ], axis=1)
    train_data_combined.to_csv("data/combined_train_data.csv", index=False)
    
    # Save combined test data with all targets
    test_data_combined = pd.concat([
        X_test_scaled,
        y_test_runs.rename('target_runs'),
        y_test_sr.rename('target_sr'),
        y_test_wicket.rename('target_wicket'),
        y_test_wicket_prob.rename('target_wicket_prob'),
        y_test_caught.rename('target_caught')
    ], axis=1)
    test_data_combined.to_csv("data/combined_test_data.csv", index=False)
    
    # Save feature lists for reference
    feature_info = {
        'all_features': all_features,
        'batting_features': batting_features,
        'wicket_features': wicket_features,
        'total_features': len(all_features)
    }
    
    import json
    with open("feature_info.json", "w") as f:
        json.dump(feature_info, f, indent=2)
    
    print("Combined artifacts saved successfully:")
    print("- combined_feature_pipeline.pkl: Preprocessing pipeline for all features")
    print("- combined_dataset.csv: Complete dataset with batting and wicket features")
    print("- combined_train_data.csv: Training set with all targets")
    print("- combined_test_data.csv: Test set with all targets")
    print("- feature_info.json: Feature configuration and metadata")
    
    print(f"\nFinal Summary:")
    print(f"- Total features: {len(all_features)} ({len(batting_features)} batting + {len(wicket_features)} wicket)")
    print(f"- Training samples: {len(X_train_scaled)}")
    print(f"- Test samples: {len(X_test_scaled)}")
    print(f"- Targets available: runs, strike rate, wicket (binary), wicket probability, caught dismissal")

# Save all combined artifacts
save_combined_artifacts(
    combined_feature_pipeline, player_match_wickets_clean, X_train_scaled, X_test_scaled,
    y_train_runs, y_test_runs, y_train_sr, y_test_sr,
    y_train_wicket, y_test_wicket, y_train_wicket_prob, y_test_wicket_prob,
    y_train_caught, y_test_caught, all_features, batting_features, wicket_features
)


14. SAVING COMBINED ARTIFACTS


Combined artifacts saved successfully:
- combined_feature_pipeline.pkl: Preprocessing pipeline for all features
- combined_dataset.csv: Complete dataset with batting and wicket features
- combined_train_data.csv: Training set with all targets
- combined_test_data.csv: Test set with all targets
- feature_info.json: Feature configuration and metadata

Final Summary:
- Total features: 25 (14 batting + 11 wicket)
- Training samples: 7243
- Test samples: 1811
- Targets available: runs, strike rate, wicket (binary), wicket probability, caught dismissal
