# IPL Cricket Player Performance Prediction - Feature Engineering

This notebook performs comprehensive feature engineering for IPL cricket player performance prediction.

## Pipeline Overview:
1. Load and prepare data
2. Aggregate to player-match level
3. Engineer advanced features (PUT, rolling, venue, PvP, career)
4. Create target labels
5. Feature selection and preprocessing
6. Time-series aware train-test split
7. Create feature pipeline
8. Save artifacts

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from datetime import datetime
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import joblib
import warnings
warnings.filterwarnings("ignore")

print("Libraries imported successfully!")

Libraries imported successfully!


## 1. Loading and Preparing Data

In [2]:
def load_and_prepare_data():
    """Load merged data and prepare for feature engineering"""
    print("=" * 60)
    print("1. LOADING AND PREPARING DATA")
    print("=" * 60)
    
    # Load merged data
    df = pd.read_csv("merged_data.csv")
    
    # Convert date column to datetime
    df["date"] = pd.to_datetime(df["date"])
    
    # Rename batsman to batter for consistency
    df["batter"] = df["batsman"]
    
    print(f"Data loaded successfully!")
    print(f"Shape: {df.shape}")
    print(f"Date range: {df["date"].min()} to {df["date"].max()}")
    print(f"Unique batters: {df["batter"].nunique()}")
    print(f"Unique venues: {df["venue"].nunique()}")
    print(f"Unique bowling teams: {df["bowling_team"].nunique()}")
    
    return df

# Load the data
df = load_and_prepare_data()

1. LOADING AND PREPARING DATA
Data loaded successfully!
Shape: (145967, 41)
Date range: 2008-04-18 00:00:00 to 2017-05-19 00:00:00
Unique batters: 406
Unique venues: 35
Unique bowling teams: 13


## 2. Aggregating to Player-Match Level

In [3]:
def aggregate_to_player_match(df):
    """Aggregate ball-by-ball data to player-match level"""
    print("\n" + "=" * 60)
    print("2. AGGREGATING TO PLAYER-MATCH LEVEL")
    print("=" * 60)
    
    # Aggregate to player-match level
    player_match = df.groupby(["batter", "match_id", "date", "venue", "bowling_team"]).agg({
        "batsman_runs": "sum",
        "ball": "count",
        "is_wicket": "sum",
        "is_boundary": "sum",
        "is_six": "sum",
        "is_four": "sum"
    }).reset_index()
    
    # Calculate strike rate
    player_match["strike_rate"] = (player_match["batsman_runs"] / player_match["ball"] * 100).round(2)
    
    # Sort by batter and date for time-series operations
    player_match = player_match.sort_values(["batter", "date"])
    
    print(f"Player-match level data shape: {player_match.shape}")
    print(f"Sample data:")
    print(player_match.head())
    
    return player_match

# Aggregate data to player-match level
player_match = aggregate_to_player_match(df)


2. AGGREGATING TO PLAYER-MATCH LEVEL
Player-match level data shape: (9054, 12)
Sample data:
           batter  match_id       date  \
0  A Ashish Reddy       346 2012-04-29   
1  A Ashish Reddy       352 2012-05-04   
2  A Ashish Reddy       359 2012-05-08   
3  A Ashish Reddy       373 2012-05-18   
4  A Ashish Reddy       376 2012-05-20   

                                       venue                 bowling_team  \
0                           Wankhede Stadium               Mumbai Indians   
1            MA Chidambaram Stadium, Chepauk          Chennai Super Kings   
2  Rajiv Gandhi International Stadium, Uppal                 Punjab Kings   
3  Rajiv Gandhi International Stadium, Uppal             Rajasthan Royals   
4  Rajiv Gandhi International Stadium, Uppal  Royal Challengers Bangalore   

   batsman_runs  ball  is_wicket  is_boundary  is_six  is_four  strike_rate  
0            10    10          1            1       0        1        100.0  
1             3     3          1   

## 3.1 Engineering PUT Features (Player vs Team)

In [4]:
def engineer_put_features(player_match):
    """Create Player vs Team (PUT) features"""
    print("\n" + "=" * 60)
    print("3.1 ENGINEERING PUT FEATURES")
    print("=" * 60)
    
    # PUT (Player vs Team) average - overall average against bowling team
    player_match["put_avg"] = player_match.groupby(["batter", "bowling_team"])["batsman_runs"].transform("mean")
    
    # PUT average with expanding mean (cumulative performance)
    put_expanding = player_match.groupby(["batter", "bowling_team"])["batsman_runs"]\
        .expanding().mean().shift(1, fill_value=np.nan)
    # Reset index to match original dataframe
    put_expanding = put_expanding.reset_index(level=[0,1], drop=True)
    player_match["put_avg_expanding"] = put_expanding
    
    print("PUT features created:")
    print(player_match[["batter", "bowling_team", "batsman_runs", "put_avg", "put_avg_expanding"]].head())
    
    return player_match

# Create PUT features
player_match = engineer_put_features(player_match)


3.1 ENGINEERING PUT FEATURES
PUT features created:
           batter                 bowling_team  batsman_runs    put_avg  \
0  A Ashish Reddy               Mumbai Indians            10  13.500000   
1  A Ashish Reddy          Chennai Super Kings             3  15.000000   
2  A Ashish Reddy                 Punjab Kings             8  12.333333   
3  A Ashish Reddy             Rajasthan Royals            10  12.333333   
4  A Ashish Reddy  Royal Challengers Bangalore             4  13.250000   

   put_avg_expanding  
0           8.500000  
1                NaN  
2          13.000000  
3          12.333333  
4          12.333333  


## 3.2 Engineering Rolling Form Features

In [5]:
def engineer_rolling_features(player_match):
    """Create rolling form features (last 5 matches)"""
    print("\n" + "=" * 60)
    print("3.2 ENGINEERING ROLLING FORM FEATURES")
    print("=" * 60)
    
    # Rolling average of last 5 matches
    player_match["rolling_avg_5"] = player_match.groupby("batter")["batsman_runs"]\
        .rolling(5, min_periods=1).mean().reset_index(level=0, drop=True)
    
    # Rolling strike rate of last 5 matches
    player_match["rolling_sr_5"] = player_match.groupby("batter")["strike_rate"]\
        .rolling(5, min_periods=1).mean().reset_index(level=0, drop=True)
    
    print("Rolling form features created:")
    print(player_match[["batter", "date", "batsman_runs", "rolling_avg_5", "rolling_sr_5"]].head(10))
    
    return player_match

# Create rolling form features
player_match = engineer_rolling_features(player_match)


3.2 ENGINEERING ROLLING FORM FEATURES
Rolling form features created:
            batter       date  batsman_runs  rolling_avg_5  rolling_sr_5
0   A Ashish Reddy 2012-04-29            10          10.00       100.000
1   A Ashish Reddy 2012-05-04             3           6.50       100.000
2   A Ashish Reddy 2012-05-08             8           7.00       100.000
3   A Ashish Reddy 2012-05-18            10           7.75       137.500
4   A Ashish Reddy 2012-05-20             4           7.00       126.000
5   A Ashish Reddy 2013-04-05             7           6.40       141.000
6   A Ashish Reddy 2013-04-07            14           8.60       144.334
14  A Ashish Reddy 2013-04-09             3           7.60       139.334
7   A Ashish Reddy 2013-04-12            16           8.80       124.890
8   A Ashish Reddy 2013-04-14             4           8.80       124.890


## 3.3 Engineering Venue Features

In [6]:
def engineer_venue_features(player_match):
    """Create venue average features"""
    print("\n" + "=" * 60)
    print("3.3 ENGINEERING VENUE FEATURES")
    print("=" * 60)
    
    # Venue average for each batter
    player_match["venue_avg"] = player_match.groupby(["batter", "venue"])["batsman_runs"].transform("mean")
    
    # Overall venue average
    player_match["venue_overall_avg"] = player_match.groupby("venue")["batsman_runs"].transform("mean")
    
    print("Venue features created:")
    print(player_match[["batter", "venue", "batsman_runs", "venue_avg", "venue_overall_avg"]].head())
    
    return player_match

# Create venue features
player_match = engineer_venue_features(player_match)


3.3 ENGINEERING VENUE FEATURES
Venue features created:
           batter                                      venue  batsman_runs  \
0  A Ashish Reddy                           Wankhede Stadium            10   
1  A Ashish Reddy            MA Chidambaram Stadium, Chepauk             3   
2  A Ashish Reddy  Rajiv Gandhi International Stadium, Uppal             8   
3  A Ashish Reddy  Rajiv Gandhi International Stadium, Uppal            10   
4  A Ashish Reddy  Rajiv Gandhi International Stadium, Uppal             4   

   venue_avg  venue_overall_avg  
0       10.0          19.918951  
1       19.5          20.362216  
2        9.1          20.465774  
3        9.1          20.465774  
4        9.1          20.465774  


## 3.4 Engineering PvP Features (Player vs Player)

In [7]:
def engineer_pvp_features(player_match, df):
    """Create Player vs Player (PvP) features"""
    print("\n" + "=" * 60)
    print("3.4 ENGINEERING PVP FEATURES")
    print("=" * 60)
    
    # For PvP, we need to consider the specific bowlers faced
    # First, get ball-by-ball data with bowler information
    pvp_data = df.groupby(["batter", "bowler", "match_id"])["batsman_runs"].sum().reset_index()
    pvp_data = pvp_data.sort_values(["batter", "match_id"])
    
    # Calculate PvP expanding average
    pvp_expanding = pvp_data.groupby(["batter", "bowler"])["batsman_runs"]\
        .expanding().mean().shift(1, fill_value=np.nan)
    pvp_expanding = pvp_expanding.reset_index(level=[0,1], drop=True)
    pvp_data["pvp_avg"] = pvp_expanding
    
    # Merge back to player-match level (take average of all bowlers faced in the match)
    pvp_match_avg = pvp_data.groupby(["batter", "match_id"])["pvp_avg"].mean().reset_index()
    player_match = player_match.merge(pvp_match_avg, on=["batter", "match_id"], how="left")
    
    print("PvP features created:")
    print(player_match[["batter", "match_id", "batsman_runs", "pvp_avg"]].head())
    
    return player_match

# Create PvP features
player_match = engineer_pvp_features(player_match, df)


3.4 ENGINEERING PVP FEATURES
PvP features created:
           batter  match_id  batsman_runs  pvp_avg
0  A Ashish Reddy       346            10      NaN
1  A Ashish Reddy       352             3    10.00
2  A Ashish Reddy       359             8     6.50
3  A Ashish Reddy       373            10     7.00
4  A Ashish Reddy       376             4     7.75


## 3.5 Engineering Career Features

In [8]:
def engineer_career_features(player_match):
    """Create career average features"""
    print("\n" + "=" * 60)
    print("3.5 ENGINEERING CAREER FEATURES")
    print("=" * 60)
    
    # Career average using expanding mean
    career_avg_expanding = player_match.groupby("batter")["batsman_runs"]\
        .expanding().mean().shift(1, fill_value=np.nan)
    career_avg_expanding = career_avg_expanding.reset_index(level=0, drop=True)
    player_match["career_avg"] = career_avg_expanding
    
    # Career strike rate using expanding mean
    career_sr_expanding = player_match.groupby("batter")["strike_rate"]\
        .expanding().mean().shift(1, fill_value=np.nan)
    career_sr_expanding = career_sr_expanding.reset_index(level=0, drop=True)
    player_match["career_sr"] = career_sr_expanding
    
    # Career matches played
    player_match["career_matches"] = player_match.groupby("batter").cumcount()
    
    print("Career features created:")
    print(player_match[["batter", "date", "batsman_runs", "career_avg", "career_sr", "career_matches"]].head(10))
    
    return player_match

# Create career features
player_match = engineer_career_features(player_match)


3.5 ENGINEERING CAREER FEATURES
Career features created:
           batter       date  batsman_runs  career_avg   career_sr  \
0  A Ashish Reddy 2012-04-29            10         NaN         NaN   
1  A Ashish Reddy 2012-05-04             3   10.000000  100.000000   
2  A Ashish Reddy 2012-05-08             8    6.500000  100.000000   
3  A Ashish Reddy 2012-05-18            10    7.000000  100.000000   
4  A Ashish Reddy 2012-05-20             4    7.750000  137.500000   
5  A Ashish Reddy 2013-04-05             7    7.000000  126.000000   
6  A Ashish Reddy 2013-04-07            14    7.000000  134.166667   
7  A Ashish Reddy 2013-04-09             3    8.000000  131.667143   
8  A Ashish Reddy 2013-04-12            16    7.375000  124.583750   
9  A Ashish Reddy 2013-04-14             4    8.333333  130.494444   

   career_matches  
0               0  
1               1  
2               2  
3               3  
4               4  
5               5  
6               6  
7          

## 4. Creating Target Labels

In [9]:
def create_target_labels(player_match):
    """Create target labels (next match performance)"""
    print("\n" + "=" * 60)
    print("4. CREATING TARGET LABELS")
    print("=" * 60)
    
    # Create target: runs in next match
    player_match = player_match.sort_values(["batter", "date"])
    player_match["target_next_runs"] = player_match.groupby("batter")["batsman_runs"].shift(-1)
    
    # Create target: strike rate in next match
    player_match["target_next_sr"] = player_match.groupby("batter")["strike_rate"].shift(-1)
    
    # Remove rows where target is NaN (last match for each player)
    player_match_clean = player_match.dropna(subset=["target_next_runs"])
    
    print(f"Data after creating targets: {player_match_clean.shape}")
    print(f"Target statistics:")
    print(f"Next match runs - Mean: {player_match_clean["target_next_runs"].mean():.2f}, Std: {player_match_clean["target_next_runs"].std():.2f}")
    print(f"Next match SR - Mean: {player_match_clean["target_next_sr"].mean():.2f}, Std: {player_match_clean["target_next_sr"].std():.2f}")
    
    return player_match_clean

# Create target labels
player_match_clean = create_target_labels(player_match)


4. CREATING TARGET LABELS
Data after creating targets: (8648, 24)
Target statistics:
Next match runs - Mean: 20.18, Std: 21.04
Next match SR - Mean: 107.75, Std: 61.80


## 5. Feature Selection and Preprocessing

In [10]:
def feature_selection_and_preprocessing(player_match_clean):
    """Select features and handle preprocessing"""
    print("\n" + "=" * 60)
    print("5. FEATURE SELECTION AND PREPROCESSING")
    print("=" * 60)
    
    # Select features for modeling
    feature_columns = [
        "put_avg", "put_avg_expanding",
        "rolling_avg_5", "rolling_sr_5",
        "venue_avg", "venue_overall_avg",
        "pvp_avg",
        "career_avg", "career_sr", "career_matches",
        "ball", "is_boundary", "is_six", "is_four"
    ]
    
    # Handle missing values in features
    for col in feature_columns:
        if col in player_match_clean.columns:
            player_match_clean[col] = player_match_clean[col].fillna(player_match_clean[col].median())
    
    features = player_match_clean[feature_columns]
    labels_runs = player_match_clean["target_next_runs"]
    labels_sr = player_match_clean["target_next_sr"]
    
    print(f"Features shape: {features.shape}")
    print(f"Features selected: {feature_columns}")
    print(f"Feature statistics:")
    print(features.describe())
    
    return features, labels_runs, labels_sr, feature_columns

# Select features and preprocess
features, labels_runs, labels_sr, feature_columns = feature_selection_and_preprocessing(player_match_clean)


5. FEATURE SELECTION AND PREPROCESSING
Features shape: (8648, 14)
Features selected: ['put_avg', 'put_avg_expanding', 'rolling_avg_5', 'rolling_sr_5', 'venue_avg', 'venue_overall_avg', 'pvp_avg', 'career_avg', 'career_sr', 'career_matches', 'ball', 'is_boundary', 'is_six', 'is_four']
Feature statistics:
           put_avg  put_avg_expanding  rolling_avg_5  rolling_sr_5  \
count  8648.000000        8648.000000    8648.000000   8648.000000   
mean     20.186318          20.327960      20.334908    107.868005   
std      11.678431          14.778165      12.388231     33.612365   
min       0.000000           0.000000       0.000000      0.000000   
25%      11.285714           9.000000      11.000000     87.913000   
50%      20.266667          19.000000      19.000000    107.495000   
75%      27.428571          28.333333      27.800000    127.924500   
max     120.000000         158.000000     158.000000    366.670000   

         venue_avg  venue_overall_avg      pvp_avg   career_avg

## 6. Time-Series Aware Train-Test Split

In [11]:
def time_series_split(features, labels_runs, labels_sr, player_match_clean):
    """Perform time-series aware train-test split"""
    print("\n" + "=" * 60)
    print("6. TIME-SERIES AWARE TRAIN-TEST SPLIT")
    print("=" * 60)
    
    # Sort by date for proper time-series split
    player_match_clean = player_match_clean.sort_values("date")
    
    # Time-series split (80% train, 20% test)
    split_idx = int(len(player_match_clean) * 0.8)
    
    X_train = features[:split_idx]
    X_test = features[split_idx:]
    y_train_runs = labels_runs[:split_idx]
    y_test_runs = labels_runs[split_idx:]
    y_train_sr = labels_sr[:split_idx]
    y_test_sr = labels_sr[split_idx:]
    
    print(f"Train set size: {len(X_train)} ({len(X_train)/len(features)*100:.1f}%)")
    print(f"Test set size: {len(X_test)} ({len(X_test)/len(features)*100:.1f}%)")
    print(f"Train date range: {player_match_clean.iloc[:split_idx]["date"].min()} to {player_match_clean.iloc[:split_idx]["date"].max()}")
    print(f"Test date range: {player_match_clean.iloc[split_idx:]["date"].min()} to {player_match_clean.iloc[split_idx:]["date"].max()}")
    
    return X_train, X_test, y_train_runs, y_test_runs, y_train_sr, y_test_sr

# Perform time-series split
X_train, X_test, y_train_runs, y_test_runs, y_train_sr, y_test_sr = time_series_split(
    features, labels_runs, labels_sr, player_match_clean
)


6. TIME-SERIES AWARE TRAIN-TEST SPLIT
Train set size: 6918 (80.0%)
Test set size: 1730 (20.0%)
Train date range: 2008-04-18 00:00:00 to 2015-05-04 00:00:00
Test date range: 2015-05-04 00:00:00 to 2017-05-16 00:00:00


## 7. Creating Feature Pipeline

In [12]:
def create_feature_pipeline(X_train, X_test, feature_columns):
    """Create and apply feature preprocessing pipeline"""
    print("\n" + "=" * 60)
    print("7. CREATING FEATURE PIPELINE")
    print("=" * 60)
    
    # Create preprocessing pipeline
    feature_pipeline = Pipeline([
        ("scaler", StandardScaler())
    ])
    
    # Fit pipeline on training data
    X_train_scaled = feature_pipeline.fit_transform(X_train)
    X_test_scaled = feature_pipeline.transform(X_test)
    
    # Convert back to DataFrame for easier handling
    X_train_scaled = pd.DataFrame(X_train_scaled, columns=feature_columns)
    X_test_scaled = pd.DataFrame(X_test_scaled, columns=feature_columns)
    
    print("Feature pipeline created and applied:")
    print(f"Scaled training data shape: {X_train_scaled.shape}")
    print(f"Scaled test data shape: {X_test_scaled.shape}")
    print(f"Sample scaled features:")
    print(X_train_scaled.head())
    
    return feature_pipeline, X_train_scaled, X_test_scaled

# Create feature pipeline
feature_pipeline, X_train_scaled, X_test_scaled = create_feature_pipeline(X_train, X_test, feature_columns)


7. CREATING FEATURE PIPELINE
Feature pipeline created and applied:
Scaled training data shape: (6918, 14)
Scaled test data shape: (1730, 14)
Sample scaled features:
    put_avg  put_avg_expanding  rolling_avg_5  rolling_sr_5  venue_avg  \
0 -0.537849          -0.769929      -0.809109     -0.221044  -0.737760   
1 -0.410843          -0.063189      -1.091993     -0.221044  -0.027581   
2 -0.636631          -0.467041      -1.051581     -0.221044  -0.805040   
3 -0.636631          -0.511913      -0.990963      0.890980  -0.805040   
4 -0.559016          -0.511913      -1.051581      0.549959  -0.805040   

   venue_overall_avg   pvp_avg  career_avg  career_sr  career_matches  \
0           0.057440  0.128962    0.111646   0.115105       -1.017722   
1           0.338648 -1.040630   -0.997523  -0.270829       -0.985256   
2           0.404345 -1.390433   -1.345916  -0.270829       -0.952790   
3           0.404345 -1.340462   -1.296146  -0.270829       -0.920324   
4           0.404345 -1.

## 8. Saving Artifacts

In [13]:
def save_artifacts(feature_pipeline, final_dataset, X_train, X_test, y_train_runs, y_test_runs, y_train_sr, y_test_sr, feature_columns):
    """Save all artifacts"""
    print("\n" + "=" * 60)
    print("8. SAVING ARTIFACTS")
    print("=" * 60)
    
    # Save the feature pipeline
    joblib.dump(feature_pipeline, "feature_pipeline.pkl")
    
    # Save the final dataset with all features and targets
    final_dataset.to_csv("dataset.csv", index=False)
    
    # Save train-test splits
    train_data = pd.concat([X_train, y_train_runs, y_train_sr], axis=1)
    train_data.columns = feature_columns + ["target_runs", "target_sr"]
    train_data.to_csv("train_data.csv", index=False)
    
    test_data = pd.concat([X_test, y_test_runs, y_test_sr], axis=1)
    test_data.columns = feature_columns + ["target_runs", "target_sr"]
    test_data.to_csv("test_data.csv", index=False)
    
    print("Artifacts saved successfully:")
    print("- feature_pipeline.pkl: Preprocessing pipeline")
    print("- dataset.csv: Complete feature-engineered dataset")
    print("- train_data.csv: Training set")
    print("- test_data.csv: Test set")

# Save all artifacts
save_artifacts(
    feature_pipeline, player_match_clean, X_train, X_test, 
    y_train_runs, y_test_runs, y_train_sr, y_test_sr, feature_columns
)


8. SAVING ARTIFACTS
Artifacts saved successfully:
- feature_pipeline.pkl: Preprocessing pipeline
- dataset.csv: Complete feature-engineered dataset
- train_data.csv: Training set
- test_data.csv: Test set
