# AUTO ML Series : IPL Score prediction using TPOT

#### IPL Score Prediction - Model Training Notebook

#### 📌 Project Overview
This project aims to predict the **first-innings score** in an IPL match at **any given over** using **TPOT (Tree-based Pipeline Optimization Tool)**. TPOT automates the process of selecting the best machine learning pipeline, optimizing feature selection, model selection, and hyperparameters. The final model will be integrated into a **pipeline** and a **web app** for real-time predictions.

#### 📊 Dataset Description
The dataset consists of two CSV files:
1. **matches.csv** – Contains match-level details like teams, venue, toss details, and final scores.
2. **deliveries.csv** – Contains ball-by-ball data, including runs scored, wickets, bowler, and batsman details.

#### 🔧 Data Preprocessing
- **Merging datasets** to map match information to each delivery.
- **Feature engineering**, including:
  - Current run rate (CRR)
  - Wickets lost
  - Batsman and bowler statistics
  - Venue and toss impact
- **Handling missing values** and data imbalances.
- **Encoding categorical variables** (e.g., team names, venues).
- **Train-test split** for model training.

This notebook handles **dataset preprocessing** and **model training** using TPOT.  
It finds the best machine learning pipeline for predicting the **final score** at any given over.

#### Note:
- This is the **model training script**. This notebook handles **dataset preprocessing** and **model training** using TPOT.  It finds the best machine learning pipeline for predicting the **final score** at any given over.
- There is a separate **Streamlit script** for deploying a web app to make predictions.




Data set credits : https://www.kaggle.com/datasets/patrickb1912/ipl-complete-dataset-20082020

In [None]:
# Importing libraries
import pandas as pd

# import the AutoMLpackage after installing tpot.
from tpot import TPOTRegressor

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

In [None]:
# Load the CSV files
# Note: I have a bad habit of restructuring notebooks, verify the location when using the notebook
deliveries = pd.read_csv('..\datasets\ipl_2008_2024\deliveries.csv')
matches = pd.read_csv('..\datasets\ipl_2008_2024\matches.csv')

In [None]:
# Show the first few rows of each DataFrame to understand the structure
print("Deliveries DataFrame:")
deliveries.head()

In [None]:
print("\nMatches DataFrame:")
matches.head()


In [None]:
deliveries.isnull().sum()

In [None]:
matches.isnull().sum()

In [9]:
# Merge datasets on match_id
ipl_data = deliveries.merge(matches, left_on='match_id', right_on='id', how='left')

In [None]:
# Filter first innings data
first_innings = ipl_data[ipl_data['inning'] == 1]

# Aggregate features at each over level
overwise_data = first_innings.groupby(['match_id', 'batting_team', 'bowling_team', 'over']).agg(
    total_runs=('total_runs', 'sum'),
    wickets=('is_wicket', 'sum')
).reset_index()

overwise_data.head()

In [None]:



# Compute cumulative features
overwise_data['cumulative_runs'] = overwise_data.groupby('match_id')['total_runs'].cumsum()
overwise_data['cumulative_wickets'] = overwise_data.groupby('match_id')['wickets'].cumsum()

overwise_data['run_rate'] = overwise_data['cumulative_runs'] / (overwise_data['over'] + 1)

# Merge with match-level data for additional features
overwise_data = overwise_data.merge(matches[['id', 'venue']], 
                                    left_on='match_id', right_on='id', how='left')

# Target Variable: Final first-innings score
target_scores = first_innings.groupby('match_id')['total_runs'].sum().reset_index()
target_scores.rename(columns={'total_runs': 'final_score'}, inplace=True)

overwise_data = overwise_data.merge(target_scores, on='match_id', how='left')

# Save preprocessed data
overwise_data.to_csv("..\datasets\ipl_2008_2024\preprocessed_ipl_data.csv", index=False)
print("Preprocessing complete. Data saved to 'preprocessed_ipl_data.csv'")


In [None]:
# Load preprocessed data
data = pd.read_csv("..\datasets\ipl_2008_2024\preprocessed_ipl_data.csv")

# Encode categorical features
encoder = LabelEncoder()
data['batting_team'] = encoder.fit_transform(data['batting_team'])
data['bowling_team'] = encoder.fit_transform(data['bowling_team'])
data['venue'] = encoder.fit_transform(data['venue'])

# Define input features (X) and target variable (Y)
features = ['over', 'cumulative_runs', 'cumulative_wickets', 'run_rate', 'batting_team', 'bowling_team', 'venue']
X = data[features]
Y = data['final_score']

# Split data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Initialize and train TPOT model
tpot = TPOTRegressor(generations=5, population_size=20, max_time_mins = 30, n_jobs =4, random_state=42)
tpot.fit(X_train, Y_train)

# Evaluate model performance
print("Best model score:", tpot.score(X_test, Y_test))

# Export the best model pipeline
tpot.export("best_tpot_pipeline.py")

print("TPOT training complete. Best model saved as 'best_tpot_pipeline.py'")

#### Now that the model has been built, it's time to deploy the model

refer the python script with the similar name to the notebook