**Problem Statement** -
The objective of this project is to build a predictive model that estimates the target runs in an IPL match. Predicting target runs can help teams strategize their innings effectively and optimize their gameplay based on historical match data.


**Understanding the Problem** -

•	In a T20 cricket match, the target runs are influenced by various factors, such as team performance, toss decision, venue conditions, and previous match statistics.

•	Using machine learning, we can analyze past IPL match data to uncover patterns and trends that determine the target runs for a given match.

•	By leveraging feature engineering and model tuning, we can improve prediction accuracy.

**What I am Predicting**  -

•	The target variable for this project is target_runs, which represents the number of runs a team must chase to win.

•	The features used for prediction include:
o	Match details (venue, date, teams playing)
o	Toss information (which team won the toss and the decision)
o	First innings performance (runs scored, wickets lost)
o	Other contextual factors (weather, pitch conditions, etc.)


**Importing the libariers**

In [82]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import  RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn import svm
from sklearn.metrics import mean_squared_error,r2_score,mean_absolute_error

**Data collection**

Reading the data  - https://raw.githubusercontent.com/avinashyadav16/ipl-analytics/main/Datasets/matches_2008-2024.csv

In [85]:
df = pd.read_csv("https://raw.githubusercontent.com/avinashyadav16/ipl-analytics/main/Datasets/matches_2008-2024.csv")
df.head()

Unnamed: 0,id,season,city,date,match_type,player_of_match,venue,team1,team2,toss_winner,toss_decision,winner,result,result_margin,target_runs,target_overs,super_over,method,umpire1,umpire2
0,335982,2008,Bangalore,2008-04-18,League,BB McCullum,M Chinnaswamy Stadium,Royal Challengers Bangalore,Kolkata Knight Riders,Royal Challengers Bangalore,field,Kolkata Knight Riders,runs,140.0,223.0,20.0,N,,Asad Rauf,RE Koertzen
1,335983,2008,Chandigarh,2008-04-19,League,MEK Hussey,"Punjab Cricket Association Stadium, Mohali",Kings XI Punjab,Chennai Super Kings,Chennai Super Kings,bat,Chennai Super Kings,runs,33.0,241.0,20.0,N,,MR Benson,SL Shastri
2,335984,2008,Delhi,2008-04-19,League,MF Maharoof,Feroz Shah Kotla,Delhi Daredevils,Rajasthan Royals,Rajasthan Royals,bat,Delhi Daredevils,wickets,9.0,130.0,20.0,N,,Aleem Dar,GA Pratapkumar
3,335985,2008,Mumbai,2008-04-20,League,MV Boucher,Wankhede Stadium,Mumbai Indians,Royal Challengers Bangalore,Mumbai Indians,bat,Royal Challengers Bangalore,wickets,5.0,166.0,20.0,N,,SJ Davis,DJ Harper
4,335986,2008,Kolkata,2008-04-20,League,DJ Hussey,Eden Gardens,Kolkata Knight Riders,Deccan Chargers,Deccan Chargers,bat,Kolkata Knight Riders,wickets,5.0,111.0,20.0,N,,BF Bowden,K Hariharan


Splitting the data into train and test

In [86]:
df_train,df_test = train_test_split(df)
print(df_train.shape)
print(df_test.shape)

(821, 20)
(274, 20)


**Data Exploration**

Checking the data

In [87]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1095 entries, 0 to 1094
Data columns (total 20 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   id               1095 non-null   int64  
 1   season           1095 non-null   int64  
 2   city             1044 non-null   object 
 3   date             1095 non-null   object 
 4   match_type       1095 non-null   object 
 5   player_of_match  1090 non-null   object 
 6   venue            1095 non-null   object 
 7   team1            1095 non-null   object 
 8   team2            1095 non-null   object 
 9   toss_winner      1095 non-null   object 
 10  toss_decision    1095 non-null   object 
 11  winner           1090 non-null   object 
 12  result           1095 non-null   object 
 13  result_margin    1076 non-null   float64
 14  target_runs      1092 non-null   float64
 15  target_overs     1092 non-null   float64
 16  super_over       1095 non-null   object 
 17  method        

Checking the type of the dataset

In [88]:
df_train.dtypes

Unnamed: 0,0
id,int64
season,int64
city,object
date,object
match_type,object
player_of_match,object
venue,object
team1,object
team2,object
toss_winner,object


In the above dataset their is no use of id column so we can drop it.

In [89]:
df.drop("id",axis =1,inplace = True)
df

Unnamed: 0,season,city,date,match_type,player_of_match,venue,team1,team2,toss_winner,toss_decision,winner,result,result_margin,target_runs,target_overs,super_over,method,umpire1,umpire2
0,2008,Bangalore,2008-04-18,League,BB McCullum,M Chinnaswamy Stadium,Royal Challengers Bangalore,Kolkata Knight Riders,Royal Challengers Bangalore,field,Kolkata Knight Riders,runs,140.0,223.0,20.0,N,,Asad Rauf,RE Koertzen
1,2008,Chandigarh,2008-04-19,League,MEK Hussey,"Punjab Cricket Association Stadium, Mohali",Kings XI Punjab,Chennai Super Kings,Chennai Super Kings,bat,Chennai Super Kings,runs,33.0,241.0,20.0,N,,MR Benson,SL Shastri
2,2008,Delhi,2008-04-19,League,MF Maharoof,Feroz Shah Kotla,Delhi Daredevils,Rajasthan Royals,Rajasthan Royals,bat,Delhi Daredevils,wickets,9.0,130.0,20.0,N,,Aleem Dar,GA Pratapkumar
3,2008,Mumbai,2008-04-20,League,MV Boucher,Wankhede Stadium,Mumbai Indians,Royal Challengers Bangalore,Mumbai Indians,bat,Royal Challengers Bangalore,wickets,5.0,166.0,20.0,N,,SJ Davis,DJ Harper
4,2008,Kolkata,2008-04-20,League,DJ Hussey,Eden Gardens,Kolkata Knight Riders,Deccan Chargers,Deccan Chargers,bat,Kolkata Knight Riders,wickets,5.0,111.0,20.0,N,,BF Bowden,K Hariharan
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1090,2024,Hyderabad,2024-05-19,League,Abhishek Sharma,"Rajiv Gandhi International Stadium, Uppal, Hyd...",Punjab Kings,Sunrisers Hyderabad,Punjab Kings,bat,Sunrisers Hyderabad,wickets,4.0,215.0,20.0,N,,Nitin Menon,VK Sharma
1091,2024,Ahmedabad,2024-05-21,Qualifier 1,MA Starc,"Narendra Modi Stadium, Ahmedabad",Sunrisers Hyderabad,Kolkata Knight Riders,Sunrisers Hyderabad,bat,Kolkata Knight Riders,wickets,8.0,160.0,20.0,N,,AK Chaudhary,R Pandit
1092,2024,Ahmedabad,2024-05-22,Eliminator,R Ashwin,"Narendra Modi Stadium, Ahmedabad",Royal Challengers Bengaluru,Rajasthan Royals,Rajasthan Royals,field,Rajasthan Royals,wickets,4.0,173.0,20.0,N,,KN Ananthapadmanabhan,MV Saidharshan Kumar
1093,2024,Chennai,2024-05-24,Qualifier 2,Shahbaz Ahmed,"MA Chidambaram Stadium, Chepauk, Chennai",Sunrisers Hyderabad,Rajasthan Royals,Rajasthan Royals,field,Sunrisers Hyderabad,runs,36.0,176.0,20.0,N,,Nitin Menon,VK Sharma


Checking the null values

In [90]:
df_train.isnull().sum().value_counts()

Unnamed: 0,count
0,13
5,2
3,2
34,1
18,1
806,1


Checking the duplicated values

In [91]:
df_train.duplicated().value_counts()

Unnamed: 0,count
False,821


Describeing the Dataset

In [92]:
df_train.describe()

Unnamed: 0,id,season,result_margin,target_runs,target_overs
count,821.0,821.0,803.0,818.0,818.0
mean,888149.2,2015.892814,16.901619,164.671149,19.764303
std,366911.4,4.921403,21.848927,34.349427,1.612071
min,335982.0,2008.0,1.0,43.0,5.0
25%,548316.0,2012.0,6.0,145.0,20.0
50%,980907.0,2016.0,8.0,165.0,20.0
75%,1216537.0,2020.0,19.0,186.0,20.0
max,1426311.0,2024.0,146.0,288.0,20.0


Replacing the Null values with NA

In [93]:
df_train = df_train.replace(np.nan,"NA")
df_test = df_test.replace(np.nan,"NA")

This code is cleaning and converting the target_runs column in both df_train and df_test:

 Converts target_runs to numeric (to_numeric with errors='coerce' turns non-numeric values into NaN).
 Fills missing values (NaN) with the column's mean to avoid data loss.

In [94]:
df_train['target_runs'] = pd.to_numeric(df_train['target_runs'], errors='coerce')
df_test['target_runs'] = pd.to_numeric(df_test['target_runs'], errors='coerce')
df_train['target_runs'] = df_train['target_runs'].fillna(df_train['target_runs'].mean())
df_test['target_runs'] = df_test['target_runs'].fillna(df_test['target_runs'].mean())

Here we are assing the df_train,df_test to x_train,x_test,y_train,y_test and droping the target_runs in x_train and x_test because we don't need it in x_train and x_test.

In [95]:
x_train = df_train.drop("target_runs",axis=1)
y_train = df_train["target_runs"]
x_test = df_test.drop("target_runs",axis=1)
y_test = df_test["target_runs"]
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

(821, 19)
(821,)
(274, 19)
(274,)


**Feature Engineering**

Here I did scalling but it giving the low score when compared to encoding

In [96]:
x_train = x_train.astype(str)
x_test = x_test.astype(str)
encoder = OneHotEncoder(handle_unknown="ignore",sparse_output=False)
x_train = encoder.fit_transform(x_train)
x_test = encoder.transform(x_test)

**Model Training**

Here I Tested DecisionTreeRegressor,  LinearRegression, SVM, along with below models also.

In [114]:
rfr = RandomForestRegressor()
rfr.fit(x_train,y_train)

In [115]:
gbr = GradientBoostingRegressor()
gbr.fit(x_train,y_train)

In [116]:
from xgboost import XGBRegressor
xg = XGBRegressor()
xg.fit(x_train,y_train)

**Model Evaluation**

In [117]:
y1 = rfr.predict(x_test)
y2 = gbr.predict(x_test)
y3 = xg.predict(x_test)

 Measured model accuracy using R² score, Mean Absolute Error (MAE), and Mean Squared Error (MSE).

In [118]:
mse1 = mean_squared_error(y_test, y1)
mse2 = mean_squared_error(y_test, y2)
mse3 = mean_squared_error(y_test, y3)
r2_1 = r2_score(y_test, y1)
r2_2 = r2_score(y_test, y2)
r2_3 = r2_score(y_test, y3)
mae1 = mean_absolute_error(y_test,y1)
mae2 = mean_absolute_error(y_test,y2)
mae3 = mean_absolute_error(y_test,y3)
print("\nRandomForestRegressor:")
print("Mean Squared Error:", mse3)
print("R-squared:", r2_3)
print("Mean Absolute Error:", mae3)
print("\nGradientBoostingRegressor:")
print("Mean Squared Error:", mse5)
print("R-squared:", r2_5)
print("Mean Absolute Error:", mae5)
print("\nXGBRegressor:")
print("Mean Squared Error:", mse6)
print("R-squared:", r2_6)
print("Mean Absolute Error:", mae6)


RandomForestRegressor:
Mean Squared Error: 810.2366894845293
R-squared: 0.11811372755934224
Mean Absolute Error: 21.702632291473613

GradientBoostingRegressor:
Mean Squared Error: 706.9511775182349
R-squared: 0.2305328222846491
Mean Absolute Error: 21.117892085918356

XGBRegressor:
Mean Squared Error: 810.2366894845293
R-squared: 0.11811372755934224
Mean Absolute Error: 21.702632291473613


**Hypermeter Tuning**

Here I am using RandomizedsearchCV because it is given me best score when comapred to GridsearchCV.

I used GridsearchCV also for below Models but it's given less score and the time taken is more.

In [105]:
gs_rtr = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
gd = RandomizedSearchCV(RandomForestRegressor(),gs_rtr,cv=5,n_iter=10)
gd.fit(x_train, y_train)

In [106]:
gd.best_params_

{'n_estimators': 50,
 'min_samples_split': 10,
 'min_samples_leaf': 2,
 'max_depth': None}

In [107]:
gd.best_score_

np.float64(0.2819688126118284)

In [108]:
gs_gbr = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 5, 7],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 3, 5]
}
gd = RandomizedSearchCV(GradientBoostingRegressor(),gs_gbr,cv=5,n_iter=10)
gd.fit(x_train, y_train)

In [109]:
gd.best_params_

{'n_estimators': 300,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'max_depth': 3,
 'learning_rate': 0.05}

In [110]:
gd.best_score_

np.float64(0.2936612561271102)

In [111]:
gs_xg = {
          'n_estimators': [100, 200, 300, 500],
          'max_depth': [3, 5, 7, 10],
          'learning_rate': [0.01, 0.05, 0.1, 0.2],
          'subsample': [0.6, 0.8, 1.0],
          'colsample_bytree': [0.6, 0.8, 1.0],
          'gamma': [0, 0.1, 0.2, 0.3],
          'reg_alpha': [0, 0.01, 0.1, 1],
          'reg_lambda': [0, 0.01, 0.1, 1]
}
gd = RandomizedSearchCV(XGBRegressor(),gs_xg,cv=5,n_iter=10)
gd.fit(x_train, y_train)

In [112]:
gd.best_params_

{'subsample': 0.8,
 'reg_lambda': 0.01,
 'reg_alpha': 0.1,
 'n_estimators': 200,
 'max_depth': 7,
 'learning_rate': 0.05,
 'gamma': 0,
 'colsample_bytree': 0.6}

In [113]:
gd.best_score_

np.float64(0.3156092785924316)

By refering above models, I am  Considering ***XGB regression*** model as my Best model

**Conclusion** -

•	This project provides data-driven insights into IPL matches and improves decision-making for cricket analysts, teams, and enthusiasts.

•	The best-performing model can help teams predict realistic chase targets and plan their batting strategy accordingly.

•	Future improvements can include live match data integration and deep learning techniques for enhanced accuracy.
