**Problem Statement** -
The objective of this project is to build a predictive model that estimates the target runs in an IPL match. Predicting target runs can help teams strategize their innings effectively and optimize their gameplay based on historical match data.


**Understanding the Problem** -

•	In a T20 cricket match, the target runs are influenced by various factors, such as team performance, toss decision, venue conditions, and previous match statistics.

•	Using machine learning, we can analyze past IPL match data to uncover patterns and trends that determine the target runs for a given match.

•	By leveraging feature engineering and model tuning, we can improve prediction accuracy.

**What I am Predicting**  -

•	The target variable for this project is target_runs, which represents the number of runs a team must chase to win.

•	The features used for prediction include:
o	Match details (venue, date, teams playing)
o	Toss information (which team won the toss and the decision)
o	First innings performance (runs scored, wickets lost)
o	Other contextual factors (weather, pitch conditions, etc.)


**Importing the libariers**

In [34]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import  RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn import svm
from sklearn.metrics import mean_squared_error,r2_score,mean_absolute_error

**Data collection**

Reading the data  - https://raw.githubusercontent.com/avinashyadav16/ipl-analytics/main/Datasets/matches_2008-2024.csv

In [35]:
dataf = pd.read_csv("https://raw.githubusercontent.com/avinashyadav16/ipl-analytics/main/Datasets/matches_2008-2024.csv")
dataf.head()

Unnamed: 0,id,season,city,date,match_type,player_of_match,venue,team1,team2,toss_winner,toss_decision,winner,result,result_margin,target_runs,target_overs,super_over,method,umpire1,umpire2
0,335982,2008,Bangalore,2008-04-18,League,BB McCullum,M Chinnaswamy Stadium,Royal Challengers Bangalore,Kolkata Knight Riders,Royal Challengers Bangalore,field,Kolkata Knight Riders,runs,140.0,223.0,20.0,N,,Asad Rauf,RE Koertzen
1,335983,2008,Chandigarh,2008-04-19,League,MEK Hussey,"Punjab Cricket Association Stadium, Mohali",Kings XI Punjab,Chennai Super Kings,Chennai Super Kings,bat,Chennai Super Kings,runs,33.0,241.0,20.0,N,,MR Benson,SL Shastri
2,335984,2008,Delhi,2008-04-19,League,MF Maharoof,Feroz Shah Kotla,Delhi Daredevils,Rajasthan Royals,Rajasthan Royals,bat,Delhi Daredevils,wickets,9.0,130.0,20.0,N,,Aleem Dar,GA Pratapkumar
3,335985,2008,Mumbai,2008-04-20,League,MV Boucher,Wankhede Stadium,Mumbai Indians,Royal Challengers Bangalore,Mumbai Indians,bat,Royal Challengers Bangalore,wickets,5.0,166.0,20.0,N,,SJ Davis,DJ Harper
4,335986,2008,Kolkata,2008-04-20,League,DJ Hussey,Eden Gardens,Kolkata Knight Riders,Deccan Chargers,Deccan Chargers,bat,Kolkata Knight Riders,wickets,5.0,111.0,20.0,N,,BF Bowden,K Hariharan


Splitting the data into train and test

In [36]:
dataf_train,dataf_test = train_test_split(dataf)
print(dataf_train.shape)
print(dataf_test.shape)

(821, 20)
(274, 20)


**Data Exploration**

Checking the data

In [37]:
dataf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1095 entries, 0 to 1094
Data columns (total 20 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   id               1095 non-null   int64  
 1   season           1095 non-null   int64  
 2   city             1044 non-null   object 
 3   date             1095 non-null   object 
 4   match_type       1095 non-null   object 
 5   player_of_match  1090 non-null   object 
 6   venue            1095 non-null   object 
 7   team1            1095 non-null   object 
 8   team2            1095 non-null   object 
 9   toss_winner      1095 non-null   object 
 10  toss_decision    1095 non-null   object 
 11  winner           1090 non-null   object 
 12  result           1095 non-null   object 
 13  result_margin    1076 non-null   float64
 14  target_runs      1092 non-null   float64
 15  target_overs     1092 non-null   float64
 16  super_over       1095 non-null   object 
 17  method        

Checking the type of the dataset

In [38]:
dataf_train.dtypes

Unnamed: 0,0
id,int64
season,int64
city,object
date,object
match_type,object
player_of_match,object
venue,object
team1,object
team2,object
toss_winner,object


In the above dataset their is no use of id column so we can drop it.

In [39]:
dataf.drop("id",axis =1,inplace = True)
dataf

Unnamed: 0,season,city,date,match_type,player_of_match,venue,team1,team2,toss_winner,toss_decision,winner,result,result_margin,target_runs,target_overs,super_over,method,umpire1,umpire2
0,2008,Bangalore,2008-04-18,League,BB McCullum,M Chinnaswamy Stadium,Royal Challengers Bangalore,Kolkata Knight Riders,Royal Challengers Bangalore,field,Kolkata Knight Riders,runs,140.0,223.0,20.0,N,,Asad Rauf,RE Koertzen
1,2008,Chandigarh,2008-04-19,League,MEK Hussey,"Punjab Cricket Association Stadium, Mohali",Kings XI Punjab,Chennai Super Kings,Chennai Super Kings,bat,Chennai Super Kings,runs,33.0,241.0,20.0,N,,MR Benson,SL Shastri
2,2008,Delhi,2008-04-19,League,MF Maharoof,Feroz Shah Kotla,Delhi Daredevils,Rajasthan Royals,Rajasthan Royals,bat,Delhi Daredevils,wickets,9.0,130.0,20.0,N,,Aleem Dar,GA Pratapkumar
3,2008,Mumbai,2008-04-20,League,MV Boucher,Wankhede Stadium,Mumbai Indians,Royal Challengers Bangalore,Mumbai Indians,bat,Royal Challengers Bangalore,wickets,5.0,166.0,20.0,N,,SJ Davis,DJ Harper
4,2008,Kolkata,2008-04-20,League,DJ Hussey,Eden Gardens,Kolkata Knight Riders,Deccan Chargers,Deccan Chargers,bat,Kolkata Knight Riders,wickets,5.0,111.0,20.0,N,,BF Bowden,K Hariharan
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1090,2024,Hyderabad,2024-05-19,League,Abhishek Sharma,"Rajiv Gandhi International Stadium, Uppal, Hyd...",Punjab Kings,Sunrisers Hyderabad,Punjab Kings,bat,Sunrisers Hyderabad,wickets,4.0,215.0,20.0,N,,Nitin Menon,VK Sharma
1091,2024,Ahmedabad,2024-05-21,Qualifier 1,MA Starc,"Narendra Modi Stadium, Ahmedabad",Sunrisers Hyderabad,Kolkata Knight Riders,Sunrisers Hyderabad,bat,Kolkata Knight Riders,wickets,8.0,160.0,20.0,N,,AK Chaudhary,R Pandit
1092,2024,Ahmedabad,2024-05-22,Eliminator,R Ashwin,"Narendra Modi Stadium, Ahmedabad",Royal Challengers Bengaluru,Rajasthan Royals,Rajasthan Royals,field,Rajasthan Royals,wickets,4.0,173.0,20.0,N,,KN Ananthapadmanabhan,MV Saidharshan Kumar
1093,2024,Chennai,2024-05-24,Qualifier 2,Shahbaz Ahmed,"MA Chidambaram Stadium, Chepauk, Chennai",Sunrisers Hyderabad,Rajasthan Royals,Rajasthan Royals,field,Sunrisers Hyderabad,runs,36.0,176.0,20.0,N,,Nitin Menon,VK Sharma


Checking the null values

In [40]:
dataf_train.isnull().sum().value_counts()

Unnamed: 0,count
0,13
3,2
2,2
38,1
14,1
803,1


Checking the duplicated values

In [41]:
dataf_train.duplicated().value_counts()

Unnamed: 0,count
False,821


Describeing the Dataset

In [42]:
dataf_train.describe()

Unnamed: 0,id,season,result_margin,target_runs,target_overs
count,821.0,821.0,807.0,819.0,819.0
mean,893870.2,2015.953715,16.978934,165.211233,19.727961
std,362930.9,4.855502,21.134117,33.532498,1.68701
min,335982.0,2008.0,1.0,43.0,5.0
25%,548333.0,2012.0,6.0,146.0,20.0
50%,980931.0,2016.0,8.0,165.0,20.0
75%,1216531.0,2020.0,19.5,186.0,20.0
max,1426312.0,2024.0,144.0,288.0,20.0


Replacing the Null values with NA

In [43]:
dataf_train = dataf_train.replace(np.nan,"NA")
dataf_test = dataf_test.replace(np.nan,"NA")

This code is cleaning and converting the target_runs column in both df_train and df_test:

 Converts target_runs to numeric (to_numeric with errors='coerce' turns non-numeric values into NaN).
 Fills missing values (NaN) with the column's mean to avoid data loss.

In [44]:
dataf_train['target_runs'] = pd.to_numeric(dataf_train['target_runs'], errors='coerce')
dataf_test['target_runs'] = pd.to_numeric(dataf_test['target_runs'], errors='coerce')
dataf_train['target_runs'] = dataf_train['target_runs'].fillna(dataf_train['target_runs'].mean())
dataf_test['target_runs'] = dataf_test['target_runs'].fillna(dataf_test['target_runs'].mean())

Here we are assing the df_train,df_test to x_train,x_test,y_train,y_test and droping the target_runs in x_train and x_test because we don't need it in x_train and x_test.

In [48]:
x1_train = dataf_train.drop("target_runs",axis=1)
y1_train = dataf_train["target_runs"]
x1_test = dataf_test.drop("target_runs",axis=1)
y1_test = dataf_test["target_runs"]
print(x1_train.shape)
print(y1_train.shape)
print(x1_test.shape)
print(y1_test.shape)

(821, 19)
(821,)
(274, 19)
(274,)


**Feature Engineering**

Here I did scalling but it giving the low score when compared to encoding

In [49]:
x1_train = x1_train.astype(str)
x1_test = x1_test.astype(str)
encoder = OneHotEncoder(handle_unknown="ignore",sparse_output=False)
x1_train = encoder.fit_transform(x1_train)
x1_test = encoder.transform(x1_test)

**Model Training**

Here I Tested DecisionTreeRegressor,  LinearRegression, SVM, along with below models also.

In [50]:
rfr = RandomForestRegressor()
rfr.fit(x1_train,y1_train)

In [51]:
gbr = GradientBoostingRegressor()
gbr.fit(x1_train,y1_train)

In [52]:
from xgboost import XGBRegressor
xg = XGBRegressor()
xg.fit(x1_train,y1_train)

**Model Evaluation**

In [53]:
z1 = rfr.predict(x1_test)
z2 = gbr.predict(x1_test)
z3 = xg.predict(x1_test)

 Measured model accuracy using R² score, Mean Absolute Error (MAE), and Mean Squared Error (MSE).

In [54]:
mse1 = mean_squared_error(y1_test, z1)
mse2 = mean_squared_error(y1_test, z2)
mse3 = mean_squared_error(y1_test, z3)
r2_1 = r2_score(y1_test, z1)
r2_2 = r2_score(y1_test, z2)
r2_3 = r2_score(y1_test, z3)
mae1 = mean_absolute_error(y1_test,z1)
mae2 = mean_absolute_error(y1_test,z2)
mae3 = mean_absolute_error(y1_test,z3)
print("\nRandomForestRegressor:")
print("Mean Squared Error:", mse1)
print("R-squared:", r2_1)
print("Mean Absolute Error:", mae1)
print("\nGradientBoostingRegressor:")
print("Mean Squared Error:", mse2)
print("R-squared:", r2_2)
print("Mean Absolute Error:", mae2)
print("\nXGBRegressor:")
print("Mean Squared Error:", mse3)
print("R-squared:", r2_3)
print("Mean Absolute Error:", mae3)


RandomForestRegressor:
Mean Squared Error: 760.4431458373939
R-squared: 0.3020492458837716
Mean Absolute Error: 22.41635513310696

GradientBoostingRegressor:
Mean Squared Error: 749.3744566651104
R-squared: 0.31220832220283845
Mean Absolute Error: 22.15043392317121

XGBRegressor:
Mean Squared Error: 767.3726823927698
R-squared: 0.29568917111555904
Mean Absolute Error: 22.449463065545615


**Hypermeter Tuning**

Here I am using RandomizedsearchCV because it is given me best score when comapred to GridsearchCV.

I used GridsearchCV also for below Models but it's given less score and the time taken is more.

In [55]:
rs_rtr = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
rd = RandomizedSearchCV(RandomForestRegressor(),rs_rtr,cv=5,n_iter=10)
rd.fit(x1_train, y1_train)

In [56]:
rd.best_params_

{'n_estimators': 50,
 'min_samples_split': 10,
 'min_samples_leaf': 1,
 'max_depth': None}

In [57]:
rd.best_score_

np.float64(0.2485623569768749)

In [58]:
rs_gbr = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 5, 7],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 3, 5]
}
rd = RandomizedSearchCV(GradientBoostingRegressor(),rs_gbr,cv=5,n_iter=10)
rd.fit(x1_train, y1_train)

In [60]:
rd.best_params_

{'n_estimators': 100,
 'min_samples_split': 10,
 'min_samples_leaf': 5,
 'max_depth': 5,
 'learning_rate': 0.05}

In [61]:
rd.best_score_

np.float64(0.2695405097085232)

In [62]:
rs_xg = {
          'n_estimators': [100, 200, 300, 500],
          'max_depth': [3, 5, 7, 10],
          'learning_rate': [0.01, 0.05, 0.1, 0.2],
          'subsample': [0.6, 0.8, 1.0],
          'colsample_bytree': [0.6, 0.8, 1.0],
          'gamma': [0, 0.1, 0.2, 0.3],
          'reg_alpha': [0, 0.01, 0.1, 1],
          'reg_lambda': [0, 0.01, 0.1, 1]
}
rd = RandomizedSearchCV(XGBRegressor(),rs_xg,cv=5,n_iter=10)
rd.fit(x1_train, y1_train)

In [63]:
rd.best_params_

{'subsample': 1.0,
 'reg_lambda': 0.01,
 'reg_alpha': 0.1,
 'n_estimators': 100,
 'max_depth': 3,
 'learning_rate': 0.2,
 'gamma': 0,
 'colsample_bytree': 1.0}

In [32]:
rd.best_score_

np.float64(0.27892575955832477)

By refering above models, I am  Considering ***XGB regression*** model as my Best model

**Conclusion** -

**.**Integrating data analytics into IPL cricket has transformed the sport, enabling teams to make informed decisions and develop effective strategies. .

•	By analyzing extensive datasets, teams can predict achievable chase targets and tailor their batting approaches accordingly.

•	Future improvements can include live match data integration and deep learning techniques for enhanced accuracy.
