# IPL  SCORE PREDICTION



<img src="https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F5635484%2Fcc4775f61ed72a625e5485a3941e6e45%2FIPL%20pic.jpg?generation=1600083123785212&alt=media">

Cricket is a bat-and-ball game played between two teams of eleven players each on a cricket field, at the centre of which is a rectangular 20-metre (22-yard) pitch with a target at each end called the wicket (a set of three wooden stumps upon which two bails sit). Each phase of play is called an innings, during which one team bats, attempting to score as many runs as possible, whilst their opponents bowl and field, attempting to minimise the number of runs scored. When each innings ends, the teams usually swap roles for the next innings (i.e. the team that previously batted will bowl/field, and vice versa). The teams each bat for one or two innings, depending on the type of match. The winning team is the one that scores the most runs, including any extras gained (except when the result is not a win/loss result). Source: https://en.wikipedia.org/wiki/Cricket


## About Dataset

Indian Premier League (IPL) is a Twenty20 cricket format league in India. It is usually played in April and May every year. As of 2019, the title sponsor of the game is Vivo. The league was founded by Board of Control for Cricket India (BCCI) in 2008.

## Problem Statement
We have to create a prediction model which can accurately predict the scores scored by a team in ipl based on their historical data as well as wickets and scores scored till that instance of time.















### Import Libraries
#### Let's import all necessary libraries for the analysis and along with it let's bring down our dataset

In [39]:
import pandas as pd
import pickle
import numpy as np
from datetime import datetime
from sklearn.metrics import mean_squared_error as mse

# Gathering Data

In [40]:
ipl_data = pd.read_csv('ipl.csv')

In [41]:
ipl_data.head()

Unnamed: 0,mid,date,venue,bat_team,bowl_team,batsman,bowler,runs,wickets,overs,runs_last_5,wickets_last_5,striker,non-striker,total
0,1,2008-04-18,M Chinnaswamy Stadium,Kolkata Knight Riders,Royal Challengers Bangalore,SC Ganguly,P Kumar,1,0,0.1,1,0,0,0,222
1,1,2008-04-18,M Chinnaswamy Stadium,Kolkata Knight Riders,Royal Challengers Bangalore,BB McCullum,P Kumar,1,0,0.2,1,0,0,0,222
2,1,2008-04-18,M Chinnaswamy Stadium,Kolkata Knight Riders,Royal Challengers Bangalore,BB McCullum,P Kumar,2,0,0.2,2,0,0,0,222
3,1,2008-04-18,M Chinnaswamy Stadium,Kolkata Knight Riders,Royal Challengers Bangalore,BB McCullum,P Kumar,2,0,0.3,2,0,0,0,222
4,1,2008-04-18,M Chinnaswamy Stadium,Kolkata Knight Riders,Royal Challengers Bangalore,BB McCullum,P Kumar,2,0,0.4,2,0,0,0,222


In [42]:
print ("The shape of the  data is (row, column):"+ str(ipl_data.shape))


The shape of the  data is (row, column):(76014, 15)


## Dataset Details

#### The dataset 'IPL Data Set.csv' consists of ball-to-ball informations about every match of IPL from Season 1 to 10 ie: (2008 to 2017)
## Dataset consists following columns:
* **mid**: Unique match id.
* **date**: Date on which the match was played.
* **venue**: Stadium where match was played.
* **batting_team**: Batting team name.
* **bowling_team**: Bowling team name.
* **batsman**: Batsman who faced that particular ball.
* **bowler**: Bowler who bowled that particular ball.
* **runs**: Runs scored by team till that point of instance.
* **wickets**: Number of Wickets fallen of the team till that point of instance.
* **overs**: Number of Overs bowled till that point of instance.
* **runs_last_5**: Runs scored in previous 5 overs.
* **wickets_last_5**: Number of Wickets that fell in previous 5 overs.
* **striker**: max(runs scored by striker, runs scored by non-striker).
* **non-striker**: min(runs scored by striker, runs scored by non-striker).
* **total**: Total runs scored by batting team at the end of first innings.


# Analyze the data

## Removing unwanted columns 

In [43]:
removed = ['mid', 'venue', 'batsman', 'bowler', 'striker', 'non-striker']
ipl_data.drop(labels=removed, axis=1, inplace=True)

## Checking for unique teams 


In [44]:
ipl_data['bat_team'].unique()

array(['Kolkata Knight Riders', 'Chennai Super Kings', 'Rajasthan Royals',
       'Mumbai Indians', 'Deccan Chargers', 'Kings XI Punjab',
       'Royal Challengers Bangalore', 'Delhi Daredevils',
       'Kochi Tuskers Kerala', 'Pune Warriors', 'Sunrisers Hyderabad',
       'Rising Pune Supergiants', 'Gujarat Lions',
       'Rising Pune Supergiant'], dtype=object)

## Removing Unconsistent teams
from bating and bowling data

In [45]:
consistent_teams = ['Kolkata Knight Riders', 'Chennai Super Kings', 'Rajasthan Royals','Mumbai Indians', 'Kings XI Punjab', 'Royal Challengers Bangalore','Delhi Daredevils', 'Sunrisers Hyderabad']
ipl_data = ipl_data[(ipl_data['bat_team'].isin(consistent_teams)) & (ipl_data['bowl_team'].isin(consistent_teams))]

In [46]:
ipl_data.head()

Unnamed: 0,date,bat_team,bowl_team,runs,wickets,overs,runs_last_5,wickets_last_5,total
0,2008-04-18,Kolkata Knight Riders,Royal Challengers Bangalore,1,0,0.1,1,0,222
1,2008-04-18,Kolkata Knight Riders,Royal Challengers Bangalore,1,0,0.2,1,0,222
2,2008-04-18,Kolkata Knight Riders,Royal Challengers Bangalore,2,0,0.2,2,0,222
3,2008-04-18,Kolkata Knight Riders,Royal Challengers Bangalore,2,0,0.3,2,0,222
4,2008-04-18,Kolkata Knight Riders,Royal Challengers Bangalore,2,0,0.4,2,0,222


## checking for null values

In [47]:
ipl_data.isnull().sum()

date              0
bat_team          0
bowl_team         0
runs              0
wickets           0
overs             0
runs_last_5       0
wickets_last_5    0
total             0
dtype: int64

In [48]:
ipl_data

Unnamed: 0,date,bat_team,bowl_team,runs,wickets,overs,runs_last_5,wickets_last_5,total
0,2008-04-18,Kolkata Knight Riders,Royal Challengers Bangalore,1,0,0.1,1,0,222
1,2008-04-18,Kolkata Knight Riders,Royal Challengers Bangalore,1,0,0.2,1,0,222
2,2008-04-18,Kolkata Knight Riders,Royal Challengers Bangalore,2,0,0.2,2,0,222
3,2008-04-18,Kolkata Knight Riders,Royal Challengers Bangalore,2,0,0.3,2,0,222
4,2008-04-18,Kolkata Knight Riders,Royal Challengers Bangalore,2,0,0.4,2,0,222
...,...,...,...,...,...,...,...,...,...
75884,2017-05-19,Kolkata Knight Riders,Mumbai Indians,106,9,18.1,29,4,107
75885,2017-05-19,Kolkata Knight Riders,Mumbai Indians,107,9,18.2,29,4,107
75886,2017-05-19,Kolkata Knight Riders,Mumbai Indians,107,9,18.3,28,4,107
75887,2017-05-19,Kolkata Knight Riders,Mumbai Indians,107,9,18.4,24,4,107


Observations:
Our dataset has no missing values.

## Removing the first 5 overs data in every match
Beacause we need atleast 5 overs to pridict next scores

In [49]:
ipl_data = ipl_data[ipl_data['overs']>=5.0]

In [50]:
ipl_data.describe()

Unnamed: 0,runs,wickets,overs,runs_last_5,wickets_last_5,total
count,40108.0,40108.0,40108.0,40108.0,40108.0,40108.0
mean,94.972699,3.042186,12.313459,38.887903,1.314027,161.947517
std,40.966837,1.906814,4.323001,11.50381,1.06265,29.831496
min,13.0,0.0,5.0,10.0,0.0,67.0
25%,62.0,2.0,8.5,31.0,1.0,142.0
50%,90.0,3.0,12.3,38.0,1.0,163.0
75%,124.0,4.0,16.2,46.0,2.0,183.0
max,246.0,10.0,19.6,94.0,7.0,246.0


# HANDLING CATEGORICAL FEATURES
using one hot encoding

In [51]:
encoded_df = pd.get_dummies(data=ipl_data, columns=['bat_team', 'bowl_team'])
encoded_df.head()

Unnamed: 0,date,runs,wickets,overs,runs_last_5,wickets_last_5,total,bat_team_Chennai Super Kings,bat_team_Delhi Daredevils,bat_team_Kings XI Punjab,...,bat_team_Royal Challengers Bangalore,bat_team_Sunrisers Hyderabad,bowl_team_Chennai Super Kings,bowl_team_Delhi Daredevils,bowl_team_Kings XI Punjab,bowl_team_Kolkata Knight Riders,bowl_team_Mumbai Indians,bowl_team_Rajasthan Royals,bowl_team_Royal Challengers Bangalore,bowl_team_Sunrisers Hyderabad
32,2008-04-18,61,0,5.1,59,0,222,0,0,0,...,0,0,0,0,0,0,0,0,1,0
33,2008-04-18,61,1,5.2,59,1,222,0,0,0,...,0,0,0,0,0,0,0,0,1,0
34,2008-04-18,61,1,5.3,59,1,222,0,0,0,...,0,0,0,0,0,0,0,0,1,0
35,2008-04-18,61,1,5.4,59,1,222,0,0,0,...,0,0,0,0,0,0,0,0,1,0
36,2008-04-18,61,1,5.5,58,1,222,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [52]:
## Our new colums
print(encoded_df.columns)

Index(['date', 'runs', 'wickets', 'overs', 'runs_last_5', 'wickets_last_5',
       'total', 'bat_team_Chennai Super Kings', 'bat_team_Delhi Daredevils',
       'bat_team_Kings XI Punjab', 'bat_team_Kolkata Knight Riders',
       'bat_team_Mumbai Indians', 'bat_team_Rajasthan Royals',
       'bat_team_Royal Challengers Bangalore', 'bat_team_Sunrisers Hyderabad',
       'bowl_team_Chennai Super Kings', 'bowl_team_Delhi Daredevils',
       'bowl_team_Kings XI Punjab', 'bowl_team_Kolkata Knight Riders',
       'bowl_team_Mumbai Indians', 'bowl_team_Rajasthan Royals',
       'bowl_team_Royal Challengers Bangalore',
       'bowl_team_Sunrisers Hyderabad'],
      dtype='object')


### Converting the column 'date' from string into datetime object 
usefull while train-test-split

In [53]:
encoded_df['date'] = encoded_df['date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))

# The Train-Test Split
Splitting the data into train and test set

In [54]:
X_train = encoded_df.drop(labels='total', axis=1)[encoded_df['date'].dt.year <= 2016]
X_test = encoded_df.drop(labels='total', axis=1)[encoded_df['date'].dt.year >= 2017]

In [55]:
y_train = encoded_df[encoded_df['date'].dt.year <= 2016]['total'].values
y_test = encoded_df[encoded_df['date'].dt.year >= 2017]['total'].values
print("Training set: {} and Test set: {}".format(X_train.shape, X_test.shape))


Training set: (37330, 22) and Test set: (2778, 22)


## Removing the 'date' column

In [56]:

X_train.drop(labels='date', axis=True, inplace=True)
X_test.drop(labels='date', axis=True, inplace=True)

# Coorelations

In [57]:
cor = ipl_data.corr()
cor['total'].sort_values(ascending=False)

total             1.000000
runs_last_5       0.587091
runs              0.391254
overs             0.028468
wickets_last_5   -0.297397
wickets          -0.457055
Name: total, dtype: float64

### observation
- As we can see 'runs_last_5' shows positive correlation 
which mean if 'runs_last_5' total runs increase.
- Similarly if 'wickets' increase total decrease. 
'wickets' shows negative correaltion

# Feature Scaling the data

In [58]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Selecting a Desired model for prediction

# Linear Regression

In [59]:
from sklearn.linear_model import LinearRegression
linear_regressor = LinearRegression()
linear_regressor.fit(X_train,y_train)

LinearRegression()

In [60]:
# Predicting results
y_pred_lr = linear_regressor.predict(X_test)


### Custom accuracy
I have defined my own function to measure accuracy of model. Custom Accuracy is defined on the basis of difference between the predicted score and actual score. If this difference falls below a particular thresold, we count it as a correct prediction.


In [61]:
def custom_accuracy(y_test,y_pred,thresold):
    right = 0
    l = len(y_pred)
    for i in range(0,l):
        if(abs(y_pred[i]-y_test[i]) <= thresold):
            right += 1
    return ((right/l)*100)

In [62]:
# Linear Regression - Model Evaluation
print("---- Linear Regression - Model Evaluation ----")
print("Mean Squared Error (MSE): {}".format(mse(y_test, y_pred_lr)))
print("Root Mean Squared Error (RMSE): {}".format(np.sqrt(mse(y_test, y_pred_lr))))
score = linear_regressor.score(X_test,y_test)*100
print("R-squared value:" , score)
print("Custom accuracy:" , custom_accuracy(y_test,y_pred_lr,20))

---- Linear Regression - Model Evaluation ----
Mean Squared Error (MSE): 251.00792310417185
Root Mean Squared Error (RMSE): 15.843229566732026
R-squared value: 75.22633566350552
Custom accuracy: 80.92152627789777


# Decision Tree

In [63]:
# Decision Tree Regression Model
from sklearn.tree import DecisionTreeRegressor
decision_regressor = DecisionTreeRegressor()
decision_regressor.fit(X_train,y_train)

DecisionTreeRegressor()

In [64]:
# Predicting results
y_pred_dt = decision_regressor.predict(X_test)

In [65]:
# Decision Tree Regression - Model Evaluation
print("---- Decision Tree Regression - Model Evaluation ----")
print("Mean Squared Error (MSE): {}".format(mse(y_test, y_pred_dt)))
print("Root Mean Squared Error (RMSE): {}".format(np.sqrt(mse(y_test, y_pred_dt))))
score = decision_regressor.score(X_test,y_test)*100
print("R-squared value:" , score)
print("Custom accuracy:" , custom_accuracy(y_test,y_pred_dt,20))

---- Decision Tree Regression - Model Evaluation ----
Mean Squared Error (MSE): 534.662347012239
Root Mean Squared Error (RMSE): 23.122766854601096
R-squared value: 47.23056804566901
Custom accuracy: 67.31461483081354


# Random Forest

In [66]:
# Random Forest Regression Model
from sklearn.ensemble import RandomForestRegressor
random_regressor = RandomForestRegressor()
random_regressor.fit(X_train,y_train)

RandomForestRegressor()

In [67]:
# Predicting results
y_pred_rf = random_regressor.predict(X_test)

In [68]:
# Random Forest Regression - Model Evaluation
print("---- Random Forest Regression - Model Evaluation ----")
print("Mean Squared Error (MSE): {}".format(mse(y_test, y_pred_rf)))
print("Root Mean Squared Error (RMSE): {}".format(np.sqrt(mse(y_test, y_pred_rf))))
score = random_regressor.score(X_test,y_test)*100
print("R-squared value:" , score)
print("Custom accuracy:" , custom_accuracy(y_test,y_pred_rf,20))

---- Random Forest Regression - Model Evaluation ----
Mean Squared Error (MSE): 332.394559362371
Root Mean Squared Error (RMSE): 18.23169107248066
R-squared value: 67.1937397868398
Custom accuracy: 74.44204463642909


### Since Linear Regression model performs best as compared to other two, we use Linear Regression model 

# Predictions
* Model trained on the data from IPL Seasons 1 to 9 ie: (2008 to 2016)
* Model tested on data from IPL Season 10 ie: (2017)
* Model predicts on data from IPL Seasons 11 to 12 ie: (2018 to 2019)

In [69]:

def predict_score(batting_team='Chennai Super Kings', bowling_team='Mumbai Indians', overs=5.1, runs=50, wickets=0, runs_in_prev_5=50, wickets_in_prev_5=0):
  temp_array = list()

  # Batting Team
  if batting_team == 'Chennai Super Kings':
    temp_array = temp_array + [1,0,0,0,0,0,0,0]
  elif batting_team == 'Delhi Daredevils':
    temp_array = temp_array + [0,1,0,0,0,0,0,0]
  elif batting_team == 'Kings XI Punjab':
    temp_array = temp_array + [0,0,1,0,0,0,0,0]
  elif batting_team == 'Kolkata Knight Riders':
    temp_array = temp_array + [0,0,0,1,0,0,0,0]
  elif batting_team == 'Mumbai Indians':
    temp_array = temp_array + [0,0,0,0,1,0,0,0]
  elif batting_team == 'Rajasthan Royals':
    temp_array = temp_array + [0,0,0,0,0,1,0,0]
  elif batting_team == 'Royal Challengers Bangalore':
    temp_array = temp_array + [0,0,0,0,0,0,1,0]
  elif batting_team == 'Sunrisers Hyderabad':
    temp_array = temp_array + [0,0,0,0,0,0,0,1]

  # Bowling Team
  if bowling_team == 'Chennai Super Kings':
    temp_array = temp_array + [1,0,0,0,0,0,0,0]
  elif bowling_team == 'Delhi Daredevils':
    temp_array = temp_array + [0,1,0,0,0,0,0,0]
  elif bowling_team == 'Kings XI Punjab':
    temp_array = temp_array + [0,0,1,0,0,0,0,0]
  elif bowling_team == 'Kolkata Knight Riders':
    temp_array = temp_array + [0,0,0,1,0,0,0,0]
  elif bowling_team == 'Mumbai Indians':
    temp_array = temp_array + [0,0,0,0,1,0,0,0]
  elif bowling_team == 'Rajasthan Royals':
    temp_array = temp_array + [0,0,0,0,0,1,0,0]
  elif bowling_team == 'Royal Challengers Bangalore':
    temp_array = temp_array + [0,0,0,0,0,0,1,0]
  elif bowling_team == 'Sunrisers Hyderabad':
    temp_array = temp_array + [0,0,0,0,0,0,0,1]

  # Overs, Runs, Wickets, Runs_in_prev_5, Wickets_in_prev_5
  temp_array = temp_array + [overs, runs, wickets, runs_in_prev_5, wickets_in_prev_5]

  # Converting into numpy array
  temp_array = np.array([temp_array])

  # Prediction
  return int(linear_regressor.predict(temp_array)[0])

# Prediction 1
* Date: 14th April 2019
* IPL : Season 12
* Match number: 30
* Teams: Sunrisers Hyderabad vs. Delhi Daredevils
* First Innings final score: 155/7

In [70]:
final_score = predict_score(batting_team='Delhi Daredevils', bowling_team='Sunrisers Hyderabad', overs=11.5, runs=98, wickets=3, runs_in_prev_5=41, wickets_in_prev_5=1)
print("The final predicted score (range): {} to {}".format(final_score-10, final_score+5))

The final predicted score (range): 145 to 160



























Prediction 2

• Date: 10th May 2019

• IPL : Season 12

• Match number: 59 (Eliminator)

• Teams: Delhi Daredevils vs. Chennai Super Kings

• First Innings final score: 147/9

In [71]:
final_score = predict_score(batting_team='Delhi Daredevils', bowling_team='Chennai Super Kings', overs=10.2, runs=68, wickets=3, runs_in_prev_5=29, wickets_in_prev_5=1)
print("The final predicted score (range): {} to {}".format(final_score-10, final_score+5))

The final predicted score (range): 140 to 155


# Pickling

In [72]:
filename = 'model.pkl'
pickle.dump(linear_regressor, open(filename, 'wb'))