# IPL  SCORE PREDICTION



<img src="https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F5635484%2Fcc4775f61ed72a625e5485a3941e6e45%2FIPL%20pic.jpg?generation=1600083123785212&alt=media">

Cricket is a bat-and-ball game played between two teams of eleven players each on a cricket field, at the centre of which is a rectangular 20-metre (22-yard) pitch with a target at each end called the wicket (a set of three wooden stumps upon which two bails sit). Each phase of play is called an innings, during which one team bats, attempting to score as many runs as possible, whilst their opponents bowl and field, attempting to minimise the number of runs scored. When each innings ends, the teams usually swap roles for the next innings (i.e. the team that previously batted will bowl/field, and vice versa). The teams each bat for one or two innings, depending on the type of match. The winning team is the one that scores the most runs, including any extras gained (except when the result is not a win/loss result). Source: https://en.wikipedia.org/wiki/Cricket


## About Dataset

Indian Premier League (IPL) is a Twenty20 cricket format league in India. It is usually played in April and May every year. As of 2019, the title sponsor of the game is Vivo. The league was founded by Board of Control for Cricket India (BCCI) in 2008.


### Import Libraries
#### Let's import all necessary libraries for the analysis and along with it let's bring down our dataset

In [67]:
import pandas as pd
import pickle
import numpy as np
from sklearn import metrics
from datetime import datetime


#Visualization Phase
import matplotlib.pyplot as plt
import matplotlib as mpl
import matplotlib.pylab as pylab
from pandas.plotting import scatter_matrix
%matplotlib inline

# for ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Gathering Data

In [2]:
ipl_data = pd.read_csv('ipl.csv')

In [3]:
ipl_data.head()

Unnamed: 0,mid,date,venue,bat_team,bowl_team,batsman,bowler,runs,wickets,overs,runs_last_5,wickets_last_5,striker,non-striker,total
0,1,2008-04-18,M Chinnaswamy Stadium,Kolkata Knight Riders,Royal Challengers Bangalore,SC Ganguly,P Kumar,1,0,0.1,1,0,0,0,222
1,1,2008-04-18,M Chinnaswamy Stadium,Kolkata Knight Riders,Royal Challengers Bangalore,BB McCullum,P Kumar,1,0,0.2,1,0,0,0,222
2,1,2008-04-18,M Chinnaswamy Stadium,Kolkata Knight Riders,Royal Challengers Bangalore,BB McCullum,P Kumar,2,0,0.2,2,0,0,0,222
3,1,2008-04-18,M Chinnaswamy Stadium,Kolkata Knight Riders,Royal Challengers Bangalore,BB McCullum,P Kumar,2,0,0.3,2,0,0,0,222
4,1,2008-04-18,M Chinnaswamy Stadium,Kolkata Knight Riders,Royal Challengers Bangalore,BB McCullum,P Kumar,2,0,0.4,2,0,0,0,222


In [4]:
print ("The shape of the  data is (row, column):"+ str(ipl_data.shape))


The shape of the  data is (row, column):(76014, 15)


## Dataset Details


### ipl_data:

* **mid** - Match id.
* **date** - UTC time and date when the Match was played.
* **venu** - Venu where match was played.
* **bat_team** - Bating Team Name.
* **bowl_team** - Bowling Team Name.
* **batsman** - Batsman Name.
* **bowler** - Bowler Name.
* **runs** - Runs scored in the over.
* **wickets** - Wickets taken in that over.
* **overs** - Number of overs.
* **runs_last_5** - Runs in last 5 overs.
* **wickets_last_5** - Wickets taken in last 5 overs.
* **striker** - Number of striker.
* **non-striker** - Number of non-striker.
* **total** - Total scores.
- All Indian Premier League Cricket matches between 2008-04-18 and 2017-05-21



# Analyze the data

## Removing unwanted columns 

In [5]:
removed = ['mid', 'venue', 'batsman', 'bowler', 'striker', 'non-striker']
ipl_data.drop(labels=removed, axis=1, inplace=True)

## Checking for unique teams 


In [6]:
ipl_data['bat_team'].unique()

array(['Kolkata Knight Riders', 'Chennai Super Kings', 'Rajasthan Royals',
       'Mumbai Indians', 'Deccan Chargers', 'Kings XI Punjab',
       'Royal Challengers Bangalore', 'Delhi Daredevils',
       'Kochi Tuskers Kerala', 'Pune Warriors', 'Sunrisers Hyderabad',
       'Rising Pune Supergiants', 'Gujarat Lions',
       'Rising Pune Supergiant'], dtype=object)

## Removing Unconsistent teams
from bating and bowling data

In [7]:
consistent_teams = ['Kolkata Knight Riders', 'Chennai Super Kings', 'Rajasthan Royals','Mumbai Indians', 'Kings XI Punjab', 'Royal Challengers Bangalore','Delhi Daredevils', 'Sunrisers Hyderabad']
ipl_data = ipl_data[(ipl_data['bat_team'].isin(consistent_teams)) & (ipl_data['bowl_team'].isin(consistent_teams))]

In [8]:
ipl_data.head()

Unnamed: 0,date,bat_team,bowl_team,runs,wickets,overs,runs_last_5,wickets_last_5,total
0,2008-04-18,Kolkata Knight Riders,Royal Challengers Bangalore,1,0,0.1,1,0,222
1,2008-04-18,Kolkata Knight Riders,Royal Challengers Bangalore,1,0,0.2,1,0,222
2,2008-04-18,Kolkata Knight Riders,Royal Challengers Bangalore,2,0,0.2,2,0,222
3,2008-04-18,Kolkata Knight Riders,Royal Challengers Bangalore,2,0,0.3,2,0,222
4,2008-04-18,Kolkata Knight Riders,Royal Challengers Bangalore,2,0,0.4,2,0,222


## checking for null values

In [9]:
ipl_data.isnull().sum()

date              0
bat_team          0
bowl_team         0
runs              0
wickets           0
overs             0
runs_last_5       0
wickets_last_5    0
total             0
dtype: int64

In [10]:
ipl_data

Unnamed: 0,date,bat_team,bowl_team,runs,wickets,overs,runs_last_5,wickets_last_5,total
0,2008-04-18,Kolkata Knight Riders,Royal Challengers Bangalore,1,0,0.1,1,0,222
1,2008-04-18,Kolkata Knight Riders,Royal Challengers Bangalore,1,0,0.2,1,0,222
2,2008-04-18,Kolkata Knight Riders,Royal Challengers Bangalore,2,0,0.2,2,0,222
3,2008-04-18,Kolkata Knight Riders,Royal Challengers Bangalore,2,0,0.3,2,0,222
4,2008-04-18,Kolkata Knight Riders,Royal Challengers Bangalore,2,0,0.4,2,0,222
...,...,...,...,...,...,...,...,...,...
75884,2017-05-19,Kolkata Knight Riders,Mumbai Indians,106,9,18.1,29,4,107
75885,2017-05-19,Kolkata Knight Riders,Mumbai Indians,107,9,18.2,29,4,107
75886,2017-05-19,Kolkata Knight Riders,Mumbai Indians,107,9,18.3,28,4,107
75887,2017-05-19,Kolkata Knight Riders,Mumbai Indians,107,9,18.4,24,4,107


Observations:
Our dataset has no missing values.

## Removing the first 5 overs data in every match
Beacause we need atleast 5 overs to pridict next scores

In [11]:
ipl_data = ipl_data[ipl_data['overs']>=5.0]

In [12]:
ipl_data.describe()

Unnamed: 0,runs,wickets,overs,runs_last_5,wickets_last_5,total
count,40108.0,40108.0,40108.0,40108.0,40108.0,40108.0
mean,94.972699,3.042186,12.313459,38.887903,1.314027,161.947517
std,40.966837,1.906814,4.323001,11.50381,1.06265,29.831496
min,13.0,0.0,5.0,10.0,0.0,67.0
25%,62.0,2.0,8.5,31.0,1.0,142.0
50%,90.0,3.0,12.3,38.0,1.0,163.0
75%,124.0,4.0,16.2,46.0,2.0,183.0
max,246.0,10.0,19.6,94.0,7.0,246.0


# HANDLING CATEGORICAL FEATURES
using one hot encoding

In [13]:
encoded_df = pd.get_dummies(data=ipl_data, columns=['bat_team', 'bowl_team'])
encoded_df.head()

Unnamed: 0,date,runs,wickets,overs,runs_last_5,wickets_last_5,total,bat_team_Chennai Super Kings,bat_team_Delhi Daredevils,bat_team_Kings XI Punjab,...,bat_team_Royal Challengers Bangalore,bat_team_Sunrisers Hyderabad,bowl_team_Chennai Super Kings,bowl_team_Delhi Daredevils,bowl_team_Kings XI Punjab,bowl_team_Kolkata Knight Riders,bowl_team_Mumbai Indians,bowl_team_Rajasthan Royals,bowl_team_Royal Challengers Bangalore,bowl_team_Sunrisers Hyderabad
32,2008-04-18,61,0,5.1,59,0,222,0,0,0,...,0,0,0,0,0,0,0,0,1,0
33,2008-04-18,61,1,5.2,59,1,222,0,0,0,...,0,0,0,0,0,0,0,0,1,0
34,2008-04-18,61,1,5.3,59,1,222,0,0,0,...,0,0,0,0,0,0,0,0,1,0
35,2008-04-18,61,1,5.4,59,1,222,0,0,0,...,0,0,0,0,0,0,0,0,1,0
36,2008-04-18,61,1,5.5,58,1,222,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [14]:
## Our new colums
print(encoded_df.columns)

Index(['date', 'runs', 'wickets', 'overs', 'runs_last_5', 'wickets_last_5',
       'total', 'bat_team_Chennai Super Kings', 'bat_team_Delhi Daredevils',
       'bat_team_Kings XI Punjab', 'bat_team_Kolkata Knight Riders',
       'bat_team_Mumbai Indians', 'bat_team_Rajasthan Royals',
       'bat_team_Royal Challengers Bangalore', 'bat_team_Sunrisers Hyderabad',
       'bowl_team_Chennai Super Kings', 'bowl_team_Delhi Daredevils',
       'bowl_team_Kings XI Punjab', 'bowl_team_Kolkata Knight Riders',
       'bowl_team_Mumbai Indians', 'bowl_team_Rajasthan Royals',
       'bowl_team_Royal Challengers Bangalore',
       'bowl_team_Sunrisers Hyderabad'],
      dtype='object')


### Converting the column 'date' from string into datetime object 
usefull while train-test-split

In [15]:
encoded_df['date'] = encoded_df['date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))

# The Train-Test Split
Splitting the data into train and test set

In [16]:
X_train = encoded_df.drop(labels='total', axis=1)[encoded_df['date'].dt.year <= 2016]
X_test = encoded_df.drop(labels='total', axis=1)[encoded_df['date'].dt.year >= 2017]

In [17]:
y_train = encoded_df[encoded_df['date'].dt.year <= 2016]['total'].values
y_test = encoded_df[encoded_df['date'].dt.year >= 2017]['total'].values

## Removing the 'date' column

In [18]:

X_train.drop(labels='date', axis=True, inplace=True)
X_test.drop(labels='date', axis=True, inplace=True)

# Coorelations

In [19]:
cor = ipl_data.corr()
cor['total'].sort_values(ascending=False)

total             1.000000
runs_last_5       0.587091
runs              0.391254
overs             0.028468
wickets_last_5   -0.297397
wickets          -0.457055
Name: total, dtype: float64

### observation
- As we can see 'runs_last_5' shows positive correlation 
which mean if 'runs_last_5' total runs increase.
- Similarly if 'wickets' increase total decrease. 
'wickets' shows negative correaltion

# Selecting a Desired model for prediction

In [56]:
from sklearn.linear_model import Ridge


model = Ridge()
model.fit(X_train, y_train)

Ridge()

# Using Cross-Validation

In [57]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model,X_train,y_train,scoring='neg_mean_squared_error',cv=10)

In [58]:
rmse_scores = np.sqrt(-scores)

In [59]:
rmse_scores

array([19.73740558, 15.91260446, 17.12674592, 19.48555003, 17.91646659,
       16.44733104, 18.94796102, 19.65136251, 17.69556262, 17.266714  ])

In [60]:
def print_scores(scores):
    print("SCORES: ",scores)
    print("Mean : ",scores.mean())
    print("Standard Deviation : ",scores.std())

In [61]:
print_scores(rmse_scores)

SCORES:  [19.73740558 15.91260446 17.12674592 19.48555003 17.91646659 16.44733104
 18.94796102 19.65136251 17.69556262 17.266714  ]
Mean :  18.018770377318468
Standard Deviation :  1.3044341305789664


In [63]:
prediction = model.predict(X_test)

In [68]:

print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, prediction)))

RMSE: 15.843248011910424


# Pickling 

In [70]:
filename = 'model.pkl'
pickle.dump(model, open(filename, 'wb'))