# Predicting through Probabilities
In this notebook, I will be trying to predict the score and ultimately who wins the match given the batsmen and incremently building up to the bowlers as well. Here, I will be using a purely probability-based method, similar to a Monte-Carlo simulation.

In [12]:
# First import everything
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import warnings
from scipy.stats import gaussian_kde
pd.options.display.max_columns = None
pd.options.display.max_rows = 300
warnings.filterwarnings('ignore')

## Obtaining Data
I have data on each delivery and each match through the 2016 season.

In [2]:
deliveries = pd.read_csv('deliveries.csv')
matches = pd.read_csv('matches.csv')

In [3]:
deliveries.head()

Unnamed: 0,match_id,inning,batting_team,bowling_team,over,ball,batsman,non_striker,bowler,is_super_over,wide_runs,bye_runs,legbye_runs,noball_runs,penalty_runs,batsman_runs,extra_runs,total_runs,player_dismissed,dismissal_kind,fielder
0,1,1,Kolkata Knight Riders,Royal Challengers Bangalore,1,1,SC Ganguly,BB McCullum,P Kumar,0,0,0,1,0,0,0,1,1,,,
1,1,1,Kolkata Knight Riders,Royal Challengers Bangalore,1,2,BB McCullum,SC Ganguly,P Kumar,0,0,0,0,0,0,0,0,0,,,
2,1,1,Kolkata Knight Riders,Royal Challengers Bangalore,1,3,BB McCullum,SC Ganguly,P Kumar,0,1,0,0,0,0,0,1,1,,,
3,1,1,Kolkata Knight Riders,Royal Challengers Bangalore,1,4,BB McCullum,SC Ganguly,P Kumar,0,0,0,0,0,0,0,0,0,,,
4,1,1,Kolkata Knight Riders,Royal Challengers Bangalore,1,5,BB McCullum,SC Ganguly,P Kumar,0,0,0,0,0,0,0,0,0,,,


In [4]:
matches.head()

Unnamed: 0,id,season,city,date,team1,team2,toss_winner,toss_decision,result,dl_applied,winner,win_by_runs,win_by_wickets,player_of_match,venue,umpire1,umpire2,umpire3
0,1,2008,Bangalore,4/18/2008,Kolkata Knight Riders,Royal Challengers Bangalore,Royal Challengers Bangalore,field,normal,0,Kolkata Knight Riders,140,0,BB McCullum,M Chinnaswamy Stadium,Asad Rauf,RE Koertzen,
1,2,2008,Chandigarh,4/19/2008,Chennai Super Kings,Kings XI Punjab,Chennai Super Kings,bat,normal,0,Chennai Super Kings,33,0,MEK Hussey,"Punjab Cricket Association Stadium, Mohali",MR Benson,SL Shastri,
2,3,2008,Delhi,4/19/2008,Rajasthan Royals,Delhi Daredevils,Rajasthan Royals,bat,normal,0,Delhi Daredevils,0,9,MF Maharoof,Feroz Shah Kotla,Aleem Dar,GA Pratapkumar,
3,4,2008,Mumbai,4/20/2008,Mumbai Indians,Royal Challengers Bangalore,Mumbai Indians,bat,normal,0,Royal Challengers Bangalore,0,5,MV Boucher,Wankhede Stadium,SJ Davis,DJ Harper,
4,5,2008,Kolkata,4/20/2008,Deccan Chargers,Kolkata Knight Riders,Deccan Chargers,bat,normal,0,Kolkata Knight Riders,0,5,DJ Hussey,Eden Gardens,BF Bowden,K Hariharan,


## Data Preparation - Batsman Only
Let's start simple. For now, I will only get the probabilites of the batsmen and attempt to model the score through a simulation of the game. The statistics I will be collecting are the probabilites of a batsman hitting a 0-6 runs, getting out, as well as the chance of coming at a specific time (such as opening, or coming in after the first batsman is out). This is a standard scenario of getting the counts. Luckily, the deliveries data has a column of batsmen runs which shows the number of runs that were credited to the batsman. However, a final score also includes extras. Thus, when I'm comparing to test data, I'll have to subtract the number of extras from the total score.

### Probabilities of Runs
First I'll go through each batsmen and calculate the direct probabilities of runs and outs.

In [5]:
# Get all the batsmen with their runs and out instances
batsmenRuns = deliveries[['batsman', 'batsman_runs', 'player_dismissed']]
# A disimssed player is equal to the batsman. So we can fill NaNs with 0
# and non-NaNs with 1
batsmenRuns['player_dismissed'].fillna(0, inplace=True)
batsmenRuns.loc[batsmenRuns['player_dismissed'] != 0, 'player_dismissed'] = 1
# Convert to numeric
batsmenRuns['player_dismissed'] = pd.to_numeric(batsmenRuns['player_dismissed'])
# Make dummies to see if each ball is 0-6 and concatenate
batsDummies = pd.get_dummies(batsmenRuns['batsman_runs']).astype(np.int64)
batsmenRuns = pd.concat([batsmenRuns, batsDummies], axis=1)
# But now the column names have 0, 1, 2, ... as integers so convert to string
batsmenRuns.columns = batsmenRuns.columns.astype(str)
# Group by by batsman
batsGroup = batsmenRuns.groupby(by=batsmenRuns['batsman'])

In [6]:
batsmenRuns.head()

Unnamed: 0,batsman,batsman_runs,player_dismissed,0,1,2,3,4,5,6
0,SC Ganguly,0,0,1,0,0,0,0,0,0
1,BB McCullum,0,0,1,0,0,0,0,0,0
2,BB McCullum,0,0,1,0,0,0,0,0,0
3,BB McCullum,0,0,1,0,0,0,0,0,0
4,BB McCullum,0,0,1,0,0,0,0,0,0


In [7]:
preData = batsGroup.sum()
preData['number_of_balls'] = preData.drop(['batsman_runs', 'player_dismissed'], axis=1).sum(axis=1)
preData

Unnamed: 0_level_0,batsman_runs,player_dismissed,0,1,2,3,4,5,6,number_of_balls
batsman,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
A Ashish Reddy,280,15,61,83,20,1,16,0,15,196
A Chandila,4,1,3,4,0,0,0,0,0,7
A Chopra,53,5,45,21,2,0,7,0,0,75
A Flintoff,62,2,24,23,2,1,5,0,2,57
A Kumble,35,2,24,21,1,0,3,0,0,49
A Mishra,291,27,142,139,17,0,25,0,3,326
A Mithun,34,5,11,8,2,0,4,0,1,26
A Mukund,19,2,9,11,2,0,1,0,0,23
A Nehra,41,8,37,21,1,0,3,0,1,63
A Singh,2,4,8,2,0,0,0,0,0,10


In [8]:
# For each player calculate probabilites using kernel density estimation
def calcProb(player):
    runs = np.array([], dtype=float)
    for runAmount in range(7):
        runs = np.append(runs, [runAmount] * player[str(runAmount)])
    # Make a kernel, calculate the probabilites, and scale appropriately
    # It's extremely rare, but if 'runs' only represents one outcome,
    # then kde will fail since the variance is zero. A primitive way
    # of combatting this to add one sample of each run and do 
    # the density estimation once more
    if len(np.unique(runs)) == 1:
        runs = np.append(runs, np.arange(7))
    kde = gaussian_kde(runs)
    probs = kde.pdf(np.arange(7))
    probs /= np.sum(probs)
    for runAmount in range(7):
        player[str(runAmount) + 'runProb'] = probs[runAmount]
    return player[-7:]

In [9]:
finalData = pd.concat([preData, preData.apply(calcProb, axis=1)], axis=1)
finalData

Unnamed: 0_level_0,batsman_runs,player_dismissed,0,1,2,3,4,5,6,number_of_balls,0runProb,1runProb,2runProb,3runProb,4runProb,5runProb,6runProb
batsman,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
A Ashish Reddy,280,15,61,83,20,1,16,0,15,196,0.296960,0.375425,0.148583,3.661273e-02,5.985771e-02,2.748872e-02,5.507292e-02
A Chandila,4,1,3,4,0,0,0,0,0,7,0.426390,0.561395,0.012215,1.321764e-07,6.997493e-16,1.812368e-27,2.296496e-42
A Chopra,53,5,45,21,2,0,7,0,0,75,0.536636,0.306394,0.054229,1.361653e-02,7.856816e-02,1.053143e-02,2.537059e-05
A Flintoff,62,2,24,23,2,1,5,0,2,57,0.365258,0.364745,0.116721,4.071712e-02,6.235889e-02,2.633209e-02,2.386808e-02
A Kumble,35,2,24,21,1,0,3,0,0,49,0.464315,0.417333,0.052976,6.705459e-03,5.365876e-02,5.007723e-03,4.070728e-06
A Mishra,291,27,142,139,17,0,25,0,3,326,0.429086,0.421841,0.061083,3.334305e-03,7.360066e-02,2.222726e-03,8.832114e-03
A Mithun,34,5,11,8,2,0,4,0,1,26,0.309963,0.299618,0.148336,7.671263e-02,8.585495e-02,5.316991e-02,2.634558e-02
A Mukund,19,2,9,11,2,0,1,0,0,23,0.374365,0.445693,0.124730,1.465789e-02,3.570014e-02,4.841593e-03,1.210115e-05
A Nehra,41,8,37,21,1,0,3,0,1,63,0.530532,0.349205,0.052215,7.479737e-03,3.988562e-02,7.376315e-03,1.330670e-02
A Singh,2,4,8,2,0,0,0,0,0,10,0.799351,0.200478,0.000171,1.066541e-13,4.866521e-29,1.622393e-50,3.951754e-78


Excellent. The next step is a bit more trickier as now we need to find the probabilites of the time that a specific batsmen comes in i.e. whether he opens or whether he comes in after someone gets out. Unfortunately, the data I have right now only gives each ball's result and the final result. It does not give the order the batsmen batted plus the batsmen that didn't bat. For that, I need to web-scrape off the net that shows the full batting order for that match. At www.howstat.com, I can get the full order of batsmen.

In [13]:
deliveries[deliveries['match_id'] == 1]

Unnamed: 0,match_id,inning,batting_team,bowling_team,over,ball,batsman,non_striker,bowler,is_super_over,wide_runs,bye_runs,legbye_runs,noball_runs,penalty_runs,batsman_runs,extra_runs,total_runs,player_dismissed,dismissal_kind,fielder
0,1,1,Kolkata Knight Riders,Royal Challengers Bangalore,1,1,SC Ganguly,BB McCullum,P Kumar,0,0,0,1,0,0,0,1,1,,,
1,1,1,Kolkata Knight Riders,Royal Challengers Bangalore,1,2,BB McCullum,SC Ganguly,P Kumar,0,0,0,0,0,0,0,0,0,,,
2,1,1,Kolkata Knight Riders,Royal Challengers Bangalore,1,3,BB McCullum,SC Ganguly,P Kumar,0,1,0,0,0,0,0,1,1,,,
3,1,1,Kolkata Knight Riders,Royal Challengers Bangalore,1,4,BB McCullum,SC Ganguly,P Kumar,0,0,0,0,0,0,0,0,0,,,
4,1,1,Kolkata Knight Riders,Royal Challengers Bangalore,1,5,BB McCullum,SC Ganguly,P Kumar,0,0,0,0,0,0,0,0,0,,,
5,1,1,Kolkata Knight Riders,Royal Challengers Bangalore,1,6,BB McCullum,SC Ganguly,P Kumar,0,0,0,0,0,0,0,0,0,,,
6,1,1,Kolkata Knight Riders,Royal Challengers Bangalore,1,7,BB McCullum,SC Ganguly,P Kumar,0,0,0,1,0,0,0,1,1,,,
7,1,1,Kolkata Knight Riders,Royal Challengers Bangalore,2,1,BB McCullum,SC Ganguly,Z Khan,0,0,0,0,0,0,0,0,0,,,
8,1,1,Kolkata Knight Riders,Royal Challengers Bangalore,2,2,BB McCullum,SC Ganguly,Z Khan,0,0,0,0,0,0,4,0,4,,,
9,1,1,Kolkata Knight Riders,Royal Challengers Bangalore,2,3,BB McCullum,SC Ganguly,Z Khan,0,0,0,0,0,0,4,0,4,,,
