# `T20 Cricket: International`

Data source: https://cricsheet.org/downloads/


## Primary Objective:

To build a "Total Runs" prediction model for 1st innings of Men's T20 cricket given any set of values during the match.

# Importing Libraries

In [36]:
import os
from yaml import safe_load

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

# Collecting the data

In [15]:
filenames = []
for file in os.listdir('/Users/kaushal.totla/Desktop/t20/t20_yamls/t20s'):
    filenames.append(os.path.join('/Users/kaushal.totla/Desktop/t20/t20_yamls/t20s',file))

In [34]:
print("Total files:",len(filenames))

Total files: 2725


In [25]:
print(filenames[:3])

['/Users/kaushal.totla/Desktop/t20/t20_yamls/t20s/1359789.yaml', '/Users/kaushal.totla/Desktop/t20/t20_yamls/t20s/1173055.yaml', '/Users/kaushal.totla/Desktop/t20/t20_yamls/t20s/951375.yaml']


In [22]:
with open(filenames[0],'r') as f:
    df = pd.json_normalize(safe_load(f))

In [23]:
df

Unnamed: 0,innings,meta.data_version,meta.created,meta.revision,info.balls_per_over,info.city,info.dates,info.gender,info.match_type,info.match_type_number,...,info.registry.people.T Manders,info.registry.people.TS Fray,info.registry.people.VC Ahir,info.registry.people.Vijaya Mallela,info.registry.people.Z Burgess,info.teams,info.toss.winner,info.toss.decision,info.umpires,info.venue
0,"[{'1st innings': {'team': 'Bermuda', 'deliveri...",0.92,2023-03-01,1,6,Buenos Aires,[2023-02-28],male,T20,2009,...,6d2b9724,c7537f8a,0bb68705,85b6244a,d0042e16,"[Bermuda, Panama]",Bermuda,bat,"[N Duguid, Vijaya Mallela]","Belgrano Athletic Club Ground, Buenos Aires"


In [37]:
#Reading all datasets
op = pd.DataFrame()
for file in filenames:
    with open(file,'r') as f:
        df = pd.json_normalize(safe_load(f))
        op = op.append(df)

FileNotFoundError: [Errno 2] No such file or directory: '/Users/kaushal.totla/Desktop/t20/t20_yamls/t20s/README.txt'

Above error is due to the `README.txt` file still being in the folder, we can ignore the error and check if our dataset is prepared.

In [39]:
op.head()

Unnamed: 0,innings,meta.data_version,meta.created,meta.revision,info.balls_per_over,info.city,info.dates,info.gender,info.match_type,info.match_type_number,...,info.registry.people.Arshad Mazarzei,info.registry.people.DK Dahani,info.registry.people.Emran Shahbakhsh,info.registry.people.Masood Jayezeh,info.registry.people.Mehran Siasar,info.registry.people.Nader Zahadiafzal,info.registry.people.Naiem Bameri,info.registry.people.Navid Abdollahpour,info.registry.people.Navid Balouch,info.registry.people.Yousef Shadzehisarjou
0,"[{'1st innings': {'team': 'Bermuda', 'deliveri...",0.92,2023-03-01,1,6.0,Buenos Aires,[2023-02-28],male,T20,2009.0,...,,,,,,,,,,
0,"[{'1st innings': {'team': 'West Indies', 'deli...",0.91,2020-03-01,1,6.0,Canberra,[2020-02-26],female,T20,853.0,...,,,,,,,,,,
0,"[{'1st innings': {'team': 'India', 'deliveries...",0.91,2016-03-17,2,6.0,Bangalore,[2016-03-15],female,T20,,...,,,,,,,,,,
0,"[{'1st innings': {'team': 'Rwanda', 'deliverie...",0.92,2021-08-21,2,6.0,Kigali City,[2021-08-21],male,T20,1234.0,...,,,,,,,,,,
0,"[{'1st innings': {'team': 'Qatar', 'deliveries...",0.92,2022-12-16,1,6.0,Nairobi,[2022-12-15],female,T20,1322.0,...,,,,,,,,,,


Our dataset is ready, let's check the size

In [43]:
#Creating backup
backup = op.copy()

In [105]:
op.shape

(2049, 5065)

- 2049 matches 
- 5065 descriptive columns

# Cleaning the data

In [106]:
op.columns

Index(['innings', 'meta.data_version', 'meta.created', 'meta.revision',
       'info.balls_per_over', 'info.city', 'info.dates', 'info.gender',
       'info.match_type', 'info.match_type_number',
       ...
       'info.registry.people.Arshad Mazarzei',
       'info.registry.people.DK Dahani',
       'info.registry.people.Emran Shahbakhsh',
       'info.registry.people.Masood Jayezeh',
       'info.registry.people.Mehran Siasar',
       'info.registry.people.Nader Zahadiafzal',
       'info.registry.people.Naiem Bameri',
       'info.registry.people.Navid Abdollahpour',
       'info.registry.people.Navid Balouch',
       'info.registry.people.Yousef Shadzehisarjou'],
      dtype='object', length=5065)

In [107]:
#Creating uique index
op.reset_index(drop=True, inplace=True)

In [108]:
#Filtering only for male
op = op[op['info.gender']=='male']

In [109]:
op.drop(op.columns[op.columns.str.startswith('info.registry.people')], axis=1, inplace=True)

In [110]:
pd.Series(op.columns).apply(lambda x: '.'.join(x.split('.')[:2])).unique()

array(['innings', 'meta.data_version', 'meta.created', 'meta.revision',
       'info.balls_per_over', 'info.city', 'info.dates', 'info.gender',
       'info.match_type', 'info.match_type_number', 'info.outcome',
       'info.overs', 'info.player_of_match', 'info.players', 'info.teams',
       'info.toss', 'info.umpires', 'info.venue', 'info.neutral_venue',
       'info.bowl_out'], dtype=object)

In [111]:
op.head(2)

Unnamed: 0,innings,meta.data_version,meta.created,meta.revision,info.balls_per_over,info.city,info.dates,info.gender,info.match_type,info.match_type_number,...,info.players.Seychelles,info.players.Eswatini,info.players.China,info.players.ICC World XI,info.players.Philippines,info.toss.uncontested,info.players.South Korea,info.players.Gambia,info.players.Israel,info.players.Iran
0,"[{'1st innings': {'team': 'Bermuda', 'deliveri...",0.92,2023-03-01,1,6.0,Buenos Aires,[2023-02-28],male,T20,2009.0,...,,,,,,,,,,
3,"[{'1st innings': {'team': 'Rwanda', 'deliverie...",0.92,2021-08-21,2,6.0,Kigali City,[2021-08-21],male,T20,1234.0,...,,,,,,,,,,


In [112]:
op['info.overs'].value_counts()

20    1397
Name: info.overs, dtype: int64

In [113]:
op['info.match_type'].value_counts()

T20    1397
Name: info.match_type, dtype: int64

In [114]:
op.drop(['info.gender','info.overs','info.match_type','info.balls_per_over',
         'info.toss.winner','info.toss.decision','meta.data_version','meta.created',
         'meta.revision','info.outcome.bowl_out','info.bowl_out',
         'info.outcome.eliminator','info.outcome.result','info.outcome.method','info.neutral_venue',
         'info.match_type_number','info.outcome.by.runs','info.outcome.by.wickets','info.outcome.winner',
         'info.player_of_match','info.umpires','info.toss.uncontested'], axis=1, inplace=True)

In [115]:
op.drop(op.columns[op.columns.str.startswith('info.players')], axis=1, inplace=True)
op.shape

(1397, 5)

In [116]:
op.head()

Unnamed: 0,innings,info.city,info.dates,info.teams,info.venue
0,"[{'1st innings': {'team': 'Bermuda', 'deliveri...",Buenos Aires,[2023-02-28],"[Bermuda, Panama]","Belgrano Athletic Club Ground, Buenos Aires"
3,"[{'1st innings': {'team': 'Rwanda', 'deliverie...",Kigali City,[2021-08-21],"[Rwanda, Ghana]",Gahanga International Cricket Stadium. Rwanda
5,"[{'1st innings': {'team': 'Netherlands', 'deli...",Kirtipur,[2021-04-21],"[Netherlands, Malaysia]",Tribhuvan University International Cricket Gro...
6,"[{'1st innings': {'team': 'West Indies', 'deli...",Lauderhill,[2019-08-03],"[India, West Indies]",Central Broward Regional Park Stadium Turf Ground
7,"[{'1st innings': {'team': 'Malta', 'deliveries...",Marsa,[2021-10-24],"[Malta, Switzerland]",Marsa Sports Club


In [117]:
op = op.reset_index(drop=True)

In [118]:
op.head(2)

Unnamed: 0,innings,info.city,info.dates,info.teams,info.venue
0,"[{'1st innings': {'team': 'Bermuda', 'deliveri...",Buenos Aires,[2023-02-28],"[Bermuda, Panama]","Belgrano Athletic Club Ground, Buenos Aires"
1,"[{'1st innings': {'team': 'Rwanda', 'deliverie...",Kigali City,[2021-08-21],"[Rwanda, Ghana]",Gahanga International Cricket Stadium. Rwanda


In [137]:
op['innings'][0][0]['1st innings'].keys()

dict_keys(['team', 'deliveries'])

In [139]:
op['innings'][0][0]['1st innings']['deliveries']

[{0.1: {'bowler': 'AM Patel',
   'runs': {'extras': 1, 'total': 1, 'batsman': 0},
   'extras': {'wides': 1},
   'non_striker': 'T Manders',
   'batsman': 'KS Leverock'}},
 {0.2: {'bowler': 'AM Patel',
   'runs': {'extras': 0, 'total': 0, 'batsman': 0},
   'non_striker': 'T Manders',
   'batsman': 'KS Leverock'}},
 {0.3: {'bowler': 'AM Patel',
   'runs': {'extras': 0, 'total': 0, 'batsman': 0},
   'non_striker': 'T Manders',
   'batsman': 'KS Leverock'}},
 {0.4: {'bowler': 'AM Patel',
   'runs': {'extras': 1, 'total': 1, 'batsman': 0},
   'extras': {'wides': 1},
   'non_striker': 'T Manders',
   'batsman': 'KS Leverock'}},
 {0.5: {'bowler': 'AM Patel',
   'runs': {'extras': 0, 'total': 0, 'batsman': 0},
   'non_striker': 'T Manders',
   'batsman': 'KS Leverock'}},
 {0.6: {'bowler': 'AM Patel',
   'runs': {'extras': 0, 'total': 4, 'batsman': 4},
   'non_striker': 'T Manders',
   'batsman': 'KS Leverock'}},
 {0.7: {'bowler': 'AM Patel',
   'runs': {'extras': 1, 'total': 1, 'batsman': 0},


In [140]:
op['innings'][0][0]['1st innings']['team']

'Bermuda'

We get most of the data needed from this stage of the JSON.

In [150]:
match = op['innings'][0][0]['1st innings']
match_cnt = 1

In [145]:
list(op['innings'][0][0]['1st innings']['deliveries'][0].keys())[0]

0.1

In [148]:
op['innings'][0][0]['1st innings']

{'team': 'Bermuda',
 'deliveries': [{0.1: {'bowler': 'AM Patel',
    'runs': {'extras': 1, 'total': 1, 'batsman': 0},
    'extras': {'wides': 1},
    'non_striker': 'T Manders',
    'batsman': 'KS Leverock'}},
  {0.2: {'bowler': 'AM Patel',
    'runs': {'extras': 0, 'total': 0, 'batsman': 0},
    'non_striker': 'T Manders',
    'batsman': 'KS Leverock'}},
  {0.3: {'bowler': 'AM Patel',
    'runs': {'extras': 0, 'total': 0, 'batsman': 0},
    'non_striker': 'T Manders',
    'batsman': 'KS Leverock'}},
  {0.4: {'bowler': 'AM Patel',
    'runs': {'extras': 1, 'total': 1, 'batsman': 0},
    'extras': {'wides': 1},
    'non_striker': 'T Manders',
    'batsman': 'KS Leverock'}},
  {0.5: {'bowler': 'AM Patel',
    'runs': {'extras': 0, 'total': 0, 'batsman': 0},
    'non_striker': 'T Manders',
    'batsman': 'KS Leverock'}},
  {0.6: {'bowler': 'AM Patel',
    'runs': {'extras': 0, 'total': 4, 'batsman': 4},
    'non_striker': 'T Manders',
    'batsman': 'KS Leverock'}},
  {0.7: {'bowler': 'AM

In [155]:
match['deliveries']

[{0.1: {'bowler': 'AM Patel',
   'runs': {'extras': 1, 'total': 1, 'batsman': 0},
   'extras': {'wides': 1},
   'non_striker': 'T Manders',
   'batsman': 'KS Leverock'}},
 {0.2: {'bowler': 'AM Patel',
   'runs': {'extras': 0, 'total': 0, 'batsman': 0},
   'non_striker': 'T Manders',
   'batsman': 'KS Leverock'}},
 {0.3: {'bowler': 'AM Patel',
   'runs': {'extras': 0, 'total': 0, 'batsman': 0},
   'non_striker': 'T Manders',
   'batsman': 'KS Leverock'}},
 {0.4: {'bowler': 'AM Patel',
   'runs': {'extras': 1, 'total': 1, 'batsman': 0},
   'extras': {'wides': 1},
   'non_striker': 'T Manders',
   'batsman': 'KS Leverock'}},
 {0.5: {'bowler': 'AM Patel',
   'runs': {'extras': 0, 'total': 0, 'batsman': 0},
   'non_striker': 'T Manders',
   'batsman': 'KS Leverock'}},
 {0.6: {'bowler': 'AM Patel',
   'runs': {'extras': 0, 'total': 4, 'batsman': 4},
   'non_striker': 'T Manders',
   'batsman': 'KS Leverock'}},
 {0.7: {'bowler': 'AM Patel',
   'runs': {'extras': 1, 'total': 1, 'batsman': 0},


In [161]:
op.loc[0]

innings       [{'1st innings': {'team': 'Bermuda', 'deliveri...
info.city                                          Buenos Aires
info.dates                                         [2023-02-28]
info.teams                                    [Bermuda, Panama]
info.venue          Belgrano Athletic Club Ground, Buenos Aires
Name: 0, dtype: object

In [163]:
match['deliveries'][0]

{0.1: {'bowler': 'AM Patel',
  'runs': {'extras': 1, 'total': 1, 'batsman': 0},
  'extras': {'wides': 1},
  'non_striker': 'T Manders',
  'batsman': 'KS Leverock'}}

In [315]:
count = 0
t20_data = pd.DataFrame()
for index, data in op.iterrows():

    match_idx = []
    ball_cnt = []
    batsman = []
    bowler = []
    runs = []
    player_out = []
    teams = []
    batting_team = []
    city = []
    venue = []

    for d in data['innings'][0]['1st innings']['deliveries']:
        for key in d.keys():
            
            match_idx.append(count)
            ball_cnt.append(key)
            
            batsman.append(d[key]['batsman'])
            bowler.append(d[key]['bowler'])
            
            runs.append(d[key]['runs']['total'])
            
            try:
                player_out.append(d[key]['wicket']['player_out'])
            except:
                player_out.append('0')
            
            teams.append(data['info.teams'])
            batting_team.append(data['innings'][0]['1st innings']['team'])

            city.append(data['info.city'])
            venue.append(data['info.venue'])
            
    temp = pd.DataFrame({
        'match_idx':match_idx,'ball_cnt':ball_cnt,'batsman':batsman,'bowler':bowler,'runs':runs,
        'player_out':player_out,'teams':teams,'batting_team':batting_team,'city':city,'venue':venue
        })
    
    t20_data = t20_data.append(temp)
    count+=1

In [316]:
t20_data.head(2)

Unnamed: 0,match_idx,ball_cnt,batsman,bowler,runs,player_out,teams,batting_team,city,venue
0,0,0.1,KS Leverock,AM Patel,1,0,"[Bermuda, Panama]",Bermuda,Buenos Aires,"Belgrano Athletic Club Ground, Buenos Aires"
1,0,0.2,KS Leverock,AM Patel,0,0,"[Bermuda, Panama]",Bermuda,Buenos Aires,"Belgrano Athletic Club Ground, Buenos Aires"


In [317]:
t20_data.tail(2)

Unnamed: 0,match_idx,ball_cnt,batsman,bowler,runs,player_out,teams,batting_team,city,venue
120,1396,19.5,Mohammad Amir,TG Southee,2,0,"[New Zealand, Pakistan]",Pakistan,,Dubai International Cricket Stadium
121,1396,19.6,Mohammad Amir,TG Southee,0,Mohammad Amir,"[New Zealand, Pakistan]",Pakistan,,Dubai International Cricket Stadium


In [318]:
def bowling_team_find(x):
    for i in x['teams']:
        if i!=x['batting_team']:
            return i

In [319]:
t20_data['bowling_team'] = t20_data.apply(bowling_team_find, axis=1)

In [320]:
t20_data.drop('teams', axis=1, inplace=True)

In [321]:
t20_data.isnull().sum()*100/t20_data.shape[0]

match_idx       0.000000
ball_cnt        0.000000
batsman         0.000000
bowler          0.000000
runs            0.000000
player_out      0.000000
batting_team    0.000000
city            8.047421
venue           0.000000
bowling_team    0.000000
dtype: float64

- 8% data missing in cities

In [322]:
t20_data[t20_data.city.isnull()]['venue'].value_counts()

Dubai International Cricket Stadium        4783
Harare Sports Club                         2478
Pallekele International Cricket Stadium    1569
Sharjah Cricket Stadium                    1245
Melbourne Cricket Ground                   1079
Sydney Cricket Ground                       623
Sylhet Stadium                              493
Mombasa Sports Club Ground                  371
Adelaide Oval                               371
Moara Vlasiei Cricket Ground                323
Rawalpindi Cricket Stadium                  245
Carrara Oval                                 64
Name: venue, dtype: int64

All the missing values in `city` is present as the ___First word of `venue`___

In [323]:
t20_data['city'] = np.where(t20_data['city'].isnull(), t20_data['venue'].str.split(' ').apply(lambda x: x[0]), t20_data['city'])

In [324]:
t20_data.isnull().sum()

match_idx       0
ball_cnt        0
batsman         0
bowler          0
runs            0
player_out      0
batting_team    0
city            0
venue           0
bowling_team    0
dtype: int64

In [325]:
teams = t20_data['batting_team'].value_counts()[:11].index.to_list()
teams

['Pakistan',
 'India',
 'New Zealand',
 'Sri Lanka',
 'South Africa',
 'West Indies',
 'Australia',
 'England',
 'Afghanistan',
 'Bangladesh',
 'Zimbabwe']

- We will take only major teams and ignore the rest.

In [326]:
t20_data = t20_data[t20_data['batting_team'].isin(teams)]
t20_data = t20_data[t20_data['bowling_team'].isin(teams)]

In [327]:
t20_data.shape

(73585, 10)

In [328]:
t20_data.head()

Unnamed: 0,match_idx,ball_cnt,batsman,bowler,runs,player_out,batting_team,city,venue,bowling_team
0,3,0.1,JD Campbell,Washington Sundar,0,0,West Indies,Lauderhill,Central Broward Regional Park Stadium Turf Ground,India
1,3,0.2,JD Campbell,Washington Sundar,0,JD Campbell,West Indies,Lauderhill,Central Broward Regional Park Stadium Turf Ground,India
2,3,0.3,N Pooran,Washington Sundar,0,0,West Indies,Lauderhill,Central Broward Regional Park Stadium Turf Ground,India
3,3,0.4,N Pooran,Washington Sundar,1,0,West Indies,Lauderhill,Central Broward Regional Park Stadium Turf Ground,India
4,3,0.5,E Lewis,Washington Sundar,0,0,West Indies,Lauderhill,Central Broward Regional Park Stadium Turf Ground,India


- We will treat this as final dataset.

In [329]:
t20_data.to_csv('cleaned_t20.csv')

# Data Exploration & Feature Engineering

In [507]:
df = t20_data.copy()

In [508]:
df.head()

Unnamed: 0,match_idx,ball_cnt,batsman,bowler,runs,player_out,batting_team,city,venue,bowling_team
0,3,0.1,JD Campbell,Washington Sundar,0,0,West Indies,Lauderhill,Central Broward Regional Park Stadium Turf Ground,India
1,3,0.2,JD Campbell,Washington Sundar,0,JD Campbell,West Indies,Lauderhill,Central Broward Regional Park Stadium Turf Ground,India
2,3,0.3,N Pooran,Washington Sundar,0,0,West Indies,Lauderhill,Central Broward Regional Park Stadium Turf Ground,India
3,3,0.4,N Pooran,Washington Sundar,1,0,West Indies,Lauderhill,Central Broward Regional Park Stadium Turf Ground,India
4,3,0.5,E Lewis,Washington Sundar,0,0,West Indies,Lauderhill,Central Broward Regional Park Stadium Turf Ground,India


In [509]:
df.shape

(73585, 10)

In [510]:
#No. of unique values
df.nunique()

match_idx       606
ball_cnt        181
batsman         681
bowler          539
runs              8
player_out      606
batting_team     11
city             95
venue           141
bowling_team     11
dtype: int64

- 606 matches across 95 cities and 141 venues.
- Total 11 teams

There are many cities, we will take top 40 cities and ignore the rest.

In [511]:
df['city'].value_counts()[:40]

Dubai              4024
Harare             3589
Colombo            3464
Mirpur             2563
Johannesburg       2509
Abu Dhabi          2112
Auckland           2033
Sydney             1950
Lahore             1858
Dhaka              1826
Melbourne          1699
Sharjah            1616
Cape Town          1615
Pallekele          1569
St Lucia           1358
Hamilton           1340
Durban             1273
Nottingham         1241
Centurion          1129
Southampton        1116
London             1115
Karachi            1112
Barbados           1051
Lauderhill          995
Christchurch        988
Wellington          977
Manchester          918
Perth               867
Ahmedabad           861
Kolkata             860
Brisbane            845
Gros Islet          773
Lucknow             754
Mount Maunganui     754
Chandigarh          751
Mumbai              747
Adelaide            742
Delhi               742
Cardiff             733
Bridgetown          685
Name: city, dtype: int64

In [512]:
df = df[df['city'].isin(df['city'].value_counts()[:40].index)]

In [513]:
#player out column also has many unique values which won't be very useful for the model. we will perform flagging on this.
df['player_out'] = np.where(df['player_out']=='0',0,1)

In [514]:
df['player_out'].value_counts()

0    54132
1     3022
Name: player_out, dtype: int64

In [515]:
cumulatives = df.groupby('match_idx').cumsum()
cumulatives

Unnamed: 0,ball_cnt,runs,player_out
0,0.1,0,0
1,0.3,0,1
2,0.6,0,1
3,1.0,1,1
4,1.5,1,1
...,...,...,...
117,1123.6,158,6
118,1142.9,158,7
119,1162.3,159,7
120,1181.8,161,7


We will change `runs` and `player_out` as abovev for continual information.

In [516]:
df['score'] = cumulatives['runs']
df['player_out'] = cumulatives['player_out']

In [517]:
#balls and overs
df['ball_cnt']

0       0.1
1       0.2
2       0.3
3       0.4
4       0.5
       ... 
117    19.2
118    19.3
119    19.4
120    19.5
121    19.6
Name: ball_cnt, Length: 57154, dtype: float64

1st part is over, second part is balls in that over

In [518]:
df['overs'] = df['ball_cnt'].astype('str').str.split('.').apply(lambda x:x[0]).astype(int)
df['balls'] = df['ball_cnt'].astype('str').str.split('.').apply(lambda x:x[1]).astype(int)

In [519]:
df[['overs','balls']].describe()

Unnamed: 0,overs,balls
count,57154.0,57154.0
mean,9.400602,3.622599
std,5.760721,1.808484
min,0.0,1.0
25%,4.0,2.0
50%,9.0,4.0
75%,14.0,5.0
max,19.0,12.0


In [520]:
df['balls_left'] = 120 - (df['overs']*6 + df['balls'])
df['balls_left'].describe()

count    57154.000000
mean        59.973790
std         34.611087
min         -3.000000
25%         30.000000
50%         60.000000
75%         90.000000
max        119.000000
Name: balls_left, dtype: float64

In [521]:
df['balls_left'] = df['balls_left'].apply(lambda x: 0 if x<0 else x)

In [522]:
df['wickets_left'] = 10 - df['player_out']
df['wickets_left'].describe()

count    57154.000000
mean         7.383630
std          2.141368
min          0.000000
25%          6.000000
50%          8.000000
75%          9.000000
max         10.000000
Name: wickets_left, dtype: float64

In [523]:
df['score'].describe()

count    57154.000000
mean        75.997813
std         49.885309
min          0.000000
25%         35.000000
50%         71.000000
75%        112.000000
max        258.000000
Name: score, dtype: float64

In [524]:
df['crr'] = df['score']/(df['overs']*6 + df['balls'])
df['crr'].describe()

count    57154.000000
mean         1.216507
std          0.389875
min          0.000000
25%          1.000000
50%          1.223177
75%          1.441176
max         14.000000
Name: crr, dtype: float64

In [525]:
df['powerplay'] = np.where(df['overs']<=6,1,0)
df['death_overs'] = np.where(df['overs']>=15,1,0)
df[['powerplay','death_overs']].describe()

Unnamed: 0,powerplay,death_overs
count,57154.0,57154.0
mean,0.356388,0.244287
std,0.478936,0.429668
min,0.0,0.0
25%,0.0,0.0
50%,0.0,0.0
75%,1.0,0.0
max,1.0,1.0


In [526]:
df['top_order_batsmen'] = np.where(df['player_out']<3,1,0)
df['middle_order_batsmen'] = np.where((df['player_out']>=3) & (df['player_out']<=4),1,0)
df['lower_order_batsmen'] = np.where((df['player_out']>=5) & (df['player_out']<=7),1,0)
df['tail_order_batsmen'] = np.where(df['player_out']>=8,1,0)

In [527]:
df['pressure_play'] = np.where((df['crr']<7)&(df['overs']>=2),1,0)
#when crr is <7 and beyond 2 overs

In [532]:
match_groups = df.groupby('match_idx')
mids = df['match_idx'].unique()
l5overs = []
for idx in mids:
    l5overs.extend((match_groups.get_group(idx).rolling(min_periods=1, window=30).sum()['runs']).to_list())

In [534]:
df['last_5_overs_score'] = l5overs

In [535]:
df['agressive'] = np.where(df['last_5_overs_score']>=45,1,0)
df['agressive'].value_counts()

0    44158
1    12996
Name: agressive, dtype: int64

In [537]:
pd.set_option('display.max_columns',50)

In [539]:
df.head(2)

Unnamed: 0,match_idx,ball_cnt,batsman,bowler,runs,player_out,batting_team,city,venue,bowling_team,score,overs,balls,balls_left,wickets_left,crr,powerplay,death_overs,top_order_batsmen,middle_order_batsmen,lower_order_batsmen,tail_order_batsmen,pressure_play,last_5_overs_score,agressive
0,3,0.1,JD Campbell,Washington Sundar,0,0,West Indies,Lauderhill,Central Broward Regional Park Stadium Turf Ground,India,0,0,1,119,10,0.0,1,0,1,0,0,0,0,0.0,0
1,3,0.2,JD Campbell,Washington Sundar,0,1,West Indies,Lauderhill,Central Broward Regional Park Stadium Turf Ground,India,0,0,2,118,9,0.0,1,0,1,0,0,0,0,0.0,0


### Building Target variable

Our target variable would be Total Score per match.

In [541]:
totals = df.groupby('match_idx')['runs'].sum()

In [544]:
df = pd.merge(df, totals, how='left', on='match_idx').rename(columns={'runs_x':'runs', 'runs_y':'total_runs'})

In [545]:
df.columns

Index(['match_idx', 'ball_cnt', 'batsman', 'bowler', 'runs', 'player_out',
       'batting_team', 'city', 'venue', 'bowling_team', 'score', 'overs',
       'balls', 'balls_left', 'wickets_left', 'crr', 'powerplay',
       'death_overs', 'top_order_batsmen', 'middle_order_batsmen',
       'lower_order_batsmen', 'tail_order_batsmen', 'pressure_play',
       'last_5_overs_score', 'agressive', 'total_runs'],
      dtype='object')

In [548]:
df.drop(['match_idx','ball_cnt','batsman','bowler','runs','player_out','balls','overs'], axis=1, inplace=True)

In [713]:
df.to_csv('t2_cleaned.csv')

# Data Preparation for Modelling

In [573]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [738]:
x = df.drop(['total_runs','venue'], axis=1)
y = df['total_runs']

In [750]:
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=10)

In [740]:
x_train.shape, x_test.shape

((45723, 16), (11431, 16))

In [741]:
x_train['batting_team'].value_counts()

Pakistan        6770
India           6004
New Zealand     5537
Sri Lanka       4549
Australia       4507
South Africa    4165
West Indies     3995
England         3613
Bangladesh      2661
Afghanistan     1985
Zimbabwe        1937
Name: batting_team, dtype: int64

In [742]:
x_test['batting_team'].value_counts()

Pakistan        1723
India           1430
New Zealand     1347
Sri Lanka       1144
Australia       1123
West Indies     1029
South Africa     991
England          857
Bangladesh       762
Zimbabwe         516
Afghanistan      509
Name: batting_team, dtype: int64

In [743]:
cat_cols = ['batting_team', 'city', 'bowling_team'] #removing `venue` and checking
num_cols = ['score', 'balls_left','wickets_left', 'crr', 'last_5_overs_score']
binary_cols = list(set(df.columns).difference(cat_cols).difference(num_cols).difference({'total_runs','venue'}))

In [744]:
transformations = ColumnTransformer([
    ('cols',OneHotEncoder(drop='first',sparse=False),cat_cols),
    ('nums',StandardScaler(),num_cols)
], remainder='passthrough')

In [745]:
ohe = OneHotEncoder(sparse=False, drop='first')
x_train_cats = ohe.fit_transform(x_train[cat_cols])
x_test_cats = ohe.transform(x_test[cat_cols])

In [746]:
scale = StandardScaler()
x_train_nums = scale.fit_transform(x_train[num_cols])
x_test_nums = scale.transform(x_test[num_cols])

In [747]:
x_train = np.hstack([x_train_cats,x_train_nums,x_train[binary_cols]])
x_test = np.hstack([x_test_cats,x_test_nums,x_test[binary_cols]])

In [748]:
x_train.shape

(45723, 72)

In [751]:
transformations.fit_transform(x_train).shape

(45723, 72)

# Modelling

- Due to huge size of the dataset, we will avoid models which use __distances__.
- Due to low latency requirement, we will also avoid SVMs

We will consider using below models:
- LinearRegression #As base model (Also, Lasso, Ridge) 
- RandomForestRegressor
- GBRegressor
- XGBRegressor


Performance metrics: `R2` and `RMSE`

In [675]:
from sklearn.linear_model import Lasso,Ridge,LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor

from sklearn.metrics import r2_score, mean_squared_error as mse

#### Linear Regression

In [676]:
lr = LinearRegression()
lr.fit(x_train, y_train)
y_pred_train = lr.predict(x_train)
y_pred_test = lr.predict(x_test)

In [677]:
print("Train:")
print(r2_score(y_train, y_pred_train))
print(mse(y_train, y_pred_train)**(0.5))

print("Test:")
print(r2_score(y_test, y_pred_test))
print(mse(y_test, y_pred_test)**(0.5))

Train:
0.6128930305615583
20.48983568730257
Test:
0.6195675633413045
20.466217757142704


- Very low scores with Lienar Regression. Adding Regulariations will make it worse.

#### RandomForest

In [678]:
rf = RandomForestRegressor()
rf.fit(x_train, y_train)
y_pred_train = rf.predict(x_train)
y_pred_test = rf.predict(x_test)

In [679]:
print("Train:")
print(r2_score(y_train, y_pred_train))
print(mse(y_train, y_pred_train)**(0.5))

print("Test:")
print(r2_score(y_test, y_pred_test))
print(mse(y_test, y_pred_test)**(0.5))

Train:
0.9923315098250338
2.8838848347995962
Test:
0.9565320929160185
6.9180418706223294


- very significant performance increase with RF. We will consider it for hypertuning.

#### GBDT

In [680]:
gb = GradientBoostingRegressor()
gb.fit(x_train, y_train)
y_pred_train = gb.predict(x_train)
y_pred_test = gb.predict(x_test)

In [681]:
print("Train:")
print(r2_score(y_train, y_pred_train))
print(mse(y_train, y_pred_train)**(0.5))

print("Test:")
print(r2_score(y_test, y_pred_test))
print(mse(y_test, y_pred_test)**(0.5))

Train:
0.6833175131855541
18.532542461666914
Test:
0.681831072028187
18.71663208709132


- Horrible performance. Almost equivalent to that of `linear regression`

#### XGBRegressor

In [682]:
xgb = XGBRegressor()
xgb.fit(x_train, y_train)
y_pred_train = xgb.predict(x_train)
y_pred_test = xgb.predict(x_test)

In [683]:
print("Train:")
print(r2_score(y_train, y_pred_train))
print(mse(y_train, y_pred_train)**(0.5))

print("Test:")
print(r2_score(y_test, y_pred_test))
print(mse(y_test, y_pred_test)**(0.5))

Train:
0.9365733117291603
8.293903890997907
Test:
0.9252506739557986
9.07198789038771


- Very fast and reliable.
- Good performance in terms of R2. Slightly off in terms of RMSE.

# Hyperparameter Tuning

In [693]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

In [706]:
rf_gs = RandomForestRegressor(n_jobs=-1)

#Random forest
rf_params_grid = {'n_estimators':[70,100,150,200],
              'max_features':['sqrt','log2',None]}

In [700]:
rf_gscv = GridSearchCV(rf_gs, rf_params_grid, cv=3, scoring=scores, refit='r2')
rf_gscv.fit(x_train, y_train)

In [702]:
pd.DataFrame(rf_gscv.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_features,param_n_estimators,params,split0_test_r2,split1_test_r2,split2_test_r2,mean_test_r2,std_test_r2,rank_test_r2,split0_test_mse,split1_test_mse,split2_test_mse,mean_test_mse,std_test_mse,rank_test_mse
0,1.493787,1.05722,0.044774,0.004171,sqrt,70,"{'max_features': 'sqrt', 'n_estimators': 70}",0.932209,0.935015,0.931955,0.93306,0.001386,12,75.578158,69.826288,72.415571,72.606672,2.352076,1
1,0.95473,0.008988,0.05392,0.000596,sqrt,100,"{'max_features': 'sqrt', 'n_estimators': 100}",0.931891,0.934955,0.932681,0.933175,0.001299,11,75.933214,69.890882,71.643523,72.489206,2.538218,2
2,1.332153,0.021873,0.074145,0.000714,sqrt,150,"{'max_features': 'sqrt', 'n_estimators': 150}",0.932257,0.935736,0.933615,0.933869,0.001432,10,75.525144,69.051798,70.648941,71.741961,2.753431,3
3,1.701704,0.030918,0.094689,0.000741,sqrt,200,"{'max_features': 'sqrt', 'n_estimators': 200}",0.933314,0.936634,0.934696,0.934882,0.001362,9,74.346096,68.086361,69.498435,70.643631,2.680756,4
4,0.698117,0.008021,0.041712,0.000227,log2,70,"{'max_features': 'log2', 'n_estimators': 70}",0.933593,0.935951,0.936646,0.935397,0.001306,7,74.034958,68.820854,67.423706,70.093172,2.845026,6
5,0.895202,0.003696,0.053526,0.000274,log2,100,"{'max_features': 'log2', 'n_estimators': 100}",0.933901,0.938259,0.933875,0.935345,0.00206,8,73.692226,66.340934,70.372124,70.135094,3.005829,5
6,1.236849,0.003049,0.074033,0.000603,log2,150,"{'max_features': 'log2', 'n_estimators': 150}",0.935278,0.940151,0.935653,0.937027,0.002214,4,72.156763,64.307236,68.480361,68.314786,3.206694,8
7,1.57935,0.010883,0.096566,0.003325,log2,200,"{'max_features': 'log2', 'n_estimators': 200}",0.934032,0.939085,0.937673,0.93693,0.002129,6,73.545554,65.453424,66.329905,68.442961,3.625778,7
8,1.988187,0.025054,0.043584,0.006951,,70,"{'max_features': None, 'n_estimators': 70}",0.938357,0.935674,0.936962,0.936998,0.001096,5,68.723842,69.118643,67.087451,68.309978,0.879355,9
9,2.933981,0.04543,0.052655,0.004849,,100,"{'max_features': None, 'n_estimators': 100}",0.939932,0.937377,0.93703,0.938113,0.001294,3,66.968367,67.288008,67.014454,67.090276,0.141078,10


In [711]:
68.309978**(0.5)

8.264985057457757

From the summary we get best results with:
1. `{'max_features': None, 'n_estimators': 200}` | Mean R2 = 0.938733 | Mean RMSE = 8.15
2. `{'max_features': None, 'n_estimators': 150}` | Mean R2 = 0.938189 | Mean RMSE = 8.18
3. `{'max_features': None, 'n_estimators': 100}` | Mean R2 = 0.938113 | Mean RMSE = 8.19
4. `{'max_features': None, 'n_estimators': 70}`  | Mean R2 = 0.936998 | Mean RMSE = 8.26 

We will go with:
        `{'max_features': None, 'n_estimators': 100}`

- This will help us avoid using too many trees(estimators) and keep the predictions fast.


- We will also use the standard deviation of our results which is [-8,8] to show a range of values for the final score.

# Building Pickle Files for deployment

In [803]:
import pickle

In [712]:
model = RandomForestRegressor(n_estimators=100, max_features=None, n_jobs=-1)
model.fit(x_train, y_train)

In [796]:
cols = ColumnTransformer([
    ('ohe',OneHotEncoder(sparse=False,drop='first'),cat_cols),
    ('scale',StandardScaler(),num_cols)
], remainder='passthrough')

In [797]:
cols.fit(x_train)

In [800]:
pipe = Pipeline([
    ('transformations',cols),
    ('model',RandomForestRegressor(n_estimators=100, max_features=None, n_jobs=-1))
])

In [801]:
pipe.fit(x_train, y_train)

In [819]:
# Exporing pickle file
pickle.dump(pipe, open('pipe.pkl','wb'))

In [816]:
x.columns

Index(['batting_team', 'city', 'bowling_team', 'score', 'balls_left',
       'wickets_left', 'crr', 'powerplay', 'death_overs', 'top_order_batsmen',
       'middle_order_batsmen', 'lower_order_batsmen', 'tail_order_batsmen',
       'pressure_play', 'last_5_overs_score', 'agressive'],
      dtype='object')