### Aim: We're going to conduct several sense checks of the data to gain confidence that we have preprocessed the data appropriately.

In [1]:
import pandas as pd

df = pd.read_csv('preprocessed_data.csv')
df.head(5)

Unnamed: 0,matchid,innings,over,ball,runs,wickets,chasing_team_won,total_chasing
0,1001349,1,0,1,0,0,1.0,
1,1001349,1,0,2,0,0,1.0,
2,1001349,1,0,3,1,0,1.0,
3,1001349,1,0,4,3,0,1.0,
4,1001349,1,0,5,3,0,1.0,


In [2]:
df[df['chasing_team_won'].isna()]

Unnamed: 0,matchid,innings,over,ball,runs,wickets,chasing_team_won,total_chasing
33667,1142504,1,0,1,0,0,,
33668,1142504,1,0,2,1,0,,
33669,1142504,1,0,3,1,0,,
33670,1142504,1,0,4,1,0,,
33671,1142504,1,0,5,5,0,,
...,...,...,...,...,...,...,...,...
683528,287862,2,19,116,135,6,,141.0
683529,287862,2,19,117,137,6,,141.0
683530,287862,2,19,118,141,6,,141.0
683531,287862,2,19,119,141,6,,141.0


#### Check 1 - Number of matches

We started with 3,915 matches but we discarded any match with only 1 innings (i.e. unfinished) or if an innings has > 120 balls (data error)

In [3]:
df['matchid'].nunique()

3800

#### Check 2 - Number of innings

This should be 2 * number of matches

In [4]:
df.groupby(['matchid', 'innings']).ngroups

7600

#### Check 3 - Maximum number of overs per innings

This has to be 19 (with 0 indexing)

In [5]:
df['over'].max()

np.int64(19)

#### Check 4 - Maximum number of balls per innings

This has to be 120. To clarify, we do not treat wides & no balls as an extra ball. If a bowler bowls a wide on the first delivery of the second over, and then the batsman scores a 4, we will record tht as 5 runs scored for the 7th ball of the innings.

In [6]:
df['ball'].max()

np.int64(120)

#### Check 5 - Average number of balls per innings

We should expect a number a bit less than 120 (given a team can be all-out or complete a chase before the 120th ball)

In [7]:
len(df) / (df['matchid'].nunique() * 2)

108.77644736842105

#### Check 6 - Proportion of chasing teams that won

We should expect this to be around 50%

In [8]:
df.groupby('matchid')['chasing_team_won'].mean().mean()

np.float64(0.48550342646283606)

#### Check 7 - Average chasing target

We should expect this to be around 140

In [9]:
df.groupby('matchid')['total_chasing'].mean().mean()

np.float64(138.10631578947368)

#### Check 8 - Average number of wickets taken

We should expect this to be around 8

In [10]:
df.groupby('matchid')['wickets'].max().mean()

np.float64(8.188421052631579)