# Data Exploration
This document explores the raw data to determine what must be cleaned before modelling can begin. The data initialisation notebook was made in parallel with this document, exposing the inconsistencies and areas for improvement in the raw dataset. 

To initialise this document, we must first load the necessary information and libraries:

In [1]:
# Perform necessary imports.
import pandas as pd
from lib.constants import *


## Exploring Match Data
In this section, we explore the raw match data, including:

1. Columns of the match data.
2. Gender of each match.
3. Recorded teams.
4. Teams of each international match.
5. Format length of each match.
6. Missing data.
7. Duplicate rows.
8. The series each match belongs to.

This section aims to determine which features of the match dataset should be considered during the cleaning phase. For example, as the project only aims to predict a batter’s performance in One-Day International cricket, it is not necessary to include Test or T20 International matches. By determining the minimum subset of eligible matches, the deliveries dataset can be reduced to allow data processing to occur more easily.

To begin with, we must load the raw match dataset:

In [2]:
# Load match data.
match_data = pd.read_csv(DATA_PATH + "/Matches.txt", delimiter="\t")


### 1. Explore the columns of the match data.

We can determine which fields are important for this project by exploring the columns of the match data.

In [3]:
# Print the columns of match data.
match_data.columns.values.tolist()


['Match Id',
 'Season Id',
 'Season',
 'Series Id',
 'Series',
 'Series Gender Id',
 'Series Gender',
 'Match Date',
 'Match YYMMDD',
 'Match Type Id',
 'Match Type',
 'Ball Type Id',
 'Ball Type',
 'TeamA Id',
 'TeamA',
 'TeamA At Home',
 'TeamB Id',
 'TeamB',
 'TeamB At Home',
 'Day/Night',
 'Venue Id',
 'Venue',
 'Toss Won By Id',
 'Toss Decision Id',
 'TeamA Innings1 Closure',
 'TeamA Innings2 Closure',
 'TeamB Innings1 Closure',
 'TeamB Innings2 Closure',
 'TeamA 1st Comparison',
 'TeamA Result Id',
 'TeamA Result',
 'TeamBattingIdMatchInnings1',
 'TeamBattingMatchInnings1',
 'TeamBattingIdMatchInnings2',
 'TeamBattingMatchInnings2',
 'TeamBattingIdMatchInnings3',
 'TeamBattingMatchInnings3',
 'TeamBattingIdMatchInnings4',
 'TeamBattingMatchInnings4',
 'TeamB Result Id',
 'TeamB Result',
 'TeamA Coach Id',
 'TeamA Coach Surname',
 'TeamA Coach Other Names',
 'TeamB Coach Id',
 'TeamB Coach Surname',
 'TeamB Coach Other Names',
 'Round Id',
 'Round',
 'Round Number',
 'Match Number

### 2. Explore series gender.

Due to large differences between male and female formats of cricket (e.g., boundary and ball size), it is necessary to focus on only one gender. Hence, we must determine the different genders present in the dataset.

In [4]:
# Print series gender.
print("Match genders: {}".format(match_data["Series Gender"].unique()))

# Print series gender ID.
print("Match gender IDs: {}".format(match_data["Series Gender Id"].unique()))


Match genders: ['Male' 'Female']
Match gender IDs: [1 2]


### 3. Explore teams.

It is necessary to understand which teams have been recorded in the match data. Notably, we must consider disability teams that may have been recorded. Due to large differences between disability and non-disability matches, it is necessary to focus on one or the other.

In [5]:
# Print all recorded teams.
match_data["TeamA"].unique().tolist()


['Australia (M)',
 'West Indies (M)',
 'Pakistan (M)',
 'Sri Lanka (M)',
 'South Africa (M)',
 'England (M)',
 'New Zealand (M)',
 'India (M)',
 'SA (M)',
 'Tas (M)',
 'NSW (M)',
 'Victoria (M)',
 'WA (M)',
 'Aus Intellectual Disability',
 'Qld (M)',
 'Australia (F)',
 'WA (F)',
 'Qld Fire (F)',
 'New Zealand (F)',
 'NSW Breakers (F)',
 'Southern Scorpions (F)',
 'Australia A (M)',
 'Zimbabwe (M)',
 'Kenya (M)',
 'ACT Meteors (F)',
 'Victoria (F)',
 'England (F)',
 'Scotland (M)',
 'Bangladesh (M)',
 'Ireland (M)',
 'India (F)',
 'West Indies (F)',
 'Tasmania Tigers (F)',
 'Sydney Sixers (M)',
 'Melbourne Stars (M)',
 'Adelaide Strikers (M)',
 'Perth Scorchers  (M)',
 'Brisbane Heat (M)',
 'Hobart Hurricanes (M)',
 'Melbourne Renegades (M)',
 'Sydney Thunder (M)',
 'Canada (M)',
 'Gloucestershire (M)',
 'South Africa Women (F)',
 'Afghanistan (M)',
 'India A (M)',
 'South Africa A (M)',
 'Pakistan (F)',
 'Sri Lanka (F)',
 'Supernovas (F)',
 'Southern Vipers (F)',
 'Surrey Stars (F)',
 

### 4. Check international games.

Only international matches that Australia participates in are important for this project. It is necessary to explore the teams participating in each recorded international match.

In [6]:
# Check for international games where Australia is not playing.
match_data[match_data.Series.str.contains("International") &
           ~match_data.TeamA.str.contains("Australia") &
           ~match_data.TeamB.str.contains("Australia")][["TeamA", "TeamB"]]


Unnamed: 0,TeamA,TeamB
19,West Indies (M),Zimbabwe (M)
75,Sri Lanka (M),England (M)
76,Sri Lanka (M),England (M)
77,Sri Lanka (M),England (M)
78,South Africa (M),Sri Lanka (M)
...,...,...
4018,India (M),England (M)
4019,India (M),England (M)
4020,India (M),England (M)
4021,India (M),England (M)


### 5. Explore match format length.

At the international level, the only formats played consistently are T20, One-Day, and Five-Day. Domestically, Four-Day formats are preferred over Five-Day. All other formats are played irregularly and should not be considered. 

In [7]:
# Print match types.
print("Match formats: {}".format(match_data["Match Type"].unique()))

# Print match type Ids.
print("Match format IDs: {}".format(match_data["Match Type Id"].unique()))

Match formats: ['5 Day' '1 Day' '4 Day' 'Twenty20' '3 Day' '2 Day']
Match format IDs: [5 1 4 7 3 2]


### 6. Check for columns containing NaN.

It is necessary to explore which fields of the match dataset contain missing data to determine the importance of each field.

In [7]:
# Print columns containing NaN.
match_data.columns[match_data.isna().any()].tolist()


['TeamBattingMatchInnings1',
 'TeamBattingMatchInnings2',
 'TeamBattingMatchInnings3',
 'TeamBattingMatchInnings4',
 'TeamA Coach Id',
 'TeamA Coach Surname',
 'TeamA Coach Other Names',
 'TeamB Coach Id',
 'TeamB Coach Surname',
 'TeamB Coach Other Names',
 'Data Source',
 'Official1 Id',
 'Official1 Surname',
 'Official1 Other Names',
 'Official2 Id',
 'Official2 Surname',
 'Official2 Other Names',
 'Official3 Id',
 'Official3 Surname',
 'Official3 Other Names',
 'Official4 Id',
 'Official4 Surname',
 'Official4 Other Names',
 'Official5 Id',
 'Official5 Surname',
 'Official5 Other Names',
 'Official6 Id',
 'Official6 Surname',
 'Official6 Other Names']

### 7. Check for duplicate rows.

Duplicate matches may cause issues when summarising player data. In this section, we investigate the presence of duplicate rows in the match dataset. Particularly, we wish to know if matches have been recorded multiple times.

In [9]:
# Print duplicate rows.
match_data[match_data.duplicated(keep=False)]


Unnamed: 0,Match Id,Season Id,Season,Series Id,Series,Series Gender Id,Series Gender,Match Date,Match YYMMDD,Match Type Id,...,Official3 Other Names,Official4 Id,Official4 Surname,Official4 Other Names,Official5 Id,Official5 Surname,Official5 Other Names,Official6 Id,Official6 Surname,Official6 Other Names


### 8. Explore series.

It is necessary to understand the series that have been recorded. Particularly, it is important to know the naming convention for each series to determine a method for separating matches based on levels (International or Domestic).

In [10]:
# Print series names.
match_data.Series.unique()


array(['International Tests M', 'International ODI M',
       'Domestic 1st Class M', 'Domestic List A M', 'International T20 M',
       'Domestic T20 M', 'International ICC Trophy M',
       'International ODI World Cup M', 'International ODI F',
       'International T20 F', 'Domestic OD F',
       'International T20 World Cup M', 'Domestic T20 F',
       'International ODI World Cup F', 'International T20 World Cup F',
       'International Tests F', 'International 1st Class M',
       'International 1st Class F'], dtype=object)

## Exploring Delivery Data
In this section, we will explore the raw deliveries data, including:

1. Columns of the data.
2. Missing data.
3. Duplicate rows.
4. Batters that have not played One-Day International matches.
5. Batting teams.
6. Batters that have played for multiple countries.

This section aims to determine which features of the deliveries dataset should be considered during the cleaning phase. For example, as the project aims to predict a batter’s performance in One-Day International cricket, it is not necessary to include games containing no One-Day International batters. Curating the deliveries dataset will ensure fewer issues occur during the data summarisation and modelling phases.

To begin, we must load the raw deliveries dataset. We will load the file in chunks, removing any delivery data not related to the reduced set of matches.

In [11]:
# Load cleaned match data.
match_data = pd.read_csv(DATA_PATH + "/Matches_Clean.txt", delimiter="\t")

# Determine which matches are important for delivery data.
match_ids = match_data["Match Id"]

# Determine duplicate columns between match and delivery data that should be dropped.
match_columns = set(match_data.columns)
match_columns.remove("Match Id")

# Load delivery data
delivery_data = pd.DataFrame()

for chunk in pd.read_csv(DATA_PATH + "/Deliveries.txt", delimiter="\t", chunksize=10**6, low_memory=False):
  chunk = chunk[chunk["Match Id"].isin(match_ids)]
  chunk.drop(
      [col for col in chunk.columns if col in match_columns], axis=1, inplace=True
  )

  # Combine filtered deliveries into single dataframe.
  delivery_data = pd.concat([delivery_data, chunk])


### 1. Explore delivery columns.

To determine which fields of the deliveries dataset are most important for this project, we must explore what data has been recorded for each delivery.

In [12]:
# Print delivery columns.
delivery_data.columns.values.tolist()


['Match Id',
 'Team Batting At Home',
 'Team Bowling At Home',
 'Toss Won By Team',
 'Toss Won By Batting Team',
 'Toss Decision',
 'TeamA ResultId',
 'Team Batting Id',
 'Team Batting',
 'Team Bowling Id',
 'Team Bowling',
 'Innings',
 'Delivery',
 'Day',
 'Session',
 'Time of Day (Hour)',
 'Time of Day (Min)',
 'Time of Day',
 'Striker Id',
 'Striker',
 'Striker Hand Id',
 'Striker Hand',
 'Non Striker Id',
 'Non Striker',
 'Non Striker Hand Id',
 'Non Striker Hand',
 'Bowler Id',
 'Bowler',
 'Bowler Hand Id',
 'Bowler Hand',
 'Pace / Spin',
 'Bowler Style',
 'Spell',
 'Over The Wicket',
 'Northern End',
 'Power Play',
 'Over',
 'Ball In Over',
 'Fair Ball In Over',
 'Ball Speed',
 'Ball RPM',
 'Pitch X',
 'Pitch Y',
 'At Batter X',
 'At Batter Y',
 'At Stumps X',
 'At Stumps Y',
 'Hit To Len',
 'Hit To Angle',
 'Bat Score',
 'ReachedBoundary',
 'Wides',
 'Noballs',
 'Byes',
 'Legbyes',
 'Penalty Runs',
 'Taken By WK',
 'Batter Out Id',
 'Batter Out',
 'How Out',
 'Fielder1 Id',
 'Fi

### 2. Check for columns containing NaN.

It is important to understand which columns contain missing data. Similarly to the match dataset, this will allow us to see which fields are not regularly recorded or may cause issues in data summarisation and modelling.

In [13]:
# Check columns that contain NaN
delivery_data.columns[delivery_data.isna().any()].tolist()


['Toss Won By Team',
 'Ball Speed',
 'Ball RPM',
 'Batter Out Id',
 'Batter Out',
 'How Out',
 'Fielder1 Id',
 'Fielder1',
 'Fielder2 Id',
 'Fielder2',
 'Fielder3 Id',
 'Fielder3',
 'Fielder4 Id',
 'Fielder4',
 'Fielder5 Id',
 'Fielder5',
 'Wind Description',
 'Event Grade',
 'Event Infield',
 'Fielder1 Catch',
 'Fielder1 Catch Assist',
 'Fielder1 Dropped Catch',
 'Fielder1 Runout',
 'Fielder1 Runout Missed',
 'Fielder1 Runout Assist',
 'Fielder1 Runout Assist Missed',
 'Fielder1 Missed Stumping',
 'Fielder1 Extra Effort',
 'Fielder1 Pressure Field',
 'Fielder1 Assist',
 'Fielder1 Fumble',
 'Fielder1 Misfield',
 'Fielder1 Dive Stop',
 'Fielder1 Dive Misfield',
 'Fielder1 Slide Stop',
 'Fielder1 Slide Miss',
 'Fielder1 Throw',
 'Fielder1 Good Throw',
 'Fielder1 Error Throw',
 'Fielder1 Throw Hit',
 'Fielder1 Throw Miss',
 'Fielder1 Throw Backed Up',
 'Fielder1 Throw Not Backed Up',
 'Fielder1 Keeper Drop',
 'Fielder1 Keeper Fumble',
 'Fielder1 Keeper Dive Stop',
 'Fielder1 Keeper Missed

### 3. Check for duplicate rows.

Having deliveries recorded more than once can cause issues in the accuracy of our model, so it is necessary to remove duplicates.

In [14]:
# Print duplicate rows.
delivery_data[delivery_data.duplicated(keep=False)]


Unnamed: 0,Match Id,Team Batting At Home,Team Bowling At Home,Toss Won By Team,Toss Won By Batting Team,Toss Decision,TeamA ResultId,Team Batting Id,Team Batting,Team Bowling Id,...,Movement Off Pitch,Stump Speed,Shot Aggression,Shot Quality Description,Hit To X Physical,Hit To Y Physical,Video File Name,Video Mark In Milliseconds,Keeper Id,Keeper


### 4. Explore players that have played One-Day International games.

It is important to understand which batters have played the One-Day International format as they will be the basis of our batter performance prediction dataset. This step will allow us to see issues with our current dataset, namely:

* Batting performances of non-Australian batters that are still recorded.
* Batting performances of batters with few international matches (e.g., <10) that are still recorded.

In [15]:
# Determine IDs of international One-Day games.
odi_match_IDs = match_data[match_data.Series.str.contains(
    "International") & match_data["Match Type Id"] == 1]["Match Id"]

# Extract deliveries within international One-Day games.
odi_deliveries = delivery_data[delivery_data["Match Id"].isin(odi_match_IDs)]

# Count number of international One Day games per player.
by_columns = ["Striker"]
aggregates = {"Match Id": pd.Series.nunique}
groupby_data = odi_deliveries.groupby(by=by_columns, as_index=False).agg(
    aggregates).rename({"Match Id": "# Matches"}, axis=1)
groupby_data


Unnamed: 0,Striker,# Matches
0,"Aamer, Mohammad",2
1,"Abbott, Kyle",3
2,"Abbott, Sean",3
3,"Adams, Andre",2
4,"Adams, Jimmy",5
...,...,...
910,"van Schoor, Melt",1
911,"van Troost, Luuk",2
912,"van der Dussen, Rassie",1
913,"van der Merwe, Roelof",4


### 5. Explore the batting teams.

As we are only concerned with the batting performances of Australian players, it is reasonable to only track deliveries faced by Australian teams. In this step, we determine which batting teams have been recorded.

In [24]:
# Print all batting teams.
delivery_data["Team Batting"].unique().tolist()


['Pakistan (M)',
 'Australia (M)',
 'South Africa (M)',
 'West Indies (M)',
 'Zimbabwe (M)',
 'Kenya (M)',
 'New Zealand (M)',
 'Bangladesh (M)',
 'Sri Lanka (M)',
 'England (M)',
 'SA (M)',
 'Victoria (M)',
 'NSW (M)',
 'Tas (M)',
 'WA (M)',
 'Qld (M)',
 'India (M)',
 'Ireland (M)',
 'Scotland (M)',
 'Netherlands (M)',
 'Australia A (M)',
 'USA (M)',
 'ICC World XI (M)',
 'Namibia (M)',
 'Brisbane Heat (M)',
 'Sydney Sixers (M)',
 'Melbourne Stars (M)',
 'Sydney Thunder (M)',
 'Adelaide Strikers (M)',
 'Melbourne Renegades (M)',
 'Hobart Hurricanes (M)',
 'Perth Scorchers  (M)',
 'Canada (M)',
 'South Africa A (M)',
 'India A (M)',
 'Afghanistan (M)',
 'England Lions (M)',
 'CA XI (M)',
 'Board President XI (M)',
 'India B (M)']

### 6. Explore players that have played for multiple countries.

An infrequent occurrence in the dataset is that of batters who have played for multiple international teams. Some batters have played in several teams for the same country (e.g., Australia A and Australia). Others, however, have played for multiple countries (e.g., Australia and New Zealand). This may raise issues during the summarisation and modelling stages if not managed.

In [18]:
# Count number of international teams per player.
by_columns = ["Striker"]
aggregates = {"Team Batting": pd.Series.nunique}
groupby_data = odi_deliveries.groupby(
    by=by_columns, as_index=False).agg(aggregates)

# Extract players that have played for multiple countries.
multiple_country_batters = groupby_data[groupby_data["Team Batting"] > 1]

# Print the teams played for by two batters
print("Teams Mayank Agarwal has batted for: {}".format(
    odi_deliveries[odi_deliveries["Striker"] == "Agarwal, Mayank"]["Team Batting"].unique().tolist()))
print("Teams Luke Ronchi has batted for: {}".format(
    odi_deliveries[odi_deliveries["Striker"] == "Ronchi, Luke"]["Team Batting"].unique().tolist()))

# Display players that have played for multiple countries.
multiple_country_batters


Teams Mayank Agarwal has batted for: ['India A (M)', 'Board President XI (M)', 'India B (M)', 'India (M)']
Teams Luke Ronchi has batted for: ['Australia (M)', 'New Zealand (M)']


Unnamed: 0,Striker,Team Batting
1,"Abbott, Kyle",2
2,"Abbott, Sean",2
7,"Afridi, Shahid",2
8,"Agar, Ashton",2
10,"Agarwal, Mayank",4
...,...,...
840,"Wade, Matthew",2
857,"White, Cameron",2
889,"Zampa, Adam",2
907,"van Jaarsveld, Vaughn",2
