# Data Initialisation
This document is concerned with initialising the data to be used for the remainder of the project. Particularly, all unnecessary data will be removed from the match and deliveries datasets according to the findings of the Data Exploration notebook. 

This document consists of three main sections:

1. Basic clean of the match dataset.
2. Basic clean of the deliveries dataset.
3. Final clean of the match and deliveries datasets together.

Each of the above sections are further divided into subsections containing individual cleaning steps. 

To initialise this document, we must first load necessary information and libraries:

In [2]:
# Perform necessary imports.
import pandas as pd
from lib.constants import *


## Basic Clean Match Data
Here, we perform a basic clean on the match data, including:

1. Removing unnecessary columns from match data.
2. Removing female formats.
3. Removing disability teams.
4. Removing non-Australian international matches.
5. Removing uncommon match formats.
6. Removing international games that are not of One-Day format.

To begin, we will  load the raw match dataset:

In [3]:
# Load match data.
match_data = pd.read_csv(DATA_PATH + "/Matches.txt", delimiter="\t")


### 1. Remove Unnecessary Columns.

We will remove columns from the dataset that are not important for the summarisation and modelling phases. Particularly, we will remove all columns related to the Officials associated with a match, reducing the size of the dataset to a more manageable state.

In [4]:
# Remove unnecessary columns from match data.
match_data = match_data[match_data.columns.drop(
    list(match_data.filter(regex='Official+')))]
match_data.columns.values.tolist()


['Match Id',
 'Season Id',
 'Season',
 'Series Id',
 'Series',
 'Series Gender Id',
 'Series Gender',
 'Match Date',
 'Match YYMMDD',
 'Match Type Id',
 'Match Type',
 'Ball Type Id',
 'Ball Type',
 'TeamA Id',
 'TeamA',
 'TeamA At Home',
 'TeamB Id',
 'TeamB',
 'TeamB At Home',
 'Day/Night',
 'Venue Id',
 'Venue',
 'Toss Won By Id',
 'Toss Decision Id',
 'TeamA Innings1 Closure',
 'TeamA Innings2 Closure',
 'TeamB Innings1 Closure',
 'TeamB Innings2 Closure',
 'TeamA 1st Comparison',
 'TeamA Result Id',
 'TeamA Result',
 'TeamBattingIdMatchInnings1',
 'TeamBattingMatchInnings1',
 'TeamBattingIdMatchInnings2',
 'TeamBattingMatchInnings2',
 'TeamBattingIdMatchInnings3',
 'TeamBattingMatchInnings3',
 'TeamBattingIdMatchInnings4',
 'TeamBattingMatchInnings4',
 'TeamB Result Id',
 'TeamB Result',
 'TeamA Coach Id',
 'TeamA Coach Surname',
 'TeamA Coach Other Names',
 'TeamB Coach Id',
 'TeamB Coach Surname',
 'TeamB Coach Other Names',
 'Round Id',
 'Round',
 'Round Number',
 'Match Number

### 2. Remove Female Formats.

Due to large variations between male and female formats, the two cannot reasonably be compared. Thus, it is necessary to focus on one gender. As the match dataset contains substantially more data for male formats, this will be the focus of the project.

> As more data is collected for womenâ€™s games, it would be interesting to repeat this research.

In [5]:
# Remove female series from match data.
match_data = match_data.loc[match_data["Series Gender Id"] == 1]
match_data["Series Gender"].unique().tolist()


['Male']

### 3. Remove Disability Matches.

A small set of disability matches are recorded in the dataset. Due to significant differences between non-disability and disability matches, this project will only focus on the non-disability matches for which there is substantially more data.

In [6]:
# Remove disability teams from match data.
match_data = match_data[~match_data.TeamA.str.contains("Disability") |
                        ~match_data.TeamB.str.contains("Disability")]
match_data["TeamA"].unique().tolist()


['Australia (M)',
 'West Indies (M)',
 'Pakistan (M)',
 'Sri Lanka (M)',
 'South Africa (M)',
 'England (M)',
 'New Zealand (M)',
 'India (M)',
 'SA (M)',
 'Tas (M)',
 'NSW (M)',
 'Victoria (M)',
 'WA (M)',
 'Qld (M)',
 'Australia A (M)',
 'Zimbabwe (M)',
 'Kenya (M)',
 'Scotland (M)',
 'Bangladesh (M)',
 'Ireland (M)',
 'Sydney Sixers (M)',
 'Melbourne Stars (M)',
 'Adelaide Strikers (M)',
 'Perth Scorchers  (M)',
 'Brisbane Heat (M)',
 'Hobart Hurricanes (M)',
 'Melbourne Renegades (M)',
 'Sydney Thunder (M)',
 'Canada (M)',
 'Gloucestershire (M)',
 'Afghanistan (M)',
 'India A (M)',
 'South Africa A (M)',
 'CA XI (M)',
 'India B (M)',
 'England Lions (M)']

### 4. Remove Non-Australian International Matches.

The match dataset contains some international games played between two non-Australian teams. As this project aims to predict batter performance at the international level of One-Day cricket based on domestic performances in Australia, it is reasonable to remove these international matches.

In [7]:
# Remove international games where Australia is not playing.
match_data = match_data[~(match_data.Series.str.contains("International") & ~match_data.TeamA.str.contains(
    "Australia") & ~match_data.TeamB.str.contains("Australia"))]
len(match_data[match_data.Series.str.contains("International") & ~match_data.TeamA.str.contains(
    "Australia") & ~match_data.TeamB.str.contains("Australia")][["TeamA", "TeamB"]])


0

### 5. Remove Uncommon Match Formats.

The only formats played consistently at the international level are T20, One-Day, and Five-Day. Domestically, Four-Day matches are favoured over the Five-Day format. In addition to these formats, the match dataset contains Two- and Three-Day games. These formats have little transfer to those played at the international level and would be reasonable to remove from the dataset.

In [8]:
# Remove games that are not T20, 1 Day, 4 Day, or 5 Day formats.
match_data = match_data[match_data["Match Type Id"].isin([1, 4, 5, 7])]
match_data["Match Type"].unique().tolist()


['5 Day', '1 Day', '4 Day', 'Twenty20']

### 6. Remove International Matches That are not One-Day Format.

International T20 and Test matches have no purpose in this project and should be removed.

In [9]:
# Remove international games that are not ODI.
match_data = match_data[(match_data["Match Type Id"] == 1 & match_data.Series.str.contains("International"))
                        | match_data.Series.str.contains("Domestic")]
match_data.Series.unique().tolist()


['International ODI M',
 'Domestic 1st Class M',
 'Domestic List A M',
 'Domestic T20 M',
 'International ICC Trophy M',
 'International ODI World Cup M',
 'International 1st Class M']

### Write Cleaned Matches to File.

We will now write this cleaned data to file to be used later.

In [10]:
# Write the cleaned data to file.
match_data.to_csv(DATA_PATH + "/Matches_Clean.txt", sep="\t", index=False)


## Basic Clean Deliveries Data
Here, we perform a basic clean on the deliveries data, including:

1. Removing deliveries in irrelevant matches.
2. Removing deliveries to foreign teams.

These steps are outlined below.

### 1. Remove Deliveries in Irrelevant Matches.

As the deliveries dataset is reasonably larger, we must read the data in chunks. As each chunk is loaded, we will remove deliveries that do not belong to our cleaned matches dataset.

In [12]:
# Determine which matches are important for delivery data.
match_ids = match_data["Match Id"]

# Determine duplicate columns between match and delivery data that should be dropped.
match_columns = set(match_data.columns)
match_columns.remove("Match Id")

# Load delivery data
delivery_data = pd.DataFrame()

for chunk in pd.read_csv(DATA_PATH + "/Deliveries.txt", delimiter="\t", chunksize=10**6, low_memory=False):
  chunk = chunk[chunk["Match Id"].isin(match_ids)]
  chunk.drop(
      [col for col in chunk.columns if col in match_columns], axis=1, inplace=True
  )

  # Combine filtered deliveries into single dataframe.
  delivery_data = pd.concat([delivery_data, chunk])


### 2. Remove Deliveries to Foreign Teams.

The deliveries dataset contains all innings of a match. At the international level, this means that both Australian and non-Australian batting innings have been recorded. It is reasonable to remove the non-Australian innings as we only aim to predict Australian batting performances.

In [13]:
# Get a list of international One Day match IDs.
odi_IDs = match_data[match_data["Series"].str.contains(
    "International")]["Match Id"].tolist()

# Remove deliveries to foreign teams.
delivery_data = delivery_data[~(delivery_data["Match Id"].isin(
    odi_IDs) & ~delivery_data["Team Batting"].str.contains("Australia"))]

# Show the remaining batting teams.
delivery_data["Team Batting"].unique().tolist()


['Australia (M)',
 'SA (M)',
 'Victoria (M)',
 'NSW (M)',
 'Tas (M)',
 'WA (M)',
 'Qld (M)',
 'Australia A (M)',
 'Brisbane Heat (M)',
 'Sydney Sixers (M)',
 'Melbourne Stars (M)',
 'Sydney Thunder (M)',
 'Adelaide Strikers (M)',
 'Melbourne Renegades (M)',
 'Hobart Hurricanes (M)',
 'Perth Scorchers  (M)',
 'CA XI (M)']

## Further Cleaning of Entire Dataset
This section performs a final clean of the deliveries and match datasets together, including:

1. Removing matches that contain no relevant batters.

These steps are outlined below.

### 1. Remove Matches That Contain no Relevant Batters.

We are only interested in modelling batters that have played at least 10 One-Day International and 10 Domestic matches. We will remove all matches that contain no batters meeting these criteria.

In [14]:
# Extract international deliveries.
int_matches = match_data[match_data["Series"].str.contains(
    "International")]["Match Id"].tolist()
int_deliveries = delivery_data[delivery_data["Match Id"].isin(int_matches)]

# Count number of innings per batter.
by_columns = ["Striker Id"]
aggregates = {"Match Id": pd.Series.nunique}
int_groupby_data = int_deliveries.groupby(by=by_columns).agg(aggregates)

# Remove batters that have batted in less than 10 international One Day innings.
batter_ids = int_groupby_data[int_groupby_data["Match Id"]
                              >= MIN_INNINGS].index.tolist()

# Extract domestic deliveries.
dom_matches = match_data[match_data["Series"].str.contains(
    "Domestic")]["Match Id"].tolist()
dom_deliveries = delivery_data[delivery_data["Match Id"].isin(dom_matches)]

# Count number of innings per batter.
by_columns = ["Striker Id"]
aggregates = {"Match Id": pd.Series.nunique}
dom_groupby_data = dom_deliveries.groupby(by=by_columns).agg(aggregates)
dom_groupby_data = dom_groupby_data[dom_groupby_data.index.isin(batter_ids)]

# Remove batters that have batted in less than 10 domestic innings.
batter_ids = dom_groupby_data[(
    dom_groupby_data["Match Id"] >= MIN_INNINGS)].index.tolist()

# Remove matches not containing the above players.
valid_matches = delivery_data[delivery_data["Striker Id"].isin(
    batter_ids)]["Match Id"].unique()
match_data = match_data[match_data["Match Id"].isin(valid_matches)]
delivery_data = delivery_data[delivery_data["Match Id"].isin(valid_matches)]


### Writing Data to File

Finally, we will write the datasets back to file in their cleaned states. Additionally, we will initialise the batter summary data file by writing the relevant Batter IDs to a new file.

In [15]:
# Write the cleaned data to file.
match_data.to_csv(DATA_PATH + "/Matches_Clean.txt", sep="\t", index=False)
delivery_data.to_csv(DATA_PATH + "/Deliveries_Clean.txt",
                     sep="\t", index=False)

# Initialise batter summary file.
batter_data = pd.DataFrame({"Batter_ID": batter_ids})
batter_data.to_csv(DATA_PATH + "/Batter_Summary.txt", sep="\t", index=False)
