# **Cricket Match Dataset: Test Nations (1877–2025):**

## **Data Engineering:**

In [None]:
import pandas as pd
import numpy as np
import sys
import warnings
from datetime import datetime
warnings.filterwarnings("ignore") # Ignore warnings

In [None]:
from data_preparation import process_data_pipeline

# Call the wrapper function to get the final processed data
data = process_data_pipeline()
print(data.shape)

(7793, 12)


In [None]:
data.columns

Index(['team_1', 'team_2', 'winner', 'margin', 'ground', 'format',
       'test_score', 'odi_score', 't20i_score', 'year', 'start_date',
       'total_duration'],
      dtype='object')

#### 1. **Create month column from the dataset:**

In [None]:
# add month column to the dataframe:
data["month"]= data["start_date"].dt.month

data["month"].unique()

array([12., 11.,  1.,  3.,  4.,  2., 10.,  9., nan,  6.,  7.,  8.,  5.])

In [None]:
data['month'].mode() # to know the most frequently occuring month

0    1.0
Name: month, dtype: float64

In [None]:
# replace nan by most frequently occuring value of the column:
data["month"].fillna(data["month"].mode(), inplace= True)

In [73]:
# Map numeric month values to actual month names:
month_mapping = {
    1.0: 'Jan', 2.0: 'Feb', 3.0: 'Mar', 4.0: 'Apr',
    5.0: 'May', 6.0: 'Jun', 7.0: 'Jul', 8.0: 'Aug',
    9.0: 'Sep', 10.0: 'Oct', 11.0: 'Nov', 12.0: 'Dec'
}
# Apply the mapping to the 'Month' column:
data['month'] = data['month'].map(month_mapping)


In [74]:
data.head(3)


Unnamed: 0,team_1,team_2,winner,margin,ground,format,test_score,odi_score,t20i_score,year,start_date,total_duration,month
0,India,Pakistan,drawn,,Bengaluru,Test,1852,0,0,2007,2007-12-08,5,Dec
1,India,Pakistan,drawn,,Eden Gardens,Test,1850,0,0,2007,2007-11-30,5,Nov
2,India,Pakistan,India,6 wickets,Delhi,Test,1849,0,0,2007,2007-11-22,5,Nov


#### 2. **Create `"Is Neutral Ground"` column:**
In cricket, a match is said to be **neutral** when it is **not played in either team’s home country**.

**Purpose of Creating `'Is Neutral Ground'` Feature:**
- **`Analyze home advantage`**: See how much teams benefit when playing at home.
- **`Understand match conditions`**: Neutral venues (e.g., UAE for Pakistan vs India) can affect match outcomes.
- **`Better insights`**: Separating home/away vs neutral games gives more accurate performance trends.
- **`Useful for model features`**: If you build predictive models later, "neutral ground" can be an important feature.

We need a **`mapping of grounds to countries`** (e.g., `"Melbourne" → "Australia"`) to automate this.

In [None]:
data.columns

Index(['team_1', 'team_2', 'winner', 'margin', 'ground', 'format',
       'test_score', 'odi_score', 't20i_score', 'year', 'start_date',
       'total_duration', 'month'],
      dtype='object')

In [76]:
data["ground"].nunique()


179

So, there are `179` different `grounds` in grounds column.

In [None]:
data["ground"].unique() # Array of all unique ground names

array(['Bengaluru', 'Eden Gardens', 'Delhi', 'Karachi', 'Faisalabad',
       'Lahore', 'Mohali', 'Rawalpindi', 'Multan', 'Chennai', 'Sialkot',
       'Ahmedabad', 'Jaipur', 'Nagpur', 'Jalandhar', 'Hyderabad (Sind)',
       'Kanpur', 'Wankhede', 'Brabourne', 'Peshawar', 'Bahawalpur',
       'Dhaka', 'Lucknow', 'Dubai (DICS)', 'Colombo (RPS)', 'Pallekele',
       'Manchester', 'The Oval', 'Birmingham', 'Adelaide', 'Mirpur',
       'Dambulla', 'Centurion', 'Gwalior', 'Guwahati', 'Abu Dhabi',
       'Jamshedpur', 'Visakhapatnam', 'Kochi', 'Amstelveen', 'Sharjah',
       'W.A.C.A', 'Hobart', 'Brisbane', 'Toronto', 'Colombo (SSC)',
       'Singapore', 'Sydney', 'Gujranwala', 'Pune', 'Hyderabad (Deccan)',
       'Indore', 'Melbourne', 'Quetta', 'Sahiwal', 'New York',
       'Johannesburg', 'Durban', 'Hambantota', 'Leeds', 'Fatullah',
       'Hangzhou', 'Perth', "Lord's", 'Colombo (PSS)', 'Taunton',
       'Nairobi (Gym)', 'Melbourne (Docklands)', 'Nottingham', 'Cardiff',
       'Canberra', 'H

Below, I will create a maping between `ground_name` and the `country`. This approach is not perfect, but it will work for now.

In [78]:
# Create the mapping of each ground to its country
grounds = [
    'Bengaluru', 'Eden Gardens', 'Delhi', 'Karachi', 'Faisalabad',
    'Lahore', 'Mohali', 'Rawalpindi', 'Multan', 'Chennai', 'Sialkot',
    'Ahmedabad', 'Jaipur', 'Nagpur', 'Jalandhar', 'Hyderabad (Sind)',
    'Kanpur', 'Wankhede', 'Brabourne', 'Peshawar', 'Bahawalpur',
    'Dhaka', 'Lucknow', 'Dubai (DICS)', 'Colombo (RPS)', 'Pallekele',
    'Manchester', 'The Oval', 'Birmingham', 'Adelaide', 'Mirpur',
    'Dambulla', 'Centurion', 'Gwalior', 'Guwahati', 'Abu Dhabi',
    'Jamshedpur', 'Visakhapatnam', 'Kochi', 'Amstelveen', 'Sharjah',
    'W.A.C.A', 'Hobart', 'Brisbane', 'Toronto', 'Colombo (SSC)',
    'Singapore', 'Sydney', 'Gujranwala', 'Pune', 'Hyderabad (Deccan)',
    'Indore', 'Melbourne', 'Quetta', 'Sahiwal', 'New York',
    'Johannesburg', 'Durban', 'Hambantota', 'Leeds', 'Fatullah',
    'Hangzhou', 'Perth', "Lord's", 'Colombo (PSS)', 'Taunton',
    'Nairobi (Gym)', 'Melbourne (Docklands)', 'Nottingham', 'Cardiff',
    'Canberra', 'Harare', 'Gros Islet', 'Chattogram', 'Khulna',
    'Northampton', 'Moratuwa', 'Christchurch', 'Cape Town',
    'Southampton', 'Bristol', 'Chester-le-Street', 'Cuttack',
    'Bridgetown', 'Dublin (Malahide)', 'Dublin', 'Belfast', 'Kingston',
    'Lauderhill', 'Mount Maunganui', 'Hamilton', 'Wellington',
    'Napier', 'Dunedin', 'Auckland', 'Nelson', 'Queenstown', 'Derby',
    'East London', 'Gqeberha', 'Sheikhupura', 'Paarl', 'Benoni',
    'Bloemfontein', 'Tangier', 'Galle', 'Kandy', 'Colombo (CCC)',
    'Hyderabad', 'Kimberley', 'Sargodha', 'Swansea', 'King City (NW)',
    'Roseau', 'Basseterre', 'Providence', "St John's", 'Georgetown',
    'Port of Spain', 'Kingstown', "St George's", 'Albion', 'Bulawayo',
    'Dharamsala', 'Ranchi', 'Rajkot', 'Vadodara', 'Chandigarh',
    'Kuala Lumpur', 'Margao', 'Srinagar', 'Thiruvananthapuram',
    'New Delhi', 'Chelmsford', 'Raipur', 'North Sound', 'Mumbai',
    'Faridabad', 'Taupo', 'Amritsar', 'Launceston', 'Hove', 'Mackay',
    'Tarouba', 'Vijayawada', 'Jodhpur', 'Leicester', 'Tunbridge Wells',
    'Sylhet', 'Dehradun', 'Tolerance Oval', 'Greater Noida',
    'Rotterdam', 'Bready', 'Cairns', 'Darwin', 'Canterbury',
    'Sheffield', 'Potchefstroom', 'Carrara', 'Geelong', 'Castries',
    'Townsville', 'Bogra', 'Pietermaritzburg', 'Dallas',
    'Nairobi (Aga)', 'Scarborough', 'Ballarat', 'Devonport', 'Albury',
    'Aberdeen', 'Whangarei', 'Nairobi (Club)', 'Berri', 'Coolidge',
    'Worcester', 'Patna', 'New Plymouth'
]

# Country lookup based on cities/venues:
ground_to_country = {}

# Helper mapping by partial match:
country_keywords = {
    'India': ['Bengaluru', 'Eden Gardens', 'Delhi', 'Mohali', 'Chennai', 'Ahmedabad', 'Jaipur',
              'Nagpur', 'Jalandhar', 'Kanpur', 'Wankhede', 'Brabourne', 'Lucknow', 'Gwalior',
              'Guwahati', 'Jamshedpur', 'Visakhapatnam', 'Kochi', 'Pune', 'Hyderabad (Deccan)',
              'Indore', 'Rajkot', 'Vadodara', 'Chandigarh', 'Margao', 'Srinagar',
              'Thiruvananthapuram', 'New Delhi', 'Raipur', 'Mumbai', 'Faridabad', 'Amritsar',
              'Vijayawada', 'Jodhpur', 'Dehradun', 'Greater Noida', 'Patna'],
    'Pakistan': ['Karachi', 'Faisalabad', 'Lahore', 'Rawalpindi', 'Multan', 'Sialkot',
                 'Hyderabad (Sind)', 'Peshawar', 'Bahawalpur', 'Gujranwala', 'Quetta',
                 'Sahiwal', 'Sheikhupura', 'Sargodha'],
    'Bangladesh': ['Dhaka', 'Mirpur', 'Fatullah', 'Chattogram', 'Khulna', 'Sylhet', 'Bogra'],
    'Sri Lanka': ['Colombo (RPS)', 'Pallekele', 'Dambulla', 'Hambantota', 'Colombo (SSC)',
                  'Moratuwa', 'Galle', 'Kandy', 'Colombo (PSS)', 'Colombo (CCC)'],
    'UAE': ['Dubai (DICS)', 'Sharjah', 'Abu Dhabi'],
    'England': ['Manchester', 'The Oval', 'Birmingham', "Lord's", 'Taunton', 'Nottingham',
                'Leeds', 'Southampton', 'Bristol', 'Chester-le-Street', 'Northampton',
                'Derby', 'Chelmsford', 'Hove', 'Leicester', 'Tunbridge Wells', 'Sheffield',
                'Scarborough', 'Worcester'],
    'Australia': ['Adelaide', 'W.A.C.A', 'Hobart', 'Brisbane', 'Sydney', 'Melbourne',
                  'Melbourne (Docklands)', 'Perth', 'Canberra', 'Launceston', 'Mackay',
                  'Carrara', 'Geelong', 'Townsville', 'Ballarat', 'Devonport', 'Albury',
                  'Berri'],
    'New Zealand': ['Taupo', 'Mount Maunganui', 'Hamilton', 'Wellington', 'Napier',
                    'Dunedin', 'Auckland', 'Nelson', 'Queenstown', 'Whangarei',
                    'New Plymouth'],
    'South Africa': ['Johannesburg', 'Durban', 'Centurion', 'East London', 'Gqeberha',
                     'Paarl', 'Benoni', 'Bloemfontein', 'Kimberley', 'Potchefstroom',
                     'Pietermaritzburg'],
    'West Indies': ['Bridgetown', 'Kingston', 'Lauderhill', 'Roseau', 'Basseterre',
                    'Providence', "St John's", 'Georgetown', 'Port of Spain',
                    'Kingstown', "St George's", 'Albion', 'Castries', 'Coolidge',
                    'North Sound', 'Tarouba'],
    'Zimbabwe': ['Harare', 'Bulawayo'],
    'Ireland': ['Dublin (Malahide)', 'Dublin', 'Belfast', 'Bready'],
    'Scotland': ['Aberdeen'],
    'Kenya': ['Nairobi (Gym)', 'Nairobi (Aga)', 'Nairobi (Club)'],
    'Canada': ['Toronto', 'King City (NW)'],
    'USA': ['New York', 'Dallas'],
    'Netherlands': ['Amstelveen', 'Rotterdam'],
    'Malaysia': ['Kuala Lumpur'],
    'China': ['Hangzhou'],
    'Singapore': ['Singapore'],
    'Namibia': [],
    'Afghanistan': [],
    'Nepal': [],
    'Hong Kong': [],
    'Germany': [],
    'Italy': [],
    'Morocco': ['Tangier'],
    'France': [],
    'UAE': ['Tolerance Oval'],
    'England': ['Canterbury'],  # UK variant
}

# Assign countries to grounds
for ground in grounds:
    assigned = False
    for country, keywords in country_keywords.items():
        if ground in keywords:
            ground_to_country[ground] = country
            assigned = True
            break
    if not assigned:
        ground_to_country[ground] = 'Unknown'

ground_to_country_sorted = dict(sorted(ground_to_country.items()))


In [79]:
data.columns


Index(['team_1', 'team_2', 'winner', 'margin', 'ground', 'format',
       'test_score', 'odi_score', 't20i_score', 'year', 'start_date',
       'total_duration', 'month'],
      dtype='object')

Note that `ground_country` is also the `home_country` of the match.

In [80]:
# Map ground to country:
data['ground_country'] = data['ground'].map(ground_to_country_sorted)

# Step 3: Create is_neutral_ground column:
data['is_neutral_ground'] = ~(
    (data['team_1'] == data['ground_country']) |
    (data['team_2'] == data['ground_country'])
)


In [81]:
data.head(10)


Unnamed: 0,team_1,team_2,winner,margin,ground,format,test_score,odi_score,t20i_score,year,start_date,total_duration,month,ground_country,is_neutral_ground
0,India,Pakistan,drawn,,Bengaluru,Test,1852,0,0,2007,2007-12-08,5,Dec,India,False
1,India,Pakistan,drawn,,Eden Gardens,Test,1850,0,0,2007,2007-11-30,5,Nov,India,False
2,India,Pakistan,India,6 wickets,Delhi,Test,1849,0,0,2007,2007-11-22,5,Nov,India,False
3,Pakistan,India,Pakistan,341 runs,Karachi,Test,1783,0,0,2006,2006-01-29,4,Jan,Pakistan,False
4,Pakistan,India,drawn,,Faisalabad,Test,1782,0,0,2006,2006-01-21,5,Jan,Pakistan,False
5,Pakistan,India,drawn,,Lahore,Test,1781,0,0,2006,2006-01-13,5,Jan,Pakistan,False
6,India,Pakistan,Pakistan,168 runs,Bengaluru,Test,1743,0,0,2005,2005-03-24,5,Mar,India,False
7,India,Pakistan,India,195 runs,Eden Gardens,Test,1741,0,0,2005,2005-03-16,5,Mar,India,False
8,India,Pakistan,drawn,,Mohali,Test,1738,0,0,2005,2005-03-08,5,Mar,India,False
9,Pakistan,India,India,inns & 131 runs,Rawalpindi,Test,1697,0,0,2004,2004-04-13,4,Apr,Pakistan,False


In [82]:
# Check if all the Grounds are mapped to a country successfully or not:
data["is_neutral_ground"].isnull().any()


np.False_

Nice! Mapping is successful.

#### 3. **Creating `won_by_runs` and `won_by_inns` columns from `margin`:**

In [None]:
data.columns

Index(['team_1', 'team_2', 'winner', 'margin', 'ground', 'format',
       'test_score', 'odi_score', 't20i_score', 'year', 'start_date',
       'total_duration', 'month', 'ground_country', 'is_neutral_ground'],
      dtype='object')

1. **Identify Matches with Results:**   
Filter out rows where the `winner` column contains "drawn" or is missing. Only process rows where a team has won.

In [84]:
# Filter out rows where the match is drawn or winner is missing
data = data[data["winner"].notna() & (data["winner"] != "drawn")]
data

Unnamed: 0,team_1,team_2,winner,margin,ground,format,test_score,odi_score,t20i_score,year,start_date,total_duration,month,ground_country,is_neutral_ground
2,India,Pakistan,India,6 wickets,Delhi,Test,1849,0,0,2007,2007-11-22,5,Nov,India,False
3,Pakistan,India,Pakistan,341 runs,Karachi,Test,1783,0,0,2006,2006-01-29,4,Jan,Pakistan,False
6,India,Pakistan,Pakistan,168 runs,Bengaluru,Test,1743,0,0,2005,2005-03-24,5,Mar,India,False
7,India,Pakistan,India,195 runs,Eden Gardens,Test,1741,0,0,2005,2005-03-16,5,Mar,India,False
9,Pakistan,India,India,inns & 131 runs,Rawalpindi,Test,1697,0,0,2004,2004-04-13,4,Apr,Pakistan,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7788,Australia,ICC World XI,Australia,210 runs,Sydney,Test,1768,0,0,2005,2005-10-14,4,Oct,Australia,False
7789,Australia,ICC World XI,Australia,156 runs,Melbourne (Docklands),ODI,0,2284,0,2005,2005-10-09,1,Oct,Australia,False
7790,Australia,ICC World XI,Australia,55 runs,Melbourne (Docklands),ODI,0,2283,0,2005,2005-10-07,1,Oct,Australia,False
7791,Australia,ICC World XI,Australia,93 runs,Melbourne (Docklands),ODI,0,2282,0,2005,2005-10-05,1,Oct,Australia,False


2. **Extract Wickets from the `margin` Column:**   
Check if the `margin` column contains the word `"wickets"`. Extract the numeric value before `"wickets"` and assign it to the `won_by_wickets` column.

In [85]:
# Extract the number of wickets from the margin column
data["won_by_wickets"] = data["margin"].str.extract(r"(\d+)\s+wickets", expand=False).astype(float)

3. **Extract Runs from the `margin` Column:**    
Check if the `margin` column contains the word `"runs".` Extract the numeric value before `"runs"` and assign it to the `won_by_runs` column.

In [86]:
# Extract the number of runs from the margin column
data["won_by_runs"] = data["margin"].str.extract(r"(\d+)\s+runs", expand=False).astype(float)

4. **Handle Matches Won by Both Innings and Runs:**    
For matches won by `"innings and X runs"`, extract the numeric value and assign it to the `won_by_runs` column. The `won_by_wickets` column will remain `NaN` for these matches.

In [90]:
# Handle matches won by "innings and X runs"
innings_and_runs = data["margin"].str.extract(r"innings\s+and\s+(\d+)\s+runs", expand=False).astype(float)
data["won_by_runs"] = data["won_by_runs"].fillna(innings_and_runs)

5. **Fill Missing Values:**   
If a match is not won by runs or wickets, the corresponding columns will remain `NaN`.

In [None]:
# Fill missing values with 0 for clarity
#data["won_by_wickets"] = data["won_by_wickets"].fillna(0).astype(int)
#data["won_by_runs"] = data["won_by_runs"].fillna(0).astype(int)


6. **Verify the Results:**   
Check the first few rows to ensure the new columns are created correctly.

In [None]:
# Verify the results
data[["winner", "margin", "won_by_wickets", "won_by_runs"]].head()


Unnamed: 0,winner,margin,won_by_wickets,won_by_runs
2,India,6 wickets,6.0,
3,Pakistan,341 runs,,341.0
6,Pakistan,168 runs,,168.0
7,India,195 runs,,195.0
9,India,inns & 131 runs,,131.0


In [None]:
data.columns

Index(['team_1', 'team_2', 'winner', 'margin', 'ground', 'format',
       'test_score', 'odi_score', 't20i_score', 'year', 'start_date',
       'total_duration', 'month', 'ground_country', 'is_neutral_ground',
       'won_by_wickets', 'won_by_runs'],
      dtype='object')

In [None]:
data.head()

Unnamed: 0,team_1,team_2,winner,margin,ground,format,test_score,odi_score,t20i_score,year,start_date,total_duration,month,ground_country,is_neutral_ground,won_by_wickets,won_by_runs
2,India,Pakistan,India,6 wickets,Delhi,Test,1849,0,0,2007,2007-11-22,5,Nov,India,False,6.0,
3,Pakistan,India,Pakistan,341 runs,Karachi,Test,1783,0,0,2006,2006-01-29,4,Jan,Pakistan,False,,341.0
6,India,Pakistan,Pakistan,168 runs,Bengaluru,Test,1743,0,0,2005,2005-03-24,5,Mar,India,False,,168.0
7,India,Pakistan,India,195 runs,Eden Gardens,Test,1741,0,0,2005,2005-03-16,5,Mar,India,False,,195.0
9,Pakistan,India,India,inns & 131 runs,Rawalpindi,Test,1697,0,0,2004,2004-04-13,4,Apr,Pakistan,False,,131.0
