# **Cricket Match Dataset: Test Nations (1877–2025):**

### **Data Preparation and Cleaning:**

In [1]:
'''import os
import sys
import warnings
from datetime import datetime
# os, sys: for file paths or custom module access.
# warnings: to suppress or manage warnings.
# datetime: useful for time-based data or tracking execution.

# ------------------------------------------------------------------------------------
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
!%matplotlib inline  # Ensures plots appear in the notebook
sns.set_theme(style="whitegrid")  # Sets plot theme

# -------------------------------------------------------------------------------------

# plotly.express: high-level API for quick visualizations.
#plotly.graph_objects: for detailed, customized interactive plots.
import plotly.express as px
import plotly.graph_objects as go

# ----------------------------------------------------------------------------------------

warnings.filterwarnings("ignore") # Ignore warnings

# ----------------------------------------------------------------------------------------
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.float_format', lambda x: '%.2f' % x)  # Format floats nicely

# --------------------------------------------------------------------------------------

from IPython.display import display, HTML, Markdown
# display(): for displaying DataFrames, HTML, Markdown, or Plotly figures without needing print.
# HTML(): for embedding raw HTML (tables, styling, formatting).
# Markdown(): to render Markdown strings dynamically.

# -----------------------------------------------------------------------------------------
# Allows us to display output from multiple lines in a single cell
#  (useful when returning multiple objects).
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# --------------------------------------------------------------------------------------
# Automatically reloads modules we edit outside the notebook 
# without restarting the kernel 
# (useful in modular analysis or app development).
%load_ext autoreload
%autoreload 2

#  Adds clean and interactive progress bars in 
# loops and DataFrame operations.
from tqdm.notebook import tqdm

# jupyterthemes: for consistent UI style
# (optional but adds polish when presenting).
# !pip install jupyterthemes
# Example (after installing): 
# !jt -t grade3 -ofs 12 -nfs 12 -tfs 12 -cellw 88%
'''



In [2]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm
import sys
import warnings
from datetime import datetime
warnings.filterwarnings("ignore") # Ignore warnings
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.float_format', lambda x: '%.2f' % x)  # Format floats nicely

In [3]:
data= pd.read_csv("C:/Users/MyMachine/Desktop/Mission-Project/00_DataSets/25_Cricket-all-teams-all-matches.csv")

**`Data Source`:** __https://www.kaggle.com/datasets/qammarshahzad/cricket-match-dataset-test-nations-18772025__

In [4]:
data.shape

(7793, 8)

In [5]:
data.columns

Index(['Team 1', 'Team 2', 'Winner', 'Margin', 'Ground', 'Match Date',
       'Scorecard', 'Format'],
      dtype='object')

This dataset contains the complete history of international cricket matches played by all `Test-playing nations` from `1877 to 2025`, including `Afghanistan`, `Australia`, `Bangladesh`, `England`, `India`, `Ireland`, `New Zealand`, `Pakistan`, `South Africa`, `Sri Lanka`, `West Indies`, and `Zimbabwe`.

It also includes `official ICC World XI matches` played against `Australia`, `Pakistan`, and `West Indies`, providing a more complete view of rare yet recognized international fixtures.

Match `results`, `margins`, `venues`, and `team performances` are covered across all formats. 

In [6]:
# check for the column data types:
data.dtypes

Team 1        object
Team 2        object
Winner        object
Margin        object
Ground        object
Match Date    object
Scorecard     object
Format        object
dtype: object

So, everything is Python Object type.

In [7]:
# Check for null values:
data.isnull().any()

Team 1        False
Team 2        False
Winner        False
Margin         True
Ground        False
Match Date    False
Scorecard     False
Format        False
dtype: bool

In [8]:
data.isnull().sum()

Team 1           0
Team 2           0
Winner           0
Margin        1038
Ground           0
Match Date       0
Scorecard        0
Format           0
dtype: int64

One column `Margin` contains Null Values.

We have `(7793, 8)` `(rows, columns)`, and one of the column contains `1038` null values.

In [9]:
# % of null values in Margin column:
data["Margin"].isnull().sum() / len(data) * 100

np.float64(13.319645836006671)

So, around 13% of the data is missing in the Margin column.

In [10]:
data

Unnamed: 0,Team 1,Team 2,Winner,Margin,Ground,Match Date,Scorecard,Format
0,India,Pakistan,drawn,,Bengaluru,"Dec 8-12, 2007",Test # 1852,Test
1,India,Pakistan,drawn,,Eden Gardens,"Nov 30-Dec 4, 2007",Test # 1850,Test
2,India,Pakistan,India,6 wickets,Delhi,"Nov 22-26, 2007",Test # 1849,Test
3,Pakistan,India,Pakistan,341 runs,Karachi,"Jan 29-Feb 1, 2006",Test # 1783,Test
4,Pakistan,India,drawn,,Faisalabad,"Jan 21-25, 2006",Test # 1782,Test
...,...,...,...,...,...,...,...,...
7788,Australia,ICC World XI,Australia,210 runs,Sydney,"Oct 14-17, 2005",Test # 1768,Test
7789,Australia,ICC World XI,Australia,156 runs,Melbourne (Docklands),"Oct 9, 2005",ODI # 2284,ODI
7790,Australia,ICC World XI,Australia,55 runs,Melbourne (Docklands),"Oct 7, 2005",ODI # 2283,ODI
7791,Australia,ICC World XI,Australia,93 runs,Melbourne (Docklands),"Oct 5, 2005",ODI # 2282,ODI


`Margin` column can not be a numerical column. But later we will split it into possibly two or more columns.

In [11]:
data.nunique()

Team 1          13
Team 2          14
Winner          16
Margin         556
Ground         179
Match Date    6921
Scorecard     7793
Format           3
dtype: int64

So, `Team 1`, `Team 2`, `Winner`, `Format` columns can be treated as categorical columns. `ground` is also considered as categorical column although it contains large number of categories.

In [12]:
# check for duplicate values:
data.duplicated().any()

np.False_

No duplicated data in the datasaet.

In [13]:
data.head(10)

Unnamed: 0,Team 1,Team 2,Winner,Margin,Ground,Match Date,Scorecard,Format
0,India,Pakistan,drawn,,Bengaluru,"Dec 8-12, 2007",Test # 1852,Test
1,India,Pakistan,drawn,,Eden Gardens,"Nov 30-Dec 4, 2007",Test # 1850,Test
2,India,Pakistan,India,6 wickets,Delhi,"Nov 22-26, 2007",Test # 1849,Test
3,Pakistan,India,Pakistan,341 runs,Karachi,"Jan 29-Feb 1, 2006",Test # 1783,Test
4,Pakistan,India,drawn,,Faisalabad,"Jan 21-25, 2006",Test # 1782,Test
5,Pakistan,India,drawn,,Lahore,"Jan 13-17, 2006",Test # 1781,Test
6,India,Pakistan,Pakistan,168 runs,Bengaluru,"Mar 24-28, 2005",Test # 1743,Test
7,India,Pakistan,India,195 runs,Eden Gardens,"Mar 16-20, 2005",Test # 1741,Test
8,India,Pakistan,drawn,,Mohali,"Mar 8-12, 2005",Test # 1738,Test
9,Pakistan,India,India,inns & 131 runs,Rawalpindi,"Apr 13-16, 2004",Test # 1697,Test


In [14]:
# Rename column to smallcase:
data.columns = data.columns.str.lower()

# Replace spaces with underscores:
data.columns = data.columns.str.replace(' ', '_')

# Replace special characters with underscores:
data.columns = data.columns.str.replace('[^a-zA-Z0-9_]', '_', regex=True)

# Replace multiple underscores with a single underscore:
data.columns = data.columns.str.replace('__+', '_', regex=True)

# Remove underscores at the beginning and end of column names:
data.columns = data.columns.str.strip('_')

In [15]:
data.columns

Index(['team_1', 'team_2', 'winner', 'margin', 'ground', 'match_date',
       'scorecard', 'format'],
      dtype='object')

In [16]:
# Get unique values from column_1 and column_2:
data[["team_1", "team_2"]]["team_1"].unique()

array(['India', 'Pakistan', 'Afghanistan', 'Australia', 'Bangladesh',
       'England', 'Ireland', 'New Zealand', 'South Africa', 'Sri Lanka',
       'West Indies', 'Zimbabwe', 'ICC World XI'], dtype=object)

In [17]:
# Get unique values from winner column:
data["winner"].unique()

array(['drawn', 'India', 'Pakistan', 'no result', 'tied', 'Afghanistan',
       'Australia', 'Bangladesh', 'England', 'World-XI', 'Ireland',
       'New Zealand', 'South Africa', 'Sri Lanka', 'West Indies',
       'Zimbabwe'], dtype=object)

In [18]:
data.head()

Unnamed: 0,team_1,team_2,winner,margin,ground,match_date,scorecard,format
0,India,Pakistan,drawn,,Bengaluru,"Dec 8-12, 2007",Test # 1852,Test
1,India,Pakistan,drawn,,Eden Gardens,"Nov 30-Dec 4, 2007",Test # 1850,Test
2,India,Pakistan,India,6 wickets,Delhi,"Nov 22-26, 2007",Test # 1849,Test
3,Pakistan,India,Pakistan,341 runs,Karachi,"Jan 29-Feb 1, 2006",Test # 1783,Test
4,Pakistan,India,drawn,,Faisalabad,"Jan 21-25, 2006",Test # 1782,Test


In [19]:
# Listing out different categories from `format` column:
data["format"].unique()

array(['Test ', 'ODI ', 'T20I '], dtype=object)

So, three formats: `Test`, `ODI`, and `T20`.

In [20]:
data["scorecard"]

0       Test # 1852
1       Test # 1850
2       Test # 1849
3       Test # 1783
4       Test # 1782
           ...     
7788    Test # 1768
7789     ODI # 2284
7790     ODI # 2283
7791     ODI # 2282
7792     T20I # 666
Name: scorecard, Length: 7793, dtype: object

In [21]:
# scorecard column should be splitted into three columns: 
# test_scorecard, odi_scorecard and t20_scorecard:

# Extract scores for each format using regex
data['test_score'] = data['scorecard'].str.extract(r'Test\s#\s(\d+)', expand=False)
data['odi_score'] = data['scorecard'].str.extract(r'ODI\s#\s(\d+)', expand=False)
data['t20i_score'] = data['scorecard'].str.extract(r'T20I\s#\s(\d+)', expand=False)

In [22]:
# Convert the extracted values to numeric, replacing missing values with NaN
data['test_score'] = pd.to_numeric(data['test_score'], errors='coerce')
data['odi_score'] = pd.to_numeric(data['odi_score'], errors='coerce')
data['t20i_score'] = pd.to_numeric(data['t20i_score'], errors='coerce')

In [23]:
data.head()

Unnamed: 0,team_1,team_2,winner,margin,ground,match_date,scorecard,format,test_score,odi_score,t20i_score
0,India,Pakistan,drawn,,Bengaluru,"Dec 8-12, 2007",Test # 1852,Test,1852.0,,
1,India,Pakistan,drawn,,Eden Gardens,"Nov 30-Dec 4, 2007",Test # 1850,Test,1850.0,,
2,India,Pakistan,India,6 wickets,Delhi,"Nov 22-26, 2007",Test # 1849,Test,1849.0,,
3,Pakistan,India,Pakistan,341 runs,Karachi,"Jan 29-Feb 1, 2006",Test # 1783,Test,1783.0,,
4,Pakistan,India,drawn,,Faisalabad,"Jan 21-25, 2006",Test # 1782,Test,1782.0,,


In [24]:
# Delete scorecard column:
data= data.drop(axis= 0, columns= ["scorecard"])

In [25]:
data.columns

Index(['team_1', 'team_2', 'winner', 'margin', 'ground', 'match_date',
       'format', 'test_score', 'odi_score', 't20i_score'],
      dtype='object')

In [26]:
# Define a function to split and parse the match_date column
def split_match_date(date_str):
    try:
        # Split the date into the range and year
        date_range, year = date_str.split(", ")
        
        # Extract the start date (first part of the range)
        start_date_str = date_range.split("-")[0] + " " + year
        start_date = pd.to_datetime(start_date_str, format="%b %d %Y", errors="coerce")
        
        # Extract the year with month
        year_with_month = start_date.strftime("%Y-%b") if start_date else None
        
        # Calculate the total duration (days)
        if "-" in date_range:
            # Check if the second part of the range includes a month
            end_date_part = date_range.split("-")[-1]
            if not any(char.isalpha() for char in end_date_part):  # If no month is present
                end_date_str = date_range.split("-")[0][:3] + " " + end_date_part + " " + year
            else:
                end_date_str = end_date_part + " " + year
            
            end_date = pd.to_datetime(end_date_str, format="%b %d %Y", errors="coerce")
            total_duration = (end_date - start_date).days + 1 if end_date else None
        else:
            total_duration = 1  # Single-day match
        
        return year_with_month, start_date, total_duration
    except Exception:
        return None, None, None

# Apply the function to the match_date column and create new columns
data[["year_with_month", "start_date", "total_duration"]] = data["match_date"].apply(
    lambda x: pd.Series(split_match_date(x))
)

# Display the updated DataFrame
data.head()

Unnamed: 0,team_1,team_2,winner,margin,ground,match_date,format,test_score,odi_score,t20i_score,year_with_month,start_date,total_duration
0,India,Pakistan,drawn,,Bengaluru,"Dec 8-12, 2007",Test,1852.0,,,2007-Dec,2007-12-08,5.0
1,India,Pakistan,drawn,,Eden Gardens,"Nov 30-Dec 4, 2007",Test,1850.0,,,2007-Nov,2007-11-30,5.0
2,India,Pakistan,India,6 wickets,Delhi,"Nov 22-26, 2007",Test,1849.0,,,2007-Nov,2007-11-22,5.0
3,Pakistan,India,Pakistan,341 runs,Karachi,"Jan 29-Feb 1, 2006",Test,1783.0,,,2006-Jan,2006-01-29,4.0
4,Pakistan,India,drawn,,Faisalabad,"Jan 21-25, 2006",Test,1782.0,,,2006-Jan,2006-01-21,5.0


And, the year_with_month column should contain only the year because the full date is already present in the start_date column.

In [27]:
data["year_with_month"]= data["year_with_month"].str.split("-", expand= True)[0]

In [28]:
data

Unnamed: 0,team_1,team_2,winner,margin,ground,match_date,format,test_score,odi_score,t20i_score,year_with_month,start_date,total_duration
0,India,Pakistan,drawn,,Bengaluru,"Dec 8-12, 2007",Test,1852.00,,,2007,2007-12-08,5.00
1,India,Pakistan,drawn,,Eden Gardens,"Nov 30-Dec 4, 2007",Test,1850.00,,,2007,2007-11-30,5.00
2,India,Pakistan,India,6 wickets,Delhi,"Nov 22-26, 2007",Test,1849.00,,,2007,2007-11-22,5.00
3,Pakistan,India,Pakistan,341 runs,Karachi,"Jan 29-Feb 1, 2006",Test,1783.00,,,2006,2006-01-29,4.00
4,Pakistan,India,drawn,,Faisalabad,"Jan 21-25, 2006",Test,1782.00,,,2006,2006-01-21,5.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...
7788,Australia,ICC World XI,Australia,210 runs,Sydney,"Oct 14-17, 2005",Test,1768.00,,,2005,2005-10-14,4.00
7789,Australia,ICC World XI,Australia,156 runs,Melbourne (Docklands),"Oct 9, 2005",ODI,,2284.00,,2005,2005-10-09,1.00
7790,Australia,ICC World XI,Australia,55 runs,Melbourne (Docklands),"Oct 7, 2005",ODI,,2283.00,,2005,2005-10-07,1.00
7791,Australia,ICC World XI,Australia,93 runs,Melbourne (Docklands),"Oct 5, 2005",ODI,,2282.00,,2005,2005-10-05,1.00


In [29]:
# Rename columns: 
data.rename(columns= {"year_with_month": "year", "match_duration": "match_duration"}, inplace= True)

In [30]:
data.columns

Index(['team_1', 'team_2', 'winner', 'margin', 'ground', 'match_date',
       'format', 'test_score', 'odi_score', 't20i_score', 'year', 'start_date',
       'total_duration'],
      dtype='object')

In [31]:
# Delete match_date column:
data= data.drop(axis= 0, columns= ["match_date"])

In [32]:
data.columns

Index(['team_1', 'team_2', 'winner', 'margin', 'ground', 'format',
       'test_score', 'odi_score', 't20i_score', 'year', 'start_date',
       'total_duration'],
      dtype='object')

> `test_score`, `odi_score` and `t20i_score` should be in `int` format. 

> `year` should be in datetime format or `int` format. 

> `start_date` should be in `datetime` form. 

> `total_duration` should be in `int` format. 

In [33]:
data.dtypes

team_1                    object
team_2                    object
winner                    object
margin                    object
ground                    object
format                    object
test_score               float64
odi_score                float64
t20i_score               float64
year                      object
start_date        datetime64[ns]
total_duration           float64
dtype: object

In [34]:
data[["test_score", "odi_score", "t20i_score"]].sample(10)

Unnamed: 0,test_score,odi_score,t20i_score
1827,,3725.0,
1751,1054.0,,
5722,316.0,,
649,,1429.0,
5719,333.0,,
2895,,1089.0,
2646,,1766.0,
414,,,704.0
795,576.0,,
1662,,,454.0


In [35]:
# In test_score, odi_score and t20i_score columns, replace Missing value by 0:
data[["test_score", "odi_score", "t20i_score"]]= data[["test_score", "odi_score", "t20i_score"]].fillna(0)

In [36]:
# All scores should be in integer format:
data[["test_score", "odi_score", "t20i_score"]]= data[["test_score", "odi_score", "t20i_score"]].astype(int)

In [37]:
data["year"].unique() # check for the unique values in the column:

array(['2007', '2006', '2005', '2004', '1999', '1989', '1987', '1984',
       '1983', '1982', '1980', '1979', '1978', '1961', None, '1960',
       '1955', '1952', '2025', '2023', '2019', '2018', '2017', '2015',
       '2014', '2013', '2012', '2011', '2010', '2009', '2008', '2003',
       '2000', '1998', '1997', '1996', '1995', '1994', '1992', '1991',
       '1990', '1988', '1986', '1985', '2024', '2022', '2021', '2016',
       '2002', '1981', '1977', '1976', '1973', '1972', '1964', '1959',
       '1956', '2001', '1993', '1975', '2020', '1974', '1971', '1969',
       '1967', '1962', '1954', '1965', '1958', '1968', '1948', '1947',
       '1951', '1946', '1936', '1934', '1933', '1932', '1966', '1953',
       '1949', '1970', '1963', '1950', '1938', '1937', '1930', '1929',
       '1928', '1926', '1925', '1924', '1921', '1920', '1912', '1911',
       '1909', '1908', '1907', '1905', '1904', '1903', '1902', '1901',
       '1899', '1898', '1897', '1896', '1895', '1894', '1893', '1892',
       '

In [38]:
# Replace None or NaN values in the year column with a default value (e.g., 0):
data["year"] = data["year"].fillna(-1)

# Convert the year column to integer type
data["year"] = data["year"].astype(int)

# Display the updated DataFrame
data["year"].dtype

dtype('int64')

In [42]:
data["total_duration"].unique()
data["total_duration"].replace(value= 0, to_replace= np.nan, inplace=True)

In [43]:
# total_duration should be in int format:
data["total_duration"]=data["total_duration"].astype(int)

In [44]:
data.dtypes

team_1                    object
team_2                    object
winner                    object
margin                    object
ground                    object
format                    object
test_score                 int64
odi_score                  int64
t20i_score                 int64
year                       int64
start_date        datetime64[ns]
total_duration             int64
dtype: object

In [45]:
# Lets check the dataset once again: 
data.head(10)

Unnamed: 0,team_1,team_2,winner,margin,ground,format,test_score,odi_score,t20i_score,year,start_date,total_duration
0,India,Pakistan,drawn,,Bengaluru,Test,1852,0,0,2007,2007-12-08,5
1,India,Pakistan,drawn,,Eden Gardens,Test,1850,0,0,2007,2007-11-30,5
2,India,Pakistan,India,6 wickets,Delhi,Test,1849,0,0,2007,2007-11-22,5
3,Pakistan,India,Pakistan,341 runs,Karachi,Test,1783,0,0,2006,2006-01-29,4
4,Pakistan,India,drawn,,Faisalabad,Test,1782,0,0,2006,2006-01-21,5
5,Pakistan,India,drawn,,Lahore,Test,1781,0,0,2006,2006-01-13,5
6,India,Pakistan,Pakistan,168 runs,Bengaluru,Test,1743,0,0,2005,2005-03-24,5
7,India,Pakistan,India,195 runs,Eden Gardens,Test,1741,0,0,2005,2005-03-16,5
8,India,Pakistan,drawn,,Mohali,Test,1738,0,0,2005,2005-03-08,5
9,Pakistan,India,India,inns & 131 runs,Rawalpindi,Test,1697,0,0,2004,2004-04-13,4


It seems like everything is in-place and ready to go.

----