In [1]:
# Created by: Jess Gallo
# Date created: 10/02/23
# Last Modified: 10/02/23
# Analytics Vidhya: The Sledge Hack: India vs Australia Cricket Hackathon

Are you a cricket and data science fan?

Welcome to The Sledge Hack - Experience the World's Biggest Cricket Hackathon before the upcoming epic clash between India and Australia at the ICC World Cup on Oct 8, 2023. It's a celebration of cricket passion and data science wizardry – a must for every cricket fan and data enthusiast!

This hackathon is a tribute to the spirit of cricket and data science enthusiasts. Join us in the pre-World Cup cricket celebration and send your best wishes to Team India on their World Cup journey! 


Problem Statement

Your task is to make precise predictions regarding the runs scored and wickets taken by each player who has been carefully selected to represent their respective teams, India and Australia 15 members squad in the highly anticipated ICC World Cup 2023 clash on Oct 8, 2023.

To accomplish this, you will need to use data science models and techniques based upon extensive historical data encompassing both player and team performance, allowing you to offer well-informed predictions.


About the Dataset

We have provided with you a dataset containing the batting and bowling statistics of the 30 players selected for the ICC World Cup 2023 of both teams India and Australia. The dataset contains the batting and bowling stats of each ODI played by the cricketer throughout his career.

Column          Description

player_id       Unique identifier of a player
player_name     Name of the player
runs_scored     No. of runs scored by the player in the match
wickets         Wickets taken by the player in the match
runs_conceded   No. of runs conceded by the player
catches         No. of catches taken by the player
stumpings       No. of stumpings done by the player
match_date      Date of the match
opposition      Opponent team name and Ground 
match_id        Unique identifier of the match


Important: Feel free to use any open source dataset or external dataset for the hackathon.

You will need to predict the runs scored and wickets taken by these 30 players in the upcoming clash between India and Australia on Oct 8, 2023.


Sample Submission File

You need to submit the solution file similar to the sample submission file. The solution file must contain the format similar to the sample submission file given below. 

Column          Description

player_id       Unique identifier of the player
runs            No. of runs scored by the player
wickets         No. of wickets taken by the player



Evaluation metric

Your actual score will be generated only after the clash between India and Australia on Oct 8, 2023 comes to an end.

Your solution file will be evaluated against the actual runs and wickets taken by the player in the match and final score is the weighted average of the RMSE is calculated between the predictions and actual runs scored and wickets taken by the players.

At the moment, the leaderboard displays scores calculated in comparison to the benchmark model. This provides a useful reference for enhancing the model. However, it's important to note that the final score, determined based on player statistics after the match, may vary significantly. 

The final leaderboard is displayed after the clash between India and Australia on Oct 8, 2023.

In [2]:
# Libraries
import pandas as pd
import numpy as np

Data Gathering

In [3]:
df = pd.read_csv('C://Users//Gallo//Downloads//data_zpyYWs0.csv')

In [4]:
df.head(10)

Unnamed: 0,player_id,player_name,runs_scored,wickets,runs_conceded,catches,stumpings,match_date,opposition,match_id
0,1,Pat Cummins,DNB,3,28,0,0,19 Oct 2011,v South Africa Centurion,1
1,2,Steve Smith,DNB,-,-,0,0,19 Oct 2011,v South Africa Centurion,1
2,10,Mitch Marsh,8*,1,19,1,0,19 Oct 2011,v South Africa Centurion,1
3,13,David Warner,20,-,-,0,0,19 Oct 2011,v South Africa Centurion,1
4,1,Pat Cummins,11*,1,73,0,0,23 Oct 2011,v South Africa Gqeberha,2
5,2,Steve Smith,26,1,24,1,0,23 Oct 2011,v South Africa Gqeberha,2
6,13,David Warner,74,-,-,0,0,23 Oct 2011,v South Africa Gqeberha,2
7,1,Pat Cummins,6*,1,49,0,0,28 Oct 2011,v South Africa Durban,3
8,13,David Warner,10,-,-,1,0,28 Oct 2011,v South Africa Durban,3
9,1,Pat Cummins,TDNB,1,11,0,0,23 Jun 2012,v Ireland Belfast,4


EDA

In [5]:
df.shape

(2575, 10)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2575 entries, 0 to 2574
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   player_id      2575 non-null   int64 
 1   player_name    2575 non-null   object
 2   runs_scored    2575 non-null   object
 3   wickets        2575 non-null   object
 4   runs_conceded  2575 non-null   object
 5   catches        2575 non-null   object
 6   stumpings      2575 non-null   object
 7   match_date     2575 non-null   object
 8   opposition     2575 non-null   object
 9   match_id       2575 non-null   int64 
dtypes: int64(2), object(8)
memory usage: 201.3+ KB


In [7]:
# Checking to see if any columns have missing data
df.isnull().any()

player_id        False
player_name      False
runs_scored      False
wickets          False
runs_conceded    False
catches          False
stumpings        False
match_date       False
opposition       False
match_id         False
dtype: bool

In [8]:
df.describe()

Unnamed: 0,player_id,match_id
count,2575.0,2575.0
mean,15.926214,248.114951
std,7.663574,168.410523
min,1.0,1.0
25%,11.0,77.0
50%,16.0,204.0
75%,22.0,411.0
max,30.0,557.0


Data Cleaning & Preparation

In [9]:
# Changing match_date column to datetime instead of object
df['match_date'] = pd.to_datetime(df['match_date'])

In [10]:
# Checking all the unique data in the runs_scored column
df['runs_scored'].unique()

array(['DNB', '8*', '20', '11*', '26', '74', '6*', '10', 'TDNB', '4', '8',
       '56', '1*', '104', '2', '0', '67', '7', '21', '37', '127', '0*',
       '47', '24', '138', '9', '2*', '14*', '102*', '7*', '1', '34',
       '21*', '11', '84', '44', '40*', '15', '59', '70', '64', '49', '5',
       '25', '13', '17', '85', '12', '164', '52', '72', '57', '76*',
       '119', '156', '39', '60', '29', '23', '16', '3', '108*', '23*',
       '35', '51', '78', '130', '128', '179', '6', '36', '146*', '27',
       '53', '42', '29*', '18', '22*', '71*', '5*', '28', '83', '14',
       '62*', '92', '10*', '63', '27*', '71', '15*', '124', '65', '41',
       '46', '125', '50', '45', '55', '96', '32', '33', '22', '36*', '40',
       '116', '31*', '123', '95', '25*', '4*', '89*', '73', '19', '69',
       '55*', '48', '82', '107', '46*', '17*', '166', '38', '38*', '122',
       '128*', '98', '80', '20*', '131', '54', '89', '44*', '76', '3*',
       '9*', '77', '43', '19*', '106', '108', '105', '90', '63*'

In [12]:
# Taking out the data with DNB and TDNB with nothing
df['runs_scored'] = df['runs_scored'].str.replace('DNB', 'np.nan', regex=True)
df['runs_scored'] = df['runs_scored'].str.replace('TDNB','np.nan', regex=True)
# Taking out the data with * after the numbers
df['runs_scored'] = df['runs_scored'].str.replace(r'\D',' ', regex=True)
df['runs_scored'].unique()

array(['      ', '8 ', '20', '11 ', '26', '74', '6 ', '10', '       ',
       '4', '8', '56', '1 ', '104', '2', '0', '67', '7', '21', '37',
       '127', '0 ', '47', '24', '138', '9', '2 ', '14 ', '102 ', '7 ',
       '1', '34', '21 ', '11', '84', '44', '40 ', '15', '59', '70', '64',
       '49', '5', '25', '13', '17', '85', '12', '164', '52', '72', '57',
       '76 ', '119', '156', '39', '60', '29', '23', '16', '3', '108 ',
       '23 ', '35', '51', '78', '130', '128', '179', '6', '36', '146 ',
       '27', '53', '42', '29 ', '18', '22 ', '71 ', '5 ', '28', '83',
       '14', '62 ', '92', '10 ', '63', '27 ', '71', '15 ', '124', '65',
       '41', '46', '125', '50', '45', '55', '96', '32', '33', '22', '36 ',
       '40', '116', '31 ', '123', '95', '25 ', '4 ', '89 ', '73', '19',
       '69', '55 ', '48', '82', '107', '46 ', '17 ', '166', '38', '38 ',
       '122', '128 ', '98', '80', '20 ', '131', '54', '89', '44 ', '76',
       '3 ', '9 ', '77', '43', '19 ', '106', '108', '105', '90',

In [None]:
# Checking to see the unique data
df['wickets'].unique()

In [None]:
# Replacing - data with nothing
df['wickets'] = df['wickets'].str.replace('-','NaN', regex=True)
df['wickets'].unique()

In [None]:
df['catches'].unique()

In [None]:
# Replacing - data with nothing
df['catches'] = df['catches'].str.replace('-','NaN', regex=True)
df['catches'].unique()

In [None]:
df['stumpings'].unique()

In [None]:
# Replacing - data with nothing
df['stumpings'] = df['stumpings'].str.replace('-','NaN', regex=True)
df['stumpings'].unique()

In [13]:
df.isnull().any()

player_id        False
player_name      False
runs_scored      False
wickets          False
runs_conceded    False
catches          False
stumpings        False
match_date       False
opposition       False
match_id         False
dtype: bool

Model Development