## Data Analysis of Test Cricket Bowlers' Stats

#### Prepared By: Ellin Ankon Dewan <br>Ahsanullah University of Science & Technology

In this project, I picked performance data about the best bowlers' in Test Cricket and conducted data analysis process with this.

Original Data Source: https://stats.espncricinfo.com/ci/content/records/93276.html .
<br>Above mentioned source continuously updates recent scores. So values for currently playing cricketers' might change in future.
<br>For this project, data was taken before 07 September (2021) from the above mentioned link.

#### Importing necessary libraries

In [116]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)        #Ignoring Warning notifications
import numpy as np
import pandas as pd

#### Reading csv File

In [117]:
df = pd.read_csv('wickets.csv', encoding='unicode_escape')     # reading the csv file containing data
                                                               # Setting Text encoding attribute by unicode_escape

display(df.head(10))                                           # Displaying top 10 row information

Unnamed: 0,Player,Span,Mat,Inns,Balls,Runs,Wkts,BBI,BBM,Ave,Econ,SR,5,10
0,M Muralitharan (ICC/SL),1992-2010,133,230,44039,18180,800,9/51,16/220,22.72,2.47,55.0,67,22
1,SK Warne (AUS),1992-2007,145,273,40705,17995,708,8/71,12/128,25.41,2.65,57.4,37,10
2,JM Anderson (ENG),2003-2021,164*,304,35079,16575,623,7/42,11/71,26.6,2.83,56.3,30,3
3,A Kumble (INDIA),1990-2008,132,236,40850,18355,619,10/74,14/149,29.65,2.69,65.9,35,8
4,GD McGrath (AUS),1993-2007,124,243,29248,12186,563,8/24,10/27,21.64,2.49,51.9,29,3
5,SCJ Broad (ENG),2007-2021,149,274,29863,14590,524,8/15,11/121,27.84,2.93,56.9,18,3
6,CA Walsh (WI),1984-2001,132,242,30019,12688,519,7/37,13/55,24.44,2.53,57.8,22,3
7,DW Steyn (SA),2004-2019,93,171,18608,10077,439,7/51,11/60,22.95,3.24,42.3,26,5
8,N Kapil Dev (INDIA),1978-1994,131,227,27740,12867,434,9/83,11/146,29.64,2.78,63.9,23,2
9,HMRKB Herath (SL),1999-2018,93,170,25993,12157,433,9/127,14/184,28.07,2.8,60.0,34,9


#### Explanation of each column of the dataset above:
This dataset represents the dataset of the highest wicket-takers in the history of Test Matches in Cricket. Test Cricket is the oldest version of Cricket, one game in this version lasts for 5 days at maximum & it is declared as "draw" if the game do not reach its result even till then. When one team bowls, the other teams bats. Each team can bat twice and bowl twice at maximum.

- **Player:** Name of the player.  Nationality shown in the bracket.
- **Span:** Year of Debut in Test Cricket - Year of Retirement from Test Cricket.
- **Mat:** Means **Matches** played in one's career.
- **Inns:** Means **Innings**. In one test match, a bowler can perform bowling in two innings maximum.
- **Balls:** No. of bowling a bowler delivered in his career.
- **Runs:** No. of runs a bowler conceeded.
- **Wkts:** No. of wickets a bowler took. Means no. of batsmen got out during his bowling.
- **BBI:** Best Bowling Figure in an innings. The number on the left represents wickets and number on the right represents amount of runs one conceeded to take that many wickets.
- **BBM:** Best Bowling Figure in a match. The number on the left represents wickets and number on the right represents amount of runs one conceeded to take that many wickets.
- **Ave:** Means **Average**. It represents number of runs one conceeded on average per wickets taken.
- **Econ:** Means the average number of runs one conceded per over bowled. An over means 6 balls.
- **SR:** Means **Strike Rate**. This represents the average interval between two wickets.
- **5:** Represents the number of times one took 5 wickets in an innings.
- **10:** Represents the number of times one took 10 wickets in a match.


#### Renaming Columns for user convenience

In [118]:
df = df.rename(columns={'Mat':'Matches Played', 'Inns':'Innings','Wkts':'Wickets', 'Ave':'Average','Econ':'Economy','SR':'Strike Rate','5':'5 Wkts', '10':'10 Wkts'})
display(df.head())

Unnamed: 0,Player,Span,Matches Played,Innings,Balls,Runs,Wickets,BBI,BBM,Average,Economy,Strike Rate,5 Wkts,10 Wkts
0,M Muralitharan (ICC/SL),1992-2010,133,230,44039,18180,800,9/51,16/220,22.72,2.47,55.0,67,22
1,SK Warne (AUS),1992-2007,145,273,40705,17995,708,8/71,12/128,25.41,2.65,57.4,37,10
2,JM Anderson (ENG),2003-2021,164*,304,35079,16575,623,7/42,11/71,26.6,2.83,56.3,30,3
3,A Kumble (INDIA),1990-2008,132,236,40850,18355,619,10/74,14/149,29.65,2.69,65.9,35,8
4,GD McGrath (AUS),1993-2007,124,243,29248,12186,563,8/24,10/27,21.64,2.49,51.9,29,3


#### Checking the number of Rows & Columns

In [119]:
print("No. of Rows: ", (df.shape[0]))                  #shape() returns 2 values in order: [0] Rows, [1] Columns

print("No. of Row & Columns: ", (df.shape[1]))


No. of Rows:  79
No. of Row & Columns:  14


#### Checking data types & existances of null-value before proceeding to next steps

In [120]:
display(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79 entries, 0 to 78
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Player          79 non-null     object 
 1   Span            79 non-null     object 
 2   Matches Played  79 non-null     object 
 3   Innings         79 non-null     int64  
 4   Balls           79 non-null     int64  
 5   Runs            79 non-null     int64  
 6   Wickets         79 non-null     int64  
 7   BBI             79 non-null     object 
 8   BBM             79 non-null     object 
 9   Average         79 non-null     float64
 10  Economy         79 non-null     float64
 11  Strike Rate     79 non-null     float64
 12  5 Wkts          79 non-null     int64  
 13  10 Wkts         79 non-null     int64  
dtypes: float64(3), int64(6), object(5)
memory usage: 8.8+ KB


None

- No null values exist in this dataset.

#### Basic statistics from dataset for multiple columns

In [121]:
display(df.describe())                                 # Basic stats from the dataset for a quick overall observation

Unnamed: 0,Innings,Balls,Runs,Wickets,Average,Economy,Strike Rate,5 Wkts,10 Wkts
count,79.0,79.0,79.0,79.0,79.0,79.0,79.0,79.0,79.0
mean,144.911392,18638.35443,8599.35443,317.21519,27.469747,2.806835,59.193671,16.35443,2.797468
std,51.180222,7199.256972,3085.168807,121.924911,3.655658,0.351577,9.350132,9.642372,3.235935
min,67.0,8785.0,4846.0,200.0,20.94,1.98,41.2,3.0,0.0
25%,110.0,13583.0,6456.5,229.0,24.5,2.6,53.3,9.5,1.0
50%,129.0,16498.0,7742.0,266.0,28.0,2.82,57.4,14.0,2.0
75%,169.0,21742.5,9756.0,374.5,29.87,3.08,63.95,20.5,3.5
max,304.0,44039.0,18355.0,800.0,34.79,3.46,91.9,67.0,22.0


#### Removing multiple Column
- For our analysis purpose, the column named **BBI** & **BBM** appears to be unnecessary. So, this will be removed using drop() function.

In [122]:
df = df.drop(['BBI','BBM'], axis =1)                         #Removing multiple columns
display(df.head())

Unnamed: 0,Player,Span,Matches Played,Innings,Balls,Runs,Wickets,Average,Economy,Strike Rate,5 Wkts,10 Wkts
0,M Muralitharan (ICC/SL),1992-2010,133,230,44039,18180,800,22.72,2.47,55.0,67,22
1,SK Warne (AUS),1992-2007,145,273,40705,17995,708,25.41,2.65,57.4,37,10
2,JM Anderson (ENG),2003-2021,164*,304,35079,16575,623,26.6,2.83,56.3,30,3
3,A Kumble (INDIA),1990-2008,132,236,40850,18355,619,29.65,2.69,65.9,35,8
4,GD McGrath (AUS),1993-2007,124,243,29248,12186,563,21.64,2.49,51.9,29,3


#### Splitting country column from player's name

In [123]:
df_name = df['Player'].str.split("(", expand=True)               #Splitting the player column to separate the country

df_name[1]= df_name[1].str.replace(")","")                       
                                                                        #Removing the closing bracket
df_name = df_name.rename({0:"Player_name", 1:"Country"}, axis =1)        
                                                                        # Renaming columns
df = pd.concat([df, df_name], axis =1)                                   
                                                                        # Concatanating with main dataframe
df = df.drop('Player', axis =1)                        
                                                        # Dropping the previous player column containing country name
    
new_column_seq = ['Player_name', 'Matches Played', 'Innings', 'Balls', 'Runs',
                 'Wickets', 'Average', 'Economy', 'Strike Rate', '5 Wkts', '10 Wkts', 'Span', 'Country']
                                                                        
df = df[new_column_seq]
                                                                        # Convenient arrangement of columns
display(df.head())

Unnamed: 0,Player_name,Matches Played,Innings,Balls,Runs,Wickets,Average,Economy,Strike Rate,5 Wkts,10 Wkts,Span,Country
0,M Muralitharan,133,230,44039,18180,800,22.72,2.47,55.0,67,22,1992-2010,ICC/SL
1,SK Warne,145,273,40705,17995,708,25.41,2.65,57.4,37,10,1992-2007,AUS
2,JM Anderson,164*,304,35079,16575,623,26.6,2.83,56.3,30,3,2003-2021,ENG
3,A Kumble,132,236,40850,18355,619,29.65,2.69,65.9,35,8,1990-2008,INDIA
4,GD McGrath,124,243,29248,12186,563,21.64,2.49,51.9,29,3,1993-2007,AUS


#### Splitting Career Span column to calculate Years Played

In [124]:
df_years = df['Span'].str.split("-", expand=True)
                                                    #Splitting the Span column to separate the debut-retirement year
df = pd.concat([df,df_years], axis=1)                                
                                                    # Concatanating with main dataframe
    
df = df.rename({0:"Debut_year", 1:"Retire_year"}, axis =1)
                                                                          #Renaming columns
df = df.drop('Span', axis =1)                                   
                                                                    # Dropping the main Span column which was mixed
df['Debut_year'] = df['Debut_year'].astype('int')                         #converting data type
df['Retire_year'] = df['Retire_year'].astype('int')

df['Years_played'] = df['Retire_year'] - df['Debut_year']                 # Calculating years of Career Span
df = df.drop(['Debut_year','Retire_year'], axis =1)           
                                                     # Dropping both debut & retire year column, not needed further

display(df.head())


Unnamed: 0,Player_name,Matches Played,Innings,Balls,Runs,Wickets,Average,Economy,Strike Rate,5 Wkts,10 Wkts,Country,Years_played
0,M Muralitharan,133,230,44039,18180,800,22.72,2.47,55.0,67,22,ICC/SL,18
1,SK Warne,145,273,40705,17995,708,25.41,2.65,57.4,37,10,AUS,15
2,JM Anderson,164*,304,35079,16575,623,26.6,2.83,56.3,30,3,ENG,18
3,A Kumble,132,236,40850,18355,619,29.65,2.69,65.9,35,8,INDIA,18
4,GD McGrath,124,243,29248,12186,563,21.64,2.49,51.9,29,3,AUS,14


#### Creating a function to check if a player played for ICC or not
If any player played for ICC, that row gets 1 assigned
If any player did not play for ICC, that row gets 0 assigned


In [125]:
def check_icc(x):
    if "ICC" in x:
        return 1
    else:
        return 0
    
df['Played_for_ICC'] = df['Country'].apply(check_icc)            # Passing values of Country column to the function using ".apply()" method

display(df.head())
df["Played_for_ICC"].value_counts()                             # Counting how many players played for ICC

Unnamed: 0,Player_name,Matches Played,Innings,Balls,Runs,Wickets,Average,Economy,Strike Rate,5 Wkts,10 Wkts,Country,Years_played,Played_for_ICC
0,M Muralitharan,133,230,44039,18180,800,22.72,2.47,55.0,67,22,ICC/SL,18,1
1,SK Warne,145,273,40705,17995,708,25.41,2.65,57.4,37,10,AUS,15,0
2,JM Anderson,164*,304,35079,16575,623,26.6,2.83,56.3,30,3,ENG,18,0
3,A Kumble,132,236,40850,18355,619,29.65,2.69,65.9,35,8,INDIA,18,0
4,GD McGrath,124,243,29248,12186,563,21.64,2.49,51.9,29,3,AUS,14,0


0    74
1     5
Name: Played_for_ICC, dtype: int64

- **5 players played at least one cricket match for ICC.**


#### Number of countries present in the dataset

In [126]:
df['Country'] = df['Country'].str.replace("ICC/","")                    # Removing ICC from Country column
df['Country'] = df['Country'].str.replace("/ICC","")
df['Country'].describe()

count      79
unique     10
top       AUS
freq       18
Name: Country, dtype: object

- **List contain total of 79 players' data.**
- **Players from 10 different countries are present in this dataset**

In [127]:
df['Country'].value_counts()

AUS      18
ENG      15
INDIA    10
WI        9
SA        8
NZ        7
PAK       7
SL        3
BDESH     1
ZIM       1
Name: Country, dtype: int64

- **18 players from Australia are present in this dataset.**
- **Only 1 Bangladeshi player is present in this dataset.**

#### Players with largest active career 

In [128]:
df.sort_values(by=["Years_played","Wickets"], ascending= False).head()

Unnamed: 0,Player_name,Matches Played,Innings,Balls,Runs,Wickets,Average,Economy,Strike Rate,5 Wkts,10 Wkts,Country,Years_played,Played_for_ICC
21,Imran Khan,88,142,19458,8258,362,22.81,2.54,53.7,23,6,PAK,21,0
55,GS Sobers,93,159,21599,7999,235,34.03,2.22,91.9,6,0,WI,20,0
9,HMRKB Herath,93,170,25993,12157,433,28.07,2.8,60.0,34,9,SL,19,0
0,M Muralitharan,133,230,44039,18180,800,22.72,2.47,55.0,67,22,SL,18,1
2,JM Anderson,164*,304,35079,16575,623,26.6,2.83,56.3,30,3,ENG,18,0


- This list shows the top 5 players who played for the longest period of time among the players from this dataset.
- Number of wickets were considered as secondary parameter for this sorting.
- This shows **Imran Khan from Pakistan played test cricket for 21 years. This is the highest.**
- Followed by **GS Sobers with 20 years**, **Rangana Herath with 19 years**, **2 other players with 18 years**.
- For a long career, a cricketer must be consistent, well-skilled and enough fit at the same time.

#### Players with shortest active career

In [129]:
df.sort_values(by=["Years_played","Wickets"], ascending= [True,False]).head(5)

Unnamed: 0,Player_name,Matches Played,Innings,Balls,Runs,Wickets,Average,Economy,Strike Rate,5 Wkts,10 Wkts,Country,Years_played,Played_for_ICC
44,GP Swann,60,109,15349,7642,255,29.96,2.98,60.1,17,3,ENG,5,0
71,K Rabada,47,86,8785,4846,213,22.75,3.3,41.2,10,4,SA,6,0
54,Yasir Shah,46*,84,13607,7248,235,30.84,3.19,57.9,16,3,PAK,7,0
61,SJ Harmison,63,115,13375,7192,226,31.82,3.22,59.1,8,1,ENG,7,1
72,JR Hazlewood,55,103,11887,5438,212,25.65,2.74,56.0,9,0,AUS,7,0


- This list contains top 5 players who played for the shortest period of time among the players from this dataset.
- Number of wickets were considered as secondary parameter for this sorting
- **GP Swann leads the dataset** with **5 years of career by taking 255 wickets.**
- Followed by **K Rabada with 6 years of career and 213 wickets.**
- Each of the **rest 3 players played for 7 years.**

#### Lowest Economy Rate

In [130]:
df.sort_values(by=["Economy","Wickets"], ascending= [True,False]).head(3)

Unnamed: 0,Player_name,Matches Played,Innings,Balls,Runs,Wickets,Average,Economy,Strike Rate,5 Wkts,10 Wkts,Country,Years_played,Played_for_ICC
32,LR Gibbs,79,148,27115,8989,309,29.09,1.98,87.7,18,2,WI,18,0
35,DL Underwood,86,151,21862,7674,297,25.83,2.1,73.6,17,6,ENG,16,0
47,R Benaud,63,116,19108,6704,248,27.03,2.1,77.0,16,1,AUS,12,0


- This list shows top 3 bowlers with lowest economy rates. This represents the amount of runs a bowler conceded per overs bowled.
- Wickets are considered as the secondary parameter for this sorting.
- **LR Gibbs from West Indies is on top of the list with 1.98 Economy Rate.**
- **Next 2 bowlers have the same Economy Rate of 2.10 but DL Underwood from England is at the 2nd because of higher number of wickets he had taken.**

#### Lowest Strike Rate

In [131]:
df.sort_values(by=["Strike Rate","Wickets"], ascending= [True,False]).head(3)

Unnamed: 0,Player_name,Matches Played,Innings,Balls,Runs,Wickets,Average,Economy,Strike Rate,5 Wkts,10 Wkts,Country,Years_played,Played_for_ICC
71,K Rabada,47,86,8785,4846,213,22.75,3.3,41.2,10,4,SA,6,0
7,DW Steyn,93,171,18608,10077,439,22.95,3.24,42.3,26,5,SA,15,0
20,Waqar Younis,87,154,16224,8788,373,23.56,3.25,43.4,22,5,PAK,14,0


- This list shows the top 3 bowlers with lowest strike rate. Strike Rate represents the number of balls one bowler needed to take one wicket.
- Higher Number of Wickets were considered as secondary parameter for this sorting.
- **K Rabada from South Africa is leading the least with the lowest strike rate of 41.2**
- Followed by **DW Steyn from South Africa & Waqar Younis from Pakistan with 42.3 & 43.4 strike rate respectively.**

#### Lowest Bowling Average

In [132]:
df.sort_values(by=["Average","Wickets"], ascending= [True,False]).head(3)

Unnamed: 0,Player_name,Matches Played,Innings,Balls,Runs,Wickets,Average,Economy,Strike Rate,5 Wkts,10 Wkts,Country,Years_played,Played_for_ICC
19,MD Marshall,81,151,17584,7876,376,20.94,2.68,46.7,22,4,WI,13,0
41,J Garner,58,111,13169,5433,259,20.97,2.47,50.8,7,0,WI,10,0
15,CEL Ambrose,98,179,22103,8501,405,20.99,2.3,54.5,22,3,WI,12,0


- This list shows the top 3 bowlers with lowest bowling average. This represents that these 3 bowlers conceded the least amount of runs per wicket taken.
- Higher Number of wickets were considered as secondary parameter for this sorting.
- Top 3 bowlers with lowest bowling average are all from West Indies.
- All 3 averagers are nearly 21, so to say. Means, all of them took 1 wicket by conceding less than 21 runs, statistically. The 3rd lowest Average is 20.99 on this list. That means, no other bowler other than these 3 bowlers could take 1 wicket by conceding less than 21 runs.
- **MD Marshall** tops the list with only a **difference of 0.03** than the **2nd position holder J Garner.**
- **J Garner** is leading by only a **difference of 0.02** from the **3rd position holder CEL Ambrose**.
- All 3 bowlers have very similar economy and strike rate too.