# Data Analysis of Most Wicket-Taker in Test Cricket
The primary goal is to submit the project made for the workshop "Data visualization with Python" organized by EMK Center.
<br>The objective of this project is to apply the knowledge of the Data Analysis procedures in the most wickets-taker datset.
<br> The dataset is collected from ESPNcricinfo.
<br>dataset source: https://stats.espncricinfo.com/ci/content/records/93276.html

## Brief description about dataset

In [5]:
# required libraries for data analysis
import pandas as pd
import numpy as np

## Dataset Information
The dataset contains information about the most wickets-taker in Test matches.
#### Features
1. __Player:__ Names of the most wicket-taker players
2. __Span:__ The duration of a player's Test cricket career
3. __Mat:__ The number of Test matches a player has played 
4. __Inns:__ The total number of innings a bowler has come on to ball
5. __Balls:__ The total number of balls a bowler has bowled in his Test career
6. __Runs:__ The total runs a bowler has given in his Test career
7. __Wkts:__ The total wickets a bowler has taken
8. __BBI:__ 'BBI' stands for the best bowling in an innings
9. __BBM:__ 'BBM' stands for the best bowling in a match
10. __Ave:__ The number of runs a bowler has conceded per wicket taken
11. __Econ:__ The average number of runs a bowler conceded per over bowled
12. __SR:__ The average number of balls a bowler has bowled per wicket taken
13. __5:__ The number of how many times a bowler has taken  5 wickets in an innings
14. __10:__ The number of how many times a bowler has taken  10 wickets in an innings

## Loading the dataset

In [4]:
# import the dataset
# read a csv file as pandas DataFrame
df = pd.read_csv('wickets.csv')
display(df.head(10))

Unnamed: 0,Player,Span,Mat,Inns,Balls,Runs,Wkts,BBI,BBM,Ave,Econ,SR,5,10
0,M Muralitharan (ICC/SL),1992-2010,133,230,44039,18180,800,9/51,16/220,22.72,2.47,55.0,67,22
1,SK Warne (AUS),1992-2007,145,273,40705,17995,708,8/71,12/128,25.41,2.65,57.4,37,10
2,JM Anderson (ENG),2003-2021,164*,304,35079,16575,623,7/42,11/71,26.6,2.83,56.3,30,3
3,A Kumble (INDIA),1990-2008,132,236,40850,18355,619,10/74,14/149,29.65,2.69,65.9,35,8
4,GD McGrath (AUS),1993-2007,124,243,29248,12186,563,8/24,10/27,21.64,2.49,51.9,29,3
5,SCJ Broad (ENG),2007-2021,149,274,29863,14590,524,8/15,11/121,27.84,2.93,56.9,18,3
6,CA Walsh (WI),1984-2001,132,242,30019,12688,519,7/37,13/55,24.44,2.53,57.8,22,3
7,DW Steyn (SA),2004-2019,93,171,18608,10077,439,7/51,11/60,22.95,3.24,42.3,26,5
8,N Kapil Dev (INDIA),1978-1994,131,227,27740,12867,434,9/83,11/146,29.64,2.78,63.9,23,2
9,HMRKB Herath (SL),1999-2018,93,170,25993,12157,433,9/127,14/184,28.07,2.8,60.0,34,9


## Check for missing values and data types of the columns

In [6]:
# number of rows
print("number of rows = ", df.shape[0])

# number of columns
print("number of columns = ", df.shape[1])

number of rows =  79
number of columns =  14


In [13]:
# checking for data types of each column
display(df.dtypes)

Player     object
Span       object
Mat         int64
Inns        int64
Balls       int64
Runs        int64
Wkts        int64
BBI        object
BBM        object
Ave       float64
Econ      float64
SR        float64
5           int64
10          int64
dtype: object

In [18]:
# checking for missing values
display(df.isnull().sum())

Player    0
Span      0
Mat       0
Inns      0
Balls     0
Runs      0
Wkts      0
BBI       0
BBM       0
Ave       0
Econ      0
SR        0
5         0
10        0
dtype: int64

## Descriptive statistics

In [19]:
# checking data statistics
display(df.describe())

Unnamed: 0,Mat,Inns,Balls,Runs,Wkts,Ave,Econ,SR,5,10
count,79.0,79.0,79.0,79.0,79.0,79.0,79.0,79.0,79.0,79.0
mean,80.101266,144.797468,18630.303797,8595.506329,317.101266,27.466456,2.806582,59.187342,16.35443,2.797468
std,28.537692,51.04231,7190.036515,3080.256645,121.731587,3.657561,0.351666,9.349337,9.642372,3.235935
min,37.0,67.0,8785.0,4846.0,200.0,20.94,1.98,41.2,3.0,0.0
25%,60.5,110.0,13580.0,6456.5,229.0,24.425,2.6,53.3,9.5,1.0
50%,71.0,129.0,16498.0,7742.0,266.0,28.0,2.82,57.4,14.0,2.0
75%,93.0,169.0,21742.5,9756.0,374.5,29.87,3.08,63.95,20.5,3.5
max,166.0,301.0,44039.0,18355.0,800.0,34.79,3.46,91.9,67.0,22.0


In [21]:
# Rename the column names
df = df.rename(columns={'Mat':'Match', 
                        'Inns':'Innings',
                        'Balls': 'Balls',
                        'Wkts': 'Wickets ',
                        'BBI':'Best_Bowling_in_an_innings',
                        'BBM':'Best_Bowling_in_a_match',
                        'Ave': 'Average',
                        'Econ': 'Economy rate',
                        'SR': 'Bowling_strike_rate',
                        5: '5 Wickets',
                        10: '10 Wickets'})

display(df.head())

Unnamed: 0,Player,Span,Match,Innings,No_of_Balls,Runs,Wickets,Best_Bowling_in_an_innings,Best_Bowling_in_a_match,Average,Economy rate,Bowling_strike_rate,5 Wickets,10 Wickets
0,M Muralitharan (ICC/SL),1992-2010,133,230,44039,18180,800,1951-09-01 00:00:00,16/220,22.72,2.47,55.0,67,22
1,SK Warne (AUS),1992-2007,145,273,40705,17995,708,1971-08-01 00:00:00,12/128,25.41,2.65,57.4,37,10
2,A Kumble (INDIA),1990-2008,132,236,40850,18355,619,1974-10-01 00:00:00,14/149,29.65,2.69,65.9,35,8
3,JM Anderson (ENG),2003-2021,162,301,34791,16457,617,1942-07-01 00:00:00,1971-11-01 00:00:00,26.67,2.83,56.3,30,3
4,GD McGrath (AUS),1993-2007,124,243,29248,12186,563,2021-08-24 00:00:00,2021-10-27 00:00:00,21.64,2.49,51.9,29,3


In [22]:
df_player = df['Player'].str.split("(", expand=True)

display(df_player.head())

Unnamed: 0,0,1
0,M Muralitharan,ICC/SL)
1,SK Warne,AUS)
2,A Kumble,INDIA)
3,JM Anderson,ENG)
4,GD McGrath,AUS)


In [25]:
df_player = df_player.rename(columns={0: 'Player',
                                      1: 'Country'})

display(df_player.head())

Unnamed: 0,Player,Country
0,M Muralitharan,ICC/SL)
1,SK Warne,AUS)
2,A Kumble,INDIA)
3,JM Anderson,ENG)
4,GD McGrath,AUS)


In [26]:
df_player['Country'] = df_player['Country'].str.replace(")", "")

display(df_player.head())

Unnamed: 0,Player,Country
0,M Muralitharan,ICC/SL
1,SK Warne,AUS
2,A Kumble,INDIA
3,JM Anderson,ENG
4,GD McGrath,AUS


In [23]:
# Remove the column
df.drop('Player', axis=1, inplace=True)

In [27]:
df = pd.concat([df, df_player], axis=1)

display(df.head())

Unnamed: 0,Span,Match,Innings,No_of_Balls,Runs,Wickets,Best_Bowling_in_an_innings,Best_Bowling_in_a_match,Average,Economy rate,Bowling_strike_rate,5 Wickets,10 Wickets,Player,Country
0,1992-2010,133,230,44039,18180,800,1951-09-01 00:00:00,16/220,22.72,2.47,55.0,67,22,M Muralitharan,ICC/SL
1,1992-2007,145,273,40705,17995,708,1971-08-01 00:00:00,12/128,25.41,2.65,57.4,37,10,SK Warne,AUS
2,1990-2008,132,236,40850,18355,619,1974-10-01 00:00:00,14/149,29.65,2.69,65.9,35,8,A Kumble,INDIA
3,2003-2021,162,301,34791,16457,617,1942-07-01 00:00:00,1971-11-01 00:00:00,26.67,2.83,56.3,30,3,JM Anderson,ENG
4,1993-2007,124,243,29248,12186,563,2021-08-24 00:00:00,2021-10-27 00:00:00,21.64,2.49,51.9,29,3,GD McGrath,AUS


In [29]:
new_col_sequence = ['Player', 'Country','Span', 'Match', 'Innings', 'No_of_Balls', 'Runs', 'Wickets ',
                    'Best_Bowling_in_an_innings', 'Best_Bowling_in_a_match', 'Average',
                    'Economy rate', 'Bowling_strike_rate', '5 Wickets', '10 Wickets',]

In [30]:
df = df[new_col_sequence]

display(df.head())

Unnamed: 0,Player,Country,Span,Match,Innings,No_of_Balls,Runs,Wickets,Best_Bowling_in_an_innings,Best_Bowling_in_a_match,Average,Economy rate,Bowling_strike_rate,5 Wickets,10 Wickets
0,M Muralitharan,ICC/SL,1992-2010,133,230,44039,18180,800,1951-09-01 00:00:00,16/220,22.72,2.47,55.0,67,22
1,SK Warne,AUS,1992-2007,145,273,40705,17995,708,1971-08-01 00:00:00,12/128,25.41,2.65,57.4,37,10
2,A Kumble,INDIA,1990-2008,132,236,40850,18355,619,1974-10-01 00:00:00,14/149,29.65,2.69,65.9,35,8
3,JM Anderson,ENG,2003-2021,162,301,34791,16457,617,1942-07-01 00:00:00,1971-11-01 00:00:00,26.67,2.83,56.3,30,3
4,GD McGrath,AUS,1993-2007,124,243,29248,12186,563,2021-08-24 00:00:00,2021-10-27 00:00:00,21.64,2.49,51.9,29,3
