### Data Analysis of Bowling records in Test matches 

In this dataset, there have top bowlers who take almost wickets in their cricket career. Other features also evaluate a    player's performance throughout their career along with wickets. Here are the features we can work with:

 * **Player:** Player's name
 * **Span:** Playing span or career duration of a player
 * **Mat:** No. of matches played
 * **Inns:** No. of innings bowled
 * **Balls:** No. of balls bowled
 * **Runs:** No. of runs conceded
 * **Wkts:** Total no. of wickets taken
 * **BBI:** BBI stands for Best Bowling in Innings and only gives the score for one innings,i.e.,9/51 means that 9 wickets for 51 runs allowed
 * **BBM:** BBM stands for Best Bowling in Match and gives the combined score over 2 innings in one match
 * **Ave:** Average (runs allowed per wicket taken)
 * **Econ:** Economy rate (runs plus extras allowed per over)
 * **SR:** Strike Rate (balls bowled per wicket taken)
 * **5:** number of times this bowler has taken five wickets in an innings
 * **10:** number of times this bowler has taken ten wickets in a match (over both innings of a test)
 
#### Reference of this Dataset: https://stats.espncricinfo.com/ci/content/records/93276.html 
 

### Import Libraries

In [1]:
import pandas as pd  #data processing
import numpy as np   # linear algebra

### Read the dataset and display the first 10 rows of the dataframe

In [2]:
df = pd.read_csv("wickets.csv")
display(df.head(10))

Unnamed: 0,Player,Span,Mat,Inns,Balls,Runs,Wkts,BBI,BBM,Ave,Econ,SR,5,10
0,M Muralitharan (ICC/SL),1992-2010,133,230,44039,18180,800,9/51,16/220,22.72,2.47,55.0,67,22
1,SK Warne (AUS),1992-2007,145,273,40705,17995,708,8/71,12/128,25.41,2.65,57.4,37,10
2,JM Anderson (ENG),2003-2021,164*,304,35079,16575,623,7/42,11/71,26.6,2.83,56.3,30,3
3,A Kumble (INDIA),1990-2008,132,236,40850,18355,619,10/74,14/149,29.65,2.69,65.9,35,8
4,GD McGrath (AUS),1993-2007,124,243,29248,12186,563,8/24,10/27,21.64,2.49,51.9,29,3
5,SCJ Broad (ENG),2007-2021,149,274,29863,14590,524,8/15,11/121,27.84,2.93,56.9,18,3
6,CA Walsh (WI),1984-2001,132,242,30019,12688,519,7/37,13/55,24.44,2.53,57.8,22,3
7,DW Steyn (SA),2004-2019,93,171,18608,10077,439,7/51,11/60,22.95,3.24,42.3,26,5
8,N Kapil Dev (INDIA),1978-1994,131,227,27740,12867,434,9/83,11/146,29.64,2.78,63.9,23,2
9,HMRKB Herath (SL),1999-2018,93,170,25993,12157,433,9/127,14/184,28.07,2.8,60.0,34,9


### Number of rows and columns in the Dataframe

In [3]:
print("No. of rows = ", df.shape[0])
print("No. of columns = ", df.shape[1])

No. of rows =  79
No. of columns =  14


* **There has 79 rows and 14 columns. Each row represents a player's information during his cricket career as a bowler.**

### Check the Datatypes and missing values in the Dataset

In [4]:
display(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79 entries, 0 to 78
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Player  79 non-null     object 
 1   Span    79 non-null     object 
 2   Mat     79 non-null     object 
 3   Inns    79 non-null     int64  
 4   Balls   79 non-null     int64  
 5   Runs    79 non-null     int64  
 6   Wkts    79 non-null     int64  
 7   BBI     79 non-null     object 
 8   BBM     79 non-null     object 
 9   Ave     79 non-null     float64
 10  Econ    79 non-null     float64
 11  SR      79 non-null     float64
 12  5       79 non-null     int64  
 13  10      79 non-null     int64  
dtypes: float64(3), int64(6), object(5)
memory usage: 8.8+ KB


None

#### Initially, we can say that there has no missing values. The datatypes are:
   * 'Player', 'Span', 'Mat', 'BBI', 'BBM' : They are strings and mixed data.
   * 'Inns', 'Balls', 'Runs', 'Wkts', '5', '10' : They are integer values.
   * 'Ave', 'Econ', 'SR' : They are float values.

### Check Data Statistics

**This helps us to check  some statistical details like mean, standard deviation, percentile, minimum and maximum value of all numerical values in the dataset.**

In [5]:
display(df.describe())

Unnamed: 0,Inns,Balls,Runs,Wkts,Ave,Econ,SR,5,10
count,79.0,79.0,79.0,79.0,79.0,79.0,79.0,79.0,79.0
mean,144.911392,18638.35443,8599.35443,317.21519,27.469747,2.806835,59.193671,16.35443,2.797468
std,51.180222,7199.256972,3085.168807,121.924911,3.655658,0.351577,9.350132,9.642372,3.235935
min,67.0,8785.0,4846.0,200.0,20.94,1.98,41.2,3.0,0.0
25%,110.0,13583.0,6456.5,229.0,24.5,2.6,53.3,9.5,1.0
50%,129.0,16498.0,7742.0,266.0,28.0,2.82,57.4,14.0,2.0
75%,169.0,21742.5,9756.0,374.5,29.87,3.08,63.95,20.5,3.5
max,304.0,44039.0,18355.0,800.0,34.79,3.46,91.9,67.0,22.0


**Observations:**
1. The average number of innings played by the bowlers are 145, where the minimum are 67 innings and maximum are 304 innings.
2. The average number of runs conceded by the bowlers are 8599, where the highest runs are 18355.
3. The average number of wickets taken by the bowlers are 317, where the maximum wickets are 800 and minimum are 200.

### Rename the column names

In [6]:
print(df.columns)

Index(['Player', 'Span', 'Mat', 'Inns', 'Balls', 'Runs', 'Wkts', 'BBI', 'BBM',
       'Ave', 'Econ', 'SR', '5', '10'],
      dtype='object')


In [7]:
df = df.rename(columns={'Mat':'Match', 
                        'Inns':'Innings',
                        'Wkts': 'No_of_wickets_taken',
                        'Ave': 'Average',
                        'Econ': 'Economy_rate',
                        'SR': 'Strike_rate',
                       '5': 'Five_wickets_in_an_innings',
                       '10': 'Ten_wickets_in_a_match'})
print(df.columns)

Index(['Player', 'Span', 'Match', 'Innings', 'Balls', 'Runs',
       'No_of_wickets_taken', 'BBI', 'BBM', 'Average', 'Economy_rate',
       'Strike_rate', 'Five_wickets_in_an_innings', 'Ten_wickets_in_a_match'],
      dtype='object')


### Remove the columns BBI and BBM:

In [8]:
df.drop("BBI", axis=1, inplace=True)
df.drop("BBM", axis=1, inplace=True)
print(df.columns)

Index(['Player', 'Span', 'Match', 'Innings', 'Balls', 'Runs',
       'No_of_wickets_taken', 'Average', 'Economy_rate', 'Strike_rate',
       'Five_wickets_in_an_innings', 'Ten_wickets_in_a_match'],
      dtype='object')


 ### How many players played for ICC?

In [9]:
def ICC_played(x):
    if 'ICC' in x:
        return 'Yes'
    else:
        return 'No'

In [10]:
df['played_for_ICC']= df['Player'].apply(ICC_played)
display(df['played_for_ICC'].head())
display(df['played_for_ICC'].value_counts())

0    Yes
1     No
2     No
3     No
4     No
Name: played_for_ICC, dtype: object

No     74
Yes     5
Name: played_for_ICC, dtype: int64

**Observations:**

Here we can see that 74 players did not play for ICC. Only **five** players played for ICC.

### How many different countries are present in this dataset? 

In [11]:
# removing the unnecessary warnings
import warnings
warnings.filterwarnings("ignore")

# creating a new column where we split the 'Player' column and remove '(' sign
df_player = df['Player'].str.split("(", expand=True)

# concatinating the new column with the main dataframe
df = pd.concat([df, df_player], axis=1)

# dropping the 'Player' column
df = df.drop('Player', axis=1)

# renaming the column names
df = df.rename(columns={0: 'Player',
                        1: 'Country'})

# remove the ")" from the 'Country' column
df['Country'] = df['Country'].str.replace(")", "")

# rearrange the columns
col_sequence = ['Player', 'Country','Span', 'Match', 'Innings', 'Balls', 'Runs',
       'No_of_wickets_taken', 'Average', 'Economy_rate', 'Strike_rate',
       'Five_wickets_in_an_innings', 'Ten_wickets_in_a_match',
       'played_for_ICC']
df = df[col_sequence]

display(df.head())

Unnamed: 0,Player,Country,Span,Match,Innings,Balls,Runs,No_of_wickets_taken,Average,Economy_rate,Strike_rate,Five_wickets_in_an_innings,Ten_wickets_in_a_match,played_for_ICC
0,M Muralitharan,ICC/SL,1992-2010,133,230,44039,18180,800,22.72,2.47,55.0,67,22,Yes
1,SK Warne,AUS,1992-2007,145,273,40705,17995,708,25.41,2.65,57.4,37,10,No
2,JM Anderson,ENG,2003-2021,164*,304,35079,16575,623,26.6,2.83,56.3,30,3,No
3,A Kumble,INDIA,1990-2008,132,236,40850,18355,619,29.65,2.69,65.9,35,8,No
4,GD McGrath,AUS,1993-2007,124,243,29248,12186,563,21.64,2.49,51.9,29,3,No


In [12]:
df['Country'].value_counts()

AUS        18
ENG        13
INDIA      10
WI          9
PAK         7
SA          7
NZ          6
SL          2
ENG/ICC     2
BDESH       1
ICC/SL      1
ICC/NZ      1
ICC/SA      1
ZIM         1
Name: Country, dtype: int64

**Observations:**
* In this dataset, there has **10 different countries**, i.e. 'Australia', 'England', 'India', 'West Indies', 'Pakistan', 'South Africa', 'New Zeland', 'Sri Lanka', 'Bangladesh', 'Zimbabwe'. 
* 'England', 'New Zeland', 'Sri Lanka', 'South Africa' played for ICC.

### Which player(s) had played for the longest period of time?
### Which player(s) had played for the shortest period of time?

In [13]:
# splitting the Span column and create two new columns
df['start_year'] = df['Span'].str[0:4]
df['end_year'] = df['Span'].str[5:]
# removing the "Span" column
df = df.drop("Span", axis=1)
# change data type string to integer
df['start_year'] = df['start_year'].astype('int') 
df['end_year'] = df['end_year'].astype('int')
#create new column to calculate the duration of playing of cricket
df['years_played'] = df['end_year'] - df['start_year']
#removing the 'start_year' and 'end_year' columns
df = df.drop(['start_year', "end_year"], axis=1)
#display(df.head())

In [14]:
#sorting the 'years_played' and 'Innings' columns in descending way and evaluate who has played the longest and shortest period of time
display(df.sort_values(by=['years_played','Innings'], ascending = False).head())
display(df.sort_values(by=['years_played','Innings'], ascending = False).tail())


Unnamed: 0,Player,Country,Match,Innings,Balls,Runs,No_of_wickets_taken,Average,Economy_rate,Strike_rate,Five_wickets_in_an_innings,Ten_wickets_in_a_match,played_for_ICC,years_played
21,Imran Khan,PAK,88,142,19458,8258,362,22.81,2.54,53.7,23,6,No,21
55,GS Sobers,WI,93,159,21599,7999,235,34.03,2.22,91.9,6,0,No,20
9,HMRKB Herath,SL,93,170,25993,12157,433,28.07,2.8,60.0,34,9,No,19
2,JM Anderson,ENG,164*,304,35079,16575,623,26.6,2.83,56.3,30,3,No,18
37,JH Kallis,ICC/SA,166,272,20232,9535,292,32.65,2.82,69.2,5,0,Yes,18


Unnamed: 0,Player,Country,Match,Innings,Balls,Runs,No_of_wickets_taken,Average,Economy_rate,Strike_rate,Five_wickets_in_an_innings,Ten_wickets_in_a_match,played_for_ICC,years_played
61,SJ Harmison,ENG/ICC,63,115,13375,7192,226,31.82,3.22,59.1,8,1,Yes,7
72,JR Hazlewood,AUS,55,103,11887,5438,212,25.65,2.74,56.0,9,0,No,7
54,Yasir Shah,PAK,46*,84,13607,7248,235,30.84,3.19,57.9,16,3,No,7
71,K Rabada,SA,47,86,8785,4846,213,22.75,3.3,41.2,10,4,No,6
44,GP Swann,ENG,60,109,15349,7642,255,29.96,2.98,60.1,17,3,No,5


**Observations:**
* From above observation, we conclude that the five players who has played **longest** period of time. But some players has same duration of playing period, so we can also consider with another column called **'Innings'** and then evaluate the information.
1. Imran Khan = 21 years
2. GS Sobers = 20 years
3. HMRKB Herath = 19 years	
4. JM Anderson = 18 years
5. JH Kallis = 18 years
* The five players who has played **shortest** period of time.
1. GP Swann = 5 years
2. K Rabada	= 6 years
3. Yasir Shah = 7 years
4. JR Hazlewood	= 7 years
5. SJ Harmison = 7 years


### How many Australian Bowlers are present in this dataset?

In [15]:
display(df["Country"].str.contains("AUS").sum())

18

We can also get the information from the Country's unique values. From the above information, there have 18 Australian bowlers present in this dataset.

### Is there any Bangladeshi player present in this dataset?

In [16]:
display(df["Country"].str.contains("BDESH").sum())

1

We can also get the information from the Country's unique values. From the above information, there has only one Bangladeshi player present in this dataset.

### Which player had the lowest economy rate?

In [17]:
df.sort_values(by='Economy_rate').head()

Unnamed: 0,Player,Country,Match,Innings,Balls,Runs,No_of_wickets_taken,Average,Economy_rate,Strike_rate,Five_wickets_in_an_innings,Ten_wickets_in_a_match,played_for_ICC,years_played
32,LR Gibbs,WI,79,148,27115,8989,309,29.09,1.98,87.7,18,2,No,18
47,R Benaud,AUS,63,116,19108,6704,248,27.03,2.1,77.0,16,1,No,12
35,DL Underwood,ENG,86,151,21862,7674,297,25.83,2.1,73.6,17,6,No,16
39,BS Bedi,INDIA,67,118,21364,7637,266,28.71,2.14,80.3,14,1,No,13
68,CV Grimmett,AUS,37,67,14513,5231,216,24.21,2.16,67.1,21,7,No,11


**Observation:**


Economy rate is the average number of runs conceded for each over bowled. A lower economy rate is seen as preferable – it means that the bowler is able to get more batsmen out with fewer balls. 

**LR Gibbs** has the lowest economy rate. He gives 8989 runs for approximately 4519 overs.

### Which player had the lowest strike rate?

In [18]:
df.sort_values(by='Strike_rate').head()

Unnamed: 0,Player,Country,Match,Innings,Balls,Runs,No_of_wickets_taken,Average,Economy_rate,Strike_rate,Five_wickets_in_an_innings,Ten_wickets_in_a_match,played_for_ICC,years_played
71,K Rabada,SA,47,86,8785,4846,213,22.75,3.3,41.2,10,4,No,6
7,DW Steyn,SA,93,171,18608,10077,439,22.95,3.24,42.3,26,5,No,15
20,Waqar Younis,PAK,87,154,16224,8788,373,23.56,3.25,43.4,22,5,No,14
19,MD Marshall,WI,81,151,17584,7876,376,20.94,2.68,46.7,22,4,No,13
25,AA Donald,SA,72,129,15519,7344,330,22.25,2.83,47.0,20,3,No,10


**Observation:**

**K Rabada** has the lowest strike rate. He has bowled approximately 41 balls to take a wicket.

### Which player had the lowest bowling average?

In [19]:
df.sort_values(by='Average').head()

Unnamed: 0,Player,Country,Match,Innings,Balls,Runs,No_of_wickets_taken,Average,Economy_rate,Strike_rate,Five_wickets_in_an_innings,Ten_wickets_in_a_match,played_for_ICC,years_played
19,MD Marshall,WI,81,151,17584,7876,376,20.94,2.68,46.7,22,4,No,13
41,J Garner,WI,58,111,13169,5433,259,20.97,2.47,50.8,7,0,No,10
15,CEL Ambrose,WI,98,179,22103,8501,405,20.99,2.3,54.5,22,3,No,12
33,FS Trueman,ENG,67,127,15178,6625,307,21.57,2.61,49.4,17,3,No,13
4,GD McGrath,AUS,124,243,29248,12186,563,21.64,2.49,51.9,29,3,No,14


**Observation:**

**MD marshall** has the lowest bowling average. He gives 20 runs on average to take a wicket.