Cricket Analysis

This analysis is a replication of the research done by " ". However, the dataset is different than the replicated research paper. Also, it looks at the batting rankings only. All of the data were obtained from ESPN Cricinfo.com.


Firstly, the required libraries were imported.

In [1]:
#Import required libraries
import pandas as pd 
import requests
import numpy as np
from pandas.tools.plotting import table


Importing the overall batting statistics for the 2013 T20 world cup. After importing the dataset , I have dropped the columns that I do not need for this analysis. 

In [2]:
#import the data set

def fetch_data_from_url(url, drop_columns,df_list_index):
    html = requests.get(url).content
    df_list = pd.read_html(html)
    df = df_list[df_list_index]
    df.index += 1 
    df.drop(df.columns[drop_columns],axis=1, inplace=True)
    return df

url = 'http://stats.espncricinfo.com/world-t20/engine/records/averages/batting.html?id=8083;type=tournament'
drop_columns= [9,10,11,12,13]
df_list_index= 0
df_1= fetch_data_from_url(url,drop_columns,df_list_index) 

In [3]:
df_1['Mat'].dtype

dtype('int64')

Since, all of the variables in the columns were of object type, I had to convert them to integer or float type to be able to make necessary calculation. However, I was unable to make the conversion in the High Score (HS) column, because of the * characters which represented the Not Out scores of batsmen. I stripped the character off from the data and only had numbers. 

In [4]:
def strip_asterix_from_column(dataframe,colname,remove_char):
    if dataframe[colname].dtype.kind == 'o':
        dataframe[colname]=dataframe[colname].map(lambda x:x.rstrip(remove_char))

def to_numeric (dataframe, column):
    dataframe[column]=dataframe[column].apply(pd.to_numeric, errors='coerce')



exclude_columns=['Player']

for column in df_1.columns:
    if not column in exclude_columns:
        strip_asterix_from_column(df_1,column,'*')
        to_numeric(df_1, column)
       




Once the dataset was stripped off the * character, I was able to make the transistion to integer. However, there were some '-' characters, I was able to add an exception using errors='coerce' so that the columns could be converted to interger. 

Once the conversion was done, I dropped off the players data who had batted less than 2 innings. 

In [5]:
#drop the players with lesser than 1 innings. (Efficient ways of doing this)
df_1 = df_1[df_1.Inns > 2]



Finally, I sorted the dataset and ranked the players according to their rankings. 

In [6]:
#Sort based on Average (Efficient ways of doing this)
#reset and start indexing from 1
def ranking_players (dataframe,colname):
    ranking_df= dataframe.sort_values(by=colname,ascending=False)
    dataframe.reset_index(drop=True)
    dataframe.index += 1 
    return ranking_df
    





In [22]:
Ave_rank= ranking_players(df_1,'Ave')

# players = list(Ave_rank['Player'].map(lambda x:x.split(' (')[0]))
# players.sort()
# players

In [23]:
# players_2 = list(more_than_two_inns['Player'].map(lambda x:x.split(' (')[0]))
# players_2.sort()
# players_2

In [24]:
# def diff(first, second):
#     second = set(second)
#     return [item for item in first if item not in second]
    
# diff(players, players_2)

However, this study required me to calculate the rankings based on other factors. Referring to the earlier document, I need to calculate the e2 and e6 values.

Where, e2 = (sumout + 2×sumno)/n and e6 = (sumout + f6×sumno)/n where f6 = 2.2 – 0.01×avno. 

This calculation could only be done if I had the sum of out scores, and sum of not out scores. And my earlier dataset did not contain any of this information. 


I therefore needed to get the individual scores each batsman scored during the tournament. And I needed the sum of total not out scores and sum of out scores. Again, I downloaded the dataset for the each batsman individual scores. I could have done it more efficiently, however, I repeated the process of getting the dataset from over ten tables. I made sure that the dataset contained scores of batsmen that had batted more than two innings as that was the requirement according to the earlier table. 

I dropped of unwanted columns from the datasets. 

In [10]:
def fetch_data_from_url(url):
    html = requests.get(url).content
    df_list = pd.read_html(html)
    df = df_list[2]
    df.index += 1 
    df.drop(df.columns[[2,3,4,5,6,7,8,9,10,11,12]],axis=1, inplace=True)
    return df

urls = [
    "http://stats.espncricinfo.com/ci/engine/stats/index.html?class=3;filter=advanced;host=25;orderby=batting_average;season=2013%2F14;template=results;trophy=89;type=batting;view=innings"
    ,"http://stats.espncricinfo.com/ci/engine/stats/index.html?class=3;filter=advanced;host=25;orderby=batting_average;page=2;season=2013%2F14;template=results;trophy=89;type=batting;view=innings"
    ,"http://stats.espncricinfo.com/ci/engine/stats/index.html?class=3;filter=advanced;host=25;orderby=batting_average;page=3;season=2013%2F14;template=results;trophy=89;type=batting;view=innings"
    ,"http://stats.espncricinfo.com/ci/engine/stats/index.html?class=3;filter=advanced;host=25;orderby=batting_average;page=4;season=2013%2F14;template=results;trophy=89;type=batting;view=innings"
    ,"http://stats.espncricinfo.com/ci/engine/stats/index.html?class=3;filter=advanced;host=25;orderby=batting_average;page=5;season=2013%2F14;template=results;trophy=89;type=batting;view=innings"
    ,"http://stats.espncricinfo.com/ci/engine/stats/index.html?class=3;filter=advanced;host=25;orderby=batting_average;page=6;season=2013%2F14;template=results;trophy=89;type=batting;view=innings"
    ,"http://stats.espncricinfo.com/ci/engine/stats/index.html?class=3;filter=advanced;host=25;orderby=batting_average;page=7;season=2013%2F14;template=results;trophy=89;type=batting;view=innings"
    ,"http://stats.espncricinfo.com/ci/engine/stats/index.html?class=3;filter=advanced;host=25;orderby=batting_average;page=8;season=2013%2F14;template=results;trophy=89;type=batting;view=innings"
    ,"http://stats.espncricinfo.com/ci/engine/stats/index.html?class=3;filter=advanced;host=25;orderby=batting_average;page=9;season=2013%2F14;template=results;trophy=89;type=batting;view=innings"
    ,"http://stats.espncricinfo.com/ci/engine/stats/index.html?class=3;filter=advanced;host=25;orderby=batting_average;page=10;season=2013%2F14;template=results;trophy=89;type=batting;view=innings"
]

df = []

for url in urls:
    df.append(fetch_data_from_url(url))
    

In [None]:
# for i in range(10):
#      print (df[i].head(5))

In [11]:
#merge all the dataframe together in a row

result = pd.concat(df)

Finally, I merged all of the tables into one dataframe. This merger took place based on the rows. I needed player with the same names to be merged in the same column. For this I grouped the merged dataframe by Player name and joined the as a column. This gave me the dataframe that had Players name in one column, but all of the scores were in the second column separated by commas. I then separated the scores separated by commas into other columns and renamed them to 

Player| Inns_1||Inns_2|....|Inns_7|

In [12]:
#group by player 
result= result.groupby(['Player'])['Runs'].apply(', '.join).reset_index()
result=pd.concat([result[['Player']], result['Runs'].str.split(', ', expand=True)], axis=1)
result.columns= ['Player','Inns_1','Inns_2','Inns_3','Inns_4','Inns_5','Inns_6','Inns_7']


I then dropped of any player that might have not played more than two innings. 

In [25]:
#Drop rows with cols that have more more than two None
more_than_two_inns=result.dropna(subset=['Inns_1','Inns_2','Inns_3','Inns_4','Inns_5','Inns_6','Inns_7'], thresh=3)

I also replaced all the NAN values to blank. The Inns column could have (*) character meaning not-out scores, however, I was not able to convert them to integer because of the * sign. But replacing the NAN values also replaced all the * sign scores to blank. I now could calculate the out scores. 

In [26]:
#replace the nan values with blank space
more_than_two_inns = more_than_two_inns.replace(np.nan, '', regex=True)

In [84]:
out_scores= more_than_two_inns[:]

I copied my dataframe to a new dataframe for calculating the sum of out scores
Finally, I changed all the Inns column to integer, and I was able to add the out scores of the individual batsmen. The Inns column could have (*) character meaning not-out scores, however, using errors='coerce' I was able to make an exception to these cases. And all the not-out scores were replaced to NAN. Now, I could easily calculate the sum of Out scores. 

In [88]:
#Remove the Not-out scores (it has *)- I do that by converting the cols into integer whilst 
#ignoring the not-out scores

for column in out_scores.columns:
    if not column in exclude_columns:
        strip_asterix_from_column(out_scores,column,'*')
#         to_numeric(out_scores, column)
        
out_scores= out_scores.replace(np.nan, '', regex=True)        







In [89]:
#Add all the out_scores
out_scores['Total'] = out_scores.sum(axis=1)

In [90]:
out_scores.head(10)

Unnamed: 0,Player,Inns_1,Inns_2,Inns_3,Inns_4,Inns_5,Inns_6,Inns_7,Total
0,AB de Villiers (SA),69*,24,21,10.0,5.0,,,AB de Villiers (SA)69*2421105
1,AD Hales (ENG),116*,38,12,,,,,AD Hales (ENG)116*3812
2,AD Mathews (SL),43,40,11*,6.0,,,,AD Mathews (SL)434011*6
3,AD Poynter (IRE),57,23,4,,,,,AD Poynter (IRE)57234
5,AJ Finch (AUS),71,65,16,6.0,,,,AJ Finch (AUS)7165166
7,AM Rahane (INDIA),32,19,3,,,,,AM Rahane (INDIA)32193
10,Ahmed Shehzad (PAK),111*,22,5,,,,,Ahmed Shehzad (PAK)111*225
14,Amjad Ali (UAE),20,5,1,,,,,Amjad Ali (UAE)2051
15,Amjad Javed (UAE),19,8,2,,,,,Amjad Javed (UAE)1982
16,Anamul Haque (BDESH),44*,44,42,26.0,18.0,10.0,,Anamul Haque (BDESH)44*4442261810


In [79]:
all_scores=more_than_two_inns[:]

Again, I made a new dataframe to calculate the total runs scored by each batsman. I relaced the NAN values to blank and then stripped the Not out symbol * from the scores. I could then convert the columns to integer,and then added the total scores. 

In [80]:
all_scores = all_scores.replace(np.nan, '', regex=True)

In [83]:
all_scores.head(10)

Unnamed: 0,Player,Inns_1,Inns_2,Inns_3,Inns_4,Inns_5,Inns_6,Inns_7
0,AB de Villiers (SA),69*,24,21,10.0,5.0,,
1,AD Hales (ENG),116*,38,12,,,,
2,AD Mathews (SL),43,40,11*,6.0,,,
3,AD Poynter (IRE),57,23,4,,,,
5,AJ Finch (AUS),71,65,16,6.0,,,
7,AM Rahane (INDIA),32,19,3,,,,
10,Ahmed Shehzad (PAK),111*,22,5,,,,
14,Amjad Ali (UAE),20,5,1,,,,
15,Amjad Javed (UAE),19,8,2,,,,
16,Anamul Haque (BDESH),44*,44,42,26.0,18.0,10.0,


In [82]:
# for column in all_scores.columns:
#     strip_asterix_from_column(all_scores,column,'*')
# #     to_numeric(all_scores, column)
        
# all_scores['Inns_1'] = all_scores['Inns_1'].map(lambda x:x.rstrip('*'))
# all_scores['Inns_2'] = all_scores['Inns_2'].map(lambda x:x.rstrip('*'))
# all_scores['Inns_3'] = all_scores['Inns_3'].map(lambda x:x.rstrip('*'))
# all_scores['Inns_4'] = all_scores['Inns_4'].map(lambda x:x.rstrip('*'))
# all_scores['Inns_5'] = all_scores['Inns_5'].map(lambda x:x.rstrip('*'))
# all_scores['Inns_6'] = all_scores['Inns_6'].map(lambda x:x.rstrip('*'))
# all_scores['Inns_7'] = all_scores['Inns_7'].map(lambda x:x.rstrip('*'))


In [None]:
all_scores['Inns_1']= pd.to_numeric(all_scores.Inns_1).astype(float)
all_scores['Inns_2']= pd.to_numeric(all_scores.Inns_2).astype(float)
all_scores['Inns_3']= pd.to_numeric(all_scores.Inns_3).astype(float)
all_scores['Inns_4']= pd.to_numeric(all_scores.Inns_4).astype(float)
all_scores['Inns_5']= pd.to_numeric(all_scores.Inns_5).astype(float)
all_scores['Inns_6']= pd.to_numeric(all_scores.Inns_6).astype(float)
all_scores['Inns_7']= pd.to_numeric(all_scores.Inns_7).astype(float)



In [None]:
all_scores['Total_runs']= all_scores.iloc[:, 1:-1].sum(axis=1)

In [68]:
all_scores.head(2)

Unnamed: 0,Player,Inns_1,Inns_2,Inns_3,Inns_4,Inns_5,Inns_6,Inns_7
0,AB de Villiers (SA),69*,24,21,10.0,5.0,,
1,AD Hales (ENG),116*,38,12,,,,


In [None]:
all_scores= all_scores.replace(np.nan, '', regex=True)

In [None]:
all_scores.head(2)

In [None]:
result_1 = pd.concat([out_scores, all_scores], axis=1, sort=False)

In [None]:
result_1.columns= ['Player',"","","","","","","",'Sumout','Player_1',"1","2","3","4","5","6","7","Total Runs"]

In [None]:
result_1=result_1.drop(result_1.columns[[1,2,3,4,5,6,7,9]],axis=1)

In [None]:
result_1= result_1[['Player', '1', '2', '3','4','5','6','7','Sumout','Total Runs']]

In [None]:
result_1.head(2)

Now to find the sum of Not out runs, I substracted the sum of out runs from the total runs. This gave me the column for sum for Not out runs. 

In [None]:
#Finding the total Not-Out Runs by substracting out runs from Total runs
result_1['Sumno']= result_1['Total Runs']-result_1['Sumout']

I then merged this individual innings dataframe with my first dataframe according to players name. I was roughly able to check to see if all the batsmen had similar amount of runs from the first table and the second innings by innigs table. 

In [None]:
result_1.head(5)

In [None]:
#sort the initial dataframe alphabhetically
df_main=df.sort_values('Player', ascending=True)

In [None]:
df_3 = pd.merge(df_main, result_1, on='Player', how='right')


In [None]:
df_3.head(2)

In [None]:
df_3.drop(df_3.columns[[1,4,5,7]],axis=1, inplace=True)

In [None]:
df_3.head(2)

In [None]:
#calculate e2
df_3['e2']= (df_3.Sumout + 2*(df_3.Sumno))/df_3.Inns

In [None]:
#calculate f6
avno= (df_3.Sumno/df_3.NO)

In [None]:
avno = avno.replace(np.nan, '0', regex=True)

In [None]:
avno=avno.apply(pd.to_numeric, errors='coerce')

In [None]:
f6=2.2-0.01*avno

In [None]:
#calculate e6 = (sumout + f6×sumno)/n where f6 = 2.2 – 0.01×avno. 
df_3['e6']=(df_3.Sumout+f6*df_3.Sumno)/df_3.Inns

In [None]:
#calculate e26
df_3['e26']=(df_3.e2+df_3.e6)/2

In [None]:
#SR average

ASR=df.SR.sum(axis=0)/209

In [None]:
df_3.head(2)

In [None]:
#calculate BP wher BP26 = e26 ×RP = e26×(SR/124.0)0.5 

df_3['BP26'] = df_3.e26*(df_3.SR/ASR)**0.5

In [None]:
main_table=df_3[:]

In [None]:
main_table.head(2)

In [None]:
#Sort based on BP26(Efficient ways of doing this)
main_table.drop(main_table.columns[[4,5,6,7,8,9,10,11]],axis=1, inplace=True)
main_table = main_table.sort_values(by='BP26', ascending=False)
#reset and start indexing from 1
main_table = main_table.reset_index(drop=True)

main_table.index += 1 

In [None]:
main_table.to_csv('batting_table.csv')
df.to_csv('batting1.csv')
result_1.to_csv('batting_sum.csv')
result.to_csv('batting_with_not_out_symbol')


In [None]:
final_table= main_table.head(50)

In [None]:

final_table.to_csv('final_table.csv')