In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as st

In [2]:
ipl_bbb = pd.read_csv('IPL_ball_by_ball_updated till 2024.csv',low_memory=False)

In [3]:
ipl_bbb.columns

Index(['Match id', 'Date', 'Season', 'Batting team', 'Bowling team',
       'Innings No', 'Ball No', 'Bowler', 'Striker', 'Non Striker',
       'runs_scored', 'extras', 'type of extras', 'score', 'score/wicket',
       'wicket_confirmation', 'wicket_type', 'fielders_involved',
       'Player Out'],
      dtype='object')

In [4]:
ipl_year_id = pd.DataFrame(columns=["id", "year"])
ipl_year_id["id"] = ipl_bbb["Match id"]
ipl_year_id["year"] = pd.to_datetime(ipl_bbb["Date"], dayfirst=True).dt.year


#create a copy of ipl_bbbc dataframe
ipl_bbbc= ipl_bbb.copy()


ipl_bbbc['year'] = pd.to_datetime(ipl_bbb["Date"], dayfirst=True).dt.year


ipl_bbbc[["Match id", "year", "runs_scored","wicket_confirmation","Bowler",'Striker']].head()

Unnamed: 0,Match id,year,runs_scored,wicket_confirmation,Bowler,Striker
0,335982,2008,0,0,P Kumar,SC Ganguly
1,335982,2008,0,0,P Kumar,BB McCullum
2,335982,2008,0,0,P Kumar,BB McCullum
3,335982,2008,0,0,P Kumar,BB McCullum
4,335982,2008,0,0,P Kumar,BB McCullum


In [5]:
# Filter the data for the bowler named 'AR Patel'
ax_patel_data = ipl_bbbc[ipl_bbbc["Bowler"] == "AR Patel"]

# Group by year and sum the wickets
total_wicket_for_player = ax_patel_data.groupby('year')["wicket_confirmation"].sum()
total_wicket_for_player

# Convert to dictionary
total_wicket_for_player_dict = total_wicket_for_player.to_dict()
print(total_wicket_for_player_dict)

{2014: 19, 2015: 14, 2016: 13, 2017: 15, 2018: 3, 2019: 12, 2020: 11, 2021: 16, 2022: 6, 2023: 11, 2024: 9}


In [6]:
# Convert dictionary to pandas Series for fitting distribution
ab = pd.Series(total_wicket_for_player_dict)

In [7]:
def get_best_distribution(data):
    dist_names = ['alpha', 'beta', 'betaprime', 'burr12', 'crystalball',
                  'dgamma', 'dweibull', 'erlang', 'exponnorm', 'f', 'fatiguelife',
                  'gamma', 'gengamma', 'gumbel_l', 'johnsonsb', 'kappa4',
                  'lognorm', 'nct', 'norm', 'norminvgauss', 'powernorm', 'rice',
                  'recipinvgauss', 't', 'trapz', 'truncnorm']
    dist_results = []
    params = {}
    
    # Iterate over each distribution and fit to data
    for dist_name in dist_names:
        dist = getattr(st, dist_name)
        param = dist.fit(data)
        params[dist_name] = param
        
        # Applying the Kolmogorov-Smirnov test
        D, p = st.kstest(data, dist_name, args=param)
        print("p value for " + dist_name + " = " + str(p))
        dist_results.append((dist_name, p))
    
    # Select the best fitted distribution based on p-value
    best_dist, best_p = max(dist_results, key=lambda item: item[1])
    
    # Print results
    print("\nBest fitting distribution: " + str(best_dist))
    print("Best p value: " + str(best_p))
    print("Parameters for the best fit: " + str(params[best_dist]))
    
    return best_dist, best_p, params[best_dist]

# Call the function with your data
import warnings
warnings.filterwarnings('ignore')
best_dist_name, best_p_val, best_params = get_best_distribution(ab)

p value for alpha = 0.7689921190849475
p value for beta = 0.3773013507770824
p value for betaprime = 0.8572934163265811
p value for burr12 = 0.9803988104338484
p value for crystalball = 0.8975015015759963
p value for dgamma = 0.8790016021005678
p value for dweibull = 0.8820049628137945
p value for erlang = 0.8548709668732069
p value for exponnorm = 0.8975028343761439
p value for f = 0.8948291453719155
p value for fatiguelife = 0.8871074086793793
p value for gamma = 0.8370833335034745
p value for gengamma = 0.841800600694424
p value for gumbel_l = 0.9980554225481311
p value for johnsonsb = 0.8869263642566767
p value for kappa4 = 0.39609266250994624
p value for lognorm = 0.005223244613261624
p value for nct = 0.8914546072687096
p value for norm = 0.8975014221709687
p value for norminvgauss = 0.9853997444817613
p value for powernorm = 0.9884662258014886
p value for rice = 0.8530212643851435
p value for recipinvgauss = 0.8655577282352929
p value for t = 0.8974954954342778
p value for trapz

-----------------------------------------------------------------------------------------------------------------------------------------------
# **Interpretation of how well the fitted distribution represents cricket player performance metrics:**

**Interpretation:**
Based on the p-values provided and the distribution fitting process, the Gumbel Left distribution (gumbel_l) was found to be the best fit for the total wickets taken by the bowler AR Patel across the given years.

- Best p-value: 0.9980554225481311
- Parameters: (13.804635631105308, 3.7883690789301996), which are location and scale respectively
- The location parameter (μ) indicates the central tendency, around which the data is distributed. The scale parameter (β) measures the spread of the distribution.

The Gumbel Left distribution had the highest p-value among all tested distributions, indicating that it is the best fit for the given data on the number of wickets taken by AR Patel over the years.

The Gumbel Left distribution is typically used to model the minimum values in a dataset, often applied in extreme value theory. In this context, it suggests that the distribution of the number of wickets taken by AR Patel each year follows a pattern that can be described by the Gumbel Left distribution.

This fit suggests that there is a predictable pattern in the wicket-taking performance of AR Patel, with the distribution parameters providing a detailed description of this pattern.

-----------------------------------------------------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------------------------------------------------
# **Limitations, Assumptions and Adjustments of the Gumbel Distribution:**
**Assumptions:**

1. The Gumbel distribution assumes that the distribution of data follows an extreme value distribution.
2. It is often used to model the maximum (or minimum with adjustments) of a set of observations.

**Limitations:**

1. The Gumbel distribution might not capture complex shapes in data that are not extreme value distributed.
2. It assumes a specific form of tail behavior that might not always be present in real-world data.

In context of cricket:
- This doesn't measure the overall performance of the player, except for one aspect, i.e wicket taking ability
- Other variables such as runs scored, catches taken and other aspects of the game should also be considered instead of one aspect.

**Adjustments:**
1. Data Transformation - Box-Cox Transformation: This can stabilize variance and make the data more normally distributed, potentially improving the fit of parametric distributions.
2. Mixture Models: Use a combination of multiple distributions to model complex data structures, especially if the data shows multimodal characteristics.
3. Robust Fitting Methods: Use robust fitting techniques that minimize the influence of outliers.

In context of cricket:
- We can create a fantasy points system where we allot the points to get an overview of the overall performance of the player
- We can also take into account multiple variables as mentioned before
- Other influencing factors such as experience, number of matches played etc. aren't taken into account which can be calculated from the dataset

-----------------------------------------------------------------------------------------------------------------------------------------------
