## Pythagorean Expectation and the Indian Premier League
The Indian Premier League (IPL) is the biggest cricket competition in the world, which has all of the world's best players in an eight week tournament involving eight teams playing sixty games in total. Each team plays every other team, once at home and then away, and the competition finishes with the four best teams competing in semi-finals and then a final.

Cricket, like baseball, is a bat and ball game, where teams score runs and the team scoring the highest number of runs is the winner. There are, of course, many differences, but statistically speaking, we can generate the same Pythagorean statistic that we generated for baseball. Our data here is derived from the competition that took place in 2018.

The IPL is played in the T20 format, in which each team has up to 120 balls to score as many runs as they can (the game takes less than three hours to complete). One difference from baseball is that runs are much easier to score - in the IPL an average score is 170 runs - and outs (wickets) are much more costly - each team has only ten outs(called wickets) in the entire game, and if you run out of wickets before the 120 balls have been bowled (pitched) then your inning is over.

In [7]:
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.formula.api as smf
import os

In [8]:
cwd = os.getcwd()
file_name = 'IPL2018teams.xlsx'
file_folder = 'Raw_data'

file_path = os.path.join(cwd,file_folder,file_name)
ipl_raw = pd.read_excel(file_path)
ipl_raw.head(3)

Unnamed: 0,scorecard_id,start_date,phase,name,home_team,away_team,toss_winner,toss_decision,inn1team,innings1,...,adjusted_target_indicator,adjusted_target,team1_overs,team2_overs,mom_player_id,mom_player,scoring_status,result_type,result_margin,winning_team
0,1056637,2018-04-07,,"Wankhede Stadium, Mumbai",Mumbai Indians,Chennai Super Kings,Chennai Super Kings,f,Mumbai Indians,165,...,n,0,20.0,20,44613,DJ Bravo,live bbb,ww,1,Chennai Super Kings
1,1056638,2018-04-08,,"Punjab Cricket Association Stadium, Mohali",Kings XI Punjab,Delhi Daredevils,Kings XI Punjab,f,Delhi Daredevils,166,...,n,0,20.0,20,170187,KL Rahul,live bbb,ww,6,Kings XI Punjab
2,1056639,2018-04-08,,"Eden Gardens, Kolkata",Kolkata Knight Riders,Royal Challengers Bangalore,Kolkata Knight Riders,f,Royal Challengers Bangalore,176,...,n,0,20.0,20,412485,N Rana,live bbb,ww,4,Kolkata Knight Riders


In [9]:
check_nas = ipl_raw.isna().sum()
check_nas

scorecard_id                  0
start_date                    0
phase                        56
name                          0
home_team                     0
away_team                     0
toss_winner                   0
toss_decision                 0
inn1team                      0
innings1                      0
wickets1                      0
overs1                        0
closure1                      0
innings2                      0
wickets2                      0
overs2                        0
closure2                      0
adjusted_target_indicator     0
adjusted_target               0
team1_overs                   0
team2_overs                   0
mom_player_id                 0
mom_player                    0
scoring_status                0
result_type                   0
result_margin                 0
winning_team                  0
dtype: int64

-- Coments over having a phase value as N/As
-- A phase refers to a specific part of an innings (especially in limited-overs formats like ODIs and T20s) that is defined by overs and match context
-- it’s not an official statistic like runs or wickets. It’s more of a tactical or analytical term used to describe periods or segments of a match, each with its own strategic focus.

* Since is not a official statistic, I'm not removing the rows with empty value for the phase

In [11]:
ipl_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 27 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   scorecard_id               60 non-null     int64         
 1   start_date                 60 non-null     datetime64[ns]
 2   phase                      4 non-null      object        
 3   name                       60 non-null     object        
 4   home_team                  60 non-null     object        
 5   away_team                  60 non-null     object        
 6   toss_winner                60 non-null     object        
 7   toss_decision              60 non-null     object        
 8   inn1team                   60 non-null     object        
 9   innings1                   60 non-null     int64         
 10  wickets1                   60 non-null     int64         
 11  overs1                     60 non-null     float64       
 12  closure1  

In [13]:
ipl = ipl_raw[["home_team","away_team","inn1team", "innings1","innings2","winning_team"]]
ipl.head(20)

Unnamed: 0,home_team,away_team,inn1team,innings1,innings2,winning_team
0,Mumbai Indians,Chennai Super Kings,Mumbai Indians,165,169,Chennai Super Kings
1,Kings XI Punjab,Delhi Daredevils,Delhi Daredevils,166,167,Kings XI Punjab
2,Kolkata Knight Riders,Royal Challengers Bangalore,Royal Challengers Bangalore,176,177,Kolkata Knight Riders
3,Sunrisers,Rajasthan Royals,Rajasthan Royals,125,127,Sunrisers
4,Chennai Super Kings,Kolkata Knight Riders,Kolkata Knight Riders,202,205,Chennai Super Kings
5,Rajasthan Royals,Delhi Daredevils,Rajasthan Royals,153,60,Rajasthan Royals
6,Sunrisers,Mumbai Indians,Mumbai Indians,147,151,Sunrisers
7,Royal Challengers Bangalore,Kings XI Punjab,Kings XI Punjab,155,159,Royal Challengers Bangalore
8,Mumbai Indians,Delhi Daredevils,Mumbai Indians,194,195,Delhi Daredevils
9,Kolkata Knight Riders,Sunrisers,Kolkata Knight Riders,138,139,Sunrisers


In [14]:
ipl["home_win"] = np.where(ipl["home_team"] == ipl["winning_team"],1,0)
ipl["away_win"] = np.where(ipl["home_team"] != ipl["winning_team"],1,0)
ipl["home_runs"] = np.where(ipl["home_team"] == ipl["inn1team"],ipl["innings1"],ipl["innings2"])
ipl["away_runs"] = np.where(ipl["away_team"] == ipl["inn1team"],ipl["innings1"],ipl["innings2"])
ipl["game_count"] = 1

ipl

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ipl["home_win"] = np.where(ipl["home_team"] == ipl["winning_team"],1,0)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ipl["away_win"] = np.where(ipl["home_team"] != ipl["winning_team"],1,0)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ipl["home_runs"] = np.where(ipl["home_team"] == ipl["inn1team

Unnamed: 0,home_team,away_team,inn1team,innings1,innings2,winning_team,home_win,away_win,home_runs,away_runs,game_count
0,Mumbai Indians,Chennai Super Kings,Mumbai Indians,165,169,Chennai Super Kings,0,1,165,169,1
1,Kings XI Punjab,Delhi Daredevils,Delhi Daredevils,166,167,Kings XI Punjab,1,0,167,166,1
2,Kolkata Knight Riders,Royal Challengers Bangalore,Royal Challengers Bangalore,176,177,Kolkata Knight Riders,1,0,177,176,1
3,Sunrisers,Rajasthan Royals,Rajasthan Royals,125,127,Sunrisers,1,0,127,125,1
4,Chennai Super Kings,Kolkata Knight Riders,Kolkata Knight Riders,202,205,Chennai Super Kings,1,0,205,202,1
5,Rajasthan Royals,Delhi Daredevils,Rajasthan Royals,153,60,Rajasthan Royals,1,0,153,60,1
6,Sunrisers,Mumbai Indians,Mumbai Indians,147,151,Sunrisers,1,0,151,147,1
7,Royal Challengers Bangalore,Kings XI Punjab,Kings XI Punjab,155,159,Royal Challengers Bangalore,1,0,159,155,1
8,Mumbai Indians,Delhi Daredevils,Mumbai Indians,194,195,Delhi Daredevils,0,1,194,195,1
9,Kolkata Knight Riders,Sunrisers,Kolkata Knight Riders,138,139,Sunrisers,0,1,138,139,1


In [27]:
home_teams = ipl.groupby("home_team").agg({"game_count":"sum","home_win":"sum","home_runs":"sum","away_runs":"sum"}).reset_index()
home_teams = home_teams.rename(columns = {"home_team":"team"})
home_teams = home_teams.set_index("team")
home_teams

Unnamed: 0_level_0,game_count,home_win,home_runs,away_runs
team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Chennai Super Kings,9,8,1577,1486
Delhi Daredevils,7,4,1258,1122
Kings XI Punjab,7,4,1188,1202
Kolkata Knight Riders,9,5,1468,1417
Mumbai Indians,7,3,1194,1171
Rajasthan Royals,7,5,1120,994
Royal Challengers Bangalore,7,4,1298,1286
Sunrisers,7,5,1070,1050


In [28]:
away_teams = ipl.groupby("away_team").agg({"game_count":"sum","away_win":"sum","home_runs":"sum","away_runs":"sum"}).reset_index()
away_teams = away_teams.rename(columns = {"away_team":"team"})
away_teams = away_teams.set_index("team")
away_teams

Unnamed: 0_level_0,game_count,away_win,home_runs,away_runs
team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Chennai Super Kings,7,3,1264,1232
Delhi Daredevils,7,1,1265,1085
Kings XI Punjab,7,2,1124,1022
Kolkata Knight Riders,7,4,1326,1291
Mumbai Indians,7,3,1111,1186
Rajasthan Royals,8,2,1362,1237
Royal Challengers Bangalore,7,2,1097,1024
Sunrisers,10,5,1624,1651
