# IPL Dataset Analysis

## Problem Statement
We want to know as to what happens during an IPL match which raises several questions in our mind with our limited knowledge about the game called cricket on which it is based. This analysis is done to know as which factors led one of the team to win and how does it matter.

## About the Dataset :
The Indian Premier League (IPL) is a professional T20 cricket league in India contested during April-May of every year by teams representing Indian cities. It is the most-attended cricket league in the world and ranks sixth among all the sports leagues. It has teams with players from around the world and is very competitive and entertaining with a lot of close matches between teams.

The IPL and other cricket related datasets are available at [cricsheet.org](https://cricsheet.org/%c2%a0(data). Feel free to visit the website and explore the data by yourself as exploring new sources of data is one of the interesting activities a data scientist gets to do.

Snapshot of the data you will be working on:<br>
<br>
The dataset 136522 data points and 23 features<br>

|Features|Description|
|-----|-----|
|match_code|Code pertaining to individual match|
|date|Date of the match played|
|city|City where the match was played|
|venue|Stadium in that city where the match was played|
|team1|team1|
|team2|team2|
|toss_winner|Who won the toss out of two teams|
|toss_decision|toss decision taken by toss winner|
|winner|Winner of that match between two teams|
|win_type|How did the team won(by wickets or runs etc.)|
|win_margin|difference with which the team won| 
|inning|inning type(1st or 2nd)|
|delivery|ball delivery|
|batting_team|current team on batting|
|batsman|current batsman on strike|
|non_striker|batsman on non-strike|
|bowler|Current bowler|
|runs|runs scored|
|extras|extra run scored|
|total|total run scored on that delivery including runs and extras|
|extras_type|extra run scored by wides or no ball or legby|
|player_out|player that got out|
|wicket_kind|How did the player got out|
|wicket_fielders|Fielder who caught out the player by catch|


### Analyzing data using pandas module

### Read the data using pandas module.

In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv('./data/ipl_dataset.csv')
df.shape
df

Unnamed: 0,match_code,date,city,venue,team1,team2,toss_winner,toss_decision,winner,win_type,...,batsman,non_striker,bowler,runs,extras,total,extras_type,player_out,wicket_kind,wicket_fielders
0,392203,2009-05-01,East London,Buffalo Park,Kolkata Knight Riders,Mumbai Indians,Mumbai Indians,bat,Mumbai Indians,runs,...,ST Jayasuriya,SR Tendulkar,I Sharma,0,1,1,wides,,,
1,392203,2009-05-01,East London,Buffalo Park,Kolkata Knight Riders,Mumbai Indians,Mumbai Indians,bat,Mumbai Indians,runs,...,ST Jayasuriya,SR Tendulkar,I Sharma,1,0,1,,,,
2,392203,2009-05-01,East London,Buffalo Park,Kolkata Knight Riders,Mumbai Indians,Mumbai Indians,bat,Mumbai Indians,runs,...,SR Tendulkar,ST Jayasuriya,I Sharma,0,1,1,wides,,,
3,392203,2009-05-01,East London,Buffalo Park,Kolkata Knight Riders,Mumbai Indians,Mumbai Indians,bat,Mumbai Indians,runs,...,SR Tendulkar,ST Jayasuriya,I Sharma,0,0,0,,,,
4,392203,2009-05-01,East London,Buffalo Park,Kolkata Knight Riders,Mumbai Indians,Mumbai Indians,bat,Mumbai Indians,runs,...,SR Tendulkar,ST Jayasuriya,I Sharma,2,0,2,,,,
5,392203,2009-05-01,East London,Buffalo Park,Kolkata Knight Riders,Mumbai Indians,Mumbai Indians,bat,Mumbai Indians,runs,...,SR Tendulkar,ST Jayasuriya,I Sharma,0,0,0,,,,
6,392203,2009-05-01,East London,Buffalo Park,Kolkata Knight Riders,Mumbai Indians,Mumbai Indians,bat,Mumbai Indians,runs,...,SR Tendulkar,ST Jayasuriya,I Sharma,0,0,0,,,,
7,392203,2009-05-01,East London,Buffalo Park,Kolkata Knight Riders,Mumbai Indians,Mumbai Indians,bat,Mumbai Indians,runs,...,SR Tendulkar,ST Jayasuriya,I Sharma,4,0,4,,,,
8,392203,2009-05-01,East London,Buffalo Park,Kolkata Knight Riders,Mumbai Indians,Mumbai Indians,bat,Mumbai Indians,runs,...,ST Jayasuriya,SR Tendulkar,AB Dinda,4,0,4,,,,
9,392203,2009-05-01,East London,Buffalo Park,Kolkata Knight Riders,Mumbai Indians,Mumbai Indians,bat,Mumbai Indians,runs,...,ST Jayasuriya,SR Tendulkar,AB Dinda,0,0,0,,,,


In [2]:
len(df['match_code'].unique())

# You can also use: 
#df_ipl['match_code'].nunique()

577

### There are certain fixed cities all around the world where matches are held. Find the list of unique cities where matches were played 

In [3]:
# Corrected as Venues to Cities
i = df['city'].unique()
for a in i:
    print(a)
len(i)

East London
Port Elizabeth
Centurion
neutral_venue
Chennai
Jaipur
Kolkata
Delhi
Chandigarh
Hyderabad
Ranchi
Mumbai
Bangalore
Dharamsala
Pune
Rajkot
Durban
Cuttack
Cape Town
Ahmedabad
Johannesburg
Visakhapatnam
Abu Dhabi
Raipur
Kochi
Kimberley
Nagpur
Bloemfontein
Indore
Kanpur


30

### Find the columns which contains null values if any ?

In [4]:
df.isna().any()

match_code         False
date               False
city               False
venue              False
team1              False
team2              False
toss_winner        False
toss_decision      False
winner              True
win_type            True
win_margin          True
inning             False
delivery           False
batting_team       False
batsman            False
non_striker        False
bowler             False
runs               False
extras             False
total              False
extras_type         True
player_out          True
wicket_kind         True
wicket_fielders     True
dtype: bool

In [5]:
df.isna().sum()

match_code              0
date                    0
city                    0
venue                   0
team1                   0
team2                   0
toss_winner             0
toss_decision           0
winner               1818
win_type             1818
win_margin           1818
inning                  0
delivery                0
batting_team            0
batsman                 0
non_striker             0
bowler                  0
runs                    0
extras                  0
total                   0
extras_type        129064
player_out         129807
wicket_kind        129807
wicket_fielders    131657
dtype: int64

### Though the match is held in different cities all around the world it may or maynot have multiple venues (stadiums where matches are held) list down top 5 most played venues 


In [6]:
venue = df.groupby(['venue'])['match_code'].nunique().sort_values(ascending = False).head(5)
venue

venue
M Chinnaswamy Stadium              58
Eden Gardens                       54
Feroz Shah Kotla                   53
Wankhede Stadium                   49
MA Chidambaram Stadium, Chepauk    48
Name: match_code, dtype: int64

### Make a runs vs run-count frequency table

In [7]:
df['runs'].value_counts()

0    55870
1    50087
4    15409
2     8835
6     5806
3      473
5       42
Name: runs, dtype: int64

### IPL seasons are held every year now let's look at our data and extract how many seasons were recorded.

In [8]:
df['years'] = pd.DatetimeIndex(df['date']).year
a = df['years'].value_counts()
print('The number of seasons that were held is', a.count())

The number of seasons that were held is 9


### What are the total no. of matches played per season

In [9]:
df.groupby(['years'])['match_code'].nunique()

years
2008    58
2009    57
2010    60
2011    73
2012    74
2013    76
2014    60
2015    59
2016    60
Name: match_code, dtype: int64

### What are the total runs scored across each season 

In [10]:
df.groupby(['years','inning'])['runs'].sum().sort_values(ascending = False)
#for each year(seasons) , innings ,which team scored the maximum no of runs in that year and inning 

years  inning
2013   1         11286
2012   1         11054
2011   1         10474
2012   2         10268
2013   2         10141
2011   2          9454
2016   1          9342
2015   1          9291
2010   1          9279
2014   1          9239
2008   1          8788
2014   2          8683
2016   2          8621
2010   2          8455
2015   2          8118
2009   1          8088
2008   2          8021
2009   2          7256
Name: runs, dtype: int64

In [11]:
a = df['total'].sum()
df.groupby(['years','inning','batting_team'])['total'].apply(lambda grp : grp.sum())
#Total runs scored by each team as per the years and that year's winnings

years  inning  batting_team               
2008   1       Chennai Super Kings            1548
               Deccan Chargers                1257
               Delhi Daredevils               1004
               Kings XI Punjab                1222
               Kolkata Knight Riders          1335
               Mumbai Indians                  979
               Rajasthan Royals               1028
               Royal Challengers Bangalore     962
       2       Chennai Super Kings             972
               Deccan Chargers                 972
               Delhi Daredevils               1114
               Kings XI Punjab                1242
               Kolkata Knight Riders           607
               Mumbai Indians                 1100
               Rajasthan Royals               1573
               Royal Challengers Bangalore    1021
2009   1       Chennai Super Kings            1613
               Deccan Chargers                1263
               Delhi Daredevils        

In [12]:
# pd.set_option('expand_frame_repr', True)
pd.set_option("display.max_rows", 999)
pd.set_option('max_colwidth',100)

### There are teams which are high performing and low performing. Let's look at the aspect of performance of an individual team. Filter the data and aggregate the runs scored by each team. Display top 10 results which are having runs scored over 200.

In [14]:
df.groupby(['match_code','batting_team','inning'])['total'].sum().sort_values(ascending = False).head(10)

match_code  batting_team                 inning
598027      Royal Challengers Bangalore  1         263
980987      Royal Challengers Bangalore  1         248
419137      Chennai Super Kings          1         246
335983      Chennai Super Kings          1         240
829795      Royal Challengers Bangalore  1         235
501260      Kings XI Punjab              1         232
733987      Kings XI Punjab              1         231
501223      Delhi Daredevils             1         231
980907      Royal Challengers Bangalore  1         227
829785      Royal Challengers Bangalore  1         226
Name: total, dtype: int64

### Chasing a 200+ target is difficulty in T-20 format. What are the chances that a team scoring runs above 200  in their 1st inning is chased by the opposition in 2nd inning.

In [15]:
f = pd.DataFrame(df[df['inning'] == 1].groupby('match_code')['total'].sum())
s = pd.DataFrame(df[df['inning'] == 2].groupby('match_code')['total'].sum())
f.head(5)

Unnamed: 0_level_0,total
match_code,Unnamed: 1_level_1
335982,222
335983,240
335984,129
335985,165
335986,110


In [16]:
df_joined = f.join(s, lsuffix='_first', rsuffix='_second')
df_joined.head(4)

Unnamed: 0_level_0,total_first,total_second
match_code,Unnamed: 1_level_1,Unnamed: 2_level_1
335982,222,82.0
335983,240,207.0
335984,129,132.0
335985,165,166.0


In [17]:
df_joined = df_joined[(df_joined['total_first'] > 200) & (df_joined['total_second'] > df_joined['total_first'])]
df_joined

Unnamed: 0_level_0,total_first,total_second
match_code,Unnamed: 1_level_1,Unnamed: 2_level_1
335990,214,217.0
419112,203,204.0
548318,205,208.0
729283,205,206.0
734007,205,211.0


In [18]:
teams_chased = df_joined['total_first'].count()
print(teams_chased)
total_matches = f[f['total'] > 200]['total'].count()
print(total_matches)
probability = teams_chased/total_matches
print(probability)

5
39
0.1282051282051282


### Every season has that one team which is outperforming others and is in great form. Which team has the highest win counts in their respective seasons ?

In [27]:
print(pd.DataFrame(df.groupby(['years','batting_team'])['winner'].nunique()).sort_values(by = 'winner', ascending = False))
print("As per the data above, the team that is outperforming other teams is Delhi Daredevils")

                                   winner
years batting_team                       
2013  Delhi Daredevils                  9
      Pune Warriors                     9
      Kolkata Knight Riders             9
2011  Delhi Daredevils                  9
2012  Pune Warriors                     9
2010  Kings XI Punjab                   8
2015  Kings XI Punjab                   8
2011  Kochi Tuskers Kerala              8
2014  Delhi Daredevils                  8
2011  Pune Warriors                     8
2012  Deccan Chargers                   8
2016  Kings XI Punjab                   8
2008  Deccan Chargers                   8
2011  Deccan Chargers                   8
2014  Royal Challengers Bangalore       7
2013  Mumbai Indians                    7
      Rajasthan Royals                  7
2012  Kings XI Punjab                   7
      Mumbai Indians                    7
2011  Kings XI Punjab                   7
2012  Rajasthan Royals                  7
      Royal Challengers Bangalore 