<div style="background-color:#f0f0f0; padding:20px; border-radius:10px; margin-bottom:20px;">
    <h1 style="color:#2E86C1; text-align:center; font-size:36px; font-weight:bold;">Exploratory Data Analysis of IPL Dataset</h1>
    <p style="font-size:16px;"></p>


The Metaata of the IPL Tournament from 2007 to 2023

1. **id**: Unique identifier for each match.
2. **season**: The year in which the IPL season took place.
3. **city**: The city where the match was played.
4. **date**: The date on which the match was played.
5. **match_type**: Type of match, usually "League", "Qualifier", "Eliminator", "Final", etc.
6. **player_of_match**: The player who was awarded the 'Player of the Match'.
7. **venue**: The stadium where the match was played.
8. **team1**: The name of the first team.
9. **team2**: The name of the second team.
10. **toss_winner**: The team that won the toss.
11. **toss_decision**: The decision made by the toss-winning team, either "Bat" or "Field".
12. **winner**: The team that won the match.
13. **result**: The result of the match, usually "Win" or "No Result".
14. **result_margin**: The margin of the result, such as runs or wickets.
15. **target_runs**: The target runs set for the team batting second (only applicable in the second innings).
16. **target_overs**: The target overs set for the team batting second (only applicable in the second innings, typically in rain-affected matches).
17. **super_over**: Indicates whether the match went to a Super Over. Values are typically 0 (No) or 1 (Yes).
18. **umpire1**: The name of the first umpire.
19. **umpire2**: The name of the second umpire.

<div style="background-color:#f0f0f0; padding:20px; border-radius:10px; margin-bottom:20px;">
    <h1 style="color:#2E86C1; text-align:center; font-size:36px; font-weight:bold;">Import Libraries</h1>
    <p style="font-size:16px;"></p>

In [None]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go

<div style="background-color:#f0f0f0; padding:20px; border-radius:10px; margin-bottom:20px;">
    <h1 style="color:#2E86C1; text-align:center; font-size:36px; font-weight:bold;">Load the Dataset</h1>
    <p style="font-size:16px;"></p>





In [36]:
data = pd.read_csv("../Data/ipl_dataset.csv")
data.head()

Unnamed: 0,id,season,city,date,match_type,player_of_match,venue,team1,team2,toss_winner,toss_decision,winner,result,result_margin,target_runs,target_overs,super_over,method,umpire1,umpire2
0,335982,2007/08,Bangalore,2008-04-18,League,BB McCullum,M Chinnaswamy Stadium,Royal Challengers Bangalore,Kolkata Knight Riders,Royal Challengers Bangalore,field,Kolkata Knight Riders,runs,140.0,223.0,20.0,N,,Asad Rauf,RE Koertzen
1,335983,2007/08,Chandigarh,2008-04-19,League,MEK Hussey,"Punjab Cricket Association Stadium, Mohali",Kings XI Punjab,Chennai Super Kings,Chennai Super Kings,bat,Chennai Super Kings,runs,33.0,241.0,20.0,N,,MR Benson,SL Shastri
2,335984,2007/08,Delhi,2008-04-19,League,MF Maharoof,Feroz Shah Kotla,Delhi Daredevils,Rajasthan Royals,Rajasthan Royals,bat,Delhi Daredevils,wickets,9.0,130.0,20.0,N,,Aleem Dar,GA Pratapkumar
3,335985,2007/08,Mumbai,2008-04-20,League,MV Boucher,Wankhede Stadium,Mumbai Indians,Royal Challengers Bangalore,Mumbai Indians,bat,Royal Challengers Bangalore,wickets,5.0,166.0,20.0,N,,SJ Davis,DJ Harper
4,335986,2007/08,Kolkata,2008-04-20,League,DJ Hussey,Eden Gardens,Kolkata Knight Riders,Deccan Chargers,Deccan Chargers,bat,Kolkata Knight Riders,wickets,5.0,111.0,20.0,N,,BF Bowden,K Hariharan


In [None]:
data.info()

## **Observation   :** 

Most of the columns are ccategorical (They are obeject in dtype) and only three ccolumns are numerical (float in dtype) and are also missing values in some columns. So I"ll see them in Data preprocessing.

<div style="background-color:#f0f0f0; padding:20px; border-radius:10px; margin-bottom:20px;">
    <h1 style="color:#2E86C1; text-align:center; font-size:36px; font-weight:bold;">Preprocess the Data</h1>
    <p style="font-size:16px;"></p>

In [None]:
# Missing values
print(data.isnull().sum())

In [None]:
# Duplicates
print(data.duplicated().sum())

In [37]:
# Handling missing values (example: fill with mode)
for col in data.select_dtypes(include='object').columns:
    data[col].fillna(data[col].mode()[0], inplace=True)

for col in data.select_dtypes(include='float').columns:
    data[col].fillna(data[col].mean(), inplace=True)

# Remove duplicates
data.drop_duplicates(inplace=True)

In [None]:
# Drop the method column bcz it has two many missing values
data = data.drop("method" , axis=1)

In [None]:
data.isnull().sum()

<div style="background-color:#f0f0f0; padding:20px; border-radius:10px; margin-bottom:20px;">
    <h1 style="color:#2E86C1; text-align:center; font-size:36px; font-weight:bold;">Complete EDA with Visualization </h1>
    <p style="font-size:16px;"></p>

In [None]:
# Get the value counts of the 'player_of_match' column
player_of_match_counts = data['player_of_match'].value_counts()

# Filter players who have won more than 10 times
players_with_more_than_10_awards = player_of_match_counts[player_of_match_counts >=10]

# Display the filtered players
print(players_with_more_than_10_awards)


In [None]:
# Create a bar plot for the players with more than 10 awards
# plt.figure(figsize=(15, 8))
sns.barplot(x=players_with_more_than_10_awards.index, y=players_with_more_than_10_awards.values)
plt.title('Players with 10 or More Player of the Match Awards')
plt.xlabel('Player')
plt.ylabel('Number of Awards')
plt.xticks(rotation=90)  # Rotate x-axis labels for better readability
plt.tight_layout()
plt.show()

### **Observation** 

> I've shortly list the the playrs that achieve / earn the player of the match (POT) more than and equal to 10 time times . It can be seen that AB de villiers earn 25 times in the history of IPL. Secoond one is Chris Gayle who earn 22 times and so on. Here below it's visualizations.

In [None]:
data.groupby([ 'winner' , 'match_type']).size()

## **Observation** 

> Here i groupby the winner and match type. Actually we see that which team won the most matches based on match type. The Chennai Super King and Mumbai Indians won the 5 trophies (5 Finals), Deccan Chargers ,Gujrat titans, Rajhastan Royals and Sunrisers hydrabad won the final only one time . And Kolkata Knight Riders won thw final two times

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Heatmap
sns.heatmap(pd.crosstab(data['season'], data['winner']), annot=True, fmt='d', cmap='YlGnBu')
plt.title('Heatmap of Season vs Winner')
plt.show()


## **Observation** 

> ThrouThrough Upper Heatmap between Season vs Winner. We actually see that the teams that won the matches in a single season. most of the teams are added in the late so they never play and won a single match in the early seasons. 

> Delhi Daredevils converted into Delhi Capitals in 2017. Decaan chargers remove out from IPL due to financial problems

> Gujrat lions played two seasons in 2016 ,17 and thwn coverted into Gujrat titans

> Pune warriors played only in  2011 , 12 and 13 and the converted into rising pyre supergiants in 2017, played two seasons and then removed out from IPL due to financial problems

> sunriser hydrabad , chennai , mumbai indians , kolkata riders and royal benglore played almost all seasons


In [None]:
data.groupby([ 'toss_winner' , 'toss_decision']).size()

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Create the count plot
plt.figure(figsize=(15, 8))
sns.countplot(x='toss_winner', hue='toss_decision', data=data)
plt.title('Toss Decisions by Teams')
plt.xlabel('Toss Winner')
plt.ylabel('Count')
plt.xticks(rotation=90)  # Rotate x-axis labels for better readability
plt.legend(title='Toss Decision')
plt.show()

## **Observation**

> Through visualizationand and groupby , we see that most of times, teams choose field first after wiining the toss.it's easy for the second team to chase . because they get advantage of due in late night also and they easiy chase . they also took bat first beacuse of batting pitches in the india.

> But if we see, Cheenei super kings don't see teh toss decision . Toss doesn't matter for them. therfor they become the champions 5 times 

In [None]:
fig = px.box(data, x="season", y="target_runs", points="all")
fig.show()

## **Observation**

> The uupper plot looks great to see the target score that are given in each season by the team . The minimum runs are given in the history of IPL are 43 (in 2014) and the maximum target given is 264 (2013)

> but if we see the 2024 season . the sunrisers hydrabad give 288 runs target . which is huge in the history of ipl

In [None]:
data['city'].value_counts()

In [None]:
# Create a bar plot for the 'city' column
plt.figure(figsize=(15, 8))  # Adjust the size as needed
sns.countplot(x='city', data=data)
plt.title('Frequency Distribution of Matches by City')
plt.xlabel('City')
plt.ylabel('Count')
plt.xticks(rotation=90)  # Rotate x-axis labels for better readability
plt.tight_layout()
plt.show()

## **Observation**

> The majority of IPL matches have been played in major Indian cities, with Mumbai hosting the most (166 matches), followed by Kolkata (86), Delhi (85), and Chennai (76). Other cities like Hyderabad, Bangalore, and Chandigarh also hosted a significant number of matches. Some matches have been played in international locations such as Abu Dhabi, Durban, and Dubai. Overall, Mumbai stands out as the primary venue for IPL matches.

In [None]:
data['toss_decision'].value_counts()

In [None]:
lab = ['field' , 'bat']
val = [652 , 372]
fig1 = go.Figure(data=[go.Pie(labels=lab, values=val, pull=[0, 0.3, 0, 0])])
fig1.show()

In [None]:
plt.figure(figsize=(15, 8))
sns.countplot(x='toss_decision', data=data)
plt.title('Frequency Distribution of Toss Decisions')
plt.xlabel('Toss Decision')
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()


## **Observation**
> Teams winning the toss in IPL matches have preferred to field first more often (652 times) than to bat first (372 times).

In [None]:
data['winner'].value_counts()

In [None]:
plt.figure(figsize=(15, 8))
sns.countplot(x='winner', data=data)
plt.title('Frequency Distribution of Match Winners')
plt.xlabel('Winner')
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()


## **Observation**

> Mumbai Indians have won the most IPL matches (140), followed by Chennai Super Kings (131) and Kolkata Knight Riders (120). Teams like Royal Challengers Bangalore and Rajasthan Royals also have a significant number of wins. Some teams have undergone name changes or rebranding, such as Delhi Daredevils becoming Delhi Capitals, and Kings XI Punjab becoming Punjab Kings. Newer teams like Gujarat Titans and Lucknow Super Giants have relatively fewer wins. Defunct teams like Deccan Chargers, Pune Warriors, and Kochi Tuskers Kerala have lower win counts due to their shorter participation in the IPL.

In [None]:
data['result'].value_counts()

In [None]:
labels = ['wickets','runs','tie','no result']
values = [542, 463, 14, 5]

# pull is given as a fraction of the pie radius
fig = go.Figure(data=[go.Pie(labels=labels, values=values, pull=[0, 0, 0.6, 0])])
fig.show()

In [None]:
plt.figure(figsize=(15, 8))
sns.countplot(x='result', data=data)
plt.title('Frequency Distribution of Match Results')
plt.xlabel('Result')
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()


## **Observation**

> Most IPL matches have been decided by wickets (542 times) or runs (463 times), indicating a clear winner in the majority of games. Tied matches are rare, occurring only 14 times, and there have been very few matches with no result (5 times).

In [None]:
data['super_over'].value_counts()

In [None]:
labs = ['N' , 'Y']
vals = [1010 , 14]
fig1 = go.Figure(data=[go.Pie(labels=labs, values=vals, pull=[0, 0.3, 0, 0])])
fig1.show()

In [None]:
plt.figure(figsize=(15, 8))
sns.countplot(x='super_over', data=data)
plt.title('Frequency Distribution of Super Overs')
plt.xlabel('Super Over')
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()


## **Observation**

> Super Overs are very rare in IPL matches, occurring only 14 times, while the vast majority of matches (1010) did not require a Super Over to determine the winner.

In [None]:
data['umpire1'].value_counts()

In [None]:
plt.figure(figsize=(15, 8))
sns.countplot(x='umpire1', data=data)
plt.title('Frequency Distribution of First Umpires')
plt.xlabel('Umpire1')
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()


## **Observation**

> The umpire who has officiated the most IPL matches as the first umpire is AK Chaudhary, with 105 matches. Other frequently officiating umpires include HDPK Dharmasena (76 matches), KN Ananthapadmanabhan (53 matches), CB Gaffaney (53 matches), and Asad Rauf (51 matches). This indicates a mix of both Indian and international umpires playing significant roles in the league. The distribution suggests that a core group of umpires consistently officiates IPL matches, with many others contributing less frequently.

<div style="background-color:#f0f0f0; padding:20px; border-radius:10px; margin-bottom:20px;">
    <h1 style="color:#2E86C1; text-align:center; font-size:36px; font-weight:bold;">Thank you very much</h1>
    <p style="font-size:16px;"></p>