### Project: IPL Cricket Data Analysis
### Author: Pranali Baban Dhobale


### Outline of the Project
```
1. Data Collection
2. Data Cleaning
3. Exploratory Data Analysis (EDA)
4. Summary of Key Findings
```

### 1. Data Collection

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="darkgrid")  # Set Seaborn style for plots

# Load the Dataset
data = pd.read_csv('IPL_Complete_Dataset_2008_2020.csv')

# Display the first few rows of the dataset
data.head()

### 2. Data Cleaning

In [None]:
# Basic Dataset Information
print("Dataset Shape:", data.shape)
data.info()

# Check for missing values
missing_values = data.isnull().sum()
print("\nMissing Values:\n", missing_values)

# Drop columns with more than 20% missing data
threshold = 0.2
data_cleaned = data.loc[:, data.isnull().mean() < threshold]

# Fill missing values for numeric columns (example: median)
data_cleaned['win_by_runs'].fillna(data_cleaned['win_by_runs'].median(), inplace=True)
data_cleaned['win_by_wickets'].fillna(data_cleaned['win_by_wickets'].median(), inplace=True)

# Remove duplicates
data_cleaned = data_cleaned.drop_duplicates()

# Display cleaned data
data_cleaned.head()

### 3. Exploratory Data Analysis (EDA)

In [None]:
# 1. Distribution of Matches by Season
season_counts = data_cleaned['season'].value_counts().sort_index()

plt.figure(figsize=(10, 6))
sns.barplot(x=season_counts.index, y=season_counts.values, palette='viridis')
plt.title('Number of Matches per Season')
plt.xlabel('Season')
plt.ylabel('Number of Matches')
plt.xticks(rotation=45)
plt.show()

In [None]:
# 2. Most Successful Teams
team_wins = data_cleaned['winner'].value_counts()

plt.figure(figsize=(10, 6))
sns.barplot(y=team_wins.index, x=team_wins.values, palette='plasma')
plt.title('Most Successful Teams')
plt.xlabel('Number of Wins')
plt.ylabel('Team')
plt.show()

In [None]:
# 3. Toss Decision Analysis
toss_decision_counts = data_cleaned['toss_decision'].value_counts()

plt.figure(figsize=(6, 6))
plt.pie(toss_decision_counts, labels=toss_decision_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Toss Decision Distribution')
plt.show()

In [None]:
# 4. Player of the Match Analysis
top_players = data_cleaned['player_of_match'].value_counts().head(10)

plt.figure(figsize=(10, 6))
sns.barplot(y=top_players.index, x=top_players.values, palette='magma')
plt.title('Top 10 Players with Most Player of the Match Awards')
plt.xlabel('Number of Awards')
plt.ylabel('Player')
plt.show()

### 4. Summary of Key Findings

In [None]:
print("Key Insights from the IPL Data Analysis:")
print("1. Distribution of matches across seasons shows a steady increase/decrease.")
print("2. Teams with the highest number of wins include:", team_wins.index[0], "and others.")
print("3. Toss decisions indicate most teams prefer to", toss_decision_counts.idxmax(), "upon winning the toss.")
print("4. The player with the most 'Player of the Match' awards is:", top_players.index[0])