# **Project Name**    - Champion Trophy 2025 eda



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Nikesh singh -**


# **Project Summary -**

This project explores a dataset comprising 7,352 cricket match records, covering multiple seasons, teams, and venues. The analysis is structured around the following key aspects:

Tournament Phases & Matches Played

The tournament was played in four distinct phases: (List of phases).
The number of matches played in each phase was analyzed, highlighting the most competitive stages of the tournament.
Batting Performance Analysis

The total runs scored by each batsman were calculated to identify the highest run-scorers.
The average runs per match were analyzed to determine batting consistency, ranking players based on performance.
The batsmen who played the most deliveries were identified, indicating the most reliable and enduring players at the crease.
Bowling Performance Analysis

The bowlers who delivered the most overs were examined, showcasing the most utilized bowlers in the tournament.
The total wickets taken by each bowler were analyzed to rank the most effective wicket-takers.
Match Venue Analysis

The top 5 venues where most matches were played were identified, highlighting popular grounds that hosted the highest number of games.
Through these analyses, key trends were observed in terms of team performance, batting stability, and bowling efficiency.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**
The objective of this project is to analyze historical cricket match data to uncover key insights into team performance, player contributions, and match trends. By examining various aspects such as match phases, batting and bowling statistics, and venue preferences, this study aims to identify the top-performing players, most impactful bowlers, and critical match venues. The findings will provide valuable insights for teams, analysts, and cricket enthusiasts to make data-driven decisions and enhance game strategies.

**Write Problem Statement Here.**

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv('/content/ct_2025_deliveries.csv')


### Dataset First View

In [None]:
# Dataset First Look

df.head(5)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isna().sum()

## Brief info & conclusion



1. match_id (int64)
  - Unique identifier for each match.
  - No missing values.
- Suggested data type: int32 (to save memory).
2. season (int64)
  - Represents the year in which the match was played.
  - No missing values.
  - Unique values likely range from 2008 to 2025.
- Suggested data type: category (since seasons are limited).
3. phase (object)
  - Represents the stage of the tournament (e.g., "Group Stage", "Playoffs", "Final").
  - No missing values.
  - Limited unique values.
- Suggested data type: category (efficient for categorical variables).
4. match_no (int64)
  - The match number within the tournament.
  - No missing values.
- Suggested data type: int16 (as match numbers are small values).
5. date (object)
  - Represents the date on which the match was played.
  - No missing values.
- Suggested data type: datetime64 (to enable date-based operations).
6. venue (object)
  - The stadium where the match was played.
  - No missing values.
  - Contains repeated values (same venues used multiple times).
- Suggested data type: category (for memory efficiency).
7. batting_team (object)
  - Name of the batting team.
  - No missing values.
  - Limited unique values (likely 8–10 teams).
- Suggested data type: category.
8. bowling_team (object)
  - Name of the bowling team.
  - No missing values.
  - Limited unique values (same as batting_team).
- Suggested data type: category.
9. innings (float64)
  - Represents the inning number (1, 2, sometimes 3 or 4 in Test matches).
  - No missing values.
- Always whole numbers, so it should be converted to int8 instead of float64.
10. over (float64)
  - Represents the current over in decimal format (e.g., 1.2 means 1 over and 2 balls).
  - No missing values.
- Suggested data type: float16 (as it requires less memory).
11. striker (object)
  - Name of the batsman on strike.
  - No missing values.
  - Repeated player names.
- Suggested data type: category.
12. bowler (object)
  - Name of the bowler.
  - No missing values.
  - Repeated player names.
- Suggested data type: category.
13. runs_of_bat (int64)
  - Runs scored by the batsman on a particular ball.
  - No missing values.
- Suggested data type: int8 (since values are small).
14. extras (int64)
  - Extra runs (wide, no-ball, byes, leg byes).
  - No missing values.
- Suggested data type: int8.
15. wide (int64)
  - Number of wides bowled in that delivery.
  - Mostly 0, with 1 for wide deliveries.
- Suggested data type: int8.
16. legbyes (int64)
  - Number of leg-byes scored on that ball.
  - Mostly 0, with small values (1-4).
- Suggested data type: int8.
17. byes (int64)
  - Number of byes scored.
  - Mostly 0, with small values.
- Suggested data type: int8.
18. noballs (int64)
  - Number of no-balls in that delivery.
  - Mostly 0, with 1 for no-ball deliveries.
- Suggested data type: int8.
19. wicket_type (object)
  - Describes how the batsman was dismissed (e.g., "Bowled", "Caught", "Run Out").
  - Many missing values (190 non-null out of 7352).
- Suggested data type: category (limited unique values).
20. player_dismissed (object)
  - Name of the player who got out.
  - Many missing values (190 non-null out of 7352).
- Suggested data type: category (repeated player names).
21. fielder (object)
  - Name of the fielder involved in a dismissal (if applicable).
  - Many missing values (141 non-null out of 7352).


## eda

#### team name participate in tournament

In [None]:

team_participated = ', '.join([i for i in df['batting_team'].unique()])
print(f'Team which participated in 2025 champion trophy  are : {team_participated}')

### Vanue where matches played

In [None]:
for i in [i for i in df['venue'].unique()]:
  print(i)

### the different phases of the tournament

In [None]:
print('Tournament played in 4 phases :',', '.join([i for i in df['phase'].unique()]))

### How many matches were played in each phase?

In [None]:


matches_per_phase = df.groupby('phase')['match_id'].nunique()
matches_per_phase.reset_index(name='matches')


### Total runs scored by each player?

In [None]:
batsman_total_runs = df.groupby('striker')['runs_of_bat'].sum().reset_index()
batsman_total_runs.columns=(['Batsman','Total runs'])

batsman_total_runs.sort_values(by='Total runs',ascending=False,inplace=True)
batsman_total_runs




In [None]:
# visual
sns.barplot(x=batsman_total_runs['Total runs'].head(10),y=batsman_total_runs['Batsman'].head(10),palette='winter')
plt.show()

### average runs scored by  each batsmans

In [None]:

batsman_stats = df.groupby("striker").agg(
    total_runs=("runs_of_bat", "sum"),
    matches_played=("match_id", "nunique")
)


batsman_stats["average_run"] = batsman_stats["total_runs"] / batsman_stats["matches_played"]


batsman_stats = batsman_stats.reset_index().rename(columns={"striker": "batsman"})




batsman_stats["average_run"] = batsman_stats["average_run"].round(2)


top_batsmen = batsman_stats


top_batsmen = top_batsmen.sort_values(by="average_run", ascending=False)


plt.figure(figsize=(10, 6))
plt.barh(top_batsmen["batsman"].head(10), top_batsmen["average_run"].head(10), color="skyblue")

# Set labels and title
plt.ylabel("Batsman")  # Keep batsman on the Y-axis for horizontal bar chart
plt.xlabel("Average Runs per Match")
plt.title("Bottom 10 Batsmen by Average Runs ")


plt.show()


### venues where  matches were played?

In [None]:
top_5_venue = df.groupby('venue')['match_id'].nunique().sort_values(ascending=False).reset_index()
top_5_venue.columns=(['venue','match_played'])

sns.barplot(y=top_5_venue['venue'],x=top_5_venue['match_played'],color='skyblue')

### the batsman who has played the most deliveries

In [None]:

batsman_deliveries = df.groupby('striker').size().reset_index(name='deliveries_faced')


batsman_deliveries = batsman_deliveries.sort_values(by='deliveries_faced', ascending=False)

batsman_deliveries

In [None]:
# visual presentation
plt.figure(figsize=(8, 5))
sns.barplot(x=batsman_deliveries['striker'].head(10),
            y=batsman_deliveries['deliveries_faced'].head(10),
            palette="Blues_r")


plt.xlabel("Batsman")
plt.ylabel("Total Deliveries Faced")
plt.title("Top 5 Batsmen Who Faced the Most Deliveries")


plt.xticks(rotation=45, ha="right")


plt.show()

###  Which batsman has scored most

In [None]:
batsman_deliveries = df.groupby('striker')['runs_of_bat'].sum().sort_values(ascending=False).reset_index()
batsman_deliveries



In [None]:
# visual presentation

sns.barplot(x=batsman_deliveries['runs_of_bat'].head(10),y=batsman_deliveries['striker'].head(10))

### Which bowler has bowled the most deliveries?



In [None]:
most_deliver_bowler = df.groupby('bowler')['over'].count().sort_values(ascending=False).reset_index()
most_deliver_bowler

In [None]:
# visual presentation
sns.barplot(x=most_deliver_bowler['over'].head(10),y=most_deliver_bowler['bowler'].head(10))

### Bowler total wicket

In [None]:


bowler_wicket_record =  df.groupby('bowler')['wicket_type'].count().sort_values(ascending=False).reset_index()
bowler_wicket_record.columns = (['Bowler','Wicket'])

In [None]:
sns.barplot(x=bowler_wicket_record['Wicket'].head(10),y=bowler_wicket_record['Bowler'].head(10))
plt.title('Bowler Performance: Total Wickets Taken')

### which bowler give most runs

In [None]:
bowler_give_runs =  df.groupby('bowler')['runs_of_bat'].sum().sort_values(ascending=False).reset_index()
bowler_give_runs .columns=(['Bowler','Runs given by bowler'])
bowler_give_runs

In [None]:
# visual present
sns.barplot(x=bowler_give_runs['Runs given by bowler'].head(10),y=bowler_give_runs['Bowler'].head(10),color='lightgreen')
plt.title("Bowlers Who Conceded the Most Runs")

###  How consistently do batsmen score runs across matches

In [None]:
batsman_average_score = ((df.groupby('striker')['runs_of_bat'].sum())/ (df.groupby('striker')['match_id'].nunique())).reset_index()
batsman_average_score.columns=(['Batsman','Avg_score'])
batsman_average_score = batsman_average_score.sort_values(by='Avg_score',ascending=False)
batsman_average_score

In [None]:
# visual presentation
sns.barplot(y=batsman_average_score['Batsman'].head(10),x=batsman_average_score['Avg_score'].head(10))
plt.title('Batsmen Performance: Average Runs Per Match')

### team which give more extra

In [None]:
dd = df.groupby('bowling_team')['extras'].sum().sort_values(ascending=False).reset_index('bowling_team')
sns.barplot(y='extras',x='bowling_team',data=dd)

plt.title('Which bowling team conceded the most extra runs')

### Which bowler conceded the most extra runs

In [None]:
bowler_give_extra = df.groupby('bowler')['extras'].sum().reset_index().sort_values(by='extras',ascending=False)
bowler_give_extra


In [None]:
# visual presentation
sns.barplot(x=bowler_give_extra['extras'].head(10),y=bowler_give_extra['bowler'].head(10),palette='summer')
plt.title("Top Extra-Conceding Bowlers in the Tournament")

### "Who are the most effective bowlers in the first 10 overs"

In [None]:
top_10_wicket_taker_in_1st_10_over = df[df['over'] <= 10].groupby('bowler')['wicket_type'].count().reset_index().sort_values(by='wicket_type',ascending=False)

top_10_wicket_taker_in_1st_10_over

In [None]:
# visual presentation
sns.barplot(y=top_10_wicket_taker_in_1st_10_over['bowler'].head(10),x=top_10_wicket_taker_in_1st_10_over['wicket_type'].head(10))
plt.title("Top Wicket-Takers in the First 10 Overs")

### top 10 wicket taker in last 10 over

In [None]:
top_10_wicket_taker_in_last_10_over = df[df['over'] >=40].groupby('bowler')['wicket_type'].count().reset_index().sort_values(by='wicket_type',ascending=False)
top_10_wicket_taker_in_last_10_over

In [None]:
# visual presentation
sns.barplot(x=top_10_wicket_taker_in_last_10_over['wicket_type'].head(10),y=top_10_wicket_taker_in_1st_10_over['bowler'].head(10),palette='summer')
plt.title("Top Wicket-Takers in last 10 Overs")

In [None]:
xxx = pd.DataFrame(df.groupby(['batting_team','bowling_team','match_no'])[['runs_of_bat','wide','legbyes', 'byes', 'noballs','extras']].sum())
xxx.reset_index(inplace=True)
xxx = xxx[xxx['runs_of_bat']!=0]
xxx['total_runs'] = xxx['runs_of_bat'] 	+ xxx['extras']
xxx

In [None]:
# Group by 'batting_team', 'bowling_team', and 'match_no'
grouped_df = df.groupby(['batting_team', 'bowling_team', 'match_no']).sum(numeric_only=True).reset_index()

# Calculate 'total_runs' by summing the relevant columns
grouped_df['total_runs'] = grouped_df[['runs_of_bat', 'wide', 'legbyes', 'byes', 'noballs', 'extras']].sum(axis=1)

# Filter out rows where 'runs_of_bat' is zero
result_df = grouped_df[grouped_df['runs_of_bat'] != 0]

result_df

### Which bowler has the highest number of dot balls bowled?

In [None]:
dot_ball_df = df[df['runs_of_bat']==0]

total_dot_ball = dot_ball_df.groupby('bowler')['runs_of_bat'].count().reset_index()

# dot_ball_over = round(((dot_ball_df.groupby('bowler')['runs_of_bat'].count().reset_index().iloc[:,1])/6 ))

total_over = round(((df.groupby('bowler')['match_id'].count())/6).reset_index().iloc[:,1])

total_4 = df[df['runs_of_bat'] == 4].groupby('bowler')['match_id'].count().reset_index().iloc[:,1]
total_6 = df[df['runs_of_bat'] == 6].groupby('bowler')['match_id'].count().reset_index().iloc[:,1]

total_wicket = df[df['wicket_type'].notna()].groupby('bowler')['match_id'].count().reset_index().iloc[:,1]

total_runs = df.groupby('bowler')['runs_of_bat'].sum().reset_index().iloc[:,1]

bowler_dot_ball_and_over = pd.concat([total_dot_ball,total_over,total_runs,total_wicket,total_4,total_6],axis=1)
bowler_dot_ball_and_over.columns = (['bowler','total_dot_over','total_over','total_runs','total_wicket','total_4','total_6'])


bowler_dot_ball_and_over.sort_values(by='total_wicket',ascending=False).head(5)




### "Which bowlers have delivered the most dot overs compared to total overs

In [None]:
# sns.scatterplot(y=bowler_dot_ball_and_over['total_dot_over'],x=bowler_dot_ball_and_over['total_over'])

plt.figure(figsize=(25,10))
sns.scatterplot(y=bowler_dot_ball_and_over['total_dot_over'],
                x=bowler_dot_ball_and_over['total_over'])

plt.axhline(y=100, color='red', linestyle='--', linewidth=2, label="Horizontal Line (y=100)")

# Plot a vertical line at x = 30
plt.axvline(x=30, color='red', linestyle='-.', linewidth=2, label="Vertical Line (x=30)")



# Annotate each point with the bowler's name
for i in range(len(bowler_dot_ball_and_over)):
    plt.text(bowler_dot_ball_and_over['total_over'].iloc[i],
             bowler_dot_ball_and_over['total_dot_over'].iloc[i],
             bowler_dot_ball_and_over['bowler'].iloc[i],
             fontsize=12, ha='right', va='bottom')

# Show plot
plt.xlabel("Total Overs")
plt.ylabel("Total Dot Overs")
plt.title("Bowler Performance Scatter Plot")
plt.show()

### Which bowlers have been attacked the most with fours

In [None]:
bowler_who_bitten_most_four = bowler_dot_ball_and_over.sort_values(by='total_4',ascending=False).head(10)

sns.barplot(x=bowler_who_bitten_most_four['total_4'],y=bowler_who_bitten_most_four['bowler'])

### Which bowlers have been attacked the most with six

In [None]:
bowler_who_bitten_most_six = bowler_dot_ball_and_over.sort_values(by='total_6',ascending=False).head(10)

sns.barplot(x=bowler_who_bitten_most_six['total_6'],y=bowler_who_bitten_most_six['bowler'])

### Bowler which give most dot over

In [None]:
total_dot_over_by_bowler = bowler_dot_ball_and_over.sort_values(by='total_dot_over',ascending=False).head(10)

sns.barplot(x=total_dot_over_by_bowler['total_dot_over'],y=total_dot_over_by_bowler['bowler'])

### most wicket taker in starting 10 over

In [None]:


def count_wickets_in_specific_over(df, start_over=1, end_over=50):
    """
    Counts the number of wickets taken by each bowler within a specified range of overs.

    Parameters:
    df (DataFrame): Input DataFrame containing at least 'over', 'wicket_type', 'bowler', and 'match_id' columns.
    start_over (int): The starting over number (inclusive). Default is 1.
    end_over (int): The ending over number (inclusive). Default is 50.

    Returns:
    DataFrame: A DataFrame with columns 'Bowler' and 'Wickets', sorted by 'Wickets' in descending order.
    """
    # Validate input parameters
    if not isinstance(df, pd.DataFrame):
        raise TypeError("The input data must be a pandas DataFrame.")
    if not all(column in df.columns for column in ['over', 'wicket_type', 'bowler', 'match_id']):
        raise ValueError("The DataFrame must contain 'over', 'wicket_type', 'bowler', and 'match_id' columns.")
    if not (isinstance(start_over, int) and isinstance(end_over, int)):
        raise TypeError("Start and end overs must be integers.")
    if start_over < 0 or end_over < 0:
        raise ValueError("Over numbers must be non-negative.")
    if start_over > end_over:
        raise ValueError("Start over must be less than or equal to end over.")

    # Filter deliveries within the specified over range where a wicket was taken
    filtered_df = df[(df['over'] >= start_over) & (df['over'] <= end_over) & (df['wicket_type'].notnull())]

    # Group by 'bowler' and count the number of wickets
    wicket_counts = filtered_df.groupby('bowler')['match_id'].count().reset_index()

    # Rename columns for clarity
    wicket_counts.columns = ['Bowler', 'Wickets']

    # Sort the DataFrame by 'Wickets' in descending order
    sorted_wicket_counts = wicket_counts.sort_values(by='Wickets', ascending=False).reset_index(drop=True)

    return sorted_wicket_counts

# count_wickets_in_specific_over(df, start_over=10, end_over=30)

In [None]:
def count_wickets_in_specific_over(df, start_over=1, end_over=50):

  filter_data = df[(df['over'] >= start_over) & (df['over'] <= end_over) & (df['wicket_type'].notna())].copy()
  wicket_count = filter_data.groupby('bowler')['match_id'].count().reset_index()
  wicket_count.columns = ['Bowler','wickets']
  sort_wicket = wicket_count.sort_values(by='wickets',ascending=False)
  return sort_wicket

count_wickets_in_specific_over(df, start_over=20, end_over=40)



### Which batsman has the highest number of ducks (zero runs) in the tournament?

In [None]:
import pandas as pd

# Dictionary containing team-wise players
teams = {
    "India": [
        "Rohit Sharma", "Shubman Gill", "Virat Kohli", "Shreyas Iyer", "KL Rahul",
        "Rishabh Pant", "Hardik Pandya", "Axar Patel", "Washington Sundar", "Kuldeep Yadav",
        "Harshit Rana", "Mohammed Shami", "Arshdeep Singh", "Ravindra Jadeja",
        "Varun Chakaravarthy"
    ],
    "Pakistan": [
        "Babar Azam", "Mohammad Rizwan", "Shaheen Afridi", "Fakhar Zaman", "Shadab Khan",
        "Imam-ul-Haq", "Haris Rauf", "Hasan Ali", "Mohammad Nawaz", "Usman Qadir",
        "Haider Ali", "Asif Ali", "Iftikhar Ahmed", "Naseem Shah", "Mohammad Wasim Jr."
    ],
    "Bangladesh": [
        "Shakib Al Hasan", "Tamim Iqbal", "Liton Das", "Mushfiqur Rahim", "Mahmudullah",
        "Afif Hossain", "Mehidy Hasan Miraz", "Taskin Ahmed", "Mustafizur Rahman",
        "Shoriful Islam", "Nasum Ahmed", "Towhid Hridoy", "Yasir Ali", "Ebadot Hossain",
        "Najmul Hossain Shanto"
    ],
    "New Zealand": [
        "Kane Williamson", "Devon Conway", "Tom Latham", "Will Young", "Daryl Mitchell",
        "Glenn Phillips", "James Neesham", "Mitchell Santner", "Ish Sodhi", "Tim Southee",
        "Trent Boult", "Lockie Ferguson", "Matt Henry", "Mark Chapman", "Finn Allen"
    ],
    "Australia": [
        "Steve Smith", "David Warner", "Marnus Labuschagne", "Glenn Maxwell",
        "Alex Carey", "Marcus Stoinis", "Mitchell Marsh", "Pat Cummins", "Josh Hazlewood",
        "Adam Zampa", "Sean Abbott", "Spencer Johnson", "Ben Dwarshuis",
        "Jake Fraser-McGurk", "Tanveer Sangha"
    ],
    "England": [
        "Jos Buttler", "Jonny Bairstow", "Joe Root", "Ben Stokes", "Moeen Ali",
        "Sam Curran", "Chris Woakes", "Jofra Archer", "Adil Rashid", "Mark Wood",
        "Dawid Malan", "Liam Livingstone", "Harry Brook", "Reece Topley", "Jason Roy"
    ],
    "South Africa": [
        "Temba Bavuma", "Quinton de Kock", "Rassie van der Dussen", "Aiden Markram",
        "David Miller", "Heinrich Klaasen", "Marco Jansen", "Kagiso Rabada",
        "Anrich Nortje", "Lungi Ngidi", "Keshav Maharaj", "Tabraiz Shamsi",
        "Reeza Hendricks", "Andile Phehlukwayo", "Gerald Coetzee"
    ],
    "Afghanistan": [
        "Hashmatullah Shahidi", "Rahmanullah Gurbaz", "Ibrahim Zadran", "Najibullah Zadran",
        "Mohammad Nabi", "Rashid Khan", "Mujeeb Ur Rahman", "Fazalhaq Farooqi",
        "Naveen-ul-Haq", "Azmatullah Omarzai", "Gulbadin Naib", "Sharafuddin Ashraf",
        "Usman Ghani", "Qais Ahmad", "Karim Janat"
    ]
}

# Convert dictionary to DataFrame with lowercase names
player_df = pd.DataFrame({k: pd.Series([name.lower() for name in v]) for k, v in teams.items()})

# Display the DataFrame
player_df


In [None]:
average_run_per_match = (df.groupby('striker')['runs_of_bat'].sum())/(df.groupby('striker')['match_id'].nunique())
match_played = df.groupby('striker')['match_id'].nunique()
total_run = df.groupby('striker')['runs_of_bat'].sum()
total_ball = df.groupby('striker')['over'].count()
strike_rate = (total_run/total_ball)*100


batsman_stats = pd.concat([average_run_per_match,strike_rate,match_played,total_run,total_ball],axis=1)
batsman_stats.columns = ['average_run_per_match','strike_rate','match_played','total_runs','total_ball']


batsman_statss = batsman_stats.reset_index().sort_values(by=['total_runs','average_run_per_match','match_played'],ascending=False)

batsman_statss

In [None]:

plt.figure(figsize=(18,10))
# Scatter plot
sns.scatterplot(y=batsman_statss['strike_rate'], x=batsman_statss['total_runs'])

# Plot horizontal and vertical lines
plt.axhline(y=100, color='red', linestyle='--', linewidth=2, label="Horizontal Line (y=100)")
plt.axvline(x=100, color='red', linestyle='-.', linewidth=2, label="Vertical Line (x=100)")

for i in range(len(batsman_statss)):
        plt.text(batsman_statss['total_runs'].iloc[i],  # X-coordinate
                 batsman_statss['strike_rate'].iloc[i], # Y-coordinate
                 batsman_statss['striker'].iloc[i],     # Text label
                 fontsize=8, ha='right', va='bottom', color='red')
# Labels and title
plt.xlabel("Total Runs")
plt.ylabel("Strike Rate")
plt.title("Batsman Performance Scatter Plot")
plt.legend()

# Show plot
plt.show()


Write the conclusion here.