# **IPL Data Analysis and 2025 winner prediction model**

This notebook presents a comprehensive analysis of Indian Premier League (IPL) data from the inception of the league in 2008 through to the most recent season in 2024, with the goal of uncovering key insights, trends, and patterns within the data. It includes data collection and preprocessing steps, exploratory data analysis (EDA) to visualize key metrics such as win rates, player performance, and team statistics, and statistical insights to identify significant factors influencing match outcomes. The notebook then introduces a Random Forest Classification model to predict the winner of the 2025 IPL season, explaining the model’s features, training, validation, and performance evaluation. The results section presents the model’s predictions for the 2025 season, discusses the potential strengths and limitations of the model, and provides insights into the predicted performance of teams and key players. The primary objective is to leverage historical IPL data to build a predictive model that forecasts future match outcomes, offering a data-driven prediction for the 2025 IPL winner while illustrating the application of machine learning techniques to real-world sports data.


In [3]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [4]:
deliveries_df = pd.read_csv('/deliveries.csv')
matches_df = pd.read_csv('/matches.csv')

print("Deliveries shape:", deliveries_df.shape)
print("Matches shape:", matches_df.shape)

deliveries = deliveries_df.copy()
matches = matches_df.copy()

FileNotFoundError: [Errno 2] No such file or directory: '/deliveries.csv'

# **IPL 2008 - 2024 Data Analysis**

In [None]:
matches.head()

In [None]:
matches.info()

In [None]:
matches.describe()

### **Data Cleaning**

In [None]:
# Drop rows with missing values in the 'winner' column
matches = matches.dropna(subset=['winner'])

In [None]:
# Impute missing valuse in 'player_of_match'
matches['player_of_match'] = matches['player_of_match'].fillna('Unknown')

In [None]:
# Drop unwanted columns from the dataset
matches.drop(['id', 'city', 'method'], axis=1, inplace=True)

In [None]:
import matplotlib.pyplot as plt

columns_to_handle_missing = ['result_margin', 'target_runs', 'target_overs']

# Plot box plots for each column
plt.figure(figsize=(5, 6))
matches[columns_to_handle_missing].boxplot()
plt.title('Box plot of columns with missing values')
plt.ylabel('Values')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Inpute selected columns with median because they have outliers

matches['result_margin'] = matches['result_margin'].fillna(matches['result_margin'].median())
matches['target_runs'] = matches['target_runs'].fillna(matches['target_runs'].median())
matches['target_overs'] = matches['target_overs'].fillna(matches['target_overs'].median())

In [None]:
matches.info()

In [None]:
matches.nunique()

### **Feature Engineering**