This project builds a complete data analysis pipeline on IPL (Indian Premier League) datasets using Python (Pandas).
It simulates a real-world workflow:
Ingest → Clean → Transform → Analyze → Report → Export
We use two datasets:
deliveries.csv→ Ball-by-ball datamatches.csv→ Match-level metadata
📎 Source: Kaggle IPL Dataset (2008–2020)
- Python
- Pandas
- NumPy
- Matplotlib (optional)
- Jupyter Notebook (VS Code)
IPL_Pipeline/
│
├── data/
│ ├── deliveries.csv
│ └── matches.csv
│
├── output/
│ ├── runs_per_match.csv
│ ├── top_batters.csv
│ ├── strike_rate.csv
│ ├── economy.csv
│ ├── team_scores.csv
│ ├── death_overs.csv
│ └── ipl_analysis.xlsx
│
├── analysis.ipynb
└── README.md
- Loaded datasets using Pandas
- Inspected shape, columns, and data types
- Handled missing values
- Fixed data types
- Validated match IDs between datasets
- Created derived columns
- Standardized column names
- Merged datasets into a unified DataFrame
- Total Runs per Match
- Runs per Team per Match
- Top 10 Batters
- Strike Rate of Batters
- Top Bowlers by Economy
- Most Consistent Batters
- Highest Individual Score
- Boundary Analysis
- Boundary Percentage
- Dot Ball Analysis
- Runs per Over Analysis
- Powerplay Performance
- Death Overs Performance
- Run Distribution (1st vs 2nd Innings)
- Toss Impact Analysis
- Player of Match Contribution
- Venue-wise Analysis
- City-wise Analysis
- Season-wise Trends
- Winning Team Analysis
- Identified key performers
- Compared match phases (powerplay, death overs)
- Analyzed venue and season trends
- Cleaned and formatted DataFrames
- Structured outputs for readability
- Exported results to CSV files
- Generated Excel summary file
- Death overs (16–20) have the highest scoring rates
- Certain venues (e.g., batting-friendly pitches) consistently produce high scores
- Toss impact is moderate (~50–55%)
- Some players dominate via boundaries, while others rely on consistency
- Strike rate is crucial in death overs, while average matters for consistency
git clone <your-repo-link>
cd IPL_Pipelinepip install pandas numpy matplotlib openpyxlOpen analysis.ipynb in VS Code or Jupyter and run all cells.
- CSV files in
/outputfolder - Excel file:
ipl_analysis.xlsx
This project demonstrates:
- End-to-end data pipeline design
- Data cleaning and transformation
- Multi-source data merging
- Efficient aggregation using Pandas
- Insight generation from real-world data
Akshat Walvekar