In [2]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# IPL Match Forecaster: A Data Science Deep Dive"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The Narrative: Decoding the Game of Cricket\n",
    "This project is a comprehensive analysis of 12 seasons of Indian Premier League (IPL) data. The goal is to move beyond simple statistics and build a machine learning model that can forecast match outcomes by identifying predictive patterns in historical data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 1: Setup and Data Loading\n",
    "We start by importing the necessary libraries and loading the dataset directly from a reliable online source to ensure this notebook is fully reproducible."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import LabelEncoder\n",
    "from xgboost import XGBClassifier\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.metrics import accuracy_score\n",
    "\n",
    "# Set plot style for better visuals\n",
    "plt.style.use('seaborn-v0_8-whitegrid')\n",
    "sns.set_palette('viridis')\n",
    "\n",
    "# Load data directly from a raw GitHub URL\n",
    "data_url = 'https://raw.githubusercontent.com/datasets/ipl/main/data/matches.csv'\n",
    "matches = pd.read_csv(data_url)\n",
    "\n",
    "print('Dataset loaded successfully!')\n",
    "matches.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 2: Data Cleaning & Preprocessing\n",
    "Real-world data is messy. We need to handle inconsistencies in team names and remove irrelevant data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Handle team name inconsistencies\n",
    "matches.replace(['Delhi Daredevils'], 'Delhi Capitals', inplace=True)\n",
    "matches.replace(['Deccan Chargers'], 'Sunrisers Hyderabad', inplace=True)\n",
    "matches.replace(['Rising Pune Supergiant'], 'Rising Pune Supergiants', inplace=True)\n",
    "\n",
    "# Drop rows with no result (e.g., washed out matches)\n",
    "matches.dropna(subset=['winner'], inplace=True)\n",
    "\n",
    "print('Data cleaned. Team name inconsistencies resolved.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 3: Exploratory Data Analysis (EDA)\n",
    "Before modeling, we must understand the data. What are the key trends and patterns?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot 1: Number of matches won by each team\n",
    "plt.figure(figsize=(12, 6))\n",
    "sns.countplot(y='winner', data=matches, order=matches['winner'].value_counts().index)\n",
    "plt.title('Total Matches Won by Each IPL Team', fontsize=16)\n",
    "plt.xlabel('Number of Wins', fontsize=12)\n",
    "plt.ylabel('Team', fontsize=12)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot 2: Impact of the Toss\n",
    "toss_impact = matches['toss_winner'] == matches['winner']\n",
    "plt.figure(figsize=(8, 5))\n",
    "sns.countplot(x=toss_impact)\n",
    "plt.title('Does the Toss Winner Also Win the Match?', fontsize=16)\n",
    "plt.xticks([0, 1], ['No', 'Yes'])\n",
    "plt.xlabel('Toss Winner Won Match', fontsize=12)\n",
    "plt.ylabel('Count', fontsize=12)\n",
    "plt.show()\n",
    "print(f\"The toss winner won the match {toss_impact.mean()*100:.2f}% of the time.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 4: Feature Engineering & Preparation\n",
    "We select our features and convert them into a numerical format that the model can understand."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = matches[['team1', 'team2', 'venue', 'toss_winner', 'toss_decision', 'winner']].copy()\n",
    "\n",
    "# We will use one encoder for all team names to ensure consistency\n",
    "all_teams = pd.concat([df['team1'], df['team2'], df['toss_winner'], df['winner']]).unique()\n",
    "team_encoder = LabelEncoder().fit(all_teams)\n",
    "\n",
    "# We need to store all encoders to make predictions later\n",
    "encoders = {}\n",
    "for col in ['team1', 'team2', 'toss_winner', 'winner']:\n",
    "    df[col] = team_encoder.transform(df[col])\n",
    "    encoders[col] = team_encoder\n",
    "\n",
    "for col in ['venue', 'toss_decision']:\n",
    "    le = LabelEncoder()\n",
    "    df[col] = le.fit_transform(df[col])\n",
    "    encoders[col] = le\n",
    "\n",
    "print('Features have been successfully encoded.')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 5: Model Training & Evaluation\n",
    "We will train two models: a simple Logistic Regression as a baseline and a powerful XGBoost model to see if we can achieve better performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = df.drop('winner', axis=1)\n",
    "y = df['winner']\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
    "\n",
    "# Baseline Model: Logistic Regression\n",
    "lr_model = LogisticRegression(max_iter=1000)\n",
    "lr_model.fit(X_train, y_train)\n",
    "lr_accuracy = accuracy_score(y_test, lr_model.predict(X_test))\n",
    "print(f\"Baseline (Logistic Regression) Accuracy: {lr_accuracy*100:.2f}%\")\n",
    "\n",
    "# Advanced Model: XGBoost Classifier\n",
    "xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', random_state=42)\n",
    "xgb_model.fit(X_train, y_train)\n",
    "xgb_accuracy = accuracy_score(y_test, xgb_model.predict(X_test))\n",
    "print(f\"Advanced (XGBoost) Accuracy: {xgb_accuracy*100:.2f}%\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 6: Feature Importance\n",
    "Let's see what our best model (XGBoost) considers the most important factors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "outputs": [],
   "source": [
    "importances = xgb_model.feature_importances_\n",
    "feature_names = X.columns\n",
    "feature_importance_df = pd.DataFrame({'feature': feature_names, 'importance': importances}).sort_values('importance', ascending=False)\n",
    "\n",
    "plt.figure(figsize=(10, 6))\n",
    "sns.barplot(x='importance', y='feature', data=feature_importance_df, palette='mako')\n",
    "plt.title('Feature Importance for Match Prediction', fontsize=16)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 7: Prediction Function\n",
    "Finally, we create a function to use our trained model to predict the outcome of a new, hypothetical match."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "outputs": [],
   "source": [
    "def predict_outcome(team1, team2, venue, toss_winner, toss_decision):\n",
    "    \"\"\"Predicts win probability for a given match setup.\"\"\"\n",
    "    try:\n",
    "        input_data = pd.DataFrame({\n",
    "            'team1': [encoders['team1'].transform([team1])[0]],\n",
    "            'team2': [encoders['team2'].transform([team2])[0]],\n",
    "            'venue': [encoders['venue'].transform([venue])[0]],\n",
    "            'toss_winner': [encoders['toss_winner'].transform([toss_winner])[0]],\n",
    "            'toss_decision': [encoders['toss_decision'].transform([toss_decision])[0]]\n",
    "        })\n",
    "        \n",
    "        win_probs = xgb_model.predict_proba(input_data)[0]\n",
    "        prob_team1 = win_probs[encoders['winner'].transform([team1])[0]]\n",
    "        prob_team2 = win_probs[encoders['winner'].transform([team2])[0]]\n",
    "        \n",
    "        # Normalize probabilities\n",
    "        total_prob = prob_team1 + prob_team2\n",
    "        win_prob_team1 = round((prob_team1 / total_prob) * 100)\n",
    "        win_prob_team2 = round((prob_team2 / total_prob) * 100)\n",
    "        \n",
    "        print(f\"--- Prediction ---\")\n",
    "        print(f\"{team1} vs. {team2} at {venue}\")\n",
    "        print(f\"{team1} Win Probability: {win_prob_team1}%\")\n",
    "        print(f\"{team2} Win Probability: {win_prob_team2}%\")\n",
    "        \n",
    "    except Exception as e:\n",
    "        print(f\"Error: Could not make a prediction. One of the inputs might be new to the model. Details: {e}\")\n",
    "\n",
    "# --- Example Prediction ---\n",
    "predict_outcome(\n",
    "    team1='Mumbai Indians',\n",
    "    team2='Chennai Super Kings',\n",
    "    venue='Wankhede Stadium',\n",
    "    toss_winner='Mumbai Indians',\n",
    "    toss_decision='field'\n",
    ")\n",
    "\n",
    "predict_outcome(\n",
    "    team1='Kolkata Knight Riders',\n",
    "    team2='Royal Challengers Bangalore',\n",
    "    venue='Eden Gardens',\n",
    "    toss_winner='Royal Challengers Bangalore',\n",
    "    toss_decision='field'\n",
    ")"
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 2
}

{'cells': [{'cell_type': 'markdown',
   'metadata': {},
   'source': ['# IPL Match Forecaster: A Data Science Deep Dive']},
  {'cell_type': 'markdown',
   'metadata': {},
   'source': ['## The Narrative: Decoding the Game of Cricket\n',
    'This project is a comprehensive analysis of 12 seasons of Indian Premier League (IPL) data. The goal is to move beyond simple statistics and build a machine learning model that can forecast match outcomes by identifying predictive patterns in historical data.']},
  {'cell_type': 'markdown',
   'metadata': {},
   'source': ['### Step 1: Setup and Data Loading\n',
    'We start by importing the necessary libraries and loading the dataset directly from a reliable online source to ensure this notebook is fully reproducible.']},
  {'cell_type': 'code',
   'execution_count': None,
   'metadata': {},
   'outputs': [],
   'source': ['import pandas as pd\n',
    'import numpy as np\n',
    'import matplotlib.pyplot as plt\n',
    'import seaborn as sns\n',
