# ðŸ¤– Predictive Intelligence: Modeling, Optimization & Interpretability

## 5.1 Problem Statement

Our objective is to predict the **Win Probability** of the batting team at any given ball in the second inning. This is a binary classification problem where:
- **Positive Class (1)**: Batting team wins.
- **Negative Class (0)**: Bowling team wins.

We use features engineered in the previous phase (CRR, RRR, Wickets Left) to capture the situational pressure of a T20 chase.

In [1]:
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
import optuna
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve, accuracy_score

# Local Paths
sys.path.append(os.path.abspath('../'))
from src.feature_engineering import create_match_features
from src.preprocessing import get_preprocessing_pipeline, build_full_pipeline
from src.utils import plot_confusion_matrix, plot_roc_curve

FIGURE_PATH = '../reports/figures/'
match_df = pd.read_csv('../data/raw/IPL Matches 2008-2020.csv')
ball_df = pd.read_csv('../data/raw/IPL Ball-by-Ball 2008-2020.csv')
feature_df = create_match_features(match_df, ball_df)

sns.set_palette("coolwarm")

## 5.2 Baseline Model: Logistic Regression

We start with a linear baseline to establish a performance floor.

In [2]:
X = feature_df.drop('result', axis=1)
y = feature_df['result']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

cat_cols = ['batting_team', 'bowling_team', 'city']
num_cols = ['runs_left', 'balls_left', 'wickets_left', 'target_score', 'crr', 'rrr', 'pressure_index']

preprocessor = get_preprocessing_pipeline(cat_cols, num_cols)
baseline_pipeline = build_full_pipeline(preprocessor, LogisticRegression(solver='liblinear'))

baseline_pipeline.fit(X_train, y_train)
y_pred = baseline_pipeline.predict(X_test)
y_prob = baseline_pipeline.predict_proba(X_test)[:, 1]

print(f"Baseline ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Baseline ROC-AUC: 0.9008

Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.78      0.80      8142
           1       0.82      0.84      0.83      9311

    accuracy                           0.81     17453
   macro avg       0.81      0.81      0.81     17453
weighted avg       0.81      0.81      0.81     17453



### Interpretation
The baseline model performs well, indicating that our engineered features (like RRR) provide a strong linear signal. However, cricket is non-linear; a single wicket can shift momentum in ways a linear model might miss.

## 5.3 Production Model: Random Forest & Stratified CV

We use **Stratified 5-Fold Cross-Validation** to ensure the balance between win/loss outcomes is preserved across folds, providing a more robust estimate of generalization error.

In [3]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
prod_pipeline = build_full_pipeline(preprocessor, rf_model)

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(prod_pipeline, X, y, cv=skf, scoring='roc_auc')

print(f"Production model Mean ROC-AUC: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

Production model Mean ROC-AUC: 0.9999 (+/- 0.0000)


## 5.4 Hyperparameter Tuning (Optuna)

Using Bayesian optimization to find the optimal tree depth and estimator count.

In [4]:
def objective(trial):
    n_estimators = trial.suggest_int('n_estimators', 50, 200)
    max_depth = trial.suggest_int('max_depth', 5, 20)
    
    model = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, random_state=42)
    pipeline = build_full_pipeline(preprocessor, model)
    
    return cross_val_score(pipeline, X_train, y_train, cv=3, scoring='roc_auc').mean()

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=10)

print("Best Parameters Found:", study.best_params)

[32m[I 2026-02-25 04:39:17,039][0m A new study created in memory with name: no-name-09a46180-2686-4ea7-a9de-feeee20125bc[0m
[32m[I 2026-02-25 04:39:31,279][0m Trial 0 finished with value: 0.9958975874852336 and parameters: {'n_estimators': 162, 'max_depth': 17}. Best is trial 0 with value: 0.9958975874852336.[0m
[32m[I 2026-02-25 04:39:40,030][0m Trial 1 finished with value: 0.9156983738935831 and parameters: {'n_estimators': 167, 'max_depth': 8}. Best is trial 0 with value: 0.9958975874852336.[0m
[32m[I 2026-02-25 04:39:47,653][0m Trial 2 finished with value: 0.9986469582969105 and parameters: {'n_estimators': 80, 'max_depth': 20}. Best is trial 2 with value: 0.9986469582969105.[0m
[32m[I 2026-02-25 04:39:55,828][0m Trial 3 finished with value: 0.9931419204890249 and parameters: {'n_estimators': 96, 'max_depth': 16}. Best is trial 2 with value: 0.9986469582969105.[0m
[32m[I 2026-02-25 04:40:02,236][0m Trial 4 finished with value: 0.904133777190084 and parameters: {'n_

Best Parameters Found: {'n_estimators': 80, 'max_depth': 20}


## 5.5 Final Evaluation & Tactical Simulation

We visualize the final performance and simulate a "live" match win probability shift.

In [5]:
# Final Model Fit
best_rf = RandomForestClassifier(**study.best_params, random_state=42)
final_pipeline = build_full_pipeline(preprocessor, best_rf)
final_pipeline.fit(X_train, y_train)

# Sample Simulation
sample_state = X_test.head(1)
win_prob = final_pipeline.predict_proba(sample_state)[0][1]
print(f"Predicted Win Probability for Sample State: {win_prob*100:.2f}%")

Predicted Win Probability for Sample State: 1.54%
