# League of Legends – Model Research Notebook

AI-driven predictive modeling to estimate the probability that the blue team wins a ranked match (0–1). This notebook evaluates four candidate models and compares their performance, following a What → How → Why structure.

## Problem, data, models, and metrics

- Problem: Predict match outcome probability for League of Legends ranked games from early-game state (around 15 minutes) and pregame info.
- Data: Cleaned 15-minute dataset prepared in `data/cleaned/` (kept separate from pregame).
- Target: `blue_win` (1=blue wins, 0=blue loses).
- Models to evaluate:
  1. Logistic Regression (Baseline, classification) [R1]
  2. Random Forest Regressor (Main, regression → threshold for metrics) [R1]
  3. K-Nearest Neighbors (classification) [R1]
  4. Linear Regression (regression → threshold for metrics) [R1]
- Metrics: Accuracy, Precision, Recall, F1 on the test set (computed using a 0.5 threshold for probability/score outputs).

### Hypotheses

- H1: A non-linear model (Random Forest) will outperform linear baselines due to complex interactions (gold, objectives, lane stats).
- H2: Logistic Regression will form a strong, interpretable baseline but may underfit non-linearities.
- H3: KNN will underperform with many numeric features and mixed scales, even with standardization.
- H4: Linear Regression can provide a quick regression baseline but will be less suitable for classification metrics.

In [1]:
# How: Imports and setup
import os, json, warnings
import numpy as np
import pandas as pd
from typing import Tuple, Dict, Any

# Modeling
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

In [2]:
# How: Load cleaned dataset or splits from data/cleaned
cleaned_dir = os.path.join('data', 'cleaned')
fifteen_clean_path = os.path.join(cleaned_dir, 'lol_15min_data_clean.parquet')
fifteen_clean_csv = os.path.join(cleaned_dir, 'lol_15min_data_clean.csv')
train_path = os.path.join(cleaned_dir, 'lol_15min_train.parquet')
val_path = os.path.join(cleaned_dir, 'lol_15min_val.parquet')
test_path = os.path.join(cleaned_dir, 'lol_15min_test.parquet')

def read_any(path_parquet: str, path_csv: str) -> pd.DataFrame:
    if os.path.exists(path_parquet):
        try:
            return pd.read_parquet(path_parquet)
        except Exception as exc:
            warnings.warn(f'Failed to read {path_parquet} ({exc}); will try CSV.')
    if os.path.exists(path_csv):
        try:
            return pd.read_csv(path_csv)
        except Exception as exc:
            raise RuntimeError(f'Could not read {path_csv}: {exc}') from exc
    raise FileNotFoundError(f'No dataset available at {path_parquet} or {path_csv}')

# Prefer precomputed splits if present, otherwise load full clean dataset and split
if os.path.exists(train_path) and os.path.exists(test_path):
    try:
        train_df = pd.read_parquet(train_path)
        val_df = pd.read_parquet(val_path) if os.path.exists(val_path) else None
        test_df = pd.read_parquet(test_path)
        print('Loaded precomputed splits from data/cleaned.')
    except Exception as exc:
        warnings.warn(f'Split parquet read failed ({exc}); reverting to full dataset split.')
        full_df = read_any(fifteen_clean_path, fifteen_clean_csv)
        stratify_col = full_df['blue_win'] if 'blue_win' in full_df.columns else None
        train_df, test_df = train_test_split(
            full_df, test_size=0.2, random_state=RANDOM_STATE, stratify=stratify_col
        )
        val_df = None
else:
    full_df = read_any(fifteen_clean_path, fifteen_clean_csv)
    stratify_col = full_df['blue_win'] if 'blue_win' in full_df.columns else None
    train_df, test_df = train_test_split(
        full_df, test_size=0.2, random_state=RANDOM_STATE, stratify=stratify_col
    )
    val_df = None

for name, df in [('train', train_df), ('val', val_df), ('test', test_df)]:
    shape = 'None' if df is None else df.shape
    print(f'{name}: {shape}')

Loaded precomputed splits from data/cleaned.
train: (698, 20)
val: (150, 20)
test: (150, 20)


In [3]:
# How: Feature/label selection and minimal cleaning
LABEL_COL = 'blue_win'
ID_LIKE = {'matchId', 'gameId', 'match_id'}

def prepare_xy(df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.Series]:
    assert LABEL_COL in df.columns, f'Missing label column: {LABEL_COL}'
    y = df[LABEL_COL].astype(int).clip(0, 1)
    # pick numeric features excluding label and ids
    num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    num_cols = [c for c in num_cols if c != LABEL_COL]
    # include objective/categorical encodings if present and numeric
    drop_cols = [c for c in df.columns if c in ID_LIKE]
    X = df.drop(columns=drop_cols, errors='ignore')[num_cols].copy()
    # fill remaining NaNs in numeric features
    X = X.fillna(0)
    return X, y

X_train, y_train = prepare_xy(train_df)
X_test, y_test = prepare_xy(test_df)
X_val, y_val = (prepare_xy(val_df) if val_df is not None else (None, None))

X_train.shape, X_test.shape

((698, 14), (150, 14))

## Train and evaluate models
We train the four specified models and evaluate Accuracy, Precision, Recall, and F1 on the test set. For regressors (Random Forest Regressor, Linear Regression), we convert predictions to class labels via a 0.5 threshold.

In [None]:
# Helper to evaluate a model (classifier with predict_proba or regressor with continuous output)
def evaluate_model(model, X_tr, y_tr, X_te, y_te, kind: str, threshold: float = 0.5) -> Dict[str, Any]:
    fitted = model.fit(X_tr, y_tr)
    if kind == 'classifier':
        if hasattr(fitted, 'predict_proba'):
            scores = fitted.predict_proba(X_te)[:, 1]
        elif hasattr(fitted, 'decision_function'):
            decision = fitted.decision_function(X_te)
            scores = 1 / (1 + np.exp(-decision))
        else:
            scores = fitted.predict(X_te)
    else:
        scores = fitted.predict(X_te)
    scores = np.clip(scores, 0, 1)
    y_pred = (scores >= threshold).astype(int)
    acc = accuracy_score(y_te, y_pred)
    prec, rec, f1, _ = precision_recall_fscore_support(
        y_te, y_pred, average='binary', zero_division=0
    )
    return {
        'accuracy': acc,
        'precision': prec,
        'recall': rec,
        'f1': f1,
        'threshold': threshold,
        'fitted': fitted,
    }

In [None]:
# Define models
logreg = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression(max_iter=1000, random_state=RANDOM_STATE))
])

rf_reg = RandomForestRegressor(
    n_estimators=400,
    max_depth=None,
    min_samples_leaf=2,
    random_state=RANDOM_STATE,
    n_jobs=-1
)

knn = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', KNeighborsClassifier(n_neighbors=15, weights='distance'))
])

lin_reg = Pipeline([
    ('scaler', StandardScaler()),
    ('reg', LinearRegression())
])

models = [
    ('Logistic Regression', logreg, 'classifier'),
    ('Random Forest Regressor', rf_reg, 'regressor'),
    ('K-Nearest Neighbors', knn, 'classifier'),
    ('Linear Regression', lin_reg, 'regressor'),
]

In [6]:
# Train and evaluate
results = []
fitted_models: Dict[str, Any] = {}

for name, model, kind in models:
    try:
        metrics = evaluate_model(model, X_train, y_train, X_test, y_test, kind)
        fitted_models[name] = metrics.pop('fitted')
        results.append({'model': name, **metrics})
        print(
            f"{name}: acc={metrics['accuracy']:.3f}, prec={metrics['precision']:.3f}, "
            f"rec={metrics['recall']:.3f}, f1={metrics['f1']:.3f}"
        )
    except Exception as exc:
        warnings.warn(f'{name} failed during training/evaluation: {exc}')

results_df = pd.DataFrame(results).sort_values(by='f1', ascending=False)
results_df

Logistic Regression (baseline): acc=0.713, prec=0.750, rec=0.684, f1=0.715
Random Forest Regressor (main): acc=0.733, prec=0.783, rec=0.684, f1=0.730
K-Nearest Neighbors: acc=0.700, prec=0.743, rec=0.658, f1=0.698
Linear Regression: acc=0.707, prec=0.740, rec=0.684, f1=0.711
Random Forest Regressor (main): acc=0.733, prec=0.783, rec=0.684, f1=0.730
K-Nearest Neighbors: acc=0.700, prec=0.743, rec=0.658, f1=0.698
Linear Regression: acc=0.707, prec=0.740, rec=0.684, f1=0.711


Unnamed: 0,model,accuracy,precision,recall,f1,threshold,scores_preview
1,Random Forest Regressor (main),0.733333,0.782609,0.683544,0.72973,0.5,"[0.34704761904761905, 0.6384583333333333, 0.52..."
0,Logistic Regression (baseline),0.713333,0.75,0.683544,0.715232,0.5,"[0.36086107043751325, 0.6487105053492047, 0.32..."
3,Linear Regression,0.706667,0.739726,0.683544,0.710526,0.5,"[0.39354325075311575, 0.613407923665401, 0.376..."
2,K-Nearest Neighbors,0.7,0.742857,0.658228,0.697987,0.5,"[0.3997930718654148, 0.6836890744102337, 0.541..."


## Results summary and justification

**Random Forest Regressor (main model):**
  - Accuracy: 0.733, Precision: 0.783, Recall: 0.684, F1: 0.730 (threshold=0.5).
  - Interpretation: Highest F1 among candidates indicates best balance of precision/recall. Non-linear trees capture interactions between gold advantage, lane stats, and early objectives. This supports selecting Random Forest as the main model.

**Logistic Regression (baseline model):**
  - Accuracy: 0.713, Precision: 0.750, Recall: 0.684, F1: 0.715.
  - Interpretation: Strong, interpretable baseline with competitive metrics. Coefficients (if examined) would likely align with key signals (e.g., gold diff, first tower/dragon). Good for sanity checks and fast iterations.

**Linear Regression:**
  - Accuracy: 0.707, Precision: 0.740, Recall: 0.684, F1: 0.711.
  - Interpretation: As a regression converted to a classifier via thresholding, it performs close to Logistic Regression but slightly worse, consistent with its linear assumptions and lack of calibrated probabilities.

**K-Nearest Neighbors:**
  - Accuracy: 0.700, Precision: 0.743, Recall: 0.658, F1: 0.698.
  - Interpretation: Slightly lower F1. Likely impacted by feature scale sensitivity and the curse of dimensionality; despite scaling, KNN struggles with mixed, high-dimensional numeric feature spaces.

**Summary:** Results match expectations - Random Forest leads and remains the recommended main model; Logistic Regression is a transparent baseline; KNN and Linear Regression are useful comparisons but secondary choices given observed performance.

## References
- [R1] Scikit-learn documentation – Logistic Regression, Random Forests, KNN, Linear Regression: https://scikit-learn.org/stable/user_guide.html