# BundesligaBERT: BERT Fine-Tuning on Kaggle GPU

This notebook fine-tunes a German BERT model (`distilbert-base-german-cased`) to predict Bundesliga stoppage time from live ticker text.

## Setup Instructions

1. **Upload Data**: Upload the JSON files (train.json, val.json, test_history.json, test_future.json) as a Kaggle dataset
2. **Attach Dataset**: Click "Add data" → Search for your dataset → Add it
3. **Enable GPU**: Settings → Accelerator → GPU T4 x2 (or available GPU)
4. **Run All**: Run all cells to train the model and generate diagnostics

## Outputs

All outputs are saved to `/kaggle/working/`:
- Predictions CSV files for all splits
- Performance metrics JSON
- Training log JSON
- Aggregated statistics JSON
- 10 diagnostic plots (PNG)



## 1. Setup: Install Core Dependencies

Install core libraries needed for training. Plotting libraries will be installed later.


In [None]:
# Install core dependencies
!pip install -q transformers datasets accelerate scikit-learn

# Note: torch, pandas, numpy are pre-installed in Kaggle environment
# We'll install plotting libraries later after basic training works


## 2. Configuration

Hyperparameters for easy tweaking without editing code below.


In [None]:
# Hyperparameter configuration
CONFIG = {
    'model_name': 'distilbert-base-german-cased',
    'learning_rate': 2e-5,
    'num_train_epochs': 10,
    'per_device_train_batch_size': 16,
    'per_device_eval_batch_size': 32,
    'weight_decay': 0.01,
    'max_length': 512,
    'early_stopping_patience': 3,
    'random_seed': 42
}

# Dataset path (update with your dataset name)
DATASET_NAME = 'your-dataset-name'  # Change this to your actual dataset name
DATASET_PATH = f'/kaggle/input/{DATASET_NAME}'

print("Configuration:")
for key, value in CONFIG.items():
    print(f"  {key}: {value}")
print(f"\nDataset path: {DATASET_PATH}")


## 3. Data Loading

Load JSON files from the Kaggle dataset.


In [None]:
import json
import os
from datasets import load_dataset

# List available files
print("Available files in dataset:")
for dirname, _, filenames in os.walk(DATASET_PATH):
    for filename in filenames:
        print(f"  {os.path.join(dirname, filename)}")

# Load JSON files
data_files = {
    'train': f'{DATASET_PATH}/train.json',
    'val': f'{DATASET_PATH}/val.json',
    'test_history': f'{DATASET_PATH}/test_history.json',
    'test_future': f'{DATASET_PATH}/test_future.json'
}

# Load datasets
datasets = load_dataset('json', data_files=data_files)

print("\nDatasets loaded successfully!")
print(f"Train: {len(datasets['train'])} samples")
print(f"Val: {len(datasets['val'])} samples")
print(f"Test History: {len(datasets['test_history'])} samples")
print(f"Test Future: {len(datasets['test_future'])} samples")


## 4. Data Preparation

Tokenize the text data using the BERT tokenizer.


In [None]:
from transformers import AutoTokenizer

# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained(CONFIG['model_name'])

# Tokenization function
def tokenize_function(examples):
    return tokenizer(
        examples['text'],
        truncation=True,
        padding='max_length',
        max_length=CONFIG['max_length']
    )

# Tokenize all datasets
print("Tokenizing datasets...")
train_dataset = datasets['train'].map(tokenize_function, batched=True)
val_dataset = datasets['val'].map(tokenize_function, batched=True)
test_history_dataset = datasets['test_history'].map(tokenize_function, batched=True)
test_future_dataset = datasets['test_future'].map(tokenize_function, batched=True)

# Set format for PyTorch
train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
val_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
test_history_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
test_future_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

print("Tokenization complete!")


## 5. Data Validation

Validate dataset sizes and distributions to catch issues before training.


In [None]:
import pandas as pd
import numpy as np

def validate_dataset(dataset, name):
    """Validate and print dataset statistics."""
    if len(dataset) == 0:
        print(f"⚠️  WARNING: {name} is EMPTY!")
        return
    
    # Convert to pandas for easier analysis
    df = pd.DataFrame({
        'match_id': dataset['match_id'],
        'season': dataset['season'],
        'half': dataset['half'],
        'label': dataset['label']
    })
    
    print(f"\n{name}:")
    print(f"  Total samples: {len(df)}")
    print(f"  Half distribution:")
    print(f"    45: {(df['half'] == 45).sum()} ({(df['half'] == 45).sum() / len(df) * 100:.1f}%)")
    print(f"    90: {(df['half'] == 90).sum()} ({(df['half'] == 90).sum() / len(df) * 100:.1f}%)")
    print(f"  Season distribution:")
    for season in sorted(df['season'].unique()):
        count = (df['season'] == season).sum()
        print(f"    {season}: {count} ({(count / len(df) * 100):.1f}%)")
    print(f"  Label statistics:")
    print(f"    Mean: {df['label'].mean():.2f}")
    print(f"    Std: {df['label'].std():.2f}")
    print(f"    Min: {df['label'].min():.2f}")
    print(f"    Max: {df['label'].max():.2f}")
    
    if len(df) < 10:
        print(f"  ⚠️  WARNING: {name} has very few samples ({len(df)})")

# Validate all datasets
validate_dataset(train_dataset, "Train")
validate_dataset(val_dataset, "Validation")
validate_dataset(test_history_dataset, "Test History")
validate_dataset(test_future_dataset, "Test Future")

print("\n✅ Data validation complete!")


## 6. Model Initialization

Initialize the BERT model for sequence classification (regression).


In [None]:
import torch
from transformers import AutoModelForSequenceClassification

# Set random seeds for reproducibility
torch.manual_seed(CONFIG['random_seed'])
np.random.seed(CONFIG['random_seed'])

# Check device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"CUDA Version: {torch.version.cuda}")

# Initialize model
print(f"\nLoading model: {CONFIG['model_name']}")
model = AutoModelForSequenceClassification.from_pretrained(
    CONFIG['model_name'],
    num_labels=1  # Regression task
)

# Move model to device
model = model.to(device)
print("Model initialized and moved to device!")


## 7. Training

Configure and train the BERT model.


In [None]:
from transformers import Trainer, TrainingArguments, EarlyStoppingCallback
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Compute metrics function
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = predictions.flatten()
    labels = labels.flatten()
    
    rmse = np.sqrt(mean_squared_error(labels, predictions))
    mae = mean_absolute_error(labels, predictions)
    r2 = r2_score(labels, predictions)
    
    return {
        'rmse': rmse,
        'mae': mae,
        'r2': r2
    }

# Training arguments
training_args = TrainingArguments(
    output_dir='/kaggle/working/checkpoints',
    num_train_epochs=CONFIG['num_train_epochs'],
    per_device_train_batch_size=CONFIG['per_device_train_batch_size'],
    per_device_eval_batch_size=CONFIG['per_device_eval_batch_size'],
    learning_rate=CONFIG['learning_rate'],
    weight_decay=CONFIG['weight_decay'],
    load_best_model_at_end=True,
    metric_for_best_model='eval_loss',
    greater_is_better=False,
    logging_steps=50,
    eval_strategy='epoch',
    save_strategy='epoch',
    save_total_limit=3,
    report_to='none'  # Disable wandb/tensorboard
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=CONFIG['early_stopping_patience'])]
)

# Train
print("Starting training...")
train_result = trainer.train()
print(f"Training completed! Final loss: {train_result.training_loss:.4f}")


## 8. Predictions & Metrics

Generate predictions for all splits and extract metrics. `trainer.predict()` returns both metrics and predictions in one call.


In [None]:
# Function to get predictions and metadata
def get_predictions(trainer, dataset, split_name):
    """Get predictions and extract metadata."""
    # Temporarily change format to access metadata
    dataset.set_format(type=None)
    
    # Extract metadata
    metadata = {
        'match_ids': dataset['match_id'],
        'seasons': dataset['season'],
        'halves': dataset['half']
    }
    
    # Reset format for prediction
    dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
    
    # Get predictions (returns both metrics and predictions)
    pred_output = trainer.predict(dataset)
    
    return {
        'predictions': pred_output.predictions.flatten(),
        'labels': pred_output.label_ids.flatten(),
        'metrics': pred_output.metrics,
        'match_ids': metadata['match_ids'],
        'seasons': metadata['seasons'],
        'halves': metadata['halves']
    }

# Get predictions for all splits
print("Generating predictions...")
train_predictions = get_predictions(trainer, train_dataset, 'train')
val_predictions = get_predictions(trainer, val_dataset, 'val')
test_history_predictions = get_predictions(trainer, test_history_dataset, 'test_history')
test_future_predictions = get_predictions(trainer, test_future_dataset, 'test_future')

# Print metrics
print("\n" + "="*70)
print("METRICS SUMMARY")
print("="*70)
for name, preds in [
    ('Train', train_predictions),
    ('Validation', val_predictions),
    ('Test History', test_history_predictions),
    ('Test Future', test_future_predictions)
]:
    if len(preds['predictions']) > 0:
        metrics = preds['metrics']
        print(f"\n{name}:")
        print(f"  RMSE: {metrics.get('test_rmse', 'N/A'):.4f}")
        print(f"  MAE: {metrics.get('test_mae', 'N/A'):.4f}")
        print(f"  R²: {metrics.get('test_r2', 'N/A'):.4f}")
    else:
        print(f"\n{name}: No data")
print("="*70)


## 9. Statistics

Calculate aggregated statistics (mean, std, min, max) per split/subset for local comparison with regression results.


In [None]:
def calculate_subset_metrics(predictions, labels, halves):
    """Calculate metrics for combined, subset_45, and subset_90."""
    predictions = np.array(predictions)
    labels = np.array(labels)
    halves = np.array(halves)
    
    if len(predictions) == 0 or len(labels) == 0:
        return {'combined': {}, 'subset_45': {}, 'subset_90': {}}
    
    combined = {
        'rmse': float(np.sqrt(mean_squared_error(labels, predictions))),
        'mae': float(mean_absolute_error(labels, predictions)),
        'r2': float(r2_score(labels, predictions)),
        'mean_actual': float(np.mean(labels)),
        'mean_predicted': float(np.mean(predictions)),
        'std_actual': float(np.std(labels)),
        'std_predicted': float(np.std(predictions)),
        'min_actual': float(np.min(labels)),
        'max_actual': float(np.max(labels)),
        'min_predicted': float(np.min(predictions)),
        'max_predicted': float(np.max(predictions))
    }
    
    subset_45_mask = halves == 45
    subset_90_mask = halves == 90
    
    subset_45 = {}
    subset_90 = {}
    
    if subset_45_mask.sum() > 0:
        subset_45 = {
            'rmse': float(np.sqrt(mean_squared_error(labels[subset_45_mask], predictions[subset_45_mask]))),
            'mae': float(mean_absolute_error(labels[subset_45_mask], predictions[subset_45_mask])),
            'r2': float(r2_score(labels[subset_45_mask], predictions[subset_45_mask])),
            'mean_actual': float(np.mean(labels[subset_45_mask])),
            'mean_predicted': float(np.mean(predictions[subset_45_mask])),
            'std_actual': float(np.std(labels[subset_45_mask])),
            'std_predicted': float(np.std(predictions[subset_45_mask])),
            'min_actual': float(np.min(labels[subset_45_mask])),
            'max_actual': float(np.max(labels[subset_45_mask])),
            'min_predicted': float(np.min(predictions[subset_45_mask])),
            'max_predicted': float(np.max(predictions[subset_45_mask]))
        }
    
    if subset_90_mask.sum() > 0:
        subset_90 = {
            'rmse': float(np.sqrt(mean_squared_error(labels[subset_90_mask], predictions[subset_90_mask]))),
            'mae': float(mean_absolute_error(labels[subset_90_mask], predictions[subset_90_mask])),
            'r2': float(r2_score(labels[subset_90_mask], predictions[subset_90_mask])),
            'mean_actual': float(np.mean(labels[subset_90_mask])),
            'mean_predicted': float(np.mean(predictions[subset_90_mask])),
            'std_actual': float(np.std(labels[subset_90_mask])),
            'std_predicted': float(np.std(predictions[subset_90_mask])),
            'min_actual': float(np.min(labels[subset_90_mask])),
            'max_actual': float(np.max(labels[subset_90_mask])),
            'min_predicted': float(np.min(predictions[subset_90_mask])),
            'max_predicted': float(np.max(predictions[subset_90_mask]))
        }
    
    return {'combined': combined, 'subset_45': subset_45, 'subset_90': subset_90}

# Calculate statistics for all splits
performance_metrics = {
    'train': calculate_subset_metrics(
        train_predictions['predictions'],
        train_predictions['labels'],
        train_predictions['halves']
    ),
    'val': calculate_subset_metrics(
        val_predictions['predictions'],
        val_predictions['labels'],
        val_predictions['halves']
    ),
    'test_history': calculate_subset_metrics(
        test_history_predictions['predictions'],
        test_history_predictions['labels'],
        test_history_predictions['halves']
    ),
    'test_future': calculate_subset_metrics(
        test_future_predictions['predictions'],
        test_future_predictions['labels'],
        test_future_predictions['halves']
    )
}

print("Statistics calculated for all splits!")


In [None]:
# Install plotting libraries
!pip install -q matplotlib seaborn scipy


## 11. Diagnostic Plots

Generate all 10 diagnostic plots for model evaluation.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.dpi'] = 300
plt.rcParams['savefig.dpi'] = 300

# Create output directory
output_dir = '/kaggle/working/figures'
os.makedirs(output_dir, exist_ok=True)

# Prepare DataFrames
def prepare_df(predictions_dict, name):
    """Prepare DataFrame from predictions dictionary."""
    if len(predictions_dict['predictions']) == 0:
        return pd.DataFrame()
    return pd.DataFrame({
        'match_id': predictions_dict['match_ids'],
        'season': predictions_dict['seasons'],
        'half': predictions_dict['halves'],
        'actual': predictions_dict['labels'],
        'predicted': predictions_dict['predictions']
    })

history_df = prepare_df(test_history_predictions, 'history')
future_df = prepare_df(test_future_predictions, 'future')

if len(history_df) > 0:
    history_df['residual'] = history_df['actual'] - history_df['predicted']
    history_df['abs_error'] = np.abs(history_df['residual'])

if len(future_df) > 0:
    future_df['residual'] = future_df['actual'] - future_df['predicted']
    future_df['abs_error'] = np.abs(future_df['residual'])

print("DataFrames prepared for plotting")


In [None]:
# 1. Learning Curve
print("Plotting learning curve...")
history = trainer.state.log_history
train_losses = [h['loss'] for h in history if 'loss' in h]
eval_losses = [h['eval_loss'] for h in history if 'eval_loss' in h]
eval_steps = [h['step'] for h in history if 'eval_loss' in h]

fig, ax = plt.subplots(figsize=(10, 6))
if train_losses:
    steps = list(range(len(train_losses)))
    ax.plot(steps, train_losses, label='Training Loss', alpha=0.7)
if eval_losses:
    ax.plot(eval_steps, eval_losses, label='Validation Loss', marker='o', markersize=4)
ax.set_xlabel('Step')
ax.set_ylabel('Loss')
ax.set_title('Learning Curve: Training vs Validation Loss')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig(f'{output_dir}/learning_curve.png', dpi=300, bbox_inches='tight')
plt.close()
print("✓ Learning curve saved")


In [None]:
# 2. CDF of Absolute Errors (Rec Curve)
print("Plotting CDF of absolute errors...")
fig, ax = plt.subplots(figsize=(10, 6))
tolerances = np.linspace(0, 5, 100)

for name, df, half in [
    ('History-45', history_df[history_df['half'] == 45] if len(history_df) > 0 else pd.DataFrame(), 45),
    ('History-90', history_df[history_df['half'] == 90] if len(history_df) > 0 else pd.DataFrame(), 90),
    ('Future-45', future_df[future_df['half'] == 45] if len(future_df) > 0 else pd.DataFrame(), 45),
    ('Future-90', future_df[future_df['half'] == 90] if len(future_df) > 0 else pd.DataFrame(), 90)
]:
    if len(df) > 0:
        percentages = [100 * (df['abs_error'] <= tol).sum() / len(df) for tol in tolerances]
        ax.plot(tolerances, percentages, label=name, linewidth=2)

ax.set_xlabel('Tolerance (minutes)')
ax.set_ylabel('% of Predictions within Tolerance')
ax.set_title('CDF of Absolute Errors (Rec Curve)')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig(f'{output_dir}/rec_curve.png', dpi=300, bbox_inches='tight')
plt.close()
print("✓ Rec curve saved")


In [None]:
# 3. Prediction vs Actual Scatter
print("Plotting prediction vs actual scatter...")
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# History
if len(history_df) > 0:
    for half in [45, 90]:
        subset = history_df[history_df['half'] == half]
        if len(subset) > 0:
            color = 'blue' if half == 45 else 'red'
            ax1.scatter(subset['actual'], subset['predicted'], alpha=0.5, label=f'Half {half}', color=color, s=20)
    
    max_val = max(history_df['actual'].max(), history_df['predicted'].max())
    ax1.plot([0, max_val], [0, max_val], 'k--', label='Perfect Fit', linewidth=2)
    ax1.set_xlabel('Actual')
    ax1.set_ylabel('Predicted')
    try:
        r2_hist = r2_score(history_df["actual"], history_df["predicted"])
        ax1.set_title(f'History (R² = {r2_hist:.3f})')
    except:
        ax1.set_title('History')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
else:
    ax1.text(0.5, 0.5, 'No History data', ha='center', va='center', transform=ax1.transAxes)
    ax1.set_title('History')

# Future
if len(future_df) > 0:
    for half in [45, 90]:
        subset = future_df[future_df['half'] == half]
        if len(subset) > 0:
            color = 'blue' if half == 45 else 'red'
            ax2.scatter(subset['actual'], subset['predicted'], alpha=0.5, label=f'Half {half}', color=color, s=20)
    
    max_val = max(future_df['actual'].max(), future_df['predicted'].max())
    ax2.plot([0, max_val], [0, max_val], 'k--', label='Perfect Fit', linewidth=2)
    ax2.set_xlabel('Actual')
    ax2.set_ylabel('Predicted')
    try:
        r2_fut = r2_score(future_df["actual"], future_df["predicted"])
        ax2.set_title(f'Future (R² = {r2_fut:.3f})')
    except:
        ax2.set_title('Future')
    ax2.legend()
    ax2.grid(True, alpha=0.3)
else:
    ax2.text(0.5, 0.5, 'No Future data', ha='center', va='center', transform=ax2.transAxes)
    ax2.set_title('Future')

plt.tight_layout()
plt.savefig(f'{output_dir}/pred_vs_actual_scatter.png', dpi=300, bbox_inches='tight')
plt.close()
print("✓ Prediction vs actual scatter saved")


In [None]:
# 4. Residual Distribution
print("Plotting residual distribution...")
if len(history_df) > 0 or len(future_df) > 0:
    all_residuals = pd.concat([history_df['residual'], future_df['residual']]) if len(history_df) > 0 and len(future_df) > 0 else (history_df['residual'] if len(history_df) > 0 else future_df['residual'])
    if len(all_residuals) > 0:
        fig, ax = plt.subplots(figsize=(10, 6))
        ax.hist(all_residuals, bins=50, density=True, alpha=0.7, label='Residuals', color='steelblue')
        
        # Fit Gaussian
        mu, sigma = stats.norm.fit(all_residuals)
        x = np.linspace(all_residuals.min(), all_residuals.max(), 100)
        ax.plot(x, stats.norm.pdf(x, mu, sigma), 'r-', linewidth=2, label=f'Gaussian Fit (μ={mu:.2f}, σ={sigma:.2f})')
        ax.axvline(0, color='black', linestyle='--', linewidth=2, label='Zero')
        ax.set_xlabel('Residual (Actual - Predicted)')
        ax.set_ylabel('Density')
        ax.set_title('Residual Distribution')
        ax.legend()
        ax.grid(True, alpha=0.3)
        plt.tight_layout()
        plt.savefig(f'{output_dir}/residual_distribution.png', dpi=300, bbox_inches='tight')
        plt.close()
        print("✓ Residual distribution saved")


In [None]:
# 5. Q-Q Plot
print("Plotting Q-Q plot...")
if len(history_df) > 0 or len(future_df) > 0:
    all_residuals = pd.concat([history_df['residual'], future_df['residual']]) if len(history_df) > 0 and len(future_df) > 0 else (history_df['residual'] if len(history_df) > 0 else future_df['residual'])
    if len(all_residuals) > 0:
        fig, ax = plt.subplots(figsize=(8, 8))
        stats.probplot(all_residuals, dist="norm", plot=ax)
        ax.set_title('Q-Q Plot: Residuals vs Normal Distribution')
        ax.grid(True, alpha=0.3)
        plt.tight_layout()
        plt.savefig(f'{output_dir}/qq_plot_residuals.png', dpi=300, bbox_inches='tight')
        plt.close()
        print("✓ Q-Q plot saved")


In [None]:
# 6. Residuals vs Predicted
print("Plotting residuals vs predicted...")
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

if len(history_df) > 0:
    ax1.scatter(history_df['predicted'], history_df['residual'], alpha=0.5, s=20)
ax1.axhline(0, color='black', linestyle='--', linewidth=2)
ax1.set_xlabel('Predicted')
ax1.set_ylabel('Residual')
ax1.set_title('History: Residuals vs Predicted')
ax1.grid(True, alpha=0.3)

if len(future_df) > 0:
    ax2.scatter(future_df['predicted'], future_df['residual'], alpha=0.5, s=20)
ax2.axhline(0, color='black', linestyle='--', linewidth=2)
ax2.set_xlabel('Predicted')
ax2.set_ylabel('Residual')
ax2.set_title('Future: Residuals vs Predicted')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(f'{output_dir}/residuals_vs_predicted.png', dpi=300, bbox_inches='tight')
plt.close()
print("✓ Residuals vs predicted saved")


In [None]:
# 7. Error by Season
print("Plotting error by season...")
if len(history_df) > 0 or len(future_df) > 0:
    combined_df = pd.concat([
        history_df.assign(split='History') if len(history_df) > 0 else pd.DataFrame(),
        future_df.assign(split='Future') if len(future_df) > 0 else pd.DataFrame()
    ])
    
    if len(combined_df) > 0:
        fig, axes = plt.subplots(1, 2, figsize=(14, 6))
        
        for idx, split in enumerate(['History', 'Future']):
            subset = combined_df[combined_df['split'] == split]
            if len(subset) > 0 and 'season' in subset.columns:
                subset.boxplot(column='abs_error', by='season', ax=axes[idx], grid=False)
                axes[idx].set_title(f'{split}: Absolute Error by Season')
                axes[idx].set_xlabel('Season')
                axes[idx].set_ylabel('Absolute Error (minutes)')
                plt.setp(axes[idx].xaxis.get_majorticklabels(), rotation=45, ha='right')
            else:
                axes[idx].text(0.5, 0.5, f'No data for {split}', ha='center', va='center', transform=axes[idx].transAxes)
                axes[idx].set_title(f'{split}: Absolute Error by Season')
        
        plt.suptitle('')
        plt.tight_layout()
        plt.savefig(f'{output_dir}/error_by_season.png', dpi=300, bbox_inches='tight')
        plt.close()
        print("✓ Error by season saved")


In [None]:
# 8. Error by Half
print("Plotting error by half...")
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

for idx, (name, df) in enumerate([('History', history_df), ('Future', future_df)]):
    if len(df) > 0:
        df.boxplot(column='abs_error', by='half', ax=axes[idx], grid=False)
        axes[idx].set_title(f'{name}: Absolute Error by Half')
        axes[idx].set_xlabel('Half')
        axes[idx].set_ylabel('Absolute Error (minutes)')
    else:
        axes[idx].text(0.5, 0.5, f'No data for {name}', ha='center', va='center', transform=axes[idx].transAxes)
        axes[idx].set_title(f'{name}: Absolute Error by Half')

plt.suptitle('')
plt.tight_layout()
plt.savefig(f'{output_dir}/error_by_half.png', dpi=300, bbox_inches='tight')
plt.close()
print("✓ Error by half saved")


In [None]:
# 9. Calibration Plot
print("Plotting calibration plot...")
if len(history_df) > 0 or len(future_df) > 0:
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
    
    for idx, (name, df) in enumerate([('History', history_df), ('Future', future_df)]):
        ax = ax1 if idx == 0 else ax2
        if len(df) > 0 and 'predicted' in df.columns and 'actual' in df.columns:
            try:
                # Bin predictions
                if df['predicted'].min() == df['predicted'].max():
                    bin_centers = [df['predicted'].mean()]
                    bin_means = [df['actual'].mean()]
                else:
                    bins = np.linspace(df['predicted'].min(), df['predicted'].max(), 10)
                    df_temp = df.copy()
                    df_temp['pred_bin'] = pd.cut(df_temp['predicted'], bins=bins)
                    
                    bin_centers = []
                    bin_means = []
                    for bin_group in df_temp.groupby('pred_bin'):
                        bin_centers.append(bin_group[1]['predicted'].mean())
                        bin_means.append(bin_group[1]['actual'].mean())
                
                if bin_centers:
                    ax.scatter(bin_centers, bin_means, s=100, alpha=0.7)
                    max_val = max(max(bin_centers), max(bin_means)) if bin_centers else 10
                    ax.plot([0, max_val], [0, max_val], 'r--', linewidth=2, label='Perfect Calibration')
                    ax.set_xlabel('Mean Predicted (binned)')
                    ax.set_ylabel('Mean Actual')
                    ax.set_title(f'{name}: Calibration Plot')
                    ax.legend()
                    ax.grid(True, alpha=0.3)
                else:
                    ax.text(0.5, 0.5, f'No data for {name}', ha='center', va='center', transform=ax.transAxes)
                    ax.set_title(f'{name}: Calibration Plot')
            except Exception as e:
                print(f"Error creating calibration plot for {name}: {e}")
                ax.text(0.5, 0.5, f'Error plotting {name}', ha='center', va='center', transform=ax.transAxes)
                ax.set_title(f'{name}: Calibration Plot')
        else:
            ax.text(0.5, 0.5, f'No data for {name}', ha='center', va='center', transform=ax.transAxes)
            ax.set_title(f'{name}: Calibration Plot')
    
    plt.tight_layout()
    plt.savefig(f'{output_dir}/calibration_plot.png', dpi=300, bbox_inches='tight')
    plt.close()
    print("✓ Calibration plot saved")

print("\n✅ All diagnostic plots generated!")


## 12. Save Outputs

Save all predictions, metrics, and aggregated statistics to `/kaggle/working/` for download.


In [None]:
# Save predictions for all splits
def save_predictions(predictions_dict, filename):
    """Save predictions to CSV."""
    if len(predictions_dict['predictions']) > 0:
        df = pd.DataFrame({
            'match_id': predictions_dict['match_ids'],
            'season': predictions_dict['seasons'],
            'half': predictions_dict['halves'],
            'actual': predictions_dict['labels'],
            'predicted': predictions_dict['predictions'],
            'residual': predictions_dict['labels'] - predictions_dict['predictions'],
            'abs_error': np.abs(predictions_dict['labels'] - predictions_dict['predictions'])
        })
        df.to_csv(f'/kaggle/working/{filename}', index=False)
        print(f"✓ Saved {filename} ({len(df)} samples)")
    else:
        # Create empty DataFrame with correct columns
        df = pd.DataFrame(columns=['match_id', 'season', 'half', 'actual', 'predicted', 'residual', 'abs_error'])
        df.to_csv(f'/kaggle/working/{filename}', index=False)
        print(f"⚠️  {filename} is empty (no data)")

print("Saving predictions...")
save_predictions(train_predictions, 'bert_predictions_train.csv')
save_predictions(val_predictions, 'bert_predictions_val.csv')
save_predictions(test_history_predictions, 'bert_predictions_history.csv')
save_predictions(test_future_predictions, 'bert_predictions_future.csv')


In [None]:
# Save performance metrics
print("\nSaving performance metrics...")
with open('/kaggle/working/bert_performance.json', 'w') as f:
    json.dump(performance_metrics, f, indent=2)
print("✓ Saved bert_performance.json")

# Save training log
training_log = {
    'training_loss': float(train_result.training_loss),
    'log_history': trainer.state.log_history
}
with open('/kaggle/working/bert_training_log.json', 'w') as f:
    json.dump(training_log, f, indent=2, default=str)
print("✓ Saved bert_training_log.json")

# Save aggregated statistics
with open('/kaggle/working/bert_aggregated_stats.json', 'w') as f:
    json.dump(performance_metrics, f, indent=2)
print("✓ Saved bert_aggregated_stats.json")


## Summary

All outputs have been saved to `/kaggle/working/`:

**Predictions:**
- `bert_predictions_train.csv`
- `bert_predictions_val.csv`
- `bert_predictions_history.csv`
- `bert_predictions_future.csv`

**Metrics & Statistics:**
- `bert_performance.json` - Detailed metrics (RMSE, MAE, R²) for all splits
- `bert_training_log.json` - Training history
- `bert_aggregated_stats.json` - Aggregated statistics for local comparison

**Diagnostic Plots (10 total):**
- All plots saved to `/kaggle/working/figures/`

You can download all files from the "Output" tab or via the file browser on the right.
