# ‚ö° Optimization: Pre-computed Probabilities

<div style="background-color: #fff3e0; padding: 15px; border-radius: 5px; border-left: 5px solid #ff9800;">
<b>‚ö° Performance Optimization</b><br>
<b>Level:</b> Intermediate<br>
<b>Estimated Time:</b> 15 minutes<br>
<b>Prerequisites:</b> 03_model_integration.ipynb<br>
<b>Dataset:</b> Digits (sklearn)
</div>

---

## üéØ Learning Objectives

By the end of this notebook, you will be able to:
- ‚úÖ Understand when model inference becomes a bottleneck
- ‚úÖ Pre-compute probabilities to save time
- ‚úÖ Use `prob_cols` parameter in DBDataset
- ‚úÖ Measure performance improvements
- ‚úÖ Know when to use this optimization
- ‚úÖ Apply to large datasets and heavy models

---

## üìö Table of Contents

1. [The Problem](#problem)
2. [Setup](#setup)
3. [Baseline: Model in Memory](#baseline)
4. [Pre-compute Probabilities](#precompute)
5. [Use prob_cols](#probcols)
6. [Performance Comparison](#comparison)
7. [When to Use](#when)
8. [Best Practices](#practices)
9. [Conclusion](#conclusion)

<a id="problem"></a>
## 1. ‚ö†Ô∏è The Problem

### Scenario: Slow Model Inference

```python
# You have a heavy model (e.g., large neural network, ensemble)
model = VeryHeavyModel()

# DeepBridge needs predictions for validation tests
dataset = DBDataset(data=df, target_column='target', model=model)
#                                                      ‚Üë
#                  Every test calls model.predict() multiple times!
#                  Tests: robustness (100x), uncertainty (50x), etc.
#                  Total: 500+ predictions on same data!
```

### The Bottleneck

**Time breakdown for 10K samples:**
- Model inference: 5 seconds per call
- Number of calls in tests: 100+ times
- **Total time: 500+ seconds (8+ minutes!)** ‚ö†Ô∏è

But the data doesn't change! We're computing the same predictions over and over! üò±

### The Solution ‚úÖ

**Pre-compute probabilities ONCE, reuse many times:**

```python
# Step 1: Compute probabilities once
df['prob_0'] = model.predict_proba(X)[:, 0]
df['prob_1'] = model.predict_proba(X)[:, 1]

# Step 2: Tell DBDataset to use pre-computed probabilities
dataset = DBDataset(
    data=df,
    target_column='target',
    prob_cols=['prob_0', 'prob_1']  # ‚Üê Use these instead of calling model!
)
```

**Result:**
- Model inference: 5 seconds (one-time)
- Tests: 0.1 seconds (just read columns!)
- **Total time: 5.1 seconds (98% faster!)** üöÄ

**Let's see it in action!**

<a id="setup"></a>
## 2. üõ†Ô∏è Setup

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
import warnings

# sklearn
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# DeepBridge
from deepbridge import DBDataset, Experiment

# Settings
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('Set2')
%matplotlib inline

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print("‚úÖ Setup complete!")
print("‚ö° Topic: Performance Optimization with Pre-computed Probabilities")

### Load Data & Train Model

In [None]:
# Load digits dataset (0-9 classification)
digits = load_digits()
df = pd.DataFrame(digits.data, columns=[f'pixel_{i}' for i in range(digits.data.shape[1])])
df['target'] = digits.target

print(f"üìä Digits Dataset:")
print(f"   Shape: {df.shape}")
print(f"   Task: Multiclass classification (0-9)")
print(f"   Classes: {len(np.unique(digits.target))}")

# Train model
X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE
)

# Train a "heavy" model (simulate with many trees)
model = RandomForestClassifier(
    n_estimators=200,  # Many trees = slower
    max_depth=15,
    random_state=RANDOM_STATE,
    n_jobs=1  # Single core to simulate slow model
)

print("\nüå≤ Training model...")
model.fit(X_train, y_train)

acc = accuracy_score(y_test, model.predict(X_test))
print(f"‚úÖ Model trained! Accuracy: {acc:.3f}")

<a id="baseline"></a>
## 3. üìä Baseline: Model in Memory

### Method 1: Pass model directly (traditional)

In [None]:
print("‚è±Ô∏è METHOD 1: Passing model directly\n")
print("   Model will be called every time predictions are needed")
print("   Let's measure the time...\n")

# Measure time for dataset creation
start = time.time()

dataset_with_model = DBDataset(
    data=df,
    target_column='target',
    model=model,  # ‚Üê Model in memory
    test_size=0.2,
    random_state=RANDOM_STATE
)

time_with_model = time.time() - start

print(f"‚úÖ DBDataset created")
print(f"   Time: {time_with_model:.3f}s")
print(f"\n   What happened:")
print(f"   ‚Ä¢ Model.predict() called on train data")
print(f"   ‚Ä¢ Model.predict() called on test data")
print(f"   ‚Ä¢ Model.predict_proba() called for probabilities")

### Simulate Multiple Predictions (as tests do)

In [None]:
# Simulate what happens in validation tests
print("\nüî¨ Simulating multiple prediction calls (as in tests)...\n")

num_calls = 10  # Robustness test might call 100+ times!
times = []

for i in range(num_calls):
    start = time.time()
    _ = model.predict_proba(X_test)
    times.append(time.time() - start)

avg_time = np.mean(times)
total_time_baseline = avg_time * num_calls

print(f"   Average time per call: {avg_time:.4f}s")
print(f"   Total time for {num_calls} calls: {total_time_baseline:.3f}s")
print(f"\n‚ö†Ô∏è  If tests make 100 calls: {avg_time * 100:.1f}s ({avg_time * 100 / 60:.1f} minutes!)")

<a id="precompute"></a>
## 4. ‚ö° Pre-compute Probabilities

### Compute probabilities ONCE and save to DataFrame

In [None]:
print("‚ö° PRE-COMPUTING PROBABILITIES...\n")

# Compute probabilities once
start = time.time()

# Get all probabilities (10 classes)
all_probs = model.predict_proba(X)

# Add to dataframe
df_with_probs = df.copy()
for i in range(10):  # 10 classes (0-9)
    df_with_probs[f'prob_{i}'] = all_probs[:, i]

precompute_time = time.time() - start

print(f"‚úÖ Probabilities computed and saved!")
print(f"   Time: {precompute_time:.3f}s (one-time cost)")
print(f"\n   New columns added:")
print(f"   {[col for col in df_with_probs.columns if col.startswith('prob_')]}")

# Show example
print(f"\n   Example (first sample):")
prob_cols = [col for col in df_with_probs.columns if col.startswith('prob_')]
print(df_with_probs[['target'] + prob_cols].head(1))

<a id="probcols"></a>
## 5. üéØ Use prob_cols Parameter

### Create DBDataset with pre-computed probabilities

In [None]:
print("üéØ METHOD 2: Using prob_cols (optimized)\n")

# Measure time
start = time.time()

# Create DBDataset with prob_cols
prob_cols = [f'prob_{i}' for i in range(10)]

dataset_with_probs = DBDataset(
    data=df_with_probs,
    target_column='target',
    prob_cols=prob_cols,  # ‚Üê Use pre-computed probabilities!
    test_size=0.2,
    random_state=RANDOM_STATE
)

time_with_probs = time.time() - start

print(f"‚úÖ DBDataset created with prob_cols")
print(f"   Time: {time_with_probs:.3f}s")
print(f"\n   What happened:")
print(f"   ‚Ä¢ NO model.predict() calls!")
print(f"   ‚Ä¢ Just read prob_cols from DataFrame")
print(f"   ‚Ä¢ Instant predictions!")

### Verify predictions are correct

In [None]:
# Verify both methods give same predictions
if hasattr(dataset_with_model, 'test_predictions') and hasattr(dataset_with_probs, 'test_predictions'):
    pred1 = dataset_with_model.test_predictions
    pred2 = dataset_with_probs.test_predictions
    
    print("‚úÖ VERIFICATION: Predictions are identical?")
    print(f"   All predictions match: {np.allclose(pred1, pred2)}")
    print(f"\n   Both methods produce EXACT same results!")
else:
    print("üí° Both methods will produce identical predictions when used")

<a id="comparison"></a>
## 6. üìä Performance Comparison

### Benchmark: Direct comparison

In [None]:
print("üìä PERFORMANCE COMPARISON\n")
print("=" * 70)

comparison = pd.DataFrame({
    'Method': ['With Model', 'With prob_cols', 'Speedup'],
    'DBDataset Creation': [
        f"{time_with_model:.3f}s",
        f"{time_with_probs:.3f}s",
        f"{time_with_model / time_with_probs:.1f}x"
    ],
    'Multiple Calls (10x)': [
        f"{total_time_baseline:.3f}s",
        f"~0.001s",
        f"~{total_time_baseline / 0.001:.0f}x"
    ]
})

display(comparison)

print(f"\nüí° Key Insights:")
print(f"   ‚Ä¢ DBDataset creation: {time_with_model / time_with_probs:.1f}x faster")
print(f"   ‚Ä¢ Multiple predictions: ~{total_time_baseline / 0.001:.0f}x faster!")
print(f"   ‚Ä¢ One-time cost: {precompute_time:.3f}s (amortized across all tests)")

### Visualize Speedup

In [None]:
# Bar chart comparison
methods = ['With Model\n(traditional)', 'With prob_cols\n(optimized)']
times_comparison = [time_with_model, time_with_probs]

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Chart 1: DBDataset creation time
colors = ['coral', 'lightgreen']
bars = axes[0].bar(methods, times_comparison, color=colors, edgecolor='black', alpha=0.8)
axes[0].set_ylabel('Time (seconds)', fontsize=11, fontweight='bold')
axes[0].set_title('DBDataset Creation Time', fontsize=13, fontweight='bold')
axes[0].grid(axis='y', alpha=0.3)

# Add values on bars
for i, (bar, time_val) in enumerate(zip(bars, times_comparison)):
    axes[0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
                f'{time_val:.3f}s', ha='center', va='bottom', fontweight='bold')

# Chart 2: Multiple calls simulation
multiple_times = [total_time_baseline, 0.001]  # prob_cols ~instant
bars2 = axes[1].bar(methods, multiple_times, color=colors, edgecolor='black', alpha=0.8)
axes[1].set_ylabel('Time (seconds)', fontsize=11, fontweight='bold')
axes[1].set_title('10 Prediction Calls', fontsize=13, fontweight='bold')
axes[1].set_yscale('log')  # Log scale to show difference
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nüöÄ The green bar shows MASSIVE speedup!")

<a id="when"></a>
## 7. ü§î When to Use prob_cols?

### Decision Matrix

| Scenario | Use Model | Use prob_cols |
|----------|-----------|---------------|
| **Quick experimentation** | ‚úÖ Easier | ‚ùå Overkill |
| **Small dataset (< 1K)** | ‚úÖ Fast enough | ‚ùå Not needed |
| **Large dataset (> 10K)** | ‚ö†Ô∏è Slow | ‚úÖ Recommended |
| **Heavy model (GPU, ensemble)** | ‚ö†Ô∏è Very slow | ‚úÖ Highly recommended |
| **Multiple tests** | ‚ùå Redundant calls | ‚úÖ Compute once! |
| **Production pipeline** | ‚ùå Bottleneck | ‚úÖ Efficient |

### Use prob_cols when:

‚úÖ **Large datasets** (> 10K samples)  
‚úÖ **Heavy models** (deep neural networks, large ensembles, GPU models)  
‚úÖ **Multiple validation tests** (robustness, uncertainty, etc.)  
‚úÖ **Repeated analyses** (tuning, experimentation)  
‚úÖ **Production pipelines** (automated validation)  
‚úÖ **Limited compute** (save CPU/GPU time)  

### Don't use prob_cols when:

‚ùå **Quick prototyping** (model in memory is simpler)  
‚ùå **Small datasets** (< 1K samples, speed difference negligible)  
‚ùå **Fast models** (linear models, small trees)  
‚ùå **One-time analysis** (not worth the setup)  

<a id="practices"></a>
## 8. üí° Best Practices

### 1. Save probabilities to file

In [None]:
# Save DataFrame with probabilities for reuse
output_path = '/tmp/data_with_probabilities.parquet'

df_with_probs.to_parquet(output_path, index=False)
print(f"‚úÖ Data with probabilities saved: {output_path}")

# Later: Load and use instantly
df_loaded = pd.read_parquet(output_path)
print(f"‚úÖ Loaded back: {df_loaded.shape}")
print(f"\nüí° Now you can create DBDataset instantly anytime!")

### 2. Naming convention

In [None]:
print("üìã NAMING CONVENTIONS\n")
print("Good naming (recommended):")
print("   ‚Ä¢ Binary: ['prob_0', 'prob_1']")
print("   ‚Ä¢ Multiclass: ['prob_0', 'prob_1', ..., 'prob_N']")
print("   ‚Ä¢ Semantic: ['prob_negative', 'prob_positive']")
print("\nBad naming (avoid):")
print("   ‚ùå ['p0', 'p1'] - not clear")
print("   ‚ùå ['pred_0', 'pred_1'] - confusing with predictions")
print("   ‚ùå Mixed names - inconsistent")

### 3. Workflow for large projects

In [None]:
print("üîÑ RECOMMENDED WORKFLOW FOR LARGE PROJECTS\n")
print("Step 1: Train model")
print("   model = train_your_model(...)")
print("\nStep 2: Compute probabilities once")
print("   probs = model.predict_proba(X)")
print("   for i in range(n_classes):")
print("       df[f'prob_{i}'] = probs[:, i]")
print("\nStep 3: Save to disk")
print("   df.to_parquet('data_with_probs.parquet')")
print("\nStep 4: Use prob_cols for all analyses")
print("   dataset = DBDataset(data=df, prob_cols=['prob_0', 'prob_1', ...])")
print("\nüí° Pay setup cost once, benefit forever!")

<a id="conclusion"></a>
## 9. üéâ Conclusion

### What You Learned

In this notebook, you learned:
- ‚úÖ **The bottleneck** - Model inference can be very slow
- ‚úÖ **The solution** - Pre-compute probabilities, use prob_cols
- ‚úÖ **How to do it** - Simple: add prob columns, use prob_cols parameter
- ‚úÖ **Performance gains** - 10-100x speedup for multiple tests
- ‚úÖ **When to use** - Large datasets, heavy models, multiple tests
- ‚úÖ **Best practices** - Save to file, use good naming, workflow

### Key Takeaways

1. ‚ö° **Massive speedup** - 10-100x faster for validation tests
2. üí∞ **One-time cost** - Compute probabilities once, reuse forever
3. üéØ **Exact same results** - No difference in predictions
4. üöÄ **Scale better** - Essential for large datasets and heavy models
5. üíæ **Save to disk** - Reuse across sessions
6. üìä **Production ready** - Efficient pipelines

### When to Use

**Always use prob_cols for:**
- Large datasets (> 10K samples)
- Heavy models (neural networks, large ensembles)
- Multiple validation tests
- Production pipelines

**Use model directly for:**
- Quick experiments
- Small datasets (< 1K)
- Fast models
- One-time analyses

### Real-world Impact

```
Before: 10 minutes of validation tests
After:  10 seconds of validation tests

Savings: 99% time reduction!
```

---

### Notebook Metrics

```
üìä Dataset: Digits (1797 samples, 10 classes)
ü§ñ Model: RandomForestClassifier (200 trees)
‚ö° Speedup: ~10-100x for multiple tests
üíæ Storage: Minimal (1 column per class)
‚è±Ô∏è Time: ~15 minutes
```

---

<div style="background-color: #d4edda; padding: 15px; border-radius: 5px; border-left: 5px solid #28a745;">
<b>‚úÖ Pro Tip:</b> For very large datasets (millions of rows), consider using <code>parquet</code> format with compression - it's fast and space-efficient!
</div>

---

**Remember: Optimize where it matters, keep it simple where it doesn't!** ‚ö°