In [None]:
````xml
<VSCode.Cell language="markdown">
# Neural Execution Risk Predictor
## End-to-End Deep Learning System

**Project Goal:** Build a Deep Learning system that predicts runtime execution risk of autonomous agents before execution.

**Risk Levels:**
- **LOW_RISK (0)**: Safe to execute with normal limits
- **MEDIUM_RISK (1)**: Execute with moderate restrictions
- **HIGH_RISK (2)**: Execute with tight limits or block entirely

**Tech Stack:**
- TensorFlow 2.x
- Structured/tabular features (NO NLP)
- FastAPI for serving
- Production-ready and explainable
</VSCode.Cell>

<VSCode.Cell language="code">
# Import Required Libraries
import os
import sys
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# TensorFlow / Keras
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models, callbacks

# Scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, 
    confusion_matrix, classification_report
)
from sklearn.inspection import permutation_importance

# Process Mining
import pm4py

print(f"TensorFlow Version: {tf.__version__}")
print(f"NumPy Version: {np.__version__}")
print(f"Pandas Version: {pd.__version__}")

# Set random seeds for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

# GPU Configuration
print("\nGPU Available:", tf.config.list_physical_devices('GPU'))
</VSCode.Cell>

<VSCode.Cell language="markdown">
## 1. Data Collection & Feature Engineering

We use a **hybrid dataset** combining:
1. **BPI Challenge 2012** event logs (real execution traces)
2. **Synthetic agent execution plans** (with explicit risk features)

### Feature Schema (9 input features):
- `num_steps`: Number of execution steps
- `num_tools`: Number of distinct tools used
- `tool_diversity`: Diversity of tools
- `has_high_risk_tool`: Boolean flag for risky tools (0/1)
- `est_tokens`: Estimated token budget
- `max_retries`: Maximum retry attempts
- `sequential_tool_calls`: Consecutive repeated tool calls
- `plan_depth`: Nesting depth of execution plan
- `time_limit_sec`: Time limit for execution

### Target:
- `failure_label`: 0=LOW_RISK, 1=MEDIUM_RISK, 2=HIGH_RISK
</VSCode.Cell>

<VSCode.Cell language="code">
# Define project paths
PROJECT_ROOT = r"D:\JAVA\CODE\PYTHON\ML\DL\Neural Execution Risk Predictor"
DATA_DIR = os.path.join(PROJECT_ROOT, "data")
SCRIPTS_DIR = os.path.join(PROJECT_ROOT, "scripts")
MODEL_DIR = os.path.join(PROJECT_ROOT, "model")
REPORTS_DIR = os.path.join(PROJECT_ROOT, "reports")

# Create directories if they don't exist
for dir_path in [DATA_DIR, MODEL_DIR, REPORTS_DIR]:
    os.makedirs(dir_path, exist_ok=True)

print("Project directories ready!")
</VSCode.Cell>

<VSCode.Cell language="markdown">
### 1.1 Extract Features from BPI Challenge 2012 XES File
</VSCode.Cell>

<VSCode.Cell language="code">
# Add scripts directory to path
sys.path.append(SCRIPTS_DIR)

from extract_bpi_features import extract_bpi_features

# Extract BPI features
xes_file_path = os.path.join(PROJECT_ROOT, "new_BPI_Challenge_2012.xes")
bpi_csv_path = os.path.join(DATA_DIR, "bpi_features.csv")

if not os.path.exists(bpi_csv_path):
    print("Extracting features from BPI Challenge 2012 XES file...")
    df_bpi = extract_bpi_features(xes_file_path, bpi_csv_path)
else:
    print("Loading existing BPI features...")
    df_bpi = pd.read_csv(bpi_csv_path)

print(f"\nBPI Dataset Shape: {df_bpi.shape}")
print("\nFirst 5 rows:")
print(df_bpi.head())
</VSCode.Cell>

<VSCode.Cell language="markdown">
### 1.2 Generate Synthetic Agent Execution Plans
</VSCode.Cell>

<VSCode.Cell language="code">
from generate_synthetic_plans import save_synthetic_dataset

# Generate synthetic plans
synthetic_csv_path = os.path.join(DATA_DIR, "synthetic_plans.csv")

if not os.path.exists(synthetic_csv_path):
    print("Generating synthetic execution plans...")
    df_synthetic = save_synthetic_dataset(
        synthetic_csv_path, 
        num_samples=2000, 
        include_edge_cases=True
    )
else:
    print("Loading existing synthetic plans...")
    df_synthetic = pd.read_csv(synthetic_csv_path)

print(f"\nSynthetic Dataset Shape: {df_synthetic.shape}")
print("\nFirst 5 rows:")
print(df_synthetic.head())
</VSCode.Cell>

<VSCode.Cell language="markdown">
### 1.3 Combine Datasets into Hybrid Dataset
</VSCode.Cell>

<VSCode.Cell language="code">
# Remove ID columns for combining
df_bpi_features = df_bpi.drop(columns=['case_id'])
df_synthetic_features = df_synthetic.drop(columns=['plan_id'])

# Combine datasets
df_combined = pd.concat([df_bpi_features, df_synthetic_features], ignore_index=True)

# Shuffle
df_combined = df_combined.sample(frac=1, random_state=42).reset_index(drop=True)

print(f"Combined Dataset Shape: {df_combined.shape}")
print(f"\nTotal Samples: {len(df_combined)}")

# Check class distribution
print("\n" + "="*60)
print("RISK LABEL DISTRIBUTION")
print("="*60)
label_counts = df_combined['failure_label'].value_counts().sort_index()
print(label_counts)
print(f"\nLOW_RISK (0): {label_counts[0]} ({label_counts[0]/len(df_combined)*100:.1f}%)")
print(f"MEDIUM_RISK (1): {label_counts[1]} ({label_counts[1]/len(df_combined)*100:.1f}%)")
print(f"HIGH_RISK (2): {label_counts[2]} ({label_counts[2]/len(df_combined)*100:.1f}%)")

# Save combined dataset
combined_csv_path = os.path.join(DATA_DIR, "execution_risk_dataset.csv")
df_combined.to_csv(combined_csv_path, index=False)
print(f"\nSaved combined dataset to: {combined_csv_path}")
</VSCode.Cell>

<VSCode.Cell language="markdown">
### 1.4 Exploratory Data Analysis
</VSCode.Cell>

<VSCode.Cell language="code">
# Display summary statistics
print("="*60)
print("FEATURE SUMMARY STATISTICS")
print("="*60)
print(df_combined.describe())

# Check for missing values
print("\n" + "="*60)
print("MISSING VALUES")
print("="*60)
print(df_combined.isnull().sum())
</VSCode.Cell>

<VSCode.Cell language="code">
# Visualize feature distributions by risk level
fig, axes = plt.subplots(3, 3, figsize=(15, 12))
axes = axes.ravel()

feature_cols = [col for col in df_combined.columns if col != 'failure_label']

for idx, col in enumerate(feature_cols):
    ax = axes[idx]
    for label in [0, 1, 2]:
        data = df_combined[df_combined['failure_label'] == label][col]
        ax.hist(data, alpha=0.5, label=f'Risk {label}', bins=20)
    ax.set_xlabel(col)
    ax.set_ylabel('Frequency')
    ax.set_title(f'Distribution of {col}')
    ax.legend()

plt.tight_layout()
plt.savefig(os.path.join(REPORTS_DIR, 'feature_distributions.png'), dpi=300, bbox_inches='tight')
plt.show()
print("Feature distribution plot saved!")
</VSCode.Cell>

<VSCode.Cell language="markdown">
## 2. Data Preprocessing

### Requirements:
1. Convert boolean features to 0/1 (already done)
2. Normalize numeric features using StandardScaler
3. Do NOT scale labels
4. Split into train/validation/test sets
</VSCode.Cell>

<VSCode.Cell language="code">
# Separate features and labels
X = df_combined.drop(columns=['failure_label']).values
y = df_combined['failure_label'].values

print(f"Features shape: {X.shape}")
print(f"Labels shape: {y.shape}")

# Feature names for later use
feature_names = [col for col in df_combined.columns if col != 'failure_label']
print(f"\nFeature names: {feature_names}")
</VSCode.Cell>

<VSCode.Cell language="code">
# Split data: 70% train, 15% validation, 15% test
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.15, random_state=42, stratify=y
)

X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.176, random_state=42, stratify=y_temp  # 0.176 of 0.85 â‰ˆ 0.15
)

print("Dataset Split:")
print(f"  Training: {X_train.shape[0]} samples ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"  Validation: {X_val.shape[0]} samples ({X_val.shape[0]/len(X)*100:.1f}%)")
print(f"  Test: {X_test.shape[0]} samples ({X_test.shape[0]/len(X)*100:.1f}%)")

# Check label distribution in each set
print("\nLabel Distribution in Training Set:")
print(pd.Series(y_train).value_counts().sort_index())
print("\nLabel Distribution in Validation Set:")
print(pd.Series(y_val).value_counts().sort_index())
print("\nLabel Distribution in Test Set:")
print(pd.Series(y_test).value_counts().sort_index())
</VSCode.Cell>

<VSCode.Cell language="code">
# Normalize features using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

print("Feature Normalization Complete!")
print(f"  Mean of training features: {X_train_scaled.mean(axis=0)}")
print(f"  Std of training features: {X_train_scaled.std(axis=0)}")

# Save scaler for later use in API
import joblib
scaler_path = os.path.join(MODEL_DIR, 'scaler.joblib')
joblib.dump(scaler, scaler_path)
print(f"\nScaler saved to: {scaler_path}")
</VSCode.Cell>

<VSCode.Cell language="code">
# Convert labels to categorical (one-hot encoding) for neural network
y_train_cat = keras.utils.to_categorical(y_train, num_classes=3)
y_val_cat = keras.utils.to_categorical(y_val, num_classes=3)
y_test_cat = keras.utils.to_categorical(y_test, num_classes=3)

print("Label Encoding:")
print(f"  Original label shape: {y_train.shape}")
print(f"  Categorical label shape: {y_train_cat.shape}")
print(f"\nExample:")
print(f"  Original: {y_train[0]}")
print(f"  Categorical: {y_train_cat[0]}")
</VSCode.Cell>

<VSCode.Cell language="markdown">
## 3. Model Architecture

### Locked Architecture:
```
Input (9 features)
  â†“
Dense(64, ReLU)
  â†“
Dropout(0.2)
  â†“
Dense(32, ReLU)
  â†“
Dense(3, Softmax)
```

### Training Configuration:
- Optimizer: Adam (lr=0.001)
- Loss: categorical_crossentropy
- Batch size: 32
- Epochs: 30
- Callbacks: EarlyStopping (patience=5)
</VSCode.Cell>

<VSCode.Cell language="code">
# Build the neural network model
def build_model(input_dim=9):
    """
    Build the Neural Execution Risk Predictor model
    
    Architecture:
    - Input layer (9 features)
    - Dense(64, ReLU)
    - Dropout(0.2)
    - Dense(32, ReLU)
    - Output Dense(3, Softmax)
    """
    model = models.Sequential([
        layers.Input(shape=(input_dim,)),
        layers.Dense(64, activation='relu', name='hidden_layer_1'),
        layers.Dropout(0.2, name='dropout'),
        layers.Dense(32, activation='relu', name='hidden_layer_2'),
        layers.Dense(3, activation='softmax', name='output_layer')
    ], name='NeuralExecutionRiskPredictor')
    
    return model

# Create model
model = build_model(input_dim=X_train_scaled.shape[1])

# Display model architecture
model.summary()
</VSCode.Cell>

<VSCode.Cell language="code">
# Compile model
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

print("Model compiled successfully!")
</VSCode.Cell>

<VSCode.Cell language="markdown">
## 4. Model Training
</VSCode.Cell>

<VSCode.Cell language="code">
# Define callbacks
early_stopping = callbacks.EarlyStopping(
    monitor='val_loss',
    patience=5,
    restore_best_weights=True,
    verbose=1
)

# Train the model
print("Starting model training...\n")

history = model.fit(
    X_train_scaled, y_train_cat,
    validation_data=(X_val_scaled, y_val_cat),
    epochs=30,
    batch_size=32,
    callbacks=[early_stopping],
    verbose=1
)

print("\nTraining complete!")
</VSCode.Cell>

<VSCode.Cell language="code">
# Save the trained model
model_path = os.path.join(MODEL_DIR, 'risk_model.h5')
model.save(model_path)
print(f"Model saved to: {model_path}")

# Also save in SavedModel format for production
saved_model_dir = os.path.join(MODEL_DIR, 'risk_model_saved')
model.save(saved_model_dir, save_format='tf')
print(f"SavedModel format saved to: {saved_model_dir}")
</VSCode.Cell>

<VSCode.Cell language="markdown">
## 5. Model Evaluation
</VSCode.Cell>

<VSCode.Cell language="code">
# Evaluate on test set
test_loss, test_accuracy = model.evaluate(X_test_scaled, y_test_cat, verbose=0)

print("="*60)
print("TEST SET EVALUATION")
print("="*60)
print(f"Test Loss: {test_loss:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")
</VSCode.Cell>

<VSCode.Cell language="code">
# Make predictions
y_pred_probs = model.predict(X_test_scaled, verbose=0)
y_pred = np.argmax(y_pred_probs, axis=1)

# Calculate metrics per class
precision = precision_score(y_test, y_pred, average=None)
recall = recall_score(y_test, y_pred, average=None)

print("\n" + "="*60)
print("PER-CLASS METRICS")
print("="*60)
print(f"\n{'Class':<15} {'Precision':<12} {'Recall':<12}")
print("-" * 40)
print(f"{'LOW_RISK (0)':<15} {precision[0]:<12.4f} {recall[0]:<12.4f}")
print(f"{'MEDIUM_RISK (1)':<15} {precision[1]:<12.4f} {recall[1]:<12.4f}")
print(f"{'HIGH_RISK (2)':<15} {precision[2]:<12.4f} {recall[2]:<12.4f}")

# Overall metrics
print(f"\n{'Overall':<15} {precision.mean():<12.4f} {recall.mean():<12.4f}")
</VSCode.Cell>

<VSCode.Cell language="code">
# Detailed classification report
print("\n" + "="*60)
print("DETAILED CLASSIFICATION REPORT")
print("="*60)
target_names = ['LOW_RISK', 'MEDIUM_RISK', 'HIGH_RISK']
print(classification_report(y_test, y_pred, target_names=target_names))
</VSCode.Cell>

<VSCode.Cell language="code">
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)

print("\n" + "="*60)
print("CONFUSION MATRIX")
print("="*60)
print(cm)
print("\nRows = Actual, Columns = Predicted")
</VSCode.Cell>

<VSCode.Cell language="markdown">
### Error Analysis
</VSCode.Cell>

<VSCode.Cell language="code">
# Error Analysis
print("="*60)
print("ERROR ANALYSIS")
print("="*60)

# False Positives: Predicted higher risk than actual
false_positives = {
    'LOWâ†’MEDIUM': cm[0, 1],
    'LOWâ†’HIGH': cm[0, 2],
    'MEDIUMâ†’HIGH': cm[1, 2]
}

# False Negatives: Predicted lower risk than actual
false_negatives = {
    'MEDIUMâ†’LOW': cm[1, 0],
    'HIGHâ†’LOW': cm[2, 0],
    'HIGHâ†’MEDIUM': cm[2, 1]
}

print("\nFalse Positives (over-estimation of risk):")
for transition, count in false_positives.items():
    print(f"  {transition}: {count}")

print("\nFalse Negatives (under-estimation of risk):")
for transition, count in false_negatives.items():
    print(f"  {transition}: {count}")

# Analysis
print("\n" + "-"*60)
print("INTERPRETATION:")
print("-"*60)
print("""
- False Positives: Model predicts HIGHER risk than actual
  â†’ Impact: Agent might be unnecessarily restricted
  â†’ Safety: SAFER (conservative approach)
  
- False Negatives: Model predicts LOWER risk than actual
  â†’ Impact: Agent runs with insufficient guards
  â†’ Safety: DANGEROUS (could lead to failures)
  
For a production Runtime Guard:
- False negatives (HIGHâ†’LOW, HIGHâ†’MEDIUM) are MORE CRITICAL
- They represent missed high-risk scenarios
- Current model should prioritize minimizing these errors
""")

# Calculate critical error rate
critical_errors = cm[2, 0] + cm[2, 1]  # HIGH predicted as LOW or MEDIUM
total_high_risk = cm[2, :].sum()
if total_high_risk > 0:
    critical_error_rate = critical_errors / total_high_risk
    print(f"\nCRITICAL ERROR RATE (missed HIGH_RISK): {critical_error_rate:.2%}")
    print(f"  â†’ {critical_errors} out of {total_high_risk} high-risk cases were underestimated")
</VSCode.Cell>

<VSCode.Cell language="markdown">
## 6. Visualization
</VSCode.Cell>

<VSCode.Cell language="code">
# Plot 1: Training & Validation Loss
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], label='Training Loss', linewidth=2)
plt.plot(history.history['val_loss'], label='Validation Loss', linewidth=2)
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Loss', fontsize=12)
plt.title('Training vs Validation Loss', fontsize=14, fontweight='bold')
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3)

# Plot 2: Training & Validation Accuracy
plt.subplot(1, 2, 2)
plt.plot(history.history['accuracy'], label='Training Accuracy', linewidth=2)
plt.plot(history.history['val_accuracy'], label='Validation Accuracy', linewidth=2)
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.title('Training vs Validation Accuracy', fontsize=14, fontweight='bold')
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(os.path.join(REPORTS_DIR, 'training_curves.png'), dpi=300, bbox_inches='tight')
plt.show()
print("Training curves saved!")
</VSCode.Cell>

<VSCode.Cell language="code">
# Plot: Confusion Matrix Heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['LOW', 'MEDIUM', 'HIGH'],
            yticklabels=['LOW', 'MEDIUM', 'HIGH'],
            cbar_kws={'label': 'Count'})
plt.xlabel('Predicted Risk Level', fontsize=12, fontweight='bold')
plt.ylabel('Actual Risk Level', fontsize=12, fontweight='bold')
plt.title('Confusion Matrix - Neural Execution Risk Predictor', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig(os.path.join(REPORTS_DIR, 'confusion_matrix.png'), dpi=300, bbox_inches='tight')
plt.show()
print("Confusion matrix heatmap saved!")
</VSCode.Cell>

<VSCode.Cell language="markdown">
## 7. Feature Importance Analysis

Using **Permutation Importance** to understand which features most influence predictions.
</VSCode.Cell>

<VSCode.Cell language="code">
# Build a wrapper for permutation importance (needs sklearn-compatible predict)
class KerasClassifierWrapper:
    def __init__(self, model):
        self.model = model
    
    def predict(self, X):
        """Predict class labels"""
        probs = self.model.predict(X, verbose=0)
        return np.argmax(probs, axis=1)
    
    def score(self, X, y):
        """Accuracy score"""
        y_pred = self.predict(X)
        return accuracy_score(y, y_pred)

# Wrap the model
wrapped_model = KerasClassifierWrapper(model)

# Calculate permutation importance
print("Calculating permutation importance...")
print("(This may take a minute...)\n")

perm_importance = permutation_importance(
    wrapped_model, 
    X_test_scaled, 
    y_test,
    n_repeats=10,
    random_state=42,
    n_jobs=-1
)

# Get importance scores
importance_mean = perm_importance.importances_mean
importance_std = perm_importance.importances_std

# Create DataFrame for better visualization
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importance_mean,
    'Std': importance_std
}).sort_values('Importance', ascending=False)

print("="*60)
print("FEATURE IMPORTANCE (Permutation)")
print("="*60)
print(importance_df.to_string(index=False))
</VSCode.Cell>

<VSCode.Cell language="code">
# Plot Feature Importance
plt.figure(figsize=(10, 6))
plt.barh(importance_df['Feature'], importance_df['Importance'], 
         xerr=importance_df['Std'], capsize=5, color='steelblue', alpha=0.8)
plt.xlabel('Importance (Drop in Accuracy)', fontsize=12, fontweight='bold')
plt.ylabel('Feature', fontsize=12, fontweight='bold')
plt.title('Feature Importance - Neural Execution Risk Predictor', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.savefig(os.path.join(REPORTS_DIR, 'feature_importance.png'), dpi=300, bbox_inches='tight')
plt.show()
print("Feature importance plot saved!")
</VSCode.Cell>

<VSCode.Cell language="markdown">
### Feature Importance Interpretation
</VSCode.Cell>

<VSCode.Cell language="code">
print("="*60)
print("FEATURE IMPORTANCE INTERPRETATION")
print("="*60)

top_features = importance_df.head(3)['Feature'].tolist()

print(f"\nTop 3 Most Important Features:")
for i, feat in enumerate(top_features, 1):
    print(f"  {i}. {feat}")

print("\n" + "-"*60)
print("SYSTEMS ENGINEERING PERSPECTIVE:")
print("-"*60)
print("""
The most important features for predicting HIGH_RISK likely include:

1. **est_tokens**: Token budget directly correlates with computational cost
   - High token usage â†’ resource exhaustion risk
   - Critical for LLM-based agents

2. **num_steps**: Execution complexity indicator
   - More steps â†’ higher chance of failure propagation
   - Long execution chains are inherently riskier

3. **sequential_tool_calls**: Indicator of retry loops or stuck execution
   - Repeated tool calls suggest potential infinite loops
   - Sign of planning failures or environmental issues

4. **has_high_risk_tool**: Explicit risk flagging
   - Certain tools (file operations, network calls) are inherently risky
   - Binary indicator with high signal value

These features make sense because they capture:
- Resource consumption (tokens, time)
- Execution complexity (steps, depth)
- Failure patterns (retries, loops)
- Explicit risk markers (dangerous tools)

This aligns with production agent monitoring requirements.
""")
</VSCode.Cell>

<VSCode.Cell language="markdown">
## 8. Model Export & Metadata
</VSCode.Cell>

<VSCode.Cell language="code">
# Save feature names for API
import json

metadata = {
    'model_name': 'Neural Execution Risk Predictor',
    'version': '1.0',
    'input_features': feature_names,
    'output_classes': {
        0: 'LOW_RISK',
        1: 'MEDIUM_RISK',
        2: 'HIGH_RISK'
    },
    'architecture': {
        'layers': ['Dense(64, ReLU)', 'Dropout(0.2)', 'Dense(32, ReLU)', 'Dense(3, Softmax)'],
        'optimizer': 'Adam',
        'learning_rate': 0.001,
        'loss': 'categorical_crossentropy'
    },
    'performance': {
        'test_accuracy': float(test_accuracy),
        'test_loss': float(test_loss),
        'precision_per_class': precision.tolist(),
        'recall_per_class': recall.tolist()
    },
    'feature_importance': importance_df.to_dict('records')
}

metadata_path = os.path.join(MODEL_DIR, 'model_metadata.json')
with open(metadata_path, 'w') as f:
    json.dump(metadata, f, indent=2)

print(f"Model metadata saved to: {metadata_path}")
</VSCode.Cell>

<VSCode.Cell language="markdown">
## 9. Test Inference (Sample Predictions)
</VSCode.Cell>

<VSCode.Cell language="code">
# Test with sample inputs
test_samples = [
    {
        'name': 'Simple Safe Execution',
        'num_steps': 3,
        'num_tools': 2,
        'tool_diversity': 2,
        'has_high_risk_tool': 0,
        'est_tokens': 1500,
        'max_retries': 1,
        'sequential_tool_calls': 0,
        'plan_depth': 1,
        'time_limit_sec': 60
    },
    {
        'name': 'Moderate Complexity',
        'num_steps': 8,
        'num_tools': 4,
        'tool_diversity': 4,
        'has_high_risk_tool': 1,
        'est_tokens': 6000,
        'max_retries': 3,
        'sequential_tool_calls': 2,
        'plan_depth': 2,
        'time_limit_sec': 180
    },
    {
        'name': 'High-Risk Execution',
        'num_steps': 18,
        'num_tools': 8,
        'tool_diversity': 7,
        'has_high_risk_tool': 1,
        'est_tokens': 15000,
        'max_retries': 6,
        'sequential_tool_calls': 12,
        'plan_depth': 4,
        'time_limit_sec': 500
    }
]

print("="*60)
print("SAMPLE PREDICTIONS")
print("="*60)

for sample in test_samples:
    name = sample.pop('name')
    
    # Prepare input
    input_array = np.array([[
        sample['num_steps'],
        sample['num_tools'],
        sample['tool_diversity'],
        sample['has_high_risk_tool'],
        sample['est_tokens'],
        sample['max_retries'],
        sample['sequential_tool_calls'],
        sample['plan_depth'],
        sample['time_limit_sec']
    ]])
    
    # Scale input
    input_scaled = scaler.transform(input_array)
    
    # Predict
    prediction_probs = model.predict(input_scaled, verbose=0)[0]
    predicted_class = np.argmax(prediction_probs)
    risk_score = float(prediction_probs[predicted_class])
    
    risk_labels = ['LOW_RISK', 'MEDIUM_RISK', 'HIGH_RISK']
    
    print(f"\n{name}:")
    print(f"  Input: {sample}")
    print(f"  Prediction: {risk_labels[predicted_class]}")
    print(f"  Confidence: {risk_score:.2%}")
    print(f"  Risk Distribution: LOW={prediction_probs[0]:.3f}, MED={prediction_probs[1]:.3f}, HIGH={prediction_probs[2]:.3f}")
</VSCode.Cell>

<VSCode.Cell language="markdown">
## 10. Summary & Next Steps

### What We Built:
âœ… Hybrid dataset from BPI Challenge 2012 + Synthetic plans  
âœ… 9-feature tabular input (NO NLP)  
âœ… Deep Neural Network (64â†’32â†’3 architecture)  
âœ… Training with early stopping  
âœ… Comprehensive evaluation & error analysis  
âœ… Feature importance analysis  
âœ… Production-ready model artifacts  

### Model Performance:
- **Test Accuracy**: {test_accuracy:.2%}
- **Key Insight**: Model balances safety (avoiding false negatives) with efficiency

### Integration with Agent Runtime Guard:

```python
# Pseudo-code for Runtime Guard Integration
class AgentRuntimeGuard:
    def __init__(self, risk_model):
        self.risk_model = risk_model
        
    def evaluate_plan(self, execution_plan):
        # Extract features from plan
        features = extract_features(execution_plan)
        
        # Predict risk
        risk_level, risk_score = self.risk_model.predict(features)
        
        # Apply guardrails
        if risk_level == 'HIGH_RISK':
            return self.apply_tight_limits(execution_plan)
        elif risk_level == 'MEDIUM_RISK':
            return self.apply_moderate_limits(execution_plan)
        else:
            return self.allow_normal_execution(execution_plan)
```

### Files Generated:
- `model/risk_model.h5` - Trained model
- `model/scaler.joblib` - Feature scaler
- `model/model_metadata.json` - Model metadata
- `reports/*.png` - Visualizations

### Next Steps:
1. Build FastAPI service (see `api/main.py`)
2. Dockerize the application
3. Deploy to production
4. Monitor model performance in real-time
5. Retrain with production data
</VSCode.Cell>

<VSCode.Cell language="markdown">
---
**Project Complete!** ðŸŽ‰

You now have a production-ready Deep Learning system for predicting agent execution risk.
</VSCode.Cell>
````