# Complete MLOps Pipeline Demo
## End-to-End Supplier Risk Prediction System

**Purpose**: Demonstrate the complete data flow through all pipeline components:
- Data Pipeline → Feature Engineering → Model Training → Prediction
- Auditing → Explainability → NLP → Visualization → Recommendations

**Why This Matters**: Shows how all components work together in production MLOps workflow

## 1. Setup & Imports
**What**: Import all pipeline components
**Why**: Each module handles a specific part of the ML lifecycle

In [1]:
import sys
from pathlib import Path
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

# Add project root to path
BASE_DIR = Path.cwd().parent
sys.path.insert(0, str(BASE_DIR))

# Import all pipeline components
from src import data_pipeline      # Data loading & preprocessing
from src import model_pipeline     # Model training & prediction
from src import explainability     # SHAP + LLM explanations
from src import nlp_layer          # NLP features & summarization
from src import auditing           # Data quality & logging
from src import visualization      # Charts & plots
from src import recommendation     # Business recommendations
from backend import visualization_engine  # Advanced visualizations
from backend import explainability_viz    # SHAP visualizations

print("✓ All pipeline components imported successfully")

Database initialized successfully.


  from .autonotebook import tqdm as notebook_tqdm


✓ All pipeline components imported successfully


## 2. Data Pipeline - Load & Preprocess
**What**: Load raw data and prepare for modeling
**Why**: Clean, validated data is foundation of ML success

In [2]:
print(" STEP 1: Data Pipeline")

training_df, weekly_df = data_pipeline.load_processed_datasets()

print(f"Training data shape: {training_df.shape}")
print(f"Weekly data shape: {weekly_df.shape}")
print(f"\nColumns: {list(training_df.columns[:10])}...")
print("\n Data loaded successfully")

 STEP 1: Data Pipeline
Training data shape: (10500, 26)
Weekly data shape: (10500, 27)

Columns: ['supplier_id', 'company_name', 'region', 'industry', 'annual_revenue', 'annual_spend', 'avg_payment_delay_days', 'contract_value', 'contract_duration_months', 'past_disputes']...

 Data loaded successfully


## 3. Auditing Pipeline - Data Quality Checks
**What**: Validate data quality and log metrics
**Why**: Catch data issues before they break models (garbage in = garbage out)

In [4]:
print("\n STEP 2: Auditing Pipeline")
print("=" * 70)

# Run data quality checks
quality_report = auditing.log_data_quality(training_df)

print("\nData Quality Report:")
print(quality_report.head(10))

# Log audit event
auditing.persist_audit_log(
    event_type="pipeline_execution",
    payload={
        "stage": "data_quality_check",
        "rows": len(training_df),
        "columns": len(training_df.columns)
    }
)

print("\n✓ Data quality validated and logged")


 STEP 2: Auditing Pipeline

Data Quality Report:
                metric  value  threshold  passed
0       max_null_ratio    0.0       0.15    True
1  region_domain_check    0.0       0.00    True

✓ Data quality validated and logged


## 4. Feature Engineering - Prepare Training Data
**What**: Split features (X) and target (y)
**Why**: Models need clean separation of inputs and outputs

In [6]:
print("\n STEP 3: Feature Engineering")
print("=" * 70)

X, y = data_pipeline.prepare_training_data()

print(f"Features (X) shape: {X.shape}")
print(f"Target (y) shape: {y.shape}")
print(f"\nFeature columns: {list(X.columns[:10])}...")
print(f"\nTarget distribution:\n{y.value_counts()}")
print("\n✓ Features prepared for training")


 STEP 3: Feature Engineering
Features (X) shape: (10500, 9)
Target (y) shape: (10500,)

Feature columns: ['region', 'industry', 'contract_criticality', 'annual_spend', 'credit_score', 'late_ratio', 'dispute_rate', 'avg_delay', 'clause_risk_score']...

Target distribution:
risk_label
medium    3943
high      3743
low       2814
Name: count, dtype: int64

✓ Features prepared for training


## 5. Model Pipeline - Train Models
**What**: Train Random Forest and XGBoost models
**Why**: Ensemble models provide robust predictions with feature importance

In [None]:
print("\n🤖 STEP 4: Model Training")
print("=" * 70)

# Train models
artifacts = model_pipeline.train_models(X, y)

print("\nModel Training Complete:")
for artifact in artifacts:
    print(f"\n{artifact.model_name}:")
    print(f"  Macro F1: {artifact.report['macro avg']['f1-score']:.4f}")
    print(f"  Accuracy: {artifact.report['accuracy']:.4f}")
    print(f"  Model saved: models/{artifact.model_name}.joblib")

print("\n✓ Models trained and persisted")

## 6. Prediction Pipeline - Single Supplier Prediction
**What**: Predict risk for a single supplier
**Why**: Real-time predictions for procurement decisions

In [None]:
print("\n STEP 5: Prediction Pipeline")
print("=" * 70)

# Create test supplier
test_supplier = {
    "region": "North America",
    "industry": "Manufacturing",
    "contract_criticality": "High",
    "annual_revenue": 2000000.0,
    "annual_spend": 75000.0,
    "avg_payment_delay_days": 10.0,
    "contract_value": 150000.0,
    "contract_duration_months": 12,
    "past_disputes": 2,
    "delivery_score": 70.0,
    "financial_stability_index": 60.0,
    "relationship_years": 3,
    "txn_count": 80,
    "avg_txn_amount": 3000.0,
    "avg_delay": 8.0,
    "late_ratio": 0.15,
    "dispute_rate": 0.08,
    "avg_delivery_quality": 68.0,
    "clause_risk_score": 45.0,
    "credit_score": 650
}

# Make prediction
result = model_pipeline.predict_single("random_forest", test_supplier)

print(f"\nPrediction: {result['prediction'].upper()}")
print(f"\nProbabilities:")
for risk_level, prob in result['probabilities'].items():
    print(f"  {risk_level}: {prob*100:.2f}%")

print("\n✓ Prediction generated")

## 7. Explainability Pipeline - SHAP + LLM Narratives
**What**: Generate human-friendly explanations using SHAP and Ollama
**Why**: Regulatory compliance and user trust require explainable AI

In [None]:
print("\n STEP 6: Explainability Pipeline")
print("=" * 70)

# Build explanation
explanation = explainability.build_explanation(
    risk_level=result['prediction'],
    probabilities=result['probabilities'],
    shap_values=result['shap_values'],
    feature_names=result['feature_names']
)

print(f"\nRisk Level: {explanation.risk_level.upper()}")
print(f"Confidence: {explanation.confidence}%")

print(f"\nTop 5 Contributing Features:")
for feat, val in explanation.top_features[:5]:
    impact = "increases" if val > 0 else "decreases"
    print(f"  {feat}: {val:.4f} ({impact} risk)")

print(f"\nBusiness Narrative:")
print(f"  {explanation.narrative}")

print("\n✓ Explanation generated")

## 8. NLP Pipeline - Feature Summarization
**What**: Use NLP to summarize SHAP features
**Why**: Translate technical features to business language

In [None]:
print("\n STEP 8: Recommendation Pipeline")
print("=" * 70)

# Generate recommendations
recommendations = recommendation.build_recommendations(
    risk_level=explanation.risk_level,
    top_features=explanation.top_features
)

print("\nActionable Recommendations:")
for i, reco in enumerate(recommendations, 1):
    print(f"  {i}. {reco}")

print("\n✓ Recommendations generated")

## 9. Recommendation Pipeline - Actionable Insights
**What**: Generate business recommendations based on risk
**Why**: Predictions without actions are useless - provide next steps

In [None]:
print("\n📊 STEP 9: Visualization Pipeline")
print("=" * 70)

import matplotlib.pyplot as plt

# Create subplots for feature importance
fig, axes = plt.subplots(1, 2, figsize=(18, 6))

# Left: SHAP bar plot
top_features = sorted(zip(explanation.feature_names, abs(explanation.shap_values)), key=lambda x: x[1], reverse=True)[:10]
features, values = zip(*top_features)
axes[0].barh(range(len(features)), values, color='steelblue')
axes[0].set_yticks(range(len(features)))
axes[0].set_yticklabels([f.replace('numeric__', '').replace('categorical__', '') for f in features])
axes[0].invert_yaxis()
axes[0].set_xlabel('|SHAP Value|', fontsize=12)
axes[0].set_title('Top 10 SHAP Feature Importance', fontsize=14, fontweight='bold')
axes[0].grid(axis='x', alpha=0.3)

# Right: Feature importance with direction
top_features_signed = sorted(zip(explanation.feature_names, explanation.shap_values), key=lambda x: abs(x[1]), reverse=True)[:10]
features_s, values_s = zip(*top_features_signed)
colors = ['red' if v > 0 else 'green' for v in values_s]
axes[1].barh(range(len(features_s)), values_s, color=colors, alpha=0.7)
axes[1].set_yticks(range(len(features_s)))
axes[1].set_yticklabels([f.replace('numeric__', '').replace('categorical__', '') for f in features_s])
axes[1].invert_yaxis()
axes[1].set_xlabel('SHAP Value (Red=Increase Risk, Green=Decrease Risk)', fontsize=12)
axes[1].set_title('Feature Impact Direction', fontsize=14, fontweight='bold')
axes[1].axvline(x=0, color='black', linestyle='--', linewidth=1)
axes[1].grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

print("\n✓ Feature importance visualizations displayed")

## 10. Visualization Pipeline - Feature Importance
**What**: Create visual explanations of SHAP values
**Why**: Visualizations help stakeholders understand model decisions

In [None]:
print("\n📈 STEP 10: Advanced Visualization")
print("=" * 70)

import matplotlib.pyplot as plt
import seaborn as sns

# Pairplot - select key numeric features
plot_cols = ['credit_score', 'annual_spend', 'late_ratio', 'dispute_rate', 'risk_label']
plot_df = training_df[plot_cols].sample(n=min(500, len(training_df)), random_state=42)

print("\nGenerating pairplot (this may take a moment)...")
g = sns.pairplot(plot_df, hue='risk_label', palette={'low': 'green', 'medium': 'orange', 'high': 'red'}, 
                 diag_kind='kde', plot_kws={'alpha': 0.6}, height=2.5)
g.fig.suptitle('Feature Relationships by Risk Level', y=1.02, fontsize=16, fontweight='bold')
plt.show()

# Subplots: Heatmap and Histogram
fig, axes = plt.subplots(1, 2, figsize=(18, 6))

# Left: Correlation heatmap
numeric_cols = training_df.select_dtypes(include=['float64', 'int64']).columns[:10]
corr = training_df[numeric_cols].corr()
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm', center=0, ax=axes[0], 
            cbar_kws={'label': 'Correlation'})
axes[0].set_title('Feature Correlation Heatmap', fontsize=14, fontweight='bold')

# Right: Credit score distribution by risk
for risk in ['low', 'medium', 'high']:
    data = training_df[training_df['risk_label'] == risk]['credit_score']
    axes[1].hist(data, bins=30, alpha=0.6, label=risk.capitalize())
axes[1].set_xlabel('Credit Score', fontsize=12)
axes[1].set_ylabel('Frequency', fontsize=12)
axes[1].set_title('Credit Score Distribution by Risk Level', fontsize=14, fontweight='bold')
axes[1].legend()
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print("\n✓ Advanced visualizations displayed")

## 11. Advanced Visualization - Data Exploration
**What**: Create advanced charts for data analysis
**Why**: Understand data distributions and relationships

In [None]:
print("\n📝 STEP 11: Audit Trail")
print("=" * 70)

# Log complete pipeline execution
auditing.persist_audit_log(
    event_type="complete_pipeline_execution",
    payload={
        "supplier_id": "DEMO_001",
        "prediction": result['prediction'],
        "confidence": explanation.confidence,
        "top_features": [f[0] for f in explanation.top_features[:3]],
        "recommendations_count": len(recommendations),
        "visualizations_generated": 4
    }
)

# Fetch recent audit events with proper column handling
try:
    recent_events = auditing.db_connector.fetch_audit_trail(limit=5)
    print("\nRecent Audit Events:")
    
    # Dynamically check which columns exist
    available_cols = ['event_type']
    if 'created_at' in recent_events.columns:
        available_cols.append('created_at')
    elif 'timestamp' in recent_events.columns:
        available_cols.append('timestamp')
    
    print(recent_events[available_cols].head())
except Exception as e:
    print(f"\nNote: Could not fetch audit trail from database: {e}")
    print("Audit events are logged to CSV at reports/audit_logs/audit_events_log.csv")

print("\n✓ Pipeline execution logged")

## 12. Audit Trail - Log Complete Pipeline Execution
**What**: Log all pipeline steps for compliance and debugging
**Why**: Production systems need complete audit trails

In [None]:
print("\n📝 STEP 11: Audit Trail")
print("=" * 70)

# Log complete pipeline execution
auditing.persist_audit_log(
    event_type="complete_pipeline_execution",
    payload={
        "supplier_id": "DEMO_001",
        "prediction": result['prediction'],
        "confidence": explanation.confidence,
        "top_features": [f[0] for f in explanation.top_features[:3]],
        "recommendations_count": len(recommendations),
        "visualizations_generated": 4
    }
)

# Fetch recent audit events
recent_events = auditing.db_connector.fetch_audit_trail(limit=5)

print("\nRecent Audit Events:")
print(recent_events[['event_type', 'timestamp']].head())

print("\n✓ Pipeline execution logged")

## 13. Complete Pipeline Summary
**What**: Summarize entire MLOps workflow
**Why**: Show how all components integrate for production ML

In [None]:
print("\n" + "=" * 70)
print("🎉 COMPLETE MLOPS PIPELINE SUMMARY")
print("=" * 70)

summary = f"""
✓ Data Pipeline: Loaded {len(training_df)} training records
✓ Auditing: Validated data quality and logged events
✓ Feature Engineering: Prepared {X.shape[1]} features
✓ Model Training: Trained Random Forest and XGBoost
✓ Prediction: Generated {result['prediction'].upper()} risk prediction
✓ Explainability: Created SHAP + LLM narrative
✓ NLP: Summarized top {len(top_features_nlp)} features
✓ Recommendations: Generated {len(recommendations)} action items
✓ Visualization: Created 4 charts and plots
✓ Audit Trail: Logged complete pipeline execution

Pipeline Components Used:
  • data_pipeline: Data loading & preprocessing
  • model_pipeline: Training & prediction
  • explainability: SHAP + Ollama narratives
  • nlp_layer: Feature summarization
  • auditing: Quality checks & logging
  • visualization: SHAP plots
  • visualization_engine: Advanced charts
  • explainability_viz: Feature importance
  • recommendation: Business actions

This demonstrates a complete production MLOps workflow!
"""

print(summary)
print("=" * 70)