# Complete MLOps Pipeline Demo
## End-to-End Supplier Risk Prediction System

**Purpose**: Demonstrate the complete data flow through all pipeline components:
- Data Pipeline ‚Üí Feature Engineering ‚Üí Model Training ‚Üí Prediction
- Auditing ‚Üí Explainability ‚Üí NLP ‚Üí Visualization ‚Üí Recommendations

**Why This Matters**: Shows how all components work together in production MLOps workflow

## 1. Setup & Imports
**What**: Import all pipeline components
**Why**: Each module handles a specific part of the ML lifecycle

In [1]:
import sys
from pathlib import Path
import pandas as pd
import numpy as np

# Add project root to path
BASE_DIR = Path.cwd().parent
sys.path.insert(0, str(BASE_DIR))

# Import all pipeline components
from src import data_pipeline      # Data loading & preprocessing
from src import model_pipeline     # Model training & prediction
from src import explainability     # SHAP + LLM explanations
from src import nlp_layer          # NLP features & summarization
from src import auditing           # Data quality & logging
from src import visualization      # Charts & plots
from src import recommendation     # Business recommendations
from backend import visualization_engine  # Advanced visualizations
from backend import explainability_viz    # SHAP visualizations

print("‚úì All pipeline components imported successfully")

Database initialized successfully.


  from .autonotebook import tqdm as notebook_tqdm


‚úì All pipeline components imported successfully


## 2. Data Pipeline - Load & Preprocess
**What**: Load raw data and prepare for modeling
**Why**: Clean, validated data is foundation of ML success

In [2]:
print(" STEP 1: Data Pipeline")

training_df, weekly_df = data_pipeline.load_processed_datasets()

print(f"Training data shape: {training_df.shape}")
print(f"Weekly data shape: {weekly_df.shape}")
print(f"\nColumns: {list(training_df.columns[:10])}...")
print("\n Data loaded successfully")

 STEP 1: Data Pipeline
Training data shape: (10500, 26)
Weekly data shape: (10500, 27)

Columns: ['supplier_id', 'company_name', 'region', 'industry', 'annual_revenue', 'annual_spend', 'avg_payment_delay_days', 'contract_value', 'contract_duration_months', 'past_disputes']...

 Data loaded successfully


## 3. Auditing Pipeline - Data Quality Checks
**What**: Validate data quality and log metrics
**Why**: Catch data issues before they break models (garbage in = garbage out)

In [4]:
print("\n STEP 2: Auditing Pipeline")
print("=" * 70)

# Run data quality checks
quality_report = auditing.log_data_quality(training_df)

print("\nData Quality Report:")
print(quality_report.head(10))

# Log audit event
auditing.persist_audit_log(
    event_type="pipeline_execution",
    payload={
        "stage": "data_quality_check",
        "rows": len(training_df),
        "columns": len(training_df.columns)
    }
)

print("\n‚úì Data quality validated and logged")


 STEP 2: Auditing Pipeline

Data Quality Report:
                metric  value  threshold  passed
0       max_null_ratio    0.0       0.15    True
1  region_domain_check    0.0       0.00    True

‚úì Data quality validated and logged


## 4. Feature Engineering - Prepare Training Data
**What**: Split features (X) and target (y)
**Why**: Models need clean separation of inputs and outputs

In [6]:
print("\n STEP 3: Feature Engineering")
print("=" * 70)

X, y = data_pipeline.prepare_training_data()

print(f"Features (X) shape: {X.shape}")
print(f"Target (y) shape: {y.shape}")
print(f"\nFeature columns: {list(X.columns[:10])}...")
print(f"\nTarget distribution:\n{y.value_counts()}")
print("\n‚úì Features prepared for training")


 STEP 3: Feature Engineering
Features (X) shape: (10500, 9)
Target (y) shape: (10500,)

Feature columns: ['region', 'industry', 'contract_criticality', 'annual_spend', 'credit_score', 'late_ratio', 'dispute_rate', 'avg_delay', 'clause_risk_score']...

Target distribution:
risk_label
medium    3943
high      3743
low       2814
Name: count, dtype: int64

‚úì Features prepared for training


## 5. Model Pipeline - Train Models
**What**: Train Random Forest and XGBoost models
**Why**: Ensemble models provide robust predictions with feature importance

In [7]:
print("\n STEP 4: Model Training")
print("=" * 70)

# Train models
artifacts = model_pipeline.train_models(X, y)

print("\nModel Training Complete:")
for model_name, metrics in artifacts.items():
    print(f"\n{model_name}:")
    print(f"  Accuracy: {metrics.get('accuracy', 'N/A')}")
    print(f"  Model saved: {metrics.get('model_path', 'N/A')}")

print("\n‚úì Models trained and persisted")


 STEP 4: Model Training

Model Training Complete:


AttributeError: 'list' object has no attribute 'items'

## 6. Prediction Pipeline - Single Supplier Prediction
**What**: Predict risk for a single supplier
**Why**: Real-time predictions for procurement decisions

In [None]:
print("\n STEP 5: Prediction Pipeline")
print("=" * 70)

# Create test supplier
test_supplier = {
    "region": "North America",
    "industry": "Manufacturing",
    "contract_criticality": "High",
    "annual_revenue": 2000000.0,
    "annual_spend": 75000.0,
    "avg_payment_delay_days": 10.0,
    "contract_value": 150000.0,
    "contract_duration_months": 12,
    "past_disputes": 2,
    "delivery_score": 70.0,
    "financial_stability_index": 60.0,
    "relationship_years": 3,
    "txn_count": 80,
    "avg_txn_amount": 3000.0,
    "avg_delay": 8.0,
    "late_ratio": 0.15,
    "dispute_rate": 0.08,
    "avg_delivery_quality": 68.0,
    "clause_risk_score": 45.0,
    "credit_score": 650
}

# Make prediction
result = model_pipeline.predict_single("random_forest", test_supplier)

print(f"\nPrediction: {result['prediction'].upper()}")
print(f"\nProbabilities:")
for risk_level, prob in result['probabilities'].items():
    print(f"  {risk_level}: {prob*100:.2f}%")

print("\n‚úì Prediction generated")

## 7. Explainability Pipeline - SHAP + LLM Narratives
**What**: Generate human-friendly explanations using SHAP and Ollama
**Why**: Regulatory compliance and user trust require explainable AI

In [None]:
print("\n STEP 6: Explainability Pipeline")
print("=" * 70)

# Build explanation
explanation = explainability.build_explanation(
    risk_level=result['prediction'],
    probabilities=result['probabilities'],
    shap_values=result['shap_values'],
    feature_names=result['feature_names']
)

print(f"\nRisk Level: {explanation.risk_level.upper()}")
print(f"Confidence: {explanation.confidence}%")

print(f"\nTop 5 Contributing Features:")
for feat, val in explanation.top_features[:5]:
    impact = "increases" if val > 0 else "decreases"
    print(f"  {feat}: {val:.4f} ({impact} risk)")

print(f"\nBusiness Narrative:")
print(f"  {explanation.narrative}")

print("\n‚úì Explanation generated")

## 8. NLP Pipeline - Feature Summarization
**What**: Use NLP to summarize SHAP features
**Why**: Translate technical features to business language

In [None]:
print("\n STEP 7: NLP Pipeline")
print("=" * 70)

# Summarize SHAP values using NLP
top_features_nlp = nlp_layer.summarize_shap_values(
    result['shap_values'],
    result['feature_names'],
    top_k=5
)

print("\nNLP-Enhanced Feature Summary:")
for feat, val in top_features_nlp:
    clean_name = feat.replace('numeric__', '').replace('categorical__', '').replace('_', ' ').title()
    print(f"  ‚Ä¢ {clean_name}: {val:.4f}")

print("\n‚úì NLP summarization complete")

## 9. Recommendation Pipeline - Actionable Insights
**What**: Generate business recommendations based on risk
**Why**: Predictions without actions are useless - provide next steps

In [None]:
print("\n STEP 8: Recommendation Pipeline")
print("=" * 70)

# Generate recommendations
recommendations = recommendation.build_recommendations(
    risk_level=explanation.risk_level,
    top_features=explanation.top_features
)

print("\nActionable Recommendations:")
for i, reco in enumerate(recommendations, 1):
    print(f"  {i}. {reco}")

print("\n‚úì Recommendations generated")

## 10. Visualization Pipeline - Feature Importance
**What**: Create visual explanations of SHAP values
**Why**: Visualizations help stakeholders understand model decisions

In [None]:
print("\n STEP 9: Visualization Pipeline")
print("=" * 70)

# Generate SHAP summary plot
shap_plot_path = visualization.plot_shap_summary(
    explanation.shap_values,
    explanation.feature_names,
    output_name="pipeline_demo_shap"
)

print(f"\nSHAP plot saved: {shap_plot_path}")

# Generate feature importance plot
importance_plot_path = explainability_viz.plot_feature_importance(
    explanation.feature_names,
    explanation.shap_values,
    output_name="pipeline_demo_importance"
)

print(f"Feature importance plot saved: {importance_plot_path}")

print("\n‚úì Visualizations generated")

## 11. Advanced Visualization - Data Exploration
**What**: Create advanced charts for data analysis
**Why**: Understand data distributions and relationships

In [None]:
print("\n STEP 10: Advanced Visualization")
print("=" * 70)

# Create correlation heatmap
heatmap_path = visualization_engine.heatmap(
    training_df,
    output_name="pipeline_demo_heatmap"
)

print(f"\nHeatmap saved: {heatmap_path}")

# Create histogram
hist_path = visualization_engine.histogram(
    training_df,
    column="credit_score",
    output_name="pipeline_demo_histogram"
)

print(f"Histogram saved: {hist_path}")

print("\n‚úì Advanced visualizations generated")

## 12. Audit Trail - Log Complete Pipeline Execution
**What**: Log all pipeline steps for compliance and debugging
**Why**: Production systems need complete audit trails

In [None]:
print("\nüìù STEP 11: Audit Trail")
print("=" * 70)

# Log complete pipeline execution
auditing.persist_audit_log(
    event_type="complete_pipeline_execution",
    payload={
        "supplier_id": "DEMO_001",
        "prediction": result['prediction'],
        "confidence": explanation.confidence,
        "top_features": [f[0] for f in explanation.top_features[:3]],
        "recommendations_count": len(recommendations),
        "visualizations_generated": 4
    }
)

# Fetch recent audit events
recent_events = auditing.db_connector.fetch_audit_trail(limit=5)

print("\nRecent Audit Events:")
print(recent_events[['event_type', 'timestamp']].head())

print("\n‚úì Pipeline execution logged")

## 13. Complete Pipeline Summary
**What**: Summarize entire MLOps workflow
**Why**: Show how all components integrate for production ML

In [None]:
print("\n" + "=" * 70)
print("üéâ COMPLETE MLOPS PIPELINE SUMMARY")
print("=" * 70)

summary = f"""
‚úì Data Pipeline: Loaded {len(training_df)} training records
‚úì Auditing: Validated data quality and logged events
‚úì Feature Engineering: Prepared {X.shape[1]} features
‚úì Model Training: Trained Random Forest and XGBoost
‚úì Prediction: Generated {result['prediction'].upper()} risk prediction
‚úì Explainability: Created SHAP + LLM narrative
‚úì NLP: Summarized top {len(top_features_nlp)} features
‚úì Recommendations: Generated {len(recommendations)} action items
‚úì Visualization: Created 4 charts and plots
‚úì Audit Trail: Logged complete pipeline execution

Pipeline Components Used:
  ‚Ä¢ data_pipeline: Data loading & preprocessing
  ‚Ä¢ model_pipeline: Training & prediction
  ‚Ä¢ explainability: SHAP + Ollama narratives
  ‚Ä¢ nlp_layer: Feature summarization
  ‚Ä¢ auditing: Quality checks & logging
  ‚Ä¢ visualization: SHAP plots
  ‚Ä¢ visualization_engine: Advanced charts
  ‚Ä¢ explainability_viz: Feature importance
  ‚Ä¢ recommendation: Business actions

This demonstrates a complete production MLOps workflow!
"""

print(summary)
print("=" * 70)