# Part 2: Snowflake Model Registry Deployment

## Overview

This notebook demonstrates **deploying an XGBoost model to Snowflake Model Registry** for production inference. You'll save training data to Snowflake tables and register your model for scalable, governed ML operations.

### Prerequisites

⚠️ **IMPORTANT**: Run `setup.sql` as ACCOUNTADMIN before starting this notebook.

The setup script creates:
- Role: `HEALTHCARE_ML_ROLE`
- Database: `HEALTHCARE_ML`
- Schema: `HEALTHCARE_ML.DIAGNOSTICS`
- Warehouse: `HEALTHCARE_ML_WH`
- Compute Pool: `HEALTHCARE_ML_CPU_POOL`

### What You'll Learn

1. **Persist data** to Snowflake tables
2. **Register models** in Snowflake Model Registry
3. **Run inference** using registered models
4. **Track metadata** (metrics, versions, comments)

> **Note**: This notebook requires Container Runtime and must be run from **Snowsight**.

## Step 1: Load Artifacts from Part 1

Load the trained model and data from `/tmp` that were saved in Part 1.

In [None]:
import pickle
import pandas as pd
from snowflake.snowpark.context import get_active_session

# Load artifacts from Part 1
with open('/tmp/breast_cancer_artifacts.pkl', 'rb') as f:
    artifacts = pickle.load(f)

best_model = artifacts['best_model']
X_train = artifacts['X_train']
X_test = artifacts['X_test']
y_train = artifacts['y_train']
y_test = artifacts['y_test']
test_accuracy = artifacts['test_accuracy']
test_f1 = artifacts['test_f1']
roc_auc = artifacts['roc_auc']
pr_auc = artifacts['pr_auc']
cv_results = artifacts['cv_results']
feature_names = artifacts['feature_names']

print("=" * 60)
print("✅ ARTIFACTS LOADED FROM /tmp")
print("=" * 60)
print(f"Model: XGBoost ({best_model.n_estimators} estimators)")
print(f"Training data: {X_train.shape[0]} samples × {X_train.shape[1]} features")
print(f"Test data: {X_test.shape[0]} samples")
print(f"Test Accuracy: {test_accuracy:.4f}")
print(f"ROC AUC: {roc_auc:.4f}")

# Connect to Snowflake
session = get_active_session()
session.sql("""
    ALTER SESSION SET query_tag = '{"origin":"sf_sit-is","name":"healthcare_ml_classification","version":{"major":1,"minor":0},"attributes":{"is_quickstart":1,"source":"notebook"}}'
""").collect()
print(f"\n✅ Connected to Snowflake: {session.get_current_account()}")

## Step 1: Environment Setup

### Import Libraries

We'll use a combination of data science and Snowflake-specific libraries:

| Library | Purpose |
|---------|---------|
| `snowflake.snowpark` | Snowflake session management |
| `pandas`, `numpy` | Data manipulation and numerical operations |
| `matplotlib`, `seaborn` | Statistical visualizations |
| `sklearn` | ML utilities, metrics, and baseline models |
| `xgboost` | Gradient boosting implementation |

> **Note**: All libraries are pre-installed in Container Runtime - no `!pip install` or EAIs needed.

In [None]:
from snowflake.ml.registry import Registry
from snowflake.ml.model import task

DATABASE = "HEALTHCARE_ML"
SCHEMA = "DIAGNOSTICS"

session.use_database(DATABASE)
session.use_schema(SCHEMA)

registry = Registry(session=session)

MODEL_NAME = "BREAST_CANCER_CLASSIFIER"

print("Logging model to Snowflake Model Registry...")
mv = registry.log_model(
    best_model,
    model_name=MODEL_NAME,
    sample_input_data=X_train.head(),
    target_platforms=["WAREHOUSE"],
    task=task.Task.TABULAR_BINARY_CLASSIFICATION,
    options={'relax_version': False},
    metrics={
        "test_accuracy": float(test_accuracy),
        "test_f1_score": float(test_f1),
        "roc_auc": float(roc_auc),
        "cv_accuracy_mean": float(cv_results['XGBoost'].mean()),
        "cv_accuracy_std": float(cv_results['XGBoost'].std()),
        "n_estimators": 100,
        "max_depth": 6,
        "learning_rate": 0.1
    },
    comment="XGBoost classifier for breast cancer diagnosis. Trained on Wisconsin Diagnostic dataset (569 samples, 30 features). Cross-validated."
)

print("=" * 60)
print("MODEL REGISTRY - SUCCESS")
print("=" * 60)
print(f"Model Name:    {MODEL_NAME}")
print(f"Version:       {mv.version_name}")
print(f"Test Accuracy: {test_accuracy:.4f}")
print(f"ROC AUC:       {roc_auc:.4f}")

## Step 3: Model Inference

### Running Predictions with the Registered Model

Once deployed to the Model Registry, inference can be performed via:

| Method | Use Case | Scalability |
|--------|----------|-------------|
| `mv.run()` (Python) | Notebooks, scripts | Batch processing |
| `MODEL!PREDICT()` (SQL) | Dashboards, ETL pipelines | Warehouse-scale |

The model executes **within Snowflake** - no data leaves the platform, maintaining security and governance.

In [None]:
print(f"Running inference using model: {mv.model_name} (version: {mv.version_name})")
predictions = mv.run(X_test, function_name="predict")
print(f"Prediction columns: {predictions.columns.tolist()}")
pred_col = predictions.columns[-1]
predictions[[pred_col]].rename(columns={pred_col: "PREDICTION"}).head(10)

## Step 4: Explore Registered Model

The Model Registry stores model artifacts along with metadata. Let's inspect:
- **Available methods**: predict, predict_proba
- **Logged metrics**: accuracy, AUC, hyperparameters

> **Tip**: View your model in Snowsight under **AI & ML > Models** for a visual interface.

In [None]:
print("Available methods:")
for func in mv.show_functions():
    print(f"  - {func['name']}")

print(f"\nModel metrics:")
mv.show_metrics()

## Step 5: (Optional) Persist Data to Snowflake

**Data Persistence Options:**

| Method | Use Case | Durability |
|--------|----------|------------|
| Snowflake Table | Structured data, SQL queries | Permanent |
| Snowflake Stage | Files, artifacts | Permanent |
| Notebook CWD | Temporary files | Session only ⚠️ |

> **Warning**: The notebook working directory (`/home/udf/`) does not persist between sessions. Always save important data to tables or stages.

In [None]:
# OPTIONAL: Save training data to Snowflake
# Uncomment and update the database/schema names to match your environment

# train_df = X_train.copy()
# train_df["DIAGNOSIS"] = y_train.values
# 
# snowpark_df = session.create_dataframe(train_df)
# snowpark_df.write.mode("overwrite").save_as_table("HEALTHCARE_ML.DIAGNOSTICS.BREAST_CANCER_TRAINING_DATA")
# 
# print("Training data saved to Snowflake table")

## Summary and Key Takeaways

### What We Accomplished

| Step | Technique | Outcome |
|------|-----------|---------|
| Data Exploration | Statistical analysis + visualizations | Understood feature distributions and class balance |
| Feature Engineering | StandardScaler | Normalized features for fair model comparison |
| Model Selection | 5-Fold Stratified CV | Compared 3 algorithms, selected XGBoost |
| Evaluation | Multiple metrics + visualizations | Validated model with ~97% accuracy, 0.99 AUC |
| Deployment | Snowflake Model Registry | Production-ready model with versioning |

### Performance Summary

| Metric | Value | Interpretation |
|--------|-------|----------------|
| Test Accuracy | ~97% | Correct predictions overall |
| ROC AUC | ~0.99 | Excellent discrimination |
| Malignant Recall | ~95%+ | Catches most cancers |
| Benign Precision | ~98%+ | Few false alarms |

### Production Usage

```sql
-- SQL Inference
SELECT BREAST_CANCER_CLASSIFIER!PREDICT(*) FROM your_patient_data;

-- Python Inference
model_version = registry.get_model("BREAST_CANCER_CLASSIFIER").version("V1")
predictions = model_version.run(new_data, function_name="predict")
```

### Next Steps

1. **Hyperparameter Tuning**: Use GridSearchCV or Optuna for optimization
2. **Feature Selection**: Reduce to top 10-15 features for efficiency
3. **Model Monitoring**: Track prediction drift in production
4. **A/B Testing**: Compare model versions on live data

> **Resources**: [Snowflake ML Documentation](https://docs.snowflake.com/en/developer-guide/snowflake-ml/overview) | [XGBoost Documentation](https://xgboost.readthedocs.io/)