In [None]:
# Import Snowpark context for the active session
import streamlit as st
import pandas as pd
import altair as alt
from snowflake.snowpark.context import get_active_session
session = get_active_session()

# Define image in a stage and read the file
#image=session.file.get_stream('@aicollege.public.setup/MLObservabilityWorkflow.jpg' , decompress=False).read() 

# Display the image
#st.image(image, width=800)

## 📊 ML Observability in Action: Monitoring + Alerting Workflow

In this section, you’ll track and respond to model performance using **Snowflake Model Monitors**. Here's the recommended end-to-end workflow:

1. Create a **baseline table** using early batch inference data (e.g., Weeks 1–5).
2. Set up a **model monitor** that references the baseline to enable drift metrics immediately.
3. Explore key **performance metrics** in **Snowsight**, such as **AUC**, **F1 Score**, **Precision**, and **Recall**.
4. Enable all **8 Snowsight panels** (prediction count, actuals count, performance, drift, etc.) for full visibility.
5. Create an **alert** that triggers if **F1 Score** drops below a specified threshold (e.g., `0.7`).
6. Recreate the model monitor using **all available inference data**, while continuing to reference the baseline.
7. Understand when **drift metrics may become disabled** and how to **re-enable them** (e.g., when baseline is missing).
8. Use functions like `MODEL_MONITOR_PERFORMANCE_METRIC` and `MODEL_MONITOR_DRIFT_METRIC` to **query metrics over time**.
9. **Visualize trends** in Streamlit or matplotlib using custom dashboards — including manual drift analysis via **Hellinger Distance**.

This notebook walks you through each step to build a transparent, production-ready observability pipeline using the Snowflake Model Registry and Monitoring framework.

### 📌 Create a Baseline Table for Drift Monitoring

To enable drift metrics in Snowflake Model Monitoring, we must define a **baseline dataset**. This baseline acts as the reference point for detecting data drift over time.

In this lab, we’ll use the **first five weeks** of batch inference results as our baseline. These predictions represent a stable period before performance begins to shift.

The baseline will be stored in a table called `BASELINE_PREDICTIONS`. We’ll use it when creating our model monitor so that drift metrics are enabled from the start.

In [None]:
CREATE OR REPLACE TABLE AICOLLEGE.PUBLIC.BASELINE_PREDICTIONS AS  --- Create baseline_predictions table
SELECT *
FROM AICOLLEGE.PUBLIC.PREDICTIONS_WITH_GROUND_TRUTH  --- Use Week-by-week (Weeks 1–5) PREDICTIONS_WITH_GROUND_TRUTH as input
WHERE WEEK BETWEEN 1 AND 5;

In [None]:
SELECT *
FROM AICOLLEGE.PUBLIC.PREDICTIONS_WITH_GROUND_TRUTH

In [None]:
DESC TABLE AICOLLEGE.PUBLIC.PREDICTIONS_WITH_GROUND_TRUTH;

### 🛠️ Set Up the Model Monitor

Now that we’ve completed batch inference and saved the results, it’s time to enable ML Observability by creating a **Model Monitor**.

Snowflake model monitors allow you to:

- Track your model’s **performance** over time (accuracy, precision, recall)
- Detect **data drift** or changes in prediction confidence
- Monitor **prediction volume** and row-level trends

Each model monitor must be tied to:
- A specific model version
- A source table with inference + actuals
- A timestamp column for time-based aggregation

We’ll now use the `CREATE MODEL MONITOR` command to register our monitor and start collecting observability metrics. Once the monitor is active, you’ll be able to view insights in the **Snowsight Monitoring Dashboard** under `AI & ML > Models`.

> ⚠️ Make sure your source table includes the following:
> - A `PREDICTED_RESPONSE` column (binary)
> - A `PREDICTED_SCORE` column (probability between 0 and 1)
> - A `MORTGAGERESPONSE` column (ground truth)
> - A `WEEK_DATE` or timestamp column (must be type `DATE` or `TIMESTAMP_NTZ`)

In [None]:
use role aicollege;
CREATE OR REPLACE MODEL MONITOR MORTGAGE_MODEL_MONITOR  --- Create mortgage model monitor
WITH 
  MODEL = 'COLLEGE_AI_HOL_XGB_MORTGAGE_MODEL',  -- Use registered model name
  VERSION = 'V1',  -- Registered version
  FUNCTION = 'predict',  --- Use predict function
  
  -- Full inference table containing predictions + ground truth
  SOURCE = AICOLLEGE.PUBLIC.PREDICTIONS_WITH_GROUND_TRUTH,  --- Use Week-by-week (Weeks 1–5) PREDICTIONS_WITH_GROUND_TRUTH as initial input
  
  -- Enables drift metrics using early scoring results (Weeks 1–5)
  BASELINE = AICOLLEGE.PUBLIC.BASELINE_PREDICTIONS,  --- Use baseline table
  
  WAREHOUSE = AICOLLEGE,
  REFRESH_INTERVAL = '365 DAY',
  AGGREGATION_WINDOW = '7 DAYS',
  
  -- Timestamp must be TIMESTAMP_NTZ
  TIMESTAMP_COLUMN = WEEK_START_DATE,  
  
  ID_COLUMNS = ('LOAN_ID'),  -- Row identifier
  PREDICTION_CLASS_COLUMNS = ('PREDICTED_RESPONSE'),  --- Use binary prediction
  ACTUAL_CLASS_COLUMNS = ('MORTGAGERESPONSE'),  --- Ground truth
  PREDICTION_SCORE_COLUMNS = ('PREDICTED_SCORE');  --- Use model confidence score

### ✅ View the Monitor in Snowsight

To access your model's observability dashboard:

1. In Snowsight, go to **AI & ML > Models**.
2. Locate your model: **COLLEGE_AI_HOL_XGB_MORTGAGE_MODEL**.
3. Click into the model, then scroll down to the **Monitors** section.
4. Click on your monitor: **MORTGAGE_MODEL_MONITOR**.

You’ll now see a dashboard that includes:

- 📈 **Performance metrics** – Tracks accuracy, precision, and recall over time  
- 🧪 **Drift detection** – Highlights shifts in data distribution using metrics like Hellinger distance  
- 📊 **Volume stats** – Shows trends in row counts and prediction volume  
- 🕒 Use the **time range selector** at the top to explore different time windows.

📌 If your predictions were generated weekly, the **WEEK_START_DATE** column will anchor the time-based aggregation used in the visualizations.

### 🔔 Set Up an `F1_SCORE` Alert for Model Monitoring

To automate performance monitoring, we'll create a **Snowflake alert** that checks if the model’s `F1_SCORE` drops below `0.7`.

📉 **Why 0.7?**  
An **F1 score below 0.7** often indicates a **significant degradation in model performance**. This threshold reflects a tradeoff between **precision (false positives)** and **recall (false negatives)** — and a dip below this level may signal:
- Poor model generalization to new data
- Increased number of incorrect classifications
- Potential data quality issues or concept drift

If this happens, the alert will trigger an **email notification** to the designated recipient — allowing MLOps teams to take proactive action when model performance drops below acceptable levels.

---

**🧪 Testing the Alert**

For demonstration purposes, we’ll use a **fixed date range** around **mid-February 2025**, when we know the `F1_SCORE` dips below the threshold.  
This ensures the alert fires immediately — no matter when the notebook is run.

Later, this can be swapped for a rolling time window (e.g., `CURRENT_DATE() - 7`) for ongoing monitoring.

---

**🔍 What this alert does:**
- Queries the `MODEL_MONITOR_PERFORMANCE_METRIC` function for recent `F1_SCORE` values  
- Checks values between **Feb 9 and Feb 16**
- Triggers an email via the `ML_ALERTS` integration if any value falls below `0.7`
- Runs daily on a schedule (e.g., 9 AM UTC)

Once verified, this logic can be adapted for long-term production monitoring.

In [None]:
CREATE OR REPLACE ALERT F1_SCORE_DROP_ALERT  -- Set your model monitor alert called "F1_SCORE_DROP_ALERT"
  WAREHOUSE = AICOLLEGE
  SCHEDULE = 'USING CRON 0 9 * * * UTC'
  IF (EXISTS (
    SELECT 1
    FROM TABLE(
      MODEL_MONITOR_PERFORMANCE_METRIC(
        'MORTGAGE_MODEL_MONITOR',
        'F1_SCORE',
        '7 DAYS',
        '2025-02-09'::DATE,  -- Fixed start date for simulation
        '2025-02-16'::DATE   -- Fixed end date for simulation
        -- DATEADD(DAY, -7, CURRENT_DATE()), CURRENT_DATE() -- Use this for production later
      )
    )
    WHERE METRIC_VALUE < 0.7 -- Set your F1 score threshold to 0.7
  ))
  THEN CALL SYSTEM$SEND_EMAIL(
    'ML_ALERTS',
    'harley.chen@snowflake.com',  --- Update with your email
    'F1_SCORE dropped below threshold',
    'The F1_SCORE for COLLEGE_AI_HOL_XGB_MORTGAGE_MODEL has dropped below 0.7 between Feb 9 and Feb 16.'
  );

### 🔁 Recreate the Model Monitor with Full Historical Data

Now that we've added more predictions to the `ALL_PREDICTIONS_WITH_GROUND_TRUTH` table (beyond just Weeks 1–5), it's important to **recreate the model monitor** to include all available inference results.

Because a model monitor can only be tied to a single source table and model version, the best practice is to:

1. **Drop the existing monitor** that only covered early batch inference.
2. **Recreate the monitor** using the same model and version — but now referencing the full table of predictions.

This ensures your observability dashboards and performance metrics in Snowsight reflect a more complete and up-to-date view of the model’s behavior in production.

In [None]:
use role aicollege;
-- Drop the previous monitor to refresh with all historical data
-- DROP MODEL MONITOR IF EXISTS MORTGAGE_MODEL_MONITOR;

-- Recreate the monitor using the full prediction table and a defined baseline
CREATE MODEL MONITOR MORTGAGE_MODEL_MONITOR
WITH 
  MODEL = 'COLLEGE_AI_HOL_XGB_MORTGAGE_MODEL',
  VERSION = 'V1',
  FUNCTION = 'predict',
  SOURCE = AICOLLEGE.PUBLIC.PREDICTIONS_WITH_GROUND_TRUTH,  --- Use full table batch scoring Snowflake table
  BASELINE = AICOLLEGE.PUBLIC.BASELINE_PREDICTIONS,
  WAREHOUSE = AICOLLEGE,
  REFRESH_INTERVAL = '365 DAY',
  AGGREGATION_WINDOW = '7 DAYS',
  TIMESTAMP_COLUMN = WEEK_START_DATE,
  ID_COLUMNS = ('LOAN_ID'),  -- Row identifier
  PREDICTION_CLASS_COLUMNS = ('PREDICTED_RESPONSE'),  --- Use binary prediction
  ACTUAL_CLASS_COLUMNS = ('MORTGAGERESPONSE'),  --- Ground truth
  PREDICTION_SCORE_COLUMNS = ('PREDICTED_SCORE');  --- Use model confidence score

### 📣 Trigger the Alert Immediately for Testing

After defining an alert to monitor model performance, Snowflake will only evaluate it based on the scheduled time (e.g., daily at 9 AM UTC).

To test your alert setup immediately — and confirm that the logic and email integration work as expected — you can:

- **Resume the alert** (if it’s currently paused)
- **Manually execute the alert** to trigger the condition check

This is useful for:

- Verifying that your alert logic behaves as expected  
- Confirming that email notifications are successfully sent  
- Validating the threshold you’ve set for model metrics  

> ⚠️ The alert will continue to run automatically based on its schedule unless you pause or drop it later.

In [None]:
-- ✅ Resume the alert if it was paused
ALTER ALERT F1_SCORE_DROP_ALERT RESUME;

-- 🚀 Execute the alert immediately to check if an email should be sent
EXECUTE ALERT F1_SCORE_DROP_ALERT;

-- 📬 After executing the alert, check your email inbox (or other notification channel)
-- to confirm that the alert was triggered and delivered successfully.
-- Make sure your alert notification settings are correctly configured in Snowflake.

### 🔁 Review Updated Monitor Results in Snowsight

With the full prediction history now included in the monitor, revisit Snowsight to explore updated trends.

1. Go to **AI & ML > Models**, then locate:
   `COLLEGE_AI_HOL_XGB_MORTGAGE_MODEL > V1 > MORTGAGE_MODEL_MONITOR`

2. Use **Settings** to enable:
   - AUC, F1 Score, Precision, Recall, Classification Accuracy
   - Drift metrics: Prediction + Feature value drift (now available since a **baseline** was provided)

3. For drift metrics, select features like:
   - **PREDICTED_RESPONSE**, **PREDICTED_SCORE** for prediction drift
   - Input features (e.g., INCOME, LOAN_TYPE_NAME) for feature drift

This visibility helps track degradation over time and detect potential issues.

## 📊 ML Monitoring: Custom Drift & Performance Charts in Streamlit

While **Snowsight provides a rich UI** for exploring model performance and drift, you may want to build a **custom dashboard using Streamlit** — especially for embedding within existing **MLOps workflows** or sharing **lightweight visualizations**.

This section shows how to:

- **Programmatically access performance, drift, and volume metrics**
- **Create time-series visualizations** using metrics like **AUC**, **F1 Score**, **Precision**, and **Recall**
- **Monitor distribution drift** using metrics like **Jensen-Shannon distance** or **Hellinger distance**
- **Visualize predictions vs actuals over time** using interactive charts

This approach works well when you have a **registered model with an attached monitor** (as shown in Snowsight), and want to **surface select insights** for stakeholders or dev teams.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import streamlit as st

# Query ROC_AUC from model monitor using updated column names
query = """
-- Use Snowflake ML MODEL_MONITOR_PERFORMANCE_METRIC function to retrieve ROC AUC over time
SELECT
  EVENT_TIMESTAMP,
  METRIC_VALUE AS ROC_AUC
FROM TABLE(
    MODEL_MONITOR_PERFORMANCE_METRIC(   #  -->  Use Snowflake's MODEL_MONITOR_PERFORMANCE_METRIC
    'MORTGAGE_MODEL_MONITOR',
    'ROC_AUC',
    '7 DAYS',
    DATEADD(DAY, -90, CURRENT_DATE()),
    CURRENT_DATE()
  )
)
ORDER BY EVENT_TIMESTAMP
"""

# Run the query
df_auc = session.sql(query).to_pandas()

# Plotting the AUC over time
plt.figure(figsize=(10, 4))
plt.plot(df_auc["EVENT_TIMESTAMP"], df_auc["ROC_AUC"], marker='o', linestyle='-')
plt.title("AUC Over Time")
plt.xlabel("Event Timestamp")
plt.ylabel("ROC AUC")
plt.xticks(rotation=45)
plt.grid(True)

### 🧮 Create Weekly Summary View for Feature Tracking

To visualize sampled feature trends across weeks, we create a lightweight view with one sample per week.

In [None]:
-- Create a table/view with one sample row per week to track prediction trends
CREATE OR REPLACE VIEW AICOLLEGE.PUBLIC.WEEKLY_MONITOR_SAMPLE AS
SELECT *
FROM (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY WEEK_START_DATE ORDER BY LOAN_ID) AS row_num
    FROM AICOLLEGE.PUBLIC.ALL_PREDICTIONS_WITH_GROUND_TRUTH
)
WHERE row_num = 1
ORDER BY WEEK_START_DATE;

In [None]:
import streamlit as st
import pandas as pd
import altair as alt

monitor_name = "MORTGAGE_MODEL_MONITOR"

tab1, tab2 = st.tabs(["📈 Model Metrics", "🔍 Weekly Feature Trends"])

# TAB 1 — MODEL METRICS
with tab1:
    st.subheader("📊 Model Performance Over Time")

    metrics = {
        "AUC": "ROC_AUC",
        "F1 Score": "F1_SCORE",
        "Accuracy": "CLASSIFICATION_ACCURACY",
        "Precision": "PRECISION",
        "Recall": "RECALL"
    }

    selected_metric_label = st.selectbox("📈 Select a performance metric:", list(metrics.keys()))
    selected_metric = metrics[selected_metric_label]

    metric_query = f"""
        SELECT EVENT_TIMESTAMP, METRIC_VALUE
        FROM TABLE(
            MODEL_MONITOR_PERFORMANCE_METRIC(
                '{monitor_name}',
                '{selected_metric}',
                '1 WEEK',
                DATE '2025-01-01',
                CURRENT_DATE()
            )
        )
        ORDER BY EVENT_TIMESTAMP
    """

    df = session.sql(metric_query).to_pandas()

    if not df.empty:
        st.markdown(f"### 🧭 Metric Trend: **{selected_metric_label}**")
        line = alt.Chart(df).mark_line(point=True).encode(
            x=alt.X("EVENT_TIMESTAMP:T", title="Date", axis=alt.Axis(format="%b %d")),
            y=alt.Y("METRIC_VALUE:Q", title=selected_metric_label),
            tooltip=["EVENT_TIMESTAMP", "METRIC_VALUE"]
        ).properties(width=700, height=400)

        st.altair_chart(line, use_container_width=True)
        with st.expander("📋 Show raw metric values"):
            st.dataframe(df)
    else:
        st.warning("⚠️ No data found for the selected metric.")

# TAB 2 — RAW WEEKLY FEATURE TRENDS
with tab2:
    st.subheader("🔍 Weekly Sampled Feature Trends")

    feature_options = [
        "MORTGAGERESPONSE", 
        "PREDICTED_RESPONSE", 
        "PREDICTED_SCORE",
        "APPLICANT_INCOME_000S",
        "LOAN_AMOUNT_000S"
    ]

    selected_feature = st.selectbox("🧮 Select a feature to track weekly:", feature_options)

    trend_query = f"""
        SELECT WEEK_START_DATE, {selected_feature}
        FROM AICOLLEGE.PUBLIC.WEEKLY_MONITOR_SAMPLE
        ORDER BY WEEK_START_DATE
    """

    trend_df = session.sql(trend_query).to_pandas()

    if not trend_df.empty:
        st.markdown(f"### 📊 Weekly Trend: **{selected_feature}** (sampled)")
        chart = alt.Chart(trend_df).mark_line(point=True).encode(
            x=alt.X("WEEK_START_DATE:T", title="Week Start", axis=alt.Axis(format="%b %d")),
            y=alt.Y(f"{selected_feature}:Q", title=selected_feature),
            tooltip=["WEEK_START_DATE", selected_feature]
        ).properties(width=700, height=400)

        st.altair_chart(chart, use_container_width=True)

        with st.expander("📋 Show raw weekly feature data"):
            st.dataframe(trend_df)
    else:
        st.warning("⚠️ No weekly data found for this feature.")

### 🔍 Feature Drift: Detecting Shifts in Input Data

While **performance metrics** like **AUC** and **F1 Score** reveal how well a model is doing, it's equally important to monitor for **feature drift** — shifts in the distribution of input data over time.

Even when performance appears stable, **drift in key input features** can signal:

- **ETL or pipeline changes**
- **Seasonal or regional trends**
- **Concept drift** that may affect long-term accuracy

Snowflake **ML Observability** provides built-in drift metrics like **Jensen-Shannon distance** and supports weekly comparisons. These metrics help detect early warning signs before they lead to degraded model performance.

📌 In this example, we go further and simulate feature drift manually using **Hellinger Distance**, which compares two probability distributions:

- A **distance of `0.0`** = identical distributions (**no drift**)
- A **distance of `1.0`** = completely different distributions (**total drift**)

This hands-on drift view complements Snowflake’s automated dashboards and enables **deeper exploratory analysis** using the raw inference data in `INFERENCEMORTGAGEDATA`.

Use the dropdown below to:

- Select a **numeric feature**
- Choose **two different weeks**
- Visualize how the distribution of that feature changes week-over-week

In [None]:
import streamlit as st
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

st.subheader("📐 Compare Feature Drift Using Hellinger Distance")

# Load the main prediction data with weekly timestamps
df = session.table("AICOLLEGE.PUBLIC.ALL_PREDICTIONS_WITH_GROUND_TRUTH").to_pandas()

# Extract readable weekly labels
df['WEEK'] = pd.to_datetime(df['WEEK_START_DATE']).dt.to_period('W').astype(str)
weeks = sorted(df['WEEK'].unique())

# Optional: display week labels as datetime strings for clearer dropdowns
week_label_map = {
    wk: f"Week of {pd.Period(wk).end_time.strftime('%b %d, %Y')}"
    for wk in weeks
}
week_labels = list(week_label_map.values())

# Dropdown selections
numeric_features = [
    "APPLICANT_INCOME_000S", 
    "LOAN_AMOUNT_000S", 
    "PREDICTED_SCORE"
]
feature = st.selectbox("🔢 Select a numeric feature:", numeric_features)

# Use dropdowns with friendly labels but map back to period keys
selected_label1 = st.selectbox("📆 Select Week 1 (baseline):", week_labels, index=0)
selected_label2 = st.selectbox("📆 Select Week 2 (comparison):", week_labels, index=1)
week1 = [k for k, v in week_label_map.items() if v == selected_label1][0]
week2 = [k for k, v in week_label_map.items() if v == selected_label2][0]

# Filter the data
dist1 = df[df['WEEK'] == week1][feature].dropna()
dist2 = df[df['WEEK'] == week2][feature].dropna()

# Compute Hellinger Distance
def compute_hellinger(p, q):
    return np.sqrt(0.5 * np.sum((np.sqrt(p) - np.sqrt(q)) ** 2))

bins = np.histogram_bin_edges(np.concatenate([dist1, dist2]), bins=10)
p_hist, _ = np.histogram(dist1, bins=bins, density=True)
q_hist, _ = np.histogram(dist2, bins=bins, density=True)

score = compute_hellinger(p_hist, q_hist)

# Display interpretation tip
st.info("ℹ️ A Hellinger Distance closer to 0.0 suggests low drift. Values near 1.0 indicate significant feature distribution changes.")

# Display result
st.markdown(f"### 📏 Hellinger Distance between **{selected_label1}** and **{selected_label2}** for **{feature}**: `{score:.4f}`")

# Plot histogram comparison
fig, ax = plt.subplots(figsize=(8, 4))
ax.hist(dist1, bins=bins, alpha=0.6, label=selected_label1, color='blue')
ax.hist(dist2, bins=bins, alpha=0.6, label=selected_label2, color='orange')
ax.set_title(f"Distribution Comparison: {feature}")
ax.set_xlabel("Value")
ax.set_ylabel("Density")
ax.legend()

In [None]:
-- DORA Evaluation Test #52: Validate that the model monitor was created
-- Step 1: Run SHOW command
SHOW MODEL MONITORS;

-- Step 2: Run this validation check
SELECT util_db.public.se_grader(
  'SEAI52',
  (actual = expected),
  actual,
  expected,
  '✅ Your MORTGAGE_MODEL_MONITOR has been successfully created!'
) AS graded_results
FROM (
  SELECT COUNT(*) AS actual,
         1 AS expected
  FROM TABLE(RESULT_SCAN(LAST_QUERY_ID()))
  WHERE "name" = 'MORTGAGE_MODEL_MONITOR'
);

### 🧹 HOL Cleanup

To reset your environment and avoid lingering monitors or scheduled alerts, you can drop the model monitor and alert below.

This is especially useful when:

- You're sharing an environment with other users
- You plan to rerun the notebook from the top
- You want to avoid refresh or alert-triggering compute

⚠️ Skip this section if you're actively monitoring your model in production.


In [None]:
-- HOL CLEANUP: Monitoring & Alerting Components

-- Drop the alert first (it may reference the monitor or view)
DROP ALERT IF EXISTS F1_SCORE_DROP_ALERT;

-- Drop the model monitor (after removing dependent alert)
DROP MODEL MONITOR IF EXISTS MORTGAGE_MODEL_MONITOR;

-- Drop the monitoring view (if no longer needed)
DROP VIEW IF EXISTS AICOLLEGE.PUBLIC.WEEKLY_MONITOR_SAMPLE;