# Snowflake Model Registry ‚Äî End-to-End Template

This notebook is a teaching scaffold: every step is explicit, and you can replace any block with your own code. Work top-to-bottom, editing the sections marked **‚Äúüîß Customize‚Äù** as you go.



## 0. Prerequisites

1. Activate the project environment (from a terminal):
   ```bash
   conda activate legalzoom-env
   ```
2. Ensure your Snowflake CLI connection (e.g. `legalzoom`) is configured with the right credentials.
3. Run this notebook from `/Users/jdemlow/github/legal-zoom` so relative imports resolve.



In [None]:
from pathlib import Path
import json

import numpy as np
import pandas as pd

import sys
sys.path.append("../model_registry_showcase")

from logging_utils import get_logger, set_global_level
from core import (
    pipeline_config_from_mapping,
    generate_synthetic_data,
    save_to_csv,
    upload_to_snowflake,
    split_training_data,
    evaluate_model,
    save_training_artifacts,
    verify_pickles,
    init_registry,
    log_model_version,
    deploy_inference_service,
)
from custom_model import CustomZScaler, train_model_with_preprocessing



We'll use the shared logging utilities so that output is consistent with the command-line tools.



In [None]:
logger = get_logger(__name__)
set_global_level("INFO")



## 1. üîß Customize your configuration

Edit the dictionaries below with your own dataset parameters, Snowflake identifiers, and toggles. Everything else in the notebook reads from `cfg`.



In [None]:
base_cfg = {
    "data": {
        # dataset generation
        "n_samples": 5000,
        "n_features": 20,
        "random_state": 42,
        "csv_path": "notebook_synthetic_data.csv",
        "upload_to_snowflake": False,  # flip to True when ready
        "connection_name": "legalzoom",
        "database": "ML_SHOWCASE",
        "data_schema": "DATA",
        "table_name": "SYNTHETIC_DATA",
    },
    "train": {
        "test_size": 0.2,
        "random_state": 42,
        "scaler_path": "scaler.pkl",
        "model_path": "model.pkl",
        "test_data_path": "test_data.csv",
        "metrics_path": "model_metrics.json",
    },
    "registry": {
        "connection_name": "legalzoom",
        "database": "ML_SHOWCASE",
        "schema": "MODELS",
        "model_name": "LINEAR_REGRESSION_CUSTOM",
        "target_platform_mode": "WAREHOUSE_ONLY",  # or SNOWPARK_CONTAINER_SERVICES_ONLY
    },
    "steps": {
        "generate_data": True,
        "train_model": True,
        "verify_pickles": True,
        "log_model": False,  # set True once you're satisfied with the run
    },
    "serving": {
        "enabled": False,
        "compute_pool": "ML_INFERENCE_POOL",
        "service_name": "LINEAR_REGRESSION_SERVICE",
        "min_instances": 1,
        "max_instances": 1,
        "instance_family": "CPU_X64_M",
    },
}

cfg = pipeline_config_from_mapping(base_cfg)
cfg



## 2. üîß Customize the dataset builder

Feel free to swap in your own data-loading logic. The helper below defaults to `sklearn.datasets.make_regression`, but you can replace it with SQL pulls, CSV loads, or feature engineering.



In [None]:
def build_dataset(config):
    """Return a pandas DataFrame with feature columns + TARGET.

    Replace this function with your own data ingestion if desired.
    """
    df = generate_synthetic_data(config.data)
    return df



In [None]:
df = build_dataset(cfg)
df.head()



## 3. Generate/save artifacts (local)

This step always saves a CSV so you can inspect the raw features. Upload to Snowflake only when you set `upload_to_snowflake=True` above.



In [None]:
csv_path = save_to_csv(df, cfg.data.csv_path)
logger.info("Saved local dataset to %s", csv_path)

if cfg.data.upload_to_snowflake:
    table_name = upload_to_snowflake(df, cfg.data)
    logger.info("Uploaded dataset to %s", table_name)



## 4. üîß Customize preprocessing/modeling

`CustomZScaler` is provided out of the box. If you want to add feature engineering, try editing the cell below (or swap in your own transformer/model entirely).



In [None]:
X_train, X_test, y_train, y_test = split_training_data(df, cfg.train)
scaler, model = train_model_with_preprocessing(X_train, y_train)



In [None]:
metrics = evaluate_model(scaler, model, X_train, X_test, y_train, y_test)
metrics



## 5. Persist artifacts & verify pickles

This mirrors the CLI flow: write test data + metrics, pickle the scaler/model, then double-check they play nicely together.



In [None]:
save_training_artifacts(X_test, y_test, metrics, cfg.train)

verified = verify_pickles(cfg.train.scaler_path, cfg.train.model_path, cfg.train.test_data_path)
logger.info("Pickle verification passed? %s", verified)



## 6. üîß Log to Snowflake (toggle when ready)

Set `cfg.steps.log_model = True` and re-run this cell to push the model into the registry. Make sure `upload_to_snowflake=True` earlier so the dataset is staged in your account.



In [None]:
if cfg.steps.log_model:
    connection, registry = init_registry(cfg.registry)
    sample_df = pd.read_csv(cfg.train.test_data_path)
    feature_cols = [col for col in sample_df.columns if col.startswith("FEATURE_")]
    sample_data = sample_df[feature_cols].head(5)
    try:
        model_version = log_model_version(registry, cfg.registry, sample_data, metrics)
        logger.info("Logged Snowflake model version: %s", model_version.version_name)
    finally:
        connection.close()
else:
    logger.info("Model logging skipped (set cfg.steps.log_model = True to enable)")



## 7. Optional: deploy to Snowpark Container Services

Fill in the compute pool details above (`cfg.serving.enabled = True`) and run the cell when you're ready to request a managed service.



In [None]:
if cfg.serving.enabled:
    service = deploy_inference_service(cfg.registry, cfg.serving)
    logger.info("Snowpark Container Services deployment initiated: %s", service)
else:
    logger.info("SPCS deployment skipped (set cfg.serving.enabled = True to enable)")



---

### Where to go next
- Swap out the dataset builder for your own feature pipeline.
- Embed additional preprocessing inside `train_model_with_preprocessing` or replace it entirely.
- Turn on the Snowflake/serving flags and monitor the CLI (`run_pipeline.py`) to compare outputs.

