# Snowflake Model Registry Walkthrough

Exploration of the linear-regression showcase.

> We'll work iteratively—looking at code, running a cell, and immediately inspecting the artifacts. Feel free to run cells as we go; each one is designed to be fast and self-contained.



In [5]:
import json



In [3]:
from pathlib import Path
from pprint import pprint

import yaml

from core import (
    pipeline_config_from_mapping,
    run_pipeline,
    DataConfig,
    TrainConfig,
    RegistryConfig,
    PipelineSteps,
)

## 1. Start with a lightweight config

Rather than editing YAML off to the side, we'll build a dictionary in the notebook (fast.ai style) and convert it to the `PipelineConfig` dataclass. This keeps the workflow reproducible and easy to tweak mid-session.



In [4]:
base_cfg = {
    "data": {
        "n_samples": 2000,  # keep things snappy for interactive runs
        "csv_path": "notebook_synthetic_data.csv",
        "upload_to_snowflake": False,  # disable remote writes while we explore
    },
    "steps": {
        "log_model": False,  # skip registry logging locally; we'll show how later
    },
}

cfg = pipeline_config_from_mapping(base_cfg)
cfg


PipelineConfig(data=DataConfig(n_samples=2000, n_features=20, random_state=42, csv_path=PosixPath('notebook_synthetic_data.csv'), upload_to_snowflake=False, connection_name='legalzoom', database='ML_SHOWCASE', data_schema='DATA', table_name='SYNTHETIC_DATA'), train=TrainConfig(test_size=0.2, random_state=42, scaler_path=PosixPath('scaler.pkl'), model_path=PosixPath('model.pkl'), test_data_path=PosixPath('test_data.csv'), metrics_path=PosixPath('model_metrics.json')), registry=RegistryConfig(connection_name='legalzoom', database='ML_SHOWCASE', schema='MODELS', model_name='LINEAR_REGRESSION_CUSTOM', user_files={'preprocessing': ['scaler.pkl']}, conda_dependencies=['snowflake::scikit-learn==1.3.0', 'snowflake::pandas==2.0.3', 'snowflake::numpy==1.24.3'], python_version='3.10', enable_explainability=False, target_platform_mode='WAREHOUSE_ONLY'), steps=PipelineSteps(generate_data=True, train_model=True, verify_pickles=True, log_model=False), serving=ServingConfig(enabled=False, compute_pool

## 2. Run the local pipeline

Notebooks always *do the thing* so we can look at the outputs. Because we disabled Snowflake calls, this cell should complete in a couple of seconds.

In [6]:
results = run_pipeline(cfg)
results.keys()

2025-11-10 17:51:32 | INFO | core | GENERATING SYNTHETIC DATASET
2025-11-10 17:51:32 | INFO | core | Dataset summary: samples=2,000, features=20
2025-11-10 17:51:32 | INFO | core | Target mean=45.09 std=208.06
2025-11-10 17:51:32 | INFO | core | SAVING DATA TO CSV
2025-11-10 17:51:32 | INFO | core | Saved data to notebook_synthetic_data.csv (0.79 MB)
2025-11-10 17:51:32 | INFO | core | Snowflake upload skipped (upload_to_snowflake=False)
2025-11-10 17:51:32 | INFO | core | Split data: train=1,600, test=400 (test_size=20%)
2025-11-10 17:51:32 | INFO | custom_model | TRAINING MODEL WITH PREPROCESSING
2025-11-10 17:51:32 | INFO | custom_model | 1. Fitting Custom Z-Scaler...
2025-11-10 17:51:32 | INFO | custom_model |    Scaler fitted
2025-11-10 17:51:32 | INFO | custom_model |    Features: 20
2025-11-10 17:51:32 | INFO | custom_model |    Mean range: [-0.05, 0.05]
2025-11-10 17:51:32 | INFO | custom_model |    Std range: [0.94, 1.03]
2025-11-10 17:51:32 | INFO | custom_model | 2. Training

dict_keys(['dataframe', 'csv_path', 'table_name', 'metrics', 'verification_passed'])

We now have a pandas DataFrame in memory, metrics persisted to disk, and verified pickles—all without touching Snowflake.



## 3. Inspect artifacts

Fast.ai notebooks celebrate curiosity—let's peek inside the pieces we just created.



In [7]:
results["dataframe"].head()


Unnamed: 0,ID,FEATURE_00,FEATURE_01,FEATURE_02,FEATURE_03,FEATURE_04,FEATURE_05,FEATURE_06,FEATURE_07,FEATURE_08,...,FEATURE_11,FEATURE_12,FEATURE_13,FEATURE_14,FEATURE_15,FEATURE_16,FEATURE_17,FEATURE_18,FEATURE_19,TARGET
0,1,0.811768,0.010015,-1.497826,-0.75213,0.286496,-0.46929,0.449818,0.509838,-0.414505,...,1.3576,0.850633,-1.005819,0.220245,-1.172288,-1.520067,0.682041,0.773936,0.74192,-8.431706
1,2,0.36576,0.518219,-0.179515,-0.39309,-0.954618,0.346117,-1.941568,0.492014,0.158095,...,-1.492462,-0.506189,0.894233,-1.567988,2.06422,-1.001315,-0.766323,-1.710603,-2.384906,-528.622889
2,3,-0.328375,0.409141,-0.874199,-0.587738,0.204798,0.644311,1.107721,0.146476,-0.80059,...,0.638187,1.414841,-0.111847,-1.443201,0.523324,1.744311,-0.45269,1.592025,0.566602,219.172623
3,4,-0.247118,0.855351,0.313623,0.576917,1.620112,-2.406278,1.368677,0.883493,0.859173,...,-0.144827,1.44768,0.039319,1.692713,-1.272305,-0.339397,0.423944,-0.511764,0.365831,403.424173
4,5,0.985311,-0.531123,-0.055674,0.08856,-0.941764,0.614847,-0.924623,-0.47453,1.862419,...,0.016415,-1.178452,1.090107,0.128195,-0.419826,-1.40384,0.726033,0.070687,-0.456796,-194.605547


In [8]:
metrics_path = cfg.train.metrics_path
with metrics_path.open() as fh:
    metrics = json.load(fh)

metrics

{'train': {'mse': 98.7510018487586,
  'rmse': 9.937353865529728,
  'mae': 7.984539740318553,
  'r2': 0.9977743020602813},
 'test': {'mse': 100.45261972353805,
  'rmse': 10.022605435890313,
  'mae': 8.00694078191657,
  'r2': 0.9974141111691142}}

In [9]:
pprint(metrics)



{'test': {'mae': 8.00694078191657,
          'mse': 100.45261972353805,
          'r2': 0.9974141111691142,
          'rmse': 10.022605435890313},
 'train': {'mae': 7.984539740318553,
           'mse': 98.7510018487586,
           'r2': 0.9977743020602813,
           'rmse': 9.937353865529728}}


Notice how the RMSE and MAE line up with the numbers we logged earlier—they come straight out of the shared `core.py` utilities, so the notebook and CLI stay perfectly in sync.



## 4. Taking it all the way to Snowflake

When you're ready to promote the run, flip the switches that we disabled earlier. Here's the minimum set of changes:

1. Set `base_cfg["data"]["upload_to_snowflake"] = True`
2. Set `base_cfg["steps"]["log_model"] = True`
3. (Optional) Toggle `cfg.serving.enabled = True` and fill in `compute_pool` if you want SPCS deployment.

You can do this directly in the notebook (just re-run the config cell) or export the dict to `pipeline.yml` for the CLI tools.



In [None]:
Uncomment the following lines when you're ready to push a real model to Snowflake.
base_cfg["data"]["upload_to_snowflake"] = True
base_cfg["steps"]["log_model"] = True
cfg = pipeline_config_from_mapping(base_cfg)
run_pipeline(cfg)

print("Snowflake deployment is opt-in—flip the switches above when you're ready.")



---

**Next steps**
- Pair this notebook with `run_pipeline.py --summary` to compare outputs.
- Use the generated artifacts (`notebook_synthetic_data.csv`, `scaler.pkl`, etc.) as fixtures in integration tests.
- Drop into `deploy_service.py` once you're comfortable with the Snowflake flow and want an interactive SPCS deployment walkthrough.

