# Dependencies and imports

In [1]:
# Requirements.
%pip install "build>=1.0" "torch>=2.0" "transformers" -e "../../libs/buildlib" "pytest" "aiohttp"

Obtaining file:///Users/lmerrick/Code/sfguide-text-embedding-snowpark-container-service/libs/buildlib
  Installing build dependencies ... [?25ldone
[?25h  Checking if build backend supports build_editable ... [?25ldone
[?25h  Getting requirements to build editable ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h  Preparing editable metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: buildlib
  Building editable for buildlib (pyproject.toml) ... [?25ldone
[?25h  Created wheel for buildlib: filename=buildlib-1.0.0-0.editable-py3-none-any.whl size=1457 sha256=bb5f929badaf3a24a415310e4e3fd7ec4afd3cc133dccd77d15ff3abdfc6ca71
  Stored in directory: /private/var/folders/_t/r3r35b_50cl44c_njstr2ykw0000gn/T/pip-ephem-wheel-cache-umvhwiqk/wheels/98/aa/29/91c30cc69abfc62073290dcfac26d918a56b77f98bcd5394c7
Successfully built buildlib
Installing collected packages: buildlib
  Attempting uninstall: buildlib
    Found existing installatio

In [2]:
import shutil
from pathlib import Path

import buildlib
import pytest
from build import ProjectBuilder
from transformers.models.bert.modeling_bert import BertModel
from transformers.models.bert.tokenization_bert_fast import BertTokenizerFast


  from .autonotebook import tqdm as notebook_tqdm


# Introduction

In order to create a custom text embedding service, we need to create a Docker image, push it to Snowflake, and tell Snowflake to deploy it. In addition to building, pushing, and deploying, this notebook will also cover how to test your Docker image locally to speed up the develoment cycle.

# Building

In order to build an image for our custom service, we put custom data (a `data/` directory) and logic (an `embed.py`, plus a `requirements.txt` specifying any dependencies) into the `build/` directory, and then we use the `buildlib` to run the build process (`buildlib.build()`).

## What is `buildlib`?

`Buildlib` is a small Python package that hard-codes much of the boilerplate needed to deploy a text embedding service for you. On the Docker side, it uses the [python-on-whales](https://github.com/gabrieldemarmiesse/python-on-whales) package to provide easy-to-use Python functions that trigger Docker builds specifically for the text embedding service. On the Snowflake side, it uses plain f-string templating and the Snowflake Python connector to generate and run the SQL commands that set up your service in Snowflake. If you're ever curious about what a particular `buildlib` function is doing behind the scenes, don't be afraid to check out the source code -- it's not that complicated!

## Clear out `build/`

Let's start with a clean slate by deleting and recreating an empty `build/`.

In [3]:
BUILD_DIR = Path(".").resolve().parents[1] / "build"
shutil.rmtree(BUILD_DIR, ignore_errors=True)
BUILD_DIR.mkdir()

## Prepare the data

For this example, we will use the [`e5-base-v2`](https://huggingface.co/intfloat/e5-base-v2) model from Microsoft. We'll demonstrate how to preload the model weights directly into our Docker image for simpler deployment.

Rather than just using the [`transformers`](https://pypi.org/project/transformers/) library directly, we'll also demonstrate how to include a custom library into the Docker image by building a [pure Python wheel](https://packaging.python.org/en/latest/guides/distributing-packages-using-setuptools/#pure-python-wheels) and including that in the `data/` alongside the model weights. The library we demonstrate on is `embed_lib/`, a small example library we wrote to call the E5 model via `transformers`.

### Put model weights into `build/data/`

In [4]:
# Downloading the model weights.
MODEL_NAME = "intfloat/e5-base-v2"
MODEL_EMBEDDING_DIM = 768
DATA_DIR = BUILD_DIR / "data"
TOKENISER_DIR = DATA_DIR / "tokenizer"
MODEL_DIR = DATA_DIR / "model"

print(f"Downloading {MODEL_NAME} and saving tokenizer and weights to {DATA_DIR}")

tokenizer = BertTokenizerFast.from_pretrained(MODEL_NAME)
model = BertModel.from_pretrained(MODEL_NAME)
assert isinstance(model, BertModel)  # This is for typechecking.
tokenizer.save_pretrained(TOKENISER_DIR)
model.save_pretrained(MODEL_DIR)

# Validate that our saved files work by loading from them.
tokenizer = BertTokenizerFast.from_pretrained(TOKENISER_DIR)
model = BertModel.from_pretrained(MODEL_DIR)

Downloading intfloat/e5-base-v2 and saving tokenizer and weights to /Users/lmerrick/Code/sfguide-text-embedding-snowpark-container-service/build/data


### Put packaged code into `build/data/`

In [5]:
# Package our Python package into a wheel in the `data/` directory.
wheel_filename = ProjectBuilder(Path(".") / "embed_lib").build(
    distribution="wheel", output_directory=DATA_DIR
)
print(f"Built {wheel_filename}")

running bdist_wheel
running build
running build_py
running egg_info
writing src/embed_lib.egg-info/PKG-INFO
writing dependency_links to src/embed_lib.egg-info/dependency_links.txt
writing requirements to src/embed_lib.egg-info/requires.txt
writing top-level names to src/embed_lib.egg-info/top_level.txt
reading manifest file 'src/embed_lib.egg-info/SOURCES.txt'
writing manifest file 'src/embed_lib.egg-info/SOURCES.txt'
installing to build/bdist.macosx-11.1-arm64/wheel
running install
running install_lib
creating build/bdist.macosx-11.1-arm64/wheel
creating build/bdist.macosx-11.1-arm64/wheel/embed_lib
copying build/lib/embed_lib/e5.py -> build/bdist.macosx-11.1-arm64/wheel/embed_lib
copying build/lib/embed_lib/__init__.py -> build/bdist.macosx-11.1-arm64/wheel/embed_lib
copying build/lib/embed_lib/_batch_iter_util.py -> build/bdist.macosx-11.1-arm64/wheel/embed_lib
running install_egg_info
Copying src/embed_lib.egg-info to build/bdist.macosx-11.1-arm64/wheel/embed_lib-1.0.0-py3.8.egg-in

## Prepare the logic

In order to actually use our custom package, load our model weights, and perform text embedding, we need to implement the core logic of text embedding. This is done by implementing `get_embed_fn()` inside a file called `embed.py`. The function `get_embed_fn` should load model weights and return a function that maps a single input consisting of a `Sequence[str]` into a single 2d `numpy` array of datatype `np.float32`. Below we give an example.

### Implement `get_embed_fn` inside `build/embed.py`

In [6]:
%%writefile ../../build/embed.py
import logging
import os
from pathlib import Path
from typing import Callable
from typing import cast
from typing import Sequence

import embed_lib.e5
import numpy as np
import torch
from transformers.models.bert.modeling_bert import BertModel
from transformers.models.bert.tokenization_bert_fast import BertTokenizerFast

MAX_BATCH_SIZE = 4


def get_embed_fn() -> Callable[[Sequence[str]], np.ndarray]:
    # Load the model into memory.
    logger = logging.getLogger(__name__)
    data_dir = Path(os.environ["BUILD_ROOT"]) / "data"
    logger.info("[get_embed_fn]Loading model from disk to memory")
    e5_model = embed_lib.e5.load_e5_model(
        tokenizer_path=data_dir / "tokenizer", model_path=data_dir / "model"
    )
    logger.info(f"[get_embed_fn]CUDA available: {torch.cuda.is_available()}")

    def embed(texts: Sequence[str]) -> np.ndarray:
        result_tensor = embed_lib.e5.embed(
            e5_model=e5_model,
            texts=texts,
            batch_size=MAX_BATCH_SIZE,
            normalize=True,
            progress_bar=False,
        )
        result_array = result_tensor.cpu().numpy().astype(np.float32)
        return result_array

    return embed

Writing ../../build/embed.py


### Add a `build/requirements.txt`

We also need to specify the requirements for our embedding logic. During the build, we will populate the `BUILD_ROOT` environment variable, which enables you to include custom packages (like your `embed_lib` wheel) in your `requirements.txt` by absolute filepath.

In [7]:
%%writefile ../../build/requirements.txt

${BUILD_ROOT}/data/embed_lib-1.0.0-py3-none-any.whl

Writing ../../build/requirements.txt


## Configure

NOTE: We use f-string templating here to inject the value of the `MODEL_EMBEDDING_DIM` constant into the file. You can also use the same `%%writefile` magic as above, but in that case you must include the literal value (i.e. `768`) not the name of the constant, since in the file you're writing this constant is not defined.

In [8]:
(BUILD_DIR / "config.py").write_text(
f"""from service_config import Configuration

USER_CONFIG = Configuration(embedding_dim={MODEL_EMBEDDING_DIM})
"""
)

89

## Build!

Now that all the pieces are in place inside `build/`, we can trigger a build via `buildlib`.

In [9]:
# We now have all the pieces we need to build our service!
list(BUILD_DIR.iterdir()) + list(DATA_DIR.iterdir())

[PosixPath('/Users/lmerrick/Code/sfguide-text-embedding-snowpark-container-service/build/config.py'),
 PosixPath('/Users/lmerrick/Code/sfguide-text-embedding-snowpark-container-service/build/embed.py'),
 PosixPath('/Users/lmerrick/Code/sfguide-text-embedding-snowpark-container-service/build/requirements.txt'),
 PosixPath('/Users/lmerrick/Code/sfguide-text-embedding-snowpark-container-service/build/data'),
 PosixPath('/Users/lmerrick/Code/sfguide-text-embedding-snowpark-container-service/build/data/tokenizer'),
 PosixPath('/Users/lmerrick/Code/sfguide-text-embedding-snowpark-container-service/build/data/model'),
 PosixPath('/Users/lmerrick/Code/sfguide-text-embedding-snowpark-container-service/build/data/embed_lib-1.0.0-py3-none-any.whl')]

In [10]:
# If you are building on a Mac with an ARM CPU, building and running for AMD64 may
# be quite slow due to CPU emulation (e.g. perf tests will look really slow).
# To get around this, you can start by building asnd testing locally for ARM CPU
# (However, since Snowpark container services uses AMD64 CPUs, you still need to do the AMD64 build before deploying!)
# buildlib.build(build_dir=BUILD_DIR, platform="linux/arm64", tag="arm_for_local_testing")
buildlib.build(build_dir=BUILD_DIR, platform="linux/amd64", tag="latest")

#1 [internal] load build definition from Dockerfile
#1 transferring dockerfile: 2.48kB done
#1 DONE 0.0s

#2 [internal] load .dockerignore
#2 transferring context: 2B done
#2 DONE 0.0s

#3 [internal] load metadata for docker.io/nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04
#3 DONE 1.5s

#4 [ 1/14] FROM docker.io/nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04@sha256:f3a7fb39fa3ffbe54da713dd2e93063885e5be2f4586a705c39031b8284d379a
#4 DONE 0.0s

#5 [internal] load build context
#5 transferring context: 439.02MB 2.6s done
#5 DONE 2.6s

#6 [ 9/14] COPY ./services_common_code ./services_common_code
#6 CACHED

#7 [ 5/14] RUN chmod +x ~/miniconda.sh &&     bash ~/miniconda.sh -b -p /opt/conda &&     rm ~/miniconda.sh &&     /opt/conda/bin/conda install -y python=3.8 &&     /opt/conda/bin/conda clean -ya
#7 CACHED

#8 [ 6/14] COPY ./libs ./libs
#8 CACHED

#9 [ 7/14] COPY ./service_api ./service_api
#9 CACHED

#10 [ 8/14] COPY ./service_embed_loop ./service_embed_loop
#10 CACHED

#11 [ 4/14] RUN 

# Test locally

You can use `pytest` on the tests in `testing/tests` to get a quick check on whether the service is not working as expected.

In [11]:
test_path = BUILD_DIR.parent / "testing" / "tests" / "test_end_to_end.py"
# with buildlib.run_container_context(tag="arm_for_local_testing"):
with buildlib.run_container_context():
    pytest.main([str(test_path)])

platform darwin -- Python 3.8.18, pytest-7.4.3, pluggy-1.3.0
rootdir: /Users/lmerrick/Code/sfguide-text-embedding-snowpark-container-service
collected 1 item

../../testing/tests/test_end_to_end.py [32m.[0m[32m                                 [100%][0m



## Performance testing

You can use the performance testing scripts to see how your configuration works. Keep in mind that [Snowflake has a 30 second timeout](https://docs.snowflake.com/en/sql-reference/external-functions-implementation) that can be triggered if too many big simultaneous requests bog down the service. Setting small batch sizes and a small internal queue size will help mitigate timeout issues. The default values are conservative (good for running medium-sized models on CPU), but you may want to try different parameters (by supplying more keyword arguments in the `config.py` file above), especially if you will be using a large CPU instance type or a GPU instance type in the compute pool that runs your service. 

In [None]:
# buildlib.start_locally(tag="arm_for_local_testing")
buildlib.start_locally()

In [None]:
# If `last_retry_time` exceeds 30 seconds for any query, that is equivalent to the
# conditions in which you can expect a timeout error when serving in Snowflake.
!python ../../testing/perf_test_scripts/checking_for_timeouts.py

In [None]:
buildlib.stop_locally()

# Deploy

Now that we have build our image and made sure it passes local tests, we can deploy our service.

In [12]:
from getpass import getpass
from textwrap import dedent
import snowflake.connector


# Edit these parameters.
connection_params = {
    "account"   : input("Account: "),
    "user"      : input("Username: "),
}

# Establish and configure connection.
connection_params["password"] = getpass(f"Password:")
connection = snowflake.connector.connect(**connection_params)


## Setup prerequisites

To deploy a service, you need to have a compute pool, a database and schema, and an image repository already set up. You also need a role that you can assume which has the permissions necessary to set up the service. Below is an example setup script.

**NOTE:** If you write your embedding code to exploit GPU acceleration (as we do in this walkthrough), choosing a GPU instance type (like GPU_3, which we use in the walkthrough) for your compute pool will unlock strong text embedding throughput. However, for lighweight models and low throughput use-cases (e.g. less than 10 short single-row embeddings per second), you may be able to get by on a modest CPU instance type like STANDARD_1 or STANDARD_3.

In [13]:
compute_pool = "text_embed_gpu"
database = "custom_ml"
schema = "ml"
role = "embed_text_manager"
image_repository = "image_repo"
spec_stage = "service_specs"
setup_sql_commands = dedent(
    f"""
    use role accountadmin;

    -- create a compute pool
    create compute pool if not exists {compute_pool}
    min_nodes = 1
    max_nodes = 1
    instance_family = gpu_3;

    -- create a new database and schema
    create or replace database {database};
    create or replace schema {schema};

    -- give yourself a new role to manage text embedding
    create or replace role {role};
    grant all on database {database} to role {role};
    grant all on schema {database}.{schema} to role {role};
    grant usage on compute pool {compute_pool} to role {role};
    grant monitor on compute pool {compute_pool} to role {role};
    grant role embed_text_manager to user {connection_params["user"]};

    -- use the new role to set up the image repo and service spec stage
    use role {role};
    use database {database};
    use schema {schema};
    create or replace image repository {image_repository};
    create or replace stage {spec_stage};
    """
)
buildlib._run_sql(connection, setup_sql_commands)

use role accountadmin;
-- create a compute pool
create compute pool if not exists text_embed_gpu
min_nodes = 1
max_nodes = 1
instance_family = gpu_3;
-- create a new database and schema
create or replace database custom_ml;
create or replace schema ml;
-- give yourself a new role to manage text embedding
create or replace role embed_text_manager;
grant all on database custom_ml to role embed_text_manager;
grant all on schema custom_ml.ml to role embed_text_manager;
grant usage on compute pool text_embed_gpu to role embed_text_manager;
grant monitor on compute pool text_embed_gpu to role embed_text_manager;
grant role embed_text_manager to user admin;
-- use the new role to set up the image repo and service spec stage
use role embed_text_manager;
use database custom_ml;
use schema ml;
create or replace image repository image_repo;
create or replace stage service_specs;


## Push the image

You can use the `buildlib.push` convenience function to push your image. If you're already logged into the image repository in Docker, you can omit the `username` and `password` arguments and pass `skip_login` instead.

In [14]:
image_repository_url = (
    f"{connection_params['account']}.registry.snowflakecomputing.com/"
    f"{database}/{schema}/{image_repository}"
)
buildlib.push(
    repo_url=image_repository_url,
    username=connection_params["user"],
    password=connection_params["password"],
    tag="latest",
    skip_login=False,
)

Login Succeeded
Pushing <accountname>.registry.snowflakecomputing.com/custom_ml/ml/image_repo/embed_text_service:latest
The push refers to repository [<accountname>.registry.snowflakecomputing.com/custom_ml/ml/image_repo/embed_text_service]
20d0cc1c6a81: Preparing
af514d472114: Preparing
7591540d1176: Preparing
518c15b0779f: Preparing
f4724c52227c: Preparing
6e716c10ddae: Preparing
51084bf0cad3: Preparing
7d123e6b7f6d: Preparing
bc51bc5bba77: Preparing
07d57957db0f: Preparing
7cb7ebd732dc: Preparing
fc4161923a73: Preparing
5f70bf18a086: Preparing
30cb8a64ad61: Preparing
600c676771a0: Preparing
6ac15100dff6: Preparing
40f0eb1871b9: Preparing
8d113b7b997c: Preparing
cd77f58b80cd: Preparing
e4b1bddcbe63: Preparing
765423415d69: Preparing
7b9433fba79b: Preparing
256d88da4185: Preparing
6e716c10ddae: Waiting
51084bf0cad3: Waiting
7d123e6b7f6d: Waiting
bc51bc5bba77: Waiting
07d57957db0f: Waiting
7cb7ebd732dc: Waiting
fc4161923a73: Waiting
5f70bf18a086: Waiting
30cb8a64ad61: Waiting
600c67677

## Create the service

We're almost there! We've built our image and pushed it, now we just need to put together a service spec and run a `create service ...;` statement. Luckily, `buildlib.deploy_service` makes this quite straightforward.

In [18]:
buildlib.deploy_service(
    connection,
    embedding_dim=MODEL_EMBEDDING_DIM,
    # If you use a CPU-instance-type pool, set `num_gpus=0`.
    num_gpus=1,
    role=role,
    database=database,
    schema=schema,
    spec_stage=spec_stage,
    compute_pool=compute_pool,
    image_repository=image_repository,
    # If for some reason your image repository is in a different database/schema
    # you can specify a separate database/schema, too. Buildlib expects the spec
    # stage to be in the same database/schema as the service, though.
    image_database=database,
    image_schema=schema,
)

use role embed_text_manager;
use database custom_ml;
use schema ml;
drop service if exists embed_text_service;
create service embed_text_service
    in compute pool text_embed_gpu
    from specification $$
        spec:
          containers:
          - image: /custom_ml/ml/image_repo/embed_text_service:latest
            name: embed-text-service
            readinessProbe:
              path: /healthcheck
              port: 8000
            resources:
              limits:
                nvidia.com/gpu: 1
              requests:
                nvidia.com/gpu: 1
          endpoint:
          - name: endpoint
            port: 8000

    $$
    min_instances = 1
    max_instances = 1;
create or replace function _embed_to_base64(input string)
    returns string
    service=embed_text_service!endpoint
    max_batch_rows=4
    as '/embed';
create or replace function _unpack_binary_array(B binary)
    returns array
    language javascript
    immutable
    as
    $$
        return Array.f

# Watch it come up

In [26]:
import json
print(json.loads(connection.cursor().execute("CALL SYSTEM$GET_SERVICE_STATUS('custom_ml.ml.embed_text_service');").fetchone()[0])[0])

{'status': 'READY', 'message': 'Running', 'containerName': 'embed-text-service', 'instanceId': '0', 'serviceName': 'EMBED_TEXT_SERVICE', 'image': '<accountname>.registry.snowflakecomputing.com/custom_ml/ml/image_repo/embed_text_service:latest', 'restartCount': 0, 'startTime': '2023-12-19T01:00:50Z'}


In [27]:
print(connection.cursor().execute("CALL SYSTEM$GET_SERVICE_LOGS('custom_ml.ml.embed_text_service', '0', 'embed-text-service');").fetchone()[0])


== CUDA ==

CUDA Version 12.1.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

*************************
** DEPRECATION NOTICE! **
*************************
THIS IMAGE IS DEPRECATED and is scheduled for DELETION.
    https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md

INFO:     Started server process [1]
INFO:     Waiting for application startup.
2023-12-19 01:00:51,708 __main__ INFO: Beginning application setup
2023-12-19 01:00:51,708 __main__ INFO: Allocating shared-memory data structures
2023-12-19 01:00:51,714 __main__ INFO: Launching lo

# Success!

Text embedding should now be live! Here's an example of giving all users access to the `embed_text` function and calling this function via a preexisting warehouse called `compute_wh`.

In [28]:
buildlib._run_sql(
    connection,
    "use role accountadmin; "
    "grant usage on database custom_ml to role public; "
    "grant usage on schema custom_ml.ml to role public; "
    "use role embed_text_manager; "
    "grant usage on function custom_ml.ml.embed_text(text) to role public; "
    "use role public; "
    "use warehouse compute_wh;"
)
print(connection.cursor().execute("select embed_text('hello world!');").fetchone()[0])

use role accountadmin;
grant usage on database custom_ml to role public;
grant usage on schema custom_ml.ml to role public;
use role embed_text_manager;
grant usage on function custom_ml.ml.embed_text(text) to role public;
use role public;
use warehouse compute_wh;
[0.0032345708459615707, -0.008582721464335918, -0.037687599658966064, -0.00331854703836143, 0.04481322318315506, -0.03027017042040825, 0.0320584774017334, 0.05281173065304756, -0.0010827032383531332, -0.01863020844757557, -0.02498818002641201, 0.047690510749816895, -0.0871417447924614, 0.01861654967069626, -0.031328581273555756, 0.00654643913730979, 0.02404150180518627, -0.00824214331805706, 0.03759678453207016, -0.020063813775777817, -0.048249438405036926, -0.05238138884305954, 0.04970218241214752, -0.008793247863650322, 0.005695943720638752, 0.0111410366371274, -0.005848483182489872, 0.001621723989956081, -0.0469050332903862, -0.03712565079331398, 0.01579623855650425, 0.03618093207478523, 0.058089494705200195, -0.051545158