In [1]:
# Initialization boilerplate
from typing import *
import time
import ray
import transformers
import torch
import torchvision
import numpy as np
import pandas as pd

import copy
import os

transformers.logging.set_verbosity_error()

def reboot_ray():
    if ray.is_initialized():
        ray.shutdown()

    if torch.cuda.is_available():
        return ray.init(num_gpus=1)
    else:     
        return ray.init()

pass

In [2]:
reboot_ray()

2021-08-09 11:41:36,732	INFO services.py:1267 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


{'node_ip_address': '192.168.0.239',
 'raylet_ip_address': '192.168.0.239',
 'redis_address': '192.168.0.239:6379',
 'object_store_address': '/tmp/ray/session_2021-08-09_11-41-35_233184_3247/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2021-08-09_11-41-35_233184_3247/sockets/raylet',
 'webui_url': '127.0.0.1:8265',
 'session_dir': '/tmp/ray/session_2021-08-09_11-41-35_233184_3247',
 'metrics_export_port': 57518,
 'node_id': 'c7398c11516e63334b45cd9be388dcccb40d05b7774f8a28cfce6236'}

# How to Load Deep Learning Models 300 Times Faster with Ray

*One of the challenges of using deep learning in production is managing the cost of loading huge models for inference. In this article, we'll show how you can reduce this cost almost to zero by leveraging features of PyTorch and Ray.*

## Introduction

Deep learning models are big and cumbersome. Because of their size, they take a long time to load. This model loading cost leads to a great deal of engineering effort when deploying models in production. Model inference platforms like [TFX](https://www.tensorflow.org/tfx/serving/serving_basic), [TorchServe](https://github.com/pytorch/serve), and [IBM Spectrum Conductor Deep Learning Impact](https://www.ibm.com/products/spectrum-deep-learning-impact?cm_mmc=text_extensions_for_pandas) run deep learning models inside dedicated, long-lived processes and containers, with lots of complex code to start and stop containers and to pass data between them.

![Block diagram of the TorchServe model inference platform, showing how TorchServe dedicates a pool of dedicated, long-lived processes to each model in order to amortize model loading costs. Source: https://github.com/pytorch/serve; License: Apache V2](images/torch_serve_arch.jpg)

But what if this conventional wisdom isn't entirely correct? What if there was a way to load a deep learning model in a tiny fraction of a second? It might be possible to run model inference in production with a much simpler architecture.

Let's see how fast we can make model loading go.

## Background: BERT 

For the examples in this article, we'll use the [BERT](https://arxiv.org/abs/1810.04805) masked language model. BERT belongs to a group of general-purpose models that capture the nuances of human language in a (relatively) compact format. You can use these models to do many different natural language processing (NLP) tasks, ranging from document classification to machine translation. However, to do any task with high accuracy, you need to start with a model trained on your target language and [*fine-tune* the model](https://towardsdatascience.com/fine-tuning-a-bert-model-with-transformers-c8e49c4e008b) for the task.

Tuning a BERT model for a task effectively creates a new model. If your application needs to perform three tasks in three different languages, you'll need *nine* copies of BERT --- one for each combination of language and task. This proliferation of models creates  headaches in production. Being able to load and unload different BERT-based model really fast would save a lot of trouble.

Let's start by loading up a BERT model in the most straightforward way.

## Loading a BERT Model

The [transformers library](https://github.com/huggingface/transformers) from [Huggingface](https://huggingface.co/) provides convenient ways to load different variants of BERT. The code snippet that follows shows how to load `bert-base-uncased`, a medium-sized model with about 420 MB of parameters.

In [3]:
bert = transformers.BertModel.from_pretrained("bert-base-uncased")

The `transformers.BertModel.from_pretrained()` method follows PyTorch's [recommended practice](https://pytorch.org/tutorials/beginner/saving_loading_models.html#save-load-state-dict-recommended) for loading models: First, construct an instance of your model, which should be a subclass of `torch.nn.Module`. Then use `torch.load()` to load a PyTorch *state dictionary* of model weights. Finally, call your model's `load_state_dict()` method to copy the model weights from the state dictionary into your model's `torch.Tensor` objects.

This method takes about 1.4 seconds to load BERT on my laptop, provided that the model is on local disk. That's fairly impressive for a model that's over 400MB in size, but it's still a long time. For comparison, running inference with this model only takes a fraction of a second.

The main reason this method is so slow is that it is optimized for reading models in a portable way over a slow network connection. It copies the model's parameters several times while building the state dictionary, then it copies them some more while installing the weights into the model's Python object.

PyTorch has an [alternate model loading method](https://pytorch.org/tutorials/beginner/saving_loading_models.html#save-load-entire-model) that gives up some compatibility but only copies model weights once. Here's what the code to load BERT with that method looks like:

In [4]:
# Serialize the model we loaded in the previous code listing.
torch.save(bert, "outputs/bert.pt")

# Load the model back in.
bert_2 = torch.load("outputs/bert.pt")

This method loads BERT in 0.125 seconds on my laptop. That's 11 times faster.

If dropping the number of copies to 1 makes model loading that much faster, imagine what would happen if we dropped the number of copies to zero! Is it possible to do that?

## Zero-Copy Model Loading

It turns out that we can indeed load PyTorch models while copying weights zero times. We can achieve this goal by leveraging some features of PyTorch and Ray.

First, some background on [Ray](https://ray.io). Ray is an open source system for building high-performance distributed applications. One of Ray's unique features is its main-memory object store, [Plasma](https://docs.ray.io/en/master/serialization.html#plasma-store). Plasma uses shared memory to pass objects between processes on each machine in a Ray cluster. Ray uses Plasma's shared memory model to implement zero-copy transfer of [NumPy](https://numpy.org/) arrays. If a Ray [task](https://docs.ray.io/en/master/walkthrough.html#remote-functions-tasks) needs to read a NumPy array from Plasma, the task can access the array's data directly out of shared memory without copying any data into its local heap.

So if we store the weights of a model as NumPy arrays on Plasma, we can access those weights directly out of Plasma's shared memory segments, without making any copies. 

But we still need to connect those weights to the rest of the PyTorch model, which requires them to be wrapped in PyTorch `Tensor` objects. The standard method of creating a `Tensor` involves copying the contents of the tensor, but PyTorch also has an alternate code path for initializing `Tensor`s *without* performing a copy. You can access this code path by passing your NumPy array to `torch.as_tensor()` instead of using `Tensor.__new__()`.

With all of this background information in mind, here's a high-level overview of how to do zero-copy model loading from Plasma. 

First, you need to load the model into the Plasma object store, which is a three-step process:

1. Load the model from disk.
2. Separate the original PyTorch model into its weights and its graph of operations, and convert the weights to NumPy arrays.
3. Upload the NumPy arrays and the model (minus weights) to Plasma.

Once the model and its weights are in object storage, it becomes possible to do a zero-copy load of the model. Here are the steps to follow:

1. Deserialize the model (minus weights) from Plasma
2. Extract the weights from Plasma (without copying any data)
3. Wrap the weights in PyTorch `Tensors` (without copying any data)
4. Install the weight tensors back in the reconstructed model (without copying any data)

If a copy of the model is in the local machine's Plasma shared memory segment, these steps will load load BERT in **0.004 seconds**. That's **340 times faster** than loading the model with `BertModel.from_pretrained()`!

![Comparison of running times for different ways of loading the bert-base-uncased model. BertModel.from_pretrained() takes 1.4 seconds, torch.load() takes 0.125 seconds, and zero-copy loading takes 0.004 seconds.](images/bert_load_times.png)

This loading time is an order of magnitude less than the time it takes to run one inference request on this model with a general purpose CPU. That means that you can load the model *on demand* with almost no performance penalty. There's need to spin up a dedicated model serving platform or a Ray [actor pool](https://docs.ray.io/en/master/actors.html#actor-pool), tying up resources for models that aren't currently running inference.

## The Details

Let's break down how to implement each of the steps for zero-copy model loading, starting with getting the model onto Plasma in an appropriate format.

We've already covered how to load a PyTorch model from disk. The next step after that initial loading is to separate the model into its weights and its graph of operations, converting the weights to NumPy arrays. Here's a Python function that will do all those things:

In [4]:
def extract_tensors(m: torch.nn.Module) -> Tuple[torch.nn.Module, List[Dict]]:
    """
    Remove the tensors from a PyTorch model, convert them to NumPy
    arrays, and return the stripped model and tensors.
    """
    tensors = []
    for _, module in m.named_modules():
        # Store the tensors in Python dictionaries
        params = {
            name: torch.clone(param).detach().numpy()
            for name, param in module.named_parameters(recurse=False)
        }
        buffers = {
            name: torch.clone(buf).detach().numpy()
            for name, buf in module.named_buffers(recurse=False)
        }
        tensors.append({"params": params, "buffers": buffers})
    
    # Make a copy of the original model and strip all tensors and
    # temporary buffers out of the copy.
    m_copy = copy.deepcopy(m)
    for _, module in m_copy.named_modules():
        for name in ([name for name, _ in module.named_parameters(recurse=False)]
                     + [name for name, _ in module.named_buffers(recurse=False)]):
            setattr(module, name, None)   

    # Make sure the copy is configured for inference.
    m_copy.train(False)
    return m_copy, tensors

Most PyTorch models are built on top the PyTorch class `torch.nn.Module`. The model is a graph of Python objects, and every object is a subclasses of `Module`.

The `Module` class provides two places to store model weights: *parameters* for weights that are trained by gradient descent, and *buffers* for weights that are trained in other ways. Lines 6-17 of the listing above iterate over the components of the model, pull out the parameters and buffers, and convert their values to NumPy arrays. Then lines 21-25 create a copy of the model and remove all the weights from the copy. Finally, line 29 returns the copy and the converted weight tensors as a Python tuple.

We can pass the return value from this function directly to `ray.put()` to upload the model and its weights onto Plasma. Here's what the upload operation looks like.

In [6]:
bert_ref = ray.put(extract_tensors(bert))

The variable `bert_ref` here is a Ray object reference. We can retrieve the model and weights by passing this object reference to `ray.get()`, as in the following listing.

In [7]:
bert_skeleton, bert_weights = ray.get(bert_ref)

If the object that `bert_ref` points to isn't available on the current node of your Ray cluster, the first attempt to read the model will block while Ray [downloads the object to the node's local shared memory segment](https://github.com/ray-project/ray/blob/c1b9f921a614a0927013ff0daeb6e130aaebb473/src/ray/core_worker/store_provider/plasma_store_provider.cc#L274). Subsequent calls to `ray.get(bert_ref)` will return the local copy immediately.

Now we need to convert `bert_weights` from NumPy arrays to `torch.Tensor` objects and attach them to the model in `bert_skeleton`, all without performing any additional copies. Here is a Python function that does those steps.

In [8]:
def replace_tensors(m: torch.nn.Module, tensors: List[Dict]):
    """
    Restore the tensors that extract_tensors() stripped out of a 
    PyTorch model.
    :param no_parameters_objects: Skip wrapping tensors in 
     ``torch.nn.Parameters`` objects (~20% speedup, may impact
     some models)
    """
    modules = [module for _, module in m.named_modules()] 
    for module, tensor_dict in zip(modules, tensors):
        # There are separate APIs to set parameters and buffers.
        for name, array in tensor_dict["params"].items():
            module.register_parameter(name, 
                torch.nn.Parameter(torch.as_tensor(array)))
        for name, array in tensor_dict["buffers"].items():
            module.register_buffer(name, torch.as_tensor(array))    

This function does roughly the same thing as PyTorch's `load_state_dict()` function, except that it avoids copying tensors. The `replace_tensors()` function modifies the reconstituted model in place. After calling `replace_tensors()`, we can run the model, producing the same results as the original copy of the model. Here's some code that shows running a BERT model after loading its weights with `replace_tensors()`.

In [11]:
# Load tensors into the model's graph of Python objects
replace_tensors(bert_skeleton, bert_weights)

# Preprocess an example input string for BERT.
test_text = "All work and no play makes Jack a dull boy."
tokenizer = transformers.BertTokenizerFast.from_pretrained(
    "bert-base-uncased")
test_tokens = tokenizer(test_text, return_tensors="pt")

# Run the original model and the copy that we just loaded
print("Original model's output:")
print(bert(**test_tokens).last_hidden_state)
print("\nModel output after zero-copy model loading:")
print(bert_skeleton(**test_tokens).last_hidden_state)

Original model's output:
tensor([[[-0.1153,  0.2566, -0.2220,  ..., -0.3130,  0.6333,  0.6588],
         [ 0.2769,  0.5195,  0.2059,  ..., -0.1062,  1.1186,  0.3836],
         [ 0.9019,  0.7557, -0.1615,  ...,  0.0588,  0.3570, -0.0296],
         ...,
         [ 0.0155, -0.0602,  0.3365,  ..., -0.0936,  0.8055, -0.5007],
         [ 0.6198,  0.2695, -0.3402,  ...,  0.0860, -0.3373, -0.4606],
         [ 0.8493,  0.3726, -0.2073,  ..., -0.1145, -0.5216, -0.4418]]],
       grad_fn=<NativeLayerNormBackward>)

Model output after zero-copy model loading:
tensor([[[-0.1153,  0.2566, -0.2220,  ..., -0.3130,  0.6333,  0.6588],
         [ 0.2769,  0.5195,  0.2059,  ..., -0.1062,  1.1186,  0.3836],
         [ 0.9019,  0.7557, -0.1615,  ...,  0.0588,  0.3570, -0.0296],
         ...,
         [ 0.0155, -0.0602,  0.3365,  ..., -0.0936,  0.8055, -0.5007],
         [ 0.6198,  0.2695, -0.3402,  ...,  0.0860, -0.3373, -0.4606],
         [ 0.8493,  0.3726, -0.2073,  ..., -0.1145, -0.5216, -0.4418]]],
    

## Caveats

The first time you call the `replace_tensors()` function, PyTorch will print out a warning:

```
UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. [...]
```

Most PyTorch models don't modify their own weights during inference, but PyTorch doesn't prevent models from doing so. If you load your weights via the zero-copy method and your model modifies a weights tensor, it will change the copy of those weights in Plasma's shared memory. Ray (as of version 1.4) [always opens shared memory segments in read-write mode](https://github.com/ray-project/plasma/blob/7d6acc7af2878fc932ec5314cbcda0e79a9d6a4b/src/plasma_client.c#L111). 

If you're sure that you model does not not modify its own weights during inference, you can safely ignore this warning. You can test for these modifications by comparing your model's weights before and after inference. If your model does modify some of its weights, it's important to copy the relevant tensors prior to running inference.

Another thing to note is that this method loads the model for CPU-based inference. To use GPU acceleration, you will need to call the model's `cuda()` method to copy all the tensors to the GPU device. Copying our example BERT model's weights to GPU memory takes about 0.07 seconds.

## Conclusion


[project CodeFlare](https://www.research.ibm.com/blog/codeflare-ml-experiments?cm_mmc=text_extensions_for_pandas]

[IBM Cloud Code Engine](https://www.ibm.com/cloud/code-engine?cm_mmc=text_extensions_for_pandas)

## (This part not in blog) Source code for timing measurements

faster version that can skip wrapping models in Parameter objects

In [9]:
# Timing measurements
# Don't include this cell in the blog

# A version of replace_tensors() that optionally allows a slightly 
# faster but slightly dangerous shortcut when loading Parameters. 
def replace_tensors_direct(m: torch.nn.Module, tensors: List[Dict]):
    """
    Restore the tensors that extract_tensors() stripped out of a 
    PyTorch model.
    """
    modules = [module for _, module in m.named_modules()] 
    for module, tensor_dict in zip(modules, tensors):
        # There are separate APIs to set parameters and buffers.
        for name, array in tensor_dict["params"].items():
            # Super fast, somewhat risky version avoids 
            # wrapping parameters in Parameters objects.
            module._parameters[name] = torch.as_tensor(array)
        for name, array in tensor_dict["buffers"].items():
            module.register_buffer(name, torch.as_tensor(array))         
        
def restore_from_plasma(model_and_tensors_ref):
    model, tensors = ray.get(model_and_tensors_ref)
    replace_tensors(model, tensors)
    return model

def restore_from_plasma_direct(model_and_tensors_ref):
    model, tensors = ray.get(model_and_tensors_ref)
    replace_tensors_direct(model, tensors)
    return model

bert_model_name = "bert-base-uncased"


# Begin comparison
print("MODEL LOADING TIMINGS:\n")

# Baseline: Load via the official API
print("Loading via official API:")
bert_times = %timeit -o -r 100 transformers.BertModel.from_pretrained(bert_model_name)
bert = transformers.BertModel.from_pretrained(bert_model_name)

# Baseline 2: torch.load()
print("Loading with torch.load():")
bert = transformers.BertModel.from_pretrained(bert_model_name)
bert_file = "outputs/bert.pt"
torch.save(bert, bert_file)

bert_2_times = %timeit -o -r 100 torch.load(bert_file)
bert_2 = torch.load(bert_file)
      
# Baseline 3: ray.get()
print("Loading with ray.get():")
bert_ref = ray.put(bert)

# Ray.put() actually returns before things have completely settled down.
time.sleep(1)

bert_3_times = %timeit -o -r 100 ray.get(bert_ref)
bert_3 = ray.get(bert_ref)

# The main event: Zero-copy load 
bert_4_ref = ray.put(extract_tensors(bert))

# Ray.put() returns before things have completely settled down.
time.sleep(1)

print("Zero-copy load, using official APIs")
bert_4_times = %timeit -o -r 100 restore_from_plasma(bert_4_ref)
bert_4 = restore_from_plasma(bert_4_ref)

print("Zero-copy load, bypassing Parameter class")
bert_5_times = %timeit -o -r 100 restore_from_plasma_direct(bert_4_ref)
bert_5 = restore_from_plasma_direct(bert_4_ref)


# Test with CUDA if available
if torch.cuda.is_available():
    def restore_from_plasma_to_cuda(model_and_tensors_ref):
        model, tensors = ray.get(model_and_tensors_ref)
        replace_tensors(model, tensors)
        model.cuda()
        return model

    bert = transformers.BertModel.from_pretrained(bert_model_name)
    torch.save(bert, bert_file)
    print("Loading with torch.load() to CUDA")
    bert_2_cuda_times = %timeit -o -r 100 torch.load(bert_file).cuda()

    print("Zero-copy load to CUDA")
    bert_4_cuda_times = %timeit -o -r 100 restore_from_plasma_to_cuda(bert_4_ref)

MODEL LOADING TIMINGS:

Loading via official API:
1.39 s ± 132 ms per loop (mean ± std. dev. of 100 runs, 1 loop each)
Loading with torch.load():
106 ms ± 3.97 ms per loop (mean ± std. dev. of 100 runs, 10 loops each)
Loading with ray.get():
180 ms ± 5.42 ms per loop (mean ± std. dev. of 100 runs, 1 loop each)
Zero-copy load, using official APIs
4.21 ms ± 313 µs per loop (mean ± std. dev. of 100 runs, 100 loops each)
Zero-copy load, bypassing Parameter class
3.92 ms ± 315 µs per loop (mean ± std. dev. of 100 runs, 100 loops each)


In [10]:
# Don't include this cell in the blog.

# Number crunching for performance graph

def stats_to_triple(timeit_output, name: str) -> Dict:
    """
    Extract out 5%-95% range and mean stats from the output of %timeit
    
    :param timeit_output: Object returned by %timeit -o
    :param name: Name for the run that produced the performance numbers
    
    :returns: Dictionary with keys "name", "5_percentile", "95_percentile", 
      and "mean", suitable for populating one row of a DataFrame.
    """
    times = np.array(timeit_output.all_runs) / timeit_output.loops
    return {
        "name": name,
        "5_percentile": np.percentile(times, 5),
        "95_percentile": np.percentile(times, 96),
        "mean": np.mean(times)
    }

name_to_run = {
    "from_pretrained()": bert_times,
    "torch.load()": bert_2_times,
    "ray.get()": bert_3_times,
    "zero_copy": bert_4_times,
    "zero_copy_hack": bert_5_times,
}

if torch.cuda.is_available():
    name_to_run["torch.load() CUDA"] = bert_2_cuda_times
    name_to_run["zero_copy CUDA"] = bert_4_cuda_times


records = [
    stats_to_triple(times, name) for name, times in name_to_run.items()
]

timings = pd.DataFrame.from_records(records)
timings

Unnamed: 0,name,5_percentile,95_percentile,mean
0,from_pretrained(),1.297487,1.749576,1.392883
1,torch.load(),0.102273,0.113069,0.106403
2,ray.get(),0.171977,0.191029,0.18048
3,zero_copy,0.003145,0.004494,0.004208
4,zero_copy_hack,0.002938,0.004254,0.003916


## (This part not in blog) Measure how long inference takes

In [27]:
# Don't include this cell in the blog.

# Inference timings

# Redo tokenization to make this cell self-contained
test_text = "All work and no play makes Jack a dull boy."
tokenizer = transformers.BertTokenizerFast.from_pretrained(
    "bert-base-uncased")
test_tokens = tokenizer(test_text, return_tensors="pt")

# Common code to run inference
def run_bert(b, t):
    with torch.no_grad():
        return b(**t).last_hidden_state

print("LOCAL INFERENCE TIMINGS:\n")

# Reload from scratch each time to be sure we aren't using stale values
print("Original model, no CUDA:")
bert = transformers.BertModel.from_pretrained("bert-base-uncased")
%timeit run_bert(bert, test_tokens)

print("Zero-copy model loading, no CUDA:")
bert = transformers.BertModel.from_pretrained("bert-base-uncased")
bert_ref = ray.put(extract_tensors(bert))
bert_skeleton, bert_weights = ray.get(bert_ref)
replace_tensors(bert_skeleton, bert_weights)
%timeit run_bert(bert_skeleton, test_tokens)

if torch.cuda.is_available():
    
    def run_bert_cuda(b, t):
        # Inputs need to be on GPU if model is on GPU
        t = {k: v.to("cuda") for k, v in t.items()}
        with torch.no_grad():
            return b(**t).last_hidden_state
 
    print("Original model, CUDA:")
    bert = transformers.BertModel.from_pretrained("bert-base-uncased")
    bert.cuda()
    %timeit run_bert_cuda(bert, test_tokens)

    print("Zero-copy model loading, CUDA:")
    bert = transformers.BertModel.from_pretrained("bert-base-uncased")
    bert_ref = ray.put(extract_tensors(bert))
    bert_skeleton, bert_weights = ray.get(bert_ref)
    replace_tensors(bert_skeleton, bert_weights)
    bert_skeleton.cuda()
    %timeit run_bert_cuda(bert_skeleton, test_tokens)

LOCAL INFERENCE TIMINGS:

Original model, no CUDA:
52.9 ms ± 2.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Zero-copy model loading, no CUDA:
55.2 ms ± 3.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## (This part not in blog) Measure how long inference takes via a Ray task


In [18]:
reboot_ray()

2021-08-09 16:39:39,937	INFO services.py:1267 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


{'node_ip_address': '192.168.0.239',
 'raylet_ip_address': '192.168.0.239',
 'redis_address': '192.168.0.239:6379',
 'object_store_address': '/tmp/ray/session_2021-08-09_16-39-38_072041_3247/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2021-08-09_16-39-38_072041_3247/sockets/raylet',
 'webui_url': '127.0.0.1:8265',
 'session_dir': '/tmp/ray/session_2021-08-09_16-39-38_072041_3247',
 'metrics_export_port': 63244,
 'node_id': '0931f12732bf564d1b2996a84923a7f5e9e63b9e1a4620e8c74997e2'}

In [26]:
# Don't include this cell in the blog.

# Inference timings in remote process

bert = transformers.BertModel.from_pretrained("bert-base-uncased")
bert_ref = ray.put(extract_tensors(bert))

@ray.remote
def run_bert_zero_copy(tokens):
    bert_skeleton, bert_weights = ray.get(bert_ref)
    replace_tensors(bert_skeleton, bert_weights)
    with torch.no_grad():
        return bert_skeleton(**tokens).last_hidden_state.detach().numpy()

@ray.remote
def run_bert_zero_copy_cuda(tokens):
    bert_skeleton, bert_weights = ray.get(bert_ref)
    replace_tensors(bert_skeleton, bert_weights)
    bert_skeleton.cuda()
    
    # Inputs also need to be on the GPU
    tokens = {k: v.to("cuda") for k, v in tokens.items()}
    with torch.no_grad():
        return bert_skeleton(**tokens).last_hidden_state.detach().numpy()

@ray.remote
class BertActor:
    def __init__(self):
        import transformers
        transformers.logging.set_verbosity_error()
        self._bert = transformers.BertModel.from_pretrained("bert-base-uncased")
        self._bert.train(False)
    
    def run_bert(self, tokens):
        with torch.no_grad():
            return self._bert(**tokens).last_hidden_state.detach().numpy()
    
@ray.remote
class BertActorCuda:
    def __init__(self):
        import transformers
        transformers.logging.set_verbosity_error()
        self._bert = transformers.BertModel.from_pretrained("bert-base-uncased").cuda()
        self._bert.train(False)
    
    def run_bert(self, tokens):
        with torch.no_grad():
            tokens = {k: v.to("cuda") for k, v in tokens.items()}
            return self._bert(**tokens).last_hidden_state.detach().numpy()
    

# Redo tokenization to make this cell self-contained
test_text = "All work and no play makes Jack a dull boy."
tokenizer = transformers.BertTokenizerFast.from_pretrained(
    "bert-base-uncased")
test_tokens = tokenizer(test_text, return_tensors="pt")


print("REMOTE INFERENCE TIMINGS:\n")

print("Actor, no CUDA:")
actor = BertActor.remote()
%timeit -o -r 100 ray.get(actor.run_bert.remote(test_tokens))
del(actor)

print("Zero-copy, no CUDA:")
%timeit -o -r 100 ray.get(run_bert_zero_copy.remote(test_tokens))


if torch.cuda.is_available():
    print("Actor, with CUDA:")
    actor = BertActorCuda.remote()
    %timeit -o -r 100 ray.get(actor.run_bert.remote(test_tokens))
    del(actor)

    print("Zero-copy, with CUDA:")
    %timeit -o -r 100 ray.get(run_bert_zero_copy_cuda.remote(test_tokens))

REMOTE INFERENCE TIMINGS:

Actor, no CUDA:
63.9 ms ± 3.06 ms per loop (mean ± std. dev. of 100 runs, 1 loop each)
Zero-copy, no CUDA:




72 ms ± 7.38 ms per loop (mean ± std. dev. of 100 runs, 1 loop each)


# Scratchpad

## Experiments on ResNet50

In [None]:
# Download and cache the ResNet model
resnet_model_name = "resnet50"
resnet_func = torchvision.models.resnet50
resnet_file = "outputs/resnet.pth"
# See https://pytorch.org/vision/0.8/_modules/torchvision/models/resnet.html
resnet_url = "https://download.pytorch.org/models/resnet50-0676ba61.pth"
#resnet_url = "https://download.pytorch.org/models/resnet152-b121ed2d.pth"


if not os.path.exists(resnet_file):
    os.system(f"wget -O {resnet_file} {resnet_url}")


In [None]:
# Baseline method: Instantiate the model and call load_state_dict()
%timeit resnet_func(pretrained=False)
resnet = resnet_func(pretrained=False)

In [None]:
# Baseline 2: torch.load()
torch.save(resnet, "outputs/resnet.torch")
%timeit torch.load("outputs/resnet.torch")
resnet_2 = torch.load("outputs/resnet.torch")

In [None]:
# Baseline 3: ray.get()
resnet_ref = ray.put(resnet)

# Ray.put() actually returns before things have completely settled down.
time.sleep(1)

%timeit ray.get(resnet_ref)
resnet_3 = ray.get(resnet_ref)

In [None]:
resnet_skeleton, resnet_tensors = extract_tensors(resnet)

resnet_skeleton_ref = ray.put(resnet_skeleton)
resnet_tensors_ref = ray.put(resnet_tensors)

# Ray.put() actually returns before things have completely settled down.
time.sleep(1)

%timeit -r 20 restore_from_plasma(resnet_skeleton_ref, resnet_tensors_ref)
resnet_4 = restore_from_plasma(resnet_skeleton_ref, resnet_tensors_ref)

In [None]:
import urllib
url, filename = ("https://github.com/pytorch/hub/raw/master/images/dog.jpg", "dog.jpg")
try: urllib.URLopener().retrieve(url, filename)
except: urllib.request.urlretrieve(url, filename)

def run_image_through_resnet(model):
    from PIL import Image
    from torchvision import transforms
    input_image = Image.open(filename)
    preprocess = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ])
    input_tensor = preprocess(input_image)
    input_batch = input_tensor.unsqueeze(0) # create a mini-batch as expected by the model

    # move the input and model to GPU for speed if available
    if torch.cuda.is_available():
        input_batch = input_batch.to('cuda')
        model.to('cuda')

    with torch.no_grad():
        output = model(input_batch)
    # Tensor of shape 1000, with confidence scores over Imagenet's 1000 classes
    # The output has unnormalized scores. To get probabilities, you can run a softmax on it.
    probabilities = torch.nn.functional.softmax(output[0], dim=0)
    return(probabilities)

In [None]:
# Make sure the models still run
before_sec = time.time()
result = run_image_through_resnet(resnet)[0:10]
print(f"{1000 * (time.time() - before_sec):1.2f} msec elapsed")
result

In [None]:
before_sec = time.time()
result = run_image_through_resnet(resnet_2)[0:10]
print(f"{1000 * (time.time() - before_sec):1.2f} msec elapsed")
result

In [None]:
before_sec = time.time()
result = run_image_through_resnet(resnet_3)[0:10]
print(f"{1000 * (time.time() - before_sec):1.2f} msec elapsed")
result

In [None]:
before_sec = time.time()
result = run_image_through_resnet(resnet_4)[0:10]
print(f"{1000 * (time.time() - before_sec):1.2f} msec elapsed")
result