# **Training and Deploying Tensorflow Models at Scale**
Once you have a beautiful model that makes amazing predictions, what do you do with it? Well, you need to put it in production! This could be as simple as running the model on a batch of data, and perhaps writing a script that runs this model every night. However, it is often much more involved. Various parts of your infrastructure may need to use this model on live data, in which case you will probably want to wrap your model in a web service: this way, any part of your infrastructure can query the model at any time using a simple REST API (or some other protocol), as we discussed in Chapter 2. But as time passes, you’ll need to regularly retrain your model on fresh data and push the updated version to production. You must handle model versioning, gracefully transition from one model to the next, possibly roll back to the previous model in case of problems, and perhaps run multiple different models in parallel to perform A/B experiments. If your product becomes successful, your service may start to get a large number of of queries per second (QPS), and it must scale up to support the load. A great solution to scale up your service, as you will see in this chapter, is to use TF Serving, either on your own hardware infrastructure or via a cloud service such as Google Vertex AI. It will take care of efficiently serving your model, handle graceful model transitions, and more. If you use the cloud platform you will also get many extra features, such as powerful monitoring tools.

Moreover, if you have a lot of training data and compute-intensive models, then training time may be prohibitively long. If your product needs to adapt to changes quickly, then a long training time can be a showstopper (e.g., think of a news recommendation system promoting news from last week).

Perhaps even more importantly, a long training time will prevent you from experimenting with new ideas. In machine learning (as in many other fields), it is hard to know in advance which ideas will work, so you should try out as many as possible, as fast as possible. One way to speed up training is to use hardware accelerators such as GPUs or TPUs. To go even faster, you can train a model across multiple machines, each equipped with multiple hardware accelerators. TensorFlow’s simple yet powerful distribution strategies API makes this easy, as you will see.

In this chapter we will look at how to deploy models, first using TF Serving, then using Vertex AI. We will also take a quick look at deploying models to mobile apps, embedded devices, and web apps. Then we will discuss how to speed up computations using GPUs and how to train models across multiple devices and servers using the distribution strategies API. Lastly, we will explore how to train models and fine-tune their hyperparameters at scale using Vertex AI. That’s a lot of topics to discuss, so let’s dive in!

## **Serving a TensorFlow Model**
Once you have trained a TensorFlow model, you can easily use it in any Python code: if it’s a Keras model, just call its predict() method! But as your infrastructure grows, there comes a point where it is preferable to wrap your model in a small service whose sole role is to make predictions and have the rest of the infrastructure query it (e.g., via a REST or gRPC API). This decouples your model from the rest of the infrastructure, making it possible to easily switch model versions or scale the service up as needed (independently from the rest of your infrastructure), perform A/B experiments, and ensure that all your software components rely on the same model versions. It also simplifies testing and development, and more. You could create your own microservice using any technology you want (e.g., using the Flask library), but why reinvent the wheel when you can just use TF Serving?

### **Using TensorFlow Serving**
TF Serving is a very efficient, battle-tested model server, written in C++. It can sustain a high load, serve multiple versions of your models and watch a model repository to automatically deploy the latest versions, and more (see Figure 19-1 from the book).

![*TF Serving can serve multiple models and automatically deploy the latest version of each model*](tfserv.png)

So let’s suppose you have trained an MNIST model using Keras, and you want to deploy it to TF Serving. The first thing you have to do is export this model to the SavedModel format, introduced in Chapter 10.

#### **Exporting SavedModels**
You already know how to save the model: just call model.save(). Now to version the model, you just need to create a subdirectory for each model version. Easy!

In [9]:
import tensorflow as tf

model_1 = tf.keras.models.load_model("models/face_age_detector.keras")
tf.saved_model.save(model_1, "models/face_age_detector_saved_model/1")

model_2 = tf.keras.models.load_model("models/face_age_detector.keras")
tf.saved_model.save(model_2, "models/face_age_detector_saved_model/2")

  saveable.load_own_variables(weights_store.get(inner_path))


INFO:tensorflow:Assets written to: models/face_age_detector_saved_model/1/assets


INFO:tensorflow:Assets written to: models/face_age_detector_saved_model/1/assets
  saveable.load_own_variables(weights_store.get(inner_path))


INFO:tensorflow:Assets written to: models/face_age_detector_saved_model/2/assets


INFO:tensorflow:Assets written to: models/face_age_detector_saved_model/2/assets


It’s usually a good idea to include all the preprocessing layers in the final model you export so that it can ingest data in its natural form once it is deployed to production. This avoids having to take care of preprocessing separately within the application that uses the model. Bundling the preprocessing steps within the model also makes it simpler to update them later on and limits the risk of mismatch between a model and the preprocessing steps it requires.

> #### **WARNING**
> Since a SavedModel saves the computation graph, it can only be used with models that are based exclusively on TensorFlow operations, excluding the ***tf.py_function()*** operation, which wraps arbitrary Python code.

TensorFlow comes with a small ***saved_model_cli command-line*** interface to inspect SavedModels. Let use it to inspect our exported model:

In [None]:
!saved_model_cli show --dir models/face_age_detector_saved_model/1 --all

2025-08-09 14:56:04.725836: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1754740564.763741   13449 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1754740564.779919   13449 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1754740564.807312   13449 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1754740564.807409   13449 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1754740564.807426   13449 computation_placer.cc:177] computation placer alr

In [50]:
def inspect_model(model_path):
    m = tf.keras.models.load_model(model_path)
    m.summary()
    print("Input shape:", getattr(m, "input_shape", None))
    print("Output shape:", getattr(m, "output_shape", None))
    print("Layer count:", len(m.layers))

In [51]:
inspect_model("models/better_face_age_detector.keras")

Input shape: (None, 224, 224, 3)
Output shape: (None, 5)
Layer count: 244


What does this output mean? Well, a SavedModel contains one or more *metagraphs*. A metagraph is a computation graph plus some function signature definitions, including their input and output names, types, and shapes. Each metagraph is identified by a set of tags. For example, you may want to have a metagraph containing the full computation graph, including the training operations: you would typically tag this one as "train". And you might have another metagraph containing a pruned computation graph with only the prediction operations, including some GPU-specific operations: this one might be tagged as "***serve***", "***gpu***". You might want to have other metagraphs as well. This can be done using TensorFlow’s low-level [SavedModel](https://homl.info/savedmodel) API. However, when you save a Keras model using its ***save()*** method, it saves a single metagraph tagged as "***serve***". Let’s inspect this "***serve***" tag set:

In [None]:
!saved_model_cli show --dir models/face_age_detector_saved_model/1 --tag_set serve

2025-08-09 14:14:04.421768: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1754738044.452292  167328 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1754738044.462459  167328 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1754738044.488061  167328 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1754738044.488099  167328 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1754738044.488104  167328 computation_placer.cc:177] computation placer alr

This metagraph contains two signature definitions: an initialization function called "***__saved_model_init_op***", which you do not need to worry about, and a default serving function called "***serving_default***". When saving a Keras model, the default serving function is the model’s ***call()*** method, which makes predictions, as you already know. Let’s get more details about this serving function:

In [None]:
!set CUDA_VISIBLE_DEVICES=""
!set TF_CPP_MIN_LOG_LEVEL="2"
!saved_model_cli show --dir models/face_age_detector_saved_model/1 --tag_set serve --signature_def serving_default

2025-08-09 14:14:36.686486: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1754738076.719782  167586 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1754738076.729982  167586 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1754738076.756449  167586 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1754738076.756488  167586 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1754738076.756493  167586 computation_placer.cc:177] computation placer alr

Note that the function’s input is named "***flatten_input***", and the output is named "***dense_1***". These correspond to the Keras model’s input and output layer names. You can also see the type and shape of the input and output data. Looks good!

Now that you have a SavedModel, the next step is to install TF Serving.

#### **Installing and Starting TensorFlow Serving**
There are many ways to install TF Serving: using the system’s package manager, using a Docker image, installing from source, and more. Since Colab runs on Ubuntu, we can use Ubuntu’s apt package manager like this:

This code starts by adding TensorFlow’s package repository to Ubuntu’s list of package sources. Then it downloads TensorFlow’s public GPG key and adds it to the package manager’s key list so it can verify TensorFlow’s package signatures. Next, it uses apt to install the ***tensorflow-model-server*** package. Lastly, it installs the ***tensorflow-serving-api*** library, which we will need to communicate with the server.

Now we want to start the server. The command will require the absolute path of the base model directory (i.e., the path to ***models***, not face_age_detector_saved_model), so let’s save that to the ***MODEL_DIR*** environment variable:

In [10]:
import os
from pathlib import Path

model_base_dir = Path("models/face_age_detector_saved_model").absolute()

os.environ["MODEL_DIR"] = str(model_base_dir)
print("MODEL_DIR is:", os.environ["MODEL_DIR"])

MODEL_DIR is: /media/jaxon/Peaceful_daddy_s/CONTENTS/AI/CODE/Neural Networks and Deep Learning/CHAPTER_19/models/face_age_detector_saved_model


In [12]:
!ls -l /media/jaxon/Peaceful_daddy_s/CONTENTS/AI/CODE/Neural Networks and Deep Learning/CHAPTER_19/models/face_age_detector_saved_model

ls: cannot access '/media/jaxon/Peaceful_daddy_s/CONTENTS/AI/CODE/Neural': No such file or directory
ls: cannot access 'Networks': No such file or directory
ls: cannot access 'and': No such file or directory
ls: cannot access 'Deep': No such file or directory
ls: cannot access 'Learning/CHAPTER_19/models/face_age_detector_saved_model': No such file or directory


We can then start the server:

In [11]:
%%bash --bg
export MODEL_DIR="${MODEL_DIR}"
tensorflow_model_server \
  --port=8500 \
  --rest_api_port=8501 \
  --model_name=face_age_detector \
  --model_base_path="${MODEL_DIR}" > my_server.log 2>&1


In Jupyter or Colab, the ***%%bash --bg*** magic command executes the cell as a bash script, running it in the background. The ***>my_server.log 2>&1*** part redirects the standard output and standard error to the *my_server.log* file. And that’s it! TF Serving is now running in the background, and its logs are saved to *my_server.log*. It loaded our MNIST model (version 1), and it is now waiting for gRPC and REST requests, respectively, on ports 8500 and 8501.

> #### **RUNNING TF SERVING IN A DOCKER CONTAINER**
> If you are running the notebook on your own machine and you have installed [Docker](https://docker.com), you can run ***docker pull*** ***tensorflow/serving*** in a terminal to download the TF Serving image. The TensorFlow team highly recommends this installation method because it is simple, it will not mess with your system, and it offers high performance. To start the server inside a Docker container, you can run the following command in a terminal:
>


> Here is what all these command-line options mean:
> - ***-it***
> 
>   Makes the container interactive (so you can press Ctrl-C to stop it) and displays the server’s output.
>
> - ***--rm***
> 
>   Deletes the container when you stop it: no need to clutter your machine with interrupted containers. However, it does not delete the image.
>
> - ***-v "/path/to/my_mnist_model:/models/my_mnist_model"***
>
>   Makes the host’s my_mnist_model directory available to the container at the path /models/mnist_model. You must replace /path/to/my_mnist_model with the absolute path of this directory. On Windows, remember to use \ instead of / in the host path, but not in the container path (since the container runs on Linux).
>
> - ***-p 8500:8500***
>
>   Makes the Docker engine forward the host’s TCP port 8500 to the container’s TCP port 8500. By default, TF Serving uses this port to serve the gRPC API.
>
> - ***-p 8501:8501***
>
>   Forwards the host’s TCP port 8501 to the container’s TCP port 8501. The Docker image is configured to use this port by default to serve the REST API.
>
> - ***-e MODEL_NAME=my_mnist_model***
>   Sets the container’s **MODEL_NAME** environment variable, so TF Serving knows which model to serve. By default, it will look for models in the /models directory, and it will automatically serve the latest version it finds.
>
> - ***tensorflow/serving***
>
>   This is the name of the image to run.

Now that the server is up and running, let’s query it, first using the REST API, then the gRPC API.

Now that the server is up and running, let’s query it, first using the REST API, then the gRPC API.

#### **Querying TF Serving through the REST API**
Let’s start by creating the query. It must contain the name of the function signature you want to call, and of course the input data. Since the request must use the JSON format, we have to convert the input images from a NumPy array to a Python list:

In [42]:
import json
import numpy as np
import requests
from PIL import Image

# --- config ---
image_path = ["face_test/25.jpg", "face_test/26.jpg", "face_test/27.jpg", "face_test/28.jpg", "face_test/29.jpg"]
server_url = "http://localhost:8501/v1/models/face_age_detector_saved_model/2:predict"
# server_url = "http://localhost:8501/v1/models/face_age_detector_saved_model/2:predict"  # to target version 2

for imag_seq in image_path:
    # --- load & preprocess ---
    img = Image.open(imag_seq).convert("RGB")
    img = img.resize((224, 224))
    arr = np.array(img).astype(np.float32)

    # --- normalize: choose one ---
    arr /= 255.0                              # UNCOMMENT if model expects [0,1]
    # arr = (arr - 127.5) / 127.5             # UNCOMMENT if model expects [-1,1]
    # arr = arr.astype(np.uint8)              # UNCOMMENT if model expects raw bytes 0-255

    # add batch dim
    batch = np.expand_dims(arr, axis=0) # shape -> (1, 224, 224, 3)

    # build payload and send
    payload = {"signature_name": "serving_default", "instances": batch.tolist()}
    resp = requests.post(server_url, headers={"Content-Type": "application/json"}, data=json.dumps(payload))

    print("HTTP status:", resp.status_code)
    if resp.status_code != 200:
        print("ERROR:", resp.text)
    else:
        result = resp.json()
        preds = np.array(result.get("predictions") or result.get("outputs") or result.get("predictions", []))
        print("predictions shape:", preds.shape)
        print("probabilities:", preds)
        # predicted class index (for single image)
        pred_class = int(np.argmax(preds[0]))
        print("predicted class index:", pred_class)

HTTP status: 400
ERROR: {
    "error": "Malformed request: POST /v1/models/face_age_detector_saved_model/2:predict"
}
HTTP status: 400
ERROR: {
    "error": "Malformed request: POST /v1/models/face_age_detector_saved_model/2:predict"
}
HTTP status: 400
ERROR: {
    "error": "Malformed request: POST /v1/models/face_age_detector_saved_model/2:predict"
}
HTTP status: 400
ERROR: {
    "error": "Malformed request: POST /v1/models/face_age_detector_saved_model/2:predict"
}
HTTP status: 400
ERROR: {
    "error": "Malformed request: POST /v1/models/face_age_detector_saved_model/2:predict"
}


In [47]:
# paste into your Python kernel (no edits needed except img_path if different)
import json, requests, numpy as np
from PIL import Image

img_path = "face_test/25.jpg"
url = "http://localhost:8501/v1/models/face_age_detector_saved_model/2:predict"  # explicit v3

# load + preprocess (model expects 224x224 RGB float32)
img = Image.open(img_path).convert("RGB").resize((224,224))
arr = np.array(img).astype(np.float32) / 255.0   # change if your model expects different scaling
batch = np.expand_dims(arr, 0)  # shape -> (1,224,224,3)

request_json = json.dumps({
 "signature_name": "serving_default",
 "instances": img_path,
})

print("batch.shape:", batch.shape, "dtype:", batch.dtype)
payload = {"signature_name": "serving_default", "instances": batch.tolist()}

resp = requests.post(url, json=payload)
print("HTTP status:", resp.status_code)
try:
    print(json.dumps(resp.json(), indent=2))
except Exception:
    print("Response text:", resp.text)


batch.shape: (1, 224, 224, 3) dtype: float32
HTTP status: 400
{
  "error": "Malformed request: POST /v1/models/face_age_detector_saved_model/2:predict"
}
