# Chapter 19: Training and Deploying TensorFlow Models at Scale

This notebook contains the code reproductions and theoretical explanations for Chapter 19 of *Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow*.

## Chapter Summary

This chapter covers the crucial final steps of a machine learning project: deploying a trained model to production and scaling up training for very large datasets and models.

Key topics covered include:

* **Serving TensorFlow Models:** We learn how to export a model to TensorFlow's `SavedModel` format. We then deploy this model using **TensorFlow Serving (TF Serving)**, a high-performance serving system. We learn how to install it (using Docker), run it, and query it using both its REST and gRPC APIs.

* **Deploying to the Cloud:** We explore how to deploy a model to **Google Cloud AI Platform**, which provides a fully managed, scalable serving solution that handles versioning, monitoring, and more. We also create a service account and write client code to query the deployed model.

* **Mobile and Web Deployment:** We briefly look at **TensorFlow Lite (TFLite)** for deploying models on mobile and embedded devices, focusing on optimization techniques like **post-training quantization** and **quantization-aware training**. We also look at **TensorFlow.js** for running models directly in a web browser.

* **Speeding Up Training with GPUs:** We discuss how to use GPUs to accelerate training, either by getting a local GPU, using a GPU-equipped VM on the cloud, or using Google's **Colaboratory (Colab)**. We also cover essential techniques for managing GPU RAM.

* **Distributed Training:** We learn how to train a single model across multiple devices and servers. We explore the concepts of **model parallelism** and **data parallelism** (including synchronous vs. asynchronous updates).

* **Distribution Strategies API:** We use TensorFlow's `tf.distribute.Strategy` API to easily implement data parallelism. This includes `MirroredStrategy` (for multiple GPUs on one machine) and `MultiWorkerMirroredStrategy` (for multiple servers).

* **Hyperparameter Tuning on AI Platform:** Finally, we see how to use Google Cloud's powerful black-box optimization service to perform large-scale hyperparameter tuning.

## Setup

First, let's import the necessary libraries and set up the environment. We'll also train and save a basic Fashion MNIST model to use for deployment examples.

In [None]:
import tensorflow as tf
from tensorflow import keras
import numpy as np
import os
import json
import requests

# Common setup for plotting
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

In [None]:
# Load and prepare Fashion MNIST data
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()
X_train_full = X_train_full / 255.0
X_test = X_test / 255.0
X_new = X_test[:3]

# Train a simple model
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(100, activation="relu"),
    keras.layers.Dense(10, activation="softmax")
])

model.compile(loss="sparse_categorical_crossentropy",
              optimizer=keras.optimizers.SGD(lr=1e-2),
              metrics=["accuracy"])

model.fit(X_train_full, y_train_full, epochs=10, validation_data=(X_test, y_test), verbose=0)

print("Model trained.")

## Serving a TensorFlow Model

Once a model is trained, you need to deploy it. Instead of embedding it in every application, it's best to wrap it in a dedicated web service. **TensorFlow Serving** is a high-performance, battle-tested system designed for this. It can serve multiple models, handle versioning, and automatically deploy the latest model.

### Exporting SavedModels

The first step is to export the model to TensorFlow's **`SavedModel`** format. This format is a directory containing the model's computation graph (as a protobuf) and its weights (variables). It's the universal, language-neutral format for TensorFlow models.

We can use `tf.saved_model.save()` or simply the model's `save()` method.

In [None]:
model_version = "0001"
model_name = "my_mnist_model"
model_path = os.path.join(model_name, model_version)

# Export the model to SavedModel format
tf.saved_model.save(model, model_path)

# You could also use: 
# model.save(model_path) # This saves as SavedModel if path is not .h5

Let's inspect the contents of the `SavedModel` using the `saved_model_cli` tool. You would run this in your terminal.

In [None]:
# This command is for your shell/terminal
!saved_model_cli show --dir {model_path} --all

A `SavedModel` contains one or more *metagraphs* (a graph + function signatures), each identified by tags. When saving a Keras model, it saves a single metagraph tagged `"serve"`. This metagraph contains the `serving_default` signature, which corresponds to the model's `call()` function (i.e., `model.predict()`).

### Installing TensorFlow Serving

The easiest way to install TF Serving is using Docker. 

1.  **Pull the image:** `docker pull tensorflow/serving`
2.  **Run the container:** (This command is for your terminal, not this notebook)

```bash
# Get the absolute path to your model directory
ML_PATH=$(pwd)

docker run -it --rm -p 8500:8500 -p 8501:8501 \
   -v "$ML_PATH/my_mnist_model:/models/my_mnist_model" \
   -e MODEL_NAME=my_mnist_model \
   tensorflow/serving
```
This command:
* Runs the `tensorflow/serving` image.
* Forwards ports `8501` (for REST API) and `8500` (for gRPC API).
* Mounts your local `my_mnist_model` directory into the container's `/models/` directory.
* Sets the `MODEL_NAME` environment variable so TF Serving knows which model to serve.

### Querying TF Serving through the REST API

The REST API is simple and uses JSON. It's great for most use cases.

In [None]:
# 1. Create the request in JSON format
# The 'instances' key holds our batch of new images (as a list)
input_data_json = json.dumps({
    "signature_name": "serving_default",
    "instances": X_new.tolist(),
})

In [None]:
# 2. Send the POST request to the server's REST API endpoint
SERVER_URL = 'http://localhost:8501/v1/models/my_mnist_model:predict'

try:
    response = requests.post(SERVER_URL, data=input_data_json)
    response.raise_for_status() # Raise an exception in case of error
    response = response.json()
    
    # 3. Parse the JSON response
    y_proba = np.array(response["predictions"])
    print(y_proba.round(2))

except requests.exceptions.ConnectionError:
    print("Could not connect to TF Serving. Is the Docker container running?")

### Querying TF Serving through the gRPC API

gRPC is a more efficient, binary protocol based on protocol buffers. It's much faster than REST and is recommended for high-performance applications, especially when transferring large amounts of data.

You'll need to install the `tensorflow-serving-api` library: `pip install -U tensorflow-serving-api`

In [None]:
try:
    import grpc
    from tensorflow_serving.apis import predict_pb2
    from tensorflow_serving.apis import prediction_service_pb2_grpc

    # 1. Create the request protobuf
    request = predict_pb2.PredictRequest()
    request.model_spec.name = model_name
    request.model_spec.signature_name = "serving_default"
    input_name = model.input_names[0]
    request.inputs[input_name].CopyFrom(tf.make_tensor_proto(X_new))

    # 2. Open a gRPC channel and send the request
    channel = grpc.insecure_channel('localhost:8500')
    predict_service = prediction_service_pb2_grpc.PredictionServiceStub(channel)
    response = predict_service.Predict(request, timeout=10.0)

    # 3. Convert the response protobuf back to a tensor
    output_name = model.output_names[0]
    outputs_proto = response.outputs[output_name]
    y_proba_grpc = tf.make_ndarray(outputs_proto)
    print(y_proba_grpc.round(2))

except (ImportError, grpc.framework.interfaces.face.face.AbortionError):
    print("gRPC setup failed or server not found.")
    print("Install with: pip install -U tensorflow-serving-api grpcio")

## Creating a Prediction Service on GCP AI Platform

**Theoretical Explanation:**

Instead of managing Docker containers yourself, you can use a fully managed service like **Google Cloud AI Platform**. 

The process is:
1.  **Set up GCP:** Create a project, enable billing, and enable the AI Platform and Cloud Storage APIs.
2.  **Create a GCS Bucket:** Create a bucket in Google Cloud Storage (GCS) to store your models.
3.  **Upload SavedModel:** Upload your `SavedModel` directory (e.g., `my_mnist_model/0001`) to the bucket.
4.  **Create a Model:** In the AI Platform console, create a "model" resource (this is just a container for versions).
5.  **Create a Version:** Create a "version" of your model, pointing it to the `SavedModel` directory in your GCS bucket. AI Platform will spin up TF Serving instances to serve your model.
6.  **Create a Service Account:** For security, create a service account with the "ML Engine Developer" role and download its JSON private key.
7.  **Query the Service:** Use Google's API Client Library to authenticate with the service account key and send prediction requests.

In [None]:
# This code assumes you have followed the steps in the book (1-6).
# 1. Set the service account key environment variable
# os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "my_service_account_key.json"

# 2. Build the client resource object
import googleapiclient.discovery

project_id = "your-gcp-project-id" # CHANGE THIS
model_id = "my_mnist_model"
model_path = "projects/{}/models/{}".format(project_id, model_id)

# To use a specific version:
# model_path += "/versions/0001"

# This line might fail if you haven't installed the client library:
# !pip install -U google-api-python-client

# ml_resource = googleapiclient.discovery.build("ml", "v1").projects()

In [None]:
# 3. Create a function to query the service
def predict_gcp(ml_resource, X):
    input_data_json = {"signature_name": "serving_default",
                       "instances": X.tolist()}
    request = ml_resource.predict(name=model_path, body=input_data_json)
    response = request.execute()
    if "error" in response:
        raise RuntimeError(response["error"])
    output_name = model.output_names[0]
    return np.array([pred[output_name] for pred in response["predictions"]])

# 4. Query the model (this will fail unless you set up GCP)
# try:
#     Y_probas = predict_gcp(ml_resource, X_new)
#     print(Y_probas.round(2))
# except NameError:
#     print("GCP client not configured. Skipping GCP prediction.")
# except Exception as e:
#     print("GCP prediction failed:", e)

## Deploying a Model to a Mobile or Embedded Device

**Theoretical Explanation:**

To run models on devices with limited compute, power, and RAM, you need to use **TensorFlow Lite (TFLite)**.

TFLite's main tool is a **converter** that takes a `SavedModel` and converts it to a much lighter `.tflite` file (based on FlatBuffers). This converter:
1.  **Reduces Model Size:** It prunes unused operations (like training ops) and optimizes the graph.
2.  **Reduces Latency & Power:** It can perform **quantization**, which converts the 32-bit float weights to 8-bit integers. 

**Post-training quantization** is the simplest method: it quantizes the weights after training. This gives a 4x reduction in size, but computations are still done with floats (the 8-bit integers are dequantized at runtime).

**Quantization-aware training** is more complex. It adds "fake" quantization operations to the model *during* training. This makes the model robust to the loss of precision, resulting in higher accuracy after quantization.

In [None]:
# Convert the SavedModel to a TFLite FlatBuffer
converter = tf.lite.TFLiteConverter.from_saved_model(model_path)
tflite_model = converter.convert()

with open("converted_model.tflite", "wb") as f:
    f.write(tflite_model)

In [None]:
# To use post-training quantization:
converter = tf.lite.TFLiteConverter.from_saved_model(model_path)
converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]
tflite_model_quantized = converter.convert()

print(f"Original model size: {os.path.getsize(model_path + '/saved_model.pb')} bytes")
print(f"TFLite model size: {len(tflite_model)} bytes")
print(f"Quantized TFLite model size: {len(tflite_model_quantized)} bytes")

## Deploying a Model to a Web Browser (TensorFlow.js)

You can also deploy models to a web browser using **TensorFlow.js**. This is great for privacy (data never leaves the user's machine) and low latency (no server round-trip).

1.  **Convert the model:** Use the `tensorflowjs_converter` command-line tool (which comes with the `tensorflowjs` pip package) to convert a `SavedModel` to the TensorFlow.js Layers format (a `model.json` file and several binary `shard` files).
2.  **Load in JavaScript:** Use the `tf.loadLayersModel()` function from the TensorFlow.js library to load the model and make predictions.

In [None]:
# 1. Install the converter (run in terminal)
# !pip install -U tensorflowjs

# 2. Run the converter (run in terminal)
# !tensorflowjs_converter --input_format=tf_saved_model {model_path} ./my_tfjs_model

In [None]:
// 3. Example JavaScript code to run the model in a browser
// import * as tf from '@tensorflow/tfjs';
// const model = await tf.loadLayersModel('https://example.com/tfjs/model.json');
// const image = tf.fromPixels(webcamElement);
// const prediction = model.predict(image);

## Using GPUs to Speed Up Computations

Training deep nets on CPUs is very slow. You can get a massive speedup by using a **Graphics Processing Unit (GPU)**. 

To do this, you need:
1.  A supported GPU (currently NVIDIA, with CUDA Compute Capability 3.5+).
2.  The NVIDIA drivers.
3.  NVIDIA's CUDA library and cuDNN library.
4.  The `tensorflow-gpu` package (or just `tensorflow` as of 2.1, which bundles GPU support).

In [None]:
# Check if TensorFlow can see the GPU
print("Is GPU available:", tf.test.is_gpu_available())
print("GPU device name:", tf.test.gpu_device_name())
print("Physical GPUs:", tf.config.experimental.list_physical_devices(device_type='GPU'))

### Managing the GPU RAM

By default, TensorFlow automatically grabs *all* the RAM in all available GPUs. If you want to run multiple programs on one machine, you must manage this.

**Option 1: Limit visible devices (via environment variable)**
This is the cleanest way. Set this in your terminal *before* running your script.

In [None]:
# Terminal A:
# $ CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 python3 program_1.py

# Terminal B:
# $ CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=1 python3 program_2.py

**Option 2: Tell TF to only grab memory as needed (memory growth)**
This must be done right at the start of your program.

In [None]:
physical_gpus = tf.config.experimental.list_physical_devices("GPU")
if physical_gpus:
    try:
        for gpu in physical_gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
    except RuntimeError as e:
        print(e)

**Option 3: Create virtual devices with a fixed limit**

In [None]:
physical_gpus = tf.config.experimental.list_physical_devices("GPU")
if physical_gpus:
    try:
        # Set two virtual GPUs on the first physical GPU, each with 2GiB of RAM
        tf.config.experimental.set_virtual_device_configuration(
            physical_gpus[0],
            [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=2048),
             tf.config.experimental.VirtualDeviceConfiguration(memory_limit=2048)])
    except RuntimeError as e:
        print(e)

### Placing Operations on Devices

By default, TF places ops on the GPU (`/gpu:0`) if it has a GPU-kernel, otherwise it falls back to the CPU (`/cpu:0`). You can manually control this with a `tf.device()` context.

In [None]:
with tf.device("/cpu:0"): 
    # These operations will run on the CPU
    c = tf.Variable(42.0)
    print(c.device)

## Training Models Across Multiple Devices

**Theoretical Explanation:**

There are two main approaches to training a single model across multiple devices:

1.  **Model Parallelism:** You split your model across devices. For example, the bottom layers run on GPU 0 and the top layers run on GPU 1. This is complex and rarely efficient because of the high communication cost between devices.
2.  **Data Parallelism:** You replicate the *entire model* on every device. At each step, you give each replica a different mini-batch of data. Each replica computes the gradients for its batch. These gradients are then aggregated (e.g., averaged) and used to update the parameters on *all* replicas. This is the most common and effective strategy.

This aggregation can be done:
* **Synchronously:** (e.g., **Mirrored Strategy**) All replicas wait until every replica has computed its gradients. The gradients are averaged (using an **AllReduce** algorithm), and every replica applies the same update. This is the simplest and most common approach.
* **Asynchronously:** (e.g., **Parameter Server Strategy**) Replicas run independently. When a replica finishes its gradients, it sends them to a "parameter server," which updates the central parameters and sends the new parameters back. This avoids waiting but can lead to *stale gradients*, which can destabilize training.

### Training at Scale Using the Distribution Strategies API

TensorFlow's `tf.distribute.Strategy` API makes data parallelism incredibly simple. You just create a strategy object and define/compile your Keras model *within its scope*.

In [None]:
# MirroredStrategy: For synchronous training on all GPUs on one machine.

# List available devices (to see if you have multiple GPUs)
print(tf.config.experimental.list_physical_devices("GPU"))

# Create the strategy
distribution = tf.distribute.MirroredStrategy()

# If you only want to use a subset of GPUs:
# distribution = tf.distribute.MirroredStrategy(["/gpu:0", "/gpu:1"])

with distribution.scope():
    mirrored_model = keras.models.Sequential([
        keras.layers.Dense(30, activation="relu", input_shape=X_train.shape[1:]),
        keras.layers.Dense(1)
    ])
    mirrored_model.compile(loss="mse", optimizer=keras.optimizers.SGD(lr=1e-3))

# The batch size must be divisible by the number of replicas (GPUs)
batch_size = 64 # e.g., if 2 GPUs, each gets 32 instances
mirrored_model.fit(X_train_scaled, y_train, epochs=10, batch_size=batch_size)

### Training on a TensorFlow Cluster

**Theoretical Explanation:**

A **TF Cluster** is a group of TF processes (tasks) running on different machines, talking to each other. Each task has a **type** (job) and an **index**:
* **`"worker"`:** A task that performs computations (usually on a GPU).
* **`"chief"`:** A special worker (usually `worker:0`) that handles extra work like saving checkpoints and writing TensorBoard logs.
* **`"ps"`:** A **Parameter Server** task. It only stores and updates model parameters. Used by the `ParameterServerStrategy`.

To configure a cluster, you must set the `TF_CONFIG` environment variable on *each machine* before it starts. This JSON variable defines the addresses of all tasks (`cluster` key) and the current task's role (`task` key).

In [None]:
# Example TF_CONFIG for worker 0
cluster_spec = {
    "worker": [
        "machine-a.example.com:2222",  # /job:worker/task:0
        "machine-b.example.com:2222"   # /job:worker/task:1
    ],
    "ps": ["machine-a.example.com:2221"] # /job:ps/task:0
}

os.environ["TF_CONFIG"] = json.dumps({
    "cluster": cluster_spec,
    "task": {"type": "worker", "index": 0}
})

In [None]:
# Using MultiWorkerMirroredStrategy (for synchronous multi-worker training)
# You would run this same script on all workers.

# Note: No "ps" jobs are needed for this strategy.
# distribution = tf.distribute.experimental.MultiWorkerMirroredStrategy()

# Using ParameterServerStrategy (for asynchronous multi-worker training)
# You would run this script on all workers and parameter servers.

# distribution = tf.distribute.experimental.ParameterServerStrategy()

# The rest of the code (defining, compiling, and fitting the model)
# would be the same as in the MirroredStrategy example.

### Black Box Hyperparameter Tuning on AI Platform

**Theoretical Explanation:**

GCP AI Platform offers a powerful **Bayesian optimization** service (Google Vizier) for hyperparameter tuning. 

You provide:
1.  A YAML configuration file (`tuning.yaml`) specifying the hyperparameter search space (e.g., `n_layers` from 10 to 100), the metric to optimize (e.g., `accuracy`), and the number of trials.
2.  Your training code, which must accept the hyperparameters as command-line arguments.
3.  Your training code must use a `TensorBoard` callback to log the metric you want to optimize.

AI Platform will then run your training job multiple times (trials), and it will use the results from previous trials to intelligently choose the hyperparameter values for the next trial, quickly homing in on the optimal values.

In [None]:
# Example 'tuning.yaml' file (this is not Python code)
"""
trainingInput:
  hyperparameters:
    goal: MAXIMIZE
    hyperparameterMetricTag: accuracy
    maxTrials: 10
    maxParallelTrials: 2
    params:
      - parameterName: n_layers
        type: INTEGER
        minValue: 10
        maxValue: 100
        scaleType: UNIT_LINEAR_SCALE
      - parameterName: momentum
        type: DOUBLE
        minValue: 0.1
        maxValue: 1.0
        scaleType: UNIT_LOG_SCALE
"""

# Example gcloud command to launch the tuning job (run in terminal)
# !gcloud ai-platform jobs submit training my_tuning_job \
#   --config tuning.yaml \
#   --package-path /my_project/src/trainer \
#   --module-name trainer.task \
#   [...other gcloud args...]

## Exercises

See Appendix A in the book.