##### Copyright 2020 The TensorFlow Authors.

In [1]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Training Keras models with TensorFlow Cloud
# 用TensorFlow云服务来训练Keras模型

## Introduction
## 简介


[TensorFlow Cloud](https://github.com/tensorflow/cloud) is a Python package that
provides APIs for a seamless transition from local debugging to distributed training
in Google Cloud. It simplifies the process of training TensorFlow models on the
cloud into a single, simple function call, requiring minimal setup and no changes
to your model. TensorFlow Cloud handles cloud-specific tasks such as creating VM
instances and distribution strategies for your models automatically. This guide
will demonstrate how to interface with Google Cloud through TensorFlow Cloud,
and the wide range of functionality provided within TensorFlow Cloud. We'll start
with the simplest use-case.

[TensorFlow Cloud](https://github.com/tensorflow/cloud) 是一个python包，其提供了一系列APIs来实现从本地调试到在Google云服务上进行分布式训练的无缝过渡。TensorFlow云服务将TensorFlow模型在云上的训练过程简化为一个单独的，简单的函数调用。只需要少量的设置且不需要对你的模型进行任何改变。TensorFlow云服务可以胜任很多具体的云计算任务，比如创建虚拟机实例和自动化模型的分布式策略。这个指南将会演示如何通过TensorFlow云服务来与Google云服务进行交互，以及TensorFlow云服务其他功能。我们将从一个最简单的使用案例开始

## Setup
## 环境设定

We'll get started by installing TensorFlow Cloud, and importing the packages we
will need in this guide.

我们首先需要安装TensorFlow Cloud，并导入我们在这篇指南中所需要的所有的包

In [2]:
!pip install -q tensorflow_cloud

In [3]:
import tensorflow as tf
import tensorflow_cloud as tfc

from tensorflow import keras
from tensorflow.keras import layers

## API overview: a first end-to-end example
## API概念：第一个端到端示例

Let's begin with a Keras model training script, such as the following CNN:
让我们从一个Keras模型的训练脚本来事，比如接下来的CNN模型

In [None]:

(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

model = keras.Sequential(
    [
        keras.Input(shape=(28, 28)),
        # Use a Rescaling layer to make sure input values are in the [0, 1] range.
        layers.experimental.preprocessing.Rescaling(1.0 / 255),
        # The original images have shape (28, 28), so we reshape them to (28, 28, 1)
        layers.Reshape(target_shape=(28, 28, 1)),
        # Follow-up with a classic small convnet
        layers.Conv2D(32, 3, activation="relu"),
        layers.MaxPooling2D(2),
        layers.Conv2D(32, 3, activation="relu"),
        layers.MaxPooling2D(2),
        layers.Conv2D(32, 3, activation="relu"),
        layers.Flatten(),
        layers.Dense(128, activation="relu"),
        layers.Dense(10),
    ]
)

model.compile(
    optimizer=keras.optimizers.Adam(),
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=keras.metrics.SparseCategoricalAccuracy(),
)

model.fit(x_train, y_train, epochs=20, batch_size=128, validation_split=0.1)

To train this model on Google Cloud we just need to add a call to `run()` at
the beginning of the script, before the imports:

要在Google云服务上训练这个模型，我们需要在脚本的开头，在导入之前加入对 `run()`的调用

In [None]:
tfc.run()

You don't need to worry about cloud-specific tasks such as creating VM instances
and distribution strategies when using TensorFlow Cloud.
The API includes intelligent defaults for all the parameters -- everything is
configurable, but many models can rely on these defaults.

你不需要担心具体的云端任务，比如创建虚拟机以及使用TensorFlow云时的分布式策略。这个API包含了所有参数的智能默认设定。所有的一切都是可设置的，不过很多模型只需要依赖默认设定就好

Upon calling `run()`, TensorFlow Cloud will:

- Make your Python script or notebook distribution-ready.
- Convert it into a Docker image with required dependencies.
- Run the training job on a GCP GPU-powered VM.
- Stream relevant logs and job information.

在调用`run()`时，TensorFlow将会：
- 将你的Python脚本或笔记本做好发布准备.
- 将你的Python脚本或笔记本转换为具有依赖性的Docker镜像.
- 在一个 GCP GPU-powered虚拟机上执行训练工作.
- 返回相关日志和工作信息.

The default VM configuration is 1 chief and 0 workers with 8 CPU cores and
1 Tesla T4 GPU.

虚拟机的默认设置是1个chief和0个worker，8核CPU和1个Tesla T4 GPU

## Google Cloud configuration
## Google云设置
In order to facilitate the proper pathways for Cloud training, you will need to
do some first-time setup. If you're a new Google Cloud user, there are a few
preliminary steps you will need to take:

1. Create a GCP Project;
2. Enable AI Platform Services;
3. Create a Service Account;
4. Download an authorization key;
5. Create a Cloud Storage bucket.

Detailed first-time setup instructions can be found in the
[TensorFlow Cloud README](https://github.com/tensorflow/cloud#setup-instructions),
and an additional setup example is shown on the
[TensorFlow Blog](https://blog.tensorflow.org/2020/08/train-your-tensorflow-model-on-google.html).

为了提升云端训练的效率，你需要做一些首次设置。如果你是Google云服务的新用户，这里是一些你需要完成的基本步骤：

1. 创建一个GCP项目
2. 启动AI平台服务
3. 创建一个服务账号
4. 下载授权密钥
5. 创建一个云储存桶(bucket)


## Common workflows and Cloud storage
## 常规流程和云储存

In most cases, you'll want to retrieve your model after training on Google Cloud.
For this, it's crucial to redirect saving and loading to Cloud Storage while
training remotely. We can direct TensorFlow Cloud to our Cloud Storage bucket for
a variety of tasks. The storage bucket can be used to save and load large training
datasets, store callback logs or model weights, and save trained model files.
To begin, let's configure `fit()` to save the model to a Cloud Storage, and set
up TensorBoard monitoring to track training progress.

在大多数情况下，你会需要在Google云上完成训练后再次检索你的模型。为此，在远程训练时，重新定向到云储存空间进行保存和加载至关重要。我们可以将TensorFlow云引导至我们的云储存桶来完成大部分任务。云储存桶可以被用于保存和加载大量训练数据集，保存回调函数的日志和模型权重，以及保存训练后的模型文件。首先，让我们以设置`fit()`方法来保存模型到云储存，然后设置TensorBoard监控来跟踪训练进度

In [4]:
def create_model():
    model = keras.Sequential(
        [
            keras.Input(shape=(28, 28)),
            layers.experimental.preprocessing.Rescaling(1.0 / 255),
            layers.Reshape(target_shape=(28, 28, 1)),
            layers.Conv2D(32, 3, activation="relu"),
            layers.MaxPooling2D(2),
            layers.Conv2D(32, 3, activation="relu"),
            layers.MaxPooling2D(2),
            layers.Conv2D(32, 3, activation="relu"),
            layers.Flatten(),
            layers.Dense(128, activation="relu"),
            layers.Dense(10),
        ]
    )

    model.compile(
        optimizer=keras.optimizers.Adam(),
        loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=keras.metrics.SparseCategoricalAccuracy(),
    )
    return model


Let's save the TensorBoard logs and model checkpoints generated during training
in our cloud storage bucket.

让我们保存训练期间产生的TensorBoard日志和模型检查点信息到我们的云端储存桶

In [5]:
import datetime
import os

# Note: Please change the gcp_bucket to your bucket name.
# 注意，请将gcp_bucket替换为你的储存桶名称
gcp_bucket = "keras-examples"

checkpoint_path = os.path.join("gs://", gcp_bucket, "mnist_example", "save_at_{epoch}")

tensorboard_path = os.path.join(  # Timestamp included to enable timeseries graphs 包含时间戳，以启用时间序列图
    "gs://", gcp_bucket, "logs", datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
)

callbacks = [
    # TensorBoard will store logs for each epoch and graph performance for us. TensorBoard将会为我们保存每一次训练的日志并绘制性能图
    keras.callbacks.TensorBoard(log_dir=tensorboard_path, histogram_freq=1),
    # ModelCheckpoint will save models after each epoch for retrieval later. ModelCheckpoint将会在每次训练后保存模型供之后使用
    keras.callbacks.ModelCheckpoint(checkpoint_path),
    # EarlyStopping will terminate training when val_loss ceases to improve. EarlyStopping将会在val_loss不在提升时中止训练
    keras.callbacks.EarlyStopping(monitor="val_loss", patience=3),
]

model = create_model()

Here, we will load our data from Keras directly. In general, it's best practice
to store your dataset in your Cloud Storage bucket, however TensorFlow Cloud can
also accomodate datasets stored locally. That's covered in the Multi-file section
of this guide.

这里，我们将要直接从Keras加载我们的模型。通常来说，最佳实践是在你的云储存桶保存你的数据，然而TensorFlow云同样可以适应本地储存的数据集。这在本指南的多个部分都有提及。

In [6]:
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

The [TensorFlow Cloud](https://github.com/tensorflow/cloud) API provides the
`remote()` function to determine whether code is being executed locally or on
the cloud. This allows for the separate designation of `fit()` parameters for
local and remote execution, and provides means for easy debugging without overloading
your local machine.

[TensorFlow Cloud](https://github.com/tensorflow/cloud)API提供了`remote()`函数来决定代码是本地执行或云端执行。这允许在本地和远程执行中分别指定 `fit()` 参数，并提供了简便的调试而不会是你的本地设备负担过重

In [7]:
if tfc.remote():
    epochs = 100
    callbacks = callbacks
    batch_size = 128
else:
    epochs = 5
    batch_size = 64
    callbacks = None

model.fit(x_train, y_train, epochs=epochs, callbacks=callbacks, batch_size=batch_size)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7fa57866c4e0>

Let's save the model in GCS after the training is complete.

让我们在训练完成后将模型保存在GCS

In [8]:
save_path = os.path.join("gs://", gcp_bucket, "mnist_example")

if tfc.remote():
    model.save(save_path)

We can also use this storage bucket for Docker image building, instead of your local
Docker instance. For this, just add your bucket to the `docker_image_bucket_name` parameter.

我们同样可以用储存桶替代你的本地Docker实例，来奖励Docker镜像。要这么多的话，将你的储存桶添加至`docker_image_bucket_name`参数中

In [9]:
# docs_infra: no_execute
tfc.run(docker_image_bucket_name=gcp_bucket)

After training the model, we can load the saved model and view our TensorBoard logs
to monitor performance.

在训练模型之后，我们可以加载保存的模型并浏览我们的TensorBoard日志来查看表现

In [9]:
# docs_infra: no_execute
model = keras.models.load_model(save_path)

In [9]:
!#docs_infra: no_execute
!tensorboard dev upload --logdir "gs://keras-examples-jonah/logs/fit" --name "Guide MNIST"

## Large-scale projects
## 大规模项目

In many cases, your project containing a Keras model may encompass more than one
Python script, or may involve external data or specific dependencies. TensorFlow
Cloud is entirely flexible for large-scale deployment, and provides a number of
intelligent functionalities to aid your projects.

在很多情况下，你的项目有一个包含了数个Python脚本的Keras模型，或者涉及到外部数据或特定的依赖关系。TensorFlow云对于大规模部署工作非常灵活，并提供了一系列智能功能来帮助你的项目。

### Entry points: support for Python scripts and Jupyter notebooks
### 进入点：支持Python脚本和Jupyter笔记本

Your call to the `run()` API won't always be contained inside the same Python script
as your model training code. For this purpose, we provide an `entry_point` parameter.
The `entry_point` parameter can be used to specify the Python script or notebook in
which your model training code lives. When calling `run()` from the same script as
your model, use the `entry_point` default of `None`.

你对`run()` API 的调用并不总是包含在同一个Python脚本或者模型训练代码之中。为此，我们提供了 `entry_point`参数。这个参数可以用于指定包含训练代码的Python脚本或者笔记本。当从同一个脚本中调用`run()`，使用`entry_point`的默认设置`None`

### `pip` dependencies
###  `pip` 依赖性

If your project calls on additional `pip` dependencies, it's possible to specify
the additional required libraries by including a `requirements.txt` file. In this
file, simply put a list of all the required dependencies and TensorFlow Cloud will
handle integrating these into your cloud build.

如果你的项目调用了额外的 `pip`依赖性，通过一个`requirements.txt` 文件来指定额外的所需要的库也是可行的。在这个文件中，你只需要放入你所需要的依赖性库列表，TensorFlow云将会将这些集成到你的云端建模中

### Python notebooks
### Pythonb笔记本

TensorFlow Cloud is also runnable from Python notebooks. Additionally, your specified
`entry_point` can be a notebook if needed. There are two key differences to keep
in mind between TensorFlow Cloud on notebooks compared to scripts:

- When calling `run()` from within a notebook, a Cloud Storage bucket must be specified
for building and storing your Docker image.
- GCloud authentication happens entirely through your authentication key, without
project specification. An example workflow using TensorFlow Cloud from a notebook
is provided in the "Putting it all together" section of this guide.

TensorFlow云同样可以从Python笔记本中运行。此外，你可以指定`entry_point`为笔记本，如果需要的话。笔记本上的TensorFlow Cloud和脚本中的TensorFlow Cloud 有两处关键性的不同：

- 当从笔记本中调用`run()` 时，, 云储存桶必须被指定以用来建立和保存你的Docker镜像

- GCloud 认证完全通过你的密钥进行,没有项目可以例外。本指南的“组合”部分提供了一个从笔记本中运行TensorFlow Cloud的示例


### Multi-file projects
### 多文件项目

If your model depends on additional files, you only need to ensure that these files
live in the same directory (or subdirectory) of the specified entry point. Every file
that is stored in the same directory as the specified `entry_point` will be included
in the Docker image, as well as any files stored in subdirectories adjacent to the
`entry_point`. This is also true for dependencies you may need which can't be acquired
through `pip`

如果你的模型依赖额外的文件，你只需要确保这些文件在同一个目录或子目录就可以指定进入点。每一个保存在同一目录下的文件都会与特定的“进入点”一起被包含在Docker镜像中，对于任何保存在子目录中的文件也是如此。这同样适用于你需要的且不需要通过`pip`获得的依赖性

For an example of a custom entry-point and multi-file project with additional pip
dependencies, take a look at this multi-file example on the
[TensorFlow Cloud Repository](https://github.com/tensorflow/cloud/tree/master/src/python/tensorflow_cloud/core/tests/examples/multi_file_example).

自定义进入点示例，多文件项目示例以及额外pip依赖性案例，请参考[这里](https://github.com/tensorflow/cloud/tree/master/src/python/tensorflow_cloud/core/tests/examples/multi_file_example).

For brevity, we'll just include the example's `run()` call:

简洁起见，我们将展示只包含`run()`调用的例子

In [None]:
tfc.run(
    docker_image_bucket_name=gcp_bucket,
    entry_point="train_model.py",
    requirements="requirements.txt"
)

## Machine configuration and distributed training
## 机器设置和分布式训练

Model training may require a wide range of different resources, depending on the
size of the model or the dataset. When accounting for configurations with multiple
GPUs, it becomes critical to choose a fitting
[distribution strategy](https://www.tensorflow.org/guide/distributed_training).
Here, we outline a few possible configurations:

模型训练可能会需要大范围的不同的资源。这取决于模型或数据集的规模。当考虑到有多个GPU时，选择合适的[分布式策略](https://www.tensorflow.org/guide/distributed_training)至关重要。
这里我们概述了一些可能的设置：

### Multi-worker distribution
### 多工作器分布
Here, we can use `COMMON_MACHINE_CONFIGS` to designate 1 chief CPU and 4 worker GPUs.

这里我们使用e `COMMON_MACHINE_CONFIGS`来分配一个主要cpu和4个工作GPU

```python
tfc.run(
    docker_image_bucket_name=gcp_bucket,
    chief_config=tfc.COMMON_MACHINE_CONFIGS['CPU'],
    worker_count=2,
    worker_config=tfc.COMMON_MACHINE_CONFIGS['T4_4X']
)
```
By default, TensorFlow Cloud chooses the best distribution strategy for your machine
configuration with a simple formula using the `chief_config`, `worker_config` and
`worker_count` parameters provided.

- If the number of GPUs specified is greater than zero, `tf.distribute.MirroredStrategy` will be chosen.
- If the number of workers is greater than zero, `tf.distribute.experimental.MultiWorkerMirroredStrategy` or `tf.distribute.experimental.TPUStrategy` will be chosen based on the accelerator type.
- Otherwise, `tf.distribute.OneDeviceStrategy` will be chosen.

默认设置下，TensorFlow Cloud通过一个简单的公式（基于提供的`chief_config`, `worker_config` and`worker_count`参数）来为你的机器设置选择最佳分布策略：
- 如果指定的GPU数量大于0，选择`tf.distribute.MirroredStrategy`
- 如果工作器数量大于0，基于加速类型选择 `tf.distribute.experimental.MultiWorkerMirroredStrategy`或者 `tf.distribute.experimental.TPUStrategy` .
- 其他情况下,选择 `tf.distribute.OneDeviceStrategy`.

### TPU distribution
### TPU 分布

Let's train the same model on TPU, as shown:

让我们用TPU训练同一个模型，其代码如下：
```python
tfc.run(
    docker_image_bucket_name=gcp_bucket,
    chief_config=tfc.COMMON_MACHINE_CONFIGS["CPU"],
    worker_count=1,
    worker_config=tfc.COMMON_MACHINE_CONFIGS["TPU"]
)
```

### Custom distribution strategy
### 自定义分布策略
To specify a custom distribution strategy, format your code normally as you would
according to the
[distributed training guide](https://www.tensorflow.org/guide/distributed_training)
and set `distribution_strategy` to `None`. Below, we'll specify our own distribution
strategy for the same MNIST model.

要指定自定义分布式策略，根据[分布式策略指南](https://www.tensorflow.org/guide/distributed_training)将你的代码格式化并将`distribution_strategy`设置为 `None`。下面我们将为同一个MNIST模型指定我们的分布式策略

```python
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
  model = create_model()

if tfc.remote():
    epochs = 100
    batch_size = 128
else:
    epochs = 10
    batch_size = 64
    callbacks = None

model.fit(
    x_train, y_train, epochs=epochs, callbacks=callbacks, batch_size=batch_size
)

tfc.run(
    docker_image_bucket_name=gcp_bucket,
    chief_config=tfc.COMMON_MACHINE_CONFIGS['CPU'],
    worker_count=2,
    worker_config=tfc.COMMON_MACHINE_CONFIGS['T4_4X'],
    distribution_strategy=None
)
```

## Custom Docker images
## 自定义Docker镜像

By default, TensorFlow Cloud uses a
[Docker base image](https://hub.docker.com/r/tensorflow/tensorflow/)
supplied by Google and corresponding to your current TensorFlow version. However,
you can also specify a custom Docker image to fit your build requirements, if necessary.
For this example, we will specify the Docker image from an older version of TensorFlow:

默认情况下，TensorFlow Cloud根据你的TensorFlow 版本使用受Google支持的[Docker基本镜像](https://hub.docker.com/r/tensorflow/tensorflow/)。然而在必要情况下，你依然可以指定一个自定义Docker镜像来满足你的构建需求。在这个示例中，我们将从旧版本的TensorFlow中指定Docker镜像

```python
tfc.run(
    docker_image_bucket_name=gcp_bucket,
    base_docker_image="tensorflow/tensorflow:2.1.0-gpu"
)
```

## Additional metrics
## 额外指标

You may find it useful to tag your Cloud jobs with specific labels, or to stream
your model's logs during Cloud training.
It's good practice to maintain proper labeling on all Cloud jobs, for record-keeping.
For this purpose, `run()` accepts a dictionary of labels up to 64 key-value pairs,
which are visible from the Cloud build logs. Logs such as epoch performance and model
saving internals can be accessed using the link provided by executing `tfc.run` or
printed to your local terminal using the `stream_logs` flag.

你也许会发现用特定的标签标记你的云端任务或者在云端训练期间返回你的模型日志会非常有用

一个好的做法是用适当的标签来标记你所有的云端工作，以便保存记录。为了这个目的，`run()`可以接受多达64个键值对的字典型标签， 这些标签在云端日志中是可见的。像是每次的训练表现和模型保存间隔日志可以通过执行 `tfc.run`所提供的链接来访问，或者通过 `stream_logs`打印到本地

```python
job_labels = {"job": "mnist-example", "team": "keras-io", "user": "jonah"}

tfc.run(
    docker_image_bucket_name=gcp_bucket,
    job_labels=job_labels,
    stream_logs=True
)
```

## Putting it all together
## 将这些组合在一起
For an in-depth Colab which uses many of the features described in this guide,
follow along
[this example](https://github.com/tensorflow/cloud/blob/master/src/python/tensorflow_cloud/core/tests/examples/dogs_classification.ipynb)
to train a state-of-the-art model to recognize dog breeds from photos using feature
extraction.
一个深度Colab使用了许多本指南中提到的功能，跟随这个示例训练一个最先进的模型，利用这些功能来从图片中识别狗的品种