# **从头开始编写训练循环**

### **引入**

In [1]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np

### **介绍**

Keras提供了默认的训练和评估循环，`fit()`和`evaluate()`。使用[内置方法的训练和评估指南](https://www.tensorflow.org/guide/keras/train_and_evaluate/)涵盖了它们的用法。

如果你想自定义模型的学习算法，同时又想利用`fit()`的便利性（例如，使用`fit()`训练GAN），则可以子类化`Model`类并实现自己的`train_step()`方法，`train_step()`方法将在`fit()`中被重复调用。更多信息，可以前往[自定义fit()中发生的操作](https://www.tensorflow.org/guide/keras/customizing_what_happens_in_fit/)指南。

现在，如果你想对训练和评估进行非常底层的控制，则应该从头开始编写自己的训练和评估循环，这就是本指南的内容。

### **使用GradientTape：第一个端到端示例**

在`GradientTape`范围内调用模型，可以使得你能够检索层的可训练权重相对于损失值的梯度。通过优化器实例，你可以使用这些梯度来更新这些变量（可以使用`model.trainable_weights`进行检索）。

让我们思考一个简单的MNIST模型：

In [2]:
inputs = keras.Input(shape=(784,), name="digits")
x1 = layers.Dense(64, activation="relu")(inputs)
x2 = layers.Dense(64, activation="relu")(x1)
outputs = layers.Dense(10, name="predictions")(x2)
model = keras.Model(inputs=inputs, outputs=outputs)

让我们使用带有自定义训练循环的小批量梯度对其进行训练。

首先，我们需要一个优化器，一个损失函数和一个数据集：

In [3]:
 # 实例化优化器
optimizer = keras.optimizers.SGD(learning_rate=1e-3)
# 实例化损失函数
loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True)

# 准备训练数据
batch_size = 64
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
x_train = np.reshape(x_train, (-1, 784))
x_test = np.reshape(x_train, (-1, 784))
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_dataset = train_dataset.shuffle(buffer_size=1024).batch(batch_size)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


接下来是我们的训练循环：

+ 我们使用`for`循环来迭代epoch
+ 对于每个epoch，我们使用`for`循环来分批迭代数据集
+ 对于每个批量，我们都打开一个`GradientTape()`域
+ 在此域内，我们调用模型（前向传递）并计算损失
+ 在域外，我们获取有关损失的模型权重的梯度
+ 最后，我们使用优化器根据梯度更新模型的权重

In [4]:
 epochs = 2
for epoch in range(epochs):
    print("\nStart of epoch %d" % (epoch,))

    # 遍历数据集的批量
    for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):

        # 打开GradientTape记录在前向传递过程中运行的操作，这可以实现自动区分
        with tf.GradientTape() as tape:

            # 前向传递运行层
            # 层应用于其输入的操作将记录在GradientTape上
            logits = model(x_batch_train, training=True)  # 最小批量的Logits

            # 计算该批量的损失值
            loss_value = loss_fn(y_batch_train, logits)

        # 使用gradientTape自动检索可训练变量相对于损失的梯度。
        grads = tape.gradient(loss_value, model.trainable_weights)

        # 通过更新变量值以最小化损失，来完成梯度下降
        optimizer.apply_gradients(zip(grads, model.trainable_weights))

        # 每200批量记录一次
        if step % 200 == 0:
            print(
                "Training loss (for one batch) at step %d: %.4f"
                % (step, float(loss_value))
            )
            print("Seen so far: %s samples" % ((step + 1) * 64))


Start of epoch 0
Training loss (for one batch) at step 0: 70.8490
Seen so far: 64 samples
Training loss (for one batch) at step 200: 1.5749
Seen so far: 12864 samples
Training loss (for one batch) at step 400: 0.8477
Seen so far: 25664 samples
Training loss (for one batch) at step 600: 0.6865
Seen so far: 38464 samples
Training loss (for one batch) at step 800: 0.7668
Seen so far: 51264 samples

Start of epoch 1
Training loss (for one batch) at step 0: 0.6792
Seen so far: 64 samples
Training loss (for one batch) at step 200: 0.7830
Seen so far: 12864 samples
Training loss (for one batch) at step 400: 0.8290
Seen so far: 25664 samples
Training loss (for one batch) at step 600: 0.7765
Seen so far: 38464 samples
Training loss (for one batch) at step 800: 0.5991
Seen so far: 51264 samples


### **用低级方法处理指标**

让我们在上面的基本循环中添加指标监控。

你可以在从头开始编写的训练循环中，随时使用内置指标（或编写的自定义指标），流程如下：

+ 在循环开始时实例化指标
+ 在每个批量之后调用`metric.update_state()`
+ 当需要显示指标的当前值时，调用`metric.result()`
+ 需要清除指标的状态时（通常在epoch末尾），调用`metric.reset_states()`

让我们使用这些知识在每个epoch结束时，使用`SparseCategoricalAccuracy`计算验证数据：

In [5]:
# 创建模型
inputs = keras.Input(shape=(784,), name="digits")
x = layers.Dense(64, activation="relu", name="dense_1")(inputs)
x = layers.Dense(64, activation="relu", name="dense_2")(x)
outputs = layers.Dense(10, name="predictions")(x)
model = keras.Model(inputs=inputs, outputs=outputs)

# 实例化用于训练模型的优化器
optimizer = keras.optimizers.SGD(learning_rate=1e-3)
# 实例化损失函数
loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True)

# 准备指标
train_acc_metric = keras.metrics.SparseCategoricalAccuracy()
val_acc_metric = keras.metrics.SparseCategoricalAccuracy()

# 准备训练数据
batch_size = 64
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_dataset = train_dataset.shuffle(buffer_size=1024).batch(batch_size)

# 准备验证数据集
# 预留10,000个样本用于验证
x_val = x_train[-10000:]
y_val = y_train[-10000:]
x_train = x_train[:-10000]
y_train = y_train[:-10000]
val_dataset = tf.data.Dataset.from_tensor_slices((x_val, y_val))
val_dataset = val_dataset.batch(64)

这是我们的训练和评估循环：

In [6]:
import time

epochs = 2
for epoch in range(epochs):
    print("\nStart of epoch %d" % (epoch,))
    start_time = time.time()

    # 遍历数据集的批量
    for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):
        with tf.GradientTape() as tape:
            logits = model(x_batch_train, training=True)
            loss_value = loss_fn(y_batch_train, logits)
        grads = tape.gradient(loss_value, model.trainable_weights)
        optimizer.apply_gradients(zip(grads, model.trainable_weights))

        # 更新训练指标
        train_acc_metric.update_state(y_batch_train, logits)

        # 每200个批量记录一次
        if step % 200 == 0:
            print(
                "Training loss (for one batch) at step %d: %.4f"
                % (step, float(loss_value))
            )
            print("Seen so far: %d samples" % ((step + 1) * 64))

    # 在每个epoch结束后显示指标
    train_acc = train_acc_metric.result()
    print("Training acc over epoch: %.4f" % (float(train_acc),))

    # 在每个epoch结束后重置指标
    train_acc_metric.reset_states()

    # 在每个epoch结束后运行验证循环
    for x_batch_val, y_batch_val in val_dataset:
        val_logits = model(x_batch_val, training=False)
        # 更新指标值
        val_acc_metric.update_state(y_batch_val, val_logits)
    val_acc = val_acc_metric.result()
    val_acc_metric.reset_states()
    print("Validation acc: %.4f" % (float(val_acc),))
    print("Time taken: %.2fs" % (time.time() - start_time))


Start of epoch 0
Training loss (for one batch) at step 0: 105.9828
Seen so far: 64 samples
Training loss (for one batch) at step 200: 1.3818
Seen so far: 12864 samples
Training loss (for one batch) at step 400: 1.3547
Seen so far: 25664 samples
Training loss (for one batch) at step 600: 1.0363
Seen so far: 38464 samples
Training loss (for one batch) at step 800: 0.9450
Seen so far: 51264 samples
Training acc over epoch: 0.6877
Validation acc: 0.8265
Time taken: 8.47s

Start of epoch 1
Training loss (for one batch) at step 0: 0.7976
Seen so far: 64 samples
Training loss (for one batch) at step 200: 0.5406
Seen so far: 12864 samples
Training loss (for one batch) at step 400: 0.6293
Seen so far: 25664 samples
Training loss (for one batch) at step 600: 0.6605
Seen so far: 38464 samples
Training loss (for one batch) at step 800: 0.5388
Seen so far: 51264 samples
Training acc over epoch: 0.8345
Validation acc: 0.8751
Time taken: 8.41s


### **使用tf.function加快训练步骤**

TensorFlow 2.0中的默认运行时是Eager Execution（动态图）模式 。因此，我们上面的训练循环会以动态图模式执行。

这对于调试非常有用，但是图形编译具有一定的性能优势，将你的计算描述为静态图可使框架应用全局性能优化。框架在不知道接下来会发生什么的情况下，是不可能一个接一个地执行操作。

你可以将以张量为输入的任何函数编译为静态图，只需在其上添加一个`@tf.function`装饰器，如下所示：

In [7]:
@tf.function
def train_step(x, y):
    with tf.GradientTape() as tape:
        logits = model(x, training=True)
        loss_value = loss_fn(y, logits)
    grads = tape.gradient(loss_value, model.trainable_weights)
    optimizer.apply_gradients(zip(grads, model.trainable_weights))
    train_acc_metric.update_state(y, logits)
    return loss_value

让我们对评估步骤进行相同的操作：

In [8]:
 @tf.function
def test_step(x, y):
    val_logits = model(x, training=False)
    val_acc_metric.update_state(y, val_logits)


现在，让我们通过编译后的训练步骤重新运行训练循环：

In [9]:
import time

epochs = 2
for epoch in range(epochs):
    print("\nStart of epoch %d" % (epoch,))
    start_time = time.time()

    # 遍历数据集的批量
    for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):
        loss_value = train_step(x_batch_train, y_batch_train)

        # 每200个批量记录一次
        if step % 200 == 0:
            print(
                "Training loss (for one batch) at step %d: %.4f"
                % (step, float(loss_value))
            )
            print("Seen so far: %d samples" % ((step + 1) * 64))

    # 在每个epoch结束后显示指标
    train_acc = train_acc_metric.result()
    print("Training acc over epoch: %.4f" % (float(train_acc),))

    # 在每个epoch结束后重置指标
    train_acc_metric.reset_states()

    # 在每个epoch结束后运行验证循环
    for x_batch_val, y_batch_val in val_dataset:
        test_step(x_batch_val, y_batch_val)

    val_acc = val_acc_metric.result()
    val_acc_metric.reset_states()
    print("Validation acc: %.4f" % (float(val_acc),))
    print("Time taken: %.2fs" % (time.time() - start_time))


Start of epoch 0
Training loss (for one batch) at step 0: 0.4298
Seen so far: 64 samples
Training loss (for one batch) at step 200: 0.5934
Seen so far: 12864 samples
Training loss (for one batch) at step 400: 0.3899
Seen so far: 25664 samples
Training loss (for one batch) at step 600: 0.5006
Seen so far: 38464 samples
Training loss (for one batch) at step 800: 0.5795
Seen so far: 51264 samples
Training acc over epoch: 0.8699
Validation acc: 0.8963
Time taken: 1.97s

Start of epoch 1
Training loss (for one batch) at step 0: 0.2032
Seen so far: 64 samples
Training loss (for one batch) at step 200: 0.2656
Seen so far: 12864 samples
Training loss (for one batch) at step 400: 0.6380
Seen so far: 25664 samples
Training loss (for one batch) at step 600: 0.9145
Seen so far: 38464 samples
Training loss (for one batch) at step 800: 0.6284
Seen so far: 51264 samples
Training acc over epoch: 0.8886
Validation acc: 0.9111
Time taken: 1.52s


快很多，不是吗？

### **使用低级别方法处理模型跟踪的损失**

层和模型递归地跟踪所有由调用`self.add_loss(value)`的层的前向传递过程中，创建的损失。标量损失值的结果列表可通过前向传递结束时的属性`model.losses`获得。

如果要使用这些损失，则应将它们求和并将其添加到训练步骤的主要损失中。

考虑下面的这个层，这会导致活动正则化损失：

In [10]:
class ActivityRegularizationLayer(layers.Layer):
    def call(self, inputs):
        self.add_loss(1e-2 * tf.reduce_sum(inputs))
        return inputs

让我们构建非常简单的模型来使用它：

In [11]:
inputs = keras.Input(shape=(784,), name="digits")
x = layers.Dense(64, activation="relu")(inputs)
# 将活动正则化作为一个层插入
x = ActivityRegularizationLayer()(x)
x = layers.Dense(64, activation="relu")(x)
outputs = layers.Dense(10, name="predictions")(x)

model = keras.Model(inputs=inputs, outputs=outputs)

我们现在的训练步骤应为：

In [13]:
@tf.function
def train_step(x, y):
    with tf.GradientTape() as tape:
        logits = model(x, training=True)
        loss_value = loss_fn(y, logits)
        # 添加在前向传递过程中创建的任何额外损失。
        loss_value += sum(model.losses)
    grads = tape.gradient(loss_value, model.trainable_weights)
    optimizer.apply_gradients(zip(grads, model.trainable_weights))
    train_acc_metric.update_state(y, logits)
    return loss_value

### **总结**

现在你学习到了使用内置训练循环以及从头开始编写自己的训练循环的所有知识。

为了更好的理解，下面学习一个简单的端到端示例，将你在本指南中学到的所有内容联系在一起：一个通过MNIST数据集训练的DCGAN。

### **端到端示例：从头开始编写GAN训练循环**
你可能熟悉生成式对抗网络（GAN），通过学习图像训练数据集的潜在分布（图像的“潜在空间”），GAN可以生成看起来几乎真实的新图像。

GAN由两部分组成：将潜在空间中的点映射到图像空间中的点的“生成器”模型，可以区分实际图像（来自训练数据集）以及假图片（生成器网络的输出）的“判别器”模型。

GAN训练循环内容如下：

1）训练判别器。 
+ 在潜在空间中采样一批随机点
+ 通过“生成器”模型将这些点转换为伪图像
+ 获取一批真实图像，并将其与生成的图像合并
+ 训练“判别器”模型对生成的图像与真实图像进行分类。

2）训练生成器。
+ 在潜在空间中采样随机点
+ 通过“生成器”网络将点转换为假图像。
+ 获取一批真实图像，并将其与生成的图像合并。
+ 训练“生成器”模型以“欺骗”判别器，并将假图像分类为真实图像。

有关GAN的工作原理的详细介绍，请参阅[《Python深度学习》](https://www.manning.com/books/deep-learning-with-python)。

接下来让我们实现这个训练循环，首先，创建旨在区分假数据和实数据的判别器：

In [12]:
discriminator = keras.Sequential(
    [
        keras.Input(shape=(28, 28, 1)),
        layers.Conv2D(64, (3, 3), strides=(2, 2), padding="same"),
        layers.LeakyReLU(alpha=0.2),
        layers.Conv2D(128, (3, 3), strides=(2, 2), padding="same"),
        layers.LeakyReLU(alpha=0.2),
        layers.GlobalMaxPooling2D(),
        layers.Dense(1),
    ],
    name="discriminator",
)
discriminator.summary()

Model: "discriminator"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 14, 14, 64)        640       
_________________________________________________________________
leaky_re_lu (LeakyReLU)      (None, 14, 14, 64)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 7, 7, 128)         73856     
_________________________________________________________________
leaky_re_lu_1 (LeakyReLU)    (None, 7, 7, 128)         0         
_________________________________________________________________
global_max_pooling2d (Global (None, 128)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 129       
Total params: 74,625
Trainable params: 74,625
Non-trainable params: 0
_________________________________________________

然后，让我们创建一个生成器网络，该网络将潜矢量转换为形状为`(28, 28, 1)`（代表MNIST数字）的输出：

In [14]:
latent_dim = 128

generator = keras.Sequential(
    [
        keras.Input(shape=(latent_dim,)),
        # 我们想生成128个系数以reshape为7x7x128的图
        layers.Dense(7 * 7 * 128),
        layers.LeakyReLU(alpha=0.2),
        layers.Reshape((7, 7, 128)),
        layers.Conv2DTranspose(128, (4, 4), strides=(2, 2), padding="same"),
        layers.LeakyReLU(alpha=0.2),
        layers.Conv2DTranspose(128, (4, 4), strides=(2, 2), padding="same"),
        layers.LeakyReLU(alpha=0.2),
        layers.Conv2D(1, (7, 7), padding="same", activation="sigmoid"),
    ],
    name="generator",
)

下面的训练循环是个关键，如你所见，它非常简单，训练步长方法仅需17行。

In [15]:
 # 为判别器和生成器各实例化一个优化器
d_optimizer = keras.optimizers.Adam(learning_rate=0.0003)
g_optimizer = keras.optimizers.Adam(learning_rate=0.0004)

# 实例化损失函数
loss_fn = keras.losses.BinaryCrossentropy(from_logits=True)


@tf.function
def train_step(real_images):
    # 在潜在空间中采样随机点
    random_latent_vectors = tf.random.normal(shape=(batch_size, latent_dim))
    # 将它们解码为假图
    generated_images = generator(random_latent_vectors)
    # 将它们与真图混合
    combined_images = tf.concat([generated_images, real_images], axis=0)

    # 整合区分真假图像的标签
    labels = tf.concat(
        [tf.ones((batch_size, 1)), tf.zeros((real_images.shape[0], 1))], axis=0
    )
    # 为标签添加噪音 - 非常重要的技巧
    labels += 0.05 * tf.random.uniform(labels.shape)

    # 训练判别器
    with tf.GradientTape() as tape:
        predictions = discriminator(combined_images)
        d_loss = loss_fn(labels, predictions)
    grads = tape.gradient(d_loss, discriminator.trainable_weights)
    d_optimizer.apply_gradients(zip(grads, discriminator.trainable_weights))

    # 在潜在空间中采样随机点
    random_latent_vectors = tf.random.normal(shape=(batch_size, latent_dim))
    # 整合所有标有“真实图像”的标签
    misleading_labels = tf.zeros((batch_size, 1))

    # 训练生成器（请注意，我们*不*更新判别器器的权重）！
    with tf.GradientTape() as tape:
        predictions = discriminator(generator(random_latent_vectors))
        g_loss = loss_fn(misleading_labels, predictions)
    grads = tape.gradient(g_loss, generator.trainable_weights)
    g_optimizer.apply_gradients(zip(grads, generator.trainable_weights))
    return d_loss, g_loss, generated_images

让我们训练GAN，通过在图像批量上反复调用`train_step`。

由于我们的判别器和生成器是卷积网络，因此你将要在GPU上运行此代码。

In [16]:
import os

# 准备数据集，我们同时在MNIST数据集上训练模型.
batch_size = 64
(x_train, _), (x_test, _) = keras.datasets.mnist.load_data()
all_digits = np.concatenate([x_train, x_test])
all_digits = all_digits.astype("float32") / 255.0
all_digits = np.reshape(all_digits, (-1, 28, 28, 1))
dataset = tf.data.Dataset.from_tensor_slices(all_digits)
dataset = dataset.shuffle(buffer_size=1024).batch(batch_size)

epochs = 1  # 实际上，你至少需要20个epoch才能生成漂亮的数字。
save_dir = "./"

for epoch in range(epochs):
    print("\nStart epoch", epoch)

    for step, real_images in enumerate(dataset):
        # 在一个批量的真实图片上训练判别器和生成器
        d_loss, g_loss, generated_images = train_step(real_images)

        # 记录
        if step % 200 == 0:
            # 打印指标
            print("discriminator loss at step %d: %.2f" % (step, d_loss))
            print("adversarial loss at step %d: %.2f" % (step, g_loss))

            # 保存一个生成图片
            img = tf.keras.preprocessing.image.array_to_img(
                generated_images[0] * 255.0, scale=False
            )
            img.save(os.path.join(save_dir, "generated_img" + str(step) + ".png"))

        # 为了限制执行时间，我们在10个步骤后停止
        if step > 10:
            break


Start epoch 0
discriminator loss at step 0: 0.70
adversarial loss at step 0: 0.73


在Colab GPU上进行约30秒钟的训练后，你将获得漂亮的假的MNIST数字。