In [1]:
import tensorflow as tf

In [8]:
for k, v in tf.train.__dict__.items():
    print(k, sep="\n")

__name__
__doc__
__package__
__loader__
__spec__
__path__
__file__
__cached__
__builtins__
_sys
experimental
ServerDef
CheckpointManager
get_checkpoint_state
latest_checkpoint
checkpoints_iterator
list_variables
load_checkpoint
load_variable
Coordinator
ExponentialMovingAverage
ClusterSpec
Checkpoint
BytesList
ClusterDef
Example
Feature
FeatureList
FeatureLists
Features
FloatList
Int64List
JobDef
SequenceExample


# tf.\_api.v2.train -- Overview

[相关教程](https://tensorflow.org/api_guides/python/train)



**Classes**

- BytesList
- [Checkpoint]()
- [CheckpointManager](#tf.train.CheckpointManager())
- ClusterDef
- ClusterSpec
- Coordinator
- Example
- ExponentialMovingAverage
- Feature
- FeatureList
- FeatureLists
- Features
- FloatList
- Int64List
- JobDef
- SequenceExample
- ServerDef

**Functions**

- checkpoints_iterator
- get_checkpoint_state
- latest_checkpoint
- list_variables
- load_checkpoint
- load_variable

# 

# tf.train.CheckpointManager()
```python
tf.train.CheckpointManager(
    checkpoint,
    directory,
    max_to_keep,
    keep_checkpoint_every_n_hours=None,
    checkpoint_name='ckpt',
    step_counter=None,
    checkpoint_interval=None,
    init_fn=None,
)
```
**Docstring**:

在`directory`中配置一个`CheckpointManager`，若之前在该目录下使用过`CheckpointManager`，则恢复之前的`CheckpointManager`，其中包括所管理的检查点组成的列表，和支持`keep_checkpoint_every_n_hours`所必需的时间戳簿记 (timestamp bookkeeping)；新实例化的`CheckpointManager`的行为与之前实例化的对象相同，其中包括在必要的时候删除某些检查点，即`CheckpointManager`保证只有`max_to_keep`个检查点保存在作用集 (active set) 中；若一个检查点被`keep_checkpoint_every_n_hours`保存，它就不会被`CheckpointManager`或任何之后在`directory`中实例化的`CheckpointManager`删除，无论是否更改了对`keep_checkpoint_every_n_hours`的设置；然而作用集中的`max_to_keep`个检查点可能会被这个`CheckpointManager`或将来在`directory`中实例化的`CheckpointManager`删除；


**Args**:
- checkpoint: 保存及管理检查点的`tf.train.Checkpoint`实例
- directory: 要写入检查点的路径，一个名为“checkpoint”的文件也会以一种文本格式写入到该目录下，其内容包含了`CheckpointManager`的状态
- max_to_keep: 要保留的检查点数量；检查点被`keep_checkpoint_every_n_hours`保存，否则多余检查点将从作用集中删除；None 时则不删除检查点，所有的检查点都保留在活动集中；需要注意的是，`max_to_keep=None`时会将所有检查点路径保留在内存中，以及磁盘上的检查点状态协议缓冲区中；
- keep_checkpoint_every_n_hours: 在从作用集中删除时，若一个检查点从最近保留的检查点到现在至少`keep_checkpoint_every_n_hours`，那么该检查点将被保留；None 时表示不以这种方式保留检查点
- checkpoint_name: 检查点文件的自定义名称
- step_counter: 用于检查当前步长计数器数值的`tf.Variable`实例，，以防用户希望每N步保存一个检查点；
- checkpoint_interval: 整数，表示保存两个检查点之间的最小步
- init_fn: 可调用函数，在目录中没有检查点时执行的自定义初始化函数；

**Readonly properties**:

- checkpoint：返回所管理的`tf.train.Checkpoint`对象
- checkpoint_interval, directory：同 Args
- checkpoints：返回一个文件名组成的列表，每个文件代表着一个检查点，这些检查点按照最早到最新的顺序排列；需要注意的是，`keep_checkpoint_every_n_hours`保存的检查点不会显示在这个列表中；
- latest_checkpoint：返回`directory`中最近的检查点的前缀，相当于`tf.train.latest_checkpoint(directory)`，返回的检查点可以通过传递给`Checkpoint.restore`来恢复训练，即`ckpt.restore(latest_checkpoint)`；如果没有检查点，则返回 None；

**File**:    \tensorflow\python\training\checkpoint_management.py

**Type**:           type

## tf.train.CheckpointManager.restore_or_initialize()
`manager.restore_or_initialize()`

**Docstring**

此方法会先尝试从`directory`中最近的检查点恢复参数，并返回所恢复的检查点所在的路径；若该目录下没有检查点并且指明了`init_fn`，则调用`init_fn`来进行自定义初始化，同时返回 None；

需要注意的是，`tf.train.Checkpoint.restore()`不同，用户无法在此方法所返回的对象上运行断言 (assertion)，例如`assert_consumed()`；因此若需要运行断言，用户应该直接使用`tf.train.Checkpoint.restore()`方法；


  
## tf.train.CheckpointManager.save()
`manager.save(checkpoint_number=None, check_interval=True)`


创建一个新的检查点并对其进行管理，返回新检查点所在路径；同时该检查点也会被记录在`checkpoints`和`latest_checkpoint`属性中；若没有保存检查点则返回 None；

**Args**:

- checkpoint_number: 支持整型、整型的`Variable`或`Tensor`；用于为检查点编号；默认 None 的情况下则使用`checkpoint.save_counter`对检查点进行编号；需要注意的是，即使提供了`checkpoint_number`，`save_counter`仍会增加；
- check_interval: 这个参数仅在指明了`checkpoint_interval`的情况下起作用；若为 True，管理器只在检查点之间的间隔大于`checkpoint_interval`时保存检查点；否则会为每一步保存检查点

# 

# tf.train.Checkpoint()
`tf.train.Checkpoint(**kwargs)`

**Docstring:**

``kwargs``的值是含有可追踪状态的类型，例如`Variable`、`Layer`、`Model`、`Optimizer`的实例化对象；`Checkpoint`将这些值保存为一个检查点，并利用`save_counter`来对这些检查点进行编号；

不同于`tf.compat.v1.train.Saver`，`Checkpoint`的`.save`和`.restore`方法保存和读取的检查点是基于对象的，而前者是基于`variable.name`的；基于对象的检查点保存了 Python 对象 (如`Layer`、`Optimizer`等) 与命名边之间的依赖关系图；在恢复检查点时这个关系图用于对变量进行匹配；这种方式使得其对 Python 程序更改的鲁棒性更强，并能够在创建同时恢复变量；`Checkpoint`依赖于``kwargs``传递给其构造函数的对象，每个依赖项都有一个与创建它的关键字参数名称相同的名称；像`Layer`和`Optimizer`这些类会自动对其变量添加依赖项；由于`Model` hook into 属性赋值，在继承了`tf.keras.Model`的类中管理依赖项会更容易些，例如下面的例子中

```python
class Regress(tf.keras.Model):
    def __init__(self):
        super(Regress, self).__init__()
        self.input_transform = tf.keras.layers.Dense(10)
  
    def call(self, inputs):
        x = self.input_transform(inputs)
        return x
```
`Regress`对其`Dense`层具有名为`input_transform`的依赖项，该依赖项反过来又依赖其变量；结果就是，通过`Checkpoint`保存`Regress`会同时保存`Dense`的所有变量；

尽管`Model.save_weights`和`Checkpoint.save`以相同的格式保存，但产生的检查点的 root 是含有`save`的对象，即使用`save_weights`保存`Model`，再加载附带着`Model`的`Checkpoint` (或者反之亦然) 将不与`Model`变量匹配；详情参阅[相关教程](https://www.tensorflow.org/guide/checkpoint)；

当变量被分配给多个 worker 时，每个 worker 会编写相应部分的检查点，随后这些部分被合并、重新设置索引，以生成单个检查点；这样不需要将所有变量复制给一个 worker，但要求所有 worker 使用同一个文件系统；

**Args**:
- `**kwargs`: 被设置为该对象属性的关键字参数，这些属性会与与检查点一起保存；参数的值必须是可跟踪对象，否则会抛出异常

**Attributes**:
- save_counter: 用于给检查点编号，调用`save()`时会增 1

### Example
```python
checkpoint_directory = "/tmp/training_checkpoints"
checkpoint_prefix = os.path.join(checkpoint_directory, "ckpt")

checkpoint = tf.train.Checkpoint(optimizer=optimizer, model=model)
status = checkpoint.restore(tf.train.latest_checkpoint(checkpoint_directory))
for _ in range(num_training_steps):
optimizer.minimize( ... )  # Variables will be restored on creation.
status.assert_consumed()  # Optional sanity checks.
checkpoint.save(file_prefix=checkpoint_prefix)
```

## tf.train.Checkpoint.restore()
`ckpt.restore(self, save_path)`

Restore a training checkpoint.

Restores this `Checkpoint` and any objects it depends on.

Either assigns values immediately if variables to restore have been created
already, or defers restoration until the variables are created. Dependencies
added after this call will be matched if they have a corresponding object in
the checkpoint (the restore request will queue in any trackable object
waiting for the expected dependency to be added).

To ensure that loading is complete and no more assignments will take place,
use the `assert_consumed()` method of the status object returned by
`restore`:

```python
checkpoint = tf.train.Checkpoint( ... )
checkpoint.restore(path).assert_consumed()
```

An exception will be raised if any Python objects in the dependency graph
were not found in the checkpoint, or if any checkpointed values do not have
a matching Python object.

Name-based `tf.compat.v1.train.Saver` checkpoints from TensorFlow 1.x can be
loaded
using this method. Names are used to match variables. Re-encode name-based
checkpoints using `tf.train.Checkpoint.save` as soon as possible.

Args:
save_path: The path to the checkpoint, as returned by `save` or
  `tf.train.latest_checkpoint`. If None (as when there is no latest
  checkpoint for `tf.train.latest_checkpoint` to return), returns an
  object which may run initializers for objects in the dependency graph.
  If the checkpoint was written by the name-based
  `tf.compat.v1.train.Saver`, names are used to match variables.

Returns:
A load status object, which can be used to make assertions about the
status of a checkpoint restoration.

The returned status object has the following methods:

* `assert_consumed()`:
    Raises an exception if any variables are unmatched: either
    checkpointed values which don't have a matching Python object or
    Python objects in the dependency graph with no values in the
    checkpoint. This method returns the status object, and so may be
    chained with other assertions.

* `assert_existing_objects_matched()`:
    Raises an exception if any existing Python objects in the dependency
    graph are unmatched. Unlike `assert_consumed`, this assertion will
    pass if values in the checkpoint have no corresponding Python
    objects. For example a `tf.keras.Layer` object which has not yet been
    built, and so has not created any variables, will pass this assertion
    but fail `assert_consumed`. Useful when loading part of a larger
    checkpoint into a new Python program, e.g. a training checkpoint with
    a `tf.compat.v1.train.Optimizer` was saved but only the state required
    for
    inference is being loaded. This method returns the status object, and
    so may be chained with other assertions.

* `assert_nontrivial_match()`: Asserts that something aside from the root
    object was matched. This is a very weak assertion, but is useful for
    sanity checking in library code where objects may exist in the
    checkpoint which haven't been created in Python and some Python
    objects may not have a checkpointed value.

* `expect_partial()`: Silence warnings about incomplete checkpoint
    restores. Warnings are otherwise printed for unused parts of the
    checkpoint file or object when the `Checkpoint` object is deleted
    (often at program shutdown).

## tf.train.Checkpoint.save(self, file_prefix)
  Saves a training checkpoint and provides basic checkpoint management.

  The saved checkpoint includes variables created by this object and any
  trackable objects it depends on at the time `Checkpoint.save()` is
  called.

  `save` is a basic convenience wrapper around the `write` method,
  sequentially numbering checkpoints using `save_counter` and updating the
  metadata used by `tf.train.latest_checkpoint`. More advanced checkpoint
  management, for example garbage collection and custom numbering, may be
  provided by other utilities which also wrap `write`
  (`tf.train.CheckpointManager` for example).

  Args:
    file_prefix: A prefix to use for the checkpoint filenames
      (/path/to/directory/and_a_prefix). Names are generated based on this
      prefix and `Checkpoint.save_counter`.

  Returns:
    The full path to the checkpoint.

## tf.train.Checkpoint.write(self, file_prefix)
  Writes a training checkpoint.

  The checkpoint includes variables created by this object and any
  trackable objects it depends on at the time `Checkpoint.write()` is
  called.

  `write` does not number checkpoints, increment `save_counter`, or update the
  metadata used by `tf.train.latest_checkpoint`. It is primarily intended for
  use by higher level checkpoint management utilities. `save` provides a very
  basic implementation of these features.

  Args:
    file_prefix: A prefix to use for the checkpoint filenames
      (/path/to/directory/and_a_prefix).

  Returns:
    The full path to the checkpoint (i.e. `file_prefix`).