In [34]:
import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow_datasets.core import ReadInstruction as ri

## Splits and slicing
`DatasetBuilders`暴露出的数据集子集被称为分割 (split)，例如`train`、`test`；在使用` tfds.load()`或`tfds.core.DatasetBuilder.as_dataset()`构建数据集时，可以对其明确制定恢复哪些分割的哪些切片；指定的方式可以是将字符串传递给`split`这两个函数的参数，或将`ReadInstruction`实例传递给该参数；




### Using strings

```python
trainset = tfds.load('mnist', split='train')
trainset, testset = tfds.load('mnist', split=['train', 'test'])
train_testset = tfds.load('mnist', split='train+test')
trainset_10_to_20 = tfds.load('mnist', split='train[10:20]')
trainset_top10pct = tfds.load('mnist', split='train[:10%]')
trainset_top10pct_last80pct = tfds.load(
    'mnist', split='train[:10%]+train[-80%:]'
)
# 10-fold cross-validation:
valset = tfds.load('mnist', split=[
    f"train[{k}%:{k+10}%]" for k in range(0, 100, 10)
])
trainsset = tfds.load('mnist', split=[
    f'train[:{k}%]+train[{k+10}%:]' for k in range(0, 100, 10)
])
```




### Using the ReadInstruction API
```python
from tensorflow_datasets.core import ReadInstruction as ri

trainset = tfds.load('mnist', split=ri('train'))
trainset, testset = tfds.load('mnist', split=[ri('train'), ri('test')])
train_testset = tfds.load('mnist', split=ri('train') + ri('test'))
trainset_10_to_20 = tfds.load(
    'mnist', split=ri('train', from_=10, to=20, unit='abs')
)
trainset_top10pct = tfds.load(
    'mnist', split=ri('train', to=10, unit='%')
)
ri_ ri('train', to=10, unit='%') + ri('train', from_=-80, unit='%')
trainset_top10pct_last80pct = tfds.load('mnist', split=ri_)
# 10-fold cross-validation
valsset = tfds.load(
    'mnist', 
    [ri('train', from_=k, to=k+10, unit='%') \
     for k in range(0, 100, 10)]
)
trainsset = tfds.load(
    'mnist',
    [ri('train', to=k, unit='%') + ri('train', from_=k+10, unit='%') \
     for k in range(0, 100, 10)]
)
```




### Using `tfds.even_splits`
将`tfds.even_splits`传递给`split`参数，结果会生成若干大小相同的不重叠子分割 (即`Dataset`实例) 组成的列表：

```python
list_of_split = tfds.load('mnist', split=tfds.even_splits('train', n=3))
for split in list_of_split:
    assert isinstance(split, tf.data.Dataset)
```
需要注意的是，类似上面讲数据集分为 3 个子分割时，由于 3 不整除 100，进而默认将数据集以比例 0 ~ 33%, 33% ~ 67%, 67% ~ 100% 的比例分割，即使所加载的数据集的总样本量能够被 3 整除；
```python
assert tfds.even_splits('train', n=3) == [
    'train[0%:33%]', 'train[33%:67%]', 'train[67%:100%]',
]
whole_trainset = tfds.load('mnist', split="train")
assert len(whole_trainset) == 60000
assert len(whole_trainset) * 0.33 == len(list_of_split[0])
assert len(whole_trainset) * 0.34 == len(list_of_split[1])
assert len(whole_trainset) * 0.33 == len(list_of_split[2])
```




### Percentage slicing and rounding
在使用百分数进行切片并且切片的边界为小数时，则默认对切片边界四舍五入，再根据切片的索引规则进行切片，即所取数据不包括最后一个索引；例如数据集共有 101 个样本，取 49% ~ 50% 处的样本，根据四舍五入的原则，则会返回第 49、50 个样本；

或者可以使用`rounding='pct1_dropremainder'`的舍入方式，这种情况下百分比边界将被视为 1% 的倍数；在需要保证不同百分数下分得的数据集长度相匹配时很有帮助，即要求指明`split=[:30%]`的情况下得到的数据集长度为在`split=[:10%]`的情况下得到的数据集的长度的 3 倍；这意味着如果`info.split[split_name].num_examples % 100 != 0`，最后的示例可能会被截去；




### Reproducibility
The sub-split API guarantees that any given split slice (or ReadInstruction) will always produce the same set of records on a given dataset, as long as the major version of the dataset is constant.

For example, tfds.load("mnist:3.0.0", split="train[10:20]") and tfds.load("mnist:3.2.0", split="train[10:20]") will always contain the same elements - regardless of platform, architecture, etc. - even though some of the records might have different values (eg: imgage encoding, label, ...).


# 

# 

## Performance tips


This document provides TFDS-specific performance tips. Note that TFDS provides datasets as tf.data.Datasets, so the advice from the tf.data guide still applies.

Benchmark datasets
Use tfds.benchmark(ds) to benchmark any tf.data.Dataset object.

Make sure to indicate the batch_size= to normalize the results (e.g. 100 iter/sec -> 3200 ex/sec). This works with any iterable (e.g. tfds.benchmark(tfds.as_numpy(ds))).

```python
ds = tfds.load('mnist', split='train').batch(32).prefetch()
# Display some benchmark statistics
tfds.benchmark(ds, batch_size=32)
# Second iteration is much faster, due to auto-caching
tfds.benchmark(ds, batch_size=32)
```

Small datasets (< GB)
All TFDS datasets store the data on disk in the TFRecord format. For small datasets (e.g. Mnist, Cifar,...), reading from .tfrecord can add significant overhead.

As those datasets fit in memory, it is possible to significantly improve the performance by caching or pre-loading the dataset. Note that TFDS automatically caches small datasets (see next section for details).

Caching the dataset
Here is an example of a data pipeline which explicitly caches the dataset after normalizing the images.

```python
def normalize_img(image, label):
  """Normalizes images: `uint8` -> `float32`."""
  return tf.cast(image, tf.float32) / 255., label


ds, ds_info = tfds.load(
    'mnist',
    split='train',
    as_supervised=True,  # returns `(img, label)` instead of dict(image=, ...)
    with_info=True,
)
# Applying normalization before `ds.cache()` to re-use it.
# Note: Random transformations (e.g. images augmentations) should be applied
# after both `ds.cache()` (to avoid caching randomness) and `ds.batch()` (for
# vectorization [1]).
ds = ds.map(normalize_img, num_parallel_calls=tf.data.experimental.AUTOTUNE)
ds = ds.cache()
# For true randomness, we set the shuffle buffer to the full dataset size.
ds = ds.shuffle(ds_info.splits['train'].num_examples)
# Batch after shuffling to get unique batches at each epoch.
ds = ds.batch(128)
ds = ds.prefetch(tf.data.experimental.AUTOTUNE)
```
[1] Vectorizing mapping
When iterating over this dataset, the second iteration will be much faster than the first one thanks to the caching.

Auto-caching
By default, TFDS auto-caches datasets which satisfy the following constraints:

Total dataset size (all splits) is defined and < 250 MiB
shuffle_files is disabled, or only a single shard is read
It is possible to opt out of auto-caching by passing try_autocaching=False to tfds.ReadConfig in tfds.load. Have a look at the dataset catalog documentation to see if a specific dataset will use auto-cache.

Loading the full data as a single Tensor
If your dataset fits into memory, you can also load the full dataset as a single Tensor or NumPy array. It is possible to do so by setting batch_size=-1 to batch all examples in a single tf.Tensor. Then use tfds.as_numpy for the conversion from tf.Tensor to np.array.

(img_train, label_train), (img_test, label_test) = tfds.as_numpy(tfds.load(
    'mnist',
    split=['train', 'test'],
    batch_size=-1,
    as_supervised=True,
))
Large datasets
Large datasets are sharded (split in multiple files), and typically do not fit in memory so they should not be cached.

Shuffle and training
During training, it's important to shuffle the data well; poorly shuffled data can result in lower training accuracy.

In addition to using ds.shuffle to shuffle records, you should also set shuffle_files=True to get good shuffling behavior for larger datasets that are sharded into multiple files. Otherwise, epochs will read the shards in the same order, and so data won't be truly randomized.

ds = tfds.load('imagenet2012', split='train', shuffle_files=True)
Additionally, when shuffle_files=True, TFDS disables options.experimental_deterministic, which may give a slight performance boost. To get deterministic shuffling, it is possible to opt-out of this feature with tfds.ReadConfig: either by setting read_config.shuffle_seed or overwriting read_config.options.experimental_deterministic.

Auto-shard your data across workers
When training on multiple workers, you can use the input_context argument of tfds.ReadConfig, so each worker will read a subset of the data.

input_context = tf.distribute.InputContext(
    input_pipeline_id=1,  # Worker id
    num_input_pipelines=4,  # Total number of workers
)
read_config = tfds.ReadConfig(
    input_context=input_context,
)
ds = tfds.load('dataset', split='train', read_config=read_config)
This is complementary to the subsplit API. First the subplit API is applied ( train[:50%] is converted into a list of files to read), then a ds.shard() op is applied on those files. Example: when using train[:50%] with num_input_pipelines=2, each of the 2 worker will read 1/4 of the data.

When shuffle_files=True, files are shuffled within one worker, but not across workers. Each worker will read the same subset of files between epochs.

Note: When using tf.distribute.Strategy, the input_context can be automatically created with experimental_distribute_datasets_from_function
Faster image decoding
By default TFDS automatically decodes images. However, there are cases where it can be more performant to skip the image decoding with tfds.decode.SkipDecoding and manually apply the tf.io.decode_image op:

When filtering examples (with ds.filter), to decode images after examples have been filtered.
When cropping images, to use the fused tf.image.decode_and_crop_jpeg op.
The code for both examples is available in the decode guide.

# 

# 

## Datasets versioning


### Semantic

每个在 TFDS 中定义的`DatasetBuilder`都具有相应的版本，版本设定方式遵循[Semantic Versioning 2.0.0](https://semver.org/spec/v2.0.0.html)；设定版本的一个目的是保证复现性，即以固定版本加载数据集，进而能够产生完全相同的数据；具体而言：

- 如果增加了`PATCH`版本，那么数据在磁盘上的序列化方式可能与之前不同，又或元数据会与之前不同，然而客户端所读取的数据依旧是相同的；对任何给定的切片都会返回相同的数据集；
- 如果增加了`MINOR`版本，客户端已读取的数据不变，但会有额外的数据，例如每个样本的 features；对任何给定的切片都会返回相同的数据集；
- 如果增加了`MAJOR`版本，则现有数据已经更改，对于给定的切片，所返回的数据未必相同

如果 TFDS 库的代码经历了更改，并且该更改影响了数据集的序列化以及客户端读取数据的方式，相应的构建函数 (builder) 版本则会根据上述原则改变；

请注意，上面的语义是最好的努力，可能有未注意到的bug影响数据集，而版本没有增加。这样的错误最终会被修复，但如果你严重依赖于版本控制，我们建议你使用发布版本的tfd(而不是HEAD)。还要注意，一些数据集有另一个独立于TFDS版本的版本控制方案。例如，Open Images数据集有几个版本，在TFDS中，对应的构建器是`open_images_v4`、`open_images_v5`等

### Supported versions
`DatasetBuilder`可以支持多个版本的数据集，这些版本可能比规范版本高，也可能比规范版本低，例如
```python
class Imagenet2012(tfds.core.GeneratorBasedBuilder):
    VERSION = tfds.core.Version(
        '2.0.1', 'Encoding fix. No changes from user POV'
    )
    SUPPORTED_VERSIONS = [
        tfds.core.Version('3.0.0','S3: tensorflow.org/datasets/splits'),
        tfds.core.Version('1.0.0'),
        tfds.core.Version('0.0.9', tfds_version_to_prepare="v1.0.0"),
    ]
```
是否继续支持旧版本是视情况而定的，主要基于数据集的版本和其受欢迎程度；例如在上面的示例中，可以看出 ImageNet-2012 2.0.0 不再支持，然而从使用者角度来看，这与 2.0.1 版本是相同的；

如果一个版本指定了`tfds_version_to_prepare`，这表示如果一个数据集曾使用的版本是
已经由旧版本的代码准备好，但无法准备好，则只能与当前版本的TFDS代码一起使用。tfds_version_to_prepare的值指定可以用于下载和准备该版本数据集的TFDS的最后已知版本。

A version can specify tfds_version_to_prepare. This means this dataset version can only be used with current version of TFDS code if it has already been prepared by an older version of the code, but cannot be prepared. The value of tfds_version_to_prepare specifies the last known version of TFDS which can be used to download and prepare the dataset at this version.

Loading a specific version
When loading a dataset or a DatasetBuilder, you can specify the version to use. For example:

tfds.load('imagenet2012:2.0.1')
tfds.builder('imagenet2012:2.0.1')

tfds.load('imagenet2012:2.0.0')  # Error: unsupported version.

# Resolves to 3.0.0 for now, but would resolve to 3.1.1 if when added.
tfds.load('imagenet2012:3.*.*')
If using TFDS for a publication, we advise you to:

fix the MAJOR component of the version only;
advertise which version of the dataset was used in your results.
Doing so should make it easier for your future self, your readers and reviewers to reproduce your results.

Experiments
To gradually roll out changes in TFDS which are impacting many dataset builders, we introduced the notion of experiments. When first introduced, an experiment is disabled by default, but specific dataset versions can decide to enable it. This will typically be done on "future" versions (not made canonical yet) at first. For example:

class MNIST(tfds.core.GeneratorBasedBuilder):
  VERSION = tfds.core.Version("1.0.0")
  SUPPORTED_VERSIONS = [
      tfds.core.Version("2.0.0", "EXP1: Opt-in for experiment 1",
                        experiments={tfds.core.Experiment.EXP1: True}),
  ]
Once an experiment has been verified to work as expected, it will be extended to all or most datasets, at which point it can be enabled by default, and the above definition would then look like:

class MNIST(tfds.core.GeneratorBasedBuilder):
  VERSION = tfds.core.Version("1.0.0",
                              experiments={tfds.core.Experiment.EXP1: False})
  SUPPORTED_VERSIONS = [
      tfds.core.Version("2.0.0", "EXP1: Opt-in for experiment 1"),
  ]
Once an experiment is used across all datasets versions (there is no dataset version left specifying {experiment: False}), the experiment can be deleted.

Experiments and their description are defined in core/utils/version.py.

BUILDER_CONFIGS and versions
Some datasets define several BUILDER_CONFIGS. When that is the case, version and supported_versions are defined on the config objects themselves. Other than that, semantics and usage are identical. For example:

class OpenImagesV4(tfds.core.GeneratorBasedBuilder):

  BUILDER_CONFIGS = [
      OpenImagesV4Config(
          name='original',
          version=tfds.core.Version('0.2.0'),
          supported_versions=[
            tfds.core.Version('1.0.0', "Major change in data"),
          ],
          description='Images at their original resolution and quality.'),
      ...
  ]

tfds.load('open_images_v4/original:1.*.*')
Was this page helpful?
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2021-01-28 UTC.

Stay connected
Blog
GitHub
Twitter
YouTube
Support
Issue tracker
Release notes
Stack Overflow
Brand guidelines
Cite TensorFlow
Terms
Privacy
Sign up for the TensorFlow monthly newsletter
Subscribe
English
The new page has loaded.

# 

# 

## Other Guides
- [TFDS CLI](https://www.tensorflow.org/datasets/cli)
- [FeatureConnector](https://www.tensorflow.org/datasets/features)
- [Customizing feature decoding](https://www.tensorflow.org/datasets/decode)
- [Community](https://www.tensorflow.org/datasets/community)