In [2]:
import tensorflow as tf
import tensorflow.data as data

# tf.data.Dataset()
`tf.data.Dataset(variant_tensor)`

**Docstring**

创建一个`DatasetV2`对象，其表示一个可能含有很多元素的数据集；`tf.data.Dataset` 支持编写高效的描述性的 input pipeline，其使用遵循以下模式：
1. 从输入数据创建源数据集；例如通过一个`list`对象创建`data.Dataset.from_tensor_slices([1, 2, 3])`，通过文件内容创建`data.TextLineDataset(["file1.txt", "file2.txt"])`，通过`TFRecord`文件创建`data.TFRecordDataset(["file1.tfrecords", "file2.tfrecords"])`，创建与某一模式匹配的所有文件的生成数据集`data.Dataset.list_files("/path/*.txt")`；更多创建数据集的方式详见`tf.data.FixedLengthRecordDataset`、`tf.data.Dataset.from_generator`


2. 通过数据集进行各种变换来对数据进行预处理，例如`dataset = dataset.map(lambda x: x*2)`


3. 遍历数据集以处理数据元素

其中遍历以数据流的模式进行，进而整个数据集不必小于内存容量；

**Args**:

- variant_tensor: 表示某一数据集的`DT_VARIANT`张量


**Common Terms**:

- Element: 在数据集迭代器上调用`next()`得到的输出变量，其可以是包含了元祖、字典`namedtuple`等具有嵌套结构的对象

- Component：指`element`嵌套结构中的每一个叶节点，可以是`tf.TypeSpec`所表示的任何类型，例如`tf.Tensor`、`tf.data.Dataset`、`tf.SparseTensor`、`tf.RaggedTensor`、`tf.TensorArray`

**File**:   \tensorflow\python\data\ops\dataset_ops.py

**Type**:           ABCMeta

**Subclasses**:     DatasetV1, DatasetSource, UnaryDataset, \_VariantDataset, ZipDataset, ConcatenateDataset, TFRecordDatasetV2, \_DirectedInterleaveDataset, \_PerDeviceGenerator, \_ReincarnatedPerDeviceGenerator, ...

## tf.data.Dataset.from_tensor_slices()

`tf.data.Dataset.from_tensor_slices(tensors)`

__Docstring__

`tensors`为张量构成的数据集，所有张量第一维形状必须相同；该函数沿着它们的第一维度将`tensors`进行切片；这个操作保留了各张量的结构，移除了每个张量的第一个维数，并使用它作为数据集维数；

注意：若`tensors`包含 Numpy 数组并且预期的操作无法执行，这些值会作为一个或多个`tf.constant`嵌入至计算图中；对于大型数据集(> 1 GB)，这可能会浪费内存并遇到图形序列化的字节限制；若`tensors`包含一个或多个大型 Numpy 数组，须考虑另外的可行方案[this guide](
https://tensorflow.org/guide/data#consuming_numpy_arrays).

__Type__: function

In [None]:
# Slicing a 1D tensor produces scalar tensor elements.
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
for k, v in dataset.__dict__.items():
    print(k, v, sep="\n")
    print()
print(list(dataset.as_numpy_iterator()))

In [None]:
# Slicing a 2D tensor produces 1D tensor elements.
dataset = tf.data.Dataset.from_tensor_slices([[1, 2, 3], [3, 4, 5], [4, 5, 6]])
print(dataset._tensors)
for element in dataset.as_numpy_iterator():
    print(element)
for data in dataset:
    print(data)

In [None]:
# Slicing a tuple of 1D tensors produces tuple elements containing
dataset = tf.data.Dataset.from_tensor_slices(([1, 2, 3], [3, 4, 5], [5, 6, 7]))
print(dataset._tensors, end="\n\n")
for element in dataset.as_numpy_iterator():
    print(element)
for data in dataset:
    print(data)
    print()

In [None]:
# Dictionary structure is also preserved.
dataset = tf.data.Dataset.from_tensor_slices({"a": [1, 2, 3], "b": [3, 4, 5]})
print(dataset._tensors, end="\n\n")
for element in dataset.as_numpy_iterator():
    print(element)
for data in dataset:
    print(data, end="\n\n")
for k, v in dataset.__dict__.items():
    print(k, v, sep="\n", end="\n\n")

In [50]:
# Two tensors can be combined into one Dataset object.
features = tf.constant([[1, 3], [2, 1], [3, 3]])
labels = tf.constant(['A', 'B', 'A'])
dataset = tf.data.Dataset.from_tensor_slices((features, labels))

for data in dataset:
    print(data, end="\n\n")
for element in dataset.as_numpy_iterator():
    print(element, end="\n\n")
print(dataset._tensors)

(<tf.Tensor: shape=(2,), dtype=int32, numpy=array([1, 3])>, <tf.Tensor: shape=(), dtype=string, numpy=b'A'>)

(<tf.Tensor: shape=(2,), dtype=int32, numpy=array([2, 1])>, <tf.Tensor: shape=(), dtype=string, numpy=b'B'>)

(<tf.Tensor: shape=(2,), dtype=int32, numpy=array([3, 3])>, <tf.Tensor: shape=(), dtype=string, numpy=b'A'>)

(array([1, 3]), b'A')

(array([2, 1]), b'B')

(array([3, 3]), b'A')

[<tf.Tensor: shape=(3, 2), dtype=int32, numpy=
array([[1, 3],
       [2, 1],
       [3, 3]])>, <tf.Tensor: shape=(3,), dtype=string, numpy=array([b'A', b'B', b'A'], dtype=object)>]


In [47]:
# Both the features and the labels tensors can be converted to a Dataset object separately and combined after.
features_dataset = Dataset.from_tensor_slices(features)
labels_dataset = Dataset.from_tensor_slices(labels)
dataset = Dataset.zip((features_dataset, labels_dataset))

for element in dataset.as_numpy_iterator():
    print(element)

[(array([1, 3]), b'A'), (array([2, 1]), b'B'), (array([3, 3]), b'A')]

(array([1, 3]), b'A')
(array([2, 1]), b'B')
(array([3, 3]), b'A')


In [46]:
# A batched feature and label set can be converted to a Dataset in similar fashion.
batched_features = tf.constant([[[1, 3], [2, 3]],
                                [[2, 1], [1, 2]],
                                [[3, 3], [3, 2]]], shape=(3, 2, 2))
batched_labels = tf.constant([['A', 'A'],
                              ['B', 'B'],
                              ['A', 'B']], shape=(3, 2, 1))
dataset = Dataset.from_tensor_slices((batched_features, batched_labels))
for element in dataset.as_numpy_iterator():
    print(element)

(array([[1, 3],
       [2, 3]]), array([[b'A'],
       [b'A']], dtype=object))
(array([[2, 1],
       [1, 2]]), array([[b'B'],
       [b'B']], dtype=object))
(array([[3, 3],
       [3, 2]]), array([[b'A'],
       [b'B']], dtype=object))
