## PyTorch Tutoria: 02 Dataset and Iterator
**Overview**

In this tutorial, we will cover the basics of constructing dataset and iterators so that we can train models using gradient descent.

The best tutorial can be found in the official website (https://pytorch.org/tutorials/beginner/data_loading_tutorial.html).

在最后的时候对数据集进行一个循环的操作，因为每次要喂给优化器一部分数据（例如一次喂给16个数据）

In [1]:
import torch
from torch.utils.data import Dataset,DataLoader

自己写Dataset时候 可以继承Dataset，并且重新 init，len，getitem函数
init中是随机构建一个tensor 1024个观测数量
len为长度 
getitem  self.data就是返回对应的位

In [13]:
class MyStupidDataset(Dataset):
    def __init__(self):
        super(MyStupidDataset, self).__init__()
        self.data = torch.randn([1024, 10, 10])
    def __len__(self):
        return 1024
    def __getitem__(self, idx):
        return self.data[idx, :, :]

实例化MyStupidDataset这个类

In [14]:
my_stupid_dataset = MyStupidDataset()

随机抽取 每次64个

In [15]:
my_data_loader = DataLoader(my_stupid_dataset,batch_size=64,shuffle=True)

In [16]:
for i in my_data_loader:
    print(i)

tensor([[[-0.5248, -0.3630, -0.1930,  ..., -0.0119, -0.8769, -0.0260],
         [ 0.7588,  0.3513,  0.3004,  ..., -0.1390, -1.9918, -3.0955],
         [ 1.3722,  0.9617, -0.1913,  ...,  1.0919, -0.7060,  1.7729],
         ...,
         [-0.2414, -0.4232,  0.8512,  ..., -0.8399,  1.5990,  0.2912],
         [ 0.8594,  1.7106,  0.2697,  ..., -0.7389,  0.1953,  0.5315],
         [ 2.2895, -0.0143, -0.2215,  ..., -0.5882,  0.7238, -1.5884]],

        [[-1.0130,  1.4657,  0.0318,  ...,  0.3085, -1.3065,  1.7797],
         [ 0.8293,  1.7893,  1.4623,  ...,  1.7632,  1.7780, -0.1956],
         [-1.2968,  0.6986, -1.1993,  ...,  0.2518,  1.5495,  0.1413],
         ...,
         [ 1.2509, -1.0730,  0.5910,  ..., -0.3773,  0.1063, -0.2226],
         [ 0.5676, -0.0704,  1.0895,  ...,  0.2139, -0.8109, -0.5394],
         [ 0.8168, -0.1895, -1.5879,  ...,  0.6088,  0.1515, -1.8768]],

        [[-1.1208,  2.3808, -0.1466,  ...,  1.1621, -0.7439,  0.5811],
         [-0.5014, -1.7511, -1.3245,  ..., -0

getitem返回一个字典

In [17]:
# A common Pattern
class MyDictDataset(Dataset):
    def __init__(self):
        super(MyDictDataset, self).__init__()
        self.x = torch.randn(1024, 10)
        self.y = torch.randn(1024)
    def __len__(self):
        return 1024
    def __getitem__(self, idx):
        return {'x':self.x[idx,:],'y':self.y[idx]}

In [18]:

my_dict_dataset = MyDictDataset()
my_data_loader = DataLoader(my_dict_dataset, batch_size=64, shuffle=True)
for batch in my_data_loader:
    print(batch['x'])
    print(batch['y'])

tensor([[-1.0969e+00,  1.9487e-01, -1.7817e+00, -3.5030e-02,  2.7460e-01,
          9.0036e-01,  2.8186e-01,  8.1132e-01, -5.3291e-01, -2.5058e-01],
        [-7.2048e-01, -9.6821e-01, -2.1010e-02,  1.2674e-01,  5.2837e-01,
          1.0812e+00, -1.9834e-01,  1.3347e-01,  5.5947e-01, -2.6234e+00],
        [-1.3195e+00, -4.7648e-01,  5.4709e-01, -1.5657e+00, -6.8983e-01,
          1.1476e+00,  4.5488e-01, -3.0486e-01,  4.4632e-01, -1.0609e+00],
        [-4.0256e-01,  9.4457e-01, -4.3113e-02,  9.6263e-01,  8.8403e-01,
         -8.7362e-01, -8.1732e-01, -5.1808e-02,  6.6693e-01,  1.2780e+00],
        [ 3.7785e-01,  3.2189e-01, -2.0458e+00, -1.3171e+00, -1.0227e+00,
          2.0059e+00, -2.8464e-02, -1.5992e+00,  4.3703e-01, -1.4566e+00],
        [-9.7553e-01, -2.0615e-01, -1.3457e+00, -1.1816e+00,  4.4045e-01,
          8.9409e-01, -4.8239e-01, -2.5157e-02, -9.6210e-01, -3.6987e-01],
        [-1.0285e+00,  2.1357e+00, -1.7101e-02,  1.8494e+00, -5.6660e-01,
         -1.0794e+00, -7.2071e-0

使用已经构建好的dataset，直接把tensor喂进去

In [20]:
from torch.utils.data import TensorDataset
x = torch.randn(10,100)
y = torch.randn(10)
tensor_dataset = TensorDataset(x, y)

为什么要去写自己的Dataset?

举例：有一个文本非常大，需要做Iterator，
1， 每次打开文件只取读一部分文本，用IO的方式去写（占用内存或者显存比较小，读写速度比较慢）

2，每次Iterator的时候 每次重新分词，对它做token

通过自己写Dataset可以报操作在不同的阶段进行处理的