上面我们对数据划分、训练样本乱序、生成批次数据以及如何封装数据读取与处理函数进行了详细的介绍。在飞桨框架中，可通过如下两个核心步骤完成数据集的定义与加载。
![](https://ai-studio-static-online.cdn.bcebos.com/a5fd990c5355426183a71b95aa28a59f979014f6905144ddb415c5a4fe647441)

通过飞桨`paddle.io.Dataset`和`paddle.io.DataLoader`两个API可以轻松创建异步数据读取的迭代器。

In [10]:
# 导入需要用到的依赖库
import json
import gzip
import paddle
from paddle.vision.transforms import Normalize # 归一化
from paddle.io import Dataset # 数据集类
from paddle.nn import functional as F

In [11]:
# 定义图像归一化处理方法,这里的CHW指图像格式需为 [C通道数,H图像高度,W图像宽度]
transform = Normalize(mean=[127.5], std=[127.5], data_format='CHW')

class MNISTDataset(Dataset):
    """
    步骤一:继承paddle.io.Dataset类
    """
    
    def __init__(self, datafile, mode='train', transform = None):
        """
        步骤二:实现构造函数
        """
        super().__init__()

        self.mode = mode
        self.transform = transform

        print('loading mnist dataset from {} ......'.format(datafile))
        # 加载json数据文件
        data = json.load(gzip.open(datafile))
        print('mnist dataset load done')
   
        # 读取到的数据区分训练集,验证集,测试集
        train_set, val_set, eval_set = data

        if mode=='train':
            # 获得训练数据集
            self.imgs, self.labels = train_set[0], train_set[1]
        elif mode=='valid':
            # 获得验证数据集
            self.imgs, self.labels = val_set[0], val_set[1]
        elif mode=='test':
            # 获得测试数据集
            self.imgs, self.labels = eval_set[0], eval_set[1]
        else:
            raise Exception("mode can only be one of ['train', 'valid', 'test']")
    
    def __getitem__(self, index):
        """
        步骤三：实现__getitem__方法，定义指定index时如何获取数据
        """
        data = self.imgs[index]
        label = self.labels[index]

        return self.transform(data),label

    def __len__(self):
        """
        步骤四：实现__len__方法，返回数据集总数目
        """
        return len(self.imgs)

datafile = './dataset/mnist.json.gz'

# 下载数据集并初始化 DataSet
train_dataset = MNISTDataset(datafile, mode='train', transform=transform)
test_dataset = MNISTDataset(datafile, mode='test', transform=transform)

print('train images: ', train_dataset.__len__(), ', test images: ', test_dataset.__len__())

loading mnist dataset from ./dataset/mnist.json.gz ......
mnist dataset load done
loading mnist dataset from ./dataset/mnist.json.gz ......
mnist dataset load done
train images:  50000 , test images:  10000


在定义完`paddle.io.Dataset`后，使用`paddle.io.DataLoader` API即可实现异步数据读取，数据会由Python线程预先读取，并异步送入一个队列中，并且可自动完成划分 batch 的工作。

In [12]:
# 定义并初始化数据读取器
train_loader = paddle.io.DataLoader(train_dataset,  # 传入数据集类
                                    batch_size=32,  # 一个批次的大小
                                    shuffle=True,  # 打乱样本的顺序
                                    num_workers=4,  # 读取的进程
                                    drop_last=True # 多余的数据丢弃
                                    ) 
print('step num:',len(train_loader))

step num: 1562




In [13]:
class MNIST(paddle.nn.Layer):
    def __init__(self):
        super(MNIST, self).__init__()
        # 定义一层全连接层，输出维度是1
        self.fc = paddle.nn.Linear(in_features=784, out_features=1)

    def forward(self, inputs):
        outputs = self.fc(inputs)
        return outputs

In [14]:
def train(model):
    print('train:')
    model.train() # 开启训练模式，和eval模式的区别就是 训练模式需要进行梯度计算
    opt = paddle.optimizer.SGD(learning_rate=0.001, parameters=model.parameters()) # 选择梯度下降作为优化函数
    EPOCH_NUM = 3 # 进行三轮
    for epoch_id in range(EPOCH_NUM):
        print('epoch:',epoch_id)
        for batch_id, data in enumerate(train_loader()):
            images, labels = data
            images = paddle.to_tensor(images).astype('float32')
            labels = paddle.to_tensor(labels).astype('float32')
            
            images = paddle.reshape(images, [images.shape[0], images.shape[2]*images.shape[3]])

            #前向计算的过程  
            predicts = model(images)

            #计算损失，取一个批次样本损失的平均值
            loss = F.square_error_cost(predicts, labels) # 使用均方差计算孙损失
            avg_loss = paddle.mean(loss)        # 取损失的平均值
            
            #每训练了200批次的数据，打印下当前Loss的情况
            if batch_id % 200 == 0:
                print("epoch: {}, batch: {}, loss is: {}".format(epoch_id, batch_id, avg_loss.numpy()))
            
            #后向传播，更新参数的过程
            avg_loss.backward()
            opt.step()
            opt.clear_grad()

    #保存模型参数
    paddle.save(model.state_dict(), './model/mnist.pdparams')

#创建模型
print("create model:")
model = MNIST()
#启动训练过程
train(model)

create model:
train:
epoch: 0
epoch: 0, batch: 0, loss is: [49.721504]
epoch: 0, batch: 200, loss is: [8.000044]
epoch: 0, batch: 400, loss is: [7.875365]
epoch: 0, batch: 600, loss is: [8.398896]
epoch: 0, batch: 800, loss is: [8.500625]
epoch: 0, batch: 1000, loss is: [10.710493]
epoch: 0, batch: 1200, loss is: [8.897474]
epoch: 0, batch: 1400, loss is: [8.95763]
epoch: 1
epoch: 1, batch: 0, loss is: [6.9973726]
epoch: 1, batch: 200, loss is: [7.125519]
epoch: 1, batch: 400, loss is: [8.90251]
epoch: 1, batch: 600, loss is: [9.347914]
epoch: 1, batch: 800, loss is: [5.661331]
epoch: 1, batch: 1000, loss is: [12.945835]
epoch: 1, batch: 1200, loss is: [8.424143]
epoch: 1, batch: 1400, loss is: [6.9396296]
epoch: 2
epoch: 2, batch: 0, loss is: [8.589945]
epoch: 2, batch: 200, loss is: [9.462098]
epoch: 2, batch: 400, loss is: [7.1859465]
epoch: 2, batch: 600, loss is: [7.637679]
epoch: 2, batch: 800, loss is: [10.674815]
epoch: 2, batch: 1000, loss is: [9.763366]
epoch: 2, batch: 1200,