# Dataset和DataLoader

Pytorch通常使用Dataset和DataLoader这两个工具类来构建数据管道。

Dataset定义了数据集的内容，它相当于一个类似列表的数据结构，具有确定的长度，能够用索引获取数据集中的元素。

而DataLoader定义了按batch加载数据集的方法，它是一个实现了__iter__方法的可迭代对象，每次迭代输出一个batch的数据。

DataLoader能够控制batch的大小，batch中元素的采样方法，以及将batch结果整理成模型所需输入形式的方法，并且能够使用多进程读取数据。

在绝大部分情况下，用户只需实现Dataset的__len__方法和__getitem__方法，就可以轻松构建自己的数据集，并用默认数据管道进行加载。

## Dataset和DataLoader概述

获取一个batch的数据的步骤：

1. 确定数据集的长度
2. 抽取batch_size个数
3. 从数据集中去对应下标的元素
4. 整理成Tensor输出

Dataset和 DataLoader的核心接口逻辑伪代码

```python
import torch

class Dataset(object):
    def __init__(self):
        pass

    def __len__(self):
        raise NotImplementedError

    def __getitem__(self,index):
        raise NotImplementedError


class DataLoader(object):
    def __init__(self,dataset,batch_size,collate_fn,shuffle = True,drop_last = False):
        self.dataset = dataset
        self.sampler =torch.utils.data.RandomSampler if shuffle else \
           torch.utils.data.SequentialSampler
        self.batch_sampler = torch.utils.data.BatchSampler
        self.sample_iter = self.batch_sampler(
            self.sampler(range(len(dataset))),
            batch_size = batch_size,drop_last = drop_last)

    def __next__(self):
        indices = next(self.sample_iter)
        batch = self.collate_fn([self.dataset[i] for i in indices])
        return batch
```


## 使用Dataset创建数据集

Dataset创建数据集常用的方法有：

* 使用 torch.utils.data.TensorDataset 根据Tensor创建数据集(numpy的array，Pandas的DataFrame需要先转换成Tensor)。
* 使用 torchvision.datasets.ImageFolder 根据图片目录创建图片数据集。
* 继承 torch.utils.data.Dataset 创建自定义数据集。

此外，还可以通过

torch.utils.data.random_split 将一个数据集分割成多份，常用于分割训练集，验证集和测试集。

调用Dataset的加法运算符(+)将多个数据集合并成一个数据集。

### 根据Tensor创建数据集

In [1]:
import numpy as np
import torch
from torch.utils.data import TensorDataset, Dataset, DataLoader, random_split


In [2]:
from sklearn import datasets

iris = datasets.load_iris()
ds_iris = TensorDataset(torch.tensor(iris.data), torch.tensor(iris.target))

n_train = int(len(ds_iris) * 0.8)
n_valid = len(ds_iris) - n_train
ds_train, ds_valid = random_split(ds_iris, [n_train, n_valid])

print(type(ds_iris))
print(type(ds_train))
print(len(ds_iris))
print(len(ds_train))
print(len(ds_valid))

<class 'torch.utils.data.dataset.TensorDataset'>
<class 'torch.utils.data.dataset.Subset'>
150
120
30


In [3]:
dl_train, dl_valid = DataLoader(ds_train, batch_size=8), DataLoader(ds_valid, batch_size=8)
for features, labels in dl_valid:
    print(features)
    print(labels)
    break

tensor([[5.5000, 4.2000, 1.4000, 0.2000],
        [5.1000, 3.7000, 1.5000, 0.4000],
        [6.5000, 3.0000, 5.8000, 2.2000],
        [6.4000, 2.8000, 5.6000, 2.1000],
        [5.7000, 3.0000, 4.2000, 1.2000],
        [5.8000, 2.7000, 4.1000, 1.0000],
        [4.6000, 3.4000, 1.4000, 0.3000],
        [4.4000, 3.2000, 1.3000, 0.2000]], dtype=torch.float64)
tensor([0, 0, 2, 2, 1, 1, 0, 0], dtype=torch.int32)


In [4]:
ds_data = ds_train + ds_valid

print('len(ds_train) = ', len(ds_train))
print('len(ds_valid) = ', len(ds_valid))
print('len(ds_train+ds_valid) = ', len(ds_data))

print(type(ds_data))

len(ds_train) =  120
len(ds_valid) =  30
len(ds_train+ds_valid) =  150
<class 'torch.utils.data.dataset.ConcatDataset'>


### 根据图片目录创建图片数据集

In [5]:
import numpy as np
import torch
from torch.utils.data import DataLoader
from torchvision import transforms, datasets

In [6]:
# 定义图片增强操作

transform_train = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomVerticalFlip(),
    transforms.RandomRotation(45),
    transforms.ToTensor()
])

transform_val = transforms.Compose([
    transforms.ToTensor()
])

In [7]:
ds_train = datasets.ImageFolder("./data/cifar2/train/",
                                transform=transform_train,
                                target_transform=lambda t:torch.tensor([t]).float())
ds_valid = datasets.ImageFolder("./data/cifar2/test/",
                                transform=transform_val,
                                target_transform=lambda t:torch.tensor([t]).float())

print(ds_train.class_to_idx)

{'0_airplane': 0, '1_automobile': 1}


In [8]:
dl_train = DataLoader(ds_train, batch_size=50, shuffle=True)
dl_valid = DataLoader(ds_valid, batch_size=50, shuffle=True)

In [9]:
for features, labels in dl_train:
    print(features.shape)
    print(labels.shape)
    break

torch.Size([50, 3, 32, 32])
torch.Size([50, 1])


In [10]:
import numpy as np
import pandas as pd
from collections import OrderedDict
import re
import string

MAX_WORDS = 10000 # 仅考虑最高频的10000个词
MAX_LEN = 200 # 每个样本保留200个词的长度
BATCH_SIZE = 20
train_data_path = './data/imdb/train.tsv'
test_data_path = './data/imdb/test.tsv'
train_token_path = './data/imdb/train_token.tsv'
test_token_path = './data/imdb/test_token.tsv'
train_samples_path = './data/imdb/train_samples/'
test_samples_path = './data/imdb/test_samples/'

In [11]:
# 构建词典

word_count_dict = {}

# 清洗文本

def clean_text(text):
    lowercase = text.lower().replace("\n", " ")
    stripped_html = re.sub('<br />', ' ', lowercase)
    cleaned_punctuation = re.sub('[%s]'%re.escape(string.punctuation), '', stripped_html)
    return cleaned_punctuation

with open(train_data_path, 'r', encoding='utf-8') as f:
    for line in f:
        label, text = line.split('\t')
        cleaned_text = clean_text(text)
        for word in cleaned_text.split(' '):
            word_count_dict[word] = word_count_dict.get(word, 0) + 1

df_word_dict = pd.DataFrame(pd.Series(word_count_dict, name='count'))
df_word_dict = df_word_dict.sort_values(by='count', ascending=False)

df_word_dict = df_word_dict[0 : MAX_WORDS-2]
df_word_dict['word_id'] = range(2, MAX_WORDS)

word_id_dict = df_word_dict['word_id'].to_dict()

df_word_dict.head(10)

Unnamed: 0,count,word_id
the,268230,2
and,129713,3
a,129479,4
of,116497,5
to,108296,6
is,85615,7
,84074,8
in,74715,9
it,62587,10
i,60837,11


In [12]:
# 转换token

# 填充文本

def pad(data_list, pad_length):
    padded_list = data_list.copy()
    if len(data_list) > pad_length:
        padded_list = data_list[-pad_length:]
    if len(data_list) < pad_length:
        padded_list = [1] * (pad_length - len(data_list)) + data_list
    return padded_list

def text_to_token(text_file, token_file):
    with open(text_file, 'r', encoding='utf-8') as fin,\
        open(token_file, 'w', encoding='utf-8') as fout:
        for line in fin:
            label, text = line.split('\t')
            cleaned_text = clean_text(text)
            word_token_list = [word_id_dict.get(word, 0) for word in cleaned_text.split(' ')]
            pad_list = pad(word_token_list, MAX_LEN)
            out_line = label + '\t' + " ".join(str(x) for x in pad_list)
            fout.write(out_line + '\n')

text_to_token(train_data_path, train_token_path)
text_to_token(test_data_path, test_token_path)

In [13]:
import os

if not os.path.exists(train_samples_path):
    os.mkdir(train_samples_path)

if not os.path.exists(test_samples_path):
    os.mkdir(test_samples_path)

def split_samples(token_path, samples_dir):
    with open(token_path, 'r', encoding='utf-8') as fin:
        i = 0
        for line in fin:
            with open(samples_dir+"%d.txt"%i, 'w', encoding='utf-8') as fout:
                fout.write(line)
            i = i + 1

split_samples(train_token_path, train_samples_path)
split_samples(test_token_path, test_samples_path)

In [14]:
print(os.listdir(train_samples_path)[0:10])

['0.txt', '1.txt', '10.txt', '100.txt', '1000.txt', '10000.txt', '10001.txt', '10002.txt', '10003.txt', '10004.txt']


In [15]:
import os
import torch
from  torch.utils.data import Dataset, DataLoader

class imdbDataset(Dataset):
    def __init__(self, samples_dir):
        self.samples_dir = samples_dir
        self.samples_paths = os.listdir(samples_dir)

    def __len__(self):
        return len(self.samples_paths)

    def __getitem__(self, index):
        path = self.samples_dir + self.samples_paths[index]
        with open(path, 'r', encoding='utf-8') as f:
            line = f.readline()
            label, tokens = line.split('\t')
            label = torch.tensor([float(label)], dtype=torch.float)
            feature = torch.tensor([int(x) for x in tokens.split(' ')], dtype=torch.long)
            return (feature, label)

ds_train = imdbDataset(train_samples_path)
ds_test = imdbDataset(test_samples_path)

print(len(ds_train))
print(len(ds_test))

dl_train = DataLoader(dataset=ds_train, batch_size=BATCH_SIZE, shuffle=True)
dl_test = DataLoader(dataset=ds_test, batch_size=BATCH_SIZE)

for feature, label in dl_train:
    print(feature)
    print(label)
    break

20000
5000
tensor([[   1,    1,    1,  ...,   17,    0,    8],
        [   1,    1,    1,  ..., 7917, 3554,    8],
        [  61,  610,   21,  ...,  162,   10,    8],
        ...,
        [   1,    1,    1,  ...,   34, 2166,    8],
        [   1,    1,    1,  ...,  114,  460,    8],
        [   1,    1,    1,  ..., 5112,  218,    8]])
tensor([[0.],
        [0.],
        [1.],
        [0.],
        [1.],
        [1.],
        [0.],
        [1.],
        [0.],
        [0.],
        [0.],
        [1.],
        [1.],
        [0.],
        [0.],
        [1.],
        [1.],
        [1.],
        [0.],
        [1.]])


## 使用DataLoader加载数据集

DataLoader能够控制batch的大小，batch中元素的采样方法，以及将batch结果整理成模型所需输入形式的方法，并且能够使用多进程读取数据。

DataLoader的函数签名如下:

```
DataLoader(
    dataset,
    batch_size=1,
    shuffle=False,
    sampler=None,
    batch_sampler=None,
    num_workers=0,
    collate_fn=None,
    pin_memory=False,
    drop_last=False,
    timeout=0,
    worker_init_fn=None,
    multiprocessing_context=None,
)
```

一般情况下，我们仅仅会配置 dataset, batch_size, shuffle, num_workers, drop_last这五个参数，其他参数使用默认值即可。

DataLoader除了可以加载torch.utils.data.Dataset外，还能够加载另外一种数据集 torch.utils.data.IterableDataset。

和Dataset数据集相当于一种列表结构不同，IterableDataset相当于一种迭代器结构。 它更加复杂，一般较少使用。

* dataset : 数据集
* batch_size: 批次大小
* shuffle: 是否乱序
* sampler: 样本采样函数，一般无需设置。
* batch_sampler: 批次采样函数，一般无需设置。
* num_workers: 使用多进程读取数据，设置的进程数。
* collate_fn: 整理一个批次数据的函数。
* pin_memory: 是否设置为锁业内存。默认为False，锁业内存不会使用虚拟内存(硬盘)，从锁业内存拷贝到GPU上速度会更快。
* drop_last: 是否丢弃最后一个样本数量不足batch_size批次数据。
* timeout: 加载一个数据批次的最长等待时间，一般无需设置。
* worker_init_fn: 每个worker中dataset的初始化函数，常用于 IterableDataset。一般不使用。

In [16]:
#构建输入数据管道
ds = TensorDataset(torch.arange(1,50))
dl = DataLoader(ds,
                batch_size = 10,
                shuffle= True,
                num_workers=2,
                drop_last = True)
#迭代数据
for batch, in dl:
    print(batch)

tensor([41, 43,  1, 28,  5, 45, 18, 22, 13, 49])
tensor([10,  4, 34, 26, 21, 38, 32, 25,  7, 44])
tensor([19, 23,  3, 27, 17,  2,  6,  9, 36, 24])
tensor([39,  8, 37, 31, 42, 48, 35, 33, 12, 29])
