# <font color='Blue'>Dataloader 1: Structure data load in pytorch dataloader

## Your data is a small dataset, which can load in memory directly.

<font size=5> We use ***Boston housing prices dataset*** as sample in this course</font><br>

<font size=4> 
- 1. load data from sklearn.datasets: get a structure data.<br>
- 2. Split data as training and test set.<br>
</font>

In [1]:
import warnings
warnings.filterwarnings("ignore")

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import torch

def load_boston_sklearn():
    X, Y = load_boston(return_X_y=True)
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state=2022)
    return  X_train, X_test, Y_train, Y_test
X_train, X_test, Y_train, Y_test = load_boston_sklearn()
print('X_train:{}'.format(X_train.shape))
print('X_test:{}'.format(X_test.shape))
print('Y_train:{}'.format(Y_train.shape))
print('Y_test:{}'.format(Y_test.shape))

X_train:(404, 13)
X_test:(102, 13)
Y_train:(404,)
Y_test:(102,)


----------------------------------
<font size=3> 
- 3. convert the numpy array to torch tensor format<br>
- 4. using the torch.utils.data.TensorDataset packpage the loaded structure data.
</font>

In [2]:
X_train = torch.tensor(X_train, dtype=torch.float)
Y_train = torch.tensor(Y_train, dtype=torch.float)
print('X_train:{}'.format(X_train.shape))
print('Y_train:{}'.format(Y_train.shape))
mydatasets = torch.utils.data.TensorDataset(X_train, Y_train)

X_train:torch.Size([404, 13])
Y_train:torch.Size([404])


<font size=5> Just take a look " what can we get in _mydatasets_ ?" </font>

In [3]:
for i, tmp  in enumerate(mydatasets):
    print('{}-th:{}'.format(i, tmp))
    if i==1:break
print('manually selection')
print('0-th:{}'.format(mydatasets[0]))    
print('1-th:{}'.format(mydatasets[1]))    

0-th:(tensor([1.5380e-02, 9.0000e+01, 3.7500e+00, 0.0000e+00, 3.9400e-01, 7.4540e+00,
        3.4200e+01, 6.3361e+00, 3.0000e+00, 2.4400e+02, 1.5900e+01, 3.8634e+02,
        3.1100e+00]), tensor(44.))
1-th:(tensor([4.5900e-02, 5.2500e+01, 5.3200e+00, 0.0000e+00, 4.0500e-01, 6.3150e+00,
        4.5600e+01, 7.3172e+00, 6.0000e+00, 2.9300e+02, 1.6600e+01, 3.9690e+02,
        7.6000e+00]), tensor(22.3000))
manually selection
0-th:(tensor([1.5380e-02, 9.0000e+01, 3.7500e+00, 0.0000e+00, 3.9400e-01, 7.4540e+00,
        3.4200e+01, 6.3361e+00, 3.0000e+00, 2.4400e+02, 1.5900e+01, 3.8634e+02,
        3.1100e+00]), tensor(44.))
1-th:(tensor([4.5900e-02, 5.2500e+01, 5.3200e+00, 0.0000e+00, 4.0500e-01, 6.3150e+00,
        4.5600e+01, 7.3172e+00, 6.0000e+00, 2.9300e+02, 1.6600e+01, 3.9690e+02,
        7.6000e+00]), tensor(22.3000))


------------------------

<font size=5 color='blue'> To avoid "the data, sampling in training phase, appear in fixed order, we need shuffle the data in each training epoch.</font><br><br>
<font size=4>
Q: Write the shuffle code by yourself, it's easy but "why do we not use a ready code?"<br><br>
Ans: using "torch.utils.data.DataLoader"<br>
</font>

In [4]:
dataloader_train = torch.utils.data.DataLoader(mydatasets, batch_size=3, shuffle=False, num_workers=0)
for i, tmp  in enumerate(dataloader_train):
    data  = tmp[0]
    target  = tmp[1]
    print('{}-th data:{}'.format(i, data))
    print('{}-th target:{}'.format(i, target))
    if i==1:break    

0-th data:tensor([[1.5380e-02, 9.0000e+01, 3.7500e+00, 0.0000e+00, 3.9400e-01, 7.4540e+00,
         3.4200e+01, 6.3361e+00, 3.0000e+00, 2.4400e+02, 1.5900e+01, 3.8634e+02,
         3.1100e+00],
        [4.5900e-02, 5.2500e+01, 5.3200e+00, 0.0000e+00, 4.0500e-01, 6.3150e+00,
         4.5600e+01, 7.3172e+00, 6.0000e+00, 2.9300e+02, 1.6600e+01, 3.9690e+02,
         7.6000e+00],
        [4.0202e-01, 0.0000e+00, 9.9000e+00, 0.0000e+00, 5.4400e-01, 6.3820e+00,
         6.7200e+01, 3.5325e+00, 4.0000e+00, 3.0400e+02, 1.8400e+01, 3.9521e+02,
         1.0360e+01]])
0-th target:tensor([44.0000, 22.3000, 23.1000])
1-th data:tensor([[3.2264e-01, 0.0000e+00, 2.1890e+01, 0.0000e+00, 6.2400e-01, 5.9420e+00,
         9.3500e+01, 1.9669e+00, 4.0000e+00, 4.3700e+02, 2.1200e+01, 3.7825e+02,
         1.6900e+01],
        [3.3147e-01, 0.0000e+00, 6.2000e+00, 0.0000e+00, 5.0700e-01, 8.2470e+00,
         7.0400e+01, 3.6519e+00, 8.0000e+00, 3.0700e+02, 1.7400e+01, 3.7895e+02,
         3.9500e+00],
        [1.

In [5]:
dataloader_train = torch.utils.data.DataLoader(mydatasets, batch_size=3, shuffle=False,num_workers=0)
for i_repeat in range(5):
    for i, tmp  in enumerate(dataloader_train):
        target  = tmp[1]
        print('{}-iter :\n{}-th target:\n{}'.format(i_repeat, i, target))
        if i==0:break 

0-iter :
0-th target:
tensor([44.0000, 22.3000, 23.1000])
1-iter :
0-th target:
tensor([44.0000, 22.3000, 23.1000])
2-iter :
0-th target:
tensor([44.0000, 22.3000, 23.1000])
3-iter :
0-th target:
tensor([44.0000, 22.3000, 23.1000])
4-iter :
0-th target:
tensor([44.0000, 22.3000, 23.1000])


In [6]:
dataloader_train = torch.utils.data.DataLoader(mydatasets, batch_size=3, shuffle=True, num_workers=0)
for i_repeat in range(5):
    for i, tmp  in enumerate(dataloader_train):
        target  = tmp[1]
        print('{}-iter :\n{}-th target:\n{}'.format(i_repeat, i, target))
        if i==0:break 

0-iter :
0-th target:
tensor([29.4000, 17.8000, 50.0000])
1-iter :
0-th target:
tensor([17.4000, 31.7000, 15.4000])
2-iter :
0-th target:
tensor([20.3000, 22.6000, 17.1000])
3-iter :
0-th target:
tensor([22.2000, 20.1000,  7.2000])
4-iter :
0-th target:
tensor([15.2000, 43.8000, 14.3000])


------------------------
<font size=5> 
I just illustrate how to using "torch.utils.data.DataLoader" to get a shuffle data.<br>

We take a detail the input in "torch.utils.data.DataLoader", see as below.
</font><br>

![image.png](attachment:image.png)

<font size=5>
Do you find **sampler** ?<br><br>
We will talk after "dataloader for Custom Dataset"<br>
</font><br>
