# Datasets with PyTorch


In [2]:
!nvidia-smi

Fri Jan 12 05:39:36 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       On  | 00000000:00:1E.0 Off |                    0 |
| N/A   31C    P8              11W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [1]:
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# Loading data from files
We've seen how to load NumPy arrays into PyTorch, and anyone familiar with <tt>pandas.read_csv()</tt> can use it to prepare data before forming tensors. Here we'll load the <a href='https://en.wikipedia.org/wiki/Iris_flower_data_set'>iris flower dataset</a> saved as a .csv file.

In [3]:
df = pd.read_csv('iris.csv')
df.sample(50)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
74,6.4,2.9,4.3,1.3,1.0
31,5.4,3.4,1.5,0.4,0.0
121,5.6,2.8,4.9,2.0,2.0
135,7.7,3.0,6.1,2.3,2.0
144,6.7,3.3,5.7,2.5,2.0
51,6.4,3.2,4.5,1.5,1.0
137,6.4,3.1,5.5,1.8,2.0
41,4.5,2.3,1.3,0.3,0.0
25,5.0,3.0,1.6,0.2,0.0
73,6.1,2.8,4.7,1.2,1.0


In [3]:
df.shape

(150, 5)

In [4]:
df['target'].value_counts()

0.0    50
1.0    50
2.0    50
Name: target, dtype: int64

The iris dataset consists of 50 samples each from three species of Iris (<em>Iris setosa</em>, <em>Iris virginica</em> and <em>Iris versicolor</em>), for 150 total samples. We have four features (sepal length & width, petal length & width) and three unique labels:
0. <em>Iris setosa</em>
1. <em>Iris virginica</em>
2. <em>Iris versicolor</em>

### The classic method for building train/test split tensors
Before introducing PyTorch's Dataset and DataLoader classes, we'll take a quick look at the alternative.

In [12]:
from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(df.drop('target',axis=1).values,
                                                    df['target'].values, test_size=0.2,
                                                    random_state=33)

X_train = torch.FloatTensor(train_X)
X_test = torch.FloatTensor(test_X)
y_train = torch.LongTensor(train_y).reshape(-1, 1)
y_test = torch.LongTensor(test_y).reshape(-1, 1)

print("X_train shape: ", X_train.shape, "\nX_test shape: ", X_test.shape, "\ny_train shape: ", y_train.shape, "\ny_test shape: ", y_test.shape)

X_train shape:  torch.Size([120, 4]) 
X_test shape:  torch.Size([30, 4]) 
y_train shape:  torch.Size([120, 1]) 
y_test shape:  torch.Size([30, 1])


## Using PyTorch's Dataset and DataLoader classes
A far better alternative is to leverage PyTorch's <a href='https://pytorch.org/docs/stable/data.html'><strong><tt>Dataset</tt></strong></a> and <a href='https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader'><strong><tt>DataLoader</strong></tt></a> classes.

Usually, to set up a Dataset specific to our investigation we would define our own custom class that inherits from <tt>torch.utils.data.Dataset</tt> (we'll do this in the CNN section). For now, we can use the built-in <a href='https://pytorch.org/docs/stable/data.html#torch.utils.data.TensorDataset'><strong><tt>TensorDataset</tt></strong></a> class.

In [13]:
from torch.utils.data import TensorDataset, DataLoader

data = df.drop('target',axis=1).values
labels = df['target'].values



In [15]:
labels

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 2., 2.,
       2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.,
       2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.,
       2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.])

In [14]:
data

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

In [7]:
data.shape

(150, 4)

In [8]:
labels.shape

(150,)

In [16]:
iris = TensorDataset(torch.FloatTensor(data),torch.LongTensor(labels))

In [17]:
len(iris)

150

In [18]:
type(iris)

torch.utils.data.dataset.TensorDataset

In [19]:
i = 0
for x,y in iris:
    i=i+1
    print('x= ', x, 'y=',y,'i',i)
  

x=  tensor([5.1000, 3.5000, 1.4000, 0.2000]) y= tensor(0) i 1
x=  tensor([4.9000, 3.0000, 1.4000, 0.2000]) y= tensor(0) i 2
x=  tensor([4.7000, 3.2000, 1.3000, 0.2000]) y= tensor(0) i 3
x=  tensor([4.6000, 3.1000, 1.5000, 0.2000]) y= tensor(0) i 4
x=  tensor([5.0000, 3.6000, 1.4000, 0.2000]) y= tensor(0) i 5
x=  tensor([5.4000, 3.9000, 1.7000, 0.4000]) y= tensor(0) i 6
x=  tensor([4.6000, 3.4000, 1.4000, 0.3000]) y= tensor(0) i 7
x=  tensor([5.0000, 3.4000, 1.5000, 0.2000]) y= tensor(0) i 8
x=  tensor([4.4000, 2.9000, 1.4000, 0.2000]) y= tensor(0) i 9
x=  tensor([4.9000, 3.1000, 1.5000, 0.1000]) y= tensor(0) i 10
x=  tensor([5.4000, 3.7000, 1.5000, 0.2000]) y= tensor(0) i 11
x=  tensor([4.8000, 3.4000, 1.6000, 0.2000]) y= tensor(0) i 12
x=  tensor([4.8000, 3.0000, 1.4000, 0.1000]) y= tensor(0) i 13
x=  tensor([4.3000, 3.0000, 1.1000, 0.1000]) y= tensor(0) i 14
x=  tensor([5.8000, 4.0000, 1.2000, 0.2000]) y= tensor(0) i 15
x=  tensor([5.7000, 4.4000, 1.5000, 0.4000]) y= tensor(0) i 16
x

Once we have a dataset we can wrap it with a DataLoader. This gives us a powerful sampler that provides single- or multi-process iterators over the dataset.

In [20]:
iris_loader = DataLoader(iris, batch_size=20, shuffle=True)

In [21]:
for X, Y in iris_loader:
    print(X.shape, Y.shape)

torch.Size([20, 4]) torch.Size([20])
torch.Size([20, 4]) torch.Size([20])
torch.Size([20, 4]) torch.Size([20])
torch.Size([20, 4]) torch.Size([20])
torch.Size([20, 4]) torch.Size([20])
torch.Size([20, 4]) torch.Size([20])
torch.Size([20, 4]) torch.Size([20])
torch.Size([10, 4]) torch.Size([10])
