Dataset Class:

While building a pytorch model it's crucial to have something that gives us samples from our dataset. That's what the dataset class is for.

In [70]:
import torch
from sklearn.datasets import make_classification

One way of doing that:

In [71]:
# class CustomDataset(torch.utils.data.dataset)

But we won't be doing that. We will write our own custom dataset class. No imports required then.

It needs:
- to consist a constructor (init)
- size calculator (len)
- function that takes an index and returns it's value (get item)

In [None]:
class CustomDataset:
    def __init__(self, data, targets):
        self.data = data
        self.targets = targets
    
    def __len__(self):
        return self.data.shape[0]
    
    def __getitem__(self, idx):
        current_sample = self.data[idx, :] #assuming we rely on numpy array - tabular dataset
        current_target = self.targets[idx] #also 1d array or a list of values - if we would have multiple targets: self.targets[idx, :]
        return {
            "sample" : torch.tensor(current_sample, dtype=torch.float),
            "target" : torch.tensor(current_target, dtype=torch.long),
        }


In [73]:
?make_classification

[31mSignature:[39m
make_classification(
    n_samples=[32m100[39m,
    n_features=[32m20[39m,
    *,
    n_informative=[32m2[39m,
    n_redundant=[32m2[39m,
    n_repeated=[32m0[39m,
    n_classes=[32m2[39m,
    n_clusters_per_class=[32m2[39m,
    weights=[38;5;28;01mNone[39;00m,
    flip_y=[32m0.01[39m,
    class_sep=[32m1.0[39m,
    hypercube=[38;5;28;01mTrue[39;00m,
    shift=[32m0.0[39m,
    scale=[32m1.0[39m,
    shuffle=[38;5;28;01mTrue[39;00m,
    random_state=[38;5;28;01mNone[39;00m,
    return_X_y=[38;5;28;01mTrue[39;00m,
)
[31mDocstring:[39m
Generate a random n-class classification problem.

This initially creates clusters of points normally distributed (std=1)
about vertices of an ``n_informative``-dimensional hypercube with sides of
length ``2*class_sep`` and assigns an equal number of clusters to each
class. It introduces interdependence between these features and adds
various types of further noise to the data.

Without shuffling, ``X``

In [74]:
data, targets = make_classification(n_samples=1000)

In [75]:
data.shape

(1000, 20)

In [76]:
targets.shape

(1000,)

In [77]:
data

array([[ 0.16979023,  0.10691182,  0.30414502, ...,  1.0626695 ,
        -0.29436545, -1.10699998],
       [ 0.32425587, -0.32573108,  0.17012535, ...,  0.08461653,
        -1.45996237, -0.13193471],
       [-0.68078443,  0.05358512,  1.10849775, ...,  0.43914515,
        -0.43305859,  1.24266871],
       ...,
       [-0.62086795, -0.61069116, -0.07145637, ..., -0.62541193,
         1.88244595, -2.53305355],
       [ 1.15202562, -0.72934549,  0.02442004, ...,  0.38376808,
        -0.96925367,  0.35392736],
       [-1.70521712,  0.88608599, -1.478207  , ..., -1.56631137,
        -0.01758584, -0.20728131]], shape=(1000, 20))

In [78]:
custom_dataset = CustomDataset(data=data, targets=targets)

In [79]:
custom_dataset.data

array([[ 0.16979023,  0.10691182,  0.30414502, ...,  1.0626695 ,
        -0.29436545, -1.10699998],
       [ 0.32425587, -0.32573108,  0.17012535, ...,  0.08461653,
        -1.45996237, -0.13193471],
       [-0.68078443,  0.05358512,  1.10849775, ...,  0.43914515,
        -0.43305859,  1.24266871],
       ...,
       [-0.62086795, -0.61069116, -0.07145637, ..., -0.62541193,
         1.88244595, -2.53305355],
       [ 1.15202562, -0.72934549,  0.02442004, ...,  0.38376808,
        -0.96925367,  0.35392736],
       [-1.70521712,  0.88608599, -1.478207  , ..., -1.56631137,
        -0.01758584, -0.20728131]], shape=(1000, 20))

In [80]:
custom_dataset.targets.shape

(1000,)

In [82]:
custom_dataset.__len__

<bound method CustomDataset.__len__ of <__main__.CustomDataset object at 0x13755f590>>

In [84]:
len(custom_dataset)

1000

In [86]:
custom_dataset[0]

{'sample': tensor([ 0.1698,  0.1069,  0.3041, -0.2440, -1.4553, -0.7016, -0.8871,  1.2583,
         -1.8462, -2.4219,  1.0927,  0.1537, -0.6797, -0.7101, -1.3596,  2.2179,
         -0.5649,  1.0627, -0.2944, -1.1070]),
 'target': tensor(1)}

In [88]:
custom_dataset[0]["sample"].shape

torch.Size([20])

In [90]:
custom_dataset[0]["sample"]


tensor([ 0.1698,  0.1069,  0.3041, -0.2440, -1.4553, -0.7016, -0.8871,  1.2583,
        -1.8462, -2.4219,  1.0927,  0.1537, -0.6797, -0.7101, -1.3596,  2.2179,
        -0.5649,  1.0627, -0.2944, -1.1070])

In [92]:
custom_dataset[0]["target"].shape

torch.Size([])

In [93]:
custom_dataset[0]["target"]

tensor(1)