# Datasets

PyTorch allows the creation of subclasses of `torch.utils.data.Dataset` that represent a collection of data. The `Dataset` class is an abstract class that requires the implementation of two methods: `__len__` and `__getitem__`. The `__len__` method returns the size of the dataset, and the `__getitem__` method returns a sample from the dataset given an index.

## MNIST Dataset

The MNIST dataset is a collection of 70,000 images of handwritten digits. Each image is a 28x28 grayscale image, and each pixel is represented by a value between 0 and 255. The dataset is split into a training set of 60,000 images and a test set of 10,000 images. The MNIST dataset is a popular dataset for getting started with deep learning and computer vision.

The `torchvision` package provides a convenient way to download and use the MNIST dataset. The `torchvision.datasets` module contains a number of popular datasets, including MNIST.

In [1]:
from torch.utils.data import Dataset

class MyDataset(Dataset):
    '''
    Template class for creating a dataset in PyTorch.
    '''
    
    def __init__(self, **kwargs):
        # Initialize the dataset
        pass

    def __len__(self):
        # Return the size of the dataset
        pass

    def __getitem__(self, idx):
        # Return the sample at the given index
        pass

ModuleNotFoundError: No module named 'torch'

In [3]:
from torchvision.datasets import MNIST

# Download the dataset
train_dataset = MNIST(root="data/MNIST/", train=True, download=True) 

# ! train_dataset is a Dataset object
# ! we do not need to implement the subclass

# Get the size of the dataset
print(len(train_dataset)) # 60000

# Get the sample at index 0
sample = train_dataset[0]
print(sample) 
# (<PIL.Image.Image image mode=L size=28x28 at 0x7F9C70B726E0>, 5)


  from .autonotebook import tqdm as notebook_tqdm


Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to data/MNIST/MNIST/raw/train-images-idx3-ubyte.gz


100%|██████████| 9912422/9912422 [00:01<00:00, 7275917.18it/s]


Extracting data/MNIST/MNIST/raw/train-images-idx3-ubyte.gz to data/MNIST/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to data/MNIST/MNIST/raw/train-labels-idx1-ubyte.gz


100%|██████████| 28881/28881 [00:00<00:00, 40527164.21it/s]


Extracting data/MNIST/MNIST/raw/train-labels-idx1-ubyte.gz to data/MNIST/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to data/MNIST/MNIST/raw/t10k-images-idx3-ubyte.gz


100%|██████████| 1648877/1648877 [00:00<00:00, 5200547.88it/s]


Extracting data/MNIST/MNIST/raw/t10k-images-idx3-ubyte.gz to data/MNIST/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to data/MNIST/MNIST/raw/t10k-labels-idx1-ubyte.gz


100%|██████████| 4542/4542 [00:00<00:00, 11788693.54it/s]

Extracting data/MNIST/MNIST/raw/t10k-labels-idx1-ubyte.gz to data/MNIST/MNIST/raw

60000
(<PIL.Image.Image image mode=L size=28x28 at 0x7F9C70B726E0>, 5)





## ITALIC: ITALian Intent Classification

The ITALIC dataset is a collection of around 16,000 audio recordings of Italian sentences. Each sentence has an associated intent, which is a label that describes the purpose of the sentence. The dataset is available on [Zenodo](https://zenodo.org/records/8040649). The dataset is also available through the huggingface datasets library. There are different configurations of the dataset:
- `massive`: recordings are split into `train`, `validation`, and `test` sets according to the [textual version](https://arxiv.org/abs/2106.03714) of the dataset.
- `hard_noisy`: recordings are split into `train`, `validation`, and `test` sets according to the auto-annotated level of noise in the recordings (e.g., test set contains only recordings with high noise).
- `hard_speaker`: recordings are split into `train`, `validation`, and `test` sets according to the speaker of the recording (e.g., test set contains only recordings from speakers not in the training set).

The `datasets` module from the huggingface library provides a convenient way to download and use different datasets. The `datasets` module contains a number of popular datasets, including ITALIC.

In [17]:
from datasets import load_dataset

# Please be sure to use use_auth_token=True and to set the access token
# using huggingface-cli login
# or follow https://huggingface.co/docs/hub/security-tokens 

# configs "hard_speaker" and "hard_noisy" are also available (to substitute "massive")
italic = load_dataset("RiTA-nlp/ITALIC", "massive", use_auth_token=True) 
italic_train = italic["train"]
italic_valid = italic["validation"]
italic_test  = italic["test"]

# Get the size of the dataset
print(len(italic_train)) # 12800

# Get the sample at index 0
sample = italic_train[0]
print(sample)

Found cached dataset italic (/home/mlaquatra/.cache/huggingface/datasets/RiTA-nlp___italic/massive/1.0.0/652d7ebb794f960178edd72867252cdfe3a68ec16d372791d07bf789ed9a7609)
100%|██████████| 3/3 [00:00<00:00, 543.23it/s]


11514
{'id': 1, 'age': 27, 'gender': 'male', 'region': 'abruzzo', 'nationality': 'italiana', 'lisp': 'nessuno', 'education': 'master', 'speaker_id': 72, 'environment': 'silent', 'device': 'phone', 'scenario': 'alarm', 'field': 'close', 'intent': 'alarm_set', 'utt': 'svegliami alle nove di mattina venerdì', 'audio': {'path': '/home/mlaquatra/.cache/huggingface/datasets/downloads/extracted/7f25fd6b6a74a983b3f0c3ea3ec3768f916c1fd6a84cd344bc1cedbd9249e698/zenodo_dataset/recordings/1.wav', 'array': array([0., 0., 0., ..., 0., 0., 0.], dtype=float32), 'sampling_rate': 16000}}


### Complete MNIST Example

The following cell contains a complete example of training and testing of a custom CNN on the MNIST dataset.

In [20]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
from tqdm import tqdm # Progress bar

# Create the dataset
train_dataset = MNIST(root="data/MNIST/", train=True, download=True, transform=ToTensor())
test_dataset = MNIST(root="data/MNIST/", train=False, download=True, transform=ToTensor())

# we can implement a validation split, how?

# Create the dataloader
train_dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=32, shuffle=False)

# Define the model
class MyModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=16, kernel_size=3, stride=1, padding=1)
        self.relu1 = nn.ReLU()
        self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
        self.conv2 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, stride=1, padding=1)
        self.relu2 = nn.ReLU()
        self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)
        self.linear1 = nn.Linear(in_features=32*7*7, out_features=10)

    def forward(self, x):
        x = self.conv1(x)
        x = self.relu1(x)
        x = self.pool1(x)
        x = self.conv2(x)
        x = self.relu2(x)
        x = self.pool2(x)
        x = x.view(x.size(0), -1)
        x = self.linear1(x)
        return x

# Create the model
model = MyModel()

# Define the loss function
loss_fn = nn.CrossEntropyLoss()

# Define the optimizer
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
# Define the scheduler
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)

# Training loop
model = model.train() # Set the model in training mode
for epoch in range(10):
    # Iterate over the batches
    losses = []
    for batch in tqdm(train_dataloader):
        # Get the input data
        input_data = batch[0]
        # Get the target data
        target = batch[1]
        # Forward pass
        output = model(input_data)
        # Loss computation
        loss = loss_fn(output, target)
        # Backward pass
        loss.backward()
        # Parameters update
        optimizer.step()
        # Reset the gradients
        optimizer.zero_grad()
        losses.append(loss.item())
    # Update the learning rate
    scheduler.step()
    avg_training_loss = sum(losses) / len(losses)
    print(f"Epoch {epoch+1} - Training loss: {avg_training_loss:.3f}")

# Evaluation loop
model = model.eval() # Set the model in evaluation mode

predictions = []
targets = []
with torch.no_grad(): # Disable gradient computation
    for batch in test_dataloader:
        # Get the input data
        input_data = batch[0]
        # Get the target data
        target = batch[1]
        # Forward pass
        output = model(input_data)
        # Save the predictions
        predictions.append(output)
        # Save the targets
        targets.append(target)

# Concatenate the predictions
predictions = torch.cat(predictions, dim=0)
# Concatenate the targets
targets = torch.cat(targets, dim=0)

# Compute the accuracy
accuracy = (predictions.argmax(dim=1) == targets).float().mean()
print(f"Accuracy: {accuracy:.5f}")

100%|██████████| 1875/1875 [00:19<00:00, 97.09it/s] 


Epoch 1 - Training loss: 0.202


100%|██████████| 1875/1875 [00:18<00:00, 100.44it/s]


Epoch 2 - Training loss: 0.069


100%|██████████| 1875/1875 [00:18<00:00, 102.05it/s]


Epoch 3 - Training loss: 0.051


100%|██████████| 1875/1875 [00:18<00:00, 101.33it/s]


Epoch 4 - Training loss: 0.042


100%|██████████| 1875/1875 [00:19<00:00, 96.34it/s] 


Epoch 5 - Training loss: 0.035


100%|██████████| 1875/1875 [00:19<00:00, 97.58it/s] 


Epoch 6 - Training loss: 0.031


100%|██████████| 1875/1875 [00:19<00:00, 97.93it/s] 


Epoch 7 - Training loss: 0.026


100%|██████████| 1875/1875 [00:18<00:00, 101.55it/s]


Epoch 8 - Training loss: 0.025


100%|██████████| 1875/1875 [00:18<00:00, 99.25it/s] 


Epoch 9 - Training loss: 0.021


100%|██████████| 1875/1875 [00:18<00:00, 99.92it/s] 


Epoch 10 - Training loss: 0.019
Accuracy: 0.99000
