# Tensors

Tensors are a specialized data structure that are very similar to arrays and matrices that run on GPU and CPU. By default they are created on the CPU. To move tensors to the GPU we use the `.to` method

In [None]:
# We move our tensor to the GPU if available
if torch.cuda.is_available():
  tensor = tensor.to('cuda')

In [None]:
import torch
import numpy as np

data = [[1, 2],[3, 4]]
x_data = torch.tensor(data)

np_array = np.array(data)
x_np = torch.from_numpy(np_array)
print(x_np)

shape is a tuple of tensor dimensions. In the functions below, it determines the dimensionality of the output tensor.

In [None]:
shape = (2,3,)
rand_tensor = torch.rand(shape)
ones_tensor = torch.ones(shape)
zeros_tensor = torch.zeros(shape)
print(f"Random Tensor: \n {rand_tensor} \n")
print(f"Ones Tensor: \n {ones_tensor} \n")
print(f"Zeros Tensor: \n {zeros_tensor}")

Tensor attributes describe the shape, datatype, and storage device

In [None]:

tensor = torch.rand(3,4)
print(f"Shape of tensor: {tensor.shape}")
print(f"Datatype of tensor: {tensor.dtype}")
print(f"Device tensor is stored on: {tensor.device}")

Tensors on the CPU and NumPy arrays can share their underlying memory locations, and changing one will change the other.

___

## Datasets and Dataloaders

Ideally we want dataset code to be decoupled from model training code for better readability and modularity. PyTorch provides two data primitives: `torch.utils.data.DataLoader` and `torch.utils.data.Dataset` that allow you to use pre-loaded datasets as well as your own data. 

### Loading a Dataset

You'll want to know what parameters you are going to need. This demo uses the following parameters in the FashionMNIST Dataset:

- `root` is the path where the training and testing data is stored
- `train` specifies training or test data
- `download=True` downloads data from the internet if it's not available at `root`.
- `transform` and `target_transform` specify the feature and label transformations.

In [None]:
import torch
from torch.utils.data import Dataset
from torchvision import datasets
from torchvision.transforms import ToTensor
import matplotlib.pyplot as plt


training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor()
)

test_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor()
)

We can index `Datasets` manually like a list with `training_data[index]`.

Visualize the samples in the training data with `matplotlib`
    - **This could be an interesting way to manipulate data? Or see the training state.

In [None]:
labels_map = {
    0: "T-Shirt",
    1: "Trouser",
    2: "Pullover",
    3: "Dress",
    4: "Coat",
    5: "Sandal",
    6: "Shirt",
    7: "Sneaker",
    8: "Bag",
    9: "Ankle Boot",
}
figure = plt.figure(figsize=(8, 8))
cols, rows = 3, 3
for i in range(1, cols * rows + 1):
    sample_idx = torch.randint(len(training_data), size=(1,)).item()
    img, label = training_data[sample_idx]
    figure.add_subplot(rows, cols, i)
    plt.title(labels_map[label])
    plt.axis("off")
    plt.imshow(img.squeeze(), cmap="gray")
plt.show()

If you want to create custom datasets for files a class must include the following three functions:

- `__init__`: The __init__ function is run once when instantiating the Dataset object. We initialize the directory containing the images, the annotations file, and both transforms (covered in more detail in the next section).
- `__len__`: The __len__ function returns the number of samples in our dataset.
- `__getitem__`: The __getitem__ function loads and returns a sample from the dataset at the given index idx. Based on the index, it identifies the imageâ€™s location on disk, converts that to a tensor using read_image, retrieves the corresponding label from the csv data in self.img_labels, calls the transform functions on them (if applicable), and returns the tensor image and corresponding label in a tuple.

In this implementation; the FashionMNIST images are stored in a directory img_dir, and their labels are stored separately in a CSV file annotations_file.

In [None]:
import os 
import pandas as pd
from torchvision.io import read_image

class CustomImageDataset(Dataset):
    def __init__(self, annotations_file, img_dir, transform=None, target_transform=None):
        self.img_labels = pd.read_csv(annotations_file)
        self.img_dir = img_dir
        self.transform = transform
        self.target_transform = target_transform
    
    def __len__(self):
        return len(self.img_labels)
    
    def __getitem__(self, idx):
        img_path = os.path.join(self.img_dir, self.img_labels.iloc[idx,0])
        image = read_image(img_path)
        label = self.img_labels.iloc[idx, 1]
        if self.transform:
            self.transform(image)
        if self.target_transform:
            label = self.target_transform(label)
        return image, label

During model training it is common to pass samples in 'minibatches', reshuffle the data at every epoch to reduce model overfitting, and use `multiprocessing` to speed up data retrieval
> An epoch refers to a complete pass or iteration through the entire training dataset during the training phase of a model. In simpler terms, an epoch represents one cycle where the model has seen and learned from every training example once.

We use `Dataloaders` to wrap each data set with an iterable for easy access to the sample data. Each iteration will return a batch of `train_features` and `train_labels`

In [None]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(training_data, batch_size=64, shuffle=True)
test_dataloader = DataLoader(test_data, batch_size=64, shuffle=True)

train_features, train_labels = next(iter(train_dataloader))
print(f"Feature batch shape: {train_features.size()}")
print(f"Labels batch shape: {train_labels.size()}")
img = train_features[0].squeeze()
label = train_labels[0]
plt.imshow(img, cmap="gray")
plt.show()
print(f"Label: {label}")

___