## 🔮 Deep Learning in Practice


### Have Considered PyTorch


Only as a tensor library (i.e., like Numpy) but which also offers GPU support and automatic differentiation


<img src="https://i.imgur.com/FXSdMjM.png" />


<table>
  <tr> <th>torch</th> <th>torch.nn</th> <th>torch.nn.functional</th> <th>torch.optim</th> <th>torch.utils</th> </tr>
  <tr> 
  <td> Wraps all other modules and offers tensor functionality on GPU and automatic differentiation </td> 
  <td> Basic blocks of neural networks (i.e., layers, activations and loss functions) </td> 
  <td> Stateless functional version of (torch.nn) </td> 
  <td> Optimization algorithms and learning rate schedulers</td> 
  <td> reading data, batching, logging, etc. </td> </tr>
</table>


Aside from these modules, PyTorch also made other libraries such as `torchvision`, `torchtext` and `torchaudio`. These in general provide **transformations** specific to the data type (e.g., rotation for images, pitch shift for audio, tokenization for text) as well as built-in known datasets and models.

It's obvious we're done with the main module `torch` since last tutorial, let's explore the rest:

### 📊 Loading Data

Let's look into `torch.utils.data`

With PyTorch, you must start wrap your data in a `Dataset` object:

In [2]:
import torch
from torch.utils.data import TensorDataset, random_split, DataLoader

x_data = torch.tensor([
    [0.3, 2.1, 3.3, 4.2, 3.5],
    [1.4, 2.5, 3.6, 4.1, 5.5], 
    [1.2, 2.1, 3.4, 0.4, 5.9], 
    [0.2, 2.1, 7.4, 0.4, 5.9], 
    [1.3, 6.1, 3.4, 1.4, 2.9], 

    ])
y_data = torch.tensor([0, 1, 0, 1, 0])
# obvious fact: x and y can come from numpy arrays

dataset = TensorDataset(x_data, y_data)
print("first example of dataset:", dataset[0], " where dataset is of length:", len(dataset))

first example of dataset: (tensor([0.3000, 2.1000, 3.3000, 4.2000, 3.5000]), tensor(0))  where dataset is of length: 5


And split it into different sets if needed

In [3]:
# Split the dataset randomly into training and validation sets
train_dataset, val_dataset = random_split(dataset, [0.8, 0.2])
print(f"now the train dataset is of length: {len(train_dataset)} and the validation dataset is of length: {len(val_dataset)}")

now the train dataset is of length: 4 and the validation dataset is of length: 1


Once we have our `Dataset` object, you can use `DataLoader` which for FREE gives you:

- Automatic batching and random sampling 

- Multiprocessing data loading and memory pinning

In [4]:
batch_size = 2
train_loader = DataLoader(train_dataset, 
                          batch_size=batch_size, 
                          shuffle=True,
                          sampler=None,                     # SequentialSampler by default
                          num_workers=0
                          )

val_loader = DataLoader(val_dataset, 
                        batch_size=batch_size, 
                        shuffle=False,
                        sampler=None,
                        num_workers=0
                        )

The object it returns is a non-indexable iterable (i.e., can loop on it but not index it).

In [5]:
for i, (inputs, targets) in enumerate(train_loader):
    print("Batch Number:", i)
    print("Inputs:", inputs)
    print("Targets:", targets)
    if i == 1: break

Batch Number: 0
Inputs: tensor([[1.3000, 6.1000, 3.4000, 1.4000, 2.9000],
        [1.2000, 2.1000, 3.4000, 0.4000, 5.9000]])
Targets: tensor([0, 0])
Batch Number: 1
Inputs: tensor([[0.3000, 2.1000, 3.3000, 4.2000, 3.5000],
        [1.4000, 2.5000, 3.6000, 4.1000, 5.5000]])
Targets: tensor([0, 1])


However, most datasets you will come across will be unstructured and locally stored (i.e., `TensorDataset` not so helpful). 

**Generic Solution:** 

Just make a class that loads your dataset and inherit the `Dataset` class from `PyTorch`. 

```Python
import torch
from torch.utils.data import Dataset

class CustomDataset(Dataset):
    def __init__(self, params):
        pass

    def __len__(self):                      # Condition 1: must have length
        pass

    def __getitem__(self, idx):             # Condition 2: and be indexable
        pass
```

Now instead of `dataset = TensorDataset(x_data, y_data)` we will do `dataset = CustomDataset(params)` and implement the methods and the rest will just work!

Let's see a basic example. Want to load Sklearn datasets into PyTorch while side stepping any type or format conversions (as these aren't always possible).

In [7]:
from torch.utils.data import Dataset, DataLoader, random_split
from sklearn.datasets import load_iris, load_digits
from sklearn.preprocessing import StandardScaler

class ClassicDataset(Dataset):
    def __init__(self, type='iris'):
        sklean_dataset = load_iris() if type=='iris' else load_digits()
        self.data = sklean_dataset.data
        self.targets = sklean_dataset.target

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        x = self.data[idx]
        y = self.targets[idx]

        return x, y

# Example usage:
iris_dataset = ClassicDataset(type="iris")                              # hyperparameters allowed!
train_dataset, val_dataset = random_split(iris_dataset , [0.8, 0.2])
train_dataloader = DataLoader(train_dataset, batch_size=12, shuffle=True)
for (xb, yb) in train_dataloader:
    print(xb)
    break

tensor([[7.2000, 3.2000, 6.0000, 1.8000],
        [6.3000, 3.3000, 4.7000, 1.6000],
        [6.6000, 3.0000, 4.4000, 1.4000],
        [4.5000, 2.3000, 1.3000, 0.3000],
        [4.6000, 3.6000, 1.0000, 0.2000],
        [4.8000, 3.4000, 1.9000, 0.2000],
        [5.2000, 2.7000, 3.9000, 1.4000],
        [5.5000, 2.5000, 4.0000, 1.3000],
        [5.6000, 3.0000, 4.5000, 1.5000],
        [6.4000, 3.2000, 4.5000, 1.5000],
        [5.0000, 3.4000, 1.6000, 0.4000],
        [5.0000, 3.2000, 1.2000, 0.2000]], dtype=torch.float64)


Less generally, `torchvision`, `torchaudio` or `torchtext` may have already implemented common custom datasets:

<img src="https://i.imgur.com/PvLR9So.png" width="1100">

And they allow transformations! Let's look at:

- [Torch Vision Transforms](https://pytorch.org/vision/0.9/transforms.html)

- [Torch Audio Transforms](https://pytorch.org/audio/stable/transforms.html)

- [Torch Text Transforms](https://pytorch.org/text/stable/transforms.html)

In [9]:
from torchvision import datasets, transforms        #models has some pretrained models. 


train_transforms = transforms.Compose([
        transforms.RandomResizedCrop(224),      #crop a random (within limits) piece and resize to 224x224.
        transforms.RandomHorizontalFlip(),      #By default 50% chance to flip the image horizontally.
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5) , (0.25, 0.25, 0.25))
    ])


val_transforms = transforms.Compose([
        transforms.Resize(256),                 #simply resize the image
        transforms.CenterCrop(224),             #center crop of size 244x244
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5) , (0.25, 0.25, 0.25))
    ])

train_data = datasets.ImageFolder('./data/Hymenoptera/train', train_transforms)
val_data = datasets.ImageFolder('./data/Hymenoptera/val', val_transforms)

And as we said earlier, each of [torchvision](https://pytorch.org/vision/stable/datasets.html), [torchaudio](https://pytorch.org/audio/stable/datasets.html) and [torchtext](https://pytorch.org/text/stable/datasets.html) also tend to come with many popular built-in datasets. Click links to have a look and let's try one out now!

In [11]:
import torchaudio
commands_data = torchaudio.datasets.CMUARCTIC(root='.', download=True)      # Speech dataset from CMU

100%|██████████| 89.0M/89.0M [00:51<00:00, 1.80MB/s]


The last main component of [torchvision](https://pytorch.org/vision/0.9/models.html), [torchaudio](https://pytorch.org/audio/stable/models.html) and [torchtext](https://pytorch.org/text/stable/models.html) are pretrained models. We can have a look at some of them now but we will try them out later. Note that the `HuggingFace` library, rather `torchtext` dominates the area of NLP pretrained models.

#### 🧠 Let's Recap

- Must wrap dataset in a PyTorch `Dataset` object (from `torch.utils`)

- Covered `Dataset` forming whether directly through tensors, custom dataset or helper libraries

- Saw that PyTorch helper libraries also offer built-in datasets, transformations and models.

The rest of [torch.utils](https://pytorch.org/docs/stable/utils.html) is niche but we may explore more of it later. 

Before we move on let's discuss how we represent different unstructured data as tensors in deep learning.

<table>
<tr>
<th> Table </th> <th> Images </th> <th> Audio </th> <th> Text </th>
<tr>
<td> Each observation already a vector </td> 
<td> 

Each observation is some $(l,w,3)$ tensor 

</td> 
<td> 

Initially sequence of amplitudes (discrete signal) and typically converted to frequency domain $(n_w,w)$ (e.g., [Mel Spectrogram](https://commons.wikimedia.org/wiki/File:Spektogram_Vokale.png)) 

</td> 
<td> 

Statistical or deep learning approaches to assign word/sentence to a [meaningful vector](https://community.sap.com/t5/technology-blogs-by-members/vector-databases-and-embeddings-revolutionizing-ai-in-rag-in-llm-or-gpt/ba-p/13575985)  

</td>
</table>

We may see some examples soon.

<img src="https://i.imgur.com/TEFUEow.png">