# Multi-Input Models

**Why multi-input?**

In [None]:
"""

Multi-input models, or models that accept more than one source of data, have many applications.
    First, we might want the model to use multiple information sources, such as two images of the same car to predict its model.

    Second, multi-modal models can work on different input types such as image and text to answer a question about the image.

    Next, in metric learning, the model learns whether two inputs represent the same object.
Think about an automated passport control where the system compares our passport photo with a picture it takes of us.

    Finally, in self-supervised learning, the model learns data representation by learning that two augmented versions of the same input represent the same object.

"""

**Omniglot Dataset**

In [None]:
"""

Omniglot dataset, a collection of images of 964 different handwritten characters from 30 different alphabets.

The first input will be the image of the character, such as this Latin letter "k".
The second input will the the alphabet that it comes from expressed as a one-hot vector.

Both inputs will be processed separately, then we concatenate their representations.Finally a classification layer predicts one of the 964 classes.
We need two elements to build such a model: a custom Dataset and an appropriate model architecture

"""

**Two-Input Dataset**

In [None]:
from PIL import Image
class OmniglotDataset(Dataset):
    def __init__(self, transform, samples):
        self.transform = transform
        self.samples = samples ### Samples are tuples of three: image file path, alphabet as a one-hot vector, and target label as the character class index

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        img_path, alphabet, label = self.samples[idx]
        img = Image.open(img_path).convert('L')  ### The convert method with the argument "L" makes sure that the image is read as grayscale
        img = self.transform(img)
        return img, alphabet, label

**Two-Input architecture**

In [None]:
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.image_layer = nn.Sequential(
        nn.Conv2d(1, 16, kernel_size=3, padding=1),
        nn.MaxPool2d(kernel_size=2),
        nn.ELU(),
        nn.Flatten(),
        nn.Linear(16*32*32, 128)
        )

        self.alphabet_layer = nn.Sequential(
        nn.Linear(30, 8), ### Its input size is 30, the number of alphabets, and we map it to an arbitrarily chosen output size of 8
        nn.ELU(),
        )

        self.classifier = nn.Sequential(
        nn.Linear(128 + 8, 964),
        )

     def forward(self, x_image, x_alphabet):
        x_image = self.image_layer(x_image)
        x_alphabet = self.alphabet_layer(x_alphabet)
        x = torch.cat((x_image, x_alphabet), dim=1)
        return self.classifier(x)


"""
In the forward method, we pass each input through its corresponding layer. Then, we concatenate the outputs with torch.cat.
Finally, we pass the result through the classifier layer and return.
"""

**Training Loop**

In [None]:
net = Net()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.01)

for epoch in range(10):
    for img, alpha, labels in dataloader_train:
        optimizer.zero_grad()
        outputs = net(img, alpha)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

In [None]:
"""

Building a multi-input model starts with crafting a custom dataset that can supply all the inputs to the model.
In this exercise, you will build the Omniglot dataset that serves triplets consisting of:

The image of a character to be classified,
The one-hot encoded alphabet vector of length 30, with zeros everywhere but for a single one denoting the ID of the alphabet the character comes from,
The target label, an integer between 0 and 963.
You are provided with samples, a list of 3-tuples comprising an image's file path, its alphabet vector, and the target label.
Also, the following imports have already been done for you.

from PIL import Image
from torch.utils.data import DataLoader, Dataset
from torchvision import transforms

"""




"""

Assign transform and samples to class attributes with the same names

Implement the .__len()__ method such that it returns the number of samples stored in the class' samples attribute

Unpack the sample at index idx assigning its contents to img_path, alphabet, and label.
Transform the loaded image with self.transform() and assign it to img_transformed.

"""

class OmniglotDataset(Dataset):
    def __init__(self, transform, samples):
        # Assign transform and samples to class attributes
        self.transform = transforms
        self.samples = samples

    def __len__(self):
        # Return number of samples
        return len(self.samples)

    def __getitem__(self, idx):
      	# Unpack the sample at index idx
        img_path, alphabet, label = self.samples[idx]
        img = Image.open(img_path).convert('L')
        # Transform the image
        img_transformed = self.transform(img)
        return img_transformed, alphabet, label





### With implementation of OmniglotDataset ready, you can actually create the dataset and DataLoader, just like you did it before
# dataset_train = OmniglotDataset(
#     transform=transforms.Compose([
#         transforms.ToTensor(),
#         transforms.Resize((64, 64)),
#     ]),
#     samples=samples,
# )

# dataloader_train = DataLoader(
#     dataset_train, shuffle=True, batch_size=3,
# )

In [None]:
"""

Define image, alphabet and classifier sub-networks as sequential models, assigning them to self.image_layer, self.alphabet_layer and self.classifier, respectively.

Pass the image and alphabet through the appropriate model layers.

Concatenate the outputs from image and alphabet layers and assign the result to x.


"""

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        # Define sub-networks as sequential models
        self.image_layer = nn.Sequential(
            nn.Conv2d(1, 16, kernel_size=3, padding=1),
            nn.MaxPool2d(kernel_size=2),
            nn.ELU(),
            nn.Flatten(),
            nn.Linear(16*32*32, 128)
        )
        self.alphabet_layer = nn.Sequential(
            nn.Linear(30, 8),
            nn.ELU(),
        )
        self.classifier = nn.Sequential(
            nn.Linear(128 + 8, 964),
        )

    def forward(self, x_image, x_alphabet):
        # Pass the x_image and x_alphabet through appropriate layers
        x_image = self.image_layer(x_image)
        x_alphabet = self.alphabet_layer(x_alphabet)
        # Concatenate x_image and x_alphabet
        x = torch.cat((x_image, x_alphabet), dim=1)
        return self.classifier(x)

# Multi-output Models

**Why multi-output?**

In [None]:
"""

Just like multi-input models, multi-output architectures are everywhere. Their simplest use-case is for multi-task learning,
where we want to predict two things from the same input, such as a car's make and model from its picture.
In multi-label classification problem, the input can belong to multiple classes simultaneously. For instance, an image can depict both a beach and people.
For each of these labels, a separate output from the model is needed.

Finally, in very deep models built of blocks of layers, it is a common practice to add extra outputs predicting the same targets after each block.
These additional outputs ensure that the early parts of the model are learning features useful for the task
at hand while also serving as a form of regularization to boost the robustness of the network.


""

In [None]:
"""

use the Omniglot dataset again to build a model to predict both the character(A, B, ... ..., ---> Total 964) and the alphabet(English, Bangla, Latin,..., ---> Total 30) it comes from based on the image.
First, we will pass the image through some layers to obtain its embedding.

Then we add two independent classifiers on top, one for each output.


Changing the Dataset Format
-----------------------------------
Before, when we used the alphabet as an input, we represented it using a one-hot vector (a list with mostly zeros and a single 1).
Now, since the alphabet is an output, we just use an integer label (e.g., 0 for first alphabet, 1 for second, and so on up to 29).

Before: Alphabet 'Greek' → [0, 1, 0, ..., 0] (one-hot vector).
Now: Alphabet 'Greek' → 1 (just a single number).



Forward Pass (Prediction Process)
------------------------------------

When we pass an image into the model:

The image processing layers extract features.
These features go into two separate classifier layers.
Each classifier makes a prediction (one for character, one for alphabet).
Finally, the model outputs both predictions.


Example: Suppose we give the model an image of a handwritten character.

It processes the image and extracts features.
The character classifier predicts: "This is character 340."
The alphabet classifier predicts: "This belongs to alphabet 5.


"""

class OmniglotDataset(Dataset):
    def __init__(self, transform, samples):
        self.transform = transform
        self.samples = samples

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        img_path, alphabet, label = \
        self.samples[idx]
        img = Image.open(img_path).convert('L')
        img = self.transform(img)
        return img, alphabet, label




class Net(nn.Module):
    def __init__(self, num_alpha, num_char):
        super().__init__()
        self.image_layer = nn.Sequential(
          nn.Conv2d(1, 16, kernel_size=3, padding=1),
          nn.MaxPool2d(kernel_size=2),
          nn.ELU(),
          nn.Flatten(),
          nn.Linear(16*32*32, 128)
        )

        self.classifier_alpha = nn.Linear(128, 30)
        self.classifier_char = nn.Linear(128, 964)

    def forward(self, x):
        x_image = self.image_layer(x)
        output_alpha = self.classifier_alpha(x_image)
        output_char = self.classifier_char(x_image)
        return output_alpha, output_char

**Training Loop**

In [None]:
for epoch in range(10):
    for images, labels_alpha, labels_char in dataloader_train:
        optimizer.zero_grad()
        outputs_alpha, outputs_char = net(images)

        loss_alpha = criterion(
        outputs_alpha, labels_alpha
        )

        loss_char = criterion(
        outputs_char, labels_char
        )

        loss = loss_alpha + loss_char
        loss.backward()
        optimizer.step()

In [None]:
"""

Use your OmniglotDataset to create dataset_train, passing the two image transforms you have used before: parse the image to a tensor and resize it to size (64, 64).

Create dataloader_train from dataset_train; shuffle the training images and set batch size to 32.
"""

# Print the sample at index 100
print(samples[100])

# Create dataset_train
dataset_train = OmniglotDataset(
    transform=transforms.Compose([
        transforms.ToTensor(),
      	transforms.Resize((64, 64)),
    ]),
    samples=samples,
)

# Create dataloader_train
dataloader_train = DataLoader(
    dataset_train, shuffle=True, batch_size= 32,
)

In [None]:
"""

Define self.classifier_alpha and self.classifier_char as linear layers with input shapes matching the output of image_layer,
and output shapes corresponding to the number of alphabets (30) and the number of characters (964), respectively.


Pass the image embedding x_image separately through each of the classifiers, assigning the results to output_alpha and output_char, respectively,
and return them in this order

"""

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.image_layer = nn.Sequential(
            nn.Conv2d(1, 16, kernel_size=3, padding=1),
            nn.MaxPool2d(kernel_size=2),
            nn.ELU(),
            nn.Flatten(),
            nn.Linear(16*32*32, 128)
        )

        # Define the two classifier layers
        self.classifier_alpha = nn.Linear(128, 30)
        self.classifier_char = nn.Linear(128, 964)

    def forward(self, x):
        x_image = self.image_layer(x)
        # Pass x_image through the classifiers and return both results
        output_alpha = self.classifier_alpha(x_image)
        output_char = self.classifier_char(x_image)
        return output_alpha , output_char


In [None]:
"""

Calculate the alphabet classification loss and assign it to loss_alpha.
Calculate the character classification loss and assign it to loss_char.
Compute the total loss as the sum of the two partial losses and assign it to loss.

"""


net = Net()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.05)

for epoch in range(1):
    for images, labels_alpha, labels_char in dataloader_train:
        optimizer.zero_grad()
        outputs_alpha, outputs_char = net(images)
        # Compute alphabet classification loss
        loss_alpha = criterion(
            outputs_alpha, labels_alpha
        )
        # Compute character classification loss
        loss_char = criterion(
             outputs_char, labels_char
        )
        # Compute total loss
        loss = loss_alpha + loss_char
        loss.backward()
        optimizer.step()

# Evaluation of multi output models and loss weighting

In [None]:
"""

We chose to define the final loss as the sum of the two partial losses. By doing so, we are telling the model that
recognizing characters and recognizing alphabets are equally important to us. If that is not the case, we can combine the two losses differently.

"""

**Warning: losses on different scales**

In [None]:
"""

There is just one caveat: when assigning loss weights, we must be aware of the magnitudes of the loss values.
If the losses are not on the same scale, one loss could dominate the other, causing the model to effectively ignore the smaller loss.
Consider a scenario where we're building a model to predict house prices, and use MSE loss.

If we also want to use the same model to provide a quality assessment of the house, categorized as "Low", "Medium", or "High", we would use cross-entropy loss.

Cross-entropy is typically in the single-digit range, while MSE can reach tens of thousands.
Combining these two would result in the model ignoring the quality assessment task completely.
A solution is to scale each loss by dividing it by the maximum value in the batch.
This brings them to the same range, allowing us to weight them if desired and add together.

"""

In [None]:
"""

Define acc_alpha and acc_char as multi-class Accuracy() metrics for the two outputs, alphabets and characters, with the appropriate number of classes each
(there are 30 alphabets and 964 characters in the dataset).


Define the evaluation loop by iterating over test images, labels_alpha, and labels_char.
Inside the for-loop, obtain model results for the test data batch and assign them to outputs_alpha, outputs_char.


Update the two accuracy metrics with the current batch's data.

"""

def evaluate_model(model):
    # Define accuracy metrics
    acc_alpha = Accuracy(task="multiclass", num_classes=30)
    acc_char = Accuracy(task="multiclass", num_classes=964)

    model.eval()
    with torch.no_grad():
        for images, labels_alpha, labels_char in dataloader_test:
            # Obtain model outputs
            outputs_alpha, outputs_char = model(images)
            _, pred_alpha = torch.max(outputs_alpha, 1)
            _, pred_char = torch.max(outputs_char, 1)
            acc_alpha(pred_alpha, labels_alpha)
            acc_char(pred_char, labels_char)

    print(f"Alphabet: {acc_alpha.compute()}")
    print(f"Character: {acc_char.compute()}")