# Object localization with MLPs
Object localization is a computer-vision task that involves identifying the location of one or more objects within an image or video. The goal of object localization is to identify the precise location and size of the object(s) within an image. This is typically done by drawing a bounding box around the object, which indicates its location and extent within the image.

Object localization is an important task in many applications, such as object detection, tracking, and recognition. It is often used in fields like autonomous driving, robotics, and surveillance, where identifying the location and motion of objects is critical for decision-making. Also in agri-food, object localization is often used to detect, for instance, a fruit in an image for robotic harvesting, or a cow in a video frame to measure the distance that she walked.



<img src="https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Ft1.daumcdn.net%2Fcfile%2Ftistory%2F999C58335BA4941F0F&f=1&nofb=1&ipt=ccd6041b1f8ce1fe478ea771aa1b9bd412286cf9ea3342a3221c37ec87379612&ipo=images" width=800>

The image above displays the difference between classification, object localizacion, object detection, and object instance segmentation. For object localization, the method needs to detect one object in the image and predict the bounding box. 

A bounding box is defined by four coordinates: 
* two coordinates for the postion
* one for the width, and 
* one for the height.

Object localization can be done using a variety of techniques, such as template matching, feature-based methods, deep learning, and other machine-learning approaches. In this practical, you'll implement a MLP that performs object localization of ellipses in images.

Let's start with the required imports.

In [None]:
!pip install d2l==0.16 --quiet

In [None]:
# General imports
from PIL import Image, ImageDraw
import random
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import ImageGrid

# Deep learning imports
from d2l import torch as d2l
import torch
from torch import nn
from torch.utils.data import Dataset
from torchvision.transforms.functional import to_tensor, to_pil_image


## Generating images with ellipses
In this notebook, we will work with some synthetic images containing. We will start with a relatively simple situations where the task is to locate a red ellipse on a black background. We will then make it more challenging by adding variation to the colors of the foreground and the background. In later tutorials, you will learn more advanced neural network architectures that allow you to locate real objects in real images.

The next cell contains a function to generate images with ellipses on it, and some helper functions that you'll need for visualization purposes. 

**Exercise:**
* Study the code briefly to understand what the functions do
* Run the code to load the functions.



In [None]:
def draw_random_ellipse(width, height, color_ellipse, color_background="black"):
    """
    This function draws a random ellipse on an image with the given width and height.
    The ellipse has a random size and position, and is filled with the specified color.
    The background of the image can also be specified, defaulting to black.
    Returns the resulting image, ellipse center coordinates, and ellipse width and height.

    Parameters:
        - width (int): the width of the image in pixels
        - height (int): the height of the image in pixels
        - color_ellipse (tuple[int]): the color of the ellipse in the format '(red, green, blue)'
        - color_background (str or tuple[int]): the color of the background in the format '(red, green, blue)' or string.
            Default: 'black'

    Returns:
        A tuple containing:
        - img (PIL.Image): the resulting image
        - x (int): the x coordinate of the center of the ellipse
        - y (int): the y coordinate of the center of the ellipse
        - w (int): the width of the ellipse
        - h (int): the height of the ellipse
    """
    # Create image
    img = Image.new("RGB", (width, height), color=color_background)

    # Create a drawing context
    draw = ImageDraw.Draw(img)

    # Calculate minimum ellipse size
    min_rx = width // 12
    min_ry = height // 12

    # Calculate maximum ellipse size
    max_rx = width // 6
    max_ry = height // 6
    # Generate random ellipse parameters within maximum size
    rx = random.randint(min_rx, max_rx)
    ry = random.randint(min_rx, max_ry)
    x = random.randint(rx, width - rx)
    y = random.randint(ry, height - ry)

    # Draw ellipse onto image
    draw.ellipse((x - rx, y - ry, x + rx, y + ry), fill=color_ellipse)

    return img, x, y, 2 * rx, 2 * ry

def plot_bbox(img, x, y, w, h, color="blue"):
    """
    This function plots a bounding box on the given image.
    The bounding box is defined by its center coordinates (x, y) and its width and height (w, h).
    The bounding box is drawn in blue with a width of 3 pixels.
    Returns the resulting image.

    Parameters:
        - img (PIL.Image): the image to plot the bounding box on
        - x (int): the x coordinate of the center of the bounding box
        - y (int): the y coordinate of the center of the bounding box
        - w (int): the width of the bounding box
        - h (int): the height of the bounding box

    Returns:
        A PIL.Image object representing the original image with the bounding box plotted on it.
    """
    draw = ImageDraw.Draw(img)
    draw.rectangle(
        (x - w / 2, y - h / 2, x + w / 2, y + h / 2),
        outline=color,
        width=2,
    )
    return img


def plot_grid(imgs, nrows, ncols):
    """
    This function plots a grid of images using the given list of images.
    The grid has the specified number of rows and columns.
    The size of the figure is set to 10x10 inches.
    Returns None.

    Parameters:
        - imgs (List[PIL.Image]): a list of PIL.Image objects to plot
        - nrows (int): the number of rows in the grid
        - ncols (int): the number of columns in the grid

    Returns:
        None.
    """
    assert len(imgs) == nrows * ncols, f"nrows*ncols must be equal to the number of images"
    fig = plt.figure(figsize=(10.0, 10.0))
    grid = ImageGrid(
        fig,
        111,  # similar to subplot(111)
        nrows_ncols=(nrows, ncols),
        axes_pad=0.1,  # pad between axes in inch.
    )
    for ax, im in zip(grid, imgs):
        # Iterating over the grid returns the Axes.
        ax.imshow(im)
    plt.show()


Let's test the function by letting it create ten images with random ellipses, at random locations and with random sizes. 

**Exercise:**
* Run the code
* You see ten image and another ten images with the bounding box
* Think what the input and what the output will be of the MLP that we are going to define

In [None]:
# Let's test our functions
img_size = 100
display_imgs = []
display_imgs_bbox = []

for i in range(10):
    img, x, y, w, h = draw_random_ellipse(width=img_size, height=img_size, color_ellipse=(255, 0, 0))
    display_imgs.append(img.copy())
    img = plot_bbox(img, x, y, w, h)
    display_imgs_bbox.append(img)
plot_grid(imgs=display_imgs, nrows=2, ncols=5)
plot_grid(imgs=display_imgs_bbox, nrows=2, ncols=5)



**Exercise:**
* Run the code below and inspect the shape of the image. 
* Calculate how many values this image contains, Consider that an image has `img_size*img_size` pixels and each pixel has 3 channels (red, green, blue).

In [None]:
import numpy as np 
img, x, y, w, h = draw_random_ellipse(width=img_size, height=img_size, color_ellipse=(255, 0, 0))
np.array(img).shape

## The Dataset class
The PyTorch **Dataset** class can be used to define a custom dataset for use with PyTorch's DataLoader, which allows for efficient loading and batching of data.

In PyTorch, a dataset is represented by a subclass of the Dataset class. Each dataset subclass must implement three methods:
- `__init__(self, ...)`:
This method is the constructor of the dataset class. It initializes the dataset with any necessary information such as file paths, labels, etc. It can also perform some pre-processing steps, such as loading the data into memory or computing statistics of the data. 

- `__getitem__(self, index)`:
This method is used to retrieve an individual sample from the dataset at the given index. It takes an index as input and returns the corresponding data sample and its label (or other metadata). This method is what enables us to use PyTorch's DataLoader to iterate through the dataset and retrieve batches of data.

- `__len__(self)`:
It returns the total number of samples in the dataset. This method is used by the DataLoader to determine the length of the dataset and to split the data into batches.

By defining a custom Dataset class, you can easily create a data pipeline that loads, transforms, and feeds your data into a machine-learning model. The DataLoader class then allows you to efficiently load and batch the data for training or inference. Overall, the Dataset class is an essential building block for creating efficient, scalable, and customizable data pipelines in PyTorch. You **MUST** learn how to build a dataset class for a given application.

**Exercise:** 
* Complete the dataset class. Hints:
  * In  `__init__`, you should set the class attributes `self.number_imgs` and `self.img_size`
  * In  `__len__`, you need to return the length of the dataset, which is provided in the init function.
  * In `__getitem__`, you need to use the function `draw_random_ellipse()` to get a random image and the x, y, width and height of the bounding box.
  * In `__getitem__`, you need to set the data by making a tensor of the image
  * In `__getitem__`, you need to define the label for the image, should be a tensor storing the four bounding box coordinates **normalized with respect to the image size** using:
  ```torch.tensor([x, y, w, h]) / self.img_size```

* Mind that the class for the dataset is called `EllipseDataset`.


In [None]:
class EllipseDataset(Dataset):
    """
    This PyTorch dataset generates a set of images containing a random ellipse with a corresponding label.
    The dataset generates a specified number of images with the given size.

    Attributes:
        - number_imgs (int): the number of images to generate
        - img_size (int): the size of each image in pixels
    """

    def __init__(self, number_imgs, img_size):
        """
        Constructs a new EllipseDataset object.

        Parameters:
            - number_imgs (int): the number of images to generate
            - img_size (int): the size of each image in pixels
        """
        # TODO: save the method inputs into parameters of the class so they can be used by other methods (2 lines)
        ..

    def __len__(self):
        """
        Returns the number of images in the dataset.

        Returns:
            An integer representing the number of images in the dataset.
        """
        # TODO: return the size of the dataset (Hint: is set using the constructor function)
        ..

    def __getitem__(self, idx):
        """
        Generates a random ellipse image with a corresponding label (=bounding box).

        Parameters:
            - idx (int): the index of the image to generate. Not used in this case as we generate image randomly.

        Returns:
            A tuple containing:
            - data (Tensor): a tensor representing the image data
            - label (Tensor): a tensor representing the label data: [x, y, w, h] relative to the image size
        """
        # TODO: create an image with an ellipse using the provided function. Keep the color of the ellipse as red (255, 0, 0)
        # NOTE: the label (or bounding box) should be normalized using the image size so the values range from 0 to 1
        img, x, y, w, h = ..

        data = to_tensor(img)
        label = ..
        return data, label


## Losses and performance metrics on regression networks
The main differences between training a regression task and classification task are the loss function used and the performance metrics. Apart from it, the rest of the training loop should be practically identical in Pytorch. A common loss and performance metric for regression is Mean Squared Error (MSE). And a common metric for evaluation of regression tasks is the Pearson correlation coefficient:

- **Mean Squared Error (MSE)** is a popular loss function used in regression. It measures the average of the squared differences between the predicted and actual values of the target variable. In other words, it calculates the average of the squared errors. A lower MSE value indicates better performance of the model in predicting the target variable.

- **Pearson correlation coefficient** is a performance metric commonly used in regression to evaluate the strength and direction of the linear relationship between the predicted values and the actual values of the target variable. The Pearson correlation coefficient, also known as Pearson's r, measures the degree of linear correlation between two variables, where a value of +1 indicates a perfect positive linear correlation, a value of 0 indicates no linear correlation, and a value of -1 indicates a perfect negative linear correlation. In regression tasks, a high Pearson correlation coefficient indicates that the model is effectively capturing the relationship between the input features and the target variable.

Run the code below to load function definitions to implement the MSE and Pearson correlation, and to evaluate the accuracy using these two metrics.

In [None]:
def pearson_correlation(x1, x2, eps=1e-8):
    """
    Calculates the Pearson correlation coefficient between two 1D tensors.

    Args:
        x1 (torch.Tensor): First input tensor (1D).
        x2 (torch.Tensor): Second input tensor (1D, with size matching x1).
        eps (float, optional): A small value added to the denominator to avoid division by zero. Default: 1e-8.

    Example:
        >>> input1 = torch.randn(128)
        >>> input2 = torch.randn(128)
        >>> output = pearson_correlation(input1, input2)
        >>> print(output)

    Returns:
        A tensor containing the Pearson correlation coefficient between x1 and x2.
    """
    assert x1.dim() == 1, "Input must be 1D matrix / vector."
    assert x1.size() == x2.size(), "Input sizes must be equal."
    x1_bar = x1 - x1.mean()
    x2_bar = x2 - x2.mean()
    dot_prod = x1_bar.dot(x2_bar)
    norm_prod = x1_bar.norm(2) * x2_bar.norm(2)
    return dot_prod / norm_prod.clamp(min=eps)


def MSE(y_hat, y):
    """
    Calculates the Mean Squared Error (MSE) between two 1D tensors.

    Args:
        y_hat (torch.Tensor): Predictions tensor (1D).
        y (torch.Tensor): Ground truth tensor (1D, with size matching y_hat).

    Example:
        >>> y_hat = torch.randn(128)
        >>> y = torch.randn(128)
        >>> output = MSE(y_hat, y)
        >>> print(output)

    Returns:
        A tensor containing the Mean Squared Error (MSE) between y_hat and y.
    """
    return torch.square(y - y_hat).mean()


def evaluate_accuracy(net, data_iter):
    """
    Evaluates the accuracy of a network on a given dataset.

    Args:
        net (nn.Module): The network to evaluate.
        data_iter (DataLoader): The DataLoader representing the dataset to evaluate.

    Returns:
        A tuple containing the Pearson correlation coefficient and the Mean Squared Error (MSE)
        between the network predictions and the ground truth targets on the dataset.
    """
    metric = d2l.Accumulator(3)  # pearson, loss, 1
    for X, y in data_iter:
        y_hat = net(X).flatten()
        y = y.flatten()
        metric.add(pearson_correlation(y_hat, y), MSE(y_hat, y), 1)
    return metric[0] / metric[2], metric[1] / metric[2]


## Training loop for bounding-box regression
As mentione above, apart from the loss and performance metrics, very few things change in the regression training loop. Therefore, you should be able to fill the gaps in the following code cell.

**Exercise:**
* Fill in the gaps in the code to
  * Use the network to predict the output `y_hat` based on the input `X`.
  * Calculate the loss with the loss function by comparing the predicted and true output (`y`)

In [None]:
def train_epoch_regression(net, train_iter, loss, updater):
    """Similar training loop to the one defined in Chapter 3."""
    # Set the model to training mode
    if isinstance(net, torch.nn.Module):
        net.train()
    # Sum of training loss, sum of training accuracy, no. of examples
    metric = d2l.Accumulator(4)
    for X, y in train_iter:
        # TODO: Use the network to predict the output y_hat for the given input X
        y_hat = ..
        # TODO: calculate the los from the predicted and true y values
        l = loss(..,  ..)

        # Compute gradients and update parameters
        updater.zero_grad()
        l.backward()
        updater.step()

        # Log evaluation metrics: Sum of training loss, sum of training accuracy, no. of examples
        metric.add(float(l), pearson_correlation(y_hat.flatten(), y.flatten()), y.size().numel(), 1)

    # Return training loss and training accuracy
    return metric[0] / metric[2], metric[1] / metric[3]


def train_regression(net, train_iter, test_iter, loss, num_epochs, updater):
    """Train a model (similar to the one defined in Chapter 3)."""
    animator = d2l.Animator(
        xlabel="epoch", xlim=[1, num_epochs], ylim=[0, 1], legend=["train loss", "train acc", "test acc", "test loss"]
    )
    for epoch in range(num_epochs):
        # Run one training epoch
        train_metrics = train_epoch_regression(net, train_iter, loss, updater)
        # Evaluate the network
        test_acc, test_loss = evaluate_accuracy(net, test_iter)
        # Show the performance of the network during training
        animator.add(epoch + 1, train_metrics + (test_acc, test_loss))
        
    train_loss, train_acc = train_metrics


## Training parameters
**Exercise:** You need to define the following parameters before starting to train:
1. Loss: use `nn.MSELoss()` ([documentation](https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html))
2. Define the train and test dataset using the earlier defined class `EllipseDataset`. Use 1000 images in the training set and 500 in the test set, with the img_size set to `img_size`.
3. Some code is already defined for you to create the dataloaders based on the dataset. Check the code to see how this is done.
4. Your network. Define the MLP as follows:
  * First a flatten layer to flatten the 2D image into a 1D vector. 
    * Calculate the length of this input vector. Consider that the image has `img_size*img_size` pixels and each pixel has 3 channels (red, green, blue).
  * A hidden (fully connected) layer with 256 hidden neurons and using ReLU activation 
  * An output layer with 4 output neurons (for the four bounding box coordinates)
5. Your trainer (also called optimizer) using `torch.optim.SGD(net.parameters(), lr=lr)`

In [None]:
# Defining some parameters
batch_size, lr = 16, 0.025
img_size = 100

# Loss
### TODO: 1. ADD YOUR CODE HERE (1 line)
loss = ..

# Let's load the dataset and create dataloaders
### TODO: 2. ADD YOUR CODE HERE TO LOAD THE DATASETS (~4 lines)
train_dataset = .. 
test_dataset = ..
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=2)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=False, num_workers=2)

# Let's define the architecture of our MLP
### TODO: 4. CREATE YOUR MLP
net = ..

# Initialize the network with small random weights
def init_weights(m):
    if type(m) == nn.Linear:
        nn.init.normal_(m.weight, std=0.02)


net.apply(init_weights)


# Define trainer (or optimizer)
### TODO: 5. ADD YOUR CODE HERE (1 line)
trainer = ..

## Training
Now we can start training our algorithm. Run the next cell to start training your MLP.

In [None]:
# Train
num_epochs = 50
train_regression(net, train_loader, test_loader, loss, num_epochs, trainer)


## Visualization of results

Now that we have trained the network, we can use it to predict the bounding-box coordinates of an image in the test set.

* Some steps we need to take:
  * We sample an image from the testset
  * Our network is adapted to take batches of images, which correspods to an array of the dimensions `[batch_size, 3, img_size, img_size]`. Therefore, if we want to feed our image with dimensions `[3, img_size, img_size]` to our network, we need to expand its dimensions like this:
`img = img.unsqueeze(0)` to create a batch with one image
  * We use the trained network to predict the bbox coordinates for the image
  * As the network returns a tensor with bbox coordinates for all images in the batch, and since we have a batch of one image, we need to get the idx 0 of the output of the network
  * Remember that the bbox coordinate that we put in the training data are relative to the size of the image, so we have to upscale the predicted coordinates to match the size of the image.
* Study the code below to identify these steps
* Run the code to see a prediction

In [None]:
test_img_id = 10
test_data = test_dataset[test_img_id]
test_img = test_data[0]
test_img_input = test_img.unsqueeze(0)
pred = net(test_img_input)            
pred = pred[0]
pred = pred * img_size
print('Predicted bounding-box coordinates:', pred)

# Show the image and the predicted bounding box
img_show = to_pil_image(test_img) 
img = plot_bbox(img_show, pred[0], pred[1], pred[2], pred[3])
img_show

**Exercise:**
* Use the code above to finish the code cell below, so that it plots the results for multiple test images

In [None]:
# Setting our network to eval mode
net.eval()

# Let's select 10 images and pass them through our network
test_imgs = [test_dataset[idx][0] for idx in range(10)]
display_imgs = []
for test_img in test_imgs:
    # TODO: unsqueeze the test image, run it through the network
    #       and the predicted bbox coordinates 

    # Draw the bounding box in the image for visualization
    img_show = to_pil_image(test_img) # Convert img from tensor to PIL image format
    img_show = plot_bbox(img_show, pred[0], pred[1], pred[2], pred[3])
    display_imgs.append(img_show)

# Plot the results
plot_grid(imgs=display_imgs, nrows=2, ncols=5)


## Making our dataset more challenging
Until now, we've been using only red ellipses over a black background. Let's make things more challenging by having random ellipse colors. You can use the following line to define a random RGB color:
```python
color = (random.randint(0, 255), random.randint(0, 255), random.randint(0, 255))
```

**Exercise:**
* Change the code in `__getitem__` to now create ellipses with random colors
* Train and test a new network on this data by completing the three code cells below.

In [None]:
class EllipseDataset(Dataset):
    """
    This PyTorch dataset generates a set of images containing a random ellipse with a corresponding label.
    The dataset generates a specified number of images with the given size.

    Attributes:
        - number_imgs (int): the number of images to generate
        - img_size (int): the size of each image in pixels
    """

    def __init__(self, number_imgs, img_size):
        """
        Constructs a new EllipseDataset object.

        Parameters:
            - number_imgs (int): the number of images to generate
            - img_size (int): the size of each image in pixels
        """
        self.number_imgs = number_imgs
        self.img_size = img_size


    def __len__(self):
        """
        Returns the number of images in the dataset.

        Returns:
            An integer representing the number of images in the dataset.
        """
        return self.number_imgs

    def __getitem__(self, idx):
        """
        Generates a random image with a corresponding label.

        Parameters:
            - idx (int): the index of the image to generate. Not used in this case as we generate image randomly.

        Returns:
            A tuple containing:
            - data (Tensor): a tensor representing the image data
            - label (Tensor): a tensor representing the label data
        """
        # TODO: create an image with an ellipse with random colors using the provided function.


        data = to_tensor(img)
        return data, label


In [None]:
# Defining some parameters
batch_size, lr, num_epochs = 16, 0.025, 50
img_size = 100

# Loss
### TODO: ADD YOUR CODE HERE (1 line)


# Let's load the dataset and create dataloaders
### TODO: ADD YOUR CODE HERE (~4 lines)

# Let's define the architecture of our MLP
### TODO: CREATE YOUR MLP


def init_weights(m):
    if type(m) == nn.Linear:
        nn.init.normal_(m.weight, std=0.02)


net.apply(init_weights)


# Define trainer (or optimizer)
### TODO: ADD YOUR CODE HERE (1 line)


# Train
train_regression(net, train_loader, test_loader, loss, num_epochs, trainer)


In [None]:
# Setting our network to eval mode
net.eval()

# Let's select 10 images and pass them through our network
imgs = [test_dataset[idx][0] for idx in range(10)]
display_imgs = []
for img in imgs:
    ### TODO: pass the image through the network and scale the predicted bounding box according to the image size (~3 lines)


    # Convert img from tensor to PIL image format
    img = to_pil_image(img)

    # TODO: plot the bounding box into the image using the provided function in this notebook (~1 line)


    # Add image to list
    display_imgs.append(img)

# Plot the results
plot_grid(imgs=display_imgs, nrows=2, ncols=5)


## Even more challenging: random background color

You see that with the random color of the ellipse, the MLP can still predict the location to some extend, although the accuracy became lower. Let's now make it even more complex by adding also a random color of the background.

**Exercise:** 
* Change the dataset class so now the color of the background is random as well. You can do so by using the parameter `color_background` in the function `draw_random_ellipse`.
```
draw_random_ellipse(..., color_ellipse=color_ellipse,color_background=color_background)
```
* Train the same network from scratch for 50 epochs. 
* Evaluate the results. Look also on how the loss and accuracy develop over the epochs during training. Would it help to train for more epochs?
* Try a more complex MLP network architecture. Does that improve results? 

In [None]:
# TODO: Add now also a random color to the background. 
#       Train the network again from scratch, and
#       Evaluate the results.

