In [None]:
import math
import random
import requests
import time

import numpy as np
import torch
import torchvision

import matplotlib.pyplot as plt
import torch.nn.functional as F

from io import BytesIO
from PIL import Image

# Lakota AI Code Camp Lesson 14: Introduction to Neural Networks IV - Loss Functions and Metrics

## Introduction

We've seen loss functions and metric in prior lessons, when we saw the neural networks train.
Last time it was what was called **Mean Squared Error**, but there are others.

We've also seen metrics.
For example, when we trained the neural network model yesterday.
The metric we were concerned about was the accuracy of our prediction.

We'll go more into this later in the lesson.

## Loss Functions

A loss function is a function that is a proxy for how well our model predicts a label or category.
In the context of neural networks, we require it to output a real number.
In particular, this is the number we want to reduce by training our neural network.

We'll go through some loss functions, their mathematical definition, and how to program them from scratch.

### Binary Cross Entropy

Binary cross entropy is a loss function that you would use if you were trying to use a model to classify two categories: for example, is this a picture of a dog or cat?

The mathematical definition is:
$$
y \log \widehat{y} + (1 - y) \log (1 - \widehat{y})
$$
where $y$ is the actual label and $\widehat{y}$ is the predicted label.
Let's define it from scratch:

In [None]:
def binary_cross_entropy(actual, predict):

    loss = actual * torch.log(predict) + (1 - actual) * torch.log(1 - predict)

    return -loss

In [None]:
input = torch.nn.Sigmoid()(torch.randn(3))
target = torch.bernoulli(input)

In [None]:
binary_cross_entropy(target, input)

tensor([-0.5331, -0.0360, -0.3462])

In [None]:
torch.nn.BCELoss(reduction='none')(input, target)

tensor([0.5331, 0.0360, 0.3462])

There is a possible problem.
If our model is very confident in its prediction, either negative or posisitive, then either $\widehat{y}$ or $1 - \widehat{y}$ is very close to zero.
You may or may not know that the logarithm function has a singularity at 0.
Specifically, the closer we get to 0, the larger the loss is.
Additionally, our gradient gets larger, which will stop our training.

In the current PyTorch version, at this time, 2.0.1, there are only two meaningful parameters for this loss function:
1.  `weight`;
1.  `reduction`.

At most, you will probably only use reduction, but it's probably better to stick with the default parameters.


### Cross Entropy

Cross entropy is the generalization of binary cross entropy loss.
We can have arbitrarily many categories as opposed to only two categories.
Suppose that we have $n$ categories, then it would be:
$$
\sum_{i = 1}^{n} y_{i} \log \widehat{y}_{i}.
$$
where ${\bf y} = (y_{1}, \ldots, y_{n})$ is our one-hot encoding and ${\bf \widehat{y}} = (\widehat{y}_{1}, \ldots, \widehat{y}_{n})$ is the prediction.

In practice, as ${\bf y}$ will typically have only one-nonzero value, say $j$, then our loss will be $\log \widehat{y}_{j}$.

This loss function is heavily used in image recognition.

In [None]:
predict = F.softmax(torch.randn(3, 5, requires_grad=True), dim=1)
actual = torch.empty(3, dtype=torch.long).random_(5)

In [None]:
torch.nn.CrossEntropyLoss(reduction='none')(predict, actual)

tensor([1.7429, 1.7805, 1.6963], grad_fn=<NllLossBackward0>)

### $L_{1}$ loss

The $L_{1}$ loss is generated by what is called a **norm**.
The most common norm that you may have seen is the absolute value.
Mathematically, this loss is defined as:
$$
\sum_{i = 1}^{n} |x_{i} - y_{i}|
$$
where ${\bf x} = (x_{1}, \ldots, x_{n})$ and ${\bf y} = (y_{1}, \ldots, y_{n})$ are $n$-dimensional vectors.

In [None]:
def l1_loss(actual, predict):

    loss = torch.abs(actual - predict)
    loss = torch.sum(loss)

    return loss

In [None]:
x = torch.tensor([1, 2, 3], dtype=torch.float)
y = torch.tensor([0, 1, 2], dtype=torch.float)

In [None]:
l1_loss(x, y)

tensor(3.)

In [None]:
torch.nn.L1Loss(reduction='sum')(x, y)

tensor(3.)

### Mean Squared Error

We saw this error last time.
It also comes from a norm, but this norm is the usual distance you think of (as the crow flies).

Mathematically, this loss is defined as:
$$
\sum_{i = 1}^{n} (y_{i} - \widehat{y}_{i})^{2}
$$
where ${\bf y} = (y_{1}, \ldots, y_{n})$ and ${\bf \widehat{y}} = (\widehat{y}_{1}, \ldots, \widehat{y}_{n})$.

In [None]:
def mse_loss(actual, predict):

    loss = torch.square(actual - predict)
    loss = torch.mean(loss)

    return loss

In [None]:
predict = torch.randn(3, 5)
actual = torch.randn(3, 5)

In [None]:
mse_loss(actual, predict)

tensor(1.1471)

In [None]:
torch.nn.MSELoss()(predict, actual)

tensor(1.1471)

### Object Detection Loss

We'll talk more about this in a later lecture, due to the complexity of the loss and metrics.

## Metrics

## Accuracy

One of the most common metrics is accuracy.
It's basically the number of correct predictions divided by the total number of predictions.
This metric is particularly useful with image recognition models.

Let's look at code from the last lecture:

In [None]:
def evaluate(model, testloader):
    model.eval()
    correct = 0
    total = 0

    with torch.no_grad():
        for inputs, label in test_loader:
            inputs, label = inputs.to(device), label.to(device)
            outputs = model(inputs)

            _, predicted = torch.max(outputs.data, 1)
            total += label.size(0)
            correct += (predicted == label).sum().item()

    print(f'Accuracy of the network on 10000 test images: {100 * correct // total} %')
    return 100 * correct // total

### Area Under ROC curve (AUROC)

Before we talk about AUROC, we'll make a short digression into the different ways a model can classify outputs.
Further, we're going to assume that we're talking about binary models only.
The **true positive rate** is the number of actual positive results classified as positive divided by the total number of positives.
The **true negative rate** is similar.
Its the number of actual negative results classified as negative divided by the total negatives.

The **false positive rate** is the number of negatives that were incorrectly classified by positive as our model divided by the total negatives.
The **false negatives rate** is similar.
It is hte number of positives that were incorrectly classified as negative, divided by the total number of negatives.

The ROC (receiver operating characteristics) curve is graphed with the $y$-axis as the true positive rate and the $x$-axis as the false positive rate.

Models are supposed to give predictions, but given an input our model spits out a number.
It's up to us to determine how to use that number.
That's where thresholding comes in.
We plot the true negative rate vs the false positive rate for various thresholds.

So, we can say that anything above a $0.5$ will be classified as a positive example.
We can replace $0.5$ with any number between 0 and 1.
Then we can graph the ROC curve.

Finally, the area under the ROC curve is its name.

We won't use it, but it's an important metric in data science, so you should try to remember it.

### IoU

We'll briefly go over the IoU metric.
IoU stands for Intersection over Union metric.
Given two bounding boxes, we calculate the area of their intersection and the artheir union.
The IoU of two bounding boxes is the intersection divided by the union.
We'll go over this more in a future lecture.
