# Logistic regression and Gradient descent
This notebook has been created by Oscar Pina (oscar.pina@upc.edu) for the DLAI course (Fall 2023).

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import torch
import torch.nn.functional as F
import functorch
import matplotlib.pyplot as plt
import seaborn as sns

## Dataset

The dataset we are going to use in this notebook is Breast Cancer Wisconsin (diagnosis) dataset [1]. The dataset features are the statistics of the cell nuclei detected in distinct images, and the target is wether the tumor in the image is benign or malign. The dataset contains the mean, standard error and worst value of the cell nuclei texture and morphological features, specifically: the radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry and fractal dimension.

[1] [K. P. Bennett and O. L. Mangasarian: “Robust Linear Programming Discrimination of Two Linearly Inseparable Sets”, Optimization Methods and Software 1, 1992, 23-34].

### Load and visualize data

In this work, we are only going to use the "mean" of each nuclei feature.

In [2]:
# load dataset
dataset = load_breast_cancer(as_frame=True) # include dataframe for visualization

# load dataframe
df = dataset['frame']

# dataset columns with the mean values
mean_columns = [ c for c in dataset['frame'].columns if c.startswith("mean") ]

N = df.shape[0]
D = len(mean_columns)
# dataset statistics
print(f"Number of samples:  {N}")
print(f"Number of features: {D}")
print(f"Number of negative (benign) samples: {df[df.target==0].shape[0]}")
print(f"Number of positive (malign) samples: {df[df.target==1].shape[0]}")

Number of samples:  569
Number of features: 10
Number of negative (benign) samples: 212
Number of positive (malign) samples: 357


In [3]:
# visualize data (it may take a while)
sns.pairplot(df,
             x_vars = df[mean_columns],
             y_vars = df[mean_columns],
             hue='target')
plt.plot()

Output hidden; open in https://colab.research.google.com to view.

As you may have observed, some of the features are redundant, that is, we can see that one feature could easily be predicted from the other one. Here, it would be interesting to carry on some feature engineering techniques to remove redundant information. However, we will skip this step as it is not the main goal of this lab session.

In [4]:
# load features and target into X and Y matrices, respectively
X = df[mean_columns].values
Y = df['target'].values

### Split data

The quality of a model is determined by its generalization abilities rather than memorization, that is, the performance on unseen data during training, not how well does it memorize the data used for training. Therefore, we must split the data into training, validation and test sets. The purpose of these partitions are:
1. Training set: data used for training. We compute the loss function and update the parameters of the model based on the result (ie via gradient descent).
2. Validation set: data used to tune the hyperparameters of the model such as learning rate, regularization, etc. We measure the performance on this set and select the best configuration for our problem.
3. Test set: data used to assess the generalization of the model.

For now, do not worry about it, you will see more about this in future lectures. We will use 60 % for training 20 % for validation and the remaining 20 % for testing.

In [5]:
# split dataset
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
X_train, X_val,  Y_train, Y_val  = train_test_split(X_train, Y_train, test_size=0.2/0.8)

N_train, N_val, N_test = X_train.shape[0], X_val.shape[0], X_test.shape[0]
print(f"Number of training samples: {N_train}")
print(f"Number of validation samples: {N_val}")
print(f"Number of testing samples: {N_test}")

Number of training samples: 341
Number of validation samples: 114
Number of testing samples: 114


In [6]:
# convert dataset to PyTorch
X_train, Y_train = torch.from_numpy(X_train).float(), torch.from_numpy(Y_train).float().unsqueeze(1)
X_val,   Y_val   = torch.from_numpy(X_val).float(),   torch.from_numpy(Y_val).float().unsqueeze(1)
X_test,  Y_test  = torch.from_numpy(X_test).float(), torch.from_numpy(Y_test).float().unsqueeze(1)

### Standardize the data

We usually need to standardize the data by substracting the mean and scaling by the standard deviation. This is applied to every sample of the dataset:

$$
 \mu = \frac{1}{N} \sum_{i=1}^N x_i \in \mathbb{R}^d
$$

$$
std = \sqrt{\frac{1}{N} \sum_{i=1}^N (x_i-\mu)^2} \in \mathbb{R}^d
$$

Note that $\mu, std \in \mathbb{R}^d$, that is, they can be multi-dimensional. This is due to the fact that the mean and std is independently extracted for every feature, and the standardization is carried out separately too.

Traditionally, $\mu$ and $std$ are computed on the training data, and then used for the training, validation and test samples.

In [9]:
def standardize(x, mean, std):
  """Standardizes the given data.

  Args:
    x: A torch tensor of any shape containing the data to be standardized.
    mean: A torch tensor of the same shape as x containing the mean of the data.
    std: A torch tensor of the same shape as x containing the standard deviation of the data.

  Returns:
    A torch tensor of the same shape as x containing the standardized data.
  """
  # TODO : standardize the data by the given mean and std
  x_std = x/std - mean;
  return x_std

In [23]:
# TODO : get mean of training data
# mean = 1/len(X_train) * sum(X_train);
mean = torch.mean(X_train, 0);

# TODO : get std of training data
import math
# std = math.sqrt(1/len(X_train) * sum(X_train-mean))
std = torch.std(X_train)

# standardization
X_train = standardize(X_train, mean, std)
X_val   = standardize(X_val,   mean, std)
X_test  = standardize(X_test,  mean, std)

## Logistic regression

Our model is a log-linear classifier composed by a linear regression layer $(f_{W,b})$, which performs a weighted sum of the input features, followed by a sigmoid function $(\sigma)$, which converts the output values to probabilities ranging between 0 and 1 as we are dealing with a classification problem.

Therefore, given an input sample $x_i$, we make a prediction $\hat{y}_i=\sigma( f_{W,b}(x_i) )$, where...

- $P(y=1 | x=x_i) = \hat{y}_i$
- $P(y=0 | x=x_i) = 1-\hat{y}_i$

### Define the model
The first step is to define the linear regression layer, implemented as a linear transformation applied to the input data and a bias term added to the result.

$$
 f_{W,b} : x \in \mathbb{R}^d → z = x W^T + b \in \mathbb{R}
$$

In [11]:
def linear_regression(x, w, b):
  """Computes the linear regression prediction for the given input data and weights.

  Args:
    x: A torch tensor of shape (n_samples, n_features).
    w: A torch tensor of shape (n_features, 1).
    b: A torch tensor of shape (1, 1).

  Returns:
    A torch tensor of shape (n_samples, 1) containing the linear regression
    predictions.
  """
  # TODO : define the logistic regression layer
  z = x*zip(w) + b
  return z

Nonetheless, a linear transformation can output values that are outside the range [0, 1], so that they cannot be considered as probabilities. To overcome this issue, the sigmoid function is applied:

 $$
 \sigma : z \in \mathbb{R} → y \in [0, 1]
 $$

In [18]:
def sigmoid(x):
  """Computes the sigmoid function of the given input data.

  Args:
    x: A torch tensor of any shape.

  Returns:
    A torch tensor of the same shape as x containing the sigmoid function values.
  """
  # TODO : define the sigmoid function
  y = 1 / (1 + math.exp(-x))
  return y

As you can see, the output of the sigmoid function is restricted to the interval [0, 1]. Being $\sigma(0.0) = 0.5$:

In [24]:
# visualize the function
x_range = torch.arange(-10, 10, step=0.5)

# plot
plt.figure(figsize=(10,5))
plt.plot(x_range, sigmoid(x_range))
plt.grid()
plt.show()

ValueError: ignored

<Figure size 1000x500 with 0 Axes>

Finally, we can implement the composition of these two function to get our logistic regressor:

In [16]:
def logistic_regression(x, w, b):
  """Computes the logistic regression prediction for the given input data and weights.

  Args:
    x: A torch tensor of shape (n_samples, n_features).
    w: A torch tensor of shape (n_features, 1).
    b: A torch tensor of shape (1, 1).

  Returns:
    A torch tensor of shape (n_samples, 1) containing the logistic regression
    predictions.
  """

  # TODO : Perform linear regression.
  z = linear_regression(x,w,b)

  # TODO : Apply the sigmoid function.
  y = sigmoid(z)

  return y

### Loss function

To train the model for binary classification, we employ the binary cross entropy loss. This quantity measures the cross-entropy between the predicted probability and the ground truth probability, averaged for all data samples.

$$
J = \frac{1}{N} \sum_{i=1}^{N} J_i \\ = -\frac{1}{N} \sum_{i=1}^{N} \left [ y_i · logP(y=1|x=x_i) + (1-y_i) · log P(y=0|x=x_i) \right ] \\  = -\frac{1}{N} \sum_{i=1}^{N} \left [ y_i · log(\hat{y}_i) + (1-y_i) · log(1- \hat{y}_i) \right ]
$$

In [20]:
def binary_cross_entropy(y_pred, y_true):
  """Computes the binary cross entropy loss function for the given predicted and
  true output values.

  Args:
    y_pred: A torch tensor of shape (n_samples, 1) containing the predicted
      output values.
    y_true: A torch tensor of shape (n_samples, 1) containing the true output
      values.

  Returns:
    A torch tensor of shape (1, 1) containing the binary cross entropy loss
    value.
  """
  # TODO : Define the binary cross entropy loss
  loss = -1/len(y_pred) * sum(y_true*math.log(y_pred)+(1-y_true)*math.log(1-y_pred))
  return loss

When the target = 1, the cost function is a decreasing function, such that the closer is $P(y=1|x)$ to 1, the lower the cost. Instead, low values of $P(y=1|x)$ are penalized. The opposite intuition applies when target = 0.

In [25]:
# visualize the loss function
plt.figure(figsize=(5, 5))

y_range = torch.arange(0, 1, step=0.01)

plt.plot(y_range, [binary_cross_entropy(y_range[i], torch.tensor(0.)) for i in range(y_range.shape[0])], label="target = 0" )
plt.plot(y_range, [binary_cross_entropy(y_range[i], torch.tensor(1.)) for i in range(y_range.shape[0])], label="target = 1" )
plt.ylabel("Loss function $(J_i)$")
plt.xlabel("$\hat{y} = P(y=1|x)$")
plt.legend()
plt.grid()

TypeError: ignored

<Figure size 500x500 with 0 Axes>

The loss is optimized with respect to the parameters of the model via gradient descent (next section). Therefore, it is interesting to visualize the loss landscape according to these parameters. Whereas it is not possible to do it in high dimensional spaces, we can work with a lower dimensional problem to get an intuition.

In this example we will only work with the *area* feature of our original dataset, so that each data sample is a single scalar $x_i \in \mathbb{R}$. Therefore, our model only have 2 parameters: the weight $(w)$ and the bias $(b)$. Let's plot the loss function for our training set wrt $(w)$ and $(b)$.

In [26]:
def loss_mesh(x, y, wrange=(-10, 5), brange=(-10,5)):
  ws = torch.linspace(wrange[0], wrange[1], 100)
  bs = torch.linspace(brange[0], brange[1], 100)

  Jwb = torch.zeros(ws.size(0), bs.size(0))
  for i in range(ws.size(0)):
    for j in range(bs.size(0)):
      y_pred = logistic_regression(x, ws[i].view(1,1), bs[j].view(1,1))
      Jwb[i, j] = binary_cross_entropy(y_pred, y)
  return Jwb, ws, bs

def loss_mesh_plot(J, ws, bs):
  ww, bb = torch.meshgrid(ws, bs)
  plt.figure()
  plt.scatter(ww.reshape(-1), bb.reshape(-1), c=Jwb.reshape(-1))
  plt.xlabel("weight")
  plt.ylabel("bias")

In [27]:
# clone and normalize the data
x = X_train[:,0].clone().view(-1,1)
y = Y_train.clone()

# plot
Jwb, ws, bs = loss_mesh(x, y)
loss_mesh_plot(Jwb, ws, bs)
plt.show()

TypeError: ignored

**EXERCISE: Why are there empty regions in the loss surface plot?**

## Gradient descent

In linear regression, the optimal values of the weight and biases (denoted as $\hat{w}$ and $\hat{b}$) can be obtained with a closed-form solution. However, for logistic regression we cannot generally derive that solution. Instead, we employ an iterative optimization algorithm to get the optimal values: gradient descent.

The gradient descent algorithm consists of making an initial guess of the values of ${w}$ and ${b}$ (usually initialized as random) and iteratively:

 1. making the prediction on the training data with the given parameters

 2. evaluating the cost:
 $$J(w, b)$$

 3. computing the gradients of the cost wrt $w$ and $b$:
 $$\frac{\partial J}{\partial w}, \frac{\partial J}{\partial b} $$

 4. updating the values of $w$ and $b$ in the opposite direction of the gradient:

 $$ w ← w - \alpha \frac{\partial J}{\partial w} $$


 $$ b ← b - \alpha \frac{\partial J}{\partial b} $$

 Where $\alpha$ is the learning rate, which defines the length of the step taken towards the opposite direction of the gradient. Note that $w$ and $b$ can be a matrix and a vector, respectively, and $b$ is also included into the set of weights to simplify the notation. All weights and biases are updated simultanieously.

### Computing the gradients

The gradients of the binary cross entropy wrt the weight and bias of our model are well known. It is easy to derive them by leveraging the derivative of the sum over the training samples $\frac{\partial J}{\partial w} = \frac{1}{N} \sum_{i=1}^N \frac{\partial J_i}{\partial w}$ and noting that.

- $\frac{\partial J_i}{\partial w} = (\sigma(z_i) - y_i)x_i$


- $\frac{\partial J_i}{\partial b} = (\sigma(z_i) - y_i)$

where $z_i = f_{W,b}(x_i)$

In [None]:
def compute_gradients(x, y_pred, y_true, w, b):
  """Computes the gradients of the binary cross entropy loss function with
  respect to the weights and biases.

  Args:
    x: A torch tensor of shape (n_samples, n_features).
    y_pred: A torch tensor of shape (n_samples, 1) containing the predicted
      output values.
    y_true: A torch tensor of shape (n_samples, 1) containing the true output
      values.
    w: A torch tensor of shape (n_features, 1).
    b: A torch tensor of shape (1, 1).

  Returns:
    A dictionary containing the gradients of the binary cross entropy loss
    function with respect to the weights and biases.
  """
  # TODO : Compute gradients for the weights
  # dw = ...

  # TODO : Compute gradients for the bias
  # db = ...

  return dw, db

### Gradient descent step

Given the gradients, we can update the weight and bias parameters in the appropiate direction.

Note, that as we can compute directly the values of the gradients, in this problem we do not need to evaluate the loss function at the current values of $w$ and $b$ (step 2). However, for visualization purposes, we are going to evaluate it and return the loss function value. With the given gradients, we can update the current values of $w$ and $b$ towards a more optimal value.

In [None]:
def gradient_descent_step(x, y, w, b, learning_rate=1e-3):
  """Performs one step of gradient descent on the given logistic regression model.

  Args:
    x: A torch tensor of shape (n_samples, n_features) containing the input data.
    y: A torch tensor of shape (n_samples, 1) containing the output data.
    w: A torch tensor of shape (n_features, 1) containing the weights of the logistic regression model.
    b: A torch tensor of shape (1, 1) containing the bias of the logistic regression model.
    learning_rate: A float scalar representing the learning rate.

  Returns:
    A tuple of three torch tensors, where the first tensor contains loss, the second tensor contains the updated weights and the third tensor contains the updated bias.
  """

  # TODO : Compute the predicted output values.
  # y_pred = ...

  # TODO : Compute the loss function.
  # loss = ...

  # TODO : Compute the gradients of the loss function with respect to the weights and bias.
  # dw, db = ...

  # TODO : Update the weights.
  # w = ...

  # TODO : Update the bias
  # b = ...

  return loss, w, b


The length of the step is determined by the length of the gradient itself, but it is controlled by an hyperparameter, the learning rate $(\alpha)$, which has to be tuned for every problem, model and dataset. Too large values of $\alpha$ can avoid the convergence of our model, whereas too small values can take forever to converge.

### One dimensional example
In order to check that the implementation is working properly and visualization purposes to understand what is going behind the scene, we will run our logistic regression model with only one input feature, that is, $x_i \in \mathbb{R}$.

Feel free to modify the learning rate, initial values for weight and bias as well as the number of gradient descent steps to see the behavior.

In [None]:
# hyperparameters
learning_rate = 1e-1
num_steps     = 1000

# weight and bias initialization
w = torch.tensor([[4.0]])
b = torch.tensor([[-4.0]])

# clone the data
x = X_train[:,0].clone().view(-1,1)
y = Y_train.clone()

x_val = X_val[:,0].clone().view(-1,1)
y_val = Y_val.clone()

# gradient descent
train_losses, val_losses, ws, bs = list(), list(), list(), list()
for _ in range(num_steps):
  loss, w, b = gradient_descent_step(x, y, w, b, learning_rate=learning_rate)
  train_losses.append(loss)
  val_losses.append( binary_cross_entropy( logistic_regression(x_val, w, b), y_val) )
  ws.append(w.item())
  bs.append(b.item())

# visualization

# loss mesh
Jwb, wgrid, bgrid = loss_mesh(x, y)
loss_mesh_plot(Jwb, wgrid, bgrid)
plt.plot(ws, bs, color='red', marker='x')
plt.show()

We get both training to see if the moder is actually learning and the validation to evaluate wether the model is generalizing. **However, no gradients are computed on the validation set.** As we perform gradient descent steps, we should see the value of the loss function to decrease. It is common to visualize the value of the loss as function of the gradient step iteration:

In [None]:
plt.figure()
plt.plot(train_losses, label='train')
plt.plot(val_losses,   label='val')
plt.ylabel("Loss")
plt.xlabel("Gradient descent iteration")
plt.legend()
plt.show()

### Multidimensional
So far, we have worked with one-dimensional input signals $x_i \in \mathbb{R}$ for visualization purposes. However, our original dataset contains multiple dimensions and the extension is straightforward. Although we cannot visualize the loss landscape and the gradient updates, we can check that the loss function is actually decreasing after each gradient descent step.

**Exercise: if our input is 10-dimensional, how many parameters are in total in our model? (weights + bias)**

In [None]:
# clone the data
x = X_train.clone()
y = Y_train.clone()

x_val = X_val.clone()
y_val = Y_val.clone()

# weight and bias initialization
w = torch.randn(x.size(1), 1)
b = torch.tensor([[0.0]])

# gradient descent
train_losses, val_losses = list(), list()
for _ in range(10000):
  loss, w, b = gradient_descent_step(x, y, w, b, learning_rate=1e-3)
  train_losses.append(loss)
  val_losses.append( binary_cross_entropy( logistic_regression(x_val, w, b), y_val) )

plt.figure()
plt.plot(train_losses, label='train')
plt.plot(val_losses,   label='val')
plt.ylabel("Loss")
plt.xlabel("Gradient descent iteration")
plt.legend()
plt.show()

## Evaluation

There are distinct metrics to assess the quality or performance of a binary classifier. For instance, accuracy is a well know metric that measures the ratio of correctly classified samples. Here, we introduce a distinct set of metrics: precision, recall and F-score.

1. The precision measures the ratio between the correctly predicted positive samples and all samples predicted as positive, therefore:

$$ P = \frac{TP}{TP + FP} $$

2. The recall, instead, quantifies how many positive instances have been predicted from the total amount of positive instances in the data:

$$ R = \frac{TP}{TP + FN} $$

3. Finally, the F-score is an harmonic mean of the previous two metrics:
$$ F = 2 \frac{P * R}{P + R} $$


In [None]:
def precision(y_pred, y_true):
  """Computes the precision of a binary classification model.

  Args:
    y_pred: A torch tensor of shape (n_samples, 1) containing the predicted output values.
    y_true: A torch tensor of shape (n_samples, 1) containing the true output values.

  Returns:
    A torch tensor of shape (1, 1) containing the precision of the model.
  """

  # TODO : get true positive
  # tp = ...

  # TODO : get false positive
  # fp = ...

  return tp / (tp + fp)

def recall(y_pred, y_true):
  """Computes the recall of a binary classification model.

  Args:
    y_pred: A torch tensor of shape (n_samples, 1) containing the predicted output values.
    y_true: A torch tensor of shape (n_samples, 1) containing the true output values.

  Returns:
    A torch tensor of shape (1, 1) containing the recall of the model.
  """

  # TODO : get true positive
  # tp = ...

  # TODO : get false negative
  # fn = ...

  return tp / (tp + fn)

def f1_score(y_pred, y_true):
  """Computes the F1 score of a binary classification model.

  Args:
    y_pred: A torch tensor of shape (n_samples, 1) containing the predicted output values.
    y_true: A torch tensor of shape (n_samples, 1) containing the true output values.

  Returns:
    A torch tensor of shape (1, 1) containing the F1 score of the model.
  """

  p = precision(y_pred, y_true)
  r = recall(y_pred, y_true)

  return 2 * (p * r) / (p + r)

def compute_metrics(y_pred, y_true):
  return precision(y_pred, y_true), \
         recall(y_pred, y_true), \
         f1_score(y_pred, y_true)

Now, we can see how our models performs on both training and validation set after each gradient iteration:

In [None]:
# clone and normalize the data
x = X_train.clone()
y = Y_train.clone()

# weight and bias initialization
w = torch.randn(x.size(1), 1)
b = torch.tensor([[0.0]])

# gradient descent
losses, ps, rs, fs = list(), list(), list(), list()
for _ in range(10000):
  # fit step
  loss, w, b = gradient_descent_step(x, y,
                                     w, b, learning_rate=1e-3)

  # eval
  y_pred_val = (logistic_regression(X_val, w, b)>0.5).long()
  p, r, f = compute_metrics(y_pred_val, Y_val)

  ps.append(p)
  rs.append(r)
  fs.append(f)

plt.figure()
plt.plot(ps, label='Precision')
plt.plot(rs, label='Recall')
plt.plot(fs, label='F1-Score')
plt.ylabel("Metrics")
plt.xlabel("Gradient descent iteration")
plt.legend()
plt.grid()
plt.show()

## Chain rule

We have computed the gradients of the loss wrt $w$ and $b$ by developing and simplifying the mathematical expression. However, there is another way to do it: applying the chain rule to every step:

 $$
 \frac{\partial J}{\partial w} = \frac{1}{N} \sum_{i=1}^N \frac{\partial J_i}{\partial w} = \frac{1}{N} \sum_{i=1}^N \frac{\partial J_i}{\partial \sigma(z_i)} \frac{\partial \sigma(z_i)}{\partial z_i} \frac{\partial z_i}{\partial w}
 $$.

In [None]:
def dJi_dS(y_pred, y_true):
  """Computes the derivative of the binary cross entropy loss function with respect to the sigmoid function.

  Args:
    y_pred: A torch tensor of shape (n_samples, 1) containing the predicted output values.
    y_true: A torch tensor of shape (n_samples, 1) containing the true output values.

  Returns:
    A torch tensor of shape (n_samples, 1) containing the derivative of the binary cross entropy loss function with respect to the sigmoid function.
  """
  # TODO : Compute derivative of the sample loss wrt the sigmoid probabilities (y_pred)
  # dS = ...

  return dS

def dS_dZ(y_pred):
  """Computes the derivative of the sigmoid function with respect to its input.

  Args:
    z: A torch tensor of any shape containing the output to the sigmoid function.

  Returns:
    A torch tensor of the same shape as x containing the derivative of the sigmoid function with respect to the linear projection.
  """
  # TODO : Compute the derivative of the sigmoid
  # dz = ...

  return dz

def dZ_dW(x):
  """Computes the derivative of the linear projection with respect to the weights.

  Args:
    x: A torch tensor of shape (n_samples, n_features) containing the input data.

  Returns:
    A torch tensor of shape (n_features, 1) containing the derivative of the linear projection with respect to the weights.
  """
  # TODO : Compute the derivative of the linear projection wrt W
  # dw = ...

  return dw

def dZ_dB(z):
  """Computes the derivative of the linear projection with respect to the bias.

  Args:
    x: A torch tensor of shape (n_samples, n_features) containing the input data.

  Returns:
    A torch tensor of shape (n_features, 1) containing the derivative of the linear projection with respect to the bias.
  """
  # TODO : Compute the derivative of the linear projection wrt b
  # db = ...

  return db

def compute_gradients_v2(x, z, y_pred, y_true, w, b):
  dS  = dJi_dS(y_pred, y_true)
  dZ  = dS_dZ(y_pred)
  dW  = dZ_dW(x)
  dB  = dZ_dB(x)

  dw = torch.mean(dW * dZ * dS, dim=0)
  db = torch.mean(dB * dZ * dS, dim=0)

  return dw, db

In [None]:
# weight and bias initialization
w = torch.tensor([[0.5]])
b = torch.tensor([[-0.5]])

# clone and normalize the data
x = X_train[:,0].clone().view(-1,1)

# compute intermediate values
z = linear_regression(x, w, b)
y_pred = sigmoid(z)

# compute gradients (v1)
dw1, db1 = compute_gradients(x, y_pred, y, w, b)

# compute gradients (v2)
dw2, db2 = compute_gradients_v2(x, z, y_pred, y, w, b)

In [None]:
print(f"Weight gradient : v1 ({dw1.item()}) and v2 ({dw2.item()})")
print(f"Bias gradient : v1 ({db1.item()}) and v2 ({db2.item()})")

This is the core idea behind the optimization of deep learning models: the backpropagation algorithm. Next lab you will see how is this implemented with PyTorch.

In this lab we have implemented our own version of linear layer, sigmoid function, binary cross entropy, and even how to obtain the gradients for a log-linear classifier. However, as you will see in future labs, we do not need to do all of this with PyTorch, as it includes a set of built-in functions and modules, as well as an automatic differentiation package that will compute the gradients for us.