# Lab 2: Linear Regression

Author: Seungjae Lee (이승재)

<div class="alert alert-warning">
    We use elemental PyTorch to implement linear regression here. However, in most actual applications, abstractions such as <code>nn.Module</code> or <code>nn.Linear</code> are used.
</div>

## Theoretical Overview

$$ H(x) = Wx + b $$

$$ cost(W, b) = \frac{1}{m} \sum^m_{i=1} \left( H(x^{(i)}) - y^{(i)} \right)^2 $$

 - $H(x)$: How to make predictions for a given $x$ value
 - $cost(W, b)$: How well $H(x)$ predicts $y$

## Imports

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [20]:
# For reproducibility
torch.manual_seed(1)

<torch._C.Generator at 0xff22f0737030>

## Data

We will use fake data for this example.

In [3]:
x_train = torch.FloatTensor([[1], [2], [3]])
y_train = torch.FloatTensor([[1], [2], [3]])

In [4]:
print(x_train)
print(x_train.shape)

tensor([[1.],
        [2.],
        [3.]])
torch.Size([3, 1])


In [5]:
print(y_train)
print(y_train.shape)

tensor([[1.],
        [2.],
        [3.]])
torch.Size([3, 1])


기본적으로 PyTorch는 NCHW 형태이다.
(Basically, PyTorch is in NCHW form.)

## Weight Initialization

In [6]:
W = torch.zeros(1, requires_grad=True)
print(W)

tensor([0.], requires_grad=True)


In [7]:
b = torch.zeros(1, requires_grad=True)
print(b)

tensor([0.], requires_grad=True)


## Hypothesis

$$ H(x) = Wx + b $$

In [8]:
hypothesis = x_train * W + b
print(hypothesis)

tensor([[0.],
        [0.],
        [0.]], grad_fn=<AddBackward0>)


## Cost

Cost means "loss", it's a function that measures how well the neural network's predictions match the true data. During training, the goal is to minimize this loss function.
$$ cost(W, b) = \frac{1}{m} \sum^m_{i=1} \left( H(x^{(i)}) - y^{(i)} \right)^2 $$

In [9]:
print(hypothesis)

tensor([[0.],
        [0.],
        [0.]], grad_fn=<AddBackward0>)


In [10]:
print(y_train)

tensor([[1.],
        [2.],
        [3.]])


In [11]:
print(hypothesis - y_train)

tensor([[-1.],
        [-2.],
        [-3.]], grad_fn=<SubBackward0>)


In [12]:
print((hypothesis - y_train) ** 2)

tensor([[1.],
        [4.],
        [9.]], grad_fn=<PowBackward0>)


In [13]:
cost = torch.mean((hypothesis - y_train) ** 2)
print(cost)

tensor(4.6667, grad_fn=<MeanBackward1>)


## Gradient Descent
Gradient descent is a fundamental optimization technique commonly used in machine learning and deep learning to minimize the loss function and improve model performance.

In [14]:
optimizer = optim.SGD([W, b], lr=0.01) # defines an optimizer

Explanation:

`optim.SGD`:

`optim` is a module in PyTorch that contains various optimization algorithms.
`SGD` stands for Stochastic Gradient Descent, which is a popular optimization algorithm used for training machine learning models.

`[W, b]`:

This is a list of parameters that the optimizer will update. In this case, `W` and `b` are the parameters of the model (e.g., weights and biases) that require gradient updates.
These parameters should be tensors with `requires_grad=True` so that gradients can be computed for them during backpropagation.

lr=0.01:

lr stands for learning rate, which is a hyperparameter that controls the step size during the parameter updates.
A smaller learning rate means smaller updates and can lead to more precise convergence, but it might take longer to train.
A larger learning rate means larger updates and can speed up training, but it might overshoot the optimal solution.

In [15]:
optimizer.zero_grad() # initialize gradient
cost.backward() # compute gradient
optimizer.step() # update

1. **`optimizer.zero_grad()`**:
   - This function sets the gradients of all model parameters to zero. 
   - Gradients are accumulated by default in PyTorch, so if you don't zero them out, the gradients from the previous iteration will be added to the current gradients, which is not what you want.
   - This step ensures that you start with a clean slate for the current iteration.

2. **`cost.backward()`**:
   - This function computes the gradient of the loss (cost) with respect to each parameter (i.e., `W` and `b`) using backpropagation.
   - The gradients are stored in the `.grad` attribute of each parameter tensor.
   - These gradients indicate the direction and magnitude of change needed to reduce the loss.

3. **`optimizer.step()`**:
   - This function updates the parameters using the gradients computed in the previous step.
   - The optimizer adjusts the parameters (`W` and `b`) based on the gradients and the learning rate.
   - For example, in Stochastic Gradient Descent (SGD), the update rule is:
     ```python
     W = W - learning_rate * W.grad
     b = b - learning_rate * b.grad
     ```
   - This step moves the parameters in the direction that reduces the loss.

In [16]:
print(W)
print(b)

tensor([0.0933], requires_grad=True)
tensor([0.0400], requires_grad=True)


Why `W` and `b` Change:

- **Gradient Computation**: During `cost.backward()`, the gradients of the loss with respect to `W` and `b` are computed. These gradients tell us how much and in which direction to change `W` and `b` to reduce the loss.
- **Parameter Update**: During `optimizer.step()`, the optimizer uses these gradients to update `W` and `b`. The parameters are adjusted slightly in the direction that reduces the loss.



Let's check if the hypothesis is now better.

In [17]:
hypothesis = x_train * W + b
print(hypothesis)

tensor([[0.1333],
        [0.2267],
        [0.3200]], grad_fn=<AddBackward0>)


In [18]:
cost = torch.mean((hypothesis - y_train) ** 2)
print(cost)

tensor(3.6927, grad_fn=<MeanBackward1>)


By iterating through this process, the model parameters are optimized to fit the data better, reducing the loss over time.

## Training with Full Code

In reality, we will be training on the dataset for multiple epochs. This can be done simply with loops.

In [4]:
import torch
import torch.nn as nn
import torch.optim as optim

# Data
x_train = torch.FloatTensor([[1], [2], [3]])
y_train = torch.FloatTensor([[5], [8], [11]])
# Initialize model
W = torch.zeros(1, requires_grad=True)
b = torch.zeros(1, requires_grad=True)
# optimizer settings
optimizer = optim.SGD([W, b], lr=0.01)

nb_epochs = 2000
for epoch in range(nb_epochs + 1):
    
    # H(x) calculate
    hypothesis = x_train * W + b
    
    # cost calculate
    cost = torch.mean((hypothesis - y_train) ** 2)

    # improve H(x) using cost
    optimizer.zero_grad()
    cost.backward()
    optimizer.step()

    # Output log every 100 times
    if epoch % 100 == 0:
        print('Epoch {:4d}/{} W: {:.3f}, b: {:.3f} Cost: {:.6f}'.format(
            epoch, nb_epochs, W.item(), b.item(), cost.item()
        ))

Epoch    0/2000 W: 0.360, b: 0.160 Cost: 70.000000
Epoch  100/2000 W: 3.197, b: 1.553 Cost: 0.028801
Epoch  200/2000 W: 3.155, b: 1.649 Cost: 0.017797
Epoch  300/2000 W: 3.122, b: 1.724 Cost: 0.010998
Epoch  400/2000 W: 3.096, b: 1.783 Cost: 0.006796
Epoch  500/2000 W: 3.075, b: 1.829 Cost: 0.004199
Epoch  600/2000 W: 3.059, b: 1.866 Cost: 0.002595
Epoch  700/2000 W: 3.046, b: 1.895 Cost: 0.001604
Epoch  800/2000 W: 3.036, b: 1.917 Cost: 0.000991
Epoch  900/2000 W: 3.029, b: 1.935 Cost: 0.000612
Epoch 1000/2000 W: 3.023, b: 1.949 Cost: 0.000378
Epoch 1100/2000 W: 3.018, b: 1.960 Cost: 0.000234
Epoch 1200/2000 W: 3.014, b: 1.968 Cost: 0.000144
Epoch 1300/2000 W: 3.011, b: 1.975 Cost: 0.000089
Epoch 1400/2000 W: 3.009, b: 1.980 Cost: 0.000055
Epoch 1500/2000 W: 3.007, b: 1.985 Cost: 0.000034
Epoch 1600/2000 W: 3.005, b: 1.988 Cost: 0.000021
Epoch 1700/2000 W: 3.004, b: 1.990 Cost: 0.000013
Epoch 1800/2000 W: 3.003, b: 1.993 Cost: 0.000008
Epoch 1900/2000 W: 3.003, b: 1.994 Cost: 0.000005

In [19]:
# Data
x_train = torch.FloatTensor([[1], [2], [3]])
y_train = torch.FloatTensor([[1], [2], [3]])
# Initialize model
W = torch.zeros(1, requires_grad=True)
b = torch.zeros(1, requires_grad=True)
# optimizer settings
optimizer = optim.SGD([W, b], lr=0.01)

nb_epochs = 1000
for epoch in range(nb_epochs + 1):
    
    # H(x) calculate
    hypothesis = x_train * W + b
    
    # cost calculate
    cost = torch.mean((hypothesis - y_train) ** 2)

    # cost and H(x) calculate
    optimizer.zero_grad()
    cost.backward()
    optimizer.step()

    # Output log every 100 times
    if epoch % 100 == 0:
        print('Epoch {:4d}/{} W: {:.3f}, b: {:.3f} Cost: {:.6f}'.format(
            epoch, nb_epochs, W.item(), b.item(), cost.item()
        ))

Epoch    0/1000 W: 0.093, b: 0.040 Cost: 4.666667
Epoch  100/1000 W: 0.873, b: 0.289 Cost: 0.012043
Epoch  200/1000 W: 0.900, b: 0.227 Cost: 0.007442
Epoch  300/1000 W: 0.921, b: 0.179 Cost: 0.004598
Epoch  400/1000 W: 0.938, b: 0.140 Cost: 0.002842
Epoch  500/1000 W: 0.951, b: 0.110 Cost: 0.001756
Epoch  600/1000 W: 0.962, b: 0.087 Cost: 0.001085
Epoch  700/1000 W: 0.970, b: 0.068 Cost: 0.000670
Epoch  800/1000 W: 0.976, b: 0.054 Cost: 0.000414
Epoch  900/1000 W: 0.981, b: 0.042 Cost: 0.000256
Epoch 1000/1000 W: 0.985, b: 0.033 Cost: 0.000158


# linear regression model

Now, we will rewrite the program using linear regression model in `torch.nn`.

Remember that we had this fake data.

In [5]:
x_train = torch.FloatTensor([[1], [2], [3]])
y_train = torch.FloatTensor([[1], [2], [3]])

Now we need to create a linear regression model. By default, all models in PyTorch are created by inheriting from the provided `nn.Module`.

In [6]:
class LinearRegressionModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(1, 1)

    def forward(self, x):
        return self.linear(x)# 

**Class Definition**

LinearRegressionModel is a subclass of `nn.Module`, which is the base class for all neural network modules in PyTorch.

By inheriting from `nn.Module`, this class gains access to a variety of useful methods and properties for building and training neural networks.

**Constructor Method**
- The `__init__` method is the constructor of the class. It initializes the instance of the class.

-  `super()` is a built-in function that allows you to call methods from a parent class. `super().__init__()` calls the constructor of the parent class (`nn.Module`). This is necessary to properly initialize the base class.

- `self.linear = nn.Linear(1, 1)` creates a linear layer with one input feature and one output feature. This layer will perform the linear transformation \( y = xW + b \), where \( W \) is the weight and \( b \) is the bias.

**Forward Method**
- The `forward` method defines the forward pass of the model. This is where the input data is passed through the network.
- `x` is the input tensor.
- `return self.linear(x)` applies the linear transformation defined in the `self.linear` layer to the input `x` and returns the result.



In the `__init__` method of the model, we define the layers to be used. Here, we will use `nn.Linear` because we are creating a linear regression model. In the `forward` method, we specify how the model calculates the output from the input.

nn.Linear(1,1) specifies the input and output dimentions.

In [7]:
model = LinearRegressionModel()

## Hypothesis

Let's create a model and calculate the predicted value $H(x)$.

In [8]:
hypothesis = model(x_train)

In [9]:
print(hypothesis)

tensor([[-0.2248],
        [-0.8860],
        [-1.5473]], grad_fn=<AddmmBackward0>)


## Cost

Now let's calculate the cost using mean squared error (MSE). MSE is also provided by default in PyTorch.

In [11]:
print(hypothesis)
print(y_train)

tensor([[-0.2248],
        [-0.8860],
        [-1.5473]], grad_fn=<AddmmBackward0>)
tensor([[1.],
        [2.],
        [3.]])


In [12]:
cost = F.mse_loss(hypothesis, y_train)

In [13]:
print(cost)

tensor(10.1690, grad_fn=<MseLossBackward0>)


## Gradient Descent

Using the final given cost, we adjust $H(x)$'s $W$ and $b$ to reduce the cost. At this time, one of the `optimizers` in PyTorch's `torch.optim` can be used.

In [14]:
optimizer = optim.SGD(model.parameters(), lr=0.01)

In [15]:
optimizer.zero_grad()
cost.backward()
optimizer.step()

## Training with Full Code

Now that we understand the Linear Regression code, let's run the code to fit the model.

In [None]:
# Data
x_train = torch.FloatTensor([[1], [2], [3]])
y_train = torch.FloatTensor([[1], [2], [3]])
# initialize model
model = LinearRegressionModel()
# optimizer setting
optimizer = optim.SGD(model.parameters(), lr=0.01)

nb_epochs = 1000
for epoch in range(nb_epochs + 1):
    
    # H(x) calculate
    prediction = model(x_train) #this is same as model.forward(x_train), where the forward method is called
    
    # cost calculate
    cost = F.mse_loss(prediction, y_train)
    
    # cost and H(x) improvement 
    optimizer.zero_grad()
    cost.backward()
    optimizer.step()
    
    # log every 100 times
    if epoch % 100 == 0:
        params = list(model.parameters())
        W = params[0].item()
        b = params[1].item()
        print('Epoch {:4d}/{} W: {:.3f}, b: {:.3f} Cost: {:.6f}'.format(
            epoch, nb_epochs, W, b, cost.item()
        ))

Epoch    0/1000 W: 0.224, b: -0.086 Cost: 3.896044
Epoch  100/1000 W: 0.926, b: 0.168 Cost: 0.004088
Epoch  200/1000 W: 0.942, b: 0.132 Cost: 0.002526
Epoch  300/1000 W: 0.954, b: 0.104 Cost: 0.001561
Epoch  400/1000 W: 0.964, b: 0.082 Cost: 0.000965
Epoch  500/1000 W: 0.972, b: 0.064 Cost: 0.000596
Epoch  600/1000 W: 0.978, b: 0.051 Cost: 0.000368
Epoch  700/1000 W: 0.983, b: 0.040 Cost: 0.000228
Epoch  800/1000 W: 0.986, b: 0.031 Cost: 0.000141
Epoch  900/1000 W: 0.989, b: 0.025 Cost: 0.000087
Epoch 1000/1000 W: 0.992, b: 0.019 Cost: 0.000054


You can see that `cost` is decending as `W` and `b` is adjusted.

Compared with the previous version, this code utilizes the linear regression model in torch to make the code easier and reusable. Specifically, model initialization, H(x) calculation and ways to access W and b are changed.