* Regression problems pop up whenever we want to predict anumerical value.
* Such cases include predicting prices,predicting the length of stay for patient in say a hospital, forecastinf demand etc.
* To develop a model for predicting house prices, we need to get our hands on data, which includes sales price,area,number of rooms etc.
* This dataset is called a `training dataset` or `training set`, and each row that contains data corrresponding to one sale is called a data point or sample.
* The thing that we're trying to predict is called a *label* or *target*.
* The variables upon which the predictions are based are called `features or covariates`

In [13]:
!pip install d2l --no-deps



In [14]:
%matplotlib inline
import math
import time
import numpy as np
import torch
from d2l import torch as d2l

## 1.1 Basics

* In linear regression we first assume that the relationship between features `x` and target `y` is approximately linear.
* We superscrpits to enumerate samples and targets, subscripts to index coordinates.
* $x^{(i)}$ denotes the $i^{th}$ and $x_j^{(i)}$ denotes its $j^{th}$  coordinate.

# Model

* The assumption of linearity means that the expected value of the target (price) can be expressed as a weighted sum of the features (area and age):

* $price = w_{area}* area + w_{age}*age + b$

* Here $w_{area}$ and   $w_{age} $ are called *weights* and $b$ is called $bias$.
* The weights determine the influence of each feature on our prediction.
* The bias determines the value of the estimate when all features are zero.
* In machine learning, we usually work with high-dimensional datasets, where it is more convenient to employ compact linear algebra notation.
* When our inputs consists of $d$ features, we assign each an index (between 0 and d) and express our prediction $\hat{y}$ as:
   
   $\hat{y}$ = $w_1*x_1 + ...+ w_d*x_d + b $

* We can also express our model compactly via dot product between `w` and `x`.

  $\hat{y}$ = $w^{T}*x + b$

* The vector `x` in the above formula corresponds to the features of a single examples

In [27]:
##implementing a basic regression example
import numpy as np
x = np.arange(12)
y_true = np.arange(12,24)
b = 1.5
y_hat = []

# For a single-feature linear regression, 'w' should be a single scalar weight.
w = 1.0

for x_val in x: # Iterate through each feature value in x
  preds = w * x_val + b
  y_hat.append(preds)

print(y_hat)
print(f"Number of predictions: {len(y_hat)}")
print(f"True values: {y_true}")

[np.float64(1.5), np.float64(2.5), np.float64(3.5), np.float64(4.5), np.float64(5.5), np.float64(6.5), np.float64(7.5), np.float64(8.5), np.float64(9.5), np.float64(10.5), np.float64(11.5), np.float64(12.5)]
Number of predictions: 12
True values: [12 13 14 15 16 17 18 19 20 21 22 23]


## Loss function

* Say we want to measure how accurate our model performs like how good is it's performance.
* We use a function called a *loss function* to do this.
* Basically it quantifies the distance between `real` and `predicted` values of the target.
* For regression problems, the most common loss function is the $ squared error$.
* When our prediction for an example $i$ is $\hat{y}^{(i)}$ and corresponding true label is $y^{(i)}$, the squared error is given by:

$l^{(i)}(w, b) = \frac{1}{2} \left( \hat{y}^{(i)} - y^{(i)} \right)^2$

* The constant $1/2$ makes no real differencw but proves to be notatioanally convenient, since it cancels out when we take the derivative of the loss.

* To measure the quality of a model on the entire dataset of `n` samples we simply average the losses on the training set:

 $
 L(\mathbf{w}, b) = \frac{1}{n} \sum_{i=1}^{n} l^{(i)}(\mathbf{w}, b) = \frac{1}{n} \sum_{i=1}^{n} \frac{1}{2} \left( \mathbf{w}x^{(i)} + b - y^{(i)} \right)^2
 $

 * When training the model, we seek parameter $(w*,b*)$ that minimizes the total loss across all training examples:
 $\mathbf{w}^*, b^* = \underset{\mathbf{w}, b}{\mathrm{argmin}} \ L(\mathbf{w}, b)$



In [30]:
##defining loss function
# Convert y_hat to a numpy array for element-wise operations
y_hat_np = np.array(y_hat)

# Calculate the squared error for each prediction and then take the mean
# according to the formula L(w, b) = (1/n) * sum( (y_hat - y_true)^2 )
# The 0.5 factor is often included in the definition of MSE for convenience in differentiation.
loss = 0.5 * np.mean((y_hat_np - y_true)**2)

print(f"Average loss: {loss}")

Average loss: 55.125


## Analytic Solution

* We can find the optimal parameters analytically by applyinf a simple formula as folows.
* First, we can subsume that bias `b` into the parameter `w` by appending a column to the design matrix consists of all 1s. Then our prediction problem is to minimize $|| y -Xw||^2$.
* As long as the matrix X has full rank that is no feature is linearly dependent on the others, then there wil be just one critical point on the loss surface and it corresponds to the minimum of the loss over the entire domain.
* Taking the derivative of the loss with respect to `w` and setting it equal to zero yields:
 $\frac{\partial}{\partial \mathbf{w}} \|\mathbf{y} - \mathbf{Xw}\|^2 = 2\mathbf{X}^\top (\mathbf{Xw} - \mathbf{y}) = 0$ hence $X^Ty = X^TXw$.

# Minibatch Stochastic Gradient Descent

* The key technique for optimizing nearly every deep learning model, consists of iteratively reducing the error by updating the parameters in the direction that incrementally lowers the loss function.

* This algorithm is called `gradient descent`.
* Think of it as a hiker trying to find the lowest point in a mountaineous valley, while being sorrounded by fog, they can't see the bottom but they can feel the sloe of ground under their feet and take a step downhill.


## 1.2 Training the Model: Stochastic Gradient Descent

Now that we have our model and loss function, we need a way to find the optimal values for `w` and `b` that minimize the loss. We'll use a simplified version of Stochastic Gradient Descent (SGD) for this demonstration. In practice, we'd use mini-batch SGD as discussed.

In [31]:
# Initialize parameters
w = np.array([0.0]) # Start with some initial weight
b = 0.0              # Start with some initial bias

# Learning rate (controls the step size in each update)
learning_rate = 0.01

# Number of epochs (how many times to iterate over the dataset)
num_epochs = 100

# Store loss values to visualize training progress
losses = []

print(f"Initial w: {w[0]:.4f}, Initial b: {b:.4f}")

for epoch in range(num_epochs):
    # Calculate predictions
    y_hat_epoch = w[0] * x + b

    # Calculate the gradient of the loss with respect to w and b
    # Loss: L(w, b) = 0.5 * (1/n) * sum((y_hat - y_true)^2)
    # dL/dw = (1/n) * sum((y_hat - y_true) * x)
    # dL/db = (1/n) * sum(y_hat - y_true)

    grad_w = np.mean((y_hat_epoch - y_true) * x)
    grad_b = np.mean(y_hat_epoch - y_true)

    # Update parameters using gradient descent
    w[0] = w[0] - learning_rate * grad_w
    b = b - learning_rate * grad_b

    # Calculate and store the loss for this epoch
    current_loss = 0.5 * np.mean((y_hat_epoch - y_true)**2)
    losses.append(current_loss)

    if (epoch + 1) % 10 == 0:
        print(f"Epoch {epoch + 1:3d}, Loss: {current_loss:.4f}, w: {w[0]:.4f}, b: {b:.4f}")

print(f"\nFinal w: {w[0]:.4f}, Final b: {b:.4f}")

Initial w: 0.0000, Initial b: 0.0000
Epoch  10, Loss: 18.3042, w: 2.4806, b: 0.6514
Epoch  20, Loss: 17.3080, w: 2.4490, b: 0.9640
Epoch  30, Loss: 16.3711, w: 2.4093, b: 1.2668
Epoch  40, Loss: 15.4850, w: 2.3706, b: 1.5613
Epoch  50, Loss: 14.6468, w: 2.3330, b: 1.8478
Epoch  60, Loss: 13.8540, w: 2.2964, b: 2.1263
Epoch  70, Loss: 13.1042, w: 2.2608, b: 2.3973
Epoch  80, Loss: 12.3949, w: 2.2262, b: 2.6608
Epoch  90, Loss: 11.7240, w: 2.1926, b: 2.9171
Epoch 100, Loss: 11.0894, w: 2.1599, b: 3.1663

Final w: 2.1599, Final b: 3.1663
