
# Week 1 Lab â€“ Linear Regression with One Feature

In this lab we will work with a very simple version of supervised learning:

- Our data consists of pairs $(x^{(i)}, y^{(i)})$, where $x^{(i)}$ is one feature and $y^{(i)}$ is the target.
- We will use a **linear regression model** with one feature.
- We will measure how good the model is using a **cost function** based on mean squared error.
- We will train the model with **gradient descent**.
- This file uses vectorization in python. If you are not familiar with vectorization you can start with 
the notebook of similar name but that do not use vectorization.



## 0. Theory Refresher

### 0.1 Linear Regression (One Feature)

We assume there is (approximately) a linear relationship between the input $x$ and the output $y$.  
Our model (or hypothesis) is a function that depends on the parameters $w$ and $b$:

$$
f_{w,b}(x) = wx + b
$$

- $w$ is the **slope**: how much $f_{w,b}(x)$ changes when $x$ increases by 1.
- $b$ is the **intercept**: the value of $f_{w,b}(x)$ when $x = 0$.
- For a dataset with $m$ examples, we write the $i$-th example as $(x^{(i)}, y^{(i)})$.  
  The prediction for that example is:
  $$
  \hat{y}^{(i)} = f_{w,b}(x^{(i)}) = w x^{(i)} + b
  $$



### 0.2 Cost Function (Mean Squared Error)

We need a way to measure how well a particular line (given by $w$ and $b$) fits the data.

We use the **mean squared error (MSE)** cost function:

$$
J(w,b) = \frac{1}{2m} \sum_{i=1}^{m} \big( \hat{y}^{(i)} - y^{(i)} \big)^2
       = \frac{1}{2m} \sum_{i=1}^{m} \big( f_{w,b}(x^{(i)}) - y^{(i)} \big)^2
$$

- The term $(\hat{y}^{(i)} - y^{(i)})$ is the **error** for example $i$.
- We square the error so that positive and negative errors do not cancel out, and to penalize large errors more.
- The factor $\frac{1}{2m}$ is for mathematical convenience when taking derivatives.

Our goal is to find values of $w$ and $b$ that **minimize** $J(w,b)$.



### 0.3 Gradient Descent

To minimize $J(w,b)$, we use **gradient descent**.  
The idea is to start with some initial $(w,b)$ and repeatedly update them in the direction that decreases the cost.

We compute the partial derivatives:

$$
\frac{\partial J}{\partial w} = \frac{1}{m} \sum_{i=1}^{m} \big( f_{w,b}(x^{(i)}) - y^{(i)} \big) x^{(i)}
$$

$$
\frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^{m} \big( f_{w,b}(x^{(i)}) - y^{(i)} \big)
$$

Given a **learning rate** $\alpha > 0$, we update:

$$
w := w - \alpha \frac{\partial J}{\partial w}, \qquad
b := b - \alpha \frac{\partial J}{\partial b}
$$

We repeat these updates many times. If $\alpha$ is chosen well, the cost $J(w,b)$ will decrease and $(w,b)$ will move toward values that fit the data.


## 1. Setup

In [None]:

# Install required libraries (run this once if needed)
%pip install numpy pandas matplotlib


In [None]:

import numpy as np
import matplotlib.pyplot as plt

np.set_printoptions(precision=4, suppress=True)



## 2. Create or Load a Simple Dataset

We will create a synthetic dataset that roughly follows a linear relationship:

$$
y \approx 3 x + 2 + \text{noise}
$$

Each point is a pair $(x^{(i)}, y^{(i)})$ with **one feature** $x^{(i)}$.


In [None]:

m = 50

x = np.linspace(0, 10, m)

true_w = 3.0
true_b = 2.0

rng = np.random.default_rng(0)
noise = rng.normal(loc=0.0, scale=2.0, size=m)

y = true_w * x + true_b + noise

print(f"Number of examples m = {m}")
print("First 5 x values:", x[:5])
print("First 5 y values:", y[:5])


### 2.1 Visualize the Data

In [None]:

plt.figure()
plt.scatter(x, y)
plt.xlabel("x")
plt.ylabel("y")
plt.title("Dataset: one feature x vs target y")
plt.show()



## 3. Linear Regression Model with One Feature

We use the model (hypothesis function):

$$
f_{w,b}(x^{(i)}) = w x^{(i)} + b
$$

where:
- $w$ is the slope,
- $b$ is the intercept.


In [None]:

def predict(x, w, b):
    """Compute the predicted y values for given x, using f_{w,b}(x) = w x + b."""
    return w * x + b

w_test = 0.0
b_test = 0.0
y_hat_test = predict(x, w_test, b_test)
print("First 5 predictions with w=0, b=0:", y_hat_test[:5])



## 4. Cost Function $J(w,b)$

We define the **mean squared error** cost function:

$$
J(w,b) = \frac{1}{2m} \sum_{i=1}^{m} \big( f_{w,b}(x^{(i)}) - y^{(i)} \big)^2
$$

This measures how well the model $f_{w,b}(x) = w x + b$ fits the data.


In [None]:

def compute_cost(x, y, w, b):
    m = x.shape[0]
    y_hat = w * x + b  # f_{w,b}(x)
    errors = y_hat - y
    cost = (1 / (2 * m)) * np.sum(errors ** 2)
    return cost

print("Cost with w=0, b=0:", compute_cost(x, y, w_test, b_test))



### 4.1 Visualize the Cost Function as a Surface

We can visualize how $J(w,b)$ changes as we vary $w$ and $b$.  
Below we plot the **cost surface** $J(w,b)$ in 3D.


In [None]:

from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm

w_values = np.linspace(2.5, 3.5, 1000)
b_values = np.linspace(0.5, 1.5, 1000)

W, B = np.meshgrid(w_values, b_values)
J_vals = np.zeros_like(W)
for i in range(W.shape[0]):
    for j in range(W.shape[1]):
        J_vals[i, j] = compute_cost(x, y, W[i, j], B[i, j])

fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(W, B, J_vals, cmap=cm.viridis, linewidth=0, antialiased=True)
ax.set_xlabel("w")
ax.set_ylabel("b")
ax.set_zlabel("J(w,b)")
ax.set_title("Cost surface J(w,b)")
plt.show()




## 5. Gradient Descent

We use **gradient descent** with the update rules:

$$
\frac{\partial J}{\partial w} = \frac{1}{m} \sum_{i=1}^{m} \big( f_{w,b}(x^{(i)}) - y^{(i)} \big) x^{(i)}, \quad
\frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^{m} \big( f_{w,b}(x^{(i)}) - y^{(i)} \big)
$$

Update:

$$
w := w - \alpha \frac{\partial J}{\partial w}, \quad
b := b - \alpha \frac{\partial J}{\partial b}
$$


In [None]:

def compute_gradients(x, y, w, b):
    m = x.shape[0]
    y_hat = w * x + b  # f_{w,b}(x)
    errors = y_hat - y


    dj_dw = (1 / m) * np.sum(errors * x)
    dj_db = (1 / m) * np.sum(errors)
    return dj_dw, dj_db

dj_dw_test, dj_db_test = compute_gradients(x, y, w_test, b_test)
print("Gradients at w=0, b=0:", dj_dw_test, dj_db_test)


### 5.1 Implement the Gradient Descent Loop

In [None]:

def gradient_descent(x, y, w_init, b_init, alpha, num_iterations):
    w = w_init
    b = b_init
    history = []

    for i in range(num_iterations):
        dj_dw, dj_db = compute_gradients(x, y, w, b)
        w = w - alpha * dj_dw
        b = b - alpha * dj_db

        cost = compute_cost(x, y, w, b)
        history.append((i, cost))

        if i % max(1, (num_iterations // 10)) == 0:
            print(f"Iteration {i:4d}: w={w:7.4f}, b={b:7.4f}, cost={cost:8.4f}")

    return w, b, history

alpha = 0.01
num_iterations = 2000

w_init = 0.0
b_init = 0.0

w_learned, b_learned, history = gradient_descent(x, y, w_init, b_init, alpha, num_iterations)
print("\nLearned parameters:")
print("w =", w_learned)
print("b =", b_learned)


### 5.2 Plot the Cost over Iterations

In [None]:

iterations = [it for it, c in history]
costs = [c for it, c in history]

plt.figure()
plt.plot(iterations[15:], costs[15:])  # skip the first points
plt.xlabel("Iteration")
plt.ylabel("Cost J(w,b)")
plt.title("Gradient Descent: Cost vs Iterations")
plt.show()


### 5.3 Visualize the Fitted Line

In [None]:

plt.figure()
plt.scatter(x, y, label="Data")
y_pred = predict(x, w_learned, b_learned)
plt.plot(x, y_pred, label="Fitted line")
plt.xlabel("x")
plt.ylabel("y")
plt.title("Linear Regression Fit (one feature)")
plt.legend()
plt.show()



## 6. Exercises (for You to Try)

1. **Change the learning rate $\alpha$**:
   - Try values like `0.001`, `0.1`, `0.5`.
   - What happens to the speed of convergence? Does the algorithm diverge for some values?

2. **Change the number of iterations**:
   - Try `num_iterations = 100`, `500`, `2000`.
   - How does the final cost change?

3. **Try different initial values** for `w_init` and `b_init`:
   - Does gradient descent still converge to similar values?

4. **Noise level**:
   - Go back to the cell where we define `noise` and change `scale` (e.g., `scale=0.5` or `scale=5.0`).
   - How does the fitted line look with less/more noise?

5. **(Optional) Manual check**:
   - Pick some values of $w$ and $b$, compute $J(w,b)$ using `compute_cost`,
   - Plot the line and see visually if a smaller cost corresponds to a better fit.
