## 🏠 Week 02 – Regression with Multiple Input Variables

In the original version of linear regression, you had a single feature \( x \), the size of the house, and you were able to predict \( y \), the price of the house.  
The model was:

$$
f_{w,b}(x) = wx + b \tag{1}
$$

But now, what if you did not only have the size of the house as a feature to predict the price, but also the **number of bedrooms**, **number of floors**, and **age of the home in years**?

That gives you much more information to predict the price.

---

### 🧮 Notation

Let:

- $\vec{x} = [x_1, x_2, x_3, x_4]$: the four input features (e.g. size, bedrooms, floors, age)  
- $x_j$: the $j$-th feature, with $j = 1, 2, ..., n$  
- $\vec{x}^{(i)} = [x_1^{(i)}, x_2^{(i)}, ..., x_n^{(i)}]$: feature vector of the $i$-th training example  
- $n$: number of features (e.g., $n = 4$)

So for a training example \( i \), we have:

$$
f_{w,b}(x^{(i)}) = w_1 x_1^{(i)} + w_2 x_2^{(i)} + w_3 x_3^{(i)} + w_4 x_4^{(i)} + b \tag{2}
$$

Or more generally, for \( n \) features:

$$
f_{w,b}(x^{(i)}) = \sum_{j=1}^{n} w_j x_j^{(i)} + b \tag{3}
$$

---

### 📐 Vectors Notation

Let:

- $\vec{w} = [w_1, w_2, ..., w_n]$: vector of weights  
- $\vec{x}^{(i)} = [x_1^{(i)}, x_2^{(i)}, ..., x_n^{(i)}]$: feature vector  
- $b$: bias (a scalar)

Then the model becomes:

$$
f_{w,b}(x^{(i)}) = \vec{w} \cdot \vec{x}^{(i)} + b \tag{4}
$$

This uses the **dot product** of two vectors:

$$
\vec{w} \cdot \vec{x}^{(i)} = \sum_{j=1}^{n} w_j x_j^{(i)} \tag{5}
$$

---

### 🧠 Name of the Model

This model is called **multiple linear regression** (not *multivariate regression* — that refers to predicting multiple outputs, which is a different topic).

It's the natural extension of **univariate linear regression** (with one feature) to multiple input features.

## 🧮 Vectorization

Vectorization is a technique to implement learning algorithms **more efficiently** — both in terms of **execution time** and **code clarity**. It allows you to take advantage of optimized **linear algebra libraries** (e.g., NumPy) and even **GPU acceleration**.

---

### Parameters and Features

Let:

- $\vec{w} = [w_1, w_2, w_3]$ with $n = 3$: weight vector  
- $\vec{x} = [x_1, x_2, x_3]$: feature vector  
- $b$: bias (a scalar)

In NumPy (Python):

```python
import numpy as np

w = np.array([1.0, 2.5, -3.3])
x = np.array([10, 20, 30])
b = 4
```

---

### ❌ Without Vectorization

**Manual computation (bad for large $n$):**

$$
f_{\vec{w}, b}(\vec{x}) = w_1 x_1 + w_2 x_2 + w_3 x_3 + b
$$

**Using a for loop:**

```python
f = 0
for j in range(0, n):
    f += w[j] * x[j]
f += b
```

---

### ✅ With Vectorization

**Mathematical expression:**

$$
f_{\vec{w}, b}(\vec{x}) = \vec{w} \cdot \vec{x} + b
$$

**Python (NumPy):**

```python
f = np.dot(w, x) + b
```

---

### 🚀 Benefits of Vectorization

- ✅ **Shorter, cleaner code**
- ✅ **Faster execution**, especially for large $n$
- ✅ Utilizes optimized backend libraries (e.g., BLAS, LAPACK, GPUs)

## Gradient Descent for Multiple Linear Regression

We're going to repeatedly update each parameter $w_j$ using the rule:

$$
w_j := w_j - \alpha \frac{\partial J(\vec{w}, b)}{\partial w_j}
$$

And the bias $b$ using:

$$
b := b - \alpha \frac{\partial J(\vec{w}, b)}{\partial b}
$$

Where:
- $\alpha$ is the learning rate  
- $J(\vec{w}, b)$ is the cost function  
- $w_j$ refers to the $j$-th weight in the parameter vector $\vec{w}$  

---

### 🧠 Intuition

In univariate regression (one feature), we had the update rules:

$$
w := w - \alpha \frac{\partial J(w, b)}{\partial w}
$$

$$
b := b - \alpha \frac{\partial J(w, b)}{\partial b}
$$

Now, with **multiple features** ($n \geq 2$), we update **each weight** $w_j$ individually:

$$
w_j := w_j - \alpha \left( \frac{1}{m} \sum_{i=1}^{m} \left( f_{\vec{w}, b}(\vec{x}^{(i)}) - y^{(i)} \right) \cdot x_j^{(i)} \right)
$$

And we update $b$ as:

$$
b := b - \alpha \left( \frac{1}{m} \sum_{i=1}^{m} \left( f_{\vec{w}, b}(\vec{x}^{(i)}) - y^{(i)} \right) \right)
$$

Where:
- $f_{\vec{w}, b}(\vec{x}^{(i)}) = \vec{w} \cdot \vec{x}^{(i)} + b$ is the prediction
- $x_j^{(i)}$ is the $j$-th feature of the $i$-th example
- $m$ is the number of training examples

---

### 🧪 Implementation note

Instead of hardcoding 3 or 4 parameters, you’ll now use **loops over $j = 1..n$** to update all weights $\vec{w}$. The update for $b$ stays similar.

---

### 🧮 Bonus: The Normal Equation

There's another method that can compute the **optimal weights $\vec{w}$ and bias $b$ without iterations**, called the **normal equation**. It's based on linear algebra.

✅ Pros:
- No need to choose learning rate $\alpha$
- No need for iteration

⚠️ Cons:
- Computationally expensive for large $n$
- Not generalizable to other algorithms (like logistic regression or neural networks)

That's why in practice, most implementations still use **gradient descent** or **advanced solvers**, even for linear regression.

## ⚖️ Feature Scaling

Feature scaling is a technique used to **normalize the range of independent variables or features** in your data. It ensures that all features contribute equally to the learning algorithm, especially for **gradient descent**, where features with larger scales can dominate the cost function and slow down convergence.

### 📌 Why is feature scaling important?

- Features may have **different units** (e.g., size in m², age in years, price in $).
- Without scaling, **gradient descent may converge slowly** or get stuck.
- It helps to **improve numerical stability**.
- Needed for algorithms that compute distances (e.g., **k-NN**, **SVM**, **logistic regression**).

### ✅ When to scale

Feature scaling is **recommended** when:
- You're using **gradient-based algorithms** (like linear regression, logistic regression, neural networks).
- You have features with **different magnitudes** or units.
- Your model does not handle feature scaling internally.

---

### 🔢 Common Methods of Feature Scaling

Feature scaling ensures that all input features contribute equally to the model by adjusting their ranges or distributions. Here are the most common methods:

- **Min-Max Scaling:** Subtract the minimum and divide by the range (max - min). Rescales features to lie between 0 and 1.
- **Mean Normalization:** Subtract the mean and divide by the range. Centers the data around 0.
- **Z-score Normalization (Standardization):** Subtract the mean and divide by the standard deviation. Ensures zero mean and unit variance.

| Method                 | Formula                                                  | Typical Range        |
|------------------------|----------------------------------------------------------|----------------------|
| **Min-Max Scaling**    | $x' = \dfrac{x - \min(x)}{\max(x) - \min(x)}$            | $[0,\ 1]$            |
| **Mean Normalization** | $x' = \dfrac{x - \mu}{\max(x) - \min(x)}$                | Around 0             |
| **Standardization**    | $x' = \dfrac{x - \mu}{\sigma}$                           | Mean = 0, Std = 1    |

---

### 🧠 Example in Python (Standardization)

```python
# Assuming X is a NumPy array of shape (m, n)
mu     = np.mean(X, axis=0)   # mean for each feature
sigma  = np.std(X, axis=0)    # standard deviation
X_norm = (X - mu) / sigma     # standardized features

## 📉 How to Tell if Gradient Descent is Converging

To verify if gradient descent is working properly (i.e., minimizing the cost function \( J(w, b) \)), it helps to analyze the **learning curve** — a plot of the cost function over iterations.

### 🔍 Key Concepts

- **Gradient Descent Update Rule:**
  $$ w := w - \alpha \frac{\partial J(w, b)}{\partial w}, \quad b := b - \alpha \frac{\partial J(w, b)}{\partial b} $$

- **Learning Curve:**
  - Horizontal axis: Number of iterations
  - Vertical axis: Cost function \( J(w, b) \)
  - Each point: Value of the cost after one update of \( w \) and \( b \)
  - Purpose: Visualize if the cost is decreasing steadily

### ✅ Signs Gradient Descent is Working

- The cost \( J(w, b) \) **decreases after every iteration**
- The curve **smoothly flattens out**, indicating convergence
- Cost function becomes stable (i.e., little to no change)

### ❌ Signs Something is Wrong

- Cost **increases** after an iteration → likely:
  - Learning rate \( \alpha \) is too large
  - Bug in the implementation
- Curve is **not flattening** → may not be converging

### 📈 Convergence Criteria

- **Visual inspection** is the most reliable:
  - Helps detect problems early (e.g., bad learning rate)
  - Shows whether training should stop
- **Automatic test (epsilon rule):**
  - Define a small threshold \( \varepsilon \), e.g., 0.001
  - If cost decrease \( < \varepsilon \), declare convergence
  - Less preferred due to difficulty in choosing a good \( \varepsilon \)

### ⏳ Number of Iterations

- Varies **greatly** depending on the application:
  - Some models converge in 30 iterations
  - Others may take 1,000 or 100,000
- Hence, plotting the learning curve is more reliable than predefining a number

---

## 🔧 Choosing a Good Learning Rate (α)

Selecting the right learning rate is **crucial** for gradient descent to work effectively.

---

### 🧠 What Can Go Wrong?

#### 🔸 Learning rate too **large**:
- Gradient descent **does not converge**
- Cost function `J(w,b)` may **oscillate** or even **increase**
- Can **overshoot** the minimum repeatedly
- Possible graph shape: zig-zag or upward curve

#### 🔸 Learning rate too **small**:
- Gradient descent **converges very slowly**
- Training takes many iterations

---

### ✅ Good Gradient Descent Behavior

- Cost \( J(w,b) \) **decreases on every iteration**
- Learning curve is smooth and decreasing
- Converges towards a stable minimum

---

### 🧪 Debugging Tip

> 🔍 "If gradient descent isn’t working, set α to a very **small** value (e.g., `0.0001`) and check if cost decreases at every step. If it doesn't, you may have a bug (like a wrong sign in the update rule)."

---

### 🚨 Common Implementation Bug

Make sure the update rule is:
$$
w := w - \alpha \cdot \frac{\partial J}{\partial w}
$$
Not:
$$
w := w + \alpha \cdot \frac{\partial J}{\partial w} \quad \text{❌ Wrong!}
$$

---

### 📊 How to Choose α in Practice

Try a **range of learning rates**, e.g.:

- `α = 0.001`
- `α = 0.003` (3x larger)
- `α = 0.01`
- `α = 0.03`
- `α = 0.1`
- ...

For each:
- Run gradient descent for a few iterations
- Plot the **cost vs. iteration**
- Choose the largest α that **still gives consistent decrease**

---

### 🔁 Summary of Good Strategy

1. Start small (e.g., `0.001`)
2. Multiply by ~3 each time until cost no longer decreases
3. Pick the **largest α** that still causes consistent descent

---

### 📌 Recommendation

Also try the **optional lab** to:
- Experiment with different α values
- See effects of **feature scaling** on convergence
- Observe real plots of learning curves

## 🛠️ Feature Engineering

The choice and design of features can significantly impact the performance of a learning algorithm. In many real-world scenarios, **feature engineering** — the process of transforming raw input data into more meaningful representations — is a **critical step** to improve predictive performance.

---

### 📍 Example: Predicting House Prices

Suppose we have two original features:

- $x_1$: **Frontage** (width of the lot)
- $x_2$: **Depth** (length of the lot)

You could use these directly in a linear model:

$$
f_{\mathbf{w},b}(x) = w_1 x_1 + w_2 x_2 + b
$$

However, you might notice that **area** (frontage × depth) is more predictive than width or depth separately:

- Define a **new feature**: $x_3 = x_1 \cdot x_2$ (the lot area)

Now, your model becomes:

$$
f_{\mathbf{w},b}(x) = w_1 x_1 + w_2 x_2 + w_3 x_3 + b
$$

This allows the algorithm to **learn the relative importance** of frontage, depth, and area.

---

### 🔁 What is Feature Engineering?

Feature engineering is the process of:
- Creating **new features** from existing ones.
- Using **domain knowledge** or **intuition** to design better inputs.
- Making the data **easier** for the model to understand and learn from.

This can include:
- Mathematical transformations (e.g., area, ratios)
- Polynomial features (e.g., $x^2$, $x^3$)
- Encoding categorical variables
- Normalization or scaling

---

### 💡 Why It Matters

- Helps the model **fit better** (especially in small datasets)
- Can capture **non-linear relationships**
- Enables better **generalization** and predictive power

In the next section, we’ll see how **polynomial features** allow models to fit **non-linear functions** using linear regression.

## 📈 Polynomial Regression & Feature Engineering

### 🧠 Motivation

So far, linear regression fits **straight lines** to data. But what if your data exhibits **non-linear patterns**?

Using **polynomial regression**, you can extend linear models to capture **curves**, by introducing **higher-order features** (e.g., $x^2$, $x^3$, etc.).

---

### 🏠 Housing Price Example

Suppose you want to predict house prices from **size (x)**. A linear model may not fit the data well:

- **Linear model:**  
  $$
  f(x) = w_1x + b
  $$  
  → Poor fit for curved trends.

- **Quadratic model:**  
  $$
  f(x) = w_1x + w_2x^2 + b
  $$  
  → May fit better, but goes back down for large $x$ (not realistic for house prices).

- **Cubic model:**  
  $$
  f(x) = w_1x + w_2x^2 + w_3x^3 + b
  $$  
  → Can model increasing trends more realistically.

- **Square root model:**  
  $$
  f(x) = w_1x + w_2\sqrt{x} + b
  $$  
  → Models fast growth early, then flattens (another realistic option).

---

### 🛠️ What is Polynomial Regression?

Polynomial regression **extends linear regression** by adding powers of the input features as new features.  
Although the hypothesis is **non-linear in $x$**, it's **linear in the parameters** $w$.

> ✅ Still considered a linear model (in $w$), but allows modeling **non-linear relationships**.

Example for one feature:
- Input: $x$
- Features: $[x, x^2, x^3]$
- Model:  
  $$f_{\mathbf{w},b}(x) = w_1x + w_2x^2 + w_3x^3 + b$$

---

### ⚖️ Importance of Feature Scaling

As you raise features to higher powers:
- Their **range increases dramatically**
  - e.g. if $x \in [1, 1000]$  
    → $x^2 \in [1, 10^6]$, $x^3 \in [1, 10^9]$
- This leads to **unstable gradients** and slow convergence

✅ Apply **feature scaling** (like Z-score normalization) before training with polynomial features.

---

### 🎯 Summary: Key Ideas

- Polynomial regression = **linear regression + feature engineering**
- Helps fit **curved data**
- You can choose from many transformations:
  - $x^2$, $x^3$, $\sqrt{x}$, $\log(x)$, $x^{-1}$, etc.
- Your choice of features has a **huge impact** on model performance
- Try different feature sets and **evaluate model performance**
- Scikit-learn makes it easy to implement polynomial regression (via `PolynomialFeatures`)

---

### 🔍 Practice Lab

- Try coding your own polynomial regression
- Then test using `scikit-learn` to validate your implementation
- Explore how changing degree and learning rate ($\alpha$) affects performance

🎉 Congrats on finishing Week 2!  
Next up: **Classification**, where we move from predicting numbers to **predicting categories**.