# Linear Regression and Gradient Descent - Complete Guide

![Cover](Assets/Supervised_Learning/Capa.png)


## 1. Supervised Learning

The **Supervised Learning** process consists of taking the **Training Set** and feeding this data to our **Learning Algorithm**. The algorithm's job is to present us with a function that makes predictions.

By convention, this function is called a **hypothesis**. The hypothesis's job is to take information (features) it hasn't seen yet and estimate the output correctly.

---

## 2. Notation and Fundamental Concepts

### Dataset Notation

- $m$ = number of training examples (number of rows in the table)
- $n$ = number of features (input variables)
- $x$ = *inputs* 
- $y$ = *output* / target value (what we want to predict)
- $(x, y)$ = a training example
- $(x^{(i)}, y^{(i)})$ = the i-th training example

### Parameters

$\theta$ (theta) are called **parameters**. The learning algorithm's job is to choose the parameters $\theta$ that allow for good predictions.

---

## 3. Example Dataset

![Example Dataset](Assets/Supervised_Learning/1.png)

Let's use a super simple dataset with only **3 houses** to understand the calculations:

| Size (m¬≤) | Price (thousand R$) |
|-----------|---------------------|
| 50        | 150                 |
| 80        | 200                 |
| 110       | 250                 |

So we have:
- $m = 3$ (3 training examples)
- $n = 1$ (1 feature: size)
- $x^{(1)} = 50, \quad y^{(1)} = 150$
- $x^{(2)} = 80, \quad y^{(2)} = 200$
- $x^{(3)} = 110, \quad y^{(3)} = 250$

---

## 4. Hypothesis (Our Line)

### Basic Representation

In **Linear Regression** with one feature, the hypothesis is represented as:

$$h(x) = \theta_0 + \theta_1 x$$

![Hypothesis](Assets/Supervised_Learning/2.png)

In the example above, $\theta_0$ = 1.5 and $\theta_1$ = 0

We observe that it closely resembles a first-degree equation, where $f(x) = b + ax$

Where:
- $\theta_0$ = intercept (where the line crosses the y-axis)
- $\theta_1$ = slope (angular coefficient)
- $x$ = input (size), i.e., what we want to use to make predictions

**Numerical example:** If $\theta_0 = 50$ and $\theta_1 = 2$, then:

$$h(x) = 50 + 2x$$

For an 80m¬≤ house:
$$h(80) = 50 + 2(80) = 50 + 160 = 210 \text{ thousand R\$}$$

### Multiple Features

When we have more than one feature (variable), such as the number of bedrooms for example:

$$h(x) = \theta_0 + \theta_1x_1 + \theta_2x_2$$


Where:
- $x_1$ = size
- $x_2$ = #bedrooms

---

## 5. Cost Function - Measuring Error

The **Cost Function** $J(\theta)$ measures how far our predictions are from the actual values. It's the difference between the actual data value ($y$) minus the value predicted by the line ($h(x)$), squared, summed across all data points:

$$J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} (h(x^{(i)}) - y^{(i)})^2$$

The intention is to make this difference as small as possible.

Formula to find $\theta_0$ and $\theta_1$:

$$
\theta_1
=
\frac
{
\sum_{i=1}^{m}
\left(x^{(i)} - \bar{x}\right)
\left(y^{(i)} - \bar{y}\right)
}
{
\sum_{i=1}^{m}
\left(x^{(i)} - \bar{x}\right)^2
}
$$


$$
\theta_0 = \bar{y} - \theta_1 \bar{x}
$$


### Step-by-Step Calculation Example

Let's calculate $J(\theta_0, \theta_1)$ for $\theta_0 = 50$ and $\theta_1 = 2$ using our dataset:

**Step 1:** Calculate predictions $h(x^{(i)})$
- $h(x^{(1)}) = 50 + 2(50) = 150$
- $h(x^{(2)}) = 50 + 2(80) = 210$
- $h(x^{(3)}) = 50 + 2(110) = 270$

**Step 2:** Calculate errors $(h(x^{(i)}) - y^{(i)})$
- Error 1: $150 - 150 = 0$
- Error 2: $210 - 200 = 10$
- Error 3: $270 - 250 = 20$

**Step 3:** Square the errors
- Error¬≤ 1: $0^2 = 0$
- Error¬≤ 2: $10^2 = 100$
- Error¬≤ 3: $20^2 = 400$

**Step 4:** Sum and divide by $2m$

$$J(50, 2) = \frac{1}{2(3)}(0 + 100 + 400) = \frac{500}{6} = 83.33$$

---

## 6. Gradient Descent - Finding the Best Parameters

![Gradient Descent](Assets/Supervised_Learning/3.png)


To minimize $J(\theta)$, we use **Gradient Descent**. Basically, we look in 360¬∞ and search for values of $\theta_0$ and $\theta_1$ to find the smallest $J(\theta)$ as quickly as possible.

### Gradient Descent Algorithm

Repeat until convergence:

$$\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)$$

Where:
- $:=$ means *assignment*, i.e., the value is updated by a new one
- $\alpha$ is the **learning rate**
- $\frac{\partial}{\partial \theta_j} J(\theta)$ is the partial derivative of the cost function

### Learning Rate

- In practice, the learning rate is generally set as $\alpha = 0.01$
- The derivative of a function defines the direction of the *steepest descent*, i.e., going *downhill* as fast as possible

### Partial Derivatives for Linear Regression

For Linear Regression, the derivatives are:

$$\frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) = \frac{1}{m} \sum_{i=1}^{m} (h(x^{(i)}) - y^{(i)})$$

$$\frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1) = \frac{1}{m} \sum_{i=1}^{m} (h(x^{(i)}) - y^{(i)}) \cdot x^{(i)}$$

### Detailed Numerical Example

Let's execute Gradient Descent manually with our dataset!

**Initial values:**
- $\theta_0 = 0$
- $\theta_1 = 0$
- $\alpha = 0.01$ (learning rate)

#### Iteration 1

**Step 1:** Calculate predictions with $\theta_0 = 0, \theta_1 = 0$
- $h(50) = 0 + 0(50) = 0$
- $h(80) = 0 + 0(80) = 0$
- $h(110) = 0 + 0(110) = 0$

**Step 2:** Calculate errors
- Error 1: $0 - 150 = -150$
- Error 2: $0 - 200 = -200$
- Error 3: $0 - 250 = -250$

**Step 3:** Calculate derivative for $\theta_0$

$$\frac{\partial J}{\partial \theta_0} = \frac{1}{3}(-150 - 200 - 250) = \frac{-600}{3} = -200$$

**Step 4:** Calculate derivative for $\theta_1$

$$\frac{\partial J}{\partial \theta_1} = \frac{1}{3}[(-150)(50) + (-200)(80) + (-250)(110)]$$
$$= \frac{1}{3}[-7500 - 16000 - 27500] = \frac{-51000}{3} = -17000$$

**Step 5:** Update parameters

$$\theta_0 := 0 - 0.01(-200) = 0 + 2 = 2$$
$$\theta_1 := 0 - 0.01(-17000) = 0 + 170 = 170$$

Now we have: $\theta_0 = 2$ and $\theta_1 = 170$

#### Iteration 2

**Step 1:** Calculate predictions with $\theta_0 = 2, \theta_1 = 170$
- $h(50) = 2 + 170(50) = 8502$
- $h(80) = 2 + 170(80) = 13602$
- $h(110) = 2 + 170(110) = 18702$

**Step 2:** Calculate errors
- Error 1: $8502 - 150 = 8352$
- Error 2: $13602 - 200 = 13402$
- Error 3: $18702 - 250 = 18452$

**Observe:** The values are very far off! This happens because $\alpha$ multiplied by large derivatives makes enormous "jumps". With more iterations, the values will converge.

**With many iterations**, the values converge to the optimal values!

---

## 7. Types of Gradient Descent

### Batch Gradient Descent

![Batch Gradient Descent](Assets/Supervised_Learning/4.png)

Uses **all the data** in each iteration. This is what we did above:

$$\theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^{m} (h(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)}$$

Each gradient descent step requires going through the **entire dataset**. This is not good when we have a very large dataset, as it becomes very time-consuming.

### Stochastic Gradient Descent (SGD)

![Stochastic Gradient Descent](Assets/Supervised_Learning/5.png)

To overcome the Batch GD problem, there's **Stochastic Gradient Descent**, which uses **only 1 example** at a time, randomly chosen:

$$\theta_j := \theta_j - \alpha (h(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)}$$

SGD takes a random house, predicts the price, and adjusts the parameters with another random house, testing iteratively until it really finds or gets very close to the *global optimum*.

**Advantage:** Much faster for large datasets!

---

## 8. Final Result

After many iterations (about 100), Gradient Descent finds the optimal values. For our dataset, the ideal values are approximately:

$$\theta_0 \approx 83.33$$
$$\theta_1 \approx 1.67$$

So our final line is:

$$h(x) = 83.33 + 1.67x$$

**Prediction example:** To predict the price of a 90m¬≤ house:

$$h(90) = 83.33 + 1.67(90) = 83.33 + 150.3 = 233.63 \text{ thousand R\$}$$

---

Below will be an example code.

# Code Example 1

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# ============================================
# 1. DATASET - 3 HOUSES
# ============================================
print("=" * 60)
print("1. EXAMPLE DATASET")
print("=" * 60)

# Our 3 data points
X = np.array([50, 80, 110])   # Size (m¬≤)
y = np.array([150, 200, 250])  # Price (thousand R$)
m = len(X)  # number of examples (m)
n = 1       # number of features (n)

print(f"Number of examples (m): {m}")
print(f"Number of features (n): {n}")
print(f"\nData:")
for i in range(m):
    print(f"  x^({i+1}) = {X[i]:3d} m¬≤  ‚Üí  y^({i+1}) = {y[i]:3d} thousand R$")
print()

In [1]:
# ============================================
# 2. HYPOTHESIS FUNCTION
# ============================================
print("=" * 60)
print("2. HYPOTHESIS h(x) = Œ∏‚ÇÄ + Œ∏‚ÇÅx")
print("=" * 60)


def hypothesis(X, theta0, theta1):
    """Calculate h(x) = theta0 + theta1 * x"""
    return theta0 + theta1 * X


# Example with Œ∏‚ÇÄ = 50, Œ∏‚ÇÅ = 2
theta0_example = 50
theta1_example = 2
x_example = 80

h_example = hypothesis(x_example, theta0_example, theta1_example)
print(f"Example: h(x) = {theta0_example} + {theta1_example}x")
print(f"For x = {x_example}m¬≤:")
print(f"h({x_example}) = {theta0_example} + {theta1_example}({x_example}) = {h_example} thousand R$")
print()

2. HYPOTHESIS h(x) = Œ∏‚ÇÄ + Œ∏‚ÇÅx
Example: h(x) = 50 + 2x
For x = 80m¬≤:
h(80) = 50 + 2(80) = 210 thousand R$



In [2]:
# ============================================
# 3. COST FUNCTION (LOSS FUNCTION)
# ============================================
print("=" * 60)
print("3. COST FUNCTION J(Œ∏‚ÇÄ, Œ∏‚ÇÅ)")
print("=" * 60)


def cost_function(X, y, theta0, theta1):
    """
    Calculate J(theta0, theta1) = (1/2m) * sum((h(x) - y)¬≤)
    """
    m = len(X)
    predictions = hypothesis(X, theta0, theta1)
    errors = predictions - y
    squared_errors = errors ** 2
    cost = (1 / (2 * m)) * np.sum(squared_errors)
    return cost


# Example of manual calculation with Œ∏‚ÇÄ=50, Œ∏‚ÇÅ=2
print(f"Calculating J({theta0_example}, {theta1_example}):\n")

predictions = hypothesis(X, theta0_example, theta1_example)
print("Step 1 - Predictions h(x^(i)):")
for i in range(m):
    print(
        f"  h(x^({i+1})) = {theta0_example} + {theta1_example}({X[i]}) = {predictions[i]:.0f}")

errors = predictions - y
print("\nStep 2 - Errors (h(x^(i)) - y^(i)):")
for i in range(m):
    print(f"  Error {i+1}: {predictions[i]:.0f} - {y[i]} = {errors[i]:.0f}")

squared_errors = errors ** 2
print("\nStep 3 - Squared errors:")
for i in range(m):
    print(f"  Error¬≤ {i+1}: ({errors[i]:.0f})¬≤ = {squared_errors[i]:.0f}")

cost = cost_function(X, y, theta0_example, theta1_example)
print(f"\nStep 4 - Sum and divide by 2m:")
print(
    f"  J({theta0_example}, {theta1_example}) = (1/6)({squared_errors[0]:.0f} + {squared_errors[1]:.0f} + {squared_errors[2]:.0f})")
print(f"  J({theta0_example}, {theta1_example}) = {cost:.2f}")
print()

3. COST FUNCTION J(Œ∏‚ÇÄ, Œ∏‚ÇÅ)
Calculating J(50, 2):



NameError: name 'X' is not defined

In [3]:
# ============================================
# 4. GRADIENT DESCENT - IMPLEMENTATION
# ============================================
print("=" * 60)
print("4. GRADIENT DESCENT")
print("=" * 60)


def compute_gradients(X, y, theta0, theta1):
    """
    Calculate the partial derivatives:
    ‚àÇJ/‚àÇŒ∏‚ÇÄ = (1/m) * sum(h(x) - y)
    ‚àÇJ/‚àÇŒ∏‚ÇÅ = (1/m) * sum((h(x) - y) * x)
    """
    m = len(X)
    predictions = hypothesis(X, theta0, theta1)
    errors = predictions - y

    d_theta0 = (1/m) * np.sum(errors)
    d_theta1 = (1/m) * np.sum(errors * X)

    return d_theta0, d_theta1


def gradient_descent(X, y, theta0_init, theta1_init, alpha, iterations, verbose=True):
    """
    Execute Gradient Descent (BATCH)

    Œ∏‚±º := Œ∏‚±º - Œ± * (‚àÇJ/‚àÇŒ∏‚±º)
    """
    theta0 = theta0_init
    theta1 = theta1_init
    m = len(X)
    history = []

    if verbose:
        print(f"Initial values: Œ∏‚ÇÄ = {theta0}, Œ∏‚ÇÅ = {theta1}")
        print(f"Learning rate (Œ±): {alpha}")
        print(f"Iterations: {iterations}\n")

    for i in range(iterations):
        # Calculate predictions
        predictions = hypothesis(X, theta0, theta1)

        # Calculate errors
        errors = predictions - y

        # Calculate partial derivatives (gradients)
        d_theta0, d_theta1 = compute_gradients(X, y, theta0, theta1)

        # Update parameters using gradient descent rule
        theta0 = theta0 - alpha * d_theta0
        theta1 = theta1 - alpha * d_theta1

        # Calculate cost
        cost = cost_function(X, y, theta0, theta1)

        # Save to history
        history.append((theta0, theta1, cost))

        # Show detailed progress for the first 2 iterations
        if verbose and i < 2:
            print(f"{'‚îÄ' * 50}")
            print(f"ITERATION {i+1}")
            print(f"{'‚îÄ' * 50}")
            print(
                f"Predictions h(x): [{predictions[0]:.2f}, {predictions[1]:.2f}, {predictions[2]:.2f}]")
            print(
                f"Errors (h(x) - y): [{errors[0]:.2f}, {errors[1]:.2f}, {errors[2]:.2f}]")
            print(f"\nPartial derivatives:")
            print(
                f"  ‚àÇJ/‚àÇŒ∏‚ÇÄ = (1/{m}) √ó ({errors[0]:.2f} + {errors[1]:.2f} + {errors[2]:.2f})")
            print(f"         = {d_theta0:.4f}")
            print(
                f"\n  ‚àÇJ/‚àÇŒ∏‚ÇÅ = (1/{m}) √ó ({errors[0]:.2f}√ó{X[0]} + {errors[1]:.2f}√ó{X[1]} + {errors[2]:.2f}√ó{X[2]})")
            print(f"         = {d_theta1:.4f}")
            print(f"\nParameter update:")
            prev_theta0 = theta0 + alpha * d_theta0
            prev_theta1 = theta1 + alpha * d_theta1
            print(
                f"  Œ∏‚ÇÄ := {prev_theta0:.4f} - {alpha} √ó {d_theta0:.4f} = {theta0:.4f}")
            print(
                f"  Œ∏‚ÇÅ := {prev_theta1:.4f} - {alpha} √ó {d_theta1:.4f} = {theta1:.4f}")
            print(f"\nCost J(Œ∏) = {cost:.4f}\n")

    return theta0, theta1, history


# ‚ö†Ô∏è IMPORTANT: Correct learning rate!
print("‚ö†Ô∏è  IMPORTANT: CHOOSING THE CORRECT LEARNING RATE")
print("‚îÄ" * 60)
print("For this dataset (large values: 50-250), we need")
print("a VERY SMALL learning rate to avoid divergence!\n")
print("Tested values:")
print("  Œ± = 0.01    ‚Üí EXPLODES! ‚ùå")
print("  Œ± = 0.001   ‚Üí EXPLODES! ‚ùå")
print("  Œ± = 0.0001  ‚Üí Converges very slowly")
print("  Œ± = 0.00001 ‚Üí Converges well! ‚úì\n")

# Run Gradient Descent with appropriate learning rate
theta0_final, theta1_final, history = gradient_descent(
    X, y,
    theta0_init=0,
    theta1_init=0,
    alpha=0.00001,  # CORRECT learning rate!
    iterations=50000,  # More iterations needed
    verbose=True
)

print("=" * 60)
print(f"FINAL RESULT AFTER {len(history)} ITERATIONS")
print("=" * 60)
print(f"Œ∏‚ÇÄ (intercept) = {theta0_final:.4f}")
print(f"Œ∏‚ÇÅ (slope) = {theta1_final:.4f}")
print(f"Final cost J(Œ∏) = {history[-1][2]:.4f}")
print(f"\nFinal line equation:")
print(f"h(x) = {theta0_final:.2f} + {theta1_final:.2f}x\n")

4. GRADIENT DESCENT
‚ö†Ô∏è  IMPORTANT: CHOOSING THE CORRECT LEARNING RATE
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
For this dataset (large values: 50-250), we need
a VERY SMALL learning rate to avoid divergence!

Tested values:
  Œ± = 0.01    ‚Üí EXPLODES! ‚ùå
  Œ± = 0.001   ‚Üí EXPLODES! ‚ùå
  Œ± = 0.0001  ‚Üí Converges very slowly
  Œ± = 0.00001 ‚Üí Converges well! ‚úì



NameError: name 'X' is not defined

In [4]:
# ============================================
# 5. FEATURE SCALING (NORMALIZATION)
# ============================================
print("=" * 60)
print("5. FEATURE SCALING - THE SOLUTION FOR LEARNING RATE")
print("=" * 60)

print("\nüí° Why normalize the data?")
print("‚îÄ" * 60)
print("When X and y values are large (50-250), the gradients")
print("become enormous, forcing the use of tiny learning rates.")
print("\nSOLUTION: Normalize data to the scale [0, 1] or [-1, 1]\n")

# Normalize using Min-Max scaling
X_norm = (X - X.min()) / (X.max() - X.min())
y_norm = (y - y.min()) / (y.max() - y.min())

print(f"Original data:")
print(f"  X = {X}")
print(f"  y = {y}")
print(f"\nNormalized data:")
print(f"  X_norm = {X_norm}")
print(f"  y_norm = {y_norm}\n")

# Run GD with normalized data and larger Œ±
print("Running GD with NORMALIZED data and Œ± = 0.1 (100x larger!):")
theta0_norm, theta1_norm, history_norm = gradient_descent(
    X_norm, y_norm,
    theta0_init=0,
    theta1_init=0,
    alpha=0.1,  # Now we can use much larger Œ±!
    iterations=1000,
    verbose=False
)

print(f"\nResult (normalized data):")
print(f"  Œ∏‚ÇÄ = {theta0_norm:.4f}")
print(f"  Œ∏‚ÇÅ = {theta1_norm:.4f}")
print(f"  Final cost = {history_norm[-1][2]:.6f}")
print(f"  Converged in only 1000 iterations! ‚úì\n")

# Denormalize parameters to get original equation
# h(x) = theta0 + theta1 * x
# Denormalization: y = y_min + (y_max - y_min) * y_norm
# x_norm = (x - x_min) / (x_max - x_min)
theta1_original = theta1_norm * (y.max() - y.min()) / (X.max() - X.min())
theta0_original = y.min() + theta0_norm * (y.max() - y.min()) - \
    theta1_original * X.min()

print("Converting back to original scale:")
print(f"  Œ∏‚ÇÄ = {theta0_original:.4f}")
print(f"  Œ∏‚ÇÅ = {theta1_original:.4f}")
print(f"  (Should be close to Œ∏‚ÇÄ‚âà83.33, Œ∏‚ÇÅ‚âà1.67)\n")

5. FEATURE SCALING - THE SOLUTION FOR LEARNING RATE

üí° Why normalize the data?
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
When X and y values are large (50-250), the gradients
become enormous, forcing the use of tiny learning rates.

SOLUTION: Normalize data to the scale [0, 1] or [-1, 1]



NameError: name 'X' is not defined

In [5]:
# ============================================
# 6. MAKING PREDICTIONS WITH FINAL MODEL
# ============================================
print("=" * 60)
print("6. MAKING PREDICTIONS WITH TRAINED MODEL")
print("=" * 60)

test_sizes = [60, 90, 120]
for size in test_sizes:
    prediction = hypothesis(size, theta0_final, theta1_final)
    print(
        f"House of {size:3d}m¬≤ ‚Üí Estimated price: R$ {prediction:.2f} thousand")
print()

6. MAKING PREDICTIONS WITH TRAINED MODEL


NameError: name 'theta0_final' is not defined

In [6]:
# ============================================
# 7. VISUALIZATIONS
# ============================================
print("=" * 60)
print("7. GENERATING PLOTS")
print("=" * 60)

fig = plt.figure(figsize=(16, 10))

# 7.1 - Original data + Fitted line
ax1 = plt.subplot(2, 3, 1)
plt.scatter(X, y, color='red', s=150, marker='o',
            label='Real data', zorder=3)
x_line = np.linspace(40, 120, 100)
y_line = hypothesis(x_line, theta0_final, theta1_final)
plt.plot(x_line, y_line, color='blue', linewidth=2.5,
         label=f'h(x) = {theta0_final:.2f} + {theta1_final:.2f}x')
plt.xlabel('Size (m¬≤)', fontsize=11, fontweight='bold')
plt.ylabel('Price (thousand R$)', fontsize=11, fontweight='bold')
plt.title('Non-Normalized Data + Fitted Line',
          fontsize=12, fontweight='bold')
plt.legend(fontsize=9)
plt.grid(True, alpha=0.3)

# 7.2 - Convergence of Cost J(Œ∏) - Non-normalized data
ax2 = plt.subplot(2, 3, 2)
costs = [h[2] for h in history]
plt.plot(costs, color='green', linewidth=2)
plt.xlabel('Iteration', fontsize=11, fontweight='bold')
plt.ylabel('Cost J(Œ∏)', fontsize=11, fontweight='bold')
plt.title(f'Convergence (Œ±={0.00001})', fontsize=12, fontweight='bold')
plt.grid(True, alpha=0.3)

# 7.3 - Parameter trajectory - Non-normalized data
ax3 = plt.subplot(2, 3, 3)
theta0_history = [h[0] for h in history]
theta1_history = [h[1] for h in history]
plt.plot(theta0_history, theta1_history, 'o-', markersize=1,
         linewidth=1, alpha=0.6, color='purple')
plt.plot(theta0_history[0], theta1_history[0], 'go',
         markersize=10, label='Start', zorder=5)
plt.plot(theta0_history[-1], theta1_history[-1],
         'ro', markersize=10, label='End', zorder=5)
plt.xlabel('Œ∏‚ÇÄ', fontsize=11, fontweight='bold')
plt.ylabel('Œ∏‚ÇÅ', fontsize=11, fontweight='bold')
plt.title('Parameter Trajectory', fontsize=12, fontweight='bold')
plt.legend(fontsize=9)
plt.grid(True, alpha=0.3)

# 7.4 - NORMALIZED data + Fitted line
ax4 = plt.subplot(2, 3, 4)
plt.scatter(X_norm, y_norm, color='red', s=150, marker='o',
            label='Normalized data', zorder=3)
x_line_norm = np.linspace(0, 1, 100)
y_line_norm = hypothesis(x_line_norm, theta0_norm, theta1_norm)
plt.plot(x_line_norm, y_line_norm, color='blue', linewidth=2.5)
plt.xlabel('Size (normalized)', fontsize=11, fontweight='bold')
plt.ylabel('Price (normalized)', fontsize=11, fontweight='bold')
plt.title('NORMALIZED Data + Fitted Line', fontsize=12, fontweight='bold')
plt.legend(fontsize=9)
plt.grid(True, alpha=0.3)

# 7.5 - Convergence of Cost J(Œ∏) - Normalized data
ax5 = plt.subplot(2, 3, 5)
costs_norm = [h[2] for h in history_norm]
plt.plot(costs_norm, color='green', linewidth=2)
plt.xlabel('Iteration', fontsize=11, fontweight='bold')
plt.ylabel('Cost J(Œ∏)', fontsize=11, fontweight='bold')
plt.title(
    f'NORMALIZED Convergence (Œ±={0.1})', fontsize=12, fontweight='bold')
plt.grid(True, alpha=0.3)

# 7.6 - Convergence comparison
ax6 = plt.subplot(2, 3, 6)
# Normalize both costs for comparison
costs_sample = costs[::len(costs)//1000] if len(costs) > 1000 else costs
costs_norm_sample = costs_norm
plt.plot(range(len(costs_sample)), costs_sample,
         label='Without normalization', linewidth=2, alpha=0.7)
plt.plot(range(len(costs_norm_sample)), costs_norm_sample,
         label='With normalization', linewidth=2, alpha=0.7)
plt.xlabel('Iteration (adjusted scale)', fontsize=11, fontweight='bold')
plt.ylabel('Cost J(Œ∏)', fontsize=11, fontweight='bold')
plt.title('Comparison: With vs Without Normalization',
          fontsize=12, fontweight='bold')
plt.legend(fontsize=9)
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("‚úì Plots generated!")
print("  ‚Ä¢ Top row: NON-normalized data")
print("  ‚Ä¢ Bottom row: NORMALIZED data\n")

7. GENERATING PLOTS


NameError: name 'plt' is not defined

In [7]:
# ============================================
# 8. COMPARISON: DIFFERENT LEARNING RATES
# ============================================
print("=" * 60)
print("8. LEARNING RATE IMPACT (Normalized Data)")
print("=" * 60)

alphas = [0.01, 0.1, 0.5]
fig, axes = plt.subplots(1, 3, figsize=(16, 4))

for idx, alpha in enumerate(alphas):
    _, _, hist = gradient_descent(
        X_norm, y_norm, 0, 0, alpha, 1000, verbose=False)
    costs = [h[2] for h in hist]

    axes[idx].plot(costs, linewidth=2.5)
    axes[idx].set_xlabel('Iteration', fontsize=11, fontweight='bold')
    axes[idx].set_ylabel('Cost J(Œ∏)', fontsize=11, fontweight='bold')
    axes[idx].set_title(f'Œ± = {alpha}', fontsize=13, fontweight='bold')
    axes[idx].grid(True, alpha=0.3)

    final_cost = costs[-1] if not np.isinf(costs[-1]) else "inf"
    if isinstance(final_cost, float):
        axes[idx].text(0.98, 0.98, f'Final: {final_cost:.6f}',
                       transform=axes[idx].transAxes,
                       fontsize=9, verticalalignment='top', horizontalalignment='right',
                       bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

    print(f"Œ± = {alpha:4.2f}  ‚Üí  Final cost = {final_cost if isinstance(final_cost, str) else f'{final_cost:.6f}'}")

plt.tight_layout()
plt.show()
print()

8. LEARNING RATE IMPACT (Normalized Data)


NameError: name 'plt' is not defined

In [8]:
# ============================================
# 9. STOCHASTIC GRADIENT DESCENT (SGD)
# ============================================
print("=" * 60)
print("9. STOCHASTIC GRADIENT DESCENT")
print("=" * 60)


def stochastic_gradient_descent(X, y, theta0_init, theta1_init, alpha, iterations):
    """
    SGD: Update Œ∏ using only 1 random example per iteration
    """
    theta0 = theta0_init
    theta1 = theta1_init
    m = len(X)
    history = []

    for i in range(iterations):
        # Choose random example
        idx = np.random.randint(0, m)
        x_i = X[idx]
        y_i = y[idx]

        # Prediction and error for this example
        prediction = hypothesis(x_i, theta0, theta1)
        error = prediction - y_i

        # Update using only this example
        theta0 = theta0 - alpha * error
        theta1 = theta1 - alpha * error * x_i

        # Calculate cost with all data (for monitoring)
        cost = cost_function(X, y, theta0, theta1)
        history.append((theta0, theta1, cost))

    return theta0, theta1, history


# Run SGD with normalized data
theta0_sgd, theta1_sgd, history_sgd = stochastic_gradient_descent(
    X_norm, y_norm, 0, 0, 0.1, 1000
)

print("Comparison Batch GD vs Stochastic GD (normalized data):")
print(f"\nBatch GD:")
print(f"  Œ∏‚ÇÄ = {theta0_norm:.4f}, Œ∏‚ÇÅ = {theta1_norm:.4f}")
print(f"  Final cost = {history_norm[-1][2]:.6f}")

print(f"\nStochastic GD:")
print(f"  Œ∏‚ÇÄ = {theta0_sgd:.4f}, Œ∏‚ÇÅ = {theta1_sgd:.4f}")
print(f"  Final cost = {history_sgd[-1][2]:.6f}")

# Visualize comparison
plt.figure(figsize=(14, 5))

plt.subplot(1, 2, 1)
costs_batch_norm = [h[2] for h in history_norm]
costs_sgd = [h[2] for h in history_sgd]
plt.plot(costs_batch_norm, label='Batch GD', linewidth=2.5, color='blue')
plt.plot(costs_sgd, label='Stochastic GD',
         linewidth=2, alpha=0.8, color='orange')
plt.xlabel('Iteration', fontsize=12, fontweight='bold')
plt.ylabel('Cost J(Œ∏)', fontsize=12, fontweight='bold')
plt.title('Batch GD vs Stochastic GD', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
theta0_norm_hist = [h[0] for h in history_norm]
theta1_norm_hist = [h[1] for h in history_norm]
theta0_sgd_hist = [h[0] for h in history_sgd]
theta1_sgd_hist = [h[1] for h in history_sgd]
plt.plot(theta0_norm_hist, theta1_norm_hist, 'o-', markersize=2, linewidth=1.5,
         alpha=0.7, label='Batch GD', color='blue')
plt.plot(theta0_sgd_hist, theta1_sgd_hist, 'o-', markersize=2, linewidth=1.5,
         alpha=0.7, label='Stochastic GD', color='orange')
plt.xlabel('Œ∏‚ÇÄ', fontsize=12, fontweight='bold')
plt.ylabel('Œ∏‚ÇÅ', fontsize=12, fontweight='bold')
plt.title('Trajectory: Batch vs Stochastic', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüìä Observations:")
print("‚Ä¢ Batch GD: smooth and deterministic convergence")
print("‚Ä¢ Stochastic GD: more 'noise' but still converges!")
print("‚Ä¢ With NORMALIZED data, both work much better!\n")

print("=" * 60)
print("‚úÖ CODE EXECUTED SUCCESSFULLY!")
print("=" * 60)
print("\nüí° IMPORTANT LESSONS:")
print("1. Learning rate is VERY important!")
print("2. Data normalization = faster convergence")
print("3. With normalization, we can use much larger Œ±")
print("4. Without normalization: Œ± needs to be tiny (0.00001)")
print("5. With normalization: Œ± can be 0.1 (10,000x larger!)")

9. STOCHASTIC GRADIENT DESCENT


NameError: name 'X_norm' is not defined