# Linear Regression

## Fundamental Concept

Linear Regression is a machine learning algorithm whose main objective is to find the best line that passes through the available data points. This method seeks to establish a linear relationship between input and output variables, enabling accurate predictions on new data.

## Supervised Learning

![Cover](../Assets/linear_regression/input.png)

Linear Regression is classified as a **supervised learning** algorithm. This characterization is due to the fact that the algorithm uses a training dataset containing both input *features* and corresponding output *targets*. During the training process, the model learns to map the relationships between these known inputs and outputs.

## Mathematical Model

Once the model is trained, it becomes capable of predicting values $\hat{y}$ (*y-hat*) for new inputs $x$. This prediction can be mathematically represented by the following function:

![Cover](../Assets/linear_regression/w_e_b.png)

$$f(x) = wx + b$$

Where:
- $f(x)$ or $\hat{y}$ represents the predicted value
- $w$ is the slope of the line
- $b$ is the intercept of the line
- $x$ is the input feature

## Model Parameters

The values chosen for parameters $w$ and $b$ are fundamental, as they completely determine the model's behavior. Specifically, these parameters define the prediction value $\hat{y}_i$ for each example $i$, based on the corresponding input feature $x_i$. The optimization of these parameters during training is what enables the model to make accurate predictions.

## Cost Function

To find the best values for $w$ and $b$, we reduce what we call the **cost function**. The goal is to choose values of $w$ and $b$ such that the prediction $\hat{y}_i$ is as close as possible to the actual value $y_i$ for all examples in the training set.

### Building the Cost Function

The cost function works by comparing the actual value $y_i$ with the predicted value $\hat{y}_i$. The difference between these values is expressed as:

$$\hat{y}_i - y_i$$

This difference is called the prediction **error**.

To obtain a more robust metric, we square this error. This serves two important purposes: eliminating negative values (ensuring that positive and negative errors don't cancel each other out) and penalizing larger errors more intensely:

$$(\hat{y}_i - y_i)^2$$

Since we want to quantify the error over the entire dataset, we sum the squared errors of all training examples:

$$\sum_{i=1}^{m} (\hat{y}_i - y_i)^2$$

where $m$ represents the total number of training examples or datapoints.

Finally, we calculate the average of these errors by dividing by $2m$. The choice of $2m$ instead of just $m$ is intended to simplify the mathematics later, especially when calculating the partial derivative of the cost function.

Since $\hat{y}_i$ can be represented as $f(x_i)$, the complete **cost function** is expressed as:

$$J(w, b) = \frac{1}{2m} \sum_{i=1}^{m} (f(x_i) - y_i)^2$$

This function, known as **Mean Squared Error** (MSE), is the most commonly used cost function in Linear Regression problems.

## Gradient Descent

To minimize the cost function, we use an algorithm called **Gradient Descent**. The process is relatively simple: we start with initial values for $w$ and $b$ (commonly $w = 0$ and $b = 0$).

Next, we repeatedly update the parameters $w$ and $b$ in small steps with the goal of reducing the cost. We continue this iterative process until we reach the lowest possible cost (local minimum). When the algorithm reaches this point, we say it has **converged**.

![Gradient Descent](../Assets/linear_regression/gradient.png)

The process can be compared to descending a mountain: each step takes us closer to the bottom of the valley. The direction at each step is determined by the **gradient**, which always points in the direction of steepest ascent.

![Gradient Descent Steps](../Assets/linear_regression/steps.png)

To minimize cost, we move in the opposite direction of the gradient, that is, we take steps going downhill.

### Gradient Descent Formula

The parameter update at each iteration follows these equations:

$w = w - \alpha \frac{\partial J(w,b)}{\partial w}$

$b = b - \alpha \frac{\partial J(w,b)}{\partial b}$

Where:
- $w$ and $b$ are updated simultaneously at each iteration
- $\alpha$ is the **learning rate**
- $\frac{\partial J(w,b)}{\partial w}$ and $\frac{\partial J(w,b)}{\partial b}$ are the partial derivatives of the cost function

### Learning Rate

Choosing a good value for $\alpha$ is crucial:

- **$\alpha$ too small**: The algorithm will take very short steps, making convergence slow and time-consuming
- **$\alpha$ too large**: The algorithm may "jump" over the minimum, failing to converge or even diverging

![Learning Rate](../Assets/linear_regression/alpha.png)

## Calculating Partial Derivatives

To implement Gradient Descent, we need to calculate the partial derivatives of the cost function $J(w,b)$ with respect to $w$ and $b$. We start by recalling our cost function:

$J(w, b) = \frac{1}{2m} \sum_{i=1}^{m} (f(x_i) - y_i)^2$

Since $f(x_i) = wx_i + b$, we can rewrite:

$J(w, b) = \frac{1}{2m} \sum_{i=1}^{m} ((wx_i + b) - y_i)^2$

### Partial Derivative with respect to $w$

Applying the chain rule to derive $J(w,b)$ with respect to $w$:

$\frac{\partial J(w,b)}{\partial w} = \frac{\partial}{\partial w} \left[\frac{1}{2m} \sum_{i=1}^{m} ((wx_i + b) - y_i)^2\right]$

The constant $\frac{1}{2m}$ can be taken out of the derivative:

$\frac{\partial J(w,b)}{\partial w} = \frac{1}{2m} \sum_{i=1}^{m} \frac{\partial}{\partial w} ((wx_i + b) - y_i)^2$

Applying the chain rule: $\frac{d}{dx}[g(x)]^2 = 2g(x) \cdot g'(x)$

$\frac{\partial J(w,b)}{\partial w} = \frac{1}{2m} \sum_{i=1}^{m} 2((wx_i + b) - y_i) \cdot \frac{\partial}{\partial w}((wx_i + b) - y_i)$

Since $\frac{\partial}{\partial w}((wx_i + b) - y_i) = x_i$:

$\frac{\partial J(w,b)}{\partial w} = \frac{1}{2m} \sum_{i=1}^{m} 2((wx_i + b) - y_i) \cdot x_i$

The factor $2$ cancels with the $2$ in the denominator:

$\frac{\partial J(w,b)}{\partial w} = \frac{1}{m} \sum_{i=1}^{m} x_i \big((w x_i + b) - y_i\big)$

Substituting $wx_i + b$ with $f(x_i)$:

$\boxed{\frac{\partial J(w,b)}{\partial w} = \frac{1}{m} \sum_{i=1}^{m} x_i(f(x_i) - y_i) }$

### Partial Derivative with respect to $b$

Following the same process for $b$:

$\frac{\partial J(w,b)}{\partial b} = \frac{\partial}{\partial b} \left[\frac{1}{2m} \sum_{i=1}^{m} ((wx_i + b) - y_i)^2\right]$

$\frac{\partial J(w,b)}{\partial b} = \frac{1}{2m} \sum_{i=1}^{m} \frac{\partial}{\partial b} ((wx_i + b) - y_i)^2$

Applying the chain rule:

$\frac{\partial J(w,b)}{\partial b} = \frac{1}{2m} \sum_{i=1}^{m} 2((wx_i + b) - y_i) \cdot \frac{\partial}{\partial b}((wx_i + b) - y_i)$

Since $\frac{\partial}{\partial b}((wx_i + b) - y_i) = 1$:

$\frac{\partial J(w,b)}{\partial b} = \frac{1}{2m} \sum_{i=1}^{m} 2((wx_i + b) - y_i) \cdot 1$

Simplifying:

$\frac{\partial J(w,b)}{\partial b} = \frac{1}{m} \sum_{i=1}^{m} ((wx_i + b) - y_i)$

Substituting $(wx_i + b)$ with $f(x_i)$:

$\boxed{\frac{\partial J(w,b)}{\partial b} = \frac{1}{m} \sum_{i=1}^{m} (f(x_i) - y_i)}$

### Complete Algorithm

With the derivatives calculated, the Gradient Descent algorithm for Linear Regression is:

**Repeat until convergence:**

$w = w - \alpha \cdot \frac{1}{m} \sum_{i=1}^{m} x_i(f(x_i) - y_i) $

$b = b - \alpha \cdot \frac{1}{m} \sum_{i=1}^{m} (f(x_i) - y_i)$

Where $f(x_i) = wx_i + b$

> **Important note**: Parameters $w$ and $b$ must be updated **simultaneously** at each iteration, that is, we calculate both derivatives with the old values before updating any parameter.

# Below you will find an example using Python code

In [None]:
# Library Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# Importing our Dataset
training_set = pd.read_csv('../Datasets/Salary_Data.csv')

In [None]:
# Definition of our Parameter and Target
X_train = training_set['YearsExperience'].values
y_train = training_set['Salary'].values

In [None]:
# Scatter visualization of the Parameter with the Target
plt.scatter(X_train, y_train)
plt.xlabel("Years of Experience")
plt.ylabel("Salary")
plt.show()

In [None]:
# We need three main functions to implement linear regression:
# 1) Cost function: Calculates how good the model is, using the Mean Squared Error (MSE), which measures the average squared error between the predicted values and the real values.

# 2) Gradient function: Calculates the derivatives of the cost function with respect to the parameters w and b.

# 3) Gradient descent function: Uses the gradients calculated by the gradient function to update the parameters w and b at each iteration, with the goal of minimizing the error.

def cost_function(x, y, w, b):
    m = len(x)
    cost_sum = 0

    for i in range(m):
        f = w * x[i] + b
        cost = (f - y[i]) ** 2
        cost_sum += cost

    total_cost = (1/(2*m)) * cost_sum
    return total_cost


def gradient_function(x, y, w, b):
    m = len(x)
    dc_dw = 0
    dc_db = 0

    for i in range(m):
        f = w * x[i] + b

        dc_dw += (f - y[i]) * x[i]
        dc_db += (f - y[i])

    dc_dw = (1/m) * dc_dw
    dc_db = (1/m) * dc_db

    return dc_dw, dc_db


def gradient_descent(x, y, alpha, iterations):
    w = 0
    b = 0

    for i in range(iterations):
        dc_dw, dc_db = gradient_function(x, y, w, b)

        w = w - alpha * dc_dw
        b = b - alpha * dc_db

        print(f"Iteration {i}: Cost {cost_function(x, y, w, b)}")

    return w, b

In [None]:
# Indicates the Learning Rate and the number of iterations
learning_rate = 0.01
iterations = 10000
# Actually calculates the Gradient Descent
final_w, final_b = gradient_descent(
    X_train, y_train, learning_rate, iterations)
print(f"w: {final_w:.4f}, b: {final_b:.4f}")

In [None]:
# Visualizes the Regression Line
plt.scatter(X_train, y_train, label='Data Points')

X_vals = np.linspace(min(X_train), max(X_train), 100)
y_vals = final_w * X_vals + final_b
plt.plot(X_vals, y_vals, color='red', label='Regression Line')

plt.xlabel("Years of Experience")
plt.ylabel("Salary")
plt.legend()
plt.show()

# Linear Regression Optimization

## Introduction

After implementing the basic Linear Regression algorithm with Gradient Descent, two common challenges arise in practice:

1. **Very high cost values** - Making interpretation and convergence difficult
2. **Inadequate learning rate selection** - Resulting in slow convergence or training failure

This guide presents two essential techniques to solve these problems: **feature normalization** and **systematic learning rate testing**.

---

## Feature Normalization

### The Scale Problem

When working with data on different scales, the Gradient Descent algorithm faces difficulties. For example, if one feature varies between 0 and 100 and another between 0 and 100,000, the gradients will have very different magnitudes, causing:

- **Slow convergence**: The algorithm needs many iterations
- **Numerical instability**: Extremely large cost values
- **Difficulty choosing the learning rate**: An α that works for one feature may be inadequate for another

### Solution: Z-Score Normalization

**Z-score** normalization transforms the data so that it has mean 0 and standard deviation 1:

$$X_{norm} = \frac{X - \mu}{\sigma}$$

Where:
- $\mu$ is the mean of the data
- $\sigma$ is the standard deviation of the data

### Implementation

```python
def normalize_features(X):
    """Normalizes data using z-score"""
    mean = np.mean(X)
    std = np.std(X)
    X_norm = (X - mean) / std
    return X_norm
```

**Parameters:**
- `X`: Array with original data

**Returns:**
- `X_norm`: Normalized data

### Benefits of Normalization

1. **Dramatic cost reduction**: From millions to values close to 0
2. **Faster convergence**: Fewer iterations needed
3. **Balanced gradients**: All features contribute equally
4. **Facilitates learning rate selection**: Typical values (0.01 to 1.0) work well

### Practical Example

Before normalization:
```
X: min=1.10, max=10.50, mean=5.31
y: min=$37,731, max=$122,391, mean=$76,003
Initial cost: 1,344,612,525
```

After normalization:
```
X_norm: min=-1.51, max=1.86, mean=0.00
y_norm: min=-1.42, max=1.72, mean=0.00
Initial cost: 0.499
```

---

## Learning Rate Optimization

### The Learning Rate Dilemma

The learning rate ($\alpha$) controls the size of the steps the algorithm takes toward the minimum. Choosing this value is critical:

| Learning Rate | Behavior | Result |
|---------------|----------|--------|
| **Too small** | Tiny steps | Very slow convergence |
| **Adequate** | Balanced steps | Efficient convergence |
| **Too large** | Excessive steps | Oscillation or divergence |

### Systematic Testing Strategy

Instead of choosing arbitrarily, we test several values and select the best based on performance metrics.

### Implementation

```python
def test_learning_rates(X, y):
    """Tests different learning rates to find the best one"""
    learning_rates = [0.001, 0.01, 0.1, 0.5, 1.0]
    iterations = 5000

    print("="*60)
    print("TESTING DIFFERENT LEARNING RATES")
    print("="*60)

    results = []

    for lr in learning_rates:
        print(f"\n--- Testing α = {lr} ---")
        w, b, history = gradient_descent(
            X, y, lr, iterations, print_every=1000)

        # Calculate R²
        predictions = w * X + b
        ss_res = np.sum((y - predictions) ** 2)
        ss_tot = np.sum((y - np.mean(y)) ** 2)
        r2 = 1 - (ss_res / ss_tot)

        results.append({
            'lr': lr,
            'final_cost': history[-1],
            'r2': r2,
            'w': w,
            'b': b
        })

        print(f"Final cost: {history[-1]:.6f}, R²: {r2:.4f}")

    # Select the best result
    best = max(results, key=lambda x: x['r2'])
    print("\n" + "="*60)
    print(f"BEST LEARNING RATE: α = {best['lr']}")
    print(f"R² = {best['r2']:.4f}")
    print("="*60)

    return best['lr']
```

### Code Breakdown

#### 1. Defining Candidates

```python
learning_rates = [0.001, 0.01, 0.1, 0.5, 1.0]
```

We test values on a **logarithmic scale**, covering from very conservative to aggressive values.

#### 2. Training with Each Candidate

```python
for lr in learning_rates:
    w, b, history = gradient_descent(X, y, lr, iterations, print_every=1000)
```

Each learning rate is tested with the same number of iterations for fair comparison.

#### 3. R² Score Calculation

The **coefficient of determination** ($R^2$) measures how well the model explains the data variability:

$$R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$$

Where:
- $SS_{res} = \sum_{i=1}^{m} (y_i - \hat{y}_i)^2$ (sum of squared residuals)
- $SS_{tot} = \sum_{i=1}^{m} (y_i - \bar{y})^2$ (total sum of squares)

```python
predictions = w * X + b
ss_res = np.sum((y - predictions) ** 2)
ss_tot = np.sum((y - np.mean(y)) ** 2)
r2 = 1 - (ss_res / ss_tot)
```

#### R² Interpretation:

- R² = 1.0: Perfect model (explains 100% of variance)
- R² = 0.95: Excellent (explains 95% of variance)
- R² = 0.70: Good (explains 70% of variance)
- R² = 0.50: Average (explains 50% of variance)
- R² < 0.30: Poor (model has little predictive power)
- R² < 0: Model worse than simply using the mean

#### 4. Storing Results

```python
results.append({
    'lr': lr,
    'final_cost': history[-1],
    'r2': r2,
    'w': w,
    'b': b
})
```

Each test is stored in a dictionary containing all relevant metrics.

#### 5. Selecting the Best Learning Rate

```python
best = max(results, key=lambda x: x['r2'])
```

We use R² as the selection criterion, choosing the learning rate that maximizes this metric.

# Code in Practice!

In [None]:

# Bonus Code - Normalization and Proper Learning Rate

def normalize_features(X):
    """Normalizes data using z-score"""
    mean = np.mean(X)
    std = np.std(X)
    X_norm = (X - mean) / std
    return X_norm, mean, std


def test_learning_rates(X, y):
    """Tests different learning rates to find the best one"""
    learning_rates = [0.001, 0.01, 0.1, 0.5, 1.0]
    iterations = 5000

    print("="*60)
    print("TESTING DIFFERENT LEARNING RATES")
    print("="*60)

    results = []

    for lr in learning_rates:
        print(f"\n--- Testing α = {lr} ---")
        w, b = gradient_descent(
            X, y, lr, iterations)

        # R²
        predictions = w * X + b
        ss_res = np.sum((y - predictions) ** 2)
        ss_tot = np.sum((y - np.mean(y)) ** 2)
        r2 = 1 - (ss_res / ss_tot)

        results.append({
            'lr': lr,
            'r2': r2,
            'w': w,
            'b': b
        })

        print(f"R²: {r2:.4f}")

    # Best result
    best = max(results, key=lambda x: x['r2'])
    print("\n" + "="*60)
    print(f"BEST LEARNING RATE: α = {best['lr']}")
    print(f"R² = {best['r2']:.4f}")
    print("="*60)

    return best['lr']

In [None]:
# Normalize the data
X_norm, x_mean, x_std = normalize_features(X_train)
y_norm, y_mean, y_std = normalize_features(y_train)

print("\nNORMALIZED DATA")
print(
    f"X_norm: min={X_norm.min():.2f}, max={X_norm.max():.2f}, mean={X_norm.mean():.2f}")
print(
    f"y_norm: min={y_norm.min():.2f}, max={y_norm.max():.2f}, mean={y_norm.mean():.2f}")

# Test different learning rates
best_lr = test_learning_rates(X_norm, y_norm)

# Train again but now with the best learning rate
print("\n" + "="*60)
print(f"FINAL TRAINING WITH α = {best_lr}")
print("="*60)

w_initial = 0
b_initial = 0
iterations = 10000


In [None]:
# Perform Gradient Descent now with much lower cost (better)
w_final, b_final = gradient_descent(
    X_norm, y_norm, best_lr, iterations
)

In [None]:
# Function to visualize the Regression Line
def plot_regression(X, y, w, b):
    """Plots data and regression line - VERY SIMPLE"""

    # Calculate predictions
    predictions = w * X + b

    # Plot
    plt.figure(figsize=(10, 6))
    plt.scatter(X, y, color='blue', s=100, alpha=0.6, label='Data')
    plt.plot(X, predictions, color='red', linewidth=3, label='Regression')

    plt.xlabel('X')
    plt.ylabel('y')
    plt.title(f'Linear Regression: y = {w:.4f}x + {b:.4f}')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()


plot_regression(X_norm, y_norm, w_final, b_final)