# Module 03: Linear Regression

**Difficulty**: ⭐⭐ Intermediate  
**Estimated Time**: 90 minutes  
**Prerequisites**: 
- [Module 00: Introduction to Machine Learning](00_introduction_to_machine_learning.ipynb)
- [Module 01: Supervised vs Unsupervised Learning](01_supervised_vs_unsupervised_learning.ipynb)
- [Module 02: Data Preparation and Splitting](02_data_preparation_and_splitting.ipynb)
- Basic calculus and linear algebra

## Learning Objectives

By the end of this notebook, you will be able to:

1. Understand the mathematical foundation of linear regression (OLS)
2. Distinguish between simple and multiple linear regression
3. Implement linear regression using scikit-learn
4. Create and use polynomial features for non-linear relationships
5. Evaluate regression models using R², MSE, RMSE, and MAE
6. Interpret coefficients and analyze residuals
7. Recognize and validate linear regression assumptions

## 1. Setup and Imports

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Scikit-learn for ML
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.metrics import (
    mean_squared_error, mean_absolute_error, r2_score,
    mean_absolute_percentage_error
)

# Configuration
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)

print("All libraries imported successfully!")

## 2. What is Linear Regression?

### Definition

**Linear Regression** is a supervised learning algorithm that models the relationship between:
- **Independent variables (X)**: Features/predictors
- **Dependent variable (y)**: Target/outcome

By finding the **best-fitting straight line** (or hyperplane in multiple dimensions) through the data.

### The Linear Equation

**Simple Linear Regression** (one feature):
```
y = β₀ + β₁x + ε
```

**Multiple Linear Regression** (many features):
```
y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε
```

Where:
- **β₀** (beta zero): Intercept (value when all x = 0)
- **β₁, β₂, ...** (beta coefficients): Slopes (effect of each feature)
- **ε** (epsilon): Error term (what the model can't explain)

### Goal: Minimize Error

Find the coefficients (β values) that **minimize the sum of squared errors** between predictions and actual values.

This is called **Ordinary Least Squares (OLS)** estimation.

## 3. Simple Linear Regression

Let's start with the simplest case: predicting one variable from another.

### Example: Years of Experience → Salary

In [None]:
# Create a simple dataset
np.random.seed(42)
years_experience = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
salary = 30000 + 5000 * years_experience + np.random.normal(0, 3000, 10)

# Create DataFrame for better visualization
simple_df = pd.DataFrame({
    'Years_Experience': years_experience,
    'Salary': salary
})

print("Simple Linear Regression Dataset:")
print(simple_df)
print(f"\nCorrelation: {simple_df.corr().iloc[0, 1]:.3f}")

In [None]:
# Visualize the relationship
plt.figure(figsize=(10, 6))
plt.scatter(years_experience, salary, s=100, alpha=0.7, edgecolors='k')
plt.xlabel('Years of Experience', fontsize=12)
plt.ylabel('Salary ($)', fontsize=12)
plt.title('Relationship: Experience vs Salary', fontsize=14)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("Notice the roughly linear relationship!")

In [None]:
# Prepare data for sklearn (needs 2D arrays)
X_simple = years_experience.reshape(-1, 1)  # Reshape to column vector
y_simple = salary

# Create and train the model
simple_model = LinearRegression()
simple_model.fit(X_simple, y_simple)

# Get the learned parameters
intercept = simple_model.intercept_
slope = simple_model.coef_[0]

print("Learned Linear Equation:")
print(f"Salary = {intercept:.2f} + {slope:.2f} × Years")
print(f"\nInterpretation:")
print(f"- Base salary (no experience): ${intercept:.2f}")
print(f"- Salary increase per year: ${slope:.2f}")

In [None]:
# Make predictions
y_pred_simple = simple_model.predict(X_simple)

# Visualize the fitted line
plt.figure(figsize=(10, 6))
plt.scatter(years_experience, salary, s=100, alpha=0.7, 
           edgecolors='k', label='Actual Data')
plt.plot(years_experience, y_pred_simple, 'r-', linewidth=2, 
        label=f'Fitted Line: y = {intercept:.0f} + {slope:.0f}x')

# Show residuals (errors)
for i in range(len(years_experience)):
    plt.plot([years_experience[i], years_experience[i]], 
            [salary[i], y_pred_simple[i]], 
            'g--', alpha=0.5, linewidth=1)

plt.xlabel('Years of Experience', fontsize=12)
plt.ylabel('Salary ($)', fontsize=12)
plt.title('Simple Linear Regression: Best-Fit Line', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("Green dashed lines show residuals (prediction errors)")
print("OLS minimizes the sum of squared residuals!")

## 4. Multiple Linear Regression

Real-world problems usually have multiple features. Let's use the California housing dataset.

In [None]:
# Load California housing dataset
housing = datasets.fetch_california_housing()
X = housing.data
y = housing.target  # Median house value (in $100,000s)

# Use subset for faster computation
X = X[:1000]
y = y[:1000]

# Create DataFrame for exploration
housing_df = pd.DataFrame(X, columns=housing.feature_names)
housing_df['MedHouseValue'] = y

print("California Housing Dataset:")
print(f"Shape: {housing_df.shape}")
print(f"\nFeatures: {list(housing.feature_names)}")
print(f"\nFirst few rows:")
print(housing_df.head())
print(f"\nTarget range: ${y.min()*100000:.0f} - ${y.max()*100000:.0f}")

In [None]:
# Explore correlations with target
correlations = housing_df.corr()['MedHouseValue'].sort_values(ascending=False)

print("Feature Correlations with House Value:")
print(correlations)

# Visualize
plt.figure(figsize=(10, 6))
correlations[1:].plot(kind='barh')
plt.xlabel('Correlation with Median House Value')
plt.title('Feature Importance (Correlation)', fontsize=14)
plt.axvline(x=0, color='k', linestyle='--', alpha=0.3)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nMedInc (median income) has strongest positive correlation!")

In [None]:
# Split data (BEFORE preprocessing!)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

In [None]:
# Scale features (fit on training data only!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train multiple linear regression model
multi_model = LinearRegression()
multi_model.fit(X_train_scaled, y_train)

print("Multiple Linear Regression Model Trained!")
print(f"\nIntercept: {multi_model.intercept_:.3f}")
print(f"\nCoefficients:")
for feature, coef in zip(housing.feature_names, multi_model.coef_):
    print(f"  {feature:12s}: {coef:8.3f}")

In [None]:
# Make predictions
y_train_pred = multi_model.predict(X_train_scaled)
y_test_pred = multi_model.predict(X_test_scaled)

# Visualize predictions
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Training set
axes[0].scatter(y_train, y_train_pred, alpha=0.6, edgecolors='k')
axes[0].plot([y_train.min(), y_train.max()], 
            [y_train.min(), y_train.max()], 
            'r--', lw=2, label='Perfect Prediction')
axes[0].set_xlabel('Actual Price ($100,000s)')
axes[0].set_ylabel('Predicted Price ($100,000s)')
axes[0].set_title('Training Set Predictions')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Test set
axes[1].scatter(y_test, y_test_pred, alpha=0.6, edgecolors='k')
axes[1].plot([y_test.min(), y_test.max()], 
            [y_test.min(), y_test.max()], 
            'r--', lw=2, label='Perfect Prediction')
axes[1].set_xlabel('Actual Price ($100,000s)')
axes[1].set_ylabel('Predicted Price ($100,000s)')
axes[1].set_title('Test Set Predictions')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Points closer to red line = better predictions")

## 5. Regression Metrics

How do we measure regression model performance?

### 1. R² (R-squared) - Coefficient of Determination

**Formula**: R² = 1 - (SS_res / SS_tot)

**Interpretation**:
- Ranges from 0 to 1 (can be negative for bad models)
- **0.7** = model explains 70% of variance
- Higher is better
- **Perfect score**: 1.0

### 2. MSE (Mean Squared Error)

**Formula**: MSE = (1/n) Σ(y_actual - y_pred)²

**Interpretation**:
- Average squared difference
- Penalizes large errors heavily (squared term)
- Lower is better
- **Perfect score**: 0.0

### 3. RMSE (Root Mean Squared Error)

**Formula**: RMSE = √MSE

**Interpretation**:
- Same units as target variable
- More interpretable than MSE
- Lower is better

### 4. MAE (Mean Absolute Error)

**Formula**: MAE = (1/n) Σ|y_actual - y_pred|

**Interpretation**:
- Average absolute difference
- Less sensitive to outliers than MSE
- More robust
- Lower is better

In [None]:
# Calculate all metrics
def evaluate_regression(y_true, y_pred, set_name=""):
    """Calculate and display regression metrics"""
    r2 = r2_score(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_true, y_pred)
    mape = mean_absolute_percentage_error(y_true, y_pred) * 100
    
    print(f"\n{set_name} Metrics:")
    print(f"  R² Score:  {r2:.4f}  (Higher is better, max=1.0)")
    print(f"  MSE:       {mse:.4f}")
    print(f"  RMSE:      ${rmse*100000:.2f}  (in actual dollars)")
    print(f"  MAE:       ${mae*100000:.2f}  (average error)")
    print(f"  MAPE:      {mape:.2f}%  (percentage error)")
    
    return {'R2': r2, 'MSE': mse, 'RMSE': rmse, 'MAE': mae, 'MAPE': mape}

# Evaluate on both sets
train_metrics = evaluate_regression(y_train, y_train_pred, "Training Set")
test_metrics = evaluate_regression(y_test, y_test_pred, "Test Set")

print(f"\nInterpretation:")
print(f"Model explains {test_metrics['R2']:.1%} of variance in house prices")
print(f"Average prediction error: ${test_metrics['MAE']*100000:.2f}")

## 6. Residual Analysis

**Residuals** are the differences between actual and predicted values.

Good residuals should:
1. **Be randomly distributed** (no patterns)
2. **Have mean ≈ 0**
3. **Have constant variance** (homoscedasticity)
4. **Follow normal distribution**

Patterns in residuals indicate model problems!

In [None]:
# Calculate residuals
residuals = y_test - y_test_pred

# Comprehensive residual analysis
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Residuals vs Predicted Values
axes[0, 0].scatter(y_test_pred, residuals, alpha=0.6, edgecolors='k')
axes[0, 0].axhline(y=0, color='r', linestyle='--', lw=2)
axes[0, 0].set_xlabel('Predicted Values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residual Plot (should be random around 0)')
axes[0, 0].grid(True, alpha=0.3)

# 2. Distribution of Residuals
axes[0, 1].hist(residuals, bins=30, edgecolor='k', alpha=0.7)
axes[0, 1].axvline(x=0, color='r', linestyle='--', lw=2)
axes[0, 1].set_xlabel('Residuals')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title(f'Residual Distribution (mean={residuals.mean():.4f})')
axes[0, 1].grid(True, alpha=0.3)

# 3. Q-Q Plot (normality check)
stats.probplot(residuals, dist="norm", plot=axes[1, 0])
axes[1, 0].set_title('Q-Q Plot (should follow red line for normality)')
axes[1, 0].grid(True, alpha=0.3)

# 4. Scale-Location Plot
standardized_residuals = (residuals - residuals.mean()) / residuals.std()
axes[1, 1].scatter(y_test_pred, np.sqrt(np.abs(standardized_residuals)), 
                  alpha=0.6, edgecolors='k')
axes[1, 1].set_xlabel('Predicted Values')
axes[1, 1].set_ylabel('√|Standardized Residuals|')
axes[1, 1].set_title('Scale-Location Plot (check for constant variance)')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nResidual Analysis Checklist:")
print("✓ Top-left: Should show no clear pattern (random scatter)")
print("✓ Top-right: Should be roughly bell-shaped (normal distribution)")
print("✓ Bottom-left: Points should follow red line (normality)")
print("✓ Bottom-right: Should show constant spread (homoscedasticity)")

## 7. Polynomial Regression

What if the relationship isn't linear? Use **polynomial features**!

Transform: x → [x, x², x³, ...]

Still uses linear regression, but on transformed features.

In [None]:
# Create non-linear dataset
np.random.seed(42)
X_poly_raw = np.linspace(0, 3, 50)
y_poly_raw = 0.5 * X_poly_raw**2 + X_poly_raw + 2 + np.random.normal(0, 0.5, 50)

X_poly = X_poly_raw.reshape(-1, 1)

# Visualize the non-linear relationship
plt.figure(figsize=(10, 6))
plt.scatter(X_poly, y_poly_raw, s=50, alpha=0.7, edgecolors='k')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Non-Linear Relationship (Quadratic)', fontsize=14)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("This data has a quadratic (curved) relationship!")
print("Simple linear regression won't fit well.")

In [None]:
# Compare linear vs polynomial regression
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# 1. Linear (degree=1)
linear_model = LinearRegression()
linear_model.fit(X_poly, y_poly_raw)
y_linear_pred = linear_model.predict(X_poly)

axes[0].scatter(X_poly, y_poly_raw, s=50, alpha=0.7, edgecolors='k')
axes[0].plot(X_poly, y_linear_pred, 'r-', linewidth=2)
axes[0].set_title(f'Linear (R²={r2_score(y_poly_raw, y_linear_pred):.3f})')
axes[0].set_xlabel('X')
axes[0].set_ylabel('y')
axes[0].grid(True, alpha=0.3)

# 2. Polynomial (degree=2)
poly_features_2 = PolynomialFeatures(degree=2)
X_poly_2 = poly_features_2.fit_transform(X_poly)
poly_model_2 = LinearRegression()
poly_model_2.fit(X_poly_2, y_poly_raw)
y_poly_pred_2 = poly_model_2.predict(X_poly_2)

axes[1].scatter(X_poly, y_poly_raw, s=50, alpha=0.7, edgecolors='k')
axes[1].plot(X_poly, y_poly_pred_2, 'g-', linewidth=2)
axes[1].set_title(f'Polynomial deg=2 (R²={r2_score(y_poly_raw, y_poly_pred_2):.3f})')
axes[1].set_xlabel('X')
axes[1].set_ylabel('y')
axes[1].grid(True, alpha=0.3)

# 3. Polynomial (degree=5) - Overfitting!
poly_features_5 = PolynomialFeatures(degree=5)
X_poly_5 = poly_features_5.fit_transform(X_poly)
poly_model_5 = LinearRegression()
poly_model_5.fit(X_poly_5, y_poly_raw)
y_poly_pred_5 = poly_model_5.predict(X_poly_5)

axes[2].scatter(X_poly, y_poly_raw, s=50, alpha=0.7, edgecolors='k')
axes[2].plot(X_poly, y_poly_pred_5, 'purple', linewidth=2)
axes[2].set_title(f'Polynomial deg=5 (R²={r2_score(y_poly_raw, y_poly_pred_5):.3f})')
axes[2].set_xlabel('X')
axes[2].set_ylabel('y')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nObservations:")
print("• Degree=1 (linear): Underfits - too simple")
print("• Degree=2 (quadratic): Perfect fit for this data!")
print("• Degree=5: Overfits - follows noise too closely")

In [None]:
# Show what PolynomialFeatures does
sample_X = np.array([[2], [3]])
poly_transformer = PolynomialFeatures(degree=3)
sample_X_poly = poly_transformer.fit_transform(sample_X)

print("Polynomial Features Transformation:")
print("\nOriginal features:")
print(sample_X)
print("\nPolynomial features (degree=3):")
print(sample_X_poly)
print("\nFeature names:")
print(poly_transformer.get_feature_names_out())
print("\nFor x=2: [1, x, x², x³] = [1, 2, 4, 8]")
print("For x=3: [1, x, x², x³] = [1, 3, 9, 27]")

## 8. Linear Regression Assumptions

For linear regression to work well, data should satisfy:

### 1. Linearity
- Relationship between X and y should be linear
- **Check**: Scatter plots, residual plots

### 2. Independence
- Observations should be independent
- **Check**: Domain knowledge, avoid time series without proper handling

### 3. Homoscedasticity
- Residuals should have constant variance
- **Check**: Scale-location plot

### 4. Normality of Residuals
- Residuals should follow normal distribution
- **Check**: Q-Q plot, histogram of residuals

### 5. No Multicollinearity
- Features shouldn't be highly correlated with each other
- **Check**: Correlation matrix, VIF (Variance Inflation Factor)

**Note**: Violations don't always break the model, but reduce reliability!

## 9. Practice Exercises

### Exercise 1: Diabetes Dataset Regression

Load the diabetes dataset (`datasets.load_diabetes()`), build a linear regression model, and:
1. Calculate R², MSE, RMSE, and MAE
2. Identify the 3 most important features (highest coefficient magnitudes)
3. Create a residual plot

In [None]:
# Your code here


### Exercise 2: Feature Engineering Impact

Using the California housing dataset:
1. Create a new feature: `rooms_per_household = AveRooms / AveOccup`
2. Train two models: one with original features, one with the new feature added
3. Compare R² scores - did the new feature help?

In [None]:
# Your code here


### Exercise 3: Polynomial Degree Selection

Create a synthetic dataset with a cubic relationship (y = x³ + noise).
Test polynomial regression with degrees 1, 2, 3, 4, and 5.
Which degree gives the best test set performance?

In [None]:
# Your code here


### Exercise 4: Interpretation Challenge

Train a linear regression model on the Boston housing dataset (or California housing).
Answer these questions:
1. What does a coefficient of 2.5 for feature "X" mean?
2. If you increase "X" by 1 unit, how much does the prediction change?
3. Which feature has the strongest effect on price?

In [None]:
# Your code here


## 10. Summary

### Key Concepts Learned

1. **Linear Regression Fundamentals**:
   - Models linear relationship: y = β₀ + β₁x₁ + ... + βₙxₙ
   - Uses Ordinary Least Squares (OLS) to minimize errors
   - Simple (1 feature) vs Multiple (many features)

2. **Regression Metrics**:
   - **R²**: Proportion of variance explained (0-1, higher better)
   - **MSE**: Mean squared error (lower better, penalizes large errors)
   - **RMSE**: Root MSE (same units as target, more interpretable)
   - **MAE**: Mean absolute error (robust to outliers)

3. **Residual Analysis**:
   - Residuals = actual - predicted
   - Should be randomly distributed around 0
   - No patterns = good model fit
   - Patterns indicate model problems

4. **Polynomial Regression**:
   - Handles non-linear relationships
   - Transforms features: x → [x, x², x³, ...]
   - Still uses linear regression on transformed features
   - Careful: high degrees can overfit!

5. **Model Assumptions**:
   - Linearity (or use polynomial features)
   - Independence of observations
   - Homoscedasticity (constant variance)
   - Normality of residuals
   - No multicollinearity

### When to Use Linear Regression

✅ **Good for**:
- Predicting continuous values
- Understanding feature importance
- Interpretable models (coefficients have meaning)
- Fast training and prediction
- Baseline model for comparison

❌ **Not ideal for**:
- Complex non-linear relationships (use polynomial or other models)
- Classification tasks (use logistic regression instead)
- High-dimensional data with many correlated features
- Data with many outliers

### Quick Reference: Scikit-learn

```python
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score, mean_squared_error

# Simple linear regression
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

# Polynomial regression
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
model.fit(X_poly, y)

# Evaluate
r2 = r2_score(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
```

### Next Steps

In the next module, we'll explore:
- **Logistic Regression** for classification
- Sigmoid function and decision boundaries
- Binary and multi-class classification
- Classification metrics

### Additional Resources

- [Scikit-learn Linear Regression](https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares)
- [StatQuest: Linear Regression](https://www.youtube.com/watch?v=nk2CQITm_eo)
- [Andrew Ng's ML Course - Linear Regression](https://www.coursera.org/learn/machine-learning)
- [Polynomial Regression in Depth](https://towardsdatascience.com/polynomial-regression-bbe8b9d97491)