# Module 03: Linear Regression

**Difficulty**: ⭐ Beginner  
**Estimated Time**: 60 minutes  
**Prerequisites**: 
- [Module 00: Introduction to ML and scikit-learn](00_introduction_to_ml_and_sklearn.ipynb)
- [Module 01: Supervised vs Unsupervised Learning](01_supervised_vs_unsupervised_learning.ipynb)
- [Module 02: Data Preparation and Train/Test Split](02_data_preparation_train_test_split.ipynb)

## Learning Objectives

By the end of this notebook, you will be able to:
1. Understand the mathematical concept behind linear regression
2. Build simple linear regression models (one feature)
3. Build multiple linear regression models (many features)
4. Interpret coefficients and intercepts
5. Evaluate regression models using R², MSE, and RMSE
6. Make predictions on new data

## 1. What is Linear Regression?

**Linear Regression** is one of the simplest and most widely used machine learning algorithms. It models the relationship between features and a target variable using a straight line (or hyperplane in multiple dimensions).

### The Big Idea
Find the "best fit" line through your data points that minimizes prediction errors.

### Real-World Examples
- Predicting house prices based on size
- Estimating sales based on advertising spend
- Forecasting temperature based on historical data
- Predicting salary based on years of experience

### The Mathematical Formula

**Simple Linear Regression** (one feature):
```
y = mx + b
or
y = β₀ + β₁x
```

**Multiple Linear Regression** (many features):
```
y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ
```

Where:
- **y** = predicted value (target)
- **x** = feature value(s)
- **β₀** (beta zero) = intercept (where line crosses y-axis)
- **β₁, β₂, ...** = coefficients (slopes) showing feature importance

In [None]:
# Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Visualization settings
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print("✓ Setup complete!")

## 2. Simple Linear Regression (One Feature)

Let's start with the simplest case: predicting house value using just one feature (median income).

**Goal**: Find the best line: `house_value = β₀ + β₁ × median_income`

In [None]:
# Load California housing dataset
housing_df = pd.read_csv('data/sample/california_housing.csv')

print("Dataset Overview:")
print(f"Shape: {housing_df.shape}")
print(f"\nFirst few rows:")
housing_df.head()

In [None]:
# For simple linear regression, use only ONE feature
X_simple = housing_df[['MedInc']]  # Double brackets to keep as DataFrame
y = housing_df['median_house_value']

print(f"Feature (X): {X_simple.shape}")
print(f"Target (y): {y.shape}")
print(f"\nFeature name: {X_simple.columns[0]}")
print(f"Target name: median_house_value")

In [None]:
# Visualize the relationship
plt.figure(figsize=(10, 6))
plt.scatter(X_simple, y, alpha=0.3, s=20)
plt.xlabel('Median Income (in $10,000s)', fontsize=12)
plt.ylabel('Median House Value ($)', fontsize=12)
plt.title('House Value vs Median Income\n(Looking for a linear relationship)', 
         fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nObservation: There's a clear positive linear trend!")
print("As income increases, house values tend to increase.")

In [None]:
# Build a simple linear regression model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X_simple, y, test_size=0.3, random_state=42
)

print("Data Split:")
print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")

# Create and train the model
simple_model = LinearRegression()
simple_model.fit(X_train, y_train)

print("\n✓ Model trained!")

In [None]:
# Examine the learned parameters
intercept = simple_model.intercept_
coefficient = simple_model.coef_[0]

print("Learned Model Parameters:")
print(f"Intercept (β₀): ${intercept:,.2f}")
print(f"Coefficient (β₁): ${coefficient:,.2f}")
print(f"\nModel Equation:")
print(f"house_value = {intercept:,.2f} + {coefficient:,.2f} × median_income")
print(f"\nInterpretation:")
print(f"- Base house value (when income=0): ${intercept:,.2f}")
print(f"- For each $10,000 increase in median income,")
print(f"  house value increases by ${coefficient:,.2f}")

In [None]:
# Visualize the regression line
plt.figure(figsize=(10, 6))

# Scatter plot of actual data
plt.scatter(X_test, y_test, alpha=0.3, s=20, label='Actual Data')

# Regression line
# Create points for the line
X_line = np.linspace(X_simple.min(), X_simple.max(), 100).reshape(-1, 1)
y_line = simple_model.predict(X_line)
plt.plot(X_line, y_line, 'r-', linewidth=3, label='Regression Line')

plt.xlabel('Median Income (in $10,000s)', fontsize=12)
plt.ylabel('Median House Value ($)', fontsize=12)
plt.title('Simple Linear Regression: Best Fit Line', fontsize=14, fontweight='bold')
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("The red line represents the model's predictions!")

In [None]:
# Make predictions
y_pred = simple_model.predict(X_test)

# Show some example predictions
results_df = pd.DataFrame({
    'Median_Income': X_test['MedInc'].values[:5],
    'Actual_Value': y_test.values[:5],
    'Predicted_Value': y_pred[:5],
    'Error': y_test.values[:5] - y_pred[:5]
})

print("Example Predictions:")
print(results_df.to_string(index=False))
print("\nNote: Error = Actual - Predicted")
print("Positive error = Model underestimated")
print("Negative error = Model overestimated")

## 3. Evaluating Regression Models

How do we know if our model is good? We use evaluation metrics:

### 1. Mean Squared Error (MSE)
- Average of squared errors
- Formula: MSE = (1/n) × Σ(actual - predicted)²
- **Lower is better** (0 is perfect)
- Units are squared (hard to interpret)

### 2. Root Mean Squared Error (RMSE)
- Square root of MSE
- Formula: RMSE = √MSE
- **Lower is better** (0 is perfect)
- Same units as target (easier to interpret)
- "On average, predictions are off by X units"

### 3. R² Score (Coefficient of Determination)
- Proportion of variance explained by the model
- Range: 0 to 1 (can be negative for bad models)
- **Higher is better** (1 is perfect)
- Interpretation: "Model explains X% of the variance"

In [None]:
# Calculate evaluation metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Metrics on test set
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Simple Linear Regression Performance:")
print(f"\nMean Squared Error (MSE): ${mse:,.2f}")
print(f"Root Mean Squared Error (RMSE): ${rmse:,.2f}")
print(f"Mean Absolute Error (MAE): ${mae:,.2f}")
print(f"R² Score: {r2:.3f}")

print(f"\nInterpretation:")
print(f"- On average, predictions are off by ${rmse:,.2f}")
print(f"- Model explains {r2*100:.1f}% of the variance in house values")
print(f"- Using only ONE feature (median income), we achieve decent performance!")

## 4. Multiple Linear Regression (Many Features)

Real-world problems usually involve multiple features. Let's use ALL features to improve our predictions.

**Hypothesis**: Using more relevant features should improve accuracy!

In [None]:
# Prepare data with ALL features
X_multiple = housing_df.drop('median_house_value', axis=1)
y = housing_df['median_house_value']

print("Multiple Linear Regression Setup:")
print(f"Number of features: {X_multiple.shape[1]}")
print(f"Features: {list(X_multiple.columns)}")
print(f"Number of samples: {len(X_multiple)}")

In [None]:
# Split the data
X_train_m, X_test_m, y_train_m, y_test_m = train_test_split(
    X_multiple, y, test_size=0.3, random_state=42
)

# Train the model
multi_model = LinearRegression()
multi_model.fit(X_train_m, y_train_m)

print("✓ Multiple linear regression model trained!")

In [None]:
# Examine coefficients for each feature
coefficients_df = pd.DataFrame({
    'Feature': X_multiple.columns,
    'Coefficient': multi_model.coef_
}).sort_values('Coefficient', key=abs, ascending=False)

print("Feature Importance (by coefficient magnitude):")
print(coefficients_df.to_string(index=False))
print(f"\nIntercept: ${multi_model.intercept_:,.2f}")
print(f"\nInterpretation:")
print(f"- Positive coefficient = feature increases house value")
print(f"- Negative coefficient = feature decreases house value")
print(f"- Larger magnitude = stronger effect")

In [None]:
# Visualize feature importance
plt.figure(figsize=(10, 6))
colors = ['green' if c > 0 else 'red' for c in coefficients_df['Coefficient']]
plt.barh(coefficients_df['Feature'], coefficients_df['Coefficient'], color=colors, alpha=0.7)
plt.xlabel('Coefficient Value', fontsize=12)
plt.ylabel('Feature', fontsize=12)
plt.title('Feature Coefficients in Multiple Linear Regression\n(Green=Positive, Red=Negative)', 
         fontsize=14, fontweight='bold')
plt.axvline(x=0, color='black', linestyle='-', linewidth=1)
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

print("Key Insights:")
print(f"- MedInc (median income) has the strongest positive effect")
print(f"- Latitude has a strong negative effect (location matters!)")

In [None]:
# Evaluate multiple regression model
y_pred_m = multi_model.predict(X_test_m)

mse_m = mean_squared_error(y_test_m, y_pred_m)
rmse_m = np.sqrt(mse_m)
mae_m = mean_absolute_error(y_test_m, y_pred_m)
r2_m = r2_score(y_test_m, y_pred_m)

print("Multiple Linear Regression Performance:")
print(f"\nMean Squared Error (MSE): ${mse_m:,.2f}")
print(f"Root Mean Squared Error (RMSE): ${rmse_m:,.2f}")
print(f"Mean Absolute Error (MAE): ${mae_m:,.2f}")
print(f"R² Score: {r2_m:.3f}")

In [None]:
# Compare simple vs multiple regression
comparison_df = pd.DataFrame({
    'Metric': ['RMSE', 'R² Score', 'Number of Features'],
    'Simple Regression': [f'${rmse:,.2f}', f'{r2:.3f}', '1'],
    'Multiple Regression': [f'${rmse_m:,.2f}', f'{r2_m:.3f}', f'{X_multiple.shape[1]}']
})

print("\nModel Comparison:")
print(comparison_df.to_string(index=False))

improvement_rmse = ((rmse - rmse_m) / rmse) * 100
improvement_r2 = ((r2_m - r2) / r2) * 100

print(f"\nImprovement:")
print(f"- RMSE reduced by {improvement_rmse:.1f}%")
print(f"- R² increased by {improvement_r2:.1f}%")
print(f"\n✓ Using multiple features significantly improves predictions!")

## 5. Visualizing Predictions vs Actuals

A perfect model would have all points on the diagonal line (predicted = actual).

In [None]:
# Create prediction comparison plot
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Simple regression
axes[0].scatter(y_test, y_pred, alpha=0.3, s=20)
axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
            'r--', linewidth=2, label='Perfect Prediction')
axes[0].set_xlabel('Actual House Value ($)', fontsize=12)
axes[0].set_ylabel('Predicted House Value ($)', fontsize=12)
axes[0].set_title(f'Simple Regression\nR² = {r2:.3f}', fontsize=13, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Multiple regression
axes[1].scatter(y_test_m, y_pred_m, alpha=0.3, s=20)
axes[1].plot([y_test_m.min(), y_test_m.max()], [y_test_m.min(), y_test_m.max()], 
            'r--', linewidth=2, label='Perfect Prediction')
axes[1].set_xlabel('Actual House Value ($)', fontsize=12)
axes[1].set_ylabel('Predicted House Value ($)', fontsize=12)
axes[1].set_title(f'Multiple Regression\nR² = {r2_m:.3f}', fontsize=13, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Key Insight: Points closer to the diagonal line = better predictions!")
print("Multiple regression has less scatter (tighter fit).")

## 6. Making Predictions on New Data

Once trained, we can use the model to predict house values for new properties.

In [None]:
# Create a new house example
new_house = pd.DataFrame({
    'MedInc': [5.0],
    'HouseAge': [25.0],
    'AveRooms': [6.0],
    'AveBedrms': [1.2],
    'Population': [1500.0],
    'AveOccup': [3.0],
    'Latitude': [37.5],
    'Longitude': [-122.0]
})

# Make prediction
predicted_value = multi_model.predict(new_house)[0]

print("New House Characteristics:")
for col in new_house.columns:
    print(f"  {col}: {new_house[col].values[0]}")

print(f"\nPredicted House Value: ${predicted_value:,.2f}")
print(f"\nThis prediction is based on learned patterns from {len(X_train_m)} training examples!")

## Exercises

Practice building and evaluating linear regression models.

### Exercise 1: Simple Regression with Different Features

Build a simple linear regression model using only the 'HouseAge' feature instead of 'MedInc'.

Steps:
1. Create X with only 'HouseAge' column
2. Split data (70/30)
3. Train a LinearRegression model
4. Calculate and print the R² score
5. Compare it to the R² score we got with 'MedInc' (printed above)
6. Which feature is better for prediction?

In [None]:
# Your code here


### Exercise 2: Interpreting Coefficients

Using the multiple regression model we built (multi_model), answer these questions:

1. What is the coefficient for 'AveRooms'?
2. If a house has 1 additional room on average, by how much does the predicted value change?
3. Which feature has the largest positive impact on house value?
4. Which feature has the largest negative impact?

Print the answers using the coefficients from multi_model.

In [None]:
# Your code here


### Exercise 3: Regression on Diabetes Dataset

Build a multiple linear regression model to predict disease progression.

Steps:
1. Load the diabetes dataset from 'data/sample/diabetes.csv'
2. Separate features (all columns except 'progression') and target ('progression')
3. Split data (70/30)
4. Train a LinearRegression model
5. Calculate and print RMSE and R² score
6. Create a scatter plot of actual vs predicted values

In [None]:
# Your code here


### Exercise 4: Residual Analysis

Residuals are the differences between actual and predicted values: `residual = actual - predicted`

Using the multiple regression model (multi_model) and test predictions:
1. Calculate residuals
2. Create a histogram of residuals
3. Calculate mean and standard deviation of residuals
4. What does the distribution of residuals tell you about the model?

Hint: Ideally, residuals should be normally distributed around zero.

In [None]:
# Your code here


## Summary

Congratulations! You've mastered linear regression, one of the most fundamental ML algorithms.

### Key Concepts

1. **Linear Regression**:
   - Models relationship between features and target using a linear equation
   - Simple: y = β₀ + β₁x (one feature)
   - Multiple: y = β₀ + β₁x₁ + β₂x₂ + ... (many features)
   - Goal: Find coefficients that minimize prediction errors

2. **Model Parameters**:
   - **Intercept (β₀)**: Base value when all features are zero
   - **Coefficients (β₁, β₂, ...)**: How much each feature affects the target
   - Positive coefficient = feature increases target
   - Negative coefficient = feature decreases target

3. **Evaluation Metrics**:
   - **MSE**: Mean Squared Error (lower is better)
   - **RMSE**: Root MSE, same units as target (lower is better)
   - **R²**: Proportion of variance explained (0-1, higher is better)
   - "Model explains X% of variance" interpretation

4. **Simple vs Multiple Regression**:
   - Simple uses one feature (easier to visualize)
   - Multiple uses many features (usually better performance)
   - More relevant features generally improve predictions

5. **Best Practices**:
   - Always split data before training
   - Visualize relationships before modeling
   - Examine coefficients to understand feature importance
   - Compare predicted vs actual values
   - Use multiple metrics for comprehensive evaluation

### When to Use Linear Regression

**Good for:**
- Continuous target variables
- Linear relationships between features and target
- Need for interpretable models
- Quick baseline models

**Not good for:**
- Non-linear relationships (use polynomial features or other algorithms)
- Classification problems (use logistic regression instead)
- Complex interactions between features (try tree-based methods)

### What's Next?

In **Module 04: Logistic Regression**, you'll learn:
- How to adapt regression for classification problems
- Understanding the sigmoid function and decision boundaries
- Binary and multiclass classification
- Probability predictions and class prediction

### Additional Resources

- [Linear Regression - StatQuest](https://www.youtube.com/watch?v=nk2CQITm_eo)
- [scikit-learn Linear Models](https://scikit-learn.org/stable/modules/linear_model.html)
- [Understanding R-squared](https://blog.minitab.com/en/adventures-in-statistics-2/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit)