[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/CompOmics/D012554A_2025/blob/main/notebooks/day_1/1.1b_Exercises_Linear_regression.ipynb)

# 1.1 Exercises — Linear Regression

In the lecture notebook you learned the fundamentals of linear regression using a diabetes dataset. In these exercises you will apply those same techniques to a new problem: predicting the fuel efficiency (miles per gallon) of automobiles from their technical specifications.

The dataset comes from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/auto+mpg) and contains data for 392 cars. Each car is described by:

| Feature | Description |
|---------|-------------|
| **cylinders** | Number of cylinders |
| **displacement** | Engine displacement (cubic inches) |
| **horsepower** | Engine horsepower |
| **weight** | Vehicle weight (lbs) |
| **acceleration** | Time to accelerate from 0 to 60 mph (seconds) |
| **model_year** | Model year (e.g. 70 = 1970) |
| **origin** | Origin of car (usa, europe, japan) |

The **target** variable is **mpg** (miles per gallon).

Throughout these exercises you will:
- Explore and visualize the dataset
- Standardize features
- Manually construct and evaluate linear models using $R^2$
- Implement a cost function and gradient descent from scratch
- Use scikit-learn for multi-feature linear regression
- Evaluate generalization using a train/test split

In [1]:
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
sns.set_context("notebook", font_scale=1.2)
sns.set_style("whitegrid")

---
## Exercise 1 — Load and Explore the Data

**Tasks:**
1. Load the Auto MPG dataset from the URL provided below.
2. Print the first 10 rows.
3. How many cars and features are in the dataset?
4. Use `.describe()` to compute summary statistics. What is the average fuel efficiency (mpg)?

In [None]:
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/mpg.csv"

# YOUR CODE HERE
dataset = ...

We will work only with the numeric columns. Run the cell below to drop the `name` column and any rows with missing values.

In [None]:
dataset = dataset.drop(columns=['name']).dropna()

# Convert origin from string to numeric (usa=1, europe=2, japan=3)
origin_map = {'usa': 1, 'europe': 2, 'japan': 3}
dataset['origin'] = dataset['origin'].map(origin_map)

print(f"Clean dataset shape: {dataset.shape}")
dataset.head()


---
## Exercise 2 — Visualize Relationships

Before building any model, visually inspect which features might be good predictors of **mpg**.

**Tasks:**
1. Create a scatter plot of **weight** (x-axis) vs **mpg** (y-axis).
2. Create a scatter plot of **horsepower** (x-axis) vs **mpg** (y-axis).
3. Create a scatter plot of **acceleration** (x-axis) vs **mpg** (y-axis).
4. Which feature appears to have the strongest linear relationship with mpg? Is it positive or negative?

In [None]:
# YOUR CODE HERE

*Write your answer about which feature has the strongest linear relationship here.*

---
## Exercise 3 — Standardize the Data

Recall from the lecture that we should standardize features before fitting a linear model so that all features are on the same scale.

**Tasks:**
1. Create a `StandardScaler` for the feature **weight** and standardize it.
2. Create a `StandardScaler` for the target **mpg** and standardize it.
3. Verify that the standardized weight has mean $\approx 0$ and standard deviation $\approx 1$.
4. Create a scatter plot of the standardized weight vs standardized mpg. Does the shape look different from Exercise 2?

In [None]:
feature_scaler = StandardScaler()
label_scaler = StandardScaler()

# YOUR CODE HERE
# Fit and transform weight and mpg


# Verify mean ≈ 0 and std ≈ 1


# Scatter plot of standardized data

---
## Exercise 4 — Fit a Manual Linear Model

Now let's manually pick parameters for a linear model $f(x) = ax + b$ and evaluate the fit.

Since the data is standardized, we expect $b \approx 0$. Set $b = 0$.

**Tasks:**
1. Looking at the scatter plot from Exercise 3, the relationship between weight and mpg is **negative** (heavier cars get fewer miles per gallon). Pick a negative value for $a$ that you think fits the data well.
2. Plot your regression line on top of the standardized scatter plot.
3. Compute $R^2$ manually using:

   $$SS_{error} = \sum_{i=1}^{n} (y^{(i)} - f(x^{(i)}))^2, \quad SS_{total} = \sum_{i=1}^{n} (y^{(i)} - \bar{y})^2, \quad R^2 = 1 - \frac{SS_{error}}{SS_{total}}$$

4. Verify your result with `sklearn.metrics.r2_score()`.

In [None]:
a = ...  # pick a negative value
b = 0

# Plot the data with your regression line
# YOUR CODE HERE

In [None]:
# Compute R-squared manually
# YOUR CODE HERE
SS_error = ...
SS_total = ...
R_squared = ...

print(f"Manual R-squared = {R_squared}")

In [None]:
# Verify with sklearn
# YOUR CODE HERE

---
## Exercise 5 — Implement the Cost Function

Recall the cost function:

$$J(a, b) = \frac{1}{2n} \sum_{i=1}^{n} (f(x^{(i)}, (a,b)) - y^{(i)})^2$$

**Tasks:**
1. Implement the function `cost_function(a, b, X, y)` below.
2. Compute $J$ for your value of $a$ from Exercise 4 (with $b = 0$).
3. Also compute $J$ for $a = 0$ and $a = -1$. Which has the lowest cost?

> **Warning:** Be careful with operator precedence! `1.0/2*n` evaluates as `(1.0/2) * n`, not `1.0 / (2*n)`.

In [None]:
def cost_function(a, b, X, y):
    # YOUR CODE HERE
    ...

# Test your function

---
## Exercise 6 — Visualize the Cost Landscape

**Tasks:**
1. Compute the cost $J(a, 0)$ for values of $a$ ranging from $-2$ to $2$ in steps of $0.05$.
2. Plot $a$ on the x-axis and $J$ on the y-axis.
3. At approximately which value of $a$ is the cost minimized?

In [None]:
# YOUR CODE HERE
a_values = np.arange(-2, 2, 0.05)

*At which value of $a$ is the cost minimized?*

---
## Exercise 7 — Implement Gradient Descent

Now implement the gradient descent algorithm to automatically find the optimal slope. We use a single feature ($\theta_1$ only, with $\theta_0 = 0$).

Recall the stochastic gradient descent update rule for a randomly selected data point $i$:

$$\theta_1 := \theta_1 - \alpha \cdot (f(x^{(i)}, \theta) - y^{(i)}) \cdot x_1^{(i)}$$

**Tasks:**
1. Complete the `linear_regression()` function below by filling in the missing steps.
2. Run it on the standardized Auto MPG data (weight vs mpg) with `alpha=0.01` and `iterations=5000`.
3. Print the learned $\theta_1$ and the corresponding $R^2$.
4. Plot the data with the fitted regression line.

In [None]:
def linear_regression(X, y, alpha, iterations):
    """Perform linear regression using stochastic gradient descent.
    
    Parameters
    ----------
    X : array-like, feature values
    y : array-like, target values
    alpha : float, learning rate
    iterations : int, number of update steps
    
    Returns
    -------
    theta1 : float, the learned slope parameter
    """
    # Step 1: initialize theta1 (e.g. to 0 or a random value)
    theta1 = ...
    
    for i in range(iterations):
        # Step 2: pick a random data point index
        idx = ...
        
        # Step 3: compute the prediction for this data point
        predict = ...
        
        # Step 4: compute the error (prediction - actual)
        error = ...
        
        # Step 5: update theta1 using the gradient descent rule
        theta1 = ...
    
    return theta1

In [None]:
# Run gradient descent
theta1 = linear_regression(dataset['weight'], dataset['mpg'], alpha=0.01, iterations=5000)

print(f"theta1 = {theta1}")
print(f"R-squared = {metrics.r2_score(dataset['mpg'], theta1 * dataset['weight'])}")

# Plot the result
# YOUR CODE HERE

---
## Exercise 8 — Effect of the Learning Rate

The learning rate $\alpha$ controls the size of each update step.

**Tasks:**
1. Run `linear_regression()` for each of the following learning rates: `0.0001`, `0.001`, `0.01`, `0.1`, `1.0` (use 5000 iterations each).
2. For each, print the final $\theta_1$ and $R^2$.
3. What happens when $\alpha$ is too large? What happens when it is too small?

> **Tip:** Run each setting a few times since stochastic gradient descent uses random sampling.

In [None]:
# YOUR CODE HERE
for alpha in [0.0001, 0.001, 0.01, 0.1, 1.0]:
    pass  # replace with your code

*Write your observations about the effect of the learning rate here.*

---
## Exercise 9 — Multi-feature Regression with Scikit-learn

Using only **weight** gives a limited model. Let's use **all numeric features** to predict mpg.

**Tasks:**
1. Reload the dataset and drop the `name` column and missing values (the cleanup cell is provided).
2. Separate the target **mpg** from the remaining features.
3. Standardize **all** features using a single `StandardScaler`.
4. Fit a `sklearn.linear_model.SGDRegressor` with `eta0=0.001` and `max_iter=10000`.
5. Compute the $R^2$ on the full dataset. How does it compare to the single-feature model?
6. Print `model.coef_` alongside the feature names. Which feature has the largest absolute coefficient?

In [None]:
from sklearn import linear_model

# Reload and clean
dataset = pd.read_csv(url).drop(columns=['name']).dropna()
origin_map = {'usa': 1, 'europe': 2, 'japan': 3}
dataset['origin'] = dataset['origin'].map(origin_map)

# Separate features and target
# YOUR CODE HERE
target = ...
features = ...

# Standardize features
# YOUR CODE HERE

# Fit SGDRegressor
# YOUR CODE HERE

# Print R-squared

# Print coefficients with feature names

*Which feature has the largest coefficient and what does this mean?*

---
## Exercise 10 — Train/Test Split

So far we evaluated the model on the **same data** it was trained on. This is problematic — the model might just memorize the training data without learning the true underlying relationship. This is called **overfitting**.

To get an honest estimate of model performance, we split the data into:
- A **training set** (used to fit the model)
- A **test set** (used to evaluate on unseen data)

**Tasks:**
1. Use `sklearn.model_selection.train_test_split` to split the standardized features and target into 80% training / 20% test (use `random_state=42`).
2. Fit an `SGDRegressor` on the **training** data only.
3. Compute $R^2$ on the **training** set and on the **test** set.
4. Is there a difference? What does this tell you about the model?

In [None]:
from sklearn.model_selection import train_test_split

# 1. Split the data
# YOUR CODE HERE

# 2. Fit the model on the training set
# YOUR CODE HERE

# 3. Evaluate on both sets
# YOUR CODE HERE

*Is there a difference between training and test $R^2$? What does this tell you?*

---
## Exercise 11 — Predict for a New Car

Suppose a new car has the following specifications:

| cylinders | displacement | horsepower | weight | acceleration | model_year | origin |
|-----------|-------------|------------|--------|--------------|------------|--------|
| 4 | 120 | 80 | 2500 | 15.5 | 82 | 3 (Japan) |

**Tasks:**
1. Create a DataFrame (or array) with this car's features.
2. Standardize the features using the **same scaler** you fitted on the training data.
3. Use your trained model to predict the mpg for this car.
4. Why is it crucial to use the same scaler from the training data, rather than fitting a new one?

In [None]:
# YOUR CODE HERE
new_car = ...


*Why must we use the same scaler?*

---
## Bonus Exercise — Residual Analysis

A **residual plot** helps diagnose whether a linear model is appropriate for the data.

**Tasks:**
1. Using your multi-feature model from Exercise 10, compute the predictions on the **test** set.
2. Compute the residuals: `residuals = y_test - predictions`.
3. Create a scatter plot of predictions (x-axis) vs residuals (y-axis).
4. Add a horizontal dashed line at $y = 0$.
5. If the model is appropriate, what pattern should you see? What pattern would indicate a problem?

In [None]:
# YOUR CODE HERE

*Describe the pattern you observe and what it tells you about the linear model.*