
# Regression Lab Sheet — *mtcars* Dataset

**Goal:** Practice the full regression workflow you will use in the quiz:  
*data exploration → simple regression → multiple regression → transformation → diagnostics → reflection.*

This lab uses a **different dataset** from the teaching notebook: the classic **`mtcars`** dataset.



## Learning Outcomes
By the end of this lab you should be able to:
1. Load and explore a dataset and identify candidate predictors.
2. Fit and interpret a **simple linear regression**.
3. Fit and interpret a **multiple linear regression**.
4. Apply a **log transformation** and explain why/when it helps.
5. Read key diagnostics: **R² / Adj. R², AIC/BIC, Durbin–Watson, normality tests, Condition Number**.
6. Answer short quiz-style questions about your results.



---
## 1) Setup & Load the Data

**Instruction:** Import the required libraries. Load the `mtcars` dataset via `statsmodels`. Explore the data.

> *What to think about*: Which variables might predict **miles per gallon (`mpg`)**?


In [None]:

# --- Code: import libraries and load the dataset ---
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Load the mtcars dataset from statsmodels
mtcars = sm.datasets.get_rdataset("mtcars").data

# Preview the first rows
mtcars.head()



**Task A:** Use `.describe()` and `.info()` to explore the dataset.  
- Identify numeric predictors that might be relevant for `mpg` (e.g., `wt`, `hp`, `disp`, `cyl`).  
- Jot down a short hypothesis: *"I expect mpg to decrease as wt increases,"* etc.


In [None]:

# --- Code: quick data summary ---
mtcars.describe()


In [None]:

# --- Code: structure / types ---
mtcars.info()



---
## 2) Visual Exploration: Scatterplots (mpg vs each variable)

**Instruction:** Before fitting models, create **scatterplots** of `mpg` against **each numeric predictor** to see which variables look suitable for a **linear** relationship with `mpg`.

> *What to think about*: Look for approximately straight-line patterns (linear trend), obvious outliers, or curvature. Make notes on which variables you would include first.


In [None]:

# --- Code: scatterplots mpg vs each numeric variable (separate figures) ---
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Ensure mtcars is loaded (from earlier cell). If not, uncomment the next two lines:
# import statsmodels.api as sm
# mtcars = sm.datasets.get_rdataset("mtcars").data

numeric_cols = mtcars.select_dtypes(include=np.number).columns.tolist()

# Remove the response variable from the list
if "mpg" in numeric_cols:
    numeric_cols.remove("mpg")

for col in numeric_cols:
    plt.figure()
    plt.scatter(mtcars[col], mtcars["mpg"])
    plt.xlabel(col)
    plt.ylabel("mpg")
    plt.title(f"mpg vs {col}")
    plt.show()



**Task (Visual assessment):**
1. List 2–3 variables that show a **clear negative** relationship with `mpg`.
2. List any variables that appear to have a **nonlinear** pattern or notable **outliers**.
3. Based on your plots, which **one** variable would you try **first** to predict `mpg`? Why?



---
## 3) Simple Linear Regression

**Instruction:** Fit a regression predicting `mpg` from `wt` (car weight, in 1000 lbs).

> *What to think about*: What sign do you expect for the slope of `wt`?


In [None]:

# --- Code: simple regression mpg ~ wt ---
formula_str = "mpg ~ wt"
result_simple = smf.ols(formula=formula_str, data=mtcars).fit()
print(result_simple.summary())



**Task B (Short answers):**
1. Report the slope for `wt`. Interpret it in words.
2. What are **R²** and **Adj. R²**? What do they tell you about model fit?
3. Is `wt` statistically significant? How do you know?



---
## 4) Multiple Linear Regression

**Instruction:** Add `hp` (horsepower) to the model: `mpg ~ wt + hp`.

> *Hint*: Adding predictors often increases R² but can introduce **multicollinearity**.


In [None]:

# --- Code: multiple regression mpg ~ wt + hp ---
formula_str = "mpg ~ wt + hp"
result_multi = smf.ols(formula=formula_str, data=mtcars).fit()
print(result_multi.summary())



**Task C (Compare to the simple model):**
1. Did **R²** increase? By how much?
2. Which predictors are significant now?
3. Check the **Condition Number** in the summary. Is it concerning? Explain briefly.



---
## 5) Transformation

**Instruction:** Create a log-transformed horsepower variable and fit `mpg ~ wt + log_hp`.

> *Why log?* `hp` can be right-skewed; logging may linearize relationships and reduce multicollinearity.


In [None]:

# --- Code: transformation & model mpg ~ wt + log_hp ---
mtcars["log_hp"] = np.log(mtcars["hp"])

formula_str = "mpg ~ wt + log_hp"
result_log = smf.ols(formula=formula_str, data=mtcars).fit()
print(result_log.summary())



**Task D (Model comparison):**
1. Compare **R² / Adj. R², AIC, BIC** across the three models (simple, multiple, log).
2. Interpret the coefficient of `log_hp`. *(Tip: a 1-unit increase in log corresponds to multiplying hp by **e ≈ 2.72**. For a doubling, multiply the coefficient by **ln 2 ≈ 0.693**.)*
3. Which model would you choose and **why**?



---
## 6) Diagnostics & Assumptions

**Instruction:** Use the model summary to discuss key diagnostics.

> *Checklist*: Durbin–Watson, residual normality (Omnibus / Jarque–Bera), and Condition Number.


In [None]:

# --- Code: re-print the chosen model's summary (adjust if you chose a different one) ---
print(result_log.summary())



**Task E (Diagnostics):**
1. **Durbin–Watson**: Is there evidence of autocorrelation? (Rule of thumb: ~2 is good.)
2. **Normality**: Do Omnibus / JB p-values suggest residuals are approximately normal?
3. **Condition Number**: Any serious multicollinearity concerns?



---
## (Optional) 7) Quick Residual Plots

**Instruction (optional):** Create a simple residual plot to visually check assumptions.


In [None]:

# --- Code: basic residual plot for the chosen model ---
import matplotlib.pyplot as plt

residuals = result_log.resid
fitted = result_log.fittedvalues

plt.figure()
plt.scatter(fitted, residuals)
plt.axhline(0, linestyle='--')
plt.xlabel("Fitted values")
plt.ylabel("Residuals")
plt.title("Residuals vs Fitted (mpg ~ wt + log_hp)")
plt.show()



---
## 8) Mini Practice Quiz (Unmarked)

Answer briefly in your notebook:

1. In the model `mpg ~ wt`, if `wt` increases by 1 (1000 lbs), how does `mpg` change?
2. Which model fit best according to **AIC/BIC**?
3. Explain in one sentence what **R²** means.
4. If `hp` **doubles**, by approximately how many units does `mpg` change according to the `log_hp` model?
5. A **Condition Number** over 1000 indicates what issue? Why does it matter?

> *Tip:* Keep your answers concise and based on your own model outputs.



---
### Wrap-up

- You’ve practiced the same workflow you will use in the marked exercise/quiz.  
- If any step felt unclear, review the relevant section in your **teaching notebook** and repeat the task here.

**Good luck — you’ve got this!**
