<a href="https://colab.research.google.com/github/HunterTzou/Data_201_Spring_2026/blob/main/Week3_part2_student.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Class 4 – Linear Regression in Python (Student Notebook)

**Audience:** R users transitioning to Python

**Focus:** Simple and Multiple Linear Regression (continuous outcomes)

> ✍️ **Instructions:** Complete all TODO sections. Reflection questions are required and assess conceptual understanding.

---

## 1. Learning Objectives

By the end of this lesson, you should be able to:

* Translate R `lm()` models into Python using `statsmodels`
* Fit simple and multiple linear regression models in Python
* Compare inference-focused vs prediction-focused workflows
* Interpret coefficients consistently across R and Python
* Evaluate predictive performance using appropriate metrics

---

## 2. Setup

In [None]:
# TODO: Run this cell
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

## 3. Data

In **R**, you would typically load data like this:

```r
df <- read.csv("housing.csv")
```

### Python Translation

In [None]:
# TODO: Load the dataset
df = pd.read_csv("housing.csv")
df.head()


Assume the following variables:

* `price` (continuous outcome)
* `size` (continuous predictor)
* `bedrooms` (numeric predictor)
* `neighborhood` (categorical predictor)

---

## 4. Conceptual Review: What Linear Regression Models

### In R

```r
lm(price ~ size, data = df)
```

* Models the *expected value* of a continuous outcome
* Coefficients represent conditional mean differences

### Key idea

✏️ **Reflection:** How does this interpretation change (or not) when we move to Python?

---

## 5. Exploratory Analysis


In [None]:
# TODO: Create a scatterplot of price vs size


✏️ **Reflection:** Does a linear relationship seem plausible?

---

## 6. Simple Linear Regression (Inference)

### R Reference

```r
model_r <- lm(price ~ size, data = df)
summary(model_r)
```

### Python Translation (statsmodels)

In [None]:
# TODO: Fit a simple linear regression using statsmodels
model_sm_simple = ...

# TODO: Display the model summary

✏️ **Reflection:** Identify two similarities between `summary(model_r)` and the Python output.

---

## 7. Multiple Linear Regression (Inference)

### R Reference

```r
lm(price ~ size + bedrooms + neighborhood, data = df)
```

### Python Translation (statsmodels)



In [None]:
# TODO: Fit a multiple linear regression using statsmodels
model_sm_multiple = ...

# TODO: Display the summary


✏️ **Reflection:** What does the coefficient on `size` represent *holding other variables constant*?

---

## 8. Categorical Predictors

In R, factors are handled automatically.

In Python:

* Formulas handle categoricals using `C()`

In [None]:
# Example
# price ~ size + C(neighborhood)


✏️ **Reflection:** What is the reference category, and how can you tell?

---

## 9. Prediction-Focused Linear Regression (scikit-learn)

### R Analogy

In R, prediction-focused workflows often use:

* `caret`
* `tidymodels`

### Python Workflow


In [None]:
# TODO: Define predictors and outcome
X = ...
y = ...

# TODO: Create train/test split
X_train, X_test, y_train, y_test = ...

In [None]:
# TODO: Fit a scikit-learn linear regression model
model_sk = ...
model_sk.fit(X_train, y_train)

In [None]:
# TODO: Generate predictions
y_pred = ...

## 10. Model Evaluation

In [None]:
# TODO: Compute RMSE


# TODO: Compute R^2

✏️ **Reflection:** Why are these metrics not provided automatically by statsmodels?

---

## 11. Comparison Table (Fill In)

| Feature          | R (`lm`) | statsmodels | scikit-learn |
| ---------------- | -------- | ----------- | ------------ |
| Primary goal     |          |             |              |
| Uses formulas    |          |             |              |
| Train/test split |          |             |              |
| Output           |          |             |              |

---

## 12. Active Learning (15–20 minutes)

**Pair activity:**

1. One student explains a regression coefficient in plain language
2. The other explains the difference between inference and prediction
3. Together, decide which tool you would use for:

   * Estimating housing price effects
   * Predicting housing prices

Be prepared to justify your choices.

---

## 13. Takeaways

* Linear regression works the same in R and Python
* statsmodels ≈ `lm`
* scikit-learn ≈ prediction-focused R workflows
* Tool choice depends on analytical goals
