# CME193 HW2: Scikit-learn Pipeline and PyTorch Model for Bike Sharing Demand

- **Semester**: Autumn 2025
- **Instructor**: Tianyu Du (`tianyudu@stanford.edu`)

### Deadline
- Deadline: 11/28/2025 (Friday).
- You may use your late days for this homework. The **final deadline** will be 12/14/2025 (Sunday). This is the university's grading deadline, you **must** submit your homework before this date, no late submission after this date will be accepted.

### Grading rubric (guideline)
- 50% correctness of implementations (API, numerics, shapes)
- 20% on the performance of your PyTorch model; we have implemented a fairly weak Poisson regression model as a baseline in this homework, **we are expecting your PyTorch model to outperform this baseline**. If you follow our instructions closely, you should be able to achieve this performance easily.
- 30% clarity of explanations/interpretations

### Overview of the Homework
In this homework, you will build a data processing and modeling pipeline for the Bike Sharing Demand dataset we explored in Lecture 7. In the lecture, we have explored a few model options for this dataset, and in this homework, you will be building a data-preprocessing pipeline for this dataset, and a PyTorch model for the prediction task. Since this is a relatively small dataset, you will not need GPU for this homework.

### Submitting this homework
Rename your notebook to `CME193_HW2_Sklearn_Pipeline_<YOUR_NAME_AND_STANFORD_EMAIL>.ipynb` and submit to **Canvas**. Please keep outputs visible (do not clear them) so we can review your results.

### Expected time to complete this homework
About 1 hour. If you get stuck, email me (`tianyudu@stanford.edu`) or come to office hours.

### Academic integrity
Follow the course academic integrity policy. Collaboration is allowed; list collaborators. Write your own code.

### AI usage
You may consult AI tools (e.g., ChatGPT) for guidance, but you must write and understand your own code.

In [None]:
# Setup
# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings("ignore")

# Core data manipulation and visualization libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn imports for model building and evaluation
from sklearn.model_selection import train_test_split, KFold, cross_validate, RandomizedSearchCV
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.dummy import DummyRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.inspection import permutation_importance, PartialDependenceDisplay
from sklearn.datasets import fetch_openml

# PyTorch imports for neural network modeling
import torch
from torch import nn
from torch.utils.data import DataLoader, TensorDataset

# Hyperparameter search helpers
from scipy.stats import loguniform

# Joblib for saving/loading models
import joblib

# Set random seed for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)
torch.manual_seed(RANDOM_STATE)

# Configure device (GPU if available, otherwise CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Set matplotlib style for consistent plotting
plt.style.use("seaborn-v0_8-whitegrid")

## [Do not modify this section] Data Loading and Preprocessing
This block of code loads the dataset and does light preprocessing; please do not modify it.

Note: The printed counts of numeric vs categorical columns reflect raw dtypes. For modeling, we treat time-based integers (`hour`, `month`, `weekday`) as categorical features later, so those counts may differ from the dtype summary.

In [None]:
# 1) Load the dataset (Bike Sharing) and define X, y

bike = fetch_openml(name="Bike_Sharing_Demand", as_frame=True, parser="auto")
df = bike.frame.copy()
print("Bike Sharing shape:", df.shape)

# Normalize column names across OpenML versions
rename_map = {
    "cnt": "count",
    "atemp": "feel_temp",
    "hum": "humidity",
    "weathersit": "weather",
    "hr": "hour",
    "mnth": "month",
    "yr": "year",
    "dteday": "datetime",
}
df = df.rename(columns={k: v for k, v in rename_map.items() if k in df.columns})

target_candidates = [c for c in ["count", "cnt"] if c in df.columns]
assert target_candidates, "Target column not found."
target = target_candidates[0]
if target != "count":
    df = df.rename(columns={target: "count"})
target = "count"

# Derive time-based features if needed
datetime_candidates = [c for c in ["datetime", "dteday"] if c in df.columns]
if datetime_candidates:
    dt = pd.to_datetime(df[datetime_candidates[0]])
    derived = {
        "year": dt.dt.year,
        "month": dt.dt.month,
        "day": dt.dt.day,
        "hour": dt.dt.hour,
        "weekday": dt.dt.weekday,
    }
    for name, values in derived.items():
        if name not in df.columns:
            df[name] = values

# Remove columns that would leak the target or are identifiers
cols_to_drop = set(datetime_candidates + ["casual", "registered", "instant", "index", "datetime", "dteday"])
cols_to_drop = [c for c in cols_to_drop if c in df.columns]
if cols_to_drop:
    df = df.drop(columns=cols_to_drop)

X = df.drop(columns=[target])
y = df[target]

print(f"Total features: {len(X.columns)}")
print("Numeric cols:", X.select_dtypes(include=[np.number]).shape[1],
      "| Categorical cols:", X.select_dtypes(exclude=[np.number]).shape[1])
print("Target summary:\n", y.describe())

# 2) Train/test split (chronological hold-out)
# Sort by time and take the last 20% as test set to simulate future prediction
sort_cols = [c for c in ["year", "month", "day", "hour"] if c in df.columns]
if sort_cols:
    df_sorted = df.sort_values(by=sort_cols).reset_index(drop=True)
else:
    df_sorted = df.reset_index(drop=True)

X_sorted = df_sorted.drop(columns=[target])
y_sorted = df_sorted[target]

split_idx = int(len(df_sorted) * 0.8)
X_train, X_test = X_sorted.iloc[:split_idx], X_sorted.iloc[split_idx:]
y_train, y_test = y_sorted.iloc[:split_idx], y_sorted.iloc[split_idx:]

print("Train:", X_train.shape, "Test:", X_test.shape)

# 3) Feature typing: numeric vs categorical (no cyclical encoding)
categorical_features = ["season", "weather", "holiday", "workingday", "weekday", "hour", "month"]
numeric_features = [c for c in X_train.columns if c not in categorical_features]

print(f"Categorical features: {categorical_features}")
print(f"Numeric features: {numeric_features}")

## [TODO] Review the preprocessing pipeline (you may modify)
A working baseline pipeline is provided below using `ColumnTransformer`, `SimpleImputer`, `StandardScaler`, and `OneHotEncoder`. 
You may tweak imputation strategies, scaling, or which columns are treated as categorical vs numeric.

In [None]:
# Define preprocessing pipeline (no cyclical transforms)
transformers = [
    # TODO: Add your transformation here, please refer to Lecture 7 for more details.
]

preprocessor = ColumnTransformer(transformers=transformers, remainder='drop')

## [TODO] Briefly explain your preprocessing design (2–3 sentences)
- What imputers and scalers did you choose and why?
- Which columns are categorical vs numeric, and why?
- Optional: any alternatives you considered.

## [Do not modify this section] Baseline Model: Ridge and Poisson Regression (Covered in Lecture)
We now establish two baseline models, the Ridge regression (i.e., linear regression with L2 regularization you implemented in the previous homework) and the Poisson regression (covered in Lecture 7).

These two models serve as weak baselines for the PyTorch model you will be building in this homework. You do not need to tune these baselines; just run them.

**Grading Note**:
<mark>
Now we fit the model and evaluate the performance on the test set. To verify that your PyTorch model is working correctly, we are expecting it to outperform these two baselines.
</mark>

In [None]:
from sklearn.linear_model import PoissonRegressor, Ridge


ridge_pipe = Pipeline([
    ("preprocessor", preprocessor),
    ("model", Ridge(max_iter=50000)),
])

poisson_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("model", PoissonRegressor(alpha=1.0, max_iter=1000))
])

In [None]:
# Fit and evaluate Ridge
ridge_pipe.fit(X_train, y_train)
y_pred_ridge = ridge_pipe.predict(X_test)
ridge_mae = mean_absolute_error(y_test, y_pred_ridge)
ridge_rmse = np.sqrt(mean_squared_error(y_test, y_pred_ridge))
ridge_r2 = r2_score(y_test, y_pred_ridge)

print(f"Ridge Regression Test MAE: {ridge_mae:,.1f}")
print(f"Ridge Regression Test RMSE: {ridge_rmse:,.1f}")
print(f"Ridge Regression Test R²: {ridge_r2:.3f}")

# Fit and evaluate Poisson
poisson_pipeline.fit(X_train, y_train)
y_pred_poisson = poisson_pipeline.predict(X_test)
poisson_mae = mean_absolute_error(y_test, y_pred_poisson)
poisson_rmse = np.sqrt(mean_squared_error(y_test, y_pred_poisson))
poisson_r2 = r2_score(y_test, y_pred_poisson)

print(f"Poisson Regression Test MAE: {poisson_mae:,.1f}")
print(f"Poisson Regression Test RMSE: {poisson_rmse:,.1f}")
print(f"Poisson Regression Test R²: {poisson_r2:.3f}")

## PyTorch Residual Network

This section mirrors the modeling pipeline with PyTorch. The data prep and training loop are provided; focus your effort on experimenting with the shallow residual network architecture and its hyperparameters. Fill in the TODOs to customize the model depth, hidden sizes, and residual connections.

Please refer to Lecture 8 for more details on PyTorch.

## [Do not modify this section] PyTorch train/evaluate helper (dataloader + training loop)

In [None]:
def train_and_evaluate_pytorch_model(model, X_train, y_train, X_test, y_test, preprocessor,
                                      epochs=25, lr=5e-4, weight_decay=1e-5, batch_size=256,
                                      patience=10, min_delta=1e-3):
    """
    Train a PyTorch model from scratch and evaluate on test set.

    Args:
        model: PyTorch model instance
        X_train: Training features (pandas DataFrame)
        y_train: Training labels (pandas Series)
        X_test: Test features (pandas DataFrame)
        y_test: Test labels (pandas Series)
        preprocessor: sklearn preprocessor (fitted inside on X_train)
        epochs: Number of training epochs
        lr: Learning rate
        weight_decay: L2 regularization strength
        batch_size: Batch size for training
        patience: Number of epochs with no improvement before early stopping
        min_delta: Minimum improvement in validation loss to reset patience

    Returns:
        results: Dictionary with loss history, metrics, and test predictions
    """
    # Recreate the preprocessing pipeline to obtain a dense design matrix for PyTorch
    preprocessor.fit(X_train, y_train)

    X_train_enc = preprocessor.transform(X_train)
    X_val_enc = preprocessor.transform(X_test)


    def to_tensor(matrix):
        if hasattr(matrix, "toarray"):
            matrix = matrix.toarray()
        return torch.tensor(matrix, dtype=torch.float32)


    X_train_tensor = to_tensor(X_train_enc)
    X_val_tensor = to_tensor(X_val_enc)
    y_train_tensor = torch.tensor(y_train.to_numpy(), dtype=torch.float32).unsqueeze(1)
    y_val_tensor = torch.tensor(y_test.to_numpy(), dtype=torch.float32).unsqueeze(1)

    train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
    val_dataset = TensorDataset(X_val_tensor, y_val_tensor)

    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

    # Training loop
    optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)
    criterion = nn.MSELoss()
    history = {"train_loss": [], "val_loss": []}
    best_val = float("inf")
    best_state = None
    best_epoch = -1
    epochs_ran = 0

    for epoch in range(epochs):
        model.train()
        running_train = 0.0
        for xb, yb in train_loader:
            xb = xb.to(device)
            yb = yb.to(device)
            optimizer.zero_grad()
            preds = model(xb)
            loss = criterion(preds, yb)
            loss.backward()
            optimizer.step()
            running_train += loss.item() * xb.size(0)
        train_loss = running_train / len(train_loader.dataset)

        model.eval()
        running_val = 0.0
        with torch.no_grad():
            for xb, yb in val_loader:
                xb = xb.to(device)
                yb = yb.to(device)
                preds = model(xb)
                val_loss = criterion(preds, yb)
                running_val += val_loss.item() * xb.size(0)
        val_loss = running_val / len(val_loader.dataset)

        history["train_loss"].append(train_loss)
        history["val_loss"].append(val_loss)
        epochs_ran = epoch + 1

        if val_loss + min_delta < best_val:
            best_val = val_loss
            best_state = {k: v.detach().cpu().clone() for k, v in model.state_dict().items()}
            best_epoch = epoch

        if (epoch + 1) % 5 == 0 or epoch == 0:
            status = "*" if epoch == best_epoch else ""
            print(f"Epoch {epoch + 1:02d} | train MSE: {train_loss:.2f} | val MSE: {val_loss:.2f} {status}")

        if best_epoch != -1 and (epoch - best_epoch) >= patience:
            print(f"Early stopping triggered at epoch {epoch + 1:02d} (best epoch {best_epoch + 1:02d})")
            break

    if best_state is not None:
        model.load_state_dict(best_state)

    # Evaluate on test set
    model.eval()
    with torch.no_grad():
        val_preds = model(X_val_tensor.to(device)).cpu().numpy().squeeze()


    torch_rmse = np.sqrt(mean_squared_error(y_test, val_preds))
    torch_mae = mean_absolute_error(y_test, val_preds)
    torch_r2 = r2_score(y_test, val_preds)

    metrics = {"rmse": torch_rmse, "mae": torch_mae, "r2": torch_r2}

    print(f"\nPyTorch Test RMSE: {torch_rmse:,.1f}")
    print(f"PyTorch Test MAE:  {torch_mae:,.1f}")
    print(f"PyTorch Test R²:  {torch_r2:.3f}")

    results = {
        "history": history,
        "metrics": metrics,
        "predictions": val_preds,
        "best_epoch": best_epoch + 1 if best_epoch != -1 else epochs_ran,
        "epochs_ran": epochs_ran,
    }
    return results

# [TODO] Define the PyTorch Neural Network

Implement a shallow residual neural network for regression using `torch.nn`. 

As covered in the lecture (Lecture 8: PyTorch models), a minimal PyTorch model is defined by subclassing `nn.Module`, and implementing both the `__init__` and `forward` methods.

1. The `__init__` method is used to define different components of the model, such as the layers, activation functions, etc.
2. The `forward` method is used to define the forward pass of the model, i.e., how the input data is transformed through the model and how the predictions are computed.

Outside the model definition, we also need to define the dataloader, optimizer, and training loop. To reduce the complexity of this homework, we have provided the dataloader and training loop in the `train_and_evaluate_pytorch_model` function and you do not need to implement them in this homework. You are welcome to look into the code to understand how the dataloader and training loop are implemented.

## Mathematical forward pass
We begin by defining the mathematical structure of the three-layer neural network we will be implementing in this exercise.

$$
\begin{aligned}
\mathbf{h}_1 &= \mathrm{ReLU}(\mathbf{W}_1\,\mathbf{x} + \mathbf{b}_1) \\
\mathbf{r}   &= \mathrm{ReLU}(\mathbf{W}_2\,\mathbf{h}_1 + \mathbf{b}_2) \\
\mathbf{h}_2 &= \mathrm{ReLU}(\mathbf{W}_3\,\mathbf{r} + \mathbf{b}_3 + \mathbf{h}_1) \quad\text{(residual add)}\\
\mathbf{d}   &= \mathrm{Dropout}_p(\mathbf{h}_2) \quad\text{(zero-out activations with prob. }p\text{ during training, this does not affect dimensions)}\\
\hat{y}       &= \mathbf{w}_4^{\top}\,\mathbf{d} + b_4
\end{aligned}
$$

Let $d_\text{input}$ denote the dimension of the input data $\mathbf{x}$, and $d_\text{hidden}$ denote the dimension of the hidden layers. 
The trainable parameters in this model are the following:
1. Weights and biases for the first layer, mapping from the raw input ($\mathbf{x}$) to the first set of hidden units ($\mathbf{h}_1$), $\mathbf{W}_1 \in \mathbb{R}^{d_\text{input} \times d_\text{hidden}}$, $\mathbf{b}_1 \in \mathbb{R}^{d_\text{hidden}}$.
2. There are no learnable parameters for the ReLU activation function.
3. Weights and biases for the second layer, mapping from the first set of hidden units ($\mathbf{h}_1$) to the residual component ($\mathbf{r}$), $\mathbf{W}_2 \in \mathbb{R}^{d_\text{hidden} \times d_\text{hidden}}$, $\mathbf{b}_2 \in \mathbb{R}^{d_\text{hidden}}$.
4. Weights and biases for the third layer, mapping from the residual component ($\mathbf{r}$) to the second set of hidden units ($\mathbf{h}_2$), $\mathbf{W}_3 \in \mathbb{R}^{d_\text{hidden} \times d_\text{hidden}}$, $\mathbf{b}_3 \in \mathbb{R}^{d_\text{hidden}}$.
5. A dropout layer with rate $p$ is applied before the final output layer. Dropout has no trainable parameters; during training it multiplies $\mathbf{h}_2$ by a Bernoulli mask (keep probability $1-p$) to discourage co-adaptation of hidden units, while during evaluation it passes activations through unchanged. There is no learnable parameters in the dropout layer.
6. Finally, the output layer maps the (possibly dropped-out) hidden activations to the prediction ($\hat{y}$), with $\mathbf{w}_4 \in \mathbb{R}^{d_\text{hidden}}$ and $b_4 \in \mathbb{R}$.

You can always create PyTorch tensors (matrices) (i.e., using `nn.Parameter`) for the trainable parameters. But a more efficient way (as we discussed in the lecture) is to use `nn.Linear` to create the layers, and then use the `nn.ReLU` activation function. For regularization, you can also insert `nn.Dropout(p)` modules to randomly zero-out activations during training (e.g., `p=0.1`).

To create a linear function, you can use `linear_layer = nn.Linear(input_dim, output_dim)` (by default, the bias/intercept is included). Calling this module with input $\mathbf{x}$ will yield the following:

$$
\texttt{linear\_layer}(\mathbf{x}) = \mathbf{W}\,\mathbf{x} + \mathbf{b}
$$

where $\mathbf{W} \in \mathbb{R}^{d_\text{output} \times d_\text{input}}$ and $\mathbf{b} \in \mathbb{R}^{d_\text{output}}$.

## Implementation guide
Having a closer look at the model definition above, we have four linear mappings in total (i.e., four pairs of weights and biases). In this case, we would want to create four `nn.Linear` modules in the `__init__` method with the corresponding input and output dimensions.

In `__init__` (define modules)
  - `self.fc1 = nn.Linear(input_dim, hidden_dim)`     ← for $\mathbf{W}_1,\mathbf{b}_1$
  - `self.fc2 = nn.Linear(hidden_dim, hidden_dim)`    ← for $\mathbf{W}_2,\mathbf{b}_2$
  - `self.fc3 = nn.Linear(hidden_dim, hidden_dim)`    ← for $\mathbf{W}_3,\mathbf{b}_3$
  - `self.out = nn.Linear(hidden_dim, 1)`             ← for $\mathbf{w}_4,b_4$
  - `self.act = nn.ReLU()`                            ← for ReLU activation
  - `self.drop = nn.Dropout(p=0.1)`                   ← applied once after the residual block

In `forward(self, x)`, follow the mathematical structure step by step:
  - `h1 = self.act(self.fc1(x))`                      ← computes $\mathbf{h}_1$
  - `r  = self.act(self.fc2(h1))`                     ← computes $\mathbf{r}$
  - `h2 = self.act(self.fc3(r) + h1)`                 ← computes $\mathbf{h}_2$ with residual add
  - `d  = self.drop(h2)`                              ← applies dropout to obtain $\mathbf{d}$
  - `return self.out(d)`                              ← computes $\hat{y}$

## Training hint
- Keep the output layer linear (no `ReLU`) after processing the final layer.
- Dropout helps regularize the network by randomly dropping hidden units; try `p` in the 0.05–0.2 range if you observe overfitting.
- Use `model.train()` during training and `model.eval()` during evaluation to enable/disable dropout as intended.
- Start with `hidden_dim` in {64, 128}; `dropout=0.1` works well.
- Train with `train_and_evaluate_pytorch_model(model, X_train, y_train, X_test, y_test, preprocessor, ...)`. 

In [None]:
# You do NOT need to modify this section.
# Fit the preprocessor on the training data to learn scaling parameters
preprocessor.fit(X_train, y_train)
# Determine the number of features after preprocessing (for the neural network input layer)
input_dim = preprocessor.transform(X_train).shape[1]

# [TODO] Implement the `__init__` and `forward` methods in the `BikeDemandResNet` class below.

In [None]:
class BikeDemandResNet(nn.Module):
    """Three-layer MLP with a simple residual connection."""

    def __init__(self, input_dim: int, hidden_dim: int = 128, dropout: float=0.1) -> None:
        super().__init__()
        # TODO: (student) implement the `__init__` method here.
        raise NotImplementedError("You need to implement the `__init__` method.")  # comment this out after you have implemented the method


    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # TODO: (student) implement the forward pass here.
        raise NotImplementedError("You need to implement the `forward` method.")  # comment this out after you have implemented the method
        # return ...

In [None]:
# Example usage with early stopping:
torch_resnet_model = BikeDemandResNet(input_dim=input_dim, hidden_dim=128).to(device)

In [None]:

resnet_run = train_and_evaluate_pytorch_model(
    torch_resnet_model,
    X_train,
    y_train,
    X_test,
    y_test,
    preprocessor,
    epochs=80,
    lr=5e-4,
    batch_size=256,
    patience=8,
    min_delta=5e-4,
)
print("Best epoch:", resnet_run["best_epoch"], "of", resnet_run["epochs_ran"], "epochs run")
print("Metrics:", resnet_run["metrics"])


## [TODO] Run a few quick experiments and summarize (2-4 sentences)
Try varying one or two hyperparameters and report what changed:
- learning rate (e.g., 1e-3, 5e-4, 1e-4)
- hidden_dim (e.g., 64, 128)
- epochs or patience (e.g., 40 vs 80)

Optional template to record results:

| setting | best epoch | RMSE | MAE | R² |
|---|---:|---:|---:|---:|
| baseline (lr=5e-4, hidden=128) | | | | |
| change 1 | | | | |
| change 2 | | | | |

Write a short takeaway on what helped or didn’t.