# Modeling: Baselines → Linear Regression → Random Forest

This notebook builds and evaluates time-aware forecasting models for hourly bike-share demand.

Goals:
- Evaluate a persistence baseline (previous-hour prediction).
- Train/evaluate a linear model and a RandomForestRegressor with time-series CV.
- Save final models, metrics, and comparison plots to `results/`.

Dataset assumptions:
- Processed data at `../data/processed_hour.csv` (or raw upload at `/mnt/data/hour.csv`).
- Target column: `cnt`.

In [1]:
# Imports & paths
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
import joblib
from sklearn.model_selection import TimeSeriesSplit, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import RidgeCV, LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error
import warnings
warnings.filterwarnings("ignore")

ROOT = Path.cwd().parent  # notebook is in /notebooks
RAW_UPLOAD = Path("/mnt/data/hour.csv")   # uploaded raw file (local)
PROC = ROOT / "data" / "processed_hour.csv"
RESULTS = ROOT / "results"
MODELS = ROOT / "models"
for p in (RESULTS, MODELS):
    p.mkdir(parents=True, exist_ok=True)

def rmse(y_true, y_pred): return mean_squared_error(y_true, y_pred, squared=False)
def mape_pct(y_true, y_pred): return np.mean(np.abs((y_true - y_pred) / np.where(y_true==0,1,y_true))) * 100.0

ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject