# BA4e: k-Fold Cross-Validation for Regression (Bike Rentals Dataset)

In this module, we apply k-Fold cross-validation to evaluate the generalization ability of a regression model.
Unlike a single train-test split, k-fold CV provides a more reliable estimate of model performance by averaging over multiple data splits.

### Step 1: Load and prepare the dataset

In [1]:
import pandas as pd
df = pd.read_csv('bike_rentals.csv')
X = df.drop(columns=['cnt', 'instant'])
y = df['cnt']

### Step 2: Define linear regression model and cross-validation setup

In [2]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold, cross_val_score

model = LinearRegression()
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

### Step 3: Run 5-fold cross-validation using R² as the evaluation metric

In [3]:
scores = cross_val_score(model, X, y, cv=kfold, scoring='r2')
print("R² scores for each fold:", scores)
print("Average R² score:", scores.mean())

R² scores for each fold: [0.82757579 0.74597252 0.81285982 0.72086789 0.80809516]
Average R² score: 0.7830742372597278


### Step 4: Interpretation

The individual fold scores give you an idea of how much model performance varies depending on the data split.
If scores vary widely, the model may be unstable or sensitive to data.
If scores are consistently high, the model is likely generalizing well.

The average R² score is a more robust estimate of expected real-world performance than a single train-test split.

In [None]:
### Step 5: Try these

- Try changing `n_splits` to 10 or 3. How does that affect the stability of scores?
- Use `cross_val_score(..., scoring='neg_mean_squared_error')` to evaluate with a different metric.
- Try cross-validating with a `DecisionTreeRegressor` or `Ridge` model and compare performance.
- Explore using `ShuffleSplit` or `RepeatedKFold` for more randomized evaluations.