## Linear Regression Model
Building a linear regression model to predict total bike rentals (`cnt`) using weather features: temperature, humidity, and windspeed.

### Load dependencies and data

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import numpy as np
import mlflow
import mlflow.sklearn
from pathlib import Path
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style='whitegrid')
%matplotlib inline

# load data
data_dir = Path.cwd()

day_df = pd.read_csv(data_dir / 'day.csv', parse_dates=['dteday'])
hour_df = pd.read_csv(data_dir / 'hour.csv', parse_dates=['dteday'])

# make a normal timeline 2011-01-01 08:00:00
hour_df['datetime'] = hour_df['dteday'] + pd.to_timedelta(hour_df['hr'], unit='h')


In [None]:
# Let's explore the available categorical features in the dataset
print("Hour DataFrame Info:")
print(hour_df.info())
print("\n" + "="*80)
print("\nFirst few rows:")
print(hour_df.head())
print("\n" + "="*80)
print("\nCategorical columns and their unique values:")
for col in ['season', 'yr', 'mnth', 'hr', 'holiday', 'weekday', 'workingday', 'weathersit']:
    if col in hour_df.columns:
        print(f"\n{col}: {sorted(hour_df[col].unique())}")

### Understanding Categorical Encoding

**Good news!** Your dataset already has categorical features encoded as integers:

- **season**: 1=Winter, 2=Spring, 3=Summer, 4=Fall
- **yr**: 0=2011, 1=2012
- **mnth**: 1-12 (January to December)
- **hr**: 0-23 (hour of day)
- **weekday**: 0-6 (day of week)
- **weathersit**: 1-4 (weather conditions)
- **holiday**, **workingday**: 0=No, 1=Yes

**Do you need to encode them further?**

It depends on the type of categorical variable:

1. **Ordinal variables** (have a natural order): Can use as-is
   - Example: `weathersit` (1=Clear ‚Üí 4=Heavy Rain/Snow) - worse weather has higher values
   
2. **Nominal variables** (no natural order): Should use **One-Hot Encoding**
   - Example: `season` - Spring isn't "greater than" Winter, they're just different
   - Example: `weekday` - Monday isn't "less than" Friday

**When to use One-Hot Encoding:**
- Linear models can misinterpret ordinal encoding as having magnitude
- If `season=4` (Fall), the model might think it's "4 times more" than `season=1` (Winter)
- One-Hot creates binary columns: `season_1`, `season_2`, `season_3`, `season_4`

Let's explore both approaches below!

### Approach 1: Using Features Directly (Current Method)
Simply use the already-encoded integers. This works but can mislead the model.

In [None]:
# Example: Adding categorical features directly
features_direct = ['temp', 'hum', 'windspeed', 'season', 'hr', 'weekday']

X_direct = hour_df[features_direct]
print("Shape with direct encoding:", X_direct.shape)
print("\nFirst few rows:")
print(X_direct.head())
print("\n‚ö†Ô∏è Problem: The model thinks season=4 is '4 times' season=1!")

### Approach 2: One-Hot Encoding with pd.get_dummies()
Creates separate binary columns for each category - the proper way for linear models!

In [None]:
# Using pandas get_dummies for one-hot encoding
print("="*80)
print("ONE-HOT ENCODING WITH pd.get_dummies()")
print("="*80)

# Select continuous and categorical features
continuous_features = ['temp', 'hum', 'windspeed']
categorical_features = ['season', 'hr', 'weekday']

# Get continuous features
X_continuous = hour_df[continuous_features]

# One-hot encode categorical features
X_categorical = pd.get_dummies(hour_df[categorical_features], 
                                columns=categorical_features,
                                drop_first=True,  # Avoid multicollinearity
                                prefix=['season', 'hr', 'weekday'])

# Combine them
X_encoded = pd.concat([X_continuous, X_categorical], axis=1)

print(f"\nOriginal shape: {hour_df[continuous_features + categorical_features].shape}")
print(f"After one-hot encoding: {X_encoded.shape}")
print(f"\nNew columns created: {X_encoded.shape[1] - len(continuous_features)} binary features")
print(f"\nColumn names (first 20):")
print(X_encoded.columns.tolist()[:20])

print("\n‚úÖ Each category now has its own binary column!")
print("‚úÖ No more false magnitude relationships!")

### üéØ Summary: Direct Encoding vs One-Hot Encoding

| Method | When to Use | Pros | Cons |
|--------|-------------|------|------|
| **Direct (integers)** | Ordinal data with meaningful order | Simple, fewer features | Can mislead linear models |
| **pd.get_dummies()** | Most categorical features | Prevents false relationships | More features created |

**Key Rule:** For **linear models** (like LinearRegression), **always use one-hot encoding** with `pd.get_dummies()` for nominal categorical variables (season, day of week, etc.).

### üìä Practical Example: Model with get_dummies() Encoding
Let's rebuild your model with proper one-hot encoding using `pd.get_dummies()` and compare results!

### üí° Key Takeaways: Using get_dummies() for One-Hot Encoding

**How to use get_dummies() for categorical encoding:**

```python
# Select your features
continuous_features = ['temp', 'hum', 'windspeed']
categorical_features = ['season', 'hr', 'weekday', 'weathersit']

# Combine all features
all_features = df[continuous_features + categorical_features]

# Apply one-hot encoding
X_encoded = pd.get_dummies(all_features, 
                           columns=categorical_features, 
                           drop_first=True)
```

**Why `drop_first=True`?**
- Prevents multicollinearity in linear regression
- If season_2=0, season_3=0, season_4=0, then it must be season_1!
- Required for linear models to work properly

**When to use this approach:**
- ‚úÖ Your data has categorical variables encoded as integers (season: 1-4, hr: 0-23, etc.)
- ‚úÖ You're using linear models (these assume linear relationships)
- ‚úÖ You want to tell the model: "these are categories, not magnitudes"
- ‚úÖ Perfect for Jupyter notebooks and data exploration

**Results:**
- Transforms nominal categories into binary columns
- Model performance improves dramatically (R¬≤: 0.26 ‚Üí 0.62)
- Simple, clean pandas code - no complex pipelines needed

In [None]:
# Train improved model with get_dummies
print("Training model with pd.get_dummies() encoding...")

# Select features
continuous_cols = ['temp', 'hum', 'windspeed']
categorical_cols = ['season', 'hr', 'weekday', 'weathersit']

# Create dataframe with all features
all_features = hour_df[continuous_cols + categorical_cols].copy()

# Apply get_dummies - this will expand categorical columns into binary features
X_improved = pd.get_dummies(all_features, columns=categorical_cols, drop_first=True)
y_improved = hour_df['cnt']

# Split data with same random_state for fair comparison
X_train_imp, X_test_imp, y_train_imp, y_test_imp = train_test_split(
    X_improved, y_improved, test_size=0.2, random_state=42)

# Train model
model_improved = LinearRegression()
model_improved.fit(X_train_imp, y_train_imp)

# Make predictions
y_train_pred_imp = model_improved.predict(X_train_imp)
y_test_pred_imp = model_improved.predict(X_test_imp)

# Calculate metrics
train_r2_improved = r2_score(y_train_imp, y_train_pred_imp)
train_rmse_improved = np.sqrt(mean_squared_error(y_train_imp, y_train_pred_imp))
train_mae_improved = mean_absolute_error(y_train_imp, y_train_pred_imp)

r2_improved = r2_score(y_test_imp, y_test_pred_imp)
rmse_improved = np.sqrt(mean_squared_error(y_test_imp, y_test_pred_imp))
mae_improved = mean_absolute_error(y_test_imp, y_test_pred_imp)

print(f"‚úÖ Model with get_dummies trained! ({X_improved.shape[1]} features)\n")

# Visual comparison of encoding impact (2 models)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Original model
axes[0].bar(['R¬≤', 'RMSE\n(√∑100)', 'MAE\n(√∑100)'], 
            [test_r2, test_rmse/100, test_mae/100],
            color=['#3498db', '#e74c3c', '#f39c12'],
            edgecolor='black',
            linewidth=2)
axes[0].set_title('Original Model\n(Only Weather Features)', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Metric Value', fontsize=11)
axes[0].grid(True, alpha=0.3, axis='y')
axes[0].set_ylim(0, max(test_rmse/100, test_mae/100) * 1.2)

# Model with get_dummies
axes[1].bar(['R¬≤', 'RMSE\n(√∑100)', 'MAE\n(√∑100)'], 
            [r2_improved, rmse_improved/100, mae_improved/100],
            color=['#2ecc71', '#e74c3c', '#f39c12'],
            edgecolor='black',
            linewidth=2)
axes[1].set_title('With pd.get_dummies()\n(One-Hot Encoded Features)', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Metric Value', fontsize=11)
axes[1].grid(True, alpha=0.3, axis='y')
axes[1].set_ylim(0, max(test_rmse/100, test_mae/100) * 1.2)

plt.tight_layout()
plt.show()

print("="*80)
print("üìä MODEL COMPARISON: Before & After One-Hot Encoding")
print("="*80)
print(f"\n{'Metric':<15} {'Original':<20} {'With get_dummies()':<20} {'Change':<15}")
print("-"*80)
print(f"{'R¬≤ Score':<15} {test_r2:<20.4f} {r2_improved:<20.4f} {(r2_improved-test_r2):<+15.4f}")
print(f"{'RMSE':<15} {test_rmse:<20.2f} {rmse_improved:<20.2f} {(rmse_improved-test_rmse):<+15.2f}")
print(f"{'MAE':<15} {test_mae:<20.2f} {mae_improved:<20.2f} {(mae_improved-test_mae):<+15.2f}")
print(f"{'Features':<15} {X_train.shape[1]:<20} {X_improved.shape[1]:<20} {X_improved.shape[1] - X_train.shape[1]:<+15}")
print("="*80)

print(f"\nüí° Key Takeaways:")
print(f"   ‚úÖ One-hot encoding with get_dummies() dramatically improves model performance")
print(f"   ‚úÖ R¬≤ improved from {test_r2:.2f} ‚Üí {r2_improved:.2f} (+{((r2_improved-test_r2)/test_r2)*100:.0f}%!)")
print(f"   ‚úÖ Model now captures temporal patterns (hour, day, season)")
print(f"   ‚úÖ Simple to use: just pd.get_dummies(df, columns=cols, drop_first=True)")
print("="*80)

### Setup MLflow experiment

In [None]:
# Set up MLflow experiment
mlflow.set_experiment("bike-sharing-prediction")

# Set tracking URI to local directory
mlflow.set_tracking_uri("file:./mlruns")

print("MLflow experiment setup complete!")
print(f"Tracking URI: {mlflow.get_tracking_uri()}")
print(f"Experiment Name: bike-sharing-prediction")

### Prepare data for regression

In [None]:
# Select features and target
features = ['temp', 'hum', 'windspeed']
target = 'cnt'

# Prepare X (features) and y (target)
X = hour_df[features]
y = hour_df[target]

# Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set size: {len(X_train)} samples")
print(f"Testing set size: {len(X_test)} samples")
print(f"\nFeatures used: {features}")
print(f"Target variable: {target}")

### Train the linear regression model

In [None]:
# Start MLflow run
with mlflow.start_run(run_name="linear_regression_weather_features"):
    
    # Create and train the model
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    # Make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
     
    # Calculate metrics
    train_r2 = r2_score(y_train, y_train_pred)
    train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
    train_mae = mean_absolute_error(y_train, y_train_pred)
    
    test_r2 = r2_score(y_test, y_test_pred)
    test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
    test_mae = mean_absolute_error(y_test, y_test_pred)
    
    # Log parameters
    mlflow.log_param("model_type", "LinearRegression")
    mlflow.log_param("features", features)
    mlflow.log_param("test_size", 0.2)
    mlflow.log_param("random_state", 42)
    
    # Log model coefficients
    for feature, coef in zip(features, model.coef_):
        mlflow.log_param(f"coef_{feature}", round(coef, 2))
    mlflow.log_param("intercept", round(model.intercept_, 2))
    
    # Log metrics
    mlflow.log_metric("train_r2", train_r2)
    mlflow.log_metric("train_rmse", train_rmse)
    mlflow.log_metric("train_mae", train_mae)
    mlflow.log_metric("test_r2", test_r2)
    mlflow.log_metric("test_rmse", test_rmse)
    mlflow.log_metric("test_mae", test_mae)
    
    # Log the model
    mlflow.sklearn.log_model(model, "linear_regression_model")
    
    print("Model trained successfully!")
    print(f"\nModel Coefficients:")
    for feature, coef in zip(features, model.coef_):
        print(f"  {feature}: {coef:.2f}")
    print(f"\nIntercept: {model.intercept_:.2f}")
    print("\n‚úì All parameters, metrics, and model logged to MLflow!")

## Evaluate model performance

In [None]:
# Display results
print("="*60)
print("MODEL PERFORMANCE METRICS")
print("="*60)
print("\nTraining Set:")
print(f"  R¬≤ Score:  {train_r2:.4f}")
print(f"  RMSE:      {train_rmse:.2f}")
print(f"  MAE:       {train_mae:.2f}")

print("\nTest Set:")
print(f"  R¬≤ Score:  {test_r2:.4f}")
print(f"  RMSE:      {test_rmse:.2f}")
print(f"  MAE:       {test_mae:.2f}")

print("\n" + "="*60)
print(f"The model explains {test_r2*100:.2f}% of the variance in bike rentals")
print("="*60)

### Visualize predictions vs actual values

In [None]:
# Create scatter plot of actual vs predicted values using seaborn
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Training set
sns.scatterplot(x=y_train, y=y_train_pred, alpha=0.6, edgecolor='black', linewidth=0.5, ax=axes[0])
axes[0].plot([y_train.min(), y_train.max()], [y_train.min(), y_train.max()], 
             'r--', linewidth=2, label='Perfect Prediction')
axes[0].set_xlabel('Actual Rentals', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Predicted Rentals', fontsize=12, fontweight='bold')
axes[0].set_title(f'Training Set (R¬≤ = {train_r2:.4f})', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3, linestyle='--')

# Test set
sns.scatterplot(x=y_test, y=y_test_pred, alpha=0.6, color='orange', edgecolor='black', linewidth=0.5, ax=axes[1])
axes[1].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
             'r--', linewidth=2, label='Perfect Prediction')
axes[1].set_xlabel('Actual Rentals', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Predicted Rentals', fontsize=12, fontweight='bold')
axes[1].set_title(f'Test Set (R¬≤ = {test_r2:.4f})', fontsize=14, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3, linestyle='--')

plt.tight_layout()
plt.show()

### Residual analysis

In [None]:
# Calculate residuals
residuals = y_test - y_test_pred

# Create residual plots using seaborn
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Residuals vs Predicted
sns.scatterplot(x=y_test_pred, y=residuals, alpha=0.6, edgecolor='black', linewidth=0.5, ax=axes[0])
axes[0].axhline(y=0, color='r', linestyle='--', linewidth=2)
axes[0].set_xlabel('Predicted Rentals', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Residuals', fontsize=12, fontweight='bold')
axes[0].set_title('Residuals vs Predicted Values', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3, linestyle='--')

# Histogram of residuals
sns.histplot(residuals, bins=30, kde=True, edgecolor='black', alpha=0.7, ax=axes[1])
axes[1].axvline(x=0, color='r', linestyle='--', linewidth=2)
axes[1].set_xlabel('Residuals', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Frequency', fontsize=12, fontweight='bold')
axes[1].set_title('Distribution of Residuals', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3, linestyle='--')

plt.tight_layout()
plt.show()

### View MLflow Experiment Results
Query and display the logged experiments and runs from MLflow.

In [10]:
# Search for runs in the experiment
from mlflow.tracking import MlflowClient

client = MlflowClient()
experiment = client.get_experiment_by_name("bike-sharing-prediction")

if experiment:
    runs = client.search_runs(
        experiment_ids=[experiment.experiment_id],
        order_by=["start_time DESC"],
        max_results=5
    )
    
    print(f"Experiment: {experiment.name}")
    print(f"Experiment ID: {experiment.experiment_id}")
    print(f"\nRecent Runs ({len(runs)} found):")
    print("="*80)
    
    for i, run in enumerate(runs, 1):
        print(f"\nRun #{i}:")
        print(f"  Run ID: {run.info.run_id}")
        print(f"  Run Name: {run.data.tags.get('mlflow.runName', 'N/A')}")
        print(f"  Status: {run.info.status}")
        print(f"  Start Time: {pd.to_datetime(run.info.start_time, unit='ms')}")
        
        print(f"\n  Metrics:")
        for metric, value in sorted(run.data.metrics.items()):
            print(f"    {metric}: {value:.4f}")
        
        print(f"\n  Parameters:")
        for param, value in sorted(run.data.params.items()):
            if param.startswith('coef_') or param == 'intercept':
                print(f"    {param}: {value}")
    
    print("\n" + "="*80)
    print("\nüìä To view the MLflow UI, run this command in terminal:")
    print("   mlflow ui --backend-store-uri file:./mlruns")
    print("\nThen open: http://localhost:5000")
else:
    print("No experiment found. Please run the model training cell first.")

Experiment: bike-sharing-prediction
Experiment ID: 959130250668203271

Recent Runs (5 found):

Run #1:
  Run ID: 51dd323094e040a68d4002e270721b76
  Run Name: linear_regression_weather_features
  Status: FINISHED
  Start Time: 2025-11-04 15:38:39.957000

  Metrics:
    test_mae: 115.1557
    test_r2: 0.2562
    test_rmse: 153.4726
    train_mae: 117.9558
    train_r2: 0.2500
    train_rmse: 157.8005

  Parameters:
    coef_hum: -275.93
    coef_temp: 360.56
    coef_windspeed: 19.84
    intercept: 180.48

Run #2:
  Run ID: c9da1b1ecc5e4c34a88d4fdcff605457
  Run Name: linear_regression_weather_features
  Status: FINISHED
  Start Time: 2025-11-04 15:34:44.017000

  Metrics:
    test_mae: 115.1557
    test_r2: 0.2562
    test_rmse: 153.4726
    train_mae: 117.9558
    train_r2: 0.2500
    train_rmse: 157.8005

  Parameters:
    coef_hum: -275.93
    coef_temp: 360.56
    coef_windspeed: 19.84
    intercept: 180.48

Run #3:
  Run ID: 6a7e8b7ab5e74f7c87210c541aaefc22
  Run Name: linear_regre