# Predict Bike Sharing Demand with AutoGluon

## Introduction

In this project, we'll tackle the Bike Sharing Demand competition from Kaggle using AutoGluon, a powerful AutoML library. The goal is to predict bike rental demand (count of total rentals) based on date, time, and weather features.

Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded.

Predicting bike sharing demand is highly relevant to related problems companies encounter, such as Uber, Lyft, and DoorDash. Accurate demand forecasting helps businesses prepare for spikes in their services and improves customer experience by limiting delays.

We'll use AutoGluon's Tabular Prediction to fit data from CSV files provided by the competition. Through several iterations of model training and optimization, we'll aim to achieve a competitive score on the Kaggle leaderboard.

## Step 1: Install and Import Required Libraries

In [None]:
# Install required packages if not already installed
# !pip install autogluon
# !pip install kaggle
# !pip install matplotlib seaborn pandas numpy

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import warnings
from pathlib import Path

from autogluon.tabular import TabularDataset, TabularPredictor

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('ggplot')
sns.set_style('whitegrid')

## Step 2: Download and Explore the Dataset

First, we need to download the competition data from Kaggle. If you haven't already set up your Kaggle API credentials, you'll need to do that first by creating a kaggle.json file in the ~/.kaggle/ directory.

In [None]:
# Create data directory if it doesn't exist
data_dir = Path('../BikeSharingSystem/data')
data_dir.mkdir(exist_ok=True)

# Download data using Kaggle API (uncomment if needed)
# !kaggle competitions download -c bike-sharing-demand -p {data_dir}
# !unzip -o {data_dir}/bike-sharing-demand.zip -d {data_dir}

In [None]:
# For this notebook, we'll assume the data files are already downloaded
# Define file paths
train_path = data_dir / 'train.csv'
test_path = data_dir / 'test.csv'

# If files don't exist, provide instructions
if not train_path.exists() or not test_path.exists():
    print("Data files not found. Please download from Kaggle and place in the data directory.")
    print("You can use the Kaggle API with: kaggle competitions download -c bike-sharing-demand")
else:
    print("Data files found. Ready to proceed.")

### 2.1 Load and Examine the Data

Let's load the training and test datasets and examine their structure.

In [None]:
# Load the datasets
train = pd.read_csv(train_path)
test = pd.read_csv(test_path)

# Display basic information about the training data
print("Training data shape:", train.shape)
print("\nTraining data info:")
train.info()

# Display the first few rows of the training data
print("\nFirst 5 rows of training data:")
train.head()

In [None]:
# Display basic information about the test data
print("Test data shape:", test.shape)
print("\nTest data info:")
test.info()

# Display the first few rows of the test data
print("\nFirst 5 rows of test data:")
test.head()

### 2.2 Understand the Features

Let's understand what each feature represents:

- **datetime**: hourly date + timestamp  
- **season**: 1 = spring, 2 = summer, 3 = fall, 4 = winter 
- **holiday**: whether the day is a holiday or not
- **workingday**: whether the day is neither a weekend nor holiday
- **weather**: 
  - 1: Clear, Few clouds, Partly cloudy
  - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
  - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
  - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- **temp**: temperature in Celsius
- **atemp**: "feels like" temperature in Celsius
- **humidity**: relative humidity
- **windspeed**: wind speed
- **casual**: number of non-registered user rentals (only in train)
- **registered**: number of registered user rentals (only in train)
- **count**: number of total rentals (target variable, only in train)

### 2.3 Statistical Summary of the Data

In [None]:
# Statistical summary of the training data
train.describe()

### 2.4 Visualize the Target Variable Distribution

In [None]:
# Plot the distribution of the target variable (count)
plt.figure(figsize=(10, 6))
sns.histplot(train['count'], bins=50, kde=True)
plt.title('Distribution of Bike Rentals (Count)', fontsize=14)
plt.xlabel('Number of Rentals', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.show()

# Also plot the log-transformed target, which might be more normally distributed
plt.figure(figsize=(10, 6))
sns.histplot(np.log1p(train['count']), bins=50, kde=True)
plt.title('Distribution of Log-Transformed Bike Rentals', fontsize=14)
plt.xlabel('Log(Number of Rentals + 1)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.show()

The distribution of the target variable is right-skewed, which is common for count data. The log-transformed version appears more normally distributed, suggesting that a log transformation might be beneficial for modeling.

### 2.5 Explore Relationships Between Features and Target

In [None]:
# Convert datetime to datetime type
train['datetime'] = pd.to_datetime(train['datetime'])
test['datetime'] = pd.to_datetime(test['datetime'])

# Extract hour from datetime
train['hour'] = train['datetime'].dt.hour
test['hour'] = test['datetime'].dt.hour

# Plot bike rentals by hour
plt.figure(figsize=(12, 6))
sns.boxplot(x='hour', y='count', data=train)
plt.title('Bike Rentals by Hour of Day', fontsize=14)
plt.xlabel('Hour of Day', fontsize=12)
plt.ylabel('Number of Rentals', fontsize=12)
plt.xticks(range(0, 24))
plt.show()

In [None]:
# Extract day of week from datetime
train['dayofweek'] = train['datetime'].dt.dayofweek
test['dayofweek'] = test['datetime'].dt.dayofweek

# Plot bike rentals by day of week
plt.figure(figsize=(10, 6))
sns.boxplot(x='dayofweek', y='count', data=train)
plt.title('Bike Rentals by Day of Week', fontsize=14)
plt.xlabel('Day of Week (0=Monday, 6=Sunday)', fontsize=12)
plt.ylabel('Number of Rentals', fontsize=12)
plt.xticks(range(0, 7), ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
plt.show()

In [None]:
# Plot bike rentals by season
plt.figure(figsize=(10, 6))
sns.boxplot(x='season', y='count', data=train)
plt.title('Bike Rentals by Season', fontsize=14)
plt.xlabel('Season (1=Spring, 2=Summer, 3=Fall, 4=Winter)', fontsize=12)
plt.ylabel('Number of Rentals', fontsize=12)
plt.xticks(range(0, 4), ['Spring', 'Summer', 'Fall', 'Winter'])
plt.show()

In [None]:
# Plot bike rentals by weather
plt.figure(figsize=(10, 6))
sns.boxplot(x='weather', y='count', data=train)
plt.title('Bike Rentals by Weather Condition', fontsize=14)
plt.xlabel('Weather Condition', fontsize=12)
plt.ylabel('Number of Rentals', fontsize=12)
plt.xticks(range(0, 4), ['Clear', 'Mist/Cloudy', 'Light Rain/Snow', 'Heavy Rain/Snow'])
plt.show()

In [None]:
# Correlation heatmap
plt.figure(figsize=(12, 10))
correlation_matrix = train.drop('datetime', axis=1).corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Feature Correlation Matrix', fontsize=14)
plt.show()

### 2.6 Key Insights from Data Exploration

From our exploratory data analysis, we can draw several insights:

1. **Hourly Patterns**: There are clear peaks in bike rentals during commuting hours (8-9 AM and 5-6 PM), suggesting that many people use bikes for commuting to and from work.

2. **Day of Week Patterns**: Weekdays show different patterns compared to weekends, with weekdays having more pronounced morning and evening peaks.

3. **Seasonal Variation**: Bike rentals are higher in summer and fall, and lower in winter and spring, likely due to weather conditions.

4. **Weather Impact**: Clear weather conditions (1) have higher rental counts compared to more severe weather conditions (3 and 4).

5. **Temperature Correlation**: There's a positive correlation between temperature and bike rentals, indicating people are more likely to rent bikes in warmer weather.

6. **Registered vs. Casual Users**: Registered users make up a larger portion of the rentals compared to casual users.

These insights will guide our feature engineering process to create meaningful features that capture these patterns.

## Step 3: Feature Engineering

Based on our exploratory analysis, we'll create additional features to help the model capture important patterns in the data.

In [None]:
def feature_engineering(df):
    """Apply feature engineering to the dataframe"""
    # Make a copy to avoid modifying the original dataframe
    df = df.copy()
    
    # Convert datetime to datetime type if it's not already
    if df['datetime'].dtype != 'datetime64[ns]':
        df['datetime'] = pd.to_datetime(df['datetime'])
    
    # Extract datetime components
    df['hour'] = df['datetime'].dt.hour
    df['day'] = df['datetime'].dt.day
    df['month'] = df['datetime'].dt.month
    df['year'] = df['datetime'].dt.year
    df['dayofweek'] = df['datetime'].dt.dayofweek
    
    # Create time of day categories
    df['time_of_day'] = df['hour'].apply(lambda x: 
                                        'morning' if 6 <= x < 12 else
                                        'afternoon' if 12 <= x < 18 else
                                        'evening' if 18 <= x < 22 else
                                        'night')
    
    # Create rush hour flag (7-9 AM and 4-7 PM on weekdays)
    df['is_rush_hour'] = (((df['hour'] >= 7) & (df['hour'] <= 9) | 
                          (df['hour'] >= 16) & (df['hour'] <= 19)) & 
                          (df['dayofweek'] < 5)).astype(int)
    
    # Create weekend flag
    df['is_weekend'] = (df['dayofweek'] >= 5).astype(int)
    
    # Combine weather and season for more context
    df['weather_season'] = df['weather'].astype(str) + '_' + df['season'].astype(str)
    
    # Create temperature bins
    df['temp_bin'] = pd.cut(df['temp'], bins=[-20, 0, 10, 20, 30, 50], 
                           labels=['very_cold', 'cold', 'mild', 'warm', 'hot'])
    
    # Create humidity bins
    df['humidity_bin'] = pd.cut(df['humidity'], bins=[0, 25, 50, 75, 100], 
                              labels=['low', 'medium', 'high', 'very_high'])
    
    # Create windspeed bins
    df['windspeed_bin'] = pd.cut(df['windspeed'], bins=[0, 10, 20, 30, 100], 
                               labels=['low', 'medium', 'high', 'very_high'])
    
    # Create interaction features
    df['temp_humidity'] = df['temp'] * df['humidity']
    df['temp_windspeed'] = df['temp'] * df['windspeed']
    
    return df

In [None]:
# Apply feature engineering to both train and test sets
train_fe = feature_engineering(train)
test_fe = feature_engineering(test)

# Display the new features
print("Training data with engineered features:")
train_fe.head()

### 3.1 Prepare Data for AutoGluon

Now that we've created additional features, let's prepare the data for AutoGluon. We'll need to:
1. Remove the original datetime column (since we've extracted its components)
2. Convert categorical features to the appropriate type
3. Create a log-transformed version of the target variable

In [None]:
# Remove the datetime column
train_fe = train_fe.drop('datetime', axis=1)
test_fe = test_fe.drop('datetime', axis=1)

# Convert categorical features to category type
categorical_features = ['season', 'holiday', 'workingday', 'weather', 'time_of_day', 
                        'temp_bin', 'humidity_bin', 'windspeed_bin', 'weather_season']

for feature in categorical_features:
    train_fe[feature] = train_fe[feature].astype('category')
    test_fe[feature] = test_fe[feature].astype('category')

# Create a log-transformed version of the target variable
train_fe['count_log'] = np.log1p(train_fe['count'])

# Display the prepared data
print("Training data prepared for AutoGluon:")
train_fe.info()

## Step 4: Initial Model Training with AutoGluon

Let's train an initial model using AutoGluon with default settings to establish a baseline.

In [None]:
# Define features to use for training
features = [col for col in train_fe.columns if col not in ['casual', 'registered', 'count', 'count_log']]

models_dir = Path('BikeSharingSystem/models')
models_dir.mkdir(parents=True, exist_ok=True)  # Add parents=True to create parent directories

# Create TabularDataset for AutoGluon
train_data = TabularDataset(train_fe[features + ['count']])

# Train initial model with default settings
initial_predictor = TabularPredictor(label='count', eval_metric='root_mean_squared_error',path=str(models_dir / 'initial_model')
initial_predictor.fit(train_data=train_data, time_limit=600)  # 10 minutes time limit

In [None]:
# Evaluate the initial model
initial_leaderboard = initial_predictor.leaderboard()
print("Initial model leaderboard:")
initial_leaderboard

### 4.1 Generate Initial Predictions and Evaluate

In [None]:
# Generate predictions on the test set
test_data = TabularDataset(test_fe[features])
initial_predictions = initial_predictor.predict(test_data)

# Create submission file
initial_submission = pd.DataFrame({
    'datetime': test['datetime'].astype(str),
    'count': initial_predictions
})

# Ensure predictions are non-negative
initial_submission['count'] = initial_submission['count'].clip(lower=0)

# Save submission file
initial_submission_path = data_dir / 'initial_submission.csv'
initial_submission.to_csv(initial_submission_path, index=False)

print(f"Initial submission file saved to {initial_submission_path}")
initial_submission.head()

## Step 5: Model Optimization - Iteration 1 (Log Transform)

For our first optimization, we'll train a model using the log-transformed target variable, which is often more appropriate for count data.

In [None]:
# Create TabularDataset with log-transformed target
train_data_log = TabularDataset(train_fe[features + ['count_log']])

# Train model with log-transformed target
log_predictor = TabularPredictor(label='count_log', eval_metric='root_mean_squared_error')
log_predictor.fit(train_data=train_data_log, time_limit=600)  # 10 minutes time limit

In [None]:
# Evaluate the log-transformed model
log_leaderboard = log_predictor.leaderboard()
print("Log-transformed model leaderboard:")
log_leaderboard

In [None]:
# Generate predictions on the test set
log_predictions = log_predictor.predict(test_data)

# Transform predictions back to original scale
log_predictions_original_scale = np.expm1(log_predictions)

# Create submission file
log_submission = pd.DataFrame({
    'datetime': test['datetime'].astype(str),
    'count': log_predictions_original_scale
})

# Ensure predictions are non-negative
log_submission['count'] = log_submission['count'].clip(lower=0)

# Save submission file
log_submission_path = data_dir / 'log_submission.csv'
log_submission.to_csv(log_submission_path, index=False)

print(f"Log-transformed model submission file saved to {log_submission_path}")
log_submission.head()

## Step 6: Model Optimization - Iteration 2 (Advanced Configuration)

For our second optimization, we'll use more advanced AutoGluon configurations, including:
1. Using the 'best_quality' preset for more thorough model training
2. Implementing bagging with multiple folds
3. Using stacking to combine multiple models

In [None]:
# Train model with advanced configuration
advanced_predictor = TabularPredictor(label='count_log', eval_metric='root_mean_squared_error')
advanced_predictor.fit(
    train_data=train_data_log,
    time_limit=1200,  # 20 minutes time limit
    presets='best_quality',  # Use more advanced models and tuning
    num_bag_folds=5,  # Use bagging for better performance
    num_stack_levels=1  # Use stacking for better performance
)

In [None]:
# Evaluate the advanced model
advanced_leaderboard = advanced_predictor.leaderboard()
print("Advanced model leaderboard:")
advanced_leaderboard

In [None]:
# Generate predictions on the test set
advanced_predictions = advanced_predictor.predict(test_data)

# Transform predictions back to original scale
advanced_predictions_original_scale = np.expm1(advanced_predictions)

# Create submission file
advanced_submission = pd.DataFrame({
    'datetime': test['datetime'].astype(str),
    'count': advanced_predictions_original_scale
})

# Ensure predictions are non-negative
advanced_submission['count'] = advanced_submission['count'].clip(lower=0)

# Save submission file
advanced_submission_path = data_dir / 'advanced_submission.csv'
advanced_submission.to_csv(advanced_submission_path, index=False)

print(f"Advanced model submission file saved to {advanced_submission_path}")
advanced_submission.head()

## Step 7: Feature Importance Analysis

Let's analyze which features were most important for our best model.

In [None]:
# Get feature importance from the best model
feature_importance = advanced_predictor.feature_importance(test_data)
print("Feature importance:")
feature_importance

In [None]:
# Visualize feature importance
plt.figure(figsize=(12, 8))
feature_importance.sort_values().plot(kind='barh')
plt.title('Feature Importance', fontsize=14)
plt.xlabel('Importance Score', fontsize=12)
plt.tight_layout()
plt.show()

## Step 8: Model Comparison and Final Submission

Let's compare the performance of our three models and select the best one for our final submission.

In [None]:
# Compare model performance
print("Initial Model Performance:")
print(initial_leaderboard.head(1))

print("\nLog-Transformed Model Performance:")
print(log_leaderboard.head(1))

print("\nAdvanced Model Performance:")
print(advanced_leaderboard.head(1))

In [None]:
# Create final submission file (using the advanced model)
final_submission = advanced_submission.copy()
final_submission_path = data_dir / 'final_submission.csv'
final_submission.to_csv(final_submission_path, index=False)

print(f"Final submission file saved to {final_submission_path}")
final_submission.head()

## Step 9: Summary and Conclusion

In this project, we tackled the Bike Sharing Demand competition using AutoGluon. Here's a summary of our approach and findings:

### Data Exploration and Insights
- We found clear patterns in bike rentals based on time of day, day of week, season, and weather conditions.
- Commuting hours (8-9 AM and 5-6 PM) showed peak rental activity.
- Weather and temperature had significant impacts on rental patterns.

### Feature Engineering
- We extracted datetime components (hour, day, month, year, day of week).
- Created categorical features like time of day, temperature bins, and humidity bins.
- Added interaction features combining weather and seasonal information.
- Created flags for rush hours and weekends.

### Model Development
1. **Initial Model**: Trained with default settings as a baseline.
2. **Log-Transformed Model**: Applied log transformation to the target variable to better handle the count data distribution.
3. **Advanced Model**: Used best_quality preset with bagging and stacking for improved performance.

### Key Findings
- Log transformation of the target variable significantly improved model performance.
- The most important features were hour of day, temperature, and season.
- Advanced configurations with bagging and stacking further improved model performance.

### Future Improvements
- Incorporate external data like public holidays or events that might affect bike rentals.
- Experiment with more complex feature interactions.
- Try time series specific models to better capture temporal patterns.
- Implement hyperparameter tuning for specific models within AutoGluon.

This project demonstrates the power of AutoGluon for quickly developing high-performing models with minimal manual tuning, while still allowing for sophisticated feature engineering and model optimization.