# Bergen Bysykkel 2024: Weather and Usage Analysis in Python
**Author**: Syed Amjad Ali

---

## Introduction

This notebook analyzes Bergen Bysykkel rental patterns using 2024 data, integrating weather data for advanced predictive modeling. It aims to:
- Understand bike usage trends by station and time of day.
- Predict hourly ride counts to optimize station capacity.
- Explore weather's impact on ride patterns.

---

## Table of Contents
1. [Introduction](#introduction)
2. [Part 1: Exploratory Analysis](#part-1-exploratory-analysis)
    - [Data Collection](#data-collection)
    - [Data Cleaning](#data-cleaning)
3. [Part 2: Predictive Analytics](#part-2-predictive-analytics)
    - [Regression Models](#regression-models)
    - [Weather Data Integration](#weather-data-integration)
4. [Conclusion](#conclusion)



## Introduction
<a id="introduction"></a>

This notebook analyzes Bergen Bysykkel rental patterns using 2024 data, integrating weather data for advanced predictive modeling. It aims to:
- Understand bike usage trends by station and time of day.
- Predict hourly ride counts to optimize station capacity.
- Explore weather's impact on ride patterns.


## Part 1: Exploratory Analysis
<a id="part-1-exploratory-analysis"></a>



In [2]:
#pip install folium geopandas shapely

In [3]:
# Importing required libraries for data manipulation and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split  # Equivalent to rsample
import folium  # Equivalent to leaflet
import geopandas as gpd  # For geospatial data (like ggmap)
import itertools  # For functional programming (like purrr)
from itertools import product
import os
from IPython.display import display  # For rendering tables nicely in Jupyter
# Set global plotting style
sns.set(style="whitegrid")




### Data Collection
<a id="data-collection"></a>


In [4]:
# Load CSV files (placeholder for combining monthly data)
# Read in all bike rides from January 1, 2024, to December 8, 2024
# Get a list of all `.csv` files in the "byssykkel-data-2024" directory
directory = "byssykkel-data-2024"  # Directory containing your .csv files
filenames = [os.path.join(directory, file) for file in os.listdir(directory) if file.endswith(".csv")]

# Combine all the data from these files into a single dataset
bike_rides_2024_data = pd.concat((pd.read_csv(file) for file in filenames), ignore_index=True)

# Render a formatted table for the first 10 rows
display(bike_rides_2024_data.head(10))  # This displays the table nicely in a Jupyter notebook

Unnamed: 0,started_at,ended_at,duration,start_station_id,start_station_name,start_station_description,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_description,end_station_latitude,end_station_longitude
0,2024-01-01 06:06:56.456000+00:00,2024-01-01 06:11:41.251000+00:00,284,8,Haukeland sykehus,Ved HUD-bygget,60.373638,5.356595,646,Bjørnsonsgate,Ved Årstad VGS,60.373986,5.342082
1,2024-01-01 06:21:00.424000+00:00,2024-01-01 06:33:39.184000+00:00,758,1048,Øvre Korskirkeallmenning,På venstre side av allmenning,60.395099,5.328645,646,Bjørnsonsgate,Ved Årstad VGS,60.373986,5.342082
2,2024-01-01 06:22:25.158000+00:00,2024-01-01 06:32:09.997000+00:00,584,156,Allehelgens plass,ved Politihuset,60.392651,5.328977,808,Damsgårdsveien 71,By walkway to Damsgårdsveien 69-71,60.380263,5.323223
3,2024-01-01 06:32:36.086000+00:00,2024-01-01 06:38:26.805000+00:00,350,790,Busstasjonen 3 Øst,Ved Taxiholdeplass,60.387856,5.334459,1048,Øvre Korskirkeallmenning,På venstre side av allmenning,60.395099,5.328645
4,2024-01-01 06:36:11.445000+00:00,2024-01-01 06:49:22.330000+00:00,790,814,Nykirken,i Strandgaten,60.396949,5.313495,222,Sandviken Brygge,ved Måseskjæret,60.412946,5.32087
5,2024-01-01 06:37:24.473000+00:00,2024-01-01 06:43:24.943000+00:00,360,217,Nykirkekaien,Langs C. Sundts gate,60.397057,5.314548,817,Nedre Korskirkeallmenning,Nedre Korskirkeallmenning 8,60.39455,5.32705
6,2024-01-01 06:37:57.937000+00:00,2024-01-01 06:45:25.180000+00:00,447,812,Hans Hauges gate,Jens Rolfens gate 6,60.401906,5.324748,792,Busstasjonen 2 Vest,Vis a vis Hotel Scandic Ørnen,60.388284,5.332873
7,2024-01-01 07:27:47.890000+00:00,2024-01-01 07:30:56.828000+00:00,188,1047,Welhavens gate 64,Langs gatetun,60.384451,5.324441,157,Florida Bybanestopp,ved Florida bybanestopp,60.382255,5.332332
8,2024-01-01 07:36:12.586000+00:00,2024-01-01 07:44:21.296000+00:00,488,642,Bøhmergaten,Magic Hotel,60.376253,5.334276,2336,Møllestranden,Ved badebrygge,60.380044,5.350824
9,2024-01-01 08:16:51.379000+00:00,2024-01-01 08:32:49.760000+00:00,958,639,Frydenbø Marina,Under overbygg v/nr. 135,60.384467,5.310008,24,Studentboligene,ved Grønneviken og KMD fakultetet,60.378317,5.350897


### Data Cleaning
<a id="data-cleaning"></a>


In [5]:
# Convert timestamps and add derived features, handling mixed formats
bike_rides_2024_data['started_at'] = pd.to_datetime(
    bike_rides_2024_data['started_at'], 
    errors='coerce',  # Invalid parsing will be set to NaT
    format=None       # Allow pandas to infer the format automatically
)

# Check for parsing issues
if bike_rides_2024_data['started_at'].isna().any():
    print("Warning: Some rows could not be parsed and were set to NaT.")

# Continue processing only for valid timestamps
bike_rides_2024_data = bike_rides_2024_data.dropna(subset=['started_at'])

# Add derived features
bike_rides_2024_data['hour'] = bike_rides_2024_data['started_at'].dt.hour
bike_rides_2024_data['weekday'] = bike_rides_2024_data['started_at'].dt.dayofweek  # Monday = 0, Sunday = 6

# Aggregate ride counts by station, date, and hour
ride_counts = bike_rides_2024_data.groupby(['start_station_id', 'weekday', 'hour']).size().reset_index(name='ride_count')

print(ride_counts.head())




   start_station_id  weekday  hour  ride_count
0                 3        0     4           5
1                 3        0     5          16
2                 3        0     6          64
3                 3        0     7          58
4                 3        0     8          43


In [6]:
num_missing_rows = bike_rides_2024_data['started_at'].isna().sum()
print(f"Number of rows with invalid timestamps: {num_missing_rows}")


Number of rows with invalid timestamps: 0


In [7]:
# Generate all station-hour combinations
hourly_timestamps = pd.date_range(
    start="2024-01-01 00:00:00", 
    end="2024-12-19 23:00:00", 
    freq="H"
)
unique_station_ids = df_agg["start_station_id"].unique()

# Create Cartesian product
station_hour_combinations = pd.DataFrame(
    list(product(hourly_timestamps, unique_station_ids)),
    columns=["floor_start_dh", "start_station_id"]
)

# Sort the DataFrame
station_hour_combinations = station_hour_combinations.sort_values(by=["start_station_id", "floor_start_dh"]).reset_index(drop=True)

# Preview the first 10 rows of the dataset
print(station_hour_combinations.head(10))

NameError: name 'df_agg' is not defined

In [None]:
# Ensure the 'started_at' column is in datetime format
bike_rides_2024_data['started_at'] = pd.to_datetime(bike_rides_2024_data['started_at'], errors='coerce')

In [None]:
# Convert the "started_at" column to hourly timestamps (YYYY-MM-DD HH:00:00 format)
bike_rides_2024_data['start_date'] = bike_rides_2024_data['started_at'].dt.floor('H')

# Group the data by station ID and hourly timestamps, and count the number of rides for each combination
bike_rides_2024 = (
    bike_rides_2024_data
    .groupby(['start_station_id', 'start_date'])
    .size()
    .reset_index(name='n_rides')
)

# The result is a DataFrame with station IDs, hourly timestamps, and the corresponding ride counts
print(bike_rides_2024.head())  # Preview the first few rows

In [None]:
num_missing_rows = bike_rides_2024_data['started_at'].isna().sum()
print(f"Number of rows with invalid timestamps: {num_missing_rows}")


In [None]:
# Ensure both datetime columns have the same format
#station_hour_combinations["floor_start_dh"] = pd.to_datetime(station_hour_combinations["floor_start_dh"]).dt.tz_localize(None)
bike_rides_2024["start_date"] = pd.to_datetime(bike_rides_2024["start_date"]).dt.tz_localize(None)

# Perform the left join
df_agg = pd.merge(
    station_hour_combinations,
    bike_rides_2024,
    how="left",
    left_on=["start_station_id", "floor_start_dh"],
    right_on=["start_station_id", "start_date"]
)

# Replace NaN values in `n_rides`
df_agg["n_rides"] = df_agg["n_rides"].fillna(0)

# Drop redundant `start_date` column
df_agg = df_agg.drop(columns=["start_date"], errors="ignore")

# Preview the resulting DataFrame
print(df_agg.head())



In [None]:
num_missing_rows = bike_rides_2024_data['started_at'].isna().sum()
print(f"Number of rows with invalid timestamps: {num_missing_rows}")


In [None]:
# Add two new columns to the dataset:
# - `start_hour`: Represents the hour of the day (0 to 23) for each observation.
# - `weekday_start`: Represents the day of the week (0 = Monday, 6 = Sunday).
df_agg['start_hour'] = df_agg['floor_start_dh'].dt.hour
df_agg['weekday_start'] = df_agg['floor_start_dh'].dt.weekday + 1  # Adjusting to match R's wday where 1 = Sunday

# Display the first 10 rows of the updated dataset for verification.
print(df_agg.head(10))

In [None]:
num_missing_rows = bike_rides_2024_data['started_at'].isna().sum()
print(f"Number of rows with invalid timestamps: {num_missing_rows}")


In [None]:
# Validation Test 1: Check for duplicate records
# Group the data by station ID and hourly timestamp
# Summarize the number of records for each station-hour combination
# Ensure that the maximum count is equal to 1, indicating no duplicates

# Group by `start_station_id` and `floor_start_dh` to count occurrences
validation_test = (
    df_agg.groupby(['start_station_id', 'floor_start_dh'])
    .size()
    .reset_index(name='n')
)

# Get the maximum count of duplicates
max_n = validation_test['n'].max()

# Assert that the maximum count is 1
if max_n == 1:
    print("Validation Test 1 Passed: No duplicates found in station-hour combinations!")
else:
    raise AssertionError("Validation Test 1 Failed: Duplicates found in station-hour combinations!")



In [None]:
# Verify that the time difference between consecutive observations is exactly 1 hour
# Group by `start_station_id` and calculate the time difference

# Replace NaN values in the 'timediff' column with 0
df_agg['timediff'] = df_agg['timediff'].fillna(0)

df_agg['timediff'] = (
    df_agg.groupby('start_station_id')['floor_start_dh']
    .diff()
    .dt.total_seconds() / 3600  # Convert time difference to hours
)

# Filter rows where the time difference is not equal to 1 hour
invalid_timediffs = df_agg[df_agg['timediff'] != 1]

# Check if there are any violations
if invalid_timediffs.empty:
    print("Validation Test 2 Passed: Time differences between observations are always 1 hour!")
else:
    print("Validation Test 2 Failed: Some time differences are not equal to 1 hour!")
    print("Sample of invalid time differences:")
    print(invalid_timediffs.head())  # Show a sample of invalid rows


In [None]:
# Inspect the failing rows
print("Inspecting invalid rows:")
print(invalid_timediffs)

# Check the previous and next rows for each invalid observation
for index, row in invalid_timediffs.iterrows():
    station_id = row['start_station_id']
    timestamp = row['floor_start_dh']
    
    # Find the previous and next rows for this station
    previous_row = df_agg[
        (df_agg['start_station_id'] == station_id) &
        (df_agg['floor_start_dh'] < timestamp)
    ].sort_values(by='floor_start_dh').tail(1)
    
    next_row = df_agg[
        (df_agg['start_station_id'] == station_id) &
        (df_agg['floor_start_dh'] > timestamp)
    ].sort_values(by='floor_start_dh').head(1)
    
    print(f"Station {station_id}, Timestamp {timestamp}")
    print("Previous row:")
    print(previous_row)
    print("Next row:")
    print(next_row)
    print("-" * 40)


In [None]:
# Ensure data is sorted before calculating differences
df_agg = df_agg.sort_values(by=['start_station_id', 'floor_start_dh'])

# Calculate time differences
df_agg['timediff'] = (
    df_agg.groupby('start_station_id')['floor_start_dh']
    .diff()
    .dt.total_seconds() / 3600  # Convert time difference to hours
)

# Filter rows with invalid time differences (excluding NaN rows)
invalid_timediffs = df_agg[(df_agg['timediff'] != 1) & (df_agg['timediff'].notna())]

# Validation test
if invalid_timediffs.empty:
    print("Validation Test Passed: Time differences between observations are always 1 hour!")
else:
    print("Validation Test Failed: Some time differences are not equal to 1 hour!")
    print("Number of invalid rows:", len(invalid_timediffs))
    print("Sample of invalid rows:")
    print(invalid_timediffs.head())



In [None]:
# Calculate average latitude and longitude for each station
station = (
    bike_rides_2024_data
    .groupby('start_station_id', as_index=False)  # Group by station ID
    .agg(
        lon=('start_station_longitude', 'mean'),  # Average longitude for each station
        lat=('start_station_latitude', 'mean')   # Average latitude for each station
    )
)

# Display the calculated averages
print(station.head(10))  # Preview the first 10 rows


In [None]:
# Merge average coordinates into the main dataset (df_agg)
df_agg_lonlat = pd.merge(
    df_agg,                 # Main dataset
    station,                # Station dataset with average coordinates
    how='left',             # Perform a left join
    on='start_station_id'   # Match based on station ID
)

# Display the first 10 rows of the updated dataset for verification
print(df_agg_lonlat.head(10))


### Exploratory Data Analysis (EDA)
- Visualize hourly ride patterns.
- Identify station-level trends.


In [None]:
import folium
from folium import Choropleth, Circle, Marker
from folium.plugins import MiniMap, FloatImage, Fullscreen

# Define the function to create an interactive map
def plot_map_folium(date_and_hour):
    # Filter the dataset for the specified date and hour
    filtered_df = df_agg_lonlat[df_agg_lonlat['floor_start_dh'] == date_and_hour]
    
    # Define color categories based on the number of rides
    def get_ride_color(n_rides):
        if n_rides == 0:
            return "#e41a1c"  # Red for 0 rides
        elif n_rides == 1:
            return "#377eb8"  # Blue for 1 ride
        elif n_rides == 2:
            return "#4daf4a"  # Green for 2 rides
        elif n_rides == 3:
            return "#984ea3"  # Purple for 3 rides
        elif n_rides == 4:
            return "#ff7f00"  # Orange for 4 rides
        else:
            return "#ffff33"  # Yellow for 5+ rides

    # Create a folium map centered at an average latitude and longitude
    avg_lat = filtered_df['lat'].mean()
    avg_lon = filtered_df['lon'].mean()
    bike_map = folium.Map(location=[avg_lat, avg_lon], zoom_start=12)

    # Add Circle Markers for each station
    for _, row in filtered_df.iterrows():
        folium.CircleMarker(
            location=[row['lat'], row['lon']],  # Latitude and Longitude
            radius=row['n_rides'] * 3 if row['n_rides'] > 0 else 5,  # Scaled marker size
            color=get_ride_color(row['n_rides']),  # Apply distinct colors
            fill=True,
            fill_opacity=0.7,
            popup=folium.Popup(
                f"<strong>Station ID:</strong> {row['start_station_id']}<br>"
                f"<strong>Number of Rides:</strong> {row['n_rides']}<br>"
                f"<strong>Time:</strong> {date_and_hour}",
                max_width=250
            ),
            tooltip=f"Station ID: {row['start_station_id']} | Rides: {row['n_rides']}"  # Tooltip
        ).add_to(bike_map)

    # Add a Legend
    legend_html = """
    <div style="position: fixed; 
                bottom: 50px; left: 50px; width: 200px; height: 150px; 
                background-color: white; z-index:1000; font-size:14px;
                border:2px solid grey; padding: 10px;">
    <h4>Bicycle Traffic Volume</h4>
    <i style="background: #e41a1c; width: 10px; height: 10px; display: inline-block;"></i> 0 Rides<br>
    <i style="background: #377eb8; width: 10px; height: 10px; display: inline-block;"></i> 1 Ride<br>
    <i style="background: #4daf4a; width: 10px; height: 10px; display: inline-block;"></i> 2 Rides<br>
    <i style="background: #984ea3; width: 10px; height: 10px; display: inline-block;"></i> 3 Rides<br>
    <i style="background: #ff7f00; width: 10px; height: 10px; display: inline-block;"></i> 4 Rides<br>
    <i style="background: #ffff33; width: 10px; height: 10px; display: inline-block;"></i> 5+ Rides<br>
    </div>
    """
    bike_map.get_root().html.add_child(folium.Element(legend_html))

    # Add additional plugins
    MiniMap(toggle_display=True).add_to(bike_map)  # Add MiniMap
    folium.LayerControl().add_to(bike_map)  # Add Layer Control
    Fullscreen().add_to(bike_map)  # Add Fullscreen Control

    # Display the map
    return bike_map


In [None]:
# Test the function with specific date and time inputs
map1 = plot_map_folium("2024-06-08 13:00:00")
#map1.save("bicycle_map_2024-06-08_13.html")  # Save the map as an HTML file
map1  # Display the map in Jupyter Notebook




In [None]:
# Another test with a different date and time
map2 = plot_map_folium("2024-05-18 15:00:00")
#map2.save("bicycle_map_2024-05-18_15.html")  # Save the map as an HTML file
map2  # Display the map in Jupyter Notebook

In [None]:
# import folium
# from folium.plugins import MiniMap, Fullscreen
# from ipywidgets import interact, widgets
# from IPython.display import display

# # Function to create an interactive map
# def create_interactive_map(date_and_hour):
#     # Filter the dataset for the specified date and hour
#     filtered_df = df_agg_lonlat[df_agg_lonlat['floor_start_dh'] == date_and_hour]
    
#     # Define color categories based on the number of rides
#     def get_ride_color(n_rides):
#         if n_rides == 0:
#             return "#e41a1c"  # Red for 0 rides
#         elif n_rides == 1:
#             return "#377eb8"  # Blue for 1 ride
#         elif n_rides == 2:
#             return "#4daf4a"  # Green for 2 rides
#         elif n_rides == 3:
#             return "#984ea3"  # Purple for 3 rides
#         elif n_rides == 4:
#             return "#ff7f00"  # Orange for 4 rides
#         else:
#             return "#ffff33"  # Yellow for 5+ rides

#     # Create a folium map centered at an average latitude and longitude
#     avg_lat = filtered_df['lat'].mean()
#     avg_lon = filtered_df['lon'].mean()
#     bike_map = folium.Map(location=[avg_lat, avg_lon], zoom_start=12)

#     # Add Circle Markers for each station
#     for _, row in filtered_df.iterrows():
#         folium.CircleMarker(
#             location=[row['lat'], row['lon']],  # Latitude and Longitude
#             radius=row['n_rides'] * 3 if row['n_rides'] > 0 else 5,  # Scaled marker size
#             color=get_ride_color(row['n_rides']),  # Apply distinct colors
#             fill=True,
#             fill_opacity=0.7,
#             popup=folium.Popup(
#                 f"<strong>Station ID:</strong> {row['start_station_id']}<br>"
#                 f"<strong>Number of Rides:</strong> {row['n_rides']}<br>"
#                 f"<strong>Time:</strong> {date_and_hour}",
#                 max_width=250
#             ),
#             tooltip=f"Station ID: {row['start_station_id']} | Rides: {row['n_rides']}"  # Tooltip
#         ).add_to(bike_map)

#     # Add a Legend
#     legend_html = """
#     <div style="position: fixed; 
#                 bottom: 50px; left: 50px; width: 200px; height: 150px; 
#                 background-color: white; z-index:1000; font-size:14px;
#                 border:2px solid grey; padding: 10px;">
#     <h4>Bicycle Traffic Volume</h4>
#     <i style="background: #e41a1c; width: 10px; height: 10px; display: inline-block;"></i> 0 Rides<br>
#     <i style="background: #377eb8; width: 10px; height: 10px; display: inline-block;"></i> 1 Ride<br>
#     <i style="background: #4daf4a; width: 10px; height: 10px; display: inline-block;"></i> 2 Rides<br>
#     <i style="background: #984ea3; width: 10px; height: 10px; display: inline-block;"></i> 3 Rides<br>
#     <i style="background: #ff7f00; width: 10px; height: 10px; display: inline-block;"></i> 4 Rides<br>
#     <i style="background: #ffff33; width: 10px; height: 10px; display: inline-block;"></i> 5+ Rides<br>
#     </div>
#     """
#     bike_map.get_root().html.add_child(folium.Element(legend_html))

#     # Add additional plugins
#     MiniMap(toggle_display=True).add_to(bike_map)  # Add MiniMap
#     folium.LayerControl().add_to(bike_map)  # Add Layer Control
#     Fullscreen().add_to(bike_map)  # Add Fullscreen Control

#     # Return the map
#     return bike_map

# # Create a dropdown widget for timestamps
# timestamps = df_agg_lonlat['floor_start_dh'].unique()  # Get all unique timestamps
# dropdown = widgets.Dropdown(
#     options=timestamps,
#     description='Select Timestamp:',
#     style={'description_width': 'initial'}
# )

# # Function to display the map based on dropdown selection
# def display_map(selected_timestamp):
#     bike_map = create_interactive_map(selected_timestamp)
#     display(bike_map)

# # Use the `interact` function to link the dropdown to the map
# interact(display_map, selected_timestamp=dropdown)


## Part 2: Predictive Analytics
<a id="part-2-predictive-analytics"></a>



In [None]:
# Importing required libraries for machine learning
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Prepare data for modeling
X = ride_counts[['weekday', 'hour']]  # Features
y = ride_counts['ride_count']  # Target variable

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Fit a simple linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
print(f"Mean Squared Error: {mean_squared_error(y_test, y_pred):.2f}")
print(f"R² Score: {r2_score(y_test, y_pred):.2f}")


### Regression Models
<a id="regression-models"></a>


### Weather Data Integration
<a id="weather-data-integration"></a>


In [None]:
# Load weather data
# Replace 'path/to/weather_data.csv' with the actual path
weather_data = pd.read_csv("path/to/weather_data.csv")

# Merge weather data with ride counts
ride_weather_data = pd.merge(ride_counts, weather_data, on=['date', 'hour'], how='inner')

print("Combined data preview:")
print(ride_weather_data.head())


### Results and Insights
- Evaluate the impact of weather on ride patterns.
- Visualize model predictions and compare with actual data.


In [None]:
# Visualize predictions vs actual values
plt.figure(figsize=(12, 6))
plt.scatter(y_test, y_pred, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], '--', color='red')
plt.title('Predicted vs Actual Ride Counts')
plt.xlabel('Actual Ride Counts')
plt.ylabel('Predicted Ride Counts')
plt.show()


## Conclusion
<a id="conclusion"></a>

