# Milestone 1 – Citi Bike Boston: Predicting Bike Demand at Stations

**Dataset:** [Citi Bike Boston Trip Data – August 2024](https://s3.amazonaws.com/hubway-data/202408-bluebikes-tripdata.zip)
**Trips:** Approx. 530,000

---

## Project Goal

The goal of this project is to accurately predict the demand for bikes at individual stations throughout the day. Using historical usage data, we aim to identify both temporal and spatial patterns to:

- Ensure better availability of bikes,
- Detect shortages or surpluses in advance,
- And support a data-driven rebalancing of the fleet.

The resulting models will help improve operational decision-making — for example, optimizing rebalancing routes or managing bike supplies at high-demand stations.

---

## Optional Extension: Weather Data

Since weather conditions strongly influence mobility behavior, we will optionally create a model that includes weather features such as temperature and precipitation. The goal is to evaluate whether including these external factors improves the model’s prediction accuracy.

---

## Planned Features

- Start time (`started_at`)
- Day of the week (`weekday`)
- Time of day (as hour or time blocks) (`time_of_day`)
- Start station (`start_station_id`)
- User type (`member_casual`)
- Optional: Weather data (temperature, precipitation)


In [1]:
import pandas as pd

csv_path = r"C:\Users\hanac\PycharmProjects\Data-Science-and-Machine-Learning\202408-bluebikes-tripdata.csv"

df = pd.read_csv(
    csv_path,
    engine="python",
    sep=None,
    on_bad_lines="skip",

)

print(df.shape)
df.head(1000000000)


(538262, 13)


Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
0,9555B91492D25570,classic_bike,01.08.2024 07:10,01.08.2024 07:26,Main St at Baldwin St,D32036,Purchase St at Pearl St,A32026,42.380.857,-71.070.629,42.354.659,-71.053.181,member
1,82D93E8BDD45E43F,electric_bike,12.08.2024 15:43,12.08.2024 15:46,75 Binney St,M32064,Cambridge Crossing at North First Street,M32077,4.236.550.728.505.650,-710.801.375.997.653,42.371.141,-71.076.198,member
2,C99E6E4F4C76DFF9,classic_bike,28.08.2024 21:06,28.08.2024 21:10,Copley Square - Dartmouth St at Boylston St,D32005,Prudential Center - 101 Huntington Ave,C32007,4.234.992.828.230.050,-7.107.739.206.866.420,4.234.652.003.998.410,-7.108.065.776.545.110,member
3,AB67BC6000A4D4CE,classic_bike,11.08.2024 13:14,11.08.2024 13:21,Ink Block - Harrison Ave at Herald St,C32025,Massachusetts Ave at Columbus Ave,C32004,42.345.901,-71.063.187,42.340.835,-710.816.197,member
4,C0B1FA5CE04B942F,electric_bike,12.08.2024 10:43,12.08.2024 11:59,Copley Square - Dartmouth St at Boylston St,D32005,Prudential Center - 101 Huntington Ave,C32007,4.234.992.828.230.050,-7.107.739.206.866.420,4.234.652.003.998.410,-7.108.065.776.545.110,member
...,...,...,...,...,...,...,...,...,...,...,...,...,...
538257,7C21E3D56C7059C4,electric_bike,01.08.2024 17:01,2024-08-01 17:08:40.036,Nashua Street at Red Auerbach Way,A32025,CambridgeSide Galleria - CambridgeSide PL at L...,M32019,42.365.673,-71.064.263,42.367.074.071.490.900,-7.107.679.277.658.460,member
538258,EF474D39EC6642EE,electric_bike,05.08.2024 12:23,05.08.2024 12:29,Commonwealth Ave at Agganis Way,A32002,Silber Way,D32032,4.235.169.201.885.970,-7.111.903.488.636.010,4.234.949.599.514.000,-7.110.057.592.391.960,member
538259,702A2E66C5554807,classic_bike,20.08.2024 21:32,20.08.2024 21:57,Nashua Street at Red Auerbach Way,A32025,Somerville High School & Central Library,S32048,42.365.673,-71.064.263,423.864,-7.109.601,member
538260,413E52E8A16DF605,classic_bike,26.08.2024 09:36,26.08.2024 09:47,Nashua Street at Red Auerbach Way,A32025,75 Binney St,M32064,42.365.673,-71.064.263,4.236.550.728.505.650,-710.801.375.997.653,member


In [2]:
import pandas as pd

# Read CSV
csv_path = r"C:\Users\hanac\PycharmProjects\Data-Science-and-Machine-Learning\202408-bluebikes-tripdata.csv"
df = pd.read_csv(csv_path, engine="python", sep=None, on_bad_lines="warn")

print("Original dataset:")
print(df.shape)

# Convert datetime columns
df['started_at'] = pd.to_datetime(df['started_at'], dayfirst=True, errors='coerce')
df['ended_at'] = pd.to_datetime(df['ended_at'], dayfirst=True, errors='coerce')
print("After datetime conversion:")
print(df.shape)

# Drop rows with missing datetime values
df = df.dropna(subset=['started_at', 'ended_at'])
print("After deletion of NaT times:")
print(df.shape)

# Calculate trip duration in seconds
df['tripduration'] = (df['ended_at'] - df['started_at']).dt.total_seconds()

# Keep only positive durations
df = df[df['tripduration'] > 0]
print("After deletion of 0 or negative trip durations:")
print(df.shape)

# Convert to numeric in case of unexpected issues
df['tripduration'] = pd.to_numeric(df['tripduration'], errors='coerce')

# Create duration in minutes
df['duration_min'] = df['tripduration'] / 60

# Print data types
print("\nColumn types of 'tripduration' and 'duration_min':")
print(df[['tripduration', 'duration_min']].dtypes)

# Print examples
print("\nExamples:")
print(df[['started_at', 'ended_at', 'tripduration', 'duration_min']].head())


Original dataset:
(538262, 13)
After datetime conversion:
(538262, 13)
After deletion of NaT times:
(435181, 13)
After deletion of 0 or negative trip durations:
(435181, 14)

Column types of 'tripduration' and 'duration_min':
tripduration    float64
duration_min    float64
dtype: object

Examples:
           started_at            ended_at  tripduration  duration_min
0 2024-08-01 07:10:00 2024-08-01 07:26:00         960.0          16.0
1 2024-08-12 15:43:00 2024-08-12 15:46:00         180.0           3.0
2 2024-08-28 21:06:00 2024-08-28 21:10:00         240.0           4.0
3 2024-08-11 13:14:00 2024-08-11 13:21:00         420.0           7.0
4 2024-08-12 10:43:00 2024-08-12 11:59:00        4560.0          76.0


In [3]:

df['weekday'] = df['started_at'].dt.weekday
df['hour'] = df['started_at'].dt.hour


def get_time_category(hour):
    if 6 <= hour < 12:
        return 'Morgen'
    elif 12 <= hour < 18:
        return 'Midday'
    elif 18 <= hour < 24:
        return 'Evening'
    else:
        return 'Night'

df['time_of_day'] = df['hour'].apply(get_time_category)
