# Milestone 1 – Citi Bike Boston: Predicting Bike Demand at Stations

**Dataset:** [Citi Bike Boston Trip Data – August 2024](https://s3.amazonaws.com/hubway-data/202408-bluebikes-tripdata.zip)
**Trips:** Approx. 530,000

---

## Project Goal

The goal of this project is to accurately predict the demand for bikes at individual stations throughout the day. Using historical usage data, we aim to identify both temporal and spatial patterns to:

- Ensure better availability of bikes,
- Detect shortages or surpluses in advance,
- And support a data-driven rebalancing of the fleet.

The resulting models will help improve operational decision-making — for example, optimizing rebalancing routes or managing bike supplies at high-demand stations.

---

## Optional Extension: Weather Data

Since weather conditions strongly influence mobility behavior, we will optionally create a model that includes weather features such as temperature and precipitation. The goal is to evaluate whether including these external factors improves the model’s prediction accuracy.

---

## Planned Features

- Start time (`started_at`)
- Day of the week (`weekday`)
- Time of day (as hour or time blocks) (`time_of_day`)
- Start station (`start_station_id`)
- User type (`member_casual`)
- Optional: Weather data (temperature, precipitation)


In [None]:
import pandas as pd

csv_path = r"C:\Users\hanac\PycharmProjects\Data-Science-and-Machine-Learning\202408-bluebikes-tripdata.csv"

df = pd.read_csv(
    csv_path,
    engine="python",
    sep=None,
    on_bad_lines="skip",

)

print(df.shape)
df.head(1000000000)


In [None]:
import pandas as pd

# Read CSV
csv_path = r"C:\Users\hanac\PycharmProjects\Data-Science-and-Machine-Learning\202408-bluebikes-tripdata.csv"
df = pd.read_csv(csv_path, engine="python", sep=None, on_bad_lines="warn")

print("Original dataset:")
print(df.shape)

# Convert datetime columns
df['started_at'] = pd.to_datetime(df['started_at'], dayfirst=True, errors='coerce')
df['ended_at'] = pd.to_datetime(df['ended_at'], dayfirst=True, errors='coerce')
print("After datetime conversion:")
print(df.shape)

# Drop rows with missing datetime values
df = df.dropna(subset=['started_at', 'ended_at'])
print("After deletion of NaT times:")
print(df.shape)

# Calculate trip duration in seconds
df['tripduration'] = (df['ended_at'] - df['started_at']).dt.total_seconds()

# Keep only positive durations
df = df[df['tripduration'] > 0]
print("After deletion of 0 or negative trip durations:")
print(df.shape)

# Convert to numeric in case of unexpected issues
df['tripduration'] = pd.to_numeric(df['tripduration'], errors='coerce')

# Create duration in minutes
df['duration_min'] = df['tripduration'] / 60

# Print data types
print("\nColumn types of 'tripduration' and 'duration_min':")
print(df[['tripduration', 'duration_min']].dtypes)

# Print examples
print("\nExamples:")
print(df[['started_at', 'ended_at', 'tripduration', 'duration_min']].head())


In [None]:

df['weekday'] = df['started_at'].dt.weekday
df['hour'] = df['started_at'].dt.hour


def get_time_category(hour):
    if 6 <= hour < 12:
        return 'Morgen'
    elif 12 <= hour < 18:
        return 'Midday'
    elif 18 <= hour < 24:
        return 'Evening'
    else:
        return 'Night'

df['time_of_day'] = df['hour'].apply(get_time_category)
