## Steps
1. Download all raw data for a given year and month
1. For each raw data, apply the filter before saving it
1. Transform the saved raw data into TS data
1. Convert the ts data into features and targets
1. Save the transformed data


Main objective is to write utility functions to do all these things so we can reuse them later. 

In [1]:
import sys
import os
from pathlib import Path

# Add the parent directory to sys.path
sys.path.append(str(Path(os.getcwd()).resolve().parent))

# Now you can import
from src.utils import create_final_features


In [6]:
# 05_prepare_features.py

# =======================
# 📚 IMPORTS
# =======================

import pandas as pd
import numpy as np
from pathlib import Path

# 🛠 Import utility functions
from src.utils import (
    create_final_features,
)

# =======================
# 📋 SET PATHS
# =======================

# ✅ Correct Input: 8-hour transformed monthly data
input_data_path = Path("../data/processed/feature_eng_all_id")

# ✅ Correct Output: final merged and feature-engineered data
final_output_path = Path("../data/processed/final_features")

# Make sure output folder exists
final_output_path.mkdir(parents=True, exist_ok=True)

# =======================
# 🛠 STEP 1: Create Final Features
# =======================

# This will:
# 1. Read all monthly 8-hour feature-engineered files
# 2. Transform timestamps
# 3. Create features (sin/cos hour, day of week, holiday/weekend, etc.)
# 4. Drop old lags, Create fresh lag features
# 5. Save two files: rides_citibike_final_2024_with_lags.parquet and rides_citibike_final_2025_with_lags.parquet

print("🚀 Creating final features and merging year-wise...")

create_final_features(
    input_dir=input_data_path,
    output_dir=final_output_path
)

print("✅ Feature creation and yearly merge complete!")


🚀 Creating final features and merging year-wise...
📂 Scanning ..\data\processed\feature_eng_all_id for 8-hour monthly parquet files...
🔄 Reading citibike_features_targets_8hours_2024_01.parquet
⚠️ Skipping unknown year in citibike_features_targets_8hours_2024_01.parquet
🔄 Reading citibike_features_targets_8hours_2024_02.parquet
🔄 Reading citibike_features_targets_8hours_2024_03.parquet
🔄 Reading citibike_features_targets_8hours_2024_04.parquet
🔄 Reading citibike_features_targets_8hours_2024_05.parquet
🔄 Reading citibike_features_targets_8hours_2024_06.parquet
🔄 Reading citibike_features_targets_8hours_2024_07.parquet
🔄 Reading citibike_features_targets_8hours_2024_08.parquet
🔄 Reading citibike_features_targets_8hours_2024_09.parquet
🔄 Reading citibike_features_targets_8hours_2024_10.parquet
🔄 Reading citibike_features_targets_8hours_2024_11.parquet
🔄 Reading citibike_features_targets_8hours_2024_12.parquet
🔄 Reading citibike_features_targets_8hours_2025_01.parquet
🔄 Reading citibike_fe

In [7]:
pd.read_parquet("../data/processed/final_features/rides_citibike_final_2024_with_lags.parquet").head()

Unnamed: 0,hour_ts,start_station_name,start_station_id,ride_count,hour,hour_sin,hour_cos,day_of_week,is_holiday_or_weekend,month,...,ride_count_lag_663,ride_count_lag_664,ride_count_lag_665,ride_count_lag_666,ride_count_lag_667,ride_count_lag_668,ride_count_lag_669,ride_count_lag_670,ride_count_lag_671,ride_count_lag_672
0,2024-02-06 11:00:00-05:00,West St & Chambers St,5329,18,11,0.258819,-0.965926,1,0,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2024-02-06 11:00:00-05:00,Lafayette St & E 8 St,5788,20,11,0.258819,-0.965926,1,0,2,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2024-02-06 12:00:00-05:00,University Pl & E 14 St,5905,25,12,1.224647e-16,-1.0,1,0,2,...,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2024-02-06 12:00:00-05:00,West St & Chambers St,5329,19,12,1.224647e-16,-1.0,1,0,2,...,0.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2024-02-06 12:00:00-05:00,8 Ave & W 31 St,6450,28,12,1.224647e-16,-1.0,1,0,2,...,0.0,0.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
