# Temporal/Time Feature Engineering

This notebook creates temporal and time-based features from the cleaned NYC taxi trip data.

**Features to create:**
- Basic time components (hour, day of week, month, day of month)
- Time-based flags (is_weekend, is_peak_hour, is_rush_hour)
- Time-of-day categories (morning/afternoon/evening/night)
- Cyclical encoding for hour and day of week (sin/cos transformations)
- Trip duration in minutes

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set style for better-looking plots
plt.style.use('default')
sns.set_palette("husl")

## Load Cleaned Data

I'll load the cleaned dataset that has already been processed in the data cleaning phase.

In [2]:
# Load the cleaned data
df = pd.read_parquet("../../data/processed/processed_taxi_cleaned.parquet")

print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nFirst few rows:")
df.head()

Dataset shape: (2451103, 22)

Columns: ['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime', 'passenger_count', 'trip_distance', 'RatecodeID', 'store_and_fwd_flag', 'PULocationID', 'DOLocationID', 'payment_type', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge', 'total_amount', 'congestion_surcharge', 'Airport_fee', 'cbd_congestion_fee', 'trip_duration_min', 'has_congestion_fee']

First few rows:


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,...,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee,cbd_congestion_fee,trip_duration_min,has_congestion_fee
0,2,2025-01-05 00:00:02,2025-01-05 00:17:47,1.0,8.66,1.0,N,138,43,1,...,0.5,12.85,6.94,1.0,64.24,0.0,1.75,0.0,17.75,0
1,2,2025-01-05 00:24:51,2025-01-05 00:50:27,1.0,9.73,1.0,N,138,61,1,...,0.5,10.01,0.0,1.0,60.06,0.0,1.75,0.0,25.6,0
2,2,2025-01-05 00:54:12,2025-01-05 01:17:58,1.0,10.33,1.0,N,138,62,1,...,0.5,10.0,0.0,1.0,60.75,0.0,1.75,0.0,23.766667,0
3,2,2025-01-05 00:10:19,2025-01-05 00:16:03,1.0,1.71,1.0,N,239,24,1,...,0.5,2.86,0.0,1.0,17.16,2.5,0.0,0.0,5.733333,0
4,2,2025-01-05 00:29:32,2025-01-05 00:37:11,1.0,1.8,1.0,N,236,143,1,...,0.5,3.14,0.0,1.0,18.84,2.5,0.0,0.0,7.65,0


## 1. Basic Time Components

Extract basic time components from the pickup datetime column. These are fundamental features that capture when trips occur.

In [3]:
# Ensure pickup datetime is in datetime format
df['tpep_pickup_datetime'] = pd.to_datetime(df['tpep_pickup_datetime'])

# Extract basic time components
df['pickup_hour'] = df['tpep_pickup_datetime'].dt.hour
df['pickup_day_of_week'] = df['tpep_pickup_datetime'].dt.dayofweek  # 0=Monday, 6=Sunday
df['pickup_day_name'] = df['tpep_pickup_datetime'].dt.day_name()
df['pickup_month'] = df['tpep_pickup_datetime'].dt.month
df['pickup_day_of_month'] = df['tpep_pickup_datetime'].dt.day

# Display summary
print("Time Components Created:")
print(f"  - pickup_hour: {df['pickup_hour'].min()} to {df['pickup_hour'].max()}")
print(f"  - pickup_day_of_week: {df['pickup_day_of_week'].min()} to {df['pickup_day_of_week'].max()} (0=Monday, 6=Sunday)")
print(f"  - pickup_month: {df['pickup_month'].min()} to {df['pickup_month'].max()}")
print(f"  - pickup_day_of_month: {df['pickup_day_of_month'].min()} to {df['pickup_day_of_month'].max()}")

# Show distribution
print("\nTrips by Hour:")
print(df['pickup_hour'].value_counts().sort_index().head(10))

Time Components Created:
  - pickup_hour: 0 to 23
  - pickup_day_of_week: 0 to 6 (0=Monday, 6=Sunday)
  - pickup_month: 1 to 1
  - pickup_day_of_month: 5 to 31

Trips by Hour:
pickup_hour
0     54148
1     35068
2     22863
3     14452
4      9115
5     11887
6     29235
7     65784
8     94607
9    107736
Name: count, dtype: int64


## 2. Time-Based Flags

Create boolean flags for weekends, peak hours, and rush hours. These help identify patterns in trip behavior.

In [4]:
# Weekend flag (Saturday=5, Sunday=6)
df['is_weekend'] = (df['pickup_day_of_week'] >= 5).astype(int)

# Peak hours flag (evening rush: 17:00-19:00 based on EDA findings)
df['is_peak_hour'] = ((df['pickup_hour'] >= 17) & (df['pickup_hour'] <= 19)).astype(int)

# Rush hour flags (morning: 7-9, evening: 17-19)
df['is_morning_rush'] = ((df['pickup_hour'] >= 7) & (df['pickup_hour'] <= 9)).astype(int)
df['is_evening_rush'] = ((df['pickup_hour'] >= 17) & (df['pickup_hour'] <= 19)).astype(int)
df['is_rush_hour'] = (df['is_morning_rush'] | df['is_evening_rush']).astype(int)

# Display summary
print("Time-Based Flags Created:")
print(f"  - is_weekend: {df['is_weekend'].sum():,} trips ({df['is_weekend'].mean()*100:.1f}%)")
print(f"  - is_peak_hour: {df['is_peak_hour'].sum():,} trips ({df['is_peak_hour'].mean()*100:.1f}%)")
print(f"  - is_morning_rush: {df['is_morning_rush'].sum():,} trips ({df['is_morning_rush'].mean()*100:.1f}%)")
print(f"  - is_evening_rush: {df['is_evening_rush'].sum():,} trips ({df['is_evening_rush'].mean()*100:.1f}%)")
print(f"  - is_rush_hour: {df['is_rush_hour'].sum():,} trips ({df['is_rush_hour'].mean()*100:.1f}%)")

Time-Based Flags Created:
  - is_weekend: 599,574 trips (24.5%)
  - is_peak_hour: 522,188 trips (21.3%)
  - is_morning_rush: 268,127 trips (10.9%)
  - is_evening_rush: 522,188 trips (21.3%)
  - is_rush_hour: 790,315 trips (32.2%)


## 3. Time-of-Day Categories

Categorize trips into time-of-day periods (morning, afternoon, evening, night) for easier pattern recognition.

In [5]:
# Define time-of-day categories
def get_time_of_day(hour):
    """Categorize hour into time-of-day periods"""
    if 5 <= hour < 12:
        return 'morning'
    elif 12 <= hour < 17:
        return 'afternoon'
    elif 17 <= hour < 21:
        return 'evening'
    else:
        return 'night'

df['time_of_day'] = df['pickup_hour'].apply(get_time_of_day)

# Display distribution
print("Time-of-Day Distribution:")
print(df['time_of_day'].value_counts())
print(f"\nPercentage breakdown:")
print(df['time_of_day'].value_counts(normalize=True) * 100)

Time-of-Day Distribution:
time_of_day
afternoon    758581
evening      663182
morning      549656
night        479684
Name: count, dtype: int64

Percentage breakdown:
time_of_day
afternoon    30.948557
evening      27.056472
morning      22.424843
night        19.570128
Name: proportion, dtype: float64


## 4. Cyclical Encoding

Convert hour and day of week to cyclical features using sine and cosine transformations. This helps models understand that hour 23 is close to hour 0, and Sunday is close to Monday.

In [6]:
# Cyclical encoding for hour (24-hour cycle)
df['hour_sin'] = np.sin(2 * np.pi * df['pickup_hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['pickup_hour'] / 24)

# Cyclical encoding for day of week (7-day cycle)
df['day_of_week_sin'] = np.sin(2 * np.pi * df['pickup_day_of_week'] / 7)
df['day_of_week_cos'] = np.cos(2 * np.pi * df['pickup_day_of_week'] / 7)

# Display sample values
print("Cyclical Encoding Sample:")
print(df[['pickup_hour', 'hour_sin', 'hour_cos', 'pickup_day_of_week', 'day_of_week_sin', 'day_of_week_cos']].head(10))

Cyclical Encoding Sample:
   pickup_hour  hour_sin  hour_cos  pickup_day_of_week  day_of_week_sin  \
0            0       0.0       1.0                   6        -0.781831   
1            0       0.0       1.0                   6        -0.781831   
2            0       0.0       1.0                   6        -0.781831   
3            0       0.0       1.0                   6        -0.781831   
4            0       0.0       1.0                   6        -0.781831   
5            0       0.0       1.0                   6        -0.781831   
6            0       0.0       1.0                   6        -0.781831   
7            0       0.0       1.0                   6        -0.781831   
8            0       0.0       1.0                   6        -0.781831   
9            0       0.0       1.0                   6        -0.781831   

   day_of_week_cos  
0          0.62349  
1          0.62349  
2          0.62349  
3          0.62349  
4          0.62349  
5          0.62349  
6

## 5. Trip Duration

Calculate trip duration in minutes from pickup and dropoff timestamps. This is a key target variable for prediction.

In [7]:
# Ensure dropoff datetime is in datetime format
df['tpep_dropoff_datetime'] = pd.to_datetime(df['tpep_dropoff_datetime'])

# Calculate trip duration in minutes
df['trip_duration_minutes'] = (
    (df['tpep_dropoff_datetime'] - df['tpep_pickup_datetime']).dt.total_seconds() / 60
)

# Display statistics
print("Trip Duration Statistics:")
print(f"  Mean: {df['trip_duration_minutes'].mean():.2f} minutes")
print(f"  Median: {df['trip_duration_minutes'].median():.2f} minutes")
print(f"  Min: {df['trip_duration_minutes'].min():.2f} minutes")
print(f"  Max: {df['trip_duration_minutes'].max():.2f} minutes")
print(f"  Std: {df['trip_duration_minutes'].std():.2f} minutes")

Trip Duration Statistics:
  Mean: 14.09 minutes
  Median: 11.18 minutes
  Min: 1.00 minutes
  Max: 179.03 minutes
  Std: 10.70 minutes


## Summary of Created Features

Let's list all the temporal/time features we've created:

In [8]:
# List all temporal features created
temporal_features = [
    'pickup_hour',
    'pickup_day_of_week',
    'pickup_day_name',
    'pickup_month',
    'pickup_day_of_month',
    'is_weekend',
    'is_peak_hour',
    'is_morning_rush',
    'is_evening_rush',
    'is_rush_hour',
    'time_of_day',
    'hour_sin',
    'hour_cos',
    'day_of_week_sin',
    'day_of_week_cos',
    'trip_duration_minutes'
]

print(f"Total temporal features created: {len(temporal_features)}")
print("\nFeature List:")
for i, feat in enumerate(temporal_features, 1):
    print(f"  {i:2d}. {feat}")

# Display data types
print("\n\nFeature Data Types:")
print(df[temporal_features].dtypes)

Total temporal features created: 16

Feature List:
   1. pickup_hour
   2. pickup_day_of_week
   3. pickup_day_name
   4. pickup_month
   5. pickup_day_of_month
   6. is_weekend
   7. is_peak_hour
   8. is_morning_rush
   9. is_evening_rush
  10. is_rush_hour
  11. time_of_day
  12. hour_sin
  13. hour_cos
  14. day_of_week_sin
  15. day_of_week_cos
  16. trip_duration_minutes


Feature Data Types:
pickup_hour                int32
pickup_day_of_week         int32
pickup_day_name           object
pickup_month               int32
pickup_day_of_month        int32
is_weekend                 int64
is_peak_hour               int64
is_morning_rush            int64
is_evening_rush            int64
is_rush_hour               int64
time_of_day               object
hour_sin                 float64
hour_cos                 float64
day_of_week_sin          float64
day_of_week_cos          float64
trip_duration_minutes    float64
dtype: object


## Save Enhanced Dataset

Save the dataset with all temporal features for use in modeling.

In [10]:
# Save the enhanced dataset
output_path = "../../data/processed/processed_taxi_with_temporal_features.parquet"
df.to_parquet(output_path, index=False)

print(f"Dataset saved to: {output_path}")
print(f"Final dataset shape: {df.shape}")
print(f"Total features: {len(df.columns)}")

Dataset saved to: ../../data/processed/processed_taxi_with_temporal_features.parquet
Final dataset shape: (2451103, 38)
Total features: 38
