# Module 06: Datetime Feature Engineering

**Difficulty**: ⭐⭐ Intermediate  
**Estimated Time**: 60 minutes  
**Prerequisites**: Module 05 (Binning and Discretization)

## Learning Objectives

By the end of this notebook, you will be able to:

1. Extract meaningful datetime components (year, month, day, hour, dayofweek) from timestamps
2. Create cyclical features using sine and cosine transformations for periodic patterns
3. Engineer time-since-event features to capture temporal relationships
4. Create binary time-based features (is_weekend, is_holiday) for categorical patterns
5. Demonstrate how datetime features improve e-commerce sales prediction models

## 1. Why Datetime Features Matter

**Temporal patterns are everywhere in real-world data**:
- E-commerce sales peak on weekends and holidays
- Traffic congestion follows hourly and daily patterns
- Energy consumption varies by season and time of day
- Customer behavior changes over time

**Problem**: Machine learning models don't understand timestamps natively!

**Solution**: Extract and engineer features that capture temporal patterns:
- **Component extraction**: Year, month, day, hour, minute
- **Cyclical encoding**: Handle periodic patterns (12 months, 24 hours)
- **Time differences**: Days since event, time between events
- **Binary indicators**: Weekend, holiday, business hours

In this module, we'll use an **e-commerce sales dataset** to demonstrate these techniques.

## 2. Setup

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta

# ML libraries
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.preprocessing import StandardScaler

# Configuration
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Set random seed for reproducibility
np.random.seed(42)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)

print("✓ Setup complete!")

## 3. Create E-commerce Sales Dataset

We'll create a realistic e-commerce dataset with temporal patterns:
- Higher sales on weekends
- Peak hours during lunch (12-2pm) and evening (6-9pm)
- Seasonal variations
- Holiday spikes

In [None]:
# Generate 2 years of hourly e-commerce data
start_date = pd.Timestamp('2022-01-01')
end_date = pd.Timestamp('2023-12-31 23:00:00')
date_range = pd.date_range(start=start_date, end=end_date, freq='H')

# Create base dataframe
ecommerce_data = pd.DataFrame({
    'order_timestamp': date_range
})

# Base sales amount
base_sales = 100

# Add temporal patterns
ecommerce_data['hour'] = ecommerce_data['order_timestamp'].dt.hour
ecommerce_data['dayofweek'] = ecommerce_data['order_timestamp'].dt.dayofweek
ecommerce_data['month'] = ecommerce_data['order_timestamp'].dt.month
ecommerce_data['is_weekend'] = ecommerce_data['dayofweek'].isin([5, 6]).astype(int)

# Weekend boost (30% higher on weekends)
weekend_multiplier = 1 + (0.3 * ecommerce_data['is_weekend'])

# Hour pattern (peak at lunch and evening)
hour_boost = np.where(
    ecommerce_data['hour'].between(12, 14) | ecommerce_data['hour'].between(18, 21),
    1.4,  # 40% boost during peak hours
    1.0
)

# Seasonal pattern (higher in Nov-Dec holiday season)
month_boost = np.where(
    ecommerce_data['month'].isin([11, 12]),
    1.5,  # 50% boost in holiday season
    1.0
)

# Combine all patterns with some random noise
ecommerce_data['sales_amount'] = (
    base_sales * 
    weekend_multiplier * 
    hour_boost * 
    month_boost * 
    np.random.uniform(0.8, 1.2, len(ecommerce_data))
).round(2)

# Keep only timestamp and sales for realistic scenario
ecommerce_data = ecommerce_data[['order_timestamp', 'sales_amount']]

print(f"Created e-commerce dataset with {len(ecommerce_data):,} hourly records")
print(f"Date range: {ecommerce_data['order_timestamp'].min()} to {ecommerce_data['order_timestamp'].max()}")
print(f"\nFirst few orders:")
ecommerce_data.head(10)

In [None]:
# Visualize sales over time
fig, axes = plt.subplots(2, 1, figsize=(15, 8))

# Full timeline
axes[0].plot(ecommerce_data['order_timestamp'], ecommerce_data['sales_amount'], alpha=0.6, linewidth=0.5)
axes[0].set_title('E-commerce Sales Over Time (2 Years)', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Date')
axes[0].set_ylabel('Sales Amount ($)')
axes[0].grid(True, alpha=0.3)

# Zoom into one week to see patterns
one_week = ecommerce_data[ecommerce_data['order_timestamp'].between('2023-01-01', '2023-01-07')]
axes[1].plot(one_week['order_timestamp'], one_week['sales_amount'], marker='o', linewidth=2)
axes[1].set_title('Sales Patterns in One Week (Notice weekend spikes and daily cycles)', 
                  fontsize=12, fontweight='bold')
axes[1].set_xlabel('Date')
axes[1].set_ylabel('Sales Amount ($)')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Notice the patterns:")
print("- Daily fluctuations (hour-of-day pattern)")
print("- Weekly fluctuations (weekend effect)")
print("- Seasonal trends (holiday season spikes)")

## 4. Technique 1: Extract Datetime Components

**Extract basic time units** from timestamps using pandas `.dt` accessor:
- Year, month, day
- Hour, minute, second
- Day of week (0=Monday, 6=Sunday)
- Day of year, week of year

In [None]:
# Create a copy for feature engineering
df = ecommerce_data.copy()

# Extract datetime components
df['year'] = df['order_timestamp'].dt.year
df['month'] = df['order_timestamp'].dt.month
df['day'] = df['order_timestamp'].dt.day
df['hour'] = df['order_timestamp'].dt.hour
df['dayofweek'] = df['order_timestamp'].dt.dayofweek  # 0=Monday, 6=Sunday
df['quarter'] = df['order_timestamp'].dt.quarter
df['dayofyear'] = df['order_timestamp'].dt.dayofyear
df['weekofyear'] = df['order_timestamp'].dt.isocalendar().week

print("Extracted datetime components:")
df[['order_timestamp', 'year', 'month', 'day', 'hour', 'dayofweek', 'quarter']].head(10)

In [None]:
# Analyze patterns by different time components
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Sales by hour of day
hourly_avg = df.groupby('hour')['sales_amount'].mean()
axes[0, 0].bar(hourly_avg.index, hourly_avg.values, color='skyblue', edgecolor='black')
axes[0, 0].set_title('Average Sales by Hour of Day', fontweight='bold')
axes[0, 0].set_xlabel('Hour')
axes[0, 0].set_ylabel('Average Sales ($)')
axes[0, 0].grid(True, alpha=0.3, axis='y')

# Sales by day of week
dow_avg = df.groupby('dayofweek')['sales_amount'].mean()
dow_labels = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
axes[0, 1].bar(dow_avg.index, dow_avg.values, color='lightcoral', edgecolor='black')
axes[0, 1].set_title('Average Sales by Day of Week', fontweight='bold')
axes[0, 1].set_xlabel('Day of Week')
axes[0, 1].set_ylabel('Average Sales ($)')
axes[0, 1].set_xticks(range(7))
axes[0, 1].set_xticklabels(dow_labels)
axes[0, 1].grid(True, alpha=0.3, axis='y')

# Sales by month
monthly_avg = df.groupby('month')['sales_amount'].mean()
axes[1, 0].bar(monthly_avg.index, monthly_avg.values, color='lightgreen', edgecolor='black')
axes[1, 0].set_title('Average Sales by Month', fontweight='bold')
axes[1, 0].set_xlabel('Month')
axes[1, 0].set_ylabel('Average Sales ($)')
axes[1, 0].set_xticks(range(1, 13))
axes[1, 0].grid(True, alpha=0.3, axis='y')

# Sales by quarter
quarterly_avg = df.groupby('quarter')['sales_amount'].mean()
axes[1, 1].bar(quarterly_avg.index, quarterly_avg.values, color='plum', edgecolor='black')
axes[1, 1].set_title('Average Sales by Quarter', fontweight='bold')
axes[1, 1].set_xlabel('Quarter')
axes[1, 1].set_ylabel('Average Sales ($)')
axes[1, 1].set_xticks(range(1, 5))
axes[1, 1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("Key insights from datetime components:")
print(f"- Peak sales hours: {hourly_avg.nlargest(3).index.tolist()}")
print(f"- Highest sales day: {dow_labels[dow_avg.idxmax()]}")
print(f"- Best performing months: {monthly_avg.nlargest(2).index.tolist()}")

## 5. Technique 2: Cyclical Features (Sine/Cosine Encoding)

**Problem**: Month=1 (January) and Month=12 (December) are actually adjacent, but the model sees them as far apart!

**Solution**: Use sine and cosine transformations to encode cyclical nature:
- Captures that December (12) and January (1) are close
- Preserves periodic patterns
- Two features (sin and cos) fully encode the cycle

**Formula**:
```python
sin_feature = sin(2 * π * value / max_value)
cos_feature = cos(2 * π * value / max_value)
```

In [None]:
# Create cyclical features for month (12 months cycle)
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)

# Create cyclical features for hour (24 hour cycle)
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)

# Create cyclical features for day of week (7 day cycle)
df['dayofweek_sin'] = np.sin(2 * np.pi * df['dayofweek'] / 7)
df['dayofweek_cos'] = np.cos(2 * np.pi * df['dayofweek'] / 7)

print("Cyclical features created:")
df[['month', 'month_sin', 'month_cos', 'hour', 'hour_sin', 'hour_cos']].head(10)

In [None]:
# Visualize cyclical encoding
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Month cyclical encoding
months = np.arange(1, 13)
month_sin = np.sin(2 * np.pi * months / 12)
month_cos = np.cos(2 * np.pi * months / 12)

axes[0].plot(months, month_sin, marker='o', label='sin(month)', linewidth=2)
axes[0].plot(months, month_cos, marker='s', label='cos(month)', linewidth=2)
axes[0].set_title('Cyclical Encoding of Months', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Month')
axes[0].set_ylabel('Encoded Value')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
axes[0].set_xticks(months)

# Circular representation
angles = 2 * np.pi * months / 12
axes[1] = plt.subplot(1, 2, 2, projection='polar')
axes[1].plot(angles, np.ones(len(months)), 'o', markersize=10)
axes[1].set_title('Months on Unit Circle\n(Shows cyclical nature)', fontsize=12, fontweight='bold')
axes[1].set_xticks(angles)
axes[1].set_xticklabels(['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 
                         'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])

plt.tight_layout()
plt.show()

print("Benefits of cyclical encoding:")
print("- January and December are now close in feature space")
print("- Preserves periodic patterns")
print("- Works for any cyclical variable (hours, days, months, seasons)")

## 6. Technique 3: Time-Since-Event Features

**Capture how long ago something happened**:
- Days since last purchase
- Hours since campaign launch
- Weeks since account creation

These features help models understand **recency effects**.

In [None]:
# Reference date: beginning of dataset
start_of_data = df['order_timestamp'].min()

# Calculate days since start of dataset
df['days_since_start'] = (df['order_timestamp'] - start_of_data).dt.total_seconds() / (24 * 3600)

# Calculate hours since start of dataset
df['hours_since_start'] = (df['order_timestamp'] - start_of_data).dt.total_seconds() / 3600

# Reference: beginning of each year (to capture yearly cycles)
df['year_start'] = pd.to_datetime(df['year'].astype(str) + '-01-01')
df['days_into_year'] = (df['order_timestamp'] - df['year_start']).dt.days

print("Time-since-event features:")
df[['order_timestamp', 'days_since_start', 'hours_since_start', 'days_into_year']].head(10)

In [None]:
# Visualize relationship between time elapsed and sales
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Sales vs days since start
sample = df.sample(1000, random_state=42)  # Sample for clearer visualization
axes[0].scatter(sample['days_since_start'], sample['sales_amount'], alpha=0.5, s=10)
axes[0].set_title('Sales vs Days Since Start', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Days Since Start')
axes[0].set_ylabel('Sales Amount ($)')
axes[0].grid(True, alpha=0.3)

# Sales vs days into year (shows annual pattern)
axes[1].scatter(sample['days_into_year'], sample['sales_amount'], alpha=0.5, s=10, color='orange')
axes[1].set_title('Sales vs Days Into Year (Shows Seasonal Pattern)', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Days Into Year')
axes[1].set_ylabel('Sales Amount ($)')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Notice the holiday season spike around day 330 (late November/December)!")

## 7. Technique 4: Binary Time Indicators

**Create binary (0/1) features** for categorical time patterns:
- is_weekend
- is_holiday
- is_business_hours
- is_peak_season

These are simple but powerful features!

In [None]:
# Is weekend (Saturday=5, Sunday=6)
df['is_weekend'] = df['dayofweek'].isin([5, 6]).astype(int)

# Is business hours (9 AM - 5 PM)
df['is_business_hours'] = df['hour'].between(9, 17).astype(int)

# Is peak hours (lunch 12-2pm or evening 6-9pm)
df['is_peak_hours'] = (
    df['hour'].between(12, 14) | df['hour'].between(18, 21)
).astype(int)

# Is holiday season (November-December)
df['is_holiday_season'] = df['month'].isin([11, 12]).astype(int)

# Is summer (June-August)
df['is_summer'] = df['month'].isin([6, 7, 8]).astype(int)

# Is Q4 (often highest sales quarter)
df['is_q4'] = (df['quarter'] == 4).astype(int)

print("Binary time indicator features:")
df[['order_timestamp', 'is_weekend', 'is_business_hours', 
    'is_peak_hours', 'is_holiday_season']].head(20)

In [None]:
# Compare sales across binary indicators
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

binary_features = [
    ('is_weekend', 'Weekend vs Weekday'),
    ('is_peak_hours', 'Peak Hours vs Off-Peak'),
    ('is_holiday_season', 'Holiday Season vs Regular'),
    ('is_business_hours', 'Business Hours vs Off-Hours')
]

for idx, (feature, title) in enumerate(binary_features):
    row = idx // 2
    col = idx % 2
    
    comparison = df.groupby(feature)['sales_amount'].mean()
    labels = ['No', 'Yes']
    
    bars = axes[row, col].bar(labels, comparison.values, 
                               color=['lightblue', 'salmon'], 
                               edgecolor='black')
    axes[row, col].set_title(f'Average Sales: {title}', fontweight='bold')
    axes[row, col].set_ylabel('Average Sales ($)')
    axes[row, col].grid(True, alpha=0.3, axis='y')
    
    # Add value labels on bars
    for bar in bars:
        height = bar.get_height()
        axes[row, col].text(bar.get_x() + bar.get_width()/2., height,
                           f'${height:.2f}',
                           ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

print("Impact of binary features:")
for feature, title in binary_features:
    means = df.groupby(feature)['sales_amount'].mean()
    pct_increase = (means[1] - means[0]) / means[0] * 100
    print(f"- {title}: {pct_increase:+.1f}% difference")

## 8. Model Performance Comparison

Let's compare model performance with different datetime feature sets:
1. **Baseline**: No datetime features (predict average)
2. **Basic**: Just raw components (month, hour, dayofweek)
3. **Cyclical**: Add sine/cosine encoding
4. **Full**: All features including time-since and binary indicators

In [None]:
# Split data (use last 3 months as test set for time series)
split_date = '2023-10-01'
train_data = df[df['order_timestamp'] < split_date].copy()
test_data = df[df['order_timestamp'] >= split_date].copy()

print(f"Training set: {len(train_data):,} records ({train_data['order_timestamp'].min()} to {train_data['order_timestamp'].max()})")
print(f"Test set: {len(test_data):,} records ({test_data['order_timestamp'].min()} to {test_data['order_timestamp'].max()})")

# Target variable
y_train = train_data['sales_amount']
y_test = test_data['sales_amount']

In [None]:
# Define feature sets
feature_sets = {
    'Baseline (No datetime features)': [],
    'Basic Components': ['year', 'month', 'day', 'hour', 'dayofweek', 'quarter'],
    'Basic + Cyclical': [
        'year', 'month', 'day', 'hour', 'dayofweek', 'quarter',
        'month_sin', 'month_cos', 'hour_sin', 'hour_cos', 
        'dayofweek_sin', 'dayofweek_cos'
    ],
    'Full Feature Set': [
        'year', 'month', 'day', 'hour', 'dayofweek', 'quarter',
        'month_sin', 'month_cos', 'hour_sin', 'hour_cos', 
        'dayofweek_sin', 'dayofweek_cos',
        'days_since_start', 'days_into_year',
        'is_weekend', 'is_business_hours', 'is_peak_hours', 
        'is_holiday_season', 'is_q4'
    ]
}

# Train and evaluate models
results = []

for name, features in feature_sets.items():
    if len(features) == 0:
        # Baseline: predict mean
        y_pred = np.full(len(y_test), y_train.mean())
    else:
        # Train model
        X_train = train_data[features]
        X_test = test_data[features]
        
        model = RandomForestRegressor(n_estimators=50, max_depth=10, random_state=42)
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
    
    # Calculate metrics
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    results.append({
        'Feature Set': name,
        'Num Features': len(features),
        'RMSE': rmse,
        'MAE': mae,
        'R² Score': r2
    })

# Display results
results_df = pd.DataFrame(results)
print("\nModel Performance Comparison:")
print("="*80)
results_df

In [None]:
# Visualize improvement
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# RMSE comparison
axes[0].barh(results_df['Feature Set'], results_df['RMSE'], color='coral', edgecolor='black')
axes[0].set_xlabel('RMSE (Lower is Better)')
axes[0].set_title('Model Error by Feature Set', fontsize=12, fontweight='bold')
axes[0].invert_yaxis()
axes[0].grid(True, alpha=0.3, axis='x')

# R² comparison
axes[1].barh(results_df['Feature Set'], results_df['R² Score'], color='lightgreen', edgecolor='black')
axes[1].set_xlabel('R² Score (Higher is Better)')
axes[1].set_title('Model Performance by Feature Set', fontsize=12, fontweight='bold')
axes[1].invert_yaxis()
axes[1].grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

# Calculate improvement
baseline_rmse = results_df.iloc[0]['RMSE']
best_rmse = results_df.iloc[-1]['RMSE']
improvement = (baseline_rmse - best_rmse) / baseline_rmse * 100

print(f"\n{'='*80}")
print(f"DATETIME FEATURE ENGINEERING IMPACT")
print(f"{'='*80}")
print(f"Error reduction: {improvement:.1f}%")
print(f"R² improvement: {results_df.iloc[0]['R² Score']:.3f} → {results_df.iloc[-1]['R² Score']:.3f}")
print(f"\nBy engineering datetime features, we dramatically improved model performance!")
print(f"{'='*80}")

## 9. Exercise Section

### Exercise 1: Traffic Volume Prediction

Create datetime features for predicting hourly traffic volume. Think about what patterns exist in traffic data.

In [None]:
# Exercise 1: Traffic volume dataset

# Generate hourly traffic data for 30 days
dates = pd.date_range('2024-01-01', periods=24*30, freq='H')
traffic_data = pd.DataFrame({'timestamp': dates})

# Simulate traffic patterns
hour = traffic_data['timestamp'].dt.hour
dow = traffic_data['timestamp'].dt.dayofweek

# Rush hour pattern (7-9 AM, 5-7 PM on weekdays)
traffic_data['volume'] = (
    1000 +  # Base traffic
    500 * ((hour.between(7, 9) | hour.between(17, 19)) & (dow < 5)) +  # Rush hour
    200 * (dow >= 5) +  # Weekend boost
    np.random.normal(0, 100, len(traffic_data))  # Noise
)

print("Traffic dataset created:")
traffic_data.head()

# TODO: Create these datetime features:
# 1. Basic components (hour, dayofweek)
# 2. Cyclical encoding for hour
# 3. is_rush_hour (7-9 AM or 5-7 PM on weekdays)
# 4. is_weekday

# Your code here:


In [None]:
# Solution to Exercise 1

# 1. Basic components
traffic_data['hour'] = traffic_data['timestamp'].dt.hour
traffic_data['dayofweek'] = traffic_data['timestamp'].dt.dayofweek

# 2. Cyclical encoding for hour
traffic_data['hour_sin'] = np.sin(2 * np.pi * traffic_data['hour'] / 24)
traffic_data['hour_cos'] = np.cos(2 * np.pi * traffic_data['hour'] / 24)

# 3. is_rush_hour
traffic_data['is_rush_hour'] = (
    (traffic_data['hour'].between(7, 9) | traffic_data['hour'].between(17, 19)) & 
    (traffic_data['dayofweek'] < 5)
).astype(int)

# 4. is_weekday
traffic_data['is_weekday'] = (traffic_data['dayofweek'] < 5).astype(int)

print("Traffic features created:")
traffic_data[['timestamp', 'volume', 'hour', 'hour_sin', 'hour_cos', 
              'is_rush_hour', 'is_weekday']].head(10)

# Verify rush hour impact
print("\nAverage volume:")
print(f"Rush hour: {traffic_data[traffic_data['is_rush_hour']==1]['volume'].mean():.0f}")
print(f"Non-rush hour: {traffic_data[traffic_data['is_rush_hour']==0]['volume'].mean():.0f}")

### Exercise 2: Customer Subscription Prediction

Create time-since features to predict customer behavior based on account age.

In [None]:
# Exercise 2: Customer subscription data

# Simulate customer accounts created over 2 years
account_created = pd.date_range('2022-01-01', '2024-01-01', periods=1000)
current_date = pd.Timestamp('2024-03-01')

customer_data = pd.DataFrame({
    'customer_id': range(1000),
    'account_created_date': account_created,
    'last_purchase_date': account_created + pd.to_timedelta(np.random.randint(0, 365, 1000), unit='D')
})

print("Customer dataset:")
customer_data.head()

# TODO: Create these time-since features (using current_date as reference):
# 1. days_since_signup: Days between account creation and current date
# 2. days_since_last_purchase: Days between last purchase and current date
# 3. account_age_months: Account age in months
# 4. is_new_customer: 1 if account < 90 days old, 0 otherwise
# 5. is_dormant: 1 if last purchase > 180 days ago, 0 otherwise

# Your code here:


In [None]:
# Solution to Exercise 2

current_date = pd.Timestamp('2024-03-01')

# 1. Days since signup
customer_data['days_since_signup'] = (current_date - customer_data['account_created_date']).dt.days

# 2. Days since last purchase
customer_data['days_since_last_purchase'] = (current_date - customer_data['last_purchase_date']).dt.days

# 3. Account age in months
customer_data['account_age_months'] = customer_data['days_since_signup'] / 30.44  # Average days per month

# 4. Is new customer
customer_data['is_new_customer'] = (customer_data['days_since_signup'] < 90).astype(int)

# 5. Is dormant
customer_data['is_dormant'] = (customer_data['days_since_last_purchase'] > 180).astype(int)

print("Customer time-since features:")
customer_data.head(10)

print("\nCustomer segments:")
print(f"New customers: {customer_data['is_new_customer'].sum()} ({customer_data['is_new_customer'].mean()*100:.1f}%)")
print(f"Dormant customers: {customer_data['is_dormant'].sum()} ({customer_data['is_dormant'].mean()*100:.1f}%)")

### Exercise 3: Energy Consumption Prediction

Create a complete set of datetime features for predicting hourly energy consumption. Combine all techniques learned!

In [None]:
# Exercise 3: Energy consumption data

# Generate hourly energy data for 1 year
dates = pd.date_range('2023-01-01', '2023-12-31 23:00:00', freq='H')
energy_data = pd.DataFrame({'timestamp': dates})

# Simulate energy patterns (higher in winter/summer, peak hours, weekday patterns)
hour = energy_data['timestamp'].dt.hour
month = energy_data['timestamp'].dt.month
dow = energy_data['timestamp'].dt.dayofweek

energy_data['consumption_kwh'] = (
    500 +  # Base consumption
    200 * ((month <= 2) | (month >= 11)) +  # Winter heating
    150 * (month.isin([6, 7, 8])) +  # Summer AC
    100 * (hour.between(18, 22)) +  # Evening peak
    -50 * (dow >= 5) +  # Lower on weekends
    np.random.normal(0, 50, len(energy_data))  # Noise
)

print("Energy dataset created:")
energy_data.head()

# TODO: Create a comprehensive datetime feature set:
# 1. Basic components (hour, dayofweek, month)
# 2. Cyclical features (hour_sin/cos, month_sin/cos)
# 3. Binary indicators (is_weekend, is_winter, is_summer, is_peak_hours)
# 4. Time-based (days_into_year)

# Your code here:


In [None]:
# Solution to Exercise 3

# 1. Basic components
energy_data['hour'] = energy_data['timestamp'].dt.hour
energy_data['dayofweek'] = energy_data['timestamp'].dt.dayofweek
energy_data['month'] = energy_data['timestamp'].dt.month
energy_data['quarter'] = energy_data['timestamp'].dt.quarter

# 2. Cyclical features
energy_data['hour_sin'] = np.sin(2 * np.pi * energy_data['hour'] / 24)
energy_data['hour_cos'] = np.cos(2 * np.pi * energy_data['hour'] / 24)
energy_data['month_sin'] = np.sin(2 * np.pi * energy_data['month'] / 12)
energy_data['month_cos'] = np.cos(2 * np.pi * energy_data['month'] / 12)

# 3. Binary indicators
energy_data['is_weekend'] = (energy_data['dayofweek'] >= 5).astype(int)
energy_data['is_winter'] = energy_data['month'].isin([12, 1, 2]).astype(int)
energy_data['is_summer'] = energy_data['month'].isin([6, 7, 8]).astype(int)
energy_data['is_peak_hours'] = energy_data['hour'].between(18, 22).astype(int)

# 4. Time-based features
year_start = pd.Timestamp('2023-01-01')
energy_data['days_into_year'] = (energy_data['timestamp'] - year_start).dt.days

print("Complete feature set for energy prediction:")
print(f"Total features created: {len(energy_data.columns) - 2}")  # Minus timestamp and consumption
energy_data.head()

# Quick model test
features = ['hour_sin', 'hour_cos', 'month_sin', 'month_cos', 
            'is_weekend', 'is_winter', 'is_summer', 'is_peak_hours']
X = energy_data[features]
y = energy_data['consumption_kwh']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = RandomForestRegressor(n_estimators=50, max_depth=10, random_state=42)
model.fit(X_train, y_train)
r2 = model.score(X_test, y_test)

print(f"\nModel R² Score with datetime features: {r2:.3f}")
print("Excellent! Datetime features captured the energy consumption patterns.")

## 10. Summary

### Key Takeaways

1. **Datetime features unlock temporal patterns** that models can't see in raw timestamps
   - Improved our e-commerce sales prediction significantly
   - Essential for any time-series or temporal prediction task

2. **Four core techniques for datetime feature engineering**:
   - **Component extraction**: Year, month, day, hour, dayofweek, quarter
   - **Cyclical encoding**: Sin/cos transformations for periodic patterns
   - **Time-since features**: Days since event, hours elapsed, account age
   - **Binary indicators**: is_weekend, is_holiday, is_peak_hours

3. **Cyclical encoding is critical** for truly periodic variables:
   - Month 12 (December) and Month 1 (January) are adjacent!
   - Use both sin and cos to fully encode the cycle
   - Works for hours, days, months, seasons

4. **Domain knowledge drives feature selection**:
   - E-commerce: weekends, holidays, peak hours
   - Traffic: rush hour, weekday patterns
   - Energy: seasonal patterns, time of day

5. **Time-based features capture relationships**:
   - Recency effects (days since last event)
   - Account age and lifecycle stages
   - Temporal distance from important events

### When to Use Datetime Features

✅ **Use when**:
- Data has timestamp information
- Known periodic patterns exist (daily, weekly, seasonal)
- Predicting time-dependent outcomes
- Historical events matter (recency, frequency)

❌ **Skip when**:
- Data has no temporal component
- Time is irrelevant to prediction
- Very short time periods with no patterns

### Best Practices

1. **Start with exploratory analysis**: Plot data over time to identify patterns
2. **Use cyclical encoding** for truly periodic features (hours, months)
3. **Create domain-specific features**: Think about what matters in your domain
4. **Test incremental impact**: Add feature groups and measure improvement
5. **Handle time zones carefully**: Be consistent with UTC or local time
6. **Consider holidays**: Country-specific holidays often matter

### What's Next?

**Module 07**: Text Feature Engineering - Learn TF-IDF, n-grams, and text vectorization for NLP tasks

### Additional Resources

- [Pandas datetime documentation](https://pandas.pydata.org/docs/user_guide/timeseries.html)
- [Feature Engineering for Time Series](https://machinelearningmastery.com/basic-feature-engineering-time-series-data-python/)
- [Cyclical Features Blog Post](https://ianlondon.github.io/blog/encoding-cyclical-features-24hour-time/)

---

**Congratulations!** You've completed Module 06. You now understand:
- How to extract datetime components from timestamps
- Why and how to create cyclical features with sin/cos
- How to engineer time-since-event features
- When to use binary time indicators
- The dramatic impact of datetime features on model performance

Ready to work with text data? Let's move to **Module 07: Text Feature Engineering**!