# Week 5: Date & Time Operations - Part 1: Python DateTime Fundamentals
**Business Scenario**: NaijaCommerce Seasonal Analysis
**Data Source**: Brazilian Olist E-commerce Dataset

## Learning Objectives
- Convert strings to datetime objects using pandas and datetime libraries
- Extract meaningful date components for temporal analysis
- Handle missing and malformed date data
- Prepare temporal data for business analysis

## 📚 Import Required Libraries
Let's start by importing all the libraries we'll need for our temporal analysis.

In [None]:
# Core data manipulation and analysis
import pandas as pd
import numpy as np

# Date and time handling
import datetime as dt
from datetime import datetime, timedelta

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Display settings
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("📦 All libraries imported successfully!")
print(f"📊 Pandas version: {pd.__version__}")
print(f"🐍 Python datetime module loaded")

## 📂 Loading the Olist E-commerce Dataset
We'll work with the same Brazilian e-commerce data that we analyzed with SQL on Thursday.

In [None]:
# Load the orders dataset
# Note: In a real scenario, you would download the Olist dataset from Kaggle
# For this example, we'll create sample data that mirrors the SQL structure

# For demonstration, let's create sample data similar to our SQL dataset
np.random.seed(42)

# Generate sample order data
n_orders = 10000
start_date = datetime(2016, 9, 1)
end_date = datetime(2018, 10, 31)

# Create sample dataset
orders_df = pd.DataFrame({
    'order_id': [f'order_{i:06d}' for i in range(n_orders)],
    'customer_id': [f'customer_{np.random.randint(1, 5000):06d}' for _ in range(n_orders)],
    'order_purchase_timestamp': pd.date_range(start=start_date, end=end_date, periods=n_orders),
    'order_status': np.random.choice(['delivered', 'shipped', 'cancelled'], n_orders, p=[0.8, 0.15, 0.05])
})

# Add delivery timestamps with realistic delays
delivery_delays = np.random.exponential(scale=7, size=n_orders)  # Average 7 days
orders_df['order_delivered_customer_date'] = orders_df['order_purchase_timestamp'] + pd.to_timedelta(delivery_delays, unit='D')

# Add approval timestamps (usually 1-3 days after purchase)
approval_delays = np.random.exponential(scale=1.5, size=n_orders)  # Average 1.5 days
orders_df['order_approved_at'] = orders_df['order_purchase_timestamp'] + pd.to_timedelta(approval_delays, unit='D')

# Set delivery dates to None for non-delivered orders
mask = orders_df['order_status'] != 'delivered'
orders_df.loc[mask, 'order_delivered_customer_date'] = pd.NaT

print(f"📊 Created sample dataset with {len(orders_df):,} orders")
print(f"📅 Date range: {orders_df['order_purchase_timestamp'].min()} to {orders_df['order_purchase_timestamp'].max()}")
orders_df.head()

## 🔍 Part 1: Understanding Python Date/Time Objects
Let's explore the different types of datetime objects and how to work with them.

### 1.1 Python datetime vs Pandas datetime64

In [None]:
# Python's native datetime
python_datetime = datetime(2024, 12, 25, 14, 30, 0)
print(f"Python datetime: {python_datetime}")
print(f"Type: {type(python_datetime)}")

# Pandas datetime (numpy datetime64)
pandas_datetime = pd.Timestamp('2024-12-25 14:30:00')
print(f"\nPandas datetime: {pandas_datetime}")
print(f"Type: {type(pandas_datetime)}")

# Check our dataset's datetime column type
print(f"\nOur order timestamp type: {type(orders_df['order_purchase_timestamp'].iloc[0])}")
print(f"Column dtype: {orders_df['order_purchase_timestamp'].dtype}")

### 1.2 Converting Strings to DateTime Objects

In [None]:
# Common date string formats and how to parse them
date_strings = [
    '2024-12-25',           # ISO format
    '25/12/2024',           # European format
    'Dec 25, 2024',         # US readable format
    '2024-12-25 14:30:00',  # ISO with time
    '25-12-2024 2:30 PM'    # Mixed format
]

# Using pd.to_datetime() with different formats
print("🔄 Converting various date string formats:")
for date_str in date_strings:
    try:
        # pd.to_datetime() is very flexible and can handle many formats automatically
        converted = pd.to_datetime(date_str)
        print(f"'{date_str}' → {converted}")
    except Exception as e:
        print(f"'{date_str}' → Error: {e}")

In [None]:
# Handling specific formats with explicit format strings
# This is faster when you know the exact format

date_str = '25-12-2024 14:30:00'
print(f"Original string: '{date_str}'")

# Using format specification (faster for large datasets)
parsed_with_format = pd.to_datetime(date_str, format='%d-%m-%Y %H:%M:%S')
print(f"Parsed with format: {parsed_with_format}")

# Let pd.to_datetime infer the format (slower but more flexible)
parsed_inferred = pd.to_datetime(date_str)
print(f"Parsed with inference: {parsed_inferred}")

### 1.3 Handling Missing and Malformed Date Data

In [None]:
# Create sample data with missing and malformed dates
problematic_dates = pd.Series([
    '2024-12-25',
    'invalid_date',
    '',
    None,
    '2024-13-45',  # Invalid month and day
    '2024-12-31',
    np.nan
])

print("🚨 Handling problematic date data:")
print("Original data:")
print(problematic_dates)

# Convert with error handling
# errors='coerce' converts invalid dates to NaT (Not a Time)
converted_dates = pd.to_datetime(problematic_dates, errors='coerce')
print("\nAfter conversion (errors='coerce'):")
print(converted_dates)

# Check for missing values
print(f"\n📊 Missing values: {converted_dates.isna().sum()}")
print(f"📊 Valid dates: {converted_dates.notna().sum()}")

## 🎯 Part 2: Pandas DateTime Accessor (.dt)
The `.dt` accessor is your gateway to extracting date components and performing temporal operations.

### 2.1 Extracting Date Components

In [None]:
# Extract various date components from our orders data
# This is equivalent to SQL's EXTRACT() function

# Create a smaller sample for clear demonstration
sample_orders = orders_df.head(10).copy()

# Extract date components
sample_orders['purchase_year'] = sample_orders['order_purchase_timestamp'].dt.year
sample_orders['purchase_month'] = sample_orders['order_purchase_timestamp'].dt.month
sample_orders['purchase_quarter'] = sample_orders['order_purchase_timestamp'].dt.quarter
sample_orders['purchase_day'] = sample_orders['order_purchase_timestamp'].dt.day
sample_orders['purchase_dayofweek'] = sample_orders['order_purchase_timestamp'].dt.dayofweek  # 0=Monday, 6=Sunday
sample_orders['purchase_hour'] = sample_orders['order_purchase_timestamp'].dt.hour

# Display the results
columns_to_show = ['order_id', 'order_purchase_timestamp', 'purchase_year', 
                   'purchase_month', 'purchase_quarter', 'purchase_dayofweek']
print("📅 Date component extraction (similar to SQL EXTRACT function):")
print(sample_orders[columns_to_show])

In [None]:
# Create business-friendly date formats
# This is equivalent to SQL's TO_CHAR() function

sample_orders['month_name'] = sample_orders['order_purchase_timestamp'].dt.month_name()
sample_orders['day_name'] = sample_orders['order_purchase_timestamp'].dt.day_name()
sample_orders['date_only'] = sample_orders['order_purchase_timestamp'].dt.date
sample_orders['time_only'] = sample_orders['order_purchase_timestamp'].dt.time
sample_orders['month_year'] = sample_orders['order_purchase_timestamp'].dt.strftime('%B %Y')

# Display formatted dates
format_columns = ['order_id', 'order_purchase_timestamp', 'month_name', 'day_name', 'month_year']
print("📝 Business-friendly date formatting:")
print(sample_orders[format_columns])

### 2.2 Business Application: Monthly Sales Summary
Let's replicate the SQL analysis we did on Thursday!

In [None]:
# Monthly sales summary - Python equivalent of our SQL GROUP BY analysis
monthly_sales = orders_df.groupby([
    orders_df['order_purchase_timestamp'].dt.year,
    orders_df['order_purchase_timestamp'].dt.month
]).agg({
    'order_id': 'count',
    'customer_id': 'nunique'
}).rename(columns={
    'order_id': 'total_orders',
    'customer_id': 'unique_customers'
})

# Reset index to make year and month regular columns
monthly_sales = monthly_sales.reset_index()
monthly_sales.columns = ['year', 'month', 'total_orders', 'unique_customers']

# Add month names for readability
monthly_sales['month_name'] = pd.to_datetime(monthly_sales[['year', 'month']].assign(day=1)).dt.month_name()

print("📊 Monthly Sales Summary (equivalent to our Thursday SQL analysis):")
print(monthly_sales.head(12))

### 2.3 Weekend vs Weekday Analysis

In [None]:
# Analyze shopping patterns by day type
# Python equivalent of our SQL weekend/weekday analysis

# Add day type classification
orders_df['day_of_week'] = orders_df['order_purchase_timestamp'].dt.dayofweek
orders_df['day_type'] = orders_df['day_of_week'].apply(
    lambda x: 'Weekend' if x >= 5 else 'Weekday'  # 5=Saturday, 6=Sunday
)

# Calculate summary by day type
day_type_summary = orders_df.groupby('day_type').agg({
    'order_id': 'count'
}).rename(columns={'order_id': 'total_orders'})

# Calculate percentages
day_type_summary['percentage_of_total'] = (
    day_type_summary['total_orders'] / day_type_summary['total_orders'].sum() * 100
).round(2)

print("📅 Weekend vs Weekday Shopping Patterns:")
print(day_type_summary)

## 📈 Part 3: Time-Based Grouping and Resampling

### 3.1 Using resample() for Time-Based Aggregation

In [None]:
# Set datetime as index for resampling
orders_indexed = orders_df.set_index('order_purchase_timestamp').sort_index()

# Monthly resampling (equivalent to SQL DATE_TRUNC('month', date))
monthly_orders = orders_indexed.resample('M').agg({
    'order_id': 'count',
    'customer_id': 'nunique'
}).rename(columns={
    'order_id': 'monthly_orders',
    'customer_id': 'monthly_customers'
})

print("📊 Monthly Order Trends (using resample):")
print(monthly_orders.head(10))

# Weekly resampling
weekly_orders = orders_indexed.resample('W').agg({
    'order_id': 'count'
}).rename(columns={'order_id': 'weekly_orders'})

print("\n📊 Weekly Order Trends:")
print(weekly_orders.head(10))

### 3.2 Peak Shopping Hours Analysis

In [None]:
# Analyze shopping patterns by hour of day
# Python equivalent of our SQL hour analysis

hourly_orders = orders_df.groupby(
    orders_df['order_purchase_timestamp'].dt.hour
)['order_id'].count().reset_index()

hourly_orders.columns = ['hour_of_day', 'order_count']
hourly_orders['percentage'] = (
    hourly_orders['order_count'] / hourly_orders['order_count'].sum() * 100
).round(2)

# Sort by order count to find peak hours
hourly_orders_sorted = hourly_orders.sort_values('order_count', ascending=False)

print("🕐 Peak Shopping Hours Analysis:")
print("Top 5 shopping hours:")
print(hourly_orders_sorted.head())

# Create a simple visualization
plt.figure(figsize=(12, 6))
plt.bar(hourly_orders['hour_of_day'], hourly_orders['order_count'])
plt.title('Shopping Patterns by Hour of Day')
plt.xlabel('Hour of Day')
plt.ylabel('Number of Orders')
plt.xticks(range(0, 24))
plt.grid(axis='y', alpha=0.3)
plt.show()

print(f"\n🎯 Business Insight: Peak shopping hour is {hourly_orders_sorted.iloc[0]['hour_of_day']}:00 with {hourly_orders_sorted.iloc[0]['order_count']} orders")

## 💡 Business Insights Summary
Let's compare our Python results with Thursday's SQL findings.

In [None]:
# Create a comprehensive summary of our temporal analysis
print("📋 BUSINESS INSIGHTS SUMMARY")
print("=" * 40)

# 1. Dataset Overview
print(f"📊 Total Orders Analyzed: {len(orders_df):,}")
print(f"📅 Date Range: {orders_df['order_purchase_timestamp'].min().strftime('%Y-%m-%d')} to {orders_df['order_purchase_timestamp'].max().strftime('%Y-%m-%d')}")
print(f"🎯 Delivered Orders: {(orders_df['order_status'] == 'delivered').sum():,}")

# 2. Seasonal Patterns
print("\n🌿 SEASONAL PATTERNS:")
quarterly_summary = orders_df.groupby(
    orders_df['order_purchase_timestamp'].dt.quarter
)['order_id'].count()
peak_quarter = quarterly_summary.idxmax()
print(f"📈 Peak Quarter: Q{peak_quarter} with {quarterly_summary.max():,} orders")

# 3. Customer Behavior
print("\n👥 CUSTOMER BEHAVIOR:")
weekend_orders = day_type_summary.loc['Weekend', 'total_orders']
weekday_orders = day_type_summary.loc['Weekday', 'total_orders']
weekend_percentage = day_type_summary.loc['Weekend', 'percentage_of_total']
print(f"🏖️ Weekend Shopping: {weekend_orders:,} orders ({weekend_percentage}%)")
print(f"💼 Weekday Shopping: {weekday_orders:,} orders ({100-weekend_percentage:.1f}%)")

# 4. Peak Hours
peak_hour = hourly_orders_sorted.iloc[0]['hour_of_day']
peak_hour_orders = hourly_orders_sorted.iloc[0]['order_count']
print(f"⏰ Peak Shopping Hour: {peak_hour}:00 with {peak_hour_orders:,} orders")

print("\n✅ Analysis complete! These insights match our SQL findings from Thursday.")

## 🎯 Key Takeaways

### Python vs SQL for DateTime Operations

| Operation | SQL | Python Pandas | Best Use Case |
|-----------|-----|---------------|---------------|
| Date Extraction | `EXTRACT(MONTH FROM date)` | `df['date'].dt.month` | Both equally effective |
| Date Formatting | `TO_CHAR(date, 'Month YYYY')` | `df['date'].dt.strftime('%B %Y')` | Python more flexible |
| Time Grouping | `DATE_TRUNC('month', date)` | `df.resample('M')` | Python better for analysis |
| Missing Data | `WHERE date IS NOT NULL` | `df['date'].notna()` | Both handle well |

### When to Use Each Tool
- **SQL**: Great for data extraction, filtering, and database-level aggregations
- **Python**: Better for complex analysis, visualizations, and machine learning
- **Both**: Can achieve the same business insights with different approaches

### Next Steps
In Part 2, we'll dive deeper into:
- Date arithmetic and timedelta operations
- Advanced time series analysis
- Rolling windows and trend analysis
- Seasonal decomposition techniques

## 📝 Practice Exercise
Try these quick exercises to reinforce your learning:

1. **Extract all orders from December** (any year)
2. **Find the busiest day of the week** for orders
3. **Calculate the percentage of orders placed on weekends vs weekdays**
4. **Create a month-year column** in the format "Jan 2018"

Try to solve these before looking at the solutions in the next notebook!