# PM2.5 Hexagon Data Enrichment - 2023 Complete Dataset

This notebook creates a unified dataset for **2023 ONLY** by:
1. Identifying all PM2.5 measurement hexagons at H3 resolution 7
2. Enriching them with traffic and weather data (local or nearest-neighbor)
3. Adding static features like terrain elevation
4. Creating a query interface for location-based lookups

**Version**: V2 - Complete 2023 data (including Sept-Oct from updated OpenMeteo)

## Phase 1: Setup and Data Loading

In [1]:
import pandas as pd
import numpy as np
import h3
import json
from datetime import datetime, timedelta
from tqdm import tqdm
import folium
from folium import plugins
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

print("Libraries loaded successfully")

Libraries loaded successfully


In [2]:
# Config
DATA_PATH = '/Users/vojtech/Code/Bard89/Project-Data/data/processed/'
H3_RESOLUTION = 7  # 5.16 km² hexagons

# YEAR FILTERING CONFIGURATION
PROCESS_YEAR = 2023
YEAR_START = f'{PROCESS_YEAR}-01-01'
YEAR_END = f'{PROCESS_YEAR}-12-31'

print(f"Data path: {DATA_PATH}")
print(f"H3 Resolution: {H3_RESOLUTION} (approx 5.16 km² per hexagon)")
print(f"Processing Year: {PROCESS_YEAR}")
print(f"Date Range: {YEAR_START} to {YEAR_END}")

Data path: /Users/vojtech/Code/Bard89/Project-Data/data/processed/
H3 Resolution: 7 (approx 5.16 km² per hexagon)
Processing Year: 2023
Date Range: 2023-01-01 to 2023-12-31


## Phase 2: Create PM2.5 Hexagon Registry

In [3]:
# Load PM2.5 data and filter for specified year
print("Loading PM2.5 air quality data...")
df_pm25 = pd.read_csv(f"{DATA_PATH}jp_openaq_processed_20230101_to_20231231.csv",
                       usecols=['h3_index_res8', 'h3_lat_res8', 'h3_lon_res8', 
                                'timestamp', 'pm25_ugm3_mean'])

# Convert timestamp and filter for year
df_pm25['timestamp'] = pd.to_datetime(df_pm25['timestamp'])
print(f"Original data range: {df_pm25['timestamp'].min()} to {df_pm25['timestamp'].max()}")

# Filter for specified year only
df_pm25 = df_pm25[(df_pm25['timestamp'] >= YEAR_START) & (df_pm25['timestamp'] <= f'{YEAR_END} 23:59:59')]
print(f"Filtered to {PROCESS_YEAR}: {df_pm25['timestamp'].min()} to {df_pm25['timestamp'].max()}")

print(f"Loaded {len(df_pm25):,} PM2.5 records for year {PROCESS_YEAR}")
print(f"PM2.5 missing values: {df_pm25['pm25_ugm3_mean'].isna().mean():.1%}")

Loading PM2.5 air quality data...
Original data range: 2023-07-14 16:00:00+00:00 to 2025-07-26 05:00:00+00:00
Filtered to 2023: 2023-07-14 16:00:00+00:00 to 2023-12-31 23:00:00+00:00
Loaded 2,078,211 PM2.5 records for year 2023
PM2.5 missing values: 21.2%


In [4]:
# Create PM2.5 hexagon registry at resolution 7
print("\nCreating PM2.5 hexagon registry at H3 resolution 7...")

pm25_registry = {}

for _, row in tqdm(df_pm25[['h3_index_res8', 'h3_lat_res8', 'h3_lon_res8']].drop_duplicates().iterrows(), 
                   total=df_pm25['h3_index_res8'].nunique()):
    if pd.notna(row['h3_index_res8']):
        hex7 = h3.cell_to_parent(row['h3_index_res8'], H3_RESOLUTION)
        
        if hex7 not in pm25_registry:
            pm25_registry[hex7] = {
                'hex7_id': hex7,
                'center_lat': row['h3_lat_res8'],
                'center_lon': row['h3_lon_res8'],
                'res8_hexagons': [],
                'measurement_count': 0
            }
        
        pm25_registry[hex7]['res8_hexagons'].append(row['h3_index_res8'])

# Calculate measurement counts
hex7_counts = df_pm25.groupby(
    df_pm25['h3_index_res8'].apply(lambda x: h3.cell_to_parent(x, H3_RESOLUTION) if pd.notna(x) else None)
)['pm25_ugm3_mean'].count()

for hex7, count in hex7_counts.items():
    if hex7 in pm25_registry:
        pm25_registry[hex7]['measurement_count'] = count

print(f"\nCreated registry with {len(pm25_registry)} PM2.5 hexagons at resolution 7")

# Convert to DataFrame
pm25_hex_df = pd.DataFrame(pm25_registry.values())
pm25_hex_df = pm25_hex_df.sort_values('measurement_count', ascending=False)

print("\nTop 5 hexagons by measurement count:")
print(pm25_hex_df[['hex7_id', 'center_lat', 'center_lon', 'measurement_count']].head())


Creating PM2.5 hexagon registry at H3 resolution 7...


100%|██████████| 638/638 [00:00<00:00, 52779.35it/s]



Created registry with 629 PM2.5 hexagons at resolution 7

Top 5 hexagons by measurement count:
             hex7_id  center_lat  center_lon  measurement_count
466  872f5bc81ffffff      35.675     139.440               6792
427  872f5aaccffffff      35.607     139.722               6749
464  872f5bc24ffffff      35.465     139.470               6418
193  872e61ae1ffffff      34.401     135.304               3492
311  872e74906ffffff      36.133     139.612               3458


## Phase 3: Load and Map Auxiliary Data

In [5]:
# Load traffic data and filter for specified year
print("Loading traffic data...")
df_traffic = pd.read_csv(f"{DATA_PATH}jp_jartic_processed_20230101_to_20231231.csv",
                          usecols=['h3_index_res8', 'h3_lat_res8', 'h3_lon_res8',
                                   'timestamp', 'avg_traffic_volume'])

# Convert timestamp and filter
df_traffic['timestamp'] = pd.to_datetime(df_traffic['timestamp'])
print(f"Original traffic data range: {df_traffic['timestamp'].min()} to {df_traffic['timestamp'].max()}")

df_traffic = df_traffic[(df_traffic['timestamp'] >= YEAR_START) & (df_traffic['timestamp'] <= f'{YEAR_END} 23:59:59')]
print(f"Filtered to {PROCESS_YEAR}: {df_traffic['timestamp'].min()} to {df_traffic['timestamp'].max()}")
print(f"Traffic records for {PROCESS_YEAR}: {len(df_traffic):,}")

# Create traffic hexagon registry at resolution 7
traffic_registry = {}
for _, row in df_traffic[['h3_index_res8', 'h3_lat_res8', 'h3_lon_res8']].drop_duplicates().iterrows():
    if pd.notna(row['h3_index_res8']):
        hex7 = h3.cell_to_parent(row['h3_index_res8'], H3_RESOLUTION)
        if hex7 not in traffic_registry:
            traffic_registry[hex7] = {
                'hex7_id': hex7,
                'center_lat': row['h3_lat_res8'],
                'center_lon': row['h3_lon_res8']
            }

print(f"Found {len(traffic_registry)} traffic hexagons at resolution 7")

# Check overlap with PM2.5
pm25_with_traffic = set(pm25_registry.keys()) & set(traffic_registry.keys())
print(f"PM2.5 hexagons with local traffic data: {len(pm25_with_traffic)} ({len(pm25_with_traffic)/len(pm25_registry)*100:.1f}%)")

Loading traffic data...
Original traffic data range: 2022-12-31 15:00:00+00:00 to 2023-12-31 14:00:00+00:00
Filtered to 2023: 2023-01-01 00:00:00+00:00 to 2023-12-31 14:00:00+00:00
Traffic records for 2023: 8,668,679
Found 1017 traffic hexagons at resolution 7
PM2.5 hexagons with local traffic data: 26 (4.1%)


In [6]:
# Load weather data and filter for specified year  
print("Loading weather data...")
# Note: Now using the COMPLETE 2023 data with Sept-Oct included
df_weather = pd.read_csv(f"{DATA_PATH}jp_openmeteo_processed_20230101_to_20231231.csv",
                          usecols=['h3_index_res8', 'h3_lat_res8', 'h3_lon_res8',
                                   'timestamp', 'temperature_c_mean', 'humidity_pct_mean',
                                   'precipitation_mm_mean'])

# Convert timestamp and filter
df_weather['timestamp'] = pd.to_datetime(df_weather['timestamp'])
print(f"Original weather data range: {df_weather['timestamp'].min()} to {df_weather['timestamp'].max()}")

df_weather = df_weather[(df_weather['timestamp'] >= YEAR_START) & (df_weather['timestamp'] <= f'{YEAR_END} 23:59:59')]
print(f"Filtered to {PROCESS_YEAR}: {df_weather['timestamp'].min()} to {df_weather['timestamp'].max()}")
print(f"Weather records for {PROCESS_YEAR}: {len(df_weather):,}")

# Verify we have complete 2023 data
unique_months = df_weather['timestamp'].dt.to_period('M').unique()
print(f"Months with weather data: {sorted(unique_months)}")
if len(unique_months) == 12:
    print("✓ Complete year coverage confirmed!")
else:
    print(f"⚠ Warning: Only {len(unique_months)}/12 months available")

# Create weather hexagon registry at resolution 7
weather_registry = {}
for _, row in df_weather[['h3_index_res8', 'h3_lat_res8', 'h3_lon_res8']].drop_duplicates().iterrows():
    if pd.notna(row['h3_index_res8']):
        hex7 = h3.cell_to_parent(row['h3_index_res8'], H3_RESOLUTION)
        if hex7 not in weather_registry:
            weather_registry[hex7] = {
                'hex7_id': hex7,
                'center_lat': row['h3_lat_res8'],
                'center_lon': row['h3_lon_res8']
            }

print(f"Found {len(weather_registry)} weather hexagons at resolution 7")

# Check overlap with PM2.5
pm25_with_weather = set(pm25_registry.keys()) & set(weather_registry.keys())
print(f"PM2.5 hexagons with local weather data: {len(pm25_with_weather)} ({len(pm25_with_weather)/len(pm25_registry)*100:.1f}%)")

Loading weather data...
Original weather data range: 2023-01-01 00:00:00+00:00 to 2023-12-31 23:00:00+00:00
Filtered to 2023: 2023-01-01 00:00:00+00:00 to 2023-12-31 23:00:00+00:00
Weather records for 2023: 4,400,280
Months with weather data: [Period('2023-01', 'M'), Period('2023-02', 'M'), Period('2023-03', 'M'), Period('2023-04', 'M'), Period('2023-05', 'M'), Period('2023-06', 'M'), Period('2023-07', 'M'), Period('2023-08', 'M'), Period('2023-09', 'M'), Period('2023-10', 'M'), Period('2023-11', 'M'), Period('2023-12', 'M')]
✓ Complete year coverage confirmed!
Found 536 weather hexagons at resolution 7
PM2.5 hexagons with local weather data: 8 (1.3%)


## Phase 4: Build Nearest Neighbor Lookup

In [7]:
def haversine_distance(lat1, lon1, lat2, lon2):
    """Calculate the great circle distance between two points on Earth"""
    from math import radians, cos, sin, asin, sqrt
    
    # Convert to radians
    lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
    
    # Haversine formula
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a))
    r = 6371  # Radius of Earth in kilometers
    
    return c * r

def find_nearest_hexagon(target_hex, target_lat, target_lon, source_registry, max_distance_km=500):
    """Find nearest hexagon from source registry"""
    min_distance = float('inf')
    nearest_hex = None
    
    for source_hex, source_info in source_registry.items():
        if source_hex == target_hex:
            # Same hexagon - distance is 0
            return source_hex, 0.0
        
        distance = haversine_distance(target_lat, target_lon, 
                                       source_info['center_lat'], 
                                       source_info['center_lon'])
        
        if distance < min_distance and distance < max_distance_km:
            min_distance = distance
            nearest_hex = source_hex
    
    return nearest_hex, min_distance

In [8]:
# Build nearest neighbor lookup for each PM2.5 hexagon
print("Building nearest neighbor lookup table...")
print("This may take a few minutes...")

nearest_lookup = {}

for hex7, hex_info in tqdm(pm25_registry.items(), desc="Processing PM2.5 hexagons"):
    lat = hex_info['center_lat']
    lon = hex_info['center_lon']
    
    # Find nearest traffic hexagon
    nearest_traffic, traffic_distance = find_nearest_hexagon(hex7, lat, lon, traffic_registry)
    
    # Find nearest weather hexagon
    nearest_weather, weather_distance = find_nearest_hexagon(hex7, lat, lon, weather_registry)
    
    nearest_lookup[hex7] = {
        'nearest_traffic_hex': nearest_traffic,
        'traffic_distance_km': traffic_distance,
        'has_local_traffic': traffic_distance == 0,
        'nearest_weather_hex': nearest_weather,
        'weather_distance_km': weather_distance,
        'has_local_weather': weather_distance == 0
    }

# Convert to DataFrame for analysis
nearest_df = pd.DataFrame(nearest_lookup).T.reset_index()
nearest_df.rename(columns={'index': 'hex7_id'}, inplace=True)

print("\nNearest neighbor statistics:")
print(f"Hexagons with local traffic: {nearest_df['has_local_traffic'].sum()}")
print(f"Hexagons with local weather: {nearest_df['has_local_weather'].sum()}")
print(f"\nTraffic distance statistics (km):")
print(nearest_df[nearest_df['traffic_distance_km'] > 0]['traffic_distance_km'].describe())
print(f"\nWeather distance statistics (km):")
print(nearest_df[nearest_df['weather_distance_km'] > 0]['weather_distance_km'].describe())

Building nearest neighbor lookup table...
This may take a few minutes...


Processing PM2.5 hexagons: 100%|██████████| 629/629 [00:00<00:00, 904.64it/s]


Nearest neighbor statistics:
Hexagons with local traffic: 26
Hexagons with local weather: 8

Traffic distance statistics (km):
count    603.000
unique   603.000
top        3.899
freq       1.000
Name: traffic_distance_km, dtype: float64

Weather distance statistics (km):
count    621.000
unique   621.000
top       32.302
freq       1.000
Name: weather_distance_km, dtype: float64





## Phase 5: Create Enriched Dataset

In [9]:
# Aggregate PM2.5 data to resolution 7
print("Aggregating PM2.5 data to resolution 7...")

df_pm25['hex7_id'] = df_pm25['h3_index_res8'].apply(
    lambda x: h3.cell_to_parent(x, H3_RESOLUTION) if pd.notna(x) else None
)

# Aggregate to hourly at hex7
df_pm25['timestamp'] = pd.to_datetime(df_pm25['timestamp'])
df_pm25['hour'] = df_pm25['timestamp'].dt.floor('H')

pm25_hourly = df_pm25.groupby(['hex7_id', 'hour']).agg({
    'pm25_ugm3_mean': 'mean',
    'h3_lat_res8': 'mean',
    'h3_lon_res8': 'mean'
}).reset_index()

pm25_hourly.rename(columns={
    'hour': 'timestamp',
    'h3_lat_res8': 'lat',
    'h3_lon_res8': 'lon'
}, inplace=True)

print(f"Created {len(pm25_hourly):,} hourly PM2.5 records for {pm25_hourly['hex7_id'].nunique()} hexagons")

Aggregating PM2.5 data to resolution 7...
Created 2,049,356 hourly PM2.5 records for 629 hexagons


In [10]:
# Prepare traffic data aggregation
print("\nAggregating traffic data to resolution 7...")

df_traffic['hex7_id'] = df_traffic['h3_index_res8'].apply(
    lambda x: h3.cell_to_parent(x, H3_RESOLUTION) if pd.notna(x) else None
)

df_traffic['timestamp'] = pd.to_datetime(df_traffic['timestamp'])
df_traffic['hour'] = df_traffic['timestamp'].dt.floor('H')

traffic_hourly = df_traffic.groupby(['hex7_id', 'hour']).agg({
    'avg_traffic_volume': 'mean'
}).reset_index()

traffic_hourly.rename(columns={'hour': 'timestamp'}, inplace=True)

print(f"Created {len(traffic_hourly):,} hourly traffic records for {traffic_hourly['hex7_id'].nunique()} hexagons")


Aggregating traffic data to resolution 7...
Created 8,668,679 hourly traffic records for 1017 hexagons


In [11]:
# Prepare weather data aggregation
print("\nAggregating weather data to resolution 7...")

df_weather['hex7_id'] = df_weather['h3_index_res8'].apply(
    lambda x: h3.cell_to_parent(x, H3_RESOLUTION) if pd.notna(x) else None
)

df_weather['timestamp'] = pd.to_datetime(df_weather['timestamp'])
df_weather['hour'] = df_weather['timestamp'].dt.floor('H')

weather_hourly = df_weather.groupby(['hex7_id', 'hour']).agg({
    'temperature_c_mean': 'mean',
    'humidity_pct_mean': 'mean',
    'precipitation_mm_mean': 'mean'
}).reset_index()

weather_hourly.rename(columns={'hour': 'timestamp'}, inplace=True)

print(f"Created {len(weather_hourly):,} hourly weather records for {weather_hourly['hex7_id'].nunique()} hexagons")


Aggregating weather data to resolution 7...
Created 4,400,280 hourly weather records for 536 hexagons


In [None]:
import os
import numpy as np
from tqdm.notebook import tqdm
from joblib import Parallel, delayed
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Generate timestamp for unique filenames
timestamp_str = datetime.now().strftime('%Y%m%d_%H%M%S')

# Output file paths with year and timestamp in filename
saved_file_csv = f'../../data/pm25_enriched_hourly_{PROCESS_YEAR}_{timestamp_str}.csv'
saved_file_parquet = f'../../data/pm25_enriched_hourly_{PROCESS_YEAR}_{timestamp_str}.parquet'

# Check for existing files with pattern
import glob
existing_parquet_files = glob.glob(f'../../data/pm25_enriched_hourly_{PROCESS_YEAR}_*.parquet')
existing_csv_files = glob.glob(f'../../data/pm25_enriched_hourly_{PROCESS_YEAR}_*.csv')

if existing_parquet_files or existing_csv_files:
    print(f"Found existing enriched data files for {PROCESS_YEAR}:")
    for f in existing_parquet_files[:3]:
        print(f"  - {os.path.basename(f)}")
    for f in existing_csv_files[:3]:
        print(f"  - {os.path.basename(f)}")
    if len(existing_parquet_files) + len(existing_csv_files) > 6:
        print(f"  ... and {len(existing_parquet_files) + len(existing_csv_files) - 6} more files")
    print(f"\nCreating new version with timestamp: {timestamp_str}")

# Merge data for PM2.5 hexagons with enrichment
print(f"\nCreating enriched dataset for year {PROCESS_YEAR}...")

# Start with PM2.5 data
enriched_data = pm25_hourly.copy()

# Add nearest neighbor information
enriched_data = enriched_data.merge(nearest_df, on='hex7_id', how='left')

# Create global lookups for faster access
print("Creating lookup dictionaries for faster processing...")

# Create traffic lookup: (hex_id, timestamp) -> data
traffic_lookup = {}
for _, row in traffic_hourly.iterrows():
    key = (row['hex7_id'], row['timestamp'])
    traffic_lookup[key] = row['avg_traffic_volume']

# Create weather lookup: (hex_id, timestamp) -> data
weather_lookup = {}
for _, row in weather_hourly.iterrows():
    key = (row['hex7_id'], row['timestamp'])
    weather_lookup[key] = {
        'temperature_c_mean': row['temperature_c_mean'],
        'humidity_pct_mean': row['humidity_pct_mean'],
        'precipitation_mm_mean': row['precipitation_mm_mean']
    }

print(f"Created lookups: {len(traffic_lookup):,} traffic, {len(weather_lookup):,} weather entries")

# Function to process a single row
def process_row(row_data):
    """Process a single row for both traffic and weather enrichment"""
    idx, row = row_data
    
    # Process traffic
    if row['has_local_traffic']:
        source_hex = row['hex7_id']
    else:
        source_hex = row['nearest_traffic_hex']
    
    traffic_result = {}
    if pd.notna(source_hex):
        key = (source_hex, row['timestamp'])
        if key in traffic_lookup:
            traffic_result['avg_traffic_volume'] = traffic_lookup[key]
            traffic_result['traffic_source'] = 'local' if row['has_local_traffic'] else 'nearest'
        else:
            traffic_result['avg_traffic_volume'] = np.nan
            traffic_result['traffic_source'] = 'missing'
    else:
        traffic_result['avg_traffic_volume'] = np.nan
        traffic_result['traffic_source'] = np.nan
    
    # Process weather
    if row['has_local_weather']:
        source_hex = row['hex7_id']
    else:
        source_hex = row['nearest_weather_hex']
    
    weather_result = {}
    if pd.notna(source_hex):
        key = (source_hex, row['timestamp'])
        if key in weather_lookup:
            weather_data = weather_lookup[key]
            weather_result.update(weather_data)
            weather_result['weather_source'] = 'local' if row['has_local_weather'] else 'nearest'
        else:
            weather_result['temperature_c_mean'] = np.nan
            weather_result['humidity_pct_mean'] = np.nan
            weather_result['precipitation_mm_mean'] = np.nan
            weather_result['weather_source'] = 'missing'
    else:
        weather_result['temperature_c_mean'] = np.nan
        weather_result['humidity_pct_mean'] = np.nan
        weather_result['precipitation_mm_mean'] = np.nan
        weather_result['weather_source'] = np.nan
    
    # Combine results
    combined = {**traffic_result, **weather_result}
    return idx, combined

# Process full dataset
print(f"Processing FULL dataset for {PROCESS_YEAR}...")

print("\nProcessing data with parallel enrichment...")
total_records = len(enriched_data)
print(f"Total records to process: {total_records:,}")

# Determine number of jobs
n_jobs = min(8, -1)  # Use up to 8 cores or all available
print(f"Using parallel processing with {n_jobs} jobs")
print("="*60)

print(f"\n[PARALLEL] PROCESSING TRAFFIC AND WEATHER ENRICHMENT FOR {PROCESS_YEAR}")
print("-"*60)

# Prepare data for processing
row_data = list(enriched_data.iterrows())

# Process in parallel with progress bar
print("Processing records...")
print("This may take several minutes for the full dataset...")

results = Parallel(n_jobs=n_jobs, backend='threading')(
    delayed(process_row)(row) 
    for row in tqdm(row_data, desc=f"Enriching {PROCESS_YEAR} data")
)

# Sort results by index to maintain order
results.sort(key=lambda x: x[0])

# Extract enrichment data
enrichment_data = [result[1] for result in results]
enrichment_df = pd.DataFrame(enrichment_data)

# Add enrichments to the main dataframe
enriched_data = pd.concat([enriched_data, enrichment_df], axis=1)

# Final validation - ensure we only have 2023 data
enriched_data['timestamp'] = pd.to_datetime(enriched_data['timestamp'])
year_check = enriched_data['timestamp'].dt.year.unique()
print(f"\n✓ Enrichment complete! Years in dataset: {year_check}")

print("\n" + "="*60)
print(f"✓ ENRICHMENT FINISHED: Created {PROCESS_YEAR} dataset with {len(enriched_data):,} records")
print("="*60)

# Create data directory if it doesn't exist
os.makedirs('../../data', exist_ok=True)

# Save as both Parquet and CSV (if needed)
print(f"\nSaving enriched {PROCESS_YEAR} data with timestamp {timestamp_str}...")

print(f"Saving as Parquet to {saved_file_parquet}...")
enriched_data.to_parquet(saved_file_parquet, index=False, compression='snappy')
parquet_size_mb = os.path.getsize(saved_file_parquet) / 1024 / 1024
print(f"✓ Parquet saved! File size: {parquet_size_mb:.1f} MB")

SAVE_CSV = True  # Set to True if CSV needed

if SAVE_CSV:
    print(f"\nSaving as CSV to {saved_file_csv}...")
    print("WARNING: Saving large CSV will take several minutes...")
    
    # Save in chunks for better performance and progress tracking
    chunk_size = 500000  # Save 500k rows at a time
    n_chunks = (len(enriched_data) + chunk_size - 1) // chunk_size
    
    for i in tqdm(range(n_chunks), desc="Saving CSV chunks"):
        start_idx = i * chunk_size
        end_idx = min((i + 1) * chunk_size, len(enriched_data))
        chunk = enriched_data.iloc[start_idx:end_idx]
        
        # Write header only for first chunk
        mode = 'w' if i == 0 else 'a'
        header = i == 0
        
        chunk.to_csv(saved_file_csv, mode=mode, header=header, index=False)
    
    csv_size_mb = os.path.getsize(saved_file_csv) / 1024 / 1024
    print(f"✓ CSV saved! File size: {csv_size_mb:.1f} MB")
else:
    print("\nSkipping CSV save (set SAVE_CSV=True if needed)")
    print("Parquet format is recommended for large datasets (faster & smaller)")

# Also save lookup table to data folder with timestamp
if 'nearest_lookup' in locals():
    lookup_file = f'../../data/hexagon_lookup_table_{PROCESS_YEAR}_{timestamp_str}.json'
    with open(lookup_file, 'w') as f:
        json.dump(nearest_lookup, f, indent=2)
    print(f"✓ Saved hexagon lookup table to {lookup_file}")

# Final summary
print("\n" + "="*60)
print(f"PROCESSING COMPLETE FOR {PROCESS_YEAR}!")
print("="*60)
print(f"  Year processed: {PROCESS_YEAR}")
print(f"  Total records: {total_records:,}")
print(f"  Timestamp: {timestamp_str}")
print(f"  Output files:")
print(f"    - {saved_file_parquet}")
if SAVE_CSV:
    print(f"    - {saved_file_csv}")
print(f"  File size: {parquet_size_mb:.1f} MB (Parquet)")
print("="*60)

In [13]:
# Add temporal features
import os
import numpy as np

# Check if enriched data is already loaded or needs to be loaded from file
if 'enriched_data' not in locals() or len(enriched_data) == 0:
    print("Loading enriched data from saved file...")
    
    saved_file_parquet = 'ml_data/pm25_enriched_hourly.parquet'
    saved_file_csv = 'ml_data/pm25_enriched_hourly.csv'
    
    if os.path.exists(saved_file_parquet):
        print(f"Loading from {saved_file_parquet}...")
        enriched_data = pd.read_parquet(saved_file_parquet)
        print(f"✓ Loaded {len(enriched_data):,} records from Parquet")
    elif os.path.exists(saved_file_csv):
        print(f"Loading from {saved_file_csv}...")
        print("Note: CSV loading may take a minute for large datasets...")
        enriched_data = pd.read_csv(saved_file_csv)
        # Convert timestamp to datetime if it's a string
        if 'timestamp' in enriched_data.columns and enriched_data['timestamp'].dtype == 'object':
            enriched_data['timestamp'] = pd.to_datetime(enriched_data['timestamp'])
        print(f"✓ Loaded {len(enriched_data):,} records from CSV")
    else:
        print("ERROR: No saved enriched data found!")
        print("Please run cell-17 first to create the enriched dataset.")
        print("Expected files:")
        print(f"  - {saved_file_parquet}")
        print(f"  - {saved_file_csv}")
else:
    print(f"Using enriched data already in memory ({len(enriched_data):,} records)")

# Now add temporal features
if 'enriched_data' in locals() and len(enriched_data) > 0:
    
    # Check if temporal features already exist
    temporal_features = ['hour', 'day_of_week', 'month', 'is_weekend', 
                        'hour_sin', 'hour_cos', 'dow_sin', 'dow_cos', 
                        'month_sin', 'month_cos']
    
    existing_features = [f for f in temporal_features if f in enriched_data.columns]
    
    if len(existing_features) == len(temporal_features):
        print(f"\n✓ Temporal features already exist in the dataset")
        print(f"  Features: {', '.join(existing_features)}")
    else:
        print("\nAdding temporal features...")
        
        # Ensure timestamp is datetime
        if 'timestamp' in enriched_data.columns:
            if enriched_data['timestamp'].dtype == 'object':
                print("  Converting timestamp from string to datetime...")
                enriched_data['timestamp'] = pd.to_datetime(enriched_data['timestamp'])
            
            # Extract temporal features
            print("  Extracting hour, day_of_week, month...")
            enriched_data['hour'] = enriched_data['timestamp'].dt.hour
            enriched_data['day_of_week'] = enriched_data['timestamp'].dt.dayofweek
            enriched_data['month'] = enriched_data['timestamp'].dt.month
            enriched_data['is_weekend'] = (enriched_data['day_of_week'] >= 5).astype(int)
            
            # Cyclical encoding
            print("  Creating cyclical encodings...")
            enriched_data['hour_sin'] = np.sin(2 * np.pi * enriched_data['hour'] / 24)
            enriched_data['hour_cos'] = np.cos(2 * np.pi * enriched_data['hour'] / 24)
            enriched_data['dow_sin'] = np.sin(2 * np.pi * enriched_data['day_of_week'] / 7)
            enriched_data['dow_cos'] = np.cos(2 * np.pi * enriched_data['day_of_week'] / 7)
            enriched_data['month_sin'] = np.sin(2 * np.pi * enriched_data['month'] / 12)
            enriched_data['month_cos'] = np.cos(2 * np.pi * enriched_data['month'] / 12)
            
            print(f"\n✓ Temporal features added successfully!")
            print(f"  Dataset shape: {enriched_data.shape}")
            print(f"  New features: {', '.join([f for f in temporal_features if f not in existing_features])}")
        else:
            print("ERROR: No timestamp column found in the dataset!")
    
    # Display dataset info
    print("\nDataset Summary:")
    print("-" * 40)
    print(f"Total records: {len(enriched_data):,}")
    print(f"Total columns: {len(enriched_data.columns)}")
    print(f"Memory usage: {enriched_data.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
    
    # Optional: Save the data with temporal features
    SAVE_WITH_TEMPORAL = False  # Set to True if you want to save
    
    if SAVE_WITH_TEMPORAL:
        output_file = 'ml_data/pm25_enriched_hourly_with_temporal.parquet'
        print(f"\nSaving data with temporal features to {output_file}...")
        enriched_data.to_parquet(output_file, index=False, compression='snappy')
        file_size_mb = os.path.getsize(output_file) / 1024 / 1024
        print(f"✓ Saved! File size: {file_size_mb:.1f} MB")
else:
    print("\nNo data available to add temporal features to.")
    print("Please ensure the enriched dataset is created and saved first.")

Using enriched data already in memory (2,049,356 records)

Adding temporal features...
  Extracting hour, day_of_week, month...
  Creating cyclical encodings...

✓ Temporal features added successfully!
  Dataset shape: (2049356, 27)
  New features: hour, day_of_week, month, is_weekend, hour_sin, hour_cos, dow_sin, dow_cos, month_sin, month_cos

Dataset Summary:
----------------------------------------
Total records: 2,049,356
Total columns: 27
Memory usage: 1180.5 MB


## Phase 6: Data Quality Analysis

In [14]:
print("Data Quality Analysis")
print("="*60)

print("\nPM2.5 Coverage:")
print(f"  Total hexagons: {enriched_data['hex7_id'].nunique()}")
print(f"  Total records: {len(enriched_data):,}")
print(f"  Missing PM2.5: {enriched_data['pm25_ugm3_mean'].isna().mean():.1%}")

print("\nTraffic Data Sources:")
if 'traffic_source' in enriched_data.columns:
    print(enriched_data['traffic_source'].value_counts())
    nearest_traffic = enriched_data[enriched_data['traffic_source']=='nearest']['traffic_distance_km']
    if len(nearest_traffic) > 0:
        # Use .values to get the actual mean value
        print(f"\nAverage distance to traffic data: {nearest_traffic.values.mean():.1f} km")
else:
    print("  Traffic source information not available")

print("\nWeather Data Sources:")
if 'weather_source' in enriched_data.columns:
    print(enriched_data['weather_source'].value_counts())
    nearest_weather = enriched_data[enriched_data['weather_source']=='nearest']['weather_distance_km']
    if len(nearest_weather) > 0:
        # Use .values to get the actual mean value
        print(f"\nAverage distance to weather data: {nearest_weather.values.mean():.1f} km")
else:
    print("  Weather source information not available")

print("\nFeature Completeness:")
for col in enriched_data.columns:
    if col not in ['hex7_id', 'timestamp', 'nearest_traffic_hex', 'nearest_weather_hex']:
        missing_pct = enriched_data[col].isna().mean() * 100
        if missing_pct > 0:
            print(f"  {col}: {100-missing_pct:.1f}% complete")

Data Quality Analysis

PM2.5 Coverage:
  Total hexagons: 629
  Total records: 2,049,356
  Missing PM2.5: 20.7%

Traffic Data Sources:
traffic_source
nearest    1921387
local        80447
missing      45435
Name: count, dtype: int64

Average distance to traffic data: 6.1 km

Weather Data Sources:
weather_source
nearest    1967412
missing      54417
local        27527
Name: count, dtype: int64

Average distance to weather data: 22.3 km

Feature Completeness:
  pm25_ugm3_mean: 79.3% complete
  avg_traffic_volume: 97.7% complete
  traffic_source: 99.9% complete
  temperature_c_mean: 97.1% complete
  humidity_pct_mean: 97.0% complete
  precipitation_mm_mean: 97.0% complete


In [15]:
class PM25QueryInterface:
    def __init__(self, enriched_data, pm25_registry):
        self.data = enriched_data
        self.registry = pm25_registry
        self.hex_locations = {hex_id: (info['center_lat'], info['center_lon']) 
                              for hex_id, info in pm25_registry.items()}
    
    def find_nearest_pm25_hexagon(self, query_lat, query_lon):
        """Find the nearest PM2.5 hexagon to a query location"""
        min_distance = float('inf')
        nearest_hex = None
        
        for hex_id, (lat, lon) in self.hex_locations.items():
            distance = haversine_distance(query_lat, query_lon, lat, lon)
            if distance < min_distance:
                min_distance = distance
                nearest_hex = hex_id
        
        return nearest_hex, min_distance
    
    def get_confidence_score(self, distance_km):
        """Calculate confidence score based on distance"""
        if distance_km < 5:
            return 'high', 0.9
        elif distance_km < 20:
            return 'medium', 0.7
        elif distance_km < 50:
            return 'low', 0.5
        else:
            return 'very_low', 0.3
    
    def query_location(self, lat, lon, timestamp=None):
        """Query PM2.5 data for a specific location"""
        
        # Find nearest PM2.5 hexagon
        nearest_hex, distance_km = self.find_nearest_pm25_hexagon(lat, lon)
        
        # Get confidence score
        confidence_level, confidence_score = self.get_confidence_score(distance_km)
        
        # Get data for the hexagon
        hex_data = self.data[self.data['hex7_id'] == nearest_hex]
        
        if timestamp:
            # Get data for specific timestamp
            timestamp = pd.to_datetime(timestamp).floor('H')
            hex_data = hex_data[hex_data['timestamp'] == timestamp]
        
        result = {
            'query_location': {'lat': lat, 'lon': lon},
            'nearest_hexagon': nearest_hex,
            'distance_km': distance_km,
            'confidence_level': confidence_level,
            'confidence_score': confidence_score,
            'data_available': len(hex_data) > 0
        }
        
        if len(hex_data) > 0:
            if timestamp:
                result['pm25_value'] = hex_data['pm25_ugm3_mean'].iloc[0]
                result['traffic_volume'] = hex_data['avg_traffic_volume'].iloc[0]
                result['temperature'] = hex_data['temperature_c_mean'].iloc[0]
                result['humidity'] = hex_data['humidity_pct_mean'].iloc[0]
            else:
                result['pm25_mean'] = hex_data['pm25_ugm3_mean'].mean()
                result['pm25_std'] = hex_data['pm25_ugm3_mean'].std()
                result['data_points'] = len(hex_data)
        
        return result

# Create query interface
query_interface = PM25QueryInterface(enriched_data, pm25_registry)
print("Query interface created")

Query interface created


In [16]:
# Test the query interface
print("Testing query interface...\n")

# Test locations
test_locations = [
    (35.6762, 139.6503, "Tokyo Station"),
    (34.6937, 135.5023, "Osaka"),
    (43.0642, 141.3469, "Sapporo"),
    (35.0, 140.0, "Rural area")
]

for lat, lon, name in test_locations:
    result = query_interface.query_location(lat, lon)
    print(f"{name} ({lat:.2f}, {lon:.2f}):")
    print(f"  Nearest PM2.5 sensor: {result['distance_km']:.1f} km away")
    print(f"  Confidence: {result['confidence_level']} ({result['confidence_score']:.1f})")
    if result['data_available']:
        print(f"  Average PM2.5: {result.get('pm25_mean', 'N/A'):.1f} μg/m³")
    print()

Testing query interface...

Tokyo Station (35.68, 139.65):
  Nearest PM2.5 sensor: 3.3 km away
  Confidence: high (0.9)
  Average PM2.5: 8.4 μg/m³

Osaka (34.69, 135.50):
  Nearest PM2.5 sensor: 2.1 km away
  Confidence: high (0.9)
  Average PM2.5: 7.1 μg/m³

Sapporo (43.06, 141.35):
  Nearest PM2.5 sensor: 2.3 km away
  Confidence: high (0.9)
  Average PM2.5: 5.2 μg/m³

Rural area (35.00, 140.00):
  Nearest PM2.5 sensor: 35.6 km away
  Confidence: low (0.5)
  Average PM2.5: 9.9 μg/m³



## Phase 9: Visualization

In [17]:
class PM25QueryInterface:
    def __init__(self, enriched_data, pm25_registry=None):
        self.data = enriched_data
        self.registry = pm25_registry
        
        # If no registry provided, create from enriched data
        if pm25_registry is None:
            print("Creating hexagon registry from enriched data...")
            self.hex_locations = {}
            for hex_id in enriched_data['hex7_id'].unique():
                hex_data = enriched_data[enriched_data['hex7_id'] == hex_id].iloc[0]
                self.hex_locations[hex_id] = (hex_data['lat'], hex_data['lon'])
        else:
            self.hex_locations = {hex_id: (info['center_lat'], info['center_lon']) 
                                  for hex_id, info in pm25_registry.items()}
    
    def find_nearest_pm25_hexagon(self, query_lat, query_lon):
        """Find the nearest PM2.5 hexagon to a query location"""
        min_distance = float('inf')
        nearest_hex = None
        
        for hex_id, (lat, lon) in self.hex_locations.items():
            distance = haversine_distance(query_lat, query_lon, lat, lon)
            if distance < min_distance:
                min_distance = distance
                nearest_hex = hex_id
        
        return nearest_hex, min_distance
    
    def get_confidence_score(self, distance_km):
        """Calculate confidence score based on distance"""
        if distance_km < 5:
            return 'high', 0.9
        elif distance_km < 20:
            return 'medium', 0.7
        elif distance_km < 50:
            return 'low', 0.5
        else:
            return 'very_low', 0.3
    
    def query_location(self, lat, lon, timestamp=None):
        """Query PM2.5 data for a specific location"""
        
        # Find nearest PM2.5 hexagon
        nearest_hex, distance_km = self.find_nearest_pm25_hexagon(lat, lon)
        
        # Get confidence score
        confidence_level, confidence_score = self.get_confidence_score(distance_km)
        
        # Get data for the hexagon
        hex_data = self.data[self.data['hex7_id'] == nearest_hex]
        
        if timestamp:
            # Get data for specific timestamp
            timestamp = pd.to_datetime(timestamp).floor('H')
            hex_data = hex_data[hex_data['timestamp'] == timestamp]
        
        result = {
            'query_location': {'lat': lat, 'lon': lon},
            'nearest_hexagon': nearest_hex,
            'distance_km': distance_km,
            'confidence_level': confidence_level,
            'confidence_score': confidence_score,
            'data_available': len(hex_data) > 0
        }
        
        if len(hex_data) > 0:
            if timestamp:
                result['pm25_value'] = hex_data['pm25_ugm3_mean'].iloc[0]
                if 'avg_traffic_volume' in hex_data.columns:
                    result['traffic_volume'] = hex_data['avg_traffic_volume'].iloc[0]
                if 'temperature_c_mean' in hex_data.columns:
                    result['temperature'] = hex_data['temperature_c_mean'].iloc[0]
                if 'humidity_pct_mean' in hex_data.columns:
                    result['humidity'] = hex_data['humidity_pct_mean'].iloc[0]
            else:
                result['pm25_mean'] = hex_data['pm25_ugm3_mean'].mean()
                result['pm25_std'] = hex_data['pm25_ugm3_mean'].std()
                result['data_points'] = len(hex_data)
        
        return result

# Load enriched data if not in memory
import os

if 'enriched_data' not in locals():
    print("Loading enriched data...")
    if os.path.exists('ml_data/pm25_enriched_hourly.parquet'):
        enriched_data = pd.read_parquet('ml_data/pm25_enriched_hourly.parquet')
    elif os.path.exists('ml_data/pm25_enriched_hourly.csv'):
        enriched_data = pd.read_csv('ml_data/pm25_enriched_hourly.csv')
        enriched_data['timestamp'] = pd.to_datetime(enriched_data['timestamp'])
    else:
        print("ERROR: No enriched data file found!")
        
# Create query interface
if 'enriched_data' in locals():
    # Try to use pm25_registry if it exists, otherwise pass None
    registry = pm25_registry if 'pm25_registry' in locals() else None
    query_interface = PM25QueryInterface(enriched_data, registry)
    print(f"Query interface created with {len(query_interface.hex_locations)} hexagon locations")
else:
    print("Cannot create query interface - no enriched data available")

Query interface created with 629 hexagon locations


## Summary

In [None]:
print("="*80)
print("PM2.5 HEXAGON ENRICHMENT COMPLETE")
print("="*80)
print(f"\nCreated enriched dataset with:")
print(f"  - {enriched_data['hex7_id'].nunique()} PM2.5 hexagons at H3 resolution 7")
print(f"  - {len(enriched_data):,} hourly records")
print(f"  - {len(pm25_with_traffic)} hexagons with local traffic data")
print(f"  - {len(pm25_with_weather)} hexagons with local weather data")
print(f"  - Nearest neighbor approximation for remaining hexagons")
print(f"\nOutput files:")
print(f"  - pm25_enriched_hourly.parquet: Enriched dataset")
print(f"  - hexagon_lookup_table.json: Nearest neighbor lookups")
print(f"  - pm25_enrichment_coverage_map.html: Coverage visualization")
print(f"\nQuery interface ready for location-based PM2.5 lookups")

PM2.5 HEXAGON ENRICHMENT COMPLETE

Created enriched dataset with:
  - 629 PM2.5 hexagons at H3 resolution 7
  - 2,049,356 hourly records
  - 26 hexagons with local traffic data
  - 8 hexagons with local weather data
  - Nearest neighbor approximation for remaining hexagons

Output files:
  - pm25_enriched_hourly.parquet: Enriched dataset
  - hexagon_lookup_table.json: Nearest neighbor lookups
  - pm25_enrichment_coverage_map.html: Coverage visualization

Query interface ready for location-based PM2.5 lookups


: 