# Singapore Air Quality Data Merging

Merge cleaned datasets (pollutants, temperature, weather forecast) into a unified dataset for modeling.

**Prerequisites:** Run `singapore_data_cleaning_processing.ipynb` first to generate:
- `pollutants_clean.csv`
- `airtemp_national.csv`
- `weather_forecast_national.csv`


## Step 6: Load Cleaned Datasets

Import libraries and load the cleaned CSV files.


In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
from pathlib import Path

# Set up plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")


Libraries imported successfully!
Pandas version: 2.3.2
NumPy version: 2.3.3


In [2]:
# Set up file paths
base_path = Path('/Users/sharin/Downloads/COS30049/Assignment/Assignment_2/AirAware')
data_path = base_path / 'data' / 'singapore'
output_path = base_path / 'data' / 'singapore'
viz_path = base_path / 'visualizations'

print(f"Base path: {base_path}")
print(f"Data path: {data_path}")

# Load cleaned datasets
print("\n" + "="*50)
print("LOADING CLEANED DATASETS")
print("="*50)

# Load pollutants data
pollutants_file = data_path / 'pollutants_clean.csv'
if pollutants_file.exists():
    pollutants_df = pd.read_csv(pollutants_file)
    pollutants_df['date'] = pd.to_datetime(pollutants_df['date'])
    print(f"\n✓ Loaded pollutants data: {pollutants_df.shape}")
    print(f"  Date range: {pollutants_df['date'].min()} to {pollutants_df['date'].max()}")
    print(f"  Regions: {pollutants_df['region'].nunique()}")
else:
    print(f"\n✗ Error: {pollutants_file} not found!")
    pollutants_df = None

# Load temperature data
temp_file = data_path / 'airtemp_national.csv'
if temp_file.exists():
    temp_df = pd.read_csv(temp_file)
    temp_df['date'] = pd.to_datetime(temp_df['date'])
    print(f"\n✓ Loaded temperature data: {temp_df.shape}")
    print(f"  Date range: {temp_df['date'].min()} to {temp_df['date'].max()}")
else:
    print(f"\n✗ Error: {temp_file} not found!")
    temp_df = None

# Load weather forecast data
forecast_file = data_path / 'weather_forecast_national.csv'
if forecast_file.exists():
    forecast_df = pd.read_csv(forecast_file)
    forecast_df['date'] = pd.to_datetime(forecast_df['date'])
    print(f"\n✓ Loaded forecast data: {forecast_df.shape}")
    print(f"  Date range: {forecast_df['date'].min()} to {forecast_df['date'].max()}")
else:
    print(f"\n✗ Error: {forecast_file} not found!")
    forecast_df = None

print("\n" + "="*50)


Base path: /Users/sharin/Downloads/COS30049/Assignment/Assignment_2/AirAware
Data path: /Users/sharin/Downloads/COS30049/Assignment/Assignment_2/AirAware/data/singapore

LOADING CLEANED DATASETS

✓ Loaded pollutants data: (15980, 8)
  Date range: 2016-02-07 00:00:00 to 2024-12-31 00:00:00
  Regions: 5

✓ Loaded temperature data: (3114, 2)
  Date range: 2016-04-15 00:00:00 to 2024-12-31 00:00:00

✓ Loaded forecast data: (3085, 2)
  Date range: 2016-03-14 00:00:00 to 2024-12-31 00:00:00



## Step 7: Merge Datasets

Merge pollutants, temperature, and forecast data into a single dataset.

**Strategy:** Aggregate pollutants to national level, then merge all datasets by date using outer join.


In [3]:
# Prepare datasets for merging
print("=== PREPARING DATASETS FOR MERGING ===\n")

# Get national-level pollutants data (aggregate across regions)
if pollutants_df is not None:
    print("Aggregating pollutants data to national level...")
    pollutants_national = pollutants_df.groupby('date').agg({
        'pm25_twenty_four_hourly': 'mean',
        'pm10_twenty_four_hourly': 'mean',
        'o3_eight_hour_max': 'mean',
        'co_eight_hour_max': 'mean',
        'so2_twenty_four_hourly': 'mean',
        'no2_one_hour_max': 'mean'
    }).reset_index()
    print(f"  ✓ National pollutants shape: {pollutants_national.shape}")
else:
    pollutants_national = None

# Merge datasets
print("\n" + "="*50)
print("MERGING DATASETS")
print("="*50)

if pollutants_national is not None and temp_df is not None and forecast_df is not None:
    # Start with pollutants data
    merged_df = pollutants_national.copy()
    print(f"\nStarting with pollutants: {merged_df.shape}")
    
    # Merge temperature data
    merged_df = pd.merge(merged_df, temp_df, on='date', how='outer')
    print(f"After merging temperature: {merged_df.shape}")
    
    # Merge forecast data
    merged_df = pd.merge(merged_df, forecast_df, on='date', how='outer')
    print(f"After merging forecast: {merged_df.shape}")
    
    # Sort by date
    merged_df = merged_df.sort_values('date').reset_index(drop=True)  # type: ignore
    
    print(f"\n✓ Final merged dataset: {merged_df.shape}")
    print(f"  Date range: {merged_df['date'].min()} to {merged_df['date'].max()}")
    print(f"  Total days: {len(merged_df)}")
    
    # Display first few rows
    print(f"\nFirst 5 rows of merged dataset:")
    display(merged_df.head())
    
    # Check for missing values
    print(f"\nMissing values:")
    missing_summary = merged_df.isnull().sum()
    missing_pct = (missing_summary / len(merged_df) * 100).round(2)
    missing_df = pd.DataFrame({
        'Missing Count': missing_summary,
        'Percentage': missing_pct
    })
    display(missing_df[missing_df['Missing Count'] > 0])
    
else:
    print("\n✗ Error: Cannot merge datasets. One or more datasets are missing.")
    merged_df = None


=== PREPARING DATASETS FOR MERGING ===

Aggregating pollutants data to national level...
  ✓ National pollutants shape: (3196, 7)

MERGING DATASETS

Starting with pollutants: (3196, 7)
After merging temperature: (3229, 8)
After merging forecast: (3229, 9)

✓ Final merged dataset: (3229, 9)
  Date range: 2016-02-07 00:00:00 to 2024-12-31 00:00:00
  Total days: 3229

First 5 rows of merged dataset:


Unnamed: 0,date,pm25_twenty_four_hourly,pm10_twenty_four_hourly,o3_eight_hour_max,co_eight_hour_max,so2_twenty_four_hourly,no2_one_hour_max,temperature_national,forecast_category_national
0,2016-02-07,10.0,20.4,56.6,0.362,3.2,6.0,,
1,2016-02-08,16.8,33.8,42.8,0.382,3.8,13.0,,
2,2016-02-09,18.758333,35.433333,44.341667,0.398917,3.4,7.708333,,
3,2016-02-10,16.025,29.808333,30.95,0.403667,3.491667,11.241667,,
4,2016-02-11,8.566667,17.475,27.7,0.364,3.791667,11.133333,,



Missing values:


Unnamed: 0,Missing Count,Percentage
pm25_twenty_four_hourly,33,1.02
pm10_twenty_four_hourly,33,1.02
o3_eight_hour_max,33,1.02
co_eight_hour_max,41,1.27
so2_twenty_four_hourly,33,1.02
no2_one_hour_max,33,1.02
temperature_national,115,3.56
forecast_category_national,144,4.46


### 7.1 Handle Missing Values


In [4]:
# Handle missing values
if merged_df is not None:
    print("=== HANDLING MISSING VALUES ===\n")
    
    # Make a copy for processing
    processed_df = merged_df.copy()
    
    # For numeric columns, use linear interpolation (limited to 3 consecutive days)
    numeric_cols = processed_df.select_dtypes(include=[np.number]).columns.tolist()
    
    print("Applying linear interpolation (limit: 3 consecutive days)...")
    for col in numeric_cols:
        before_missing = processed_df[col].isnull().sum()
        processed_df[col] = processed_df[col].interpolate(method='linear', limit=3)
        after_missing = processed_df[col].isnull().sum()
        filled = before_missing - after_missing
        if filled > 0:
            print(f"  {col}: filled {filled} missing values")
    
    # Drop rows with remaining missing values in critical columns
    print("\nDropping rows with remaining missing PM2.5 values...")
    before_rows = len(processed_df)
    processed_df = processed_df.dropna(subset=['pm25_twenty_four_hourly'])
    after_rows = len(processed_df)
    dropped = before_rows - after_rows
    print(f"  Dropped {dropped} rows ({dropped/before_rows*100:.2f}%)")
    
    print(f"\nFinal dataset shape: {processed_df.shape}")
    print(f"Date range: {processed_df['date'].min()} to {processed_df['date'].max()}")
    
    # Check remaining missing values
    print(f"\nRemaining missing values:")
    missing_final = processed_df.isnull().sum()
    if missing_final.sum() > 0:
        display(missing_final[missing_final > 0])
    else:
        print("  ✓ No missing values!")
    
    # Save merged dataset
    output_file = output_path / 'singapore_merged.csv'
    processed_df.to_csv(output_file, index=False)
    print(f"\n✓ Saved merged dataset to: {output_file}")
    
else:
    print("Cannot process missing values: merged dataset not available")
    processed_df = None


=== HANDLING MISSING VALUES ===

Applying linear interpolation (limit: 3 consecutive days)...
  pm25_twenty_four_hourly: filled 28 missing values
  pm10_twenty_four_hourly: filled 28 missing values
  o3_eight_hour_max: filled 28 missing values
  co_eight_hour_max: filled 36 missing values
  so2_twenty_four_hourly: filled 28 missing values
  no2_one_hour_max: filled 28 missing values
  temperature_national: filled 24 missing values
  forecast_category_national: filled 49 missing values

Dropping rows with remaining missing PM2.5 values...
  Dropped 5 rows (0.15%)

Final dataset shape: (3224, 9)
Date range: 2016-02-07 00:00:00 to 2024-12-31 00:00:00

Remaining missing values:


temperature_national          91
forecast_category_national    91
dtype: int64


✓ Saved merged dataset to: /Users/sharin/Downloads/COS30049/Assignment/Assignment_2/AirAware/data/singapore/singapore_merged.csv


## Summary

Merged dataset created: `singapore_merged.csv`
