# Singapore Final Data Merge (2015-2024)

This notebook merges cleaned pollutants and weather data into the final format:
- **Columns**: Country | Region | Date | AQI | Temperature | RelativeHumidity | WindSpeed
- **Output**: Single merged file saved to `data/singapore/singapore_merged_2015_2024.csv`

In [8]:
# Import necessary libraries
import pandas as pd
import numpy as np
import os
from pathlib import Path

print("✅ Libraries imported successfully")

✅ Libraries imported successfully


## Step 1: Setup Paths

In [9]:
# Define base paths
base_path = Path("/Users/sharin/Downloads/COS30049/Assignment/Assignment_2/COS30049-Computing-Technology-Innovation-Project-by-YSA")
data_path = base_path / "data" / "singapore"
clean_path = data_path / "clean"

print(f"📁 Clean data location: {clean_path}")
print(f"📁 Output will be saved to: {data_path}")

📁 Clean data location: /Users/sharin/Downloads/COS30049/Assignment/Assignment_2/COS30049-Computing-Technology-Innovation-Project-by-YSA/data/singapore/clean
📁 Output will be saved to: /Users/sharin/Downloads/COS30049/Assignment/Assignment_2/COS30049-Computing-Technology-Innovation-Project-by-YSA/data/singapore


## Step 2: Load and Combine Year-by-Year Data

In [10]:
# Load all pollutants data (year by year)
print("Loading pollutants data...")
pollutants_dfs = []
for year in range(2015, 2025):
    file_path = clean_path / 'pollutants' / f'pollutants_clean_{year}.csv'
    if file_path.exists():
        df = pd.read_csv(file_path)
        pollutants_dfs.append(df)
        print(f"  ✓ Loaded pollutants_{year}: {len(df)} records")
    else:
        print(f"  ⚠️  File not found: {file_path}")

# Combine all pollutants data
if pollutants_dfs:
    pollutants_data = pd.concat(pollutants_dfs, ignore_index=True)
    pollutants_data['Date'] = pd.to_datetime(pollutants_data['Date'])
    print(f"\n✅ Total pollutants records: {len(pollutants_data)}")
    print(f"   Date range: {pollutants_data['Date'].min()} to {pollutants_data['Date'].max()}")
else:
    print("\n❌ No pollutants data found!")
    pollutants_data = None

Loading pollutants data...
  ✓ Loaded pollutants_2015: 365 records
  ✓ Loaded pollutants_2016: 319 records
  ✓ Loaded pollutants_2017: 361 records
  ✓ Loaded pollutants_2018: 365 records
  ✓ Loaded pollutants_2019: 360 records
  ✓ Loaded pollutants_2020: 362 records
  ✓ Loaded pollutants_2021: 353 records
  ✓ Loaded pollutants_2022: 364 records
  ✓ Loaded pollutants_2023: 346 records
  ✓ Loaded pollutants_2024: 366 records

✅ Total pollutants records: 3561
   Date range: 2015-01-01 00:00:00 to 2024-12-31 00:00:00


In [11]:
# Load all weather data (year by year)
print("\nLoading weather data...")
weather_dfs = []
for year in range(2015, 2025):
    file_path = clean_path / 'weather' / f'weather_clean_{year}.csv'
    if file_path.exists():
        df = pd.read_csv(file_path)
        weather_dfs.append(df)
        print(f"  ✓ Loaded weather_{year}: {len(df)} records")
    else:
        print(f"  ⚠️  File not found: {file_path}")

# Combine all weather data
if weather_dfs:
    weather_data = pd.concat(weather_dfs, ignore_index=True)
    weather_data['Date'] = pd.to_datetime(weather_data['Date'])
    print(f"\n✅ Total weather records: {len(weather_data)}")
    print(f"   Date range: {weather_data['Date'].min()} to {weather_data['Date'].max()}")
else:
    print("\n❌ No weather data found!")
    weather_data = None


Loading weather data...
  ✓ Loaded weather_2015: 365 records
  ✓ Loaded weather_2016: 366 records
  ✓ Loaded weather_2017: 365 records
  ✓ Loaded weather_2018: 365 records
  ✓ Loaded weather_2019: 365 records
  ✓ Loaded weather_2020: 366 records
  ✓ Loaded weather_2021: 365 records
  ✓ Loaded weather_2022: 365 records
  ✓ Loaded weather_2023: 365 records
  ✓ Loaded weather_2024: 366 records

✅ Total weather records: 3653
   Date range: 2015-01-01 00:00:00 to 2024-12-31 00:00:00


## Step 3: Merge Pollutants and Weather Data

In [12]:
if pollutants_data is not None and weather_data is not None:
    print("\n" + "="*80)
    print("MERGING DATA")
    print("="*80)
    
    # Select only the columns we need from pollutants (Country, Region, Date, AQI)
    pollutants_subset = pollutants_data[['Country', 'Region', 'Date', 'AQI']].copy()
    
    # Select only the columns we need from weather (Date, Temperature, RelativeHumidity, WindSpeed)
    weather_subset = weather_data[['Date', 'Temperature', 'RelativeHumidity', 'WindSpeed']].copy()
    
    # Merge on Date (inner join to keep only matching records)
    merged_data = pd.merge(
        pollutants_subset,
        weather_subset,
        on='Date',
        how='inner'
    )
    
    # Reorder columns to match final structure: Country | Region | Date | AQI | Temperature | RelativeHumidity | WindSpeed
    merged_data = merged_data[['Country', 'Region', 'Date', 'AQI', 'Temperature', 'RelativeHumidity', 'WindSpeed']]
    
    # Sort by Date
    merged_data = merged_data.sort_values('Date').reset_index(drop=True)
    
    print(f"\n✅ Merge completed successfully!")
    print(f"   Total merged records: {len(merged_data)}")
    print(f"   Date range: {merged_data['Date'].min()} to {merged_data['Date'].max()}")
    print(f"   Columns: {list(merged_data.columns)}")
    
else:
    print("\n❌ Cannot merge - missing pollutants or weather data!")
    merged_data = None


MERGING DATA

✅ Merge completed successfully!
   Total merged records: 3561
   Date range: 2015-01-01 00:00:00 to 2024-12-31 00:00:00
   Columns: ['Country', 'Region', 'Date', 'AQI', 'Temperature', 'RelativeHumidity', 'WindSpeed']


## Step 4: Data Quality Check

In [13]:
if merged_data is not None:
    print("\n" + "="*80)
    print("DATA QUALITY CHECK")
    print("="*80)
    
    # Check for missing values
    print("\nMissing values:")
    print(merged_data.isnull().sum())
    
    # Check data types
    print("\nData types:")
    print(merged_data.dtypes)
    
    # Check region distribution
    print("\nRegion distribution:")
    print(merged_data['Region'].value_counts().sort_index())
    
    # Statistical summary
    print("\nStatistical summary:")
    print(merged_data[['AQI', 'Temperature', 'RelativeHumidity', 'WindSpeed']].describe())
    
    # Sample data
    print("\nSample data (first 10 rows):")
    print(merged_data.head(10))


DATA QUALITY CHECK

Missing values:
Country             0
Region              0
Date                0
AQI                 0
Temperature         0
RelativeHumidity    0
WindSpeed           0
dtype: int64

Data types:
Country                     object
Region                      object
Date                datetime64[ns]
AQI                        float64
Temperature                float64
RelativeHumidity           float64
WindSpeed                  float64
dtype: object

Region distribution:
Region
Central       715
East          715
North         709
North-East    713
West          709
Name: count, dtype: int64

Statistical summary:
               AQI  Temperature  RelativeHumidity    WindSpeed
count  3561.000000  3561.000000       3561.000000  3561.000000
mean     48.174406    26.575667         85.714302     8.252011
std      10.154003     0.880058          4.021479     2.784917
min      18.900000    22.530000         63.920000     2.990000
25%      41.740000    26.000000         83

## Step 5: Save Merged Data

In [14]:
if merged_data is not None:
    # Save the merged data
    output_file = data_path / 'singapore_merged_2015_2024.csv'
    merged_data.to_csv(output_file, index=False)
    
    print("\n" + "="*80)
    print("MERGE COMPLETE!")
    print("="*80)
    print(f"\n✅ Merged data saved to: {output_file}")
    print(f"   Total records: {len(merged_data)}")
    print(f"   Columns: {list(merged_data.columns)}")
    print(f"   File size: {output_file.stat().st_size / 1024:.2f} KB")
    print(f"\n📊 Final structure: Country | Region | Date | AQI | Temperature | RelativeHumidity | WindSpeed")
    print(f"\n🎉 Data is ready for analysis and modeling!")
else:
    print("\n❌ Merge failed - no data to save")


MERGE COMPLETE!

✅ Merged data saved to: /Users/sharin/Downloads/COS30049/Assignment/Assignment_2/COS30049-Computing-Technology-Innovation-Project-by-YSA/data/singapore/singapore_merged_2015_2024.csv
   Total records: 3561
   Columns: ['Country', 'Region', 'Date', 'AQI', 'Temperature', 'RelativeHumidity', 'WindSpeed']
   File size: 176.87 KB

📊 Final structure: Country | Region | Date | AQI | Temperature | RelativeHumidity | WindSpeed

🎉 Data is ready for analysis and modeling!


## Summary

This notebook successfully merged cleaned Singapore data (2015-2024) into a single file.

### Output File:
**`singapore_merged_2015_2024.csv`**
- **Location**: `data/singapore/`
- **Columns**: Country | Region | Date | AQI | Temperature | RelativeHumidity | WindSpeed
- **Records**: ~3,561 daily records
- **Date Range**: 2015-01-01 to 2024-12-31
- **Regions**: Central, East, North, North-East, West

### Data Quality:
- ✅ No missing values
- ✅ All numeric values rounded to 2 decimal places
- ✅ Data sorted by date
- ✅ Consistent column names and formats

### Next Steps:
1. Use this merged file for data analysis
2. Create visualizations (AQI trends, correlation plots, etc.)
3. Build machine learning models for AQI prediction
4. Compare with Thailand data for regional analysis