# Singapore Data Cleaning and Processing (2015-2024)

This notebook cleans and processes raw Singapore data:
- **Pollutants data** (PM10, PM2.5, CO, NO2, SO2, O3, AQI)
- **Weather data** (Temperature, Humidity, Wind Speed)
- **Air Temperature data** (Daily temperature readings)

Cleaned data will be saved to: `data/singapore/clean/`

In [24]:
# Import necessary libraries
import pandas as pd
import numpy as np
import os
from pathlib import Path

print("✅ Libraries imported successfully")

✅ Libraries imported successfully


## Step 1: Setup Paths and Verify Data

In [25]:
# Define base paths
base_path = Path("/Users/sharin/Downloads/COS30049/Assignment/Assignment_2/COS30049-Computing-Technology-Innovation-Project-by-YSA")
data_path = base_path / "data" / "singapore"
raw_path = data_path / "raw"
clean_path = data_path / "clean"

# Create clean directory if it doesn't exist
os.makedirs(clean_path, exist_ok=True)

# Verify raw data directories exist
subdirs = ['pollutants', 'weather', 'air_temperature']
for subdir in subdirs:
    path = raw_path / subdir
    if path.exists():
        files = list(path.glob('*.csv'))
        print(f"✅ {subdir}: {len(files)} files found")
    else:
        print(f"❌ {subdir}: Directory not found!")

print(f"\n📁 Clean data will be saved to: {clean_path}")

✅ pollutants: 10 files found
✅ weather: 10 files found
✅ air_temperature: 10 files found

📁 Clean data will be saved to: /Users/sharin/Downloads/COS30049/Assignment/Assignment_2/COS30049-Computing-Technology-Innovation-Project-by-YSA/data/singapore/clean


## Step 2: Clean Pollutants Data (2015-2024)

In [26]:
def process_pollutants_data(years):
    """
    Process pollutants data for given years.
    Aggregates hourly data to daily averages and handles missing values.
    """
    all_data = []
    
    for year in years:
        file_path = raw_path / 'pollutants' / f'pollutants_{year}.csv'
        
        if not file_path.exists():
            print(f"⚠️  Skipping {year}: File not found")
            continue
        
        print(f"Processing pollutants data for {year}...")
        
        # Read data in chunks to handle large files
        chunks = []
        for chunk in pd.read_csv(file_path, chunksize=10000):
            # Convert date column to datetime
            chunk['date'] = pd.to_datetime(chunk['date'])
            
            # Check if this is 2015 format (Open-Meteo) or 2016+ format (Singapore API)
            if 'pm10' in chunk.columns:
                # 2015 format - already has correct column names
                daily_chunk = chunk.groupby('date').agg({
                    'pm10': 'mean',
                    'pm2_5': 'mean',
                    'carbon_monoxide': 'mean',
                    'nitrogen_dioxide': 'mean',
                    'sulphur_dioxide': 'mean',
                    'ozone': 'mean',
                    'aqi': 'mean'
                }).reset_index()
            else:
                # 2016+ format - map Singapore API columns to standard names
                # Rename columns to match 2015 format
                chunk = chunk.rename(columns={
                    'pm10_twenty_four_hourly': 'pm10',
                    'pm25_twenty_four_hourly': 'pm2_5',
                    'co_eight_hour_max': 'carbon_monoxide',
                    'no2_one_hour_max': 'nitrogen_dioxide',
                    'so2_twenty_four_hourly': 'sulphur_dioxide',
                    'o3_eight_hour_max': 'ozone',
                    'psi_twenty_four_hourly': 'aqi'
                })
                
                # Group by date and calculate daily averages
                daily_chunk = chunk.groupby('date').agg({
                    'pm10': 'mean',
                    'pm2_5': 'mean',
                    'carbon_monoxide': 'mean',
                    'nitrogen_dioxide': 'mean',
                    'sulphur_dioxide': 'mean',
                    'ozone': 'mean',
                    'aqi': 'mean'
                }).reset_index()
            
            chunks.append(daily_chunk)
        
        # Combine all chunks for this year
        year_data = pd.concat(chunks, ignore_index=True)
        
        # Group again in case there are duplicate dates from chunking
        year_data = year_data.groupby('date').mean().reset_index()
        
        all_data.append(year_data)
        print(f"  ✓ {year}: {len(year_data)} daily records processed")
    
    # Combine all years
    if not all_data:
        print("❌ No pollutants data found!")
        return None
    
    combined_data = pd.concat(all_data, ignore_index=True)
    combined_data = combined_data.sort_values('date').reset_index(drop=True)
    
    # Handle missing values - use method parameter instead of deprecated syntax
    pollutant_cols = ['pm10', 'pm2_5', 'carbon_monoxide', 'nitrogen_dioxide', 'sulphur_dioxide', 'ozone', 'aqi']
    combined_data[pollutant_cols] = combined_data[pollutant_cols].ffill().bfill().fillna(0)
    
    # Standardize column names for merging (matching final structure)
    combined_data = combined_data.rename(columns={'aqi': 'AQI'})
    
    # Round all numeric columns to 2 decimal places
    numeric_cols = ['pm10', 'pm2_5', 'carbon_monoxide', 'nitrogen_dioxide', 'sulphur_dioxide', 'ozone', 'AQI']
    combined_data[numeric_cols] = combined_data[numeric_cols].round(2)
    
    # Add Country column
    combined_data['Country'] = 'Singapore'
    
    # Assign regions cyclically based on date (to distribute data across 5 regions)
    # Singapore regions: Central, East, North, North-East, West
    regions = ['Central', 'East', 'North', 'North-East', 'West']
    combined_data['Region'] = combined_data['date'].apply(lambda x: regions[x.dayofyear % 5])
    
    # Rename date column to match final format (after using it for region assignment)
    combined_data = combined_data.rename(columns={'date': 'Date'})
    
    # Reorder columns: Country, Region, Date, then pollutants
    cols = ['Country', 'Region', 'Date', 'pm10', 'pm2_5', 'carbon_monoxide', 
            'nitrogen_dioxide', 'sulphur_dioxide', 'ozone', 'AQI']
    combined_data = combined_data[cols]
    
    return combined_data

# Process pollutants data for 2015-2024
pollutants_clean = process_pollutants_data(range(2015, 2025))

if pollutants_clean is not None:
    print(f"\n✅ Total pollutants records: {len(pollutants_clean)}")
    print(f"Date range: {pollutants_clean['Date'].min()} to {pollutants_clean['Date'].max()}")
    print(f"\nSample data:")
    print(pollutants_clean.head())
    print(f"\nMissing values:")
    print(pollutants_clean.isnull().sum())

Processing pollutants data for 2015...
  ✓ 2015: 365 daily records processed
Processing pollutants data for 2016...
  ✓ 2016: 319 daily records processed
Processing pollutants data for 2017...
  ✓ 2017: 361 daily records processed
Processing pollutants data for 2018...
  ✓ 2018: 365 daily records processed
Processing pollutants data for 2019...
  ✓ 2019: 360 daily records processed
Processing pollutants data for 2020...
  ✓ 2020: 362 daily records processed
Processing pollutants data for 2021...
  ✓ 2021: 353 daily records processed
Processing pollutants data for 2022...
  ✓ 2022: 364 daily records processed
Processing pollutants data for 2023...
  ✓ 2023: 346 daily records processed
Processing pollutants data for 2024...
  ✓ 2024: 366 daily records processed

✅ Total pollutants records: 3561
Date range: 2015-01-01 00:00:00 to 2024-12-31 00:00:00

Sample data:
     Country      Region       Date   pm10  pm2_5  carbon_monoxide  \
0  Singapore        East 2015-01-01  19.71  13.62        

## Step 3: Clean Weather Data (2015-2024)

In [27]:
def process_weather_data(years):
    """
    Process weather data for given years.
    Weather data is already in daily format from the API.
    """
    all_data = []
    
    for year in years:
        file_path = raw_path / 'weather' / f'weather_{year}.csv'
        
        if not file_path.exists():
            print(f"⚠️  Skipping {year}: File not found")
            continue
        
        print(f"Processing weather data for {year}...")
        
        # Read weather data
        year_data = pd.read_csv(file_path)
        year_data['date'] = pd.to_datetime(year_data['date'])
        
        all_data.append(year_data)
        print(f"  ✓ {year}: {len(year_data)} daily records processed")
    
    # Combine all years
    if not all_data:
        print("❌ No weather data found!")
        return None
    
    combined_data = pd.concat(all_data, ignore_index=True)
    combined_data = combined_data.sort_values('date').reset_index(drop=True)
    
    # Handle missing values
    weather_cols = ['temperature_2m', 'relative_humidity_2m', 'wind_speed_10m']
    combined_data[weather_cols] = combined_data[weather_cols].ffill().bfill().fillna(0)
    
    # Add Country column
    combined_data['Country'] = 'Singapore'
    
    # Assign regions cyclically based on date (to distribute data across 5 regions)
    # Singapore regions: Central, East, North, North-East, West
    regions = ['Central', 'East', 'North', 'North-East', 'West']
    combined_data['Region'] = combined_data['date'].apply(lambda x: regions[x.dayofyear % 5])
    
    # Standardize column names for merging (matching final structure)
    combined_data = combined_data.rename(columns={
        'temperature_2m': 'Temperature',
        'relative_humidity_2m': 'RelativeHumidity',
        'wind_speed_10m': 'WindSpeed',
        'date': 'Date'
    })
    
    # Round all numeric columns to 2 decimal places
    numeric_cols = ['Temperature', 'RelativeHumidity', 'WindSpeed']
    combined_data[numeric_cols] = combined_data[numeric_cols].round(2)
    
    # Reorder columns: Country, Region, Date, Temperature, RelativeHumidity, WindSpeed
    combined_data = combined_data[['Country', 'Region', 'Date', 'Temperature', 'RelativeHumidity', 'WindSpeed']]
    
    return combined_data

# Process weather data for 2015-2024
weather_clean = process_weather_data(range(2015, 2025))

if weather_clean is not None:
    print(f"\n✅ Total weather records: {len(weather_clean)}")
    print(f"Date range: {weather_clean['Date'].min()} to {weather_clean['Date'].max()}")
    print(f"\nSample data:")
    print(weather_clean.head())
    print(f"\nMissing values:")
    print(weather_clean.isnull().sum())

Processing weather data for 2015...
  ✓ 2015: 365 daily records processed
Processing weather data for 2016...
  ✓ 2016: 366 daily records processed
Processing weather data for 2017...
  ✓ 2017: 365 daily records processed
Processing weather data for 2018...
  ✓ 2018: 365 daily records processed
Processing weather data for 2019...
  ✓ 2019: 365 daily records processed
Processing weather data for 2020...
  ✓ 2020: 366 daily records processed
Processing weather data for 2021...
  ✓ 2021: 365 daily records processed
Processing weather data for 2022...
  ✓ 2022: 365 daily records processed
Processing weather data for 2023...
  ✓ 2023: 365 daily records processed
Processing weather data for 2024...
  ✓ 2024: 366 daily records processed

✅ Total weather records: 3653
Date range: 2015-01-01 00:00:00 to 2024-12-31 00:00:00

Sample data:
     Country      Region       Date  Temperature  RelativeHumidity  WindSpeed
0  Singapore        East 2015-01-01        24.62             88.69      11.65
1  S

## Step 4: Clean Air Temperature Data (2015-2024)

In [28]:
def process_temperature_data(years):
    """
    Process air temperature data for given years.
    Temperature data is already in daily format from the API.
    """
    all_data = []
    
    for year in years:
        file_path = raw_path / 'air_temperature' / f'airtemp_{year}.csv'
        
        if not file_path.exists():
            print(f"⚠️  Skipping {year}: File not found")
            continue
        
        print(f"Processing air temperature data for {year}...")
        
        # Read temperature data
        year_data = pd.read_csv(file_path)
        year_data['timestamp'] = pd.to_datetime(year_data['timestamp'])
        
        # Rename timestamp to date for consistency
        year_data = year_data.rename(columns={'timestamp': 'date'})
        
        all_data.append(year_data)
        print(f"  ✓ {year}: {len(year_data)} daily records processed")
    
    # Combine all years
    if not all_data:
        print("❌ No air temperature data found!")
        return None
    
    combined_data = pd.concat(all_data, ignore_index=True)
    combined_data = combined_data.sort_values('date').reset_index(drop=True)
    
    # Handle missing values
    combined_data['reading_value'] = combined_data['reading_value'].ffill().bfill().fillna(0)
    
    # Add Country column
    combined_data['Country'] = 'Singapore'
    
    # Assign regions cyclically based on date (to distribute data across 5 regions)
    # Singapore regions: Central, East, North, North-East, West
    regions = ['Central', 'East', 'North', 'North-East', 'West']
    combined_data['Region'] = combined_data['date'].apply(lambda x: regions[x.dayofyear % 5])
    
    # Standardize column names for merging (matching final structure)
    combined_data = combined_data.rename(columns={
        'reading_value': 'Temperature',
        'date': 'Date'
    })
    
    # Round temperature to 2 decimal places
    combined_data['Temperature'] = combined_data['Temperature'].round(2)
    
    # Reorder columns: Country, Region, Date, Temperature, reading_type, station_name
    combined_data = combined_data[['Country', 'Region', 'Date', 'Temperature', 'reading_type', 'station_name']]
    
    return combined_data

# Process air temperature data for 2015-2024
temperature_clean = process_temperature_data(range(2015, 2025))

if temperature_clean is not None:
    print(f"\n✅ Total air temperature records: {len(temperature_clean)}")
    print(f"Date range: {temperature_clean['Date'].min()} to {temperature_clean['Date'].max()}")
    print(f"\nSample data:")
    print(temperature_clean.head())
    print(f"\nMissing values:")
    print(temperature_clean.isnull().sum())

Processing air temperature data for 2015...
  ✓ 2015: 365 daily records processed
Processing air temperature data for 2016...
  ✓ 2016: 366 daily records processed
Processing air temperature data for 2017...
  ✓ 2017: 365 daily records processed
Processing air temperature data for 2018...
  ✓ 2018: 365 daily records processed
Processing air temperature data for 2019...
  ✓ 2019: 365 daily records processed
Processing air temperature data for 2020...
  ✓ 2020: 366 daily records processed
Processing air temperature data for 2021...
  ✓ 2021: 365 daily records processed
Processing air temperature data for 2022...
  ✓ 2022: 365 daily records processed
Processing air temperature data for 2023...
  ✓ 2023: 365 daily records processed
Processing air temperature data for 2024...
  ✓ 2024: 366 daily records processed

✅ Total air temperature records: 3653
Date range: 2015-01-01 00:00:00 to 2024-12-31 00:00:00

Sample data:
     Country      Region       Date  Temperature reading_type  \
0  Sing

## Step 5: Save Cleaned Data

## Step 5: Preview Final Merge Structure


In [29]:
# Preview how the final merged data will look
if pollutants_clean is not None and weather_clean is not None:
    print("=" * 80)
    print("PREVIEW: Final Merged Data Structure")
    print("=" * 80)
    
    # Merge pollutants and weather on Date
    preview_merge = pd.merge(
        pollutants_clean[['Country', 'Region', 'Date', 'AQI']],
        weather_clean[['Date', 'Temperature', 'RelativeHumidity', 'WindSpeed']],
        on='Date',
        how='inner'
    )
    
    # Reorder columns to match final structure
    preview_merge = preview_merge[['Country', 'Region', 'Date', 'AQI', 'Temperature', 'RelativeHumidity', 'WindSpeed']]
    
    print(f"\nFinal columns: {list(preview_merge.columns)}")
    print(f"Total merged records: {len(preview_merge)}")
    print(f"\nSample merged data (first 5 rows):")
    print(preview_merge.head())
    print(f"\nData types:")
    print(preview_merge.dtypes)
    print(f"\nAll numeric values are rounded to 2 decimal places ✓")
    print("=" * 80)


PREVIEW: Final Merged Data Structure

Final columns: ['Country', 'Region', 'Date', 'AQI', 'Temperature', 'RelativeHumidity', 'WindSpeed']
Total merged records: 3561

Sample merged data (first 5 rows):
     Country      Region       Date    AQI  Temperature  RelativeHumidity  \
0  Singapore        East 2015-01-01  53.46        24.62             88.69   
1  Singapore       North 2015-01-02  53.46        25.49             78.11   
2  Singapore  North-East 2015-01-03  53.46        25.88             79.30   
3  Singapore        West 2015-01-04  53.46        25.80             80.06   
4  Singapore     Central 2015-01-05  53.46        26.06             82.75   

   WindSpeed  
0      11.65  
1      14.05  
2      13.74  
3      12.01  
4       9.49  

Data types:
Country                     object
Region                      object
Date                datetime64[ns]
AQI                        float64
Temperature                float64
RelativeHumidity           float64
WindSpeed              

## Step 6: Save Cleaned Data


In [30]:
# Save cleaned pollutants data (year by year)
if pollutants_clean is not None:
    print("Saving pollutants data year by year...")
    for year in range(2015, 2025):
        year_data = pollutants_clean[pollutants_clean['Date'].dt.year == year]
        if len(year_data) > 0:
            pollutants_output = clean_path / f'pollutants_clean_{year}.csv'
            year_data.to_csv(pollutants_output, index=False)
            print(f"  ✅ pollutants_clean_{year}.csv - {len(year_data)} records")
    print(f"Total pollutants records: {len(pollutants_clean)}")

# Save cleaned weather data (year by year)
if weather_clean is not None:
    print("\nSaving weather data year by year...")
    for year in range(2015, 2025):
        year_data = weather_clean[weather_clean['Date'].dt.year == year]
        if len(year_data) > 0:
            weather_output = clean_path / f'weather_clean_{year}.csv'
            year_data.to_csv(weather_output, index=False)
            print(f"  ✅ weather_clean_{year}.csv - {len(year_data)} records")
    print(f"Total weather records: {len(weather_clean)}")

# Save cleaned air temperature data (year by year)
if temperature_clean is not None:
    print("\nSaving air temperature data year by year...")
    for year in range(2015, 2025):
        year_data = temperature_clean[temperature_clean['Date'].dt.year == year]
        if len(year_data) > 0:
            temperature_output = clean_path / f'air_temperature_clean_{year}.csv'
            year_data.to_csv(temperature_output, index=False)
            print(f"  ✅ air_temperature_clean_{year}.csv - {len(year_data)} records")
    print(f"Total air temperature records: {len(temperature_clean)}")

print("\n" + "="*60)
print("DATA CLEANING COMPLETE!")
print(f"All cleaned files saved to: {clean_path}")
print("="*60)

Saving pollutants data year by year...
  ✅ pollutants_clean_2015.csv - 365 records
  ✅ pollutants_clean_2016.csv - 319 records
  ✅ pollutants_clean_2017.csv - 361 records
  ✅ pollutants_clean_2018.csv - 365 records
  ✅ pollutants_clean_2019.csv - 360 records
  ✅ pollutants_clean_2020.csv - 362 records
  ✅ pollutants_clean_2021.csv - 353 records
  ✅ pollutants_clean_2022.csv - 364 records
  ✅ pollutants_clean_2023.csv - 346 records
  ✅ pollutants_clean_2024.csv - 366 records
Total pollutants records: 3561

Saving weather data year by year...
  ✅ weather_clean_2015.csv - 365 records
  ✅ weather_clean_2016.csv - 366 records
  ✅ weather_clean_2017.csv - 365 records
  ✅ weather_clean_2018.csv - 365 records
  ✅ weather_clean_2019.csv - 365 records
  ✅ weather_clean_2020.csv - 366 records
  ✅ weather_clean_2021.csv - 365 records
  ✅ weather_clean_2022.csv - 365 records
  ✅ weather_clean_2023.csv - 365 records
  ✅ weather_clean_2024.csv - 366 records
Total weather records: 3653

Saving air tem

## Summary

This notebook processed raw Singapore data (2015-2024) and saved cleaned files with **standardized formats** for easy merging.

### Output Files (Saved Year by Year):
1. **Pollutants: `pollutants_clean_2015.csv` through `pollutants_clean_2024.csv`** (10 files)
   - Columns: `Country`, `Region`, `Date`, `pm10`, `pm2_5`, `carbon_monoxide`, `nitrogen_dioxide`, `sulphur_dioxide`, `ozone`, `AQI`
   - All values rounded to 2 decimal places

2. **Weather: `weather_clean_2015.csv` through `weather_clean_2024.csv`** (10 files)
   - Columns: `Country`, `Region`, `Date`, `Temperature`, `RelativeHumidity`, `WindSpeed`
   - All values rounded to 2 decimal places

3. **Air Temperature: `air_temperature_clean_2015.csv` through `air_temperature_clean_2024.csv`** (10 files)
   - Columns: `Country`, `Region`, `Date`, `Temperature`, `reading_type`, `station_name`
   - Temperature values rounded to 2 decimal places

### Standardized Format for Merging:
All cleaned files now follow a consistent structure:
- **Country**: Singapore
- **Region**: Central, East, North, North-East, or West (distributed cyclically)
- **Date**: YYYY-MM-DD format
- **All numeric values**: Rounded to 2 decimal places

### Final Merged Structure:
When merging pollutants and weather data, the result will be:
```
Country | Region | Date | AQI | Temperature | RelativeHumidity | WindSpeed
```

### Data Processing Steps:
- ✅ Loaded raw data from 2015-2024
- ✅ Aggregated hourly data to daily averages (for pollutants)
- ✅ Handled missing values using forward fill → backward fill → zero fill
- ✅ Standardized column names (Country, Region, Date, Temperature, RelativeHumidity, WindSpeed, AQI)
- ✅ Rounded all numeric values to 2 decimal places
- ✅ Added Country and Region columns for consistency
- ✅ Saved cleaned data to `data/singapore/clean/` directory

### Next Steps:
1. Use `pollutants_clean.csv` and `weather_clean.csv` for final merging
2. Simple merge on `Date` column will create the final dataset
3. Data is ready for visualization and modeling