# Melbourne Air Quality Historical Data Collection
# Research Notebook for Environmental Analysis

## Overview
This notebook fetches historical air quality data from the OpenWeatherMap API for EPA Victoria air monitoring station locations across Melbourne. The data is collected for research purposes to analyze air quality trends and patterns in the Melbourne metropolitan area.

### Key Features:
- **Comprehensive Coverage**: 20+ monitoring locations across Melbourne
- **Secure API Key Management**: Environment variable integration
- **Robust Error Handling**: Comprehensive error management and logging
- **Research-Ready Output**: Clean CSV format with proper metadata
- **Progress Tracking**: Real-time progress monitoring with tqdm

## 1. Environment Setup and Dependencies

In [None]:
# =============================================================================
# CELL 1: SETUP, IMPORTS, AND LOGGER CONFIGURATION
# =============================================================================
"""
Melbourne Air Quality Data Collection System
===========================================

This notebook collects historical air quality data from OpenWeatherMap API
for EPA Victoria monitoring station locations across Melbourne.

Author: Research Team
Date: 2025
Purpose: Environmental air quality analysis and research
"""

import os
import sys
import requests
import csv
import time
import logging
from datetime import datetime, timedelta
from typing import Dict, List, Tuple, Any
from tqdm.notebook import tqdm
from dotenv import load_dotenv
import pandas as pd

# --- 1. UTF-8 Aware Logger Setup (CRUCIAL FIX) ---
def setup_logger(log_file='logs/01_air_quality_melbourne_data_collection.log', level=logging.INFO):
    """Configures a logger to be UTF-8 aware for both console and file output."""
    logger = logging.getLogger()
    logger.setLevel(level)
    
    # Prevents adding duplicate handlers if you re-run this cell
    if logger.hasHandlers():
        logger.handlers.clear()

    # Console Handler with explicit UTF-8 encoding
    console_handler = logging.StreamHandler(sys.stdout)
    console_formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
    console_handler.setFormatter(console_formatter)
    
    # File Handler with explicit UTF-8 encoding
    file_handler = logging.FileHandler(log_file, mode='w', encoding='utf-8')
    file_formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
    file_handler.setFormatter(file_formatter)

    # Add both handlers to the root logger
    logger.addHandler(console_handler)
    logger.addHandler(file_handler)
    
    return logger

# --- 2. Initialize Environment ---
# Load environment variables from .env file
load_dotenv()

# Initialize the logger for use throughout the notebook
logger = setup_logger()

logger.info("📊 Melbourne Air Quality Data Collection System")
logger.info("=" * 50)
logger.info("🔬 Research Environment Initialized")
logger.info("📋 Dependencies loaded and UTF-8 logger configured.")

## 2. Configuration and Research Parameters

In [None]:
# =============================================================================
# CELL 2: RESEARCH CONFIGURATION
# =============================================================================

class AirQualityConfig:
    """Configuration class for air quality data collection."""
    
    def __init__(self):
        # API Configuration
        self.API_KEY = os.getenv('OPENWEATHER_API_KEY')
        if not self.API_KEY:
            logger.critical("❌ OpenWeather API key not found in .env file. Halting.")
            raise ValueError("API key not configured.")
        
        self.BASE_URL = "http://api.openweathermap.org/data/2.5/air_pollution/history"
        
        # Research Period
        self.START_DATE = datetime(2020, 11, 25)
        self.END_DATE = datetime(2025, 1, 4)
        
        # Data Collection Parameters
        self.CHUNK_SIZE_DAYS = 5
        self.REQUEST_DELAY_SECONDS = 0.5
        self.REQUEST_TIMEOUT = 20  # seconds
        
        # Output Configuration
        self.OUTPUT_CSV_PATH = f"../../data/raw/melbourne_raw_air_quality_openweather_{self.START_DATE.strftime('%Y%m%d')}_to_{self.END_DATE.strftime('%Y%m%d')}.csv"
        self.LOG_FILE_PATH = "air_quality_collection.log"
        
        # EPA Victoria Air Monitoring Stations & Representative Locations
        self.MONITORING_LOCATIONS = {
            "Melbourne CBD": [-37.8136, 144.9631], 
            "Footscray": [-37.7997, 144.9020],
            "Brooklyn": [-37.8161, 144.8415], 
            "Alphington": [-37.7833, 145.0333],
            "Spotswood": [-37.8335, 144.8863], 
            "Box Hill": [-37.8185, 145.1225],
            "Brighton": [-37.9056, 145.0028], 
            "Dandenong": [-37.9875, 145.2149],
            "Mooroolbark": [-37.7825, 145.3168], 
            "Altona North": [-37.8410, 144.8490],
            "Melton": [-37.6833, 144.5833], 
            "Point Cook": [-37.9148, 144.7509],
            "Macleod": [-37.7333, 145.0667], 
            "Carlton": [-37.8001, 144.9656],
            "Richmond": [-37.8183, 145.0014], 
            "St Kilda": [-37.8676, 144.9801],
            "Yarraville": [-37.8167, 144.9000], 
            "Frankston": [-38.1421, 145.1256],
            "Ringwood": [-37.8136, 145.2306], 
            "Werribee": [-37.9009, 144.6590],
            "Craigieburn": [-37.5986, 144.9425], 
            "Pakenham": [-38.0753, 145.4834],
            "Broadmeadows": [-37.6839, 144.9169],
        }
        
        # CSV Headers - *Corrected to match processing function*
        self.CSV_HEADERS = [
            "location", "latitude", "longitude", "datetime", "timestamp_unix",
            "aqi", "co", "no", "no2", "o3", "so2", "pm2_5", "pm10", "nh3"
        ]

# --- Initialize configuration ---
try:
    config = AirQualityConfig()
    logger.info(f"🎯 Research Configuration Loaded")
    logger.info(f"📍 Monitoring {len(config.MONITORING_LOCATIONS)} locations")
    logger.info(f"💾 Output file will be: {config.OUTPUT_CSV_PATH}")
    logger.info(f"🔑 API key configured: ✅")
except ValueError as e:
    logger.error("Could not initialize configuration.")

## 3. Data Collection Functions

In [None]:
# =============================================================================
# CELL 3: CORE DATA COLLECTION CLASS
# =============================================================================

class AirQualityDataCollector:
    """Handles the mechanics of fetching, processing, and writing data."""
    
    def __init__(self, config: AirQualityConfig):
        self.config = config
        self.session = requests.Session()
        self.total_records = 0
        self.failed_requests = 0
        
    def fetch_historical_data(
        self, location: str, lat: float, lon: float, start_unix: int, end_unix: int
    ) -> List[Dict]:
        """Fetches and enhances historical data for a single API call."""
        params = {
            "lat": lat, "lon": lon, "start": start_unix, "end": end_unix, "appid": self.config.API_KEY
        }
        try:
            response = self.session.get(self.config.BASE_URL, params=params, timeout=self.config.REQUEST_TIMEOUT)
            response.raise_for_status()
            raw_data = response.json().get("list", [])
            logger.info(f"✅ API call for {location} successful. Retrieved {len(raw_data)} records.")
            return [{**entry, 'location': location, 'latitude': lat, 'longitude': lon} for entry in raw_data]
        except requests.exceptions.HTTPError as e:
            logger.error(f"❌ HTTP Error {e.response.status_code} for {location}: {e.response.text}")
        except requests.exceptions.RequestException as e:
            logger.error(f"❌ Request failed for {location}: {e}")
        
        self.failed_requests += 1
        return []

    def _process_to_csv_row(self, entry: Dict) -> List:
        """Processes a single JSON data point into a list for CSV writing."""
        dt_object = datetime.fromtimestamp(entry["dt"])
        components = entry.get("components", {})
        main_data = entry.get("main", {})
        
        # Ensure the row matches the headers defined in config
        return [
            entry.get('location', ''), entry.get('latitude', ''), entry.get('longitude', ''),
            dt_object.strftime("%Y-%m-%d %H:%M:%S"), entry["dt"],
            main_data.get("aqi", ''), components.get("co", ''), components.get("no", ''),
            components.get("no2", ''), components.get("o3", ''), components.get("so2", ''),
            components.get("pm2_5", ''), components.get("pm10", ''), components.get("nh3", '')
        ]

    def _collect_for_location(self, writer: csv.writer, location: str, lat: float, lon: float) -> int:
        """Handles the data collection loop for a single location."""
        records_for_location = 0
        current_date = self.config.START_DATE
        while current_date <= self.config.END_DATE:
            start_unix = int(current_date.timestamp())
            end_date_for_chunk = min(current_date + timedelta(days=self.config.CHUNK_SIZE_DAYS), self.config.END_DATE + timedelta(days=1))
            end_unix = int(end_date_for_chunk.timestamp())
            
            data_points = self.fetch_historical_data(location, lat, lon, start_unix, end_unix)
            for entry in data_points:
                writer.writerow(self._process_to_csv_row(entry))
                records_for_location += 1
            
            current_date = end_date_for_chunk
            time.sleep(self.config.REQUEST_DELAY_SECONDS)
        return records_for_location

    def collect_all_data(self) -> Tuple[bool, int]:
        """Orchestrates the entire data collection process."""
        logger.info("🚀 Starting comprehensive air quality data collection...")
        self.total_records = 0
        self.failed_requests = 0
        
        try:
            with open(self.config.OUTPUT_CSV_PATH, mode="w", newline="", encoding="utf-8") as file:
                writer = csv.writer(file)
                writer.writerow(self.config.CSV_HEADERS)
                location_progress = tqdm(self.config.MONITORING_LOCATIONS.items(), desc="🌍 Processing Locations", unit="location")
                for location, (lat, lon) in location_progress:
                    location_progress.set_postfix_str(f"📍 {location}")
                    self.total_records += self._collect_for_location(writer, location, lat, lon)
            return True, self.total_records
        except IOError as e:
            logger.critical(f"💥 CRITICAL FILE ERROR: Could not write to {self.config.OUTPUT_CSV_PATH}. Reason: {e}")
            return False, 0
        except Exception as e:
            logger.critical(f"💥 UNEXPECTED CRITICAL ERROR during collection: {e}")
            return False, 0

# --- Initialize the collector ---
collector = AirQualityDataCollector(config)
logger.info("🔧 Data collector class defined and instance created.")

## 4. Execute Data Collection

In [None]:
# =============================================================================
# CELL 4: DATA COLLECTION ORCHESTRATOR
# =============================================================================

def _print_results_summary(start_time: datetime, end_time: datetime, total_records: int):
    """Logs the summary of the data collection results."""
    duration = end_time - start_time
    logger.info("=" * 60)
    logger.info("✅ DATA COLLECTION COMPLETED SUCCESSFULLY!")
    logger.info(f"📁 Data saved to: {config.OUTPUT_CSV_PATH}")
    logger.info(f"📊 Total records collected: {total_records}")
    logger.info(f"⏱️  Total duration: {duration}")
    if duration.total_seconds() > 0:
        logger.info(f"🎯 Collection efficiency: {total_records / duration.total_seconds():.2f} records/second")

def _print_final_guidance(success: bool):
    """Prints helpful next steps or troubleshooting advice to the console."""
    if success:
        print("\n🎉 Ready for research and analysis! See logs for details.")
    else:
        print("\n❌ DATA COLLECTION FAILED. Check the log file for detailed error information.")

def run_data_collection_pipeline() -> bool:
    """Orchestrates the entire data collection process."""
    logger.info(f"🕐 Collection started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    start_time = datetime.now()
    
    success, total_records = collector.collect_all_data()
    
    end_time = datetime.now()
    
    if success:
        _print_results_summary(start_time, end_time, total_records)
    
    return success

# --- Execute the Pipeline ---
was_successful = run_data_collection_pipeline()
_print_final_guidance(was_successful)

## 5. Data Analysis and Validation

In [None]:
# =============================================================================
# CELL 5: DATA ANALYSIS AND VALIDATION
# =============================================================================

def analyze_collected_data():
    """Perform initial analysis and validation of the collected data CSV."""
    if not os.path.exists(config.OUTPUT_CSV_PATH):
        logger.error(f"❌ No data file found at '{config.OUTPUT_CSV_PATH}'. Please run data collection first.")
        return None
    
    logger.info(f"📊 Analyzing collected data from '{config.OUTPUT_CSV_PATH}'...")
    try:
        df = pd.read_csv(config.OUTPUT_CSV_PATH)
        
        print("\n--- Data Analysis Report ---")
        print(f"📈 Dataset Overview:")
        print(f"   - Total records: {len(df):,}")
        print(f"   - Unique Locations: {df['location'].nunique()}")
        
        print(f"\n📍 Location Coverage (Top 10):")
        print(df['location'].value_counts().head(10))
        
        print(f"\n🔬 Data Completeness per Parameter:")
        for param in ['aqi', 'pm2_5', 'pm10', 'no2', 'o3', 'co', 'so2']:
            if param in df.columns:
                percentage = df[param].notna().mean() * 100
                print(f"   - {param.upper()}: {percentage:.1f}% complete")
        
        print(f"\n✅ Data Quality Checks:")
        print(f"   - Duplicate rows: {df.duplicated().sum()}")
        print(f"   - Rows with missing timestamps: {df['datetime'].isnull().sum()}")
        print("--- End of Report ---")
        
        return df
    except Exception as e:
        logger.error(f"❌ Analysis failed: {e}")
        return None

# --- Run the analysis on the generated file ---
# You can re-run this cell anytime after the collection is complete.
df_analysis = analyze_collected_data()
if df_analysis is not None:
    print("\nDataFrame returned to 'df_analysis' variable. You can now use it for further work.")
    # display(df_analysis.head()) # Uncomment to see the first few rows

This notebook provides:

1. **Professional Structure**: Clear markdown sections with research context
2. **Environment Variables**: Secure API key management with `.env` file
3. **Expanded Coverage**: Based on EPA Victoria's monitoring network including Melbourne CBD, Footscray, Brooklyn, and portable monitoring locations like Box Hill, Brighton, Dandenong, Mooroolbark, Altona North, Melton, Point Cook, and Macleod
4. **Research Features**: 
   - Comprehensive logging
   - Error handling
   - Progress tracking
   - Data validation
   - Analysis functions
5. **Professional Output**: Clean CSV with metadata and research documentation

To use this notebook:

1. Create a `.env` file with your OpenWeather API key:
```
OPENWEATHER_API_KEY=your_actual_api_key_here
```

2. Install required packages:
```bash
pip install requests tqdm python-dotenv pandas numpy
```

3. Run the notebook cells in order to collect and analyze your air quality data.

The notebook covers 22 monitoring locations across Melbourne, including official EPA monitoring stations and representative locations for comprehensive coverage of the metropolitan area.