# NB01: Data Collection - London Air Pollution
## This notebook collects historical air pollution data from OpenWeather API

## SECTION 1: IMPORT LIBRARIES

In [2]:
import requests  # For making API calls
import os        # For file/folder operations
import json      # For saving data in JSON format
from dotenv import load_dotenv  # For loading API key securely
from datetime import datetime, timedelta  # For handling dates

print("All libraries imported successfully!")

All libraries imported successfully!


# Why are we using these specific libraries?
## Requests- 
Makes HTTP requests to communicate with web APIs. It is essential because The OpenWeather API is a web service that requires HTTP GET requests to retrieve data. While Python has a built-in 'urllib' library,'requests' is industry-standard because it simplifies API calls significantly.
As demonstrated in W02 Lecture on API interactions, requests.get() handles:- URL construction with parameters,Authentication headers,Response parsing,Error handling (status codes)
# How I tested my understanding: 
I ran requests.get() with a simple test endpoint (OpenWeather current weather) before attempting historical data collection to verify I understood the parameter structure.
## python-dotenv- 
It loads environment variables from a .env file into Python. Hardcoding API keys directly in notebooks is a major security risk. If I share my notebook on GitHub or with classmates, my API key would be exposed, allowing others to use my OpenWeather account, hence it is critical for security. In W03 Lab we discussed reproducibility and why notebooks with hardcoded passwords/keys aren't shareable. This approach solves that problem.
## json-
Encodes and decodes JSON (JavaScript Object Notation) data format. The OpenWeather API returns data in JSON format, and we need to save the raw API response before transformation. JSON is human-readable (I can open the file in a text editor to inspect it), preserves nested data structures (lists, dictionaries), Language-agnostic, can be read by R, JavaScript, etc., Maintains data types (numbers stay numbers, not strings)
# Why save raw JSON before transformation?: 
Following the principle discussed in W04 Lecture on data pipelines - we separate collection from transformation. This means if I make a mistake in NB02, I don't need to re-call the API(which is rate-limited). The raw data acts as a "checkpoint" in my workflow.

## SECTION 2: LOAD API KEY

In [None]:
# Load environment variables from .env file
load_dotenv()
# Get API key from environment variable
API_KEY = os.getenv('OPENWEATHER_API_KEY')
# Check if API key was loaded successfully
if API_KEY:
    print(f"API Key loaded! (First 8 characters: {API_KEY[:8]}...)")
else:
    print("ERROR: API Key not found! Check your .env file.")

‚úÖ API Key loaded! (First 8 characters: bff53b98...)


### Why do we use a .env file instead of putting the API key directly in the code?

Security and shareability are the main reasons:

SECURITY RISK: Hardcoding my API key (like API_KEY = "bff53b982b...") means anyone who sees my notebook can use my key. This could exhaust my rate limit (OpenWeather free tier = 1,000 calls/day) or cost me money on paid plans. Even worse, if I push to GitHub, the key stays in the commit history forever, even if I delete it later.

PROFESSIONAL PRACTICE: The "Twelve-Factor App" methodology (industry standard) requires configuration like API keys to be stored in environment variables, not code.
This separates credentials from logic, following the "separation of concerns" principle from W01.

REPRODUCIBILITY: Using .env files means:I can share my notebook publicly (only shows os.getenv('OPENWEATHER_API_KEY')), Others can run my code by creating their own .env file with their own key. I add .env to .gitignore so it never gets committed. This approach was emphasised in W03 Lab when discussing reproducible research and secure coding practices. I tested this by printing the first 8 characters of my loaded key to verify load_dotenv() worked correctly before making API calls.

In [4]:
# %% SECTION 3: DEFINE PARAMETERS
# London coordinates (central London)
LATITUDE = 51.5074
LONGITUDE = -0.1278

# Time period for data collection
# Let's get 3 years of data (2022-2024)
END_DATE = datetime(2024, 12, 31, 23, 59, 59)
START_DATE = datetime(2022, 1, 1, 0, 0, 0)

# Convert to Unix timestamps (required by API)
END_TIMESTAMP = int(END_DATE.timestamp())
START_TIMESTAMP = int(START_DATE.timestamp())

print(f"üìç Location: London ({LATITUDE}, {LONGITUDE})")
print(f"üìÖ Start Date: {START_DATE.strftime('%Y-%m-%d')}")
print(f"üìÖ End Date: {END_DATE.strftime('%Y-%m-%d')}")
print(f"‚è±Ô∏è  Start Timestamp: {START_TIMESTAMP}")
print(f"‚è±Ô∏è  End Timestamp: {END_TIMESTAMP}")

üìç Location: London (51.5074, -0.1278)
üìÖ Start Date: 2022-01-01
üìÖ End Date: 2024-12-31
‚è±Ô∏è  Start Timestamp: 1640995200
‚è±Ô∏è  End Timestamp: 1735689599


### Why did you choose this time period? (2022-2024)
I chose 2022-2024 (3 years) after careful consideration of several factors:

STATISTICAL SIGNIFICANCE: Air quality data contains significant seasonal variation (higher pollution in winter due to heating, lower in summer). To identify genuine long-term trends rather than seasonal fluctuations, I need multiple complete years.

RELEVANCE TO RESEARCH QUESTION: My question asks if London's air is getting "better or worse" - this requires recent data that reflects current conditions. 
Data from 2022-2024 is most relevant because:

Post-COVID: 2020-2021 had abnormal pollution levels due to lockdowns (traffic reduced by ~70% in March-May 2020 according to UK DfT statistics). 2022 onwards represents "normal" urban activity patterns. Captures recent policy impacts (e.g., Ultra Low Emission Zone expansion in August 2023)

API CONSTRAINTS: OpenWeather's free tier allows 1,000 API calls per day.
Each call returns data for one location/time point. For 3 years of hourly data: 3 years √ó 365 days √ó 24 hours = 26,280 records, which fits within reasonable API limits when requested in appropriate chunks.

DATA QUALITY: Recent years have better data completeness and accuracy.Modern monitoring equipment (post-2020) provides more reliable PM2.5 and NO‚ÇÇ measurements compared to older sensors.

#### Alternatives I considered:
5 years (2020-2024): Rejected because it includes COVID lockdown anomalies which would skew trend analysis and not reflect typical urban conditions.

1 year (2024 only): Too short to distinguish trends from seasonal variation.

10 years (2015-2024): Ideal for long-term trends BUT would exceed API rate limits significantly and include different measurement methodologies

#### My decision balances:
‚úì Statistical robustness (multiple seasonal cycles)
‚úì Current relevance (post-COVID "normal")
‚úì Practical constraints (API limits)
‚úì Data quality (modern sensors)

#### Testing approach: 
I initially tested with 1 week of data (Jan 1-7, 2024) to verify the API worked correctly before committing to the full 3-year download.

#### Why did you choose these specific coordinates for London?
I chose coordinates 51.5074¬∞N, -0.1278¬∞W (central London) after researching several location options:
#### RESEARCH PROCESS:
I investigated three potential coordinate sets to represent "London":
Option 1: City of London (51.5074¬∞N, -0.1278¬∞W) - Financial district, very central.
Option 2: Greater London centroid (51.5072¬∞N, -0.1276¬∞W) - Geographic center.
Option 3: Heathrow area (51.4700¬∞N, -0.4543¬∞W) - Western suburbs near airport.
Option 4: Canary Wharf (51.5055¬∞N, -0.0196¬∞W) - East London business district.
#### DECISION: I selected City of London (51.5074¬∞N, -0.1278¬∞W) for these reasons:
REPRESENTATIVE URBAN POLLUTION: This location experiences typical urban air quality issues - traffic emissions, building density, human activity - without the airport-specific pollution at Heathrow or industrial zones elsewhere. According to UK DEFRA's London Air Quality Network (https://www.londonair.org.uk/), this area has well-established monitoring stations that show representative central London conditions.
POPULATION EXPOSURE: Central London has the highest daytime population density(approximately 500,000 workers + residents in square mile). When we ask if London's air is "getting better or worse," we should focus on where most people are exposed to pollution. This makes the health relevance clearer.

POLICY RELEVANCE: Central London is where air quality policies have been most aggressive:
Congestion Charge Zone (since 2003)
Ultra Low Emission Zone (ULEZ) expanded August 2023
Measuring this location captures policy impact

DATA CONSISTENCY: The City of London coordinates are commonly used in academic studies and government reports, making my findings comparable to existing research.

AVOIDING CONFOUNDS: 
Heathrow: Would be dominated by aircraft emissions (not representative of urban London)
Suburbs: Lower pollution, less representative of the "London" people think of
Parks (Hyde Park, etc.): Would underestimate typical exposure

#### How I found these coordinates:
Used Google Maps to identify "City of London" and right-clicked for coordinates
#Cross-referenced with UK DEFRA monitoring station locations to ensure this area has historical air quality data for validation
Verified coordinates using https://www.latlong.net/

#### Testing: 
I made a test API call with these coordinates before running the full data collection to verify OpenWeather has coverage for this location (some APIs have gaps in geographic coverage).

### What is a Unix timestamp and why does the API need it?

A Unix timestamp (also called "Epoch time" or "POSIX time") is a way of representing date and time as a single number: the number of seconds that have elapsed since January 1, 1970, 00:00:00 UTC (Coordinated Universal Time).
For example:
Unix timestamp 0 = January 1, 1970, 00:00:00 UTC
This is a standardized system used across programming languages and databases.

#### WHY DOES THE API NEED IT?
UNAMBIGUOUS TIME REPRESENTATION: Human-readable dates like "01/02/2024" are ambiguous - is that February 1st (US format) or January 2nd (UK format)?
Unix timestamps eliminate this confusion. 1706745600 means exactly one moment in time, regardless of where you are in the world.

TIMEZONE INDEPENDENCE: OpenWeather's servers could be anywhere, and users could be requesting data from any timezone. Unix timestamps are ALWAYS in UTC, so there's no confusion about timezone conversions.

COMPUTATIONAL EFFICIENCY: Computers process numbers much faster than strings. Comparing, sorting, and calculating with timestamps like 1640995200 is far more efficient than parsing "2022-01-01T00:00:00+00:00". For an API handling millions of requests, this efficiency matters.

STANDARDIZATION: Unix timestamps are a universal standard across systems(Linux, macOS, Windows, databases, APIs). This means the OpenWeather API can be used by any programming language (Python, JavaScript, R, Java, etc.) without needing different date format parsers.



In [5]:
# %% SECTION 4: CONSTRUCT API REQUEST
# The API endpoint for HISTORICAL air pollution data
BASE_URL = "http://api.openweathermap.org/data/2.5/air_pollution/history"

# Parameters for the API request
params = {
    'lat': LATITUDE,
    'lon': LONGITUDE,
    'start': START_TIMESTAMP,
    'end': END_TIMESTAMP,
    'appid': API_KEY
}

print(f"üîó API Endpoint: {BASE_URL}")
print(f"üì¶ Parameters: lat={LATITUDE}, lon={LONGITUDE}")
print(f"   Time range: {START_TIMESTAMP} to {END_TIMESTAMP}")

üîó API Endpoint: http://api.openweathermap.org/data/2.5/air_pollution/history
üì¶ Parameters: lat=51.5074, lon=-0.1278
   Time range: 1640995200 to 1735689599


### How did you figure out which API endpoints to use?
I started with the OpenWeather API documentation (openweathermap.org/api) and navigated to the Air Pollution API section. The documentation showed three endpoints: current, forecast, and history. Since my question asks "Is London's air getting better or worse?", I need HISTORICAL data to analyze trends over time.

### What other API endpoints were available and why didn't you use them?
OpenWeather Air Pollution API has three endpoints:

CURRENT (/air_pollution):
Provides: Real-time air quality for right now, Why rejected: Only gives one snapshot, can't analyze trends over time, Use case: Real-time monitoring dashboards

FORECAST (/air_pollution/forecast):
Provides: Predicted air quality for next 5 days, Why rejected: Shows future predictions, not past measurements. Can't answer "Is air getting better?" with predictions, need actual historical data, Use case: Planning outdoor activities

HISTORY (/air_pollution/history) ‚úì SELECTED:
Provides: Measured air quality data from the past, Why chosen: Only option that provides historical data for trend analysis, Allows calculating year-over-year changes and seasonal patterns, Directly answers my research question

### Decision matrix: 
History was the ONLY endpoint providing multi-year measure data needed for "better or worse" trend analysis. The others provide either single snapshots or predictions, neither suitable for historical trend detection.

In [None]:
# %% SECTION 5: MAKE API REQUEST

print(" Making API request...")
print("This might take 10-30 seconds for 3 years of data...")

try:
    # Make the GET request
    response = requests.get(BASE_URL, params=params)
    
    # Check if request was successful
    if response.status_code == 200:
        print("‚úÖ API request successful!")
        data = response.json()
        
        # Check how much data we got
        num_records = len(data.get('list', []))
        print(f"üìä Received {num_records} hourly air pollution records")
        
    else:
        print(f"‚ùå API request failed with status code: {response.status_code}")
        print(f"Error message: {response.text}")
        data = None
        
except Exception as e:
    print(f"‚ùå An error occurred: {e}")
    data = None

üöÄ Making API request...
‚è≥ This might take 10-30 seconds for 3 years of data...


‚úÖ API request successful!
üìä Received 25968 hourly air pollution records


In [None]:
# %% SECTION 6: INSPECT THE DATA

if data:
    print("\n Let's look at the structure of our data:")
    print(f"Top-level keys: {list(data.keys())}")
    
    # Look at the first record
    if 'list' in data and len(data['list']) > 0:
        first_record = data['list'][0]
        print("\n Structure of first record:")
        print(json.dumps(first_record, indent=2))
        
        # What pollutants do we have?
        if 'components' in first_record:
            pollutants = list(first_record['components'].keys())
            print(f"\n Available pollutants: {pollutants}")



üîç Let's look at the structure of our data:
Top-level keys: ['coord', 'list']

üìã Structure of first record:
{
  "main": {
    "aqi": 1
  },
  "components": {
    "co": 230.31,
    "no": 0.01,
    "no2": 16.96,
    "o3": 40.41,
    "so2": 7.57,
    "pm2_5": 9.6,
    "pm10": 15.84,
    "nh3": 0.09
  },
  "dt": 1640995200
}

üå´Ô∏è  Available pollutants: ['co', 'no', 'no2', 'o3', 'so2', 'pm2_5', 'pm10', 'nh3']


In [8]:
# %% SECTION 7: SAVE DATA TO JSON FILE

# Create data folder if it doesn't exist
os.makedirs('data', exist_ok=True)

# Filename with timestamp
filename = f"data/london_air_pollution_{START_DATE.year}-{END_DATE.year}.json"

if data:
    try:
        # Save to JSON file
        with open(filename, 'w') as f:
            json.dump(data, f, indent=2)
        
        print(f"‚úÖ Data saved successfully to: {filename}")
        
        # Check file size
        file_size_mb = os.path.getsize(filename) / (1024 * 1024)
        print(f"üìÅ File size: {file_size_mb:.2f} MB")
        
    except Exception as e:
        print(f"‚ùå Error saving file: {e}")
else:
    print("‚ùå No data to save - API request failed")

‚úÖ Data saved successfully to: data/london_air_pollution_2022-2024.json
üìÅ File size: 6.90 MB


In [9]:
# %% SECTION 8: SUMMARY
print("\n" + "="*50)
print("üìä DATA COLLECTION SUMMARY")
print("="*50)

if data and 'list' in data:
    num_records = len(data['list'])
    
    # Calculate date range from actual data
    first_dt = datetime.fromtimestamp(data['list'][0]['dt'])
    last_dt = datetime.fromtimestamp(data['list'][-1]['dt'])
    
    print(f"‚úÖ Successfully collected {num_records:,} records")
    print(f"üìÖ Date range: {first_dt.strftime('%Y-%m-%d')} to {last_dt.strftime('%Y-%m-%d')}")
    print(f"üìç Location: London ({LATITUDE}, {LONGITUDE})")
    print(f"üíæ Saved to: {filename}")
    print(f"\nüéØ Next step: Move to NB02-Data-Transformation.ipynb")
else:
    print("‚ùå Data collection failed - review error messages above")


üìä DATA COLLECTION SUMMARY
‚úÖ Successfully collected 25,968 records
üìÖ Date range: 2022-01-01 to 2024-12-31
üìç Location: London (51.5074, -0.1278)
üíæ Saved to: data/london_air_pollution_2022-2024.json

üéØ Next step: Move to NB02-Data-Transformation.ipynb


In [10]:
# %% PROFESSIONAL ERROR HANDLING
import time

def fetch_with_retry(url, params, max_retries=3):
    """
    Fetch data from API with automatic retry logic.
    Implements exponential backoff for rate limit handling.
    """
    for attempt in range(max_retries):
        try:
            response = requests.get(url, params=params, timeout=30)
            
            if response.status_code == 200:
                return response
            elif response.status_code == 429:  # Rate limit
                wait_time = 2 ** attempt  # Exponential backoff
                print(f"‚è≥ Rate limited. Waiting {wait_time}s before retry...")
                time.sleep(wait_time)
            elif response.status_code == 401:
                print(f"‚ùå Authentication failed. Check API key.")
                return None
            else:
                print(f"‚ö†Ô∏è Unexpected status: {response.status_code}")
                
        except requests.exceptions.Timeout:
            print(f"‚è±Ô∏è Request timeout on attempt {attempt + 1}/{max_retries}")
        except requests.exceptions.ConnectionError:
            print(f"üîå Connection error on attempt {attempt + 1}/{max_retries}")
            
    print(f"‚ùå Failed after {max_retries} attempts")
    return None

# Use it in your API call
response = fetch_with_retry(BASE_URL, params)

### SUMMARY REFLECTION 
#### Overall reflection: What was challenging about this notebook?
#### Answer: 
The most challenging aspect was navigating the OpenWeather API limitations and understanding why historical data access took time to activate. Initially, I assumed that once my API key showed as "Active" on the dashboard, all endpoints would be immediately accessible. However, I discovered through trial and error (and the 401 error) that historical data access requires additional activation time beyond the basic key activation.
#### Were there any errors you had to debug? What was more complex than expected?
Understanding Unix timestamps was also initially confusing - converting between human-readable dates and seconds-since-1970 required careful testing to ensure I didn't accidentally request the wrong time period. I used online converters to verify my datetime calculations before running the full API request.
#### What did you learn about API authentication and data collection?
I learned that APIs often have multiple endpoints for different purposes (current vs forecast vs historical). Reading documentation carefully to select the right endpoint was more important than I expected - using the wrong one would mean collecting the wrong type
of data entirely. This experience reinforced that data collection isn't just "download and go" -
it requires understanding authentication, rate limits, error handling, and data documentation practices.
#### What would you do differently if you had to collect data from a different city?
Research the city's geography first - identify main urban center, suburbs, industrial zones, and green spaces
Check where official government monitoring stations are located (equivalent to UK DEFRA) to align with official data for validation
Consider collecting from MULTIPLE coordinates if API limits allow - e.g., city center, one suburb, one industrial area - to capture spatial variation
Would still avoid extreme locations (airports, highways, parks) that aren't representative of typical urban exposure.
