# DEFRA Dataset Assesment
1) I'll be start adding my main paths and modules I will be using in this notebook below.

In [None]:
# possible python modules i will be using below
from curses import meta
import os
import pandas as pd
from pathlib import Path
import csv

#define base path  without hardcoding
base_dir = Path.home() / "Desktop" / "data science projects" / "air-pollution-levels" / "data" / "defra" / "optimised"
#metadata file for pollutant name, location and site names
metadata_dir = Path.home() / "Desktop" / "data science projects" / "air-pollution-levels" / "data" / "defra" /"test"/"std_london_sites_pollutant.csv"

# output path for saving statistics 1. function
stats_output_path = base_dir/ "dataset_statistics.csv"

# log file from nan replacement process
nan_log_path = Path.home() / "Desktop" / "data science projects" / "air-pollution-levels" / "data" / "defra" / "logs" / "NaN_values_record.csv"



## 1) Initial Dataset Assessment: Raw Numbers

Before conducting quality checks, I need to establish the baseline characteristics of the DEFRA dataset. This section calculates comprehensive statistics about the data collection effort, including file counts, measurement records, station coverage, and pollutant distribution.

### Purpose
- Document the scale and scope of data collection.
- Establish baseline metrics for comparison with LAQN.
- Provide context for subsequent quality analysis.

### Methodology
The function `get_defra_dataset_statistics()` performs the following:
1. Loads standardised metadata to identify unique stations and pollutants.
2. Counts files across all three yearly directories (2023, 2024, 2025).
3. Calculates total measurement records by reading all CSV files.
4. Determines spatial coverage from unique coordinate pairs.
5. Documents temporal coverage (35 months: January 2023 to November 2025).

### Notes
- File counting is fast (scans directory structure only).
- Record counting can be slow (reads every CSV file).
- Results are saved to csv.

In [None]:


def get_defra_dataset_statistics(base_dir, metadata_path, nan_log_path):
    """
    Calculate  statistics at DEFRA dataset.
    This function walks through the monthly data directories 2023, 2024, 2025and calculates key metrics needed for reporting.
    parameters:
        base_dir : Path
            base directory containing defra data folders
        metadata_path : Path
            path to the standardised metadata csv file
        nan_log_path : Path
            path to the NaN values log file after notice data flags, changed theem to NaN.
            
    returns:
        dict : dictionary containing all calculated statistics

    """
    
    stats = {}
    
    # read metadata to get station and pollutant info.
    metadata = pd.read_csv(metadata_path, encoding="utf-8")
    
    # Calculate metadata statistics.
    stats['unique_stations'] = metadata['station_name'].nunique()
    stats['total_combinations'] = len(metadata)
    stats['unique_pollutants'] = metadata['pollutant_std'].nunique()
    
    # Get pollutant breakdown
    pollutant_counts = metadata['pollutant_std'].value_counts()
    stats['pollutant_distribution'] = pollutant_counts.to_dict()
    
    # Count unique coordinates for spatial coverage, i will be use this for laqn dataset asweell.
    # Group by lat/lon and count unique locations, instead of station names and will do the validation afterwards
    unique_coords = metadata[['latitude', 'longitude']].drop_duplicates()
    stats['unique_locations'] = len(unique_coords)
    
    # Count files in monthly data directories
    total_files = 0
    files_by_year = {}
    
    # Loop through each years measurement directory
    for year in ['2023', '2024', '2025']:
        year_dir = Path(base_dir) / f'{year}measurements'
        
        if year_dir.exists():
            # Count all CSV files in this years directory and subdirec..
            year_files = list(year_dir.rglob('*.csv'))
            files_by_year[year] = len(year_files)
            total_files += len(year_files)
            print(f"  {year}: {len(year_files)} files")
        else:
            files_by_year[year] = 0
            print(f"  {year}: Directory not found")
    
    stats['total_files'] = total_files
    stats['files_by_year'] = files_by_year
    
    # Calculate total measurement records, this requires reading all csv files and counting rows
    total_records = 0
    records_by_year = {}
    total_missing = 0
    missing_by_year = {}
    
    for year in ['2023', '2024', '2025']:
        year_dir = Path(base_dir) / f'{year}measurements'
        year_records = 0
        # noticed that I didn't use the nan_log_path to count missing values
        year_missing = 0
        
        if year_dir.exists():
            # read each csv, count rows
            for csv_file in year_dir.rglob('*.csv'):
                try:
                    df = pd.read_csv(csv_file)
                    year_records += len(df)
                except Exception as e:
                    print(f"  Warning: Could not read {csv_file.name}: {e}")
            
            records_by_year[year] = year_records
            missing_by_year[year] = year_missing
            total_records += year_records
            total_missing += year_missing
            print(f"  {year}: {year_records:,} records, {year_missing:,} missing ({(year_missing/year_records*100):.2f}%)")
        else:
            records_by_year[year] = 0
    
    stats['total_records'] = total_records
    stats['records_by_year'] = records_by_year

    stats['missing_by_year'] = missing_by_year
    stats['overall_completeness'] = ((total_records - total_missing) / total_records * 100) if total_records > 0 else 0
    
    
    # Calculate temporal coverage based on the files collected, understands which months have data
    stats['temporal_coverage'] = {
        'start_date': '2023-01-01',
        'end_date': '2025-11-19',  
        'total_months': 35  # Jan 2023 to 19.Nov 2025
    }
    
    return stats

In [15]:
def print_dataset_statistics(stats):
    """
    print dataset statistics in a formatted.
    
    parameters:
    stats : dict
        dictionary returned by get_defra_dataset_statistics()
    """
    
    print("\n" + "="*40)
    print("Defra dataset statistics: initial assessment")
    print("="*40)
    
    print("\nScale and scope:")
    print(f"Total files collected: {stats['total_files']:,}")
    print(f"Total measurement records: {stats['total_records']:,}")
    print(f"Unique monitoring stations: {stats['unique_stations']}")
    print(f"Total station-pollutant combinations: {stats['total_combinations']}")
    print(f"Unique pollutant types: {stats['unique_pollutants']}")
    print(f"Unique geographic locations: {stats['unique_locations']}")
    
    print("\nFiles by year:")
    for year, count in stats['files_by_year'].items():
        print(f"{year}: {count:,} files")
    
    print("\nRecords by year:")
    for year, count in stats['records_by_year'].items():
        print(f"{year}: {count:,} measurement records")
    
    print("\nTemporal coverage:")
    print(f"start date: {stats['temporal_coverage']['start_date']}")
    print(f"end date: {stats['temporal_coverage']['end_date']}")
    print(f"total months: {stats['temporal_coverage']['total_months']}")
    
    print("\nPollutant distribution:")
    print("Station/Pollutant combinations by type:")
    for pollutant, count in sorted(stats['pollutant_distribution'].items(), 
                                   key=lambda x: x[1], reverse=True):
        percentage = (count / stats['total_combinations']) * 100
        print(f"  {pollutant}: {count} ({percentage:.1f}%)")


In [None]:
# run the analysis
stats = get_defra_dataset_statistics(base_dir, metadata_dir)
print_dataset_statistics(stats)

# # Save statistics for later use as csv
# pd.DataFrame(list(stats.items()), columns=["Metric", "Value"]).to_csv(stats_output_path, index=False)
# print(f"\nStatistics saved to: {stats_output_path}")

  2023: 1431 files
  2024: 1193 files
  2025: 939 files
  2023: 1,000,126 records
  2024: 868,320 records
  2025: 657,545 records

Defra dataset statistics: initial assessment

Scale and scope:
Total files collected: 3,563
Total measurement records: 2,525,991
Unique monitoring stations: 18
Total station-pollutant combinations: 144
Unique pollutant types: 37
Unique geographic locations: 20

Files by year:
2023: 1,431 files
2024: 1,193 files
2025: 939 files

Records by year:
2023: 1,000,126 measurement records
2024: 868,320 measurement records
2025: 657,545 measurement records

Temporal coverage:
start date: 2023-01-01
end date: 2025-11-19
total months: 35

Pollutant distribution:
Station/Pollutant combinations by type:
  PM10: 15 (10.4%)
  PM2.5: 15 (10.4%)
  NO2: 14 (9.7%)
  NOx: 14 (9.7%)
  NO: 14 (9.7%)
  O3: 9 (6.2%)
  SO2: 3 (2.1%)
  n-Pentane: 2 (1.4%)
  m,p-Xylene: 2 (1.4%)
  n-Butane: 2 (1.4%)
  n-Heptane: 2 (1.4%)
  n-Hexane: 2 (1.4%)
  n-Octane: 2 (1.4%)
  Propene: 2 (1.4%)
  

    2023: 1431 files
    2024: 1193 files
    2025: 939 files
    2023: 1,000,126 records
    2024: 868,320 records
    2025: 657,545 records

    ========================================
    Defra dataset statistics: initial assessment
    ========================================

    Scale and scope:
    Total files collected: 3,563
    Total measurement records: 2,525,991
    Unique monitoring stations: 18
    Total station-pollutant combinations: 144
    Unique pollutant types: 37
    Unique geographic locations: 20

    Files by year:
    2023: 1,431 files
    2024: 1,193 files
    2025: 939 files

    Records by year:
    2023: 1,000,126 measurement records
    2024: 868,320 measurement records
    2025: 657,545 measurement records

    Temporal coverage:
    start date: 2023-01-01
    end date: 2025-11-19
    total months: 35

    Pollutant distribution:
    Station/Pollutant combinations by type:
    PM10: 15 (10.4%)
    PM2.5: 15 (10.4%)
    NO2: 14 (9.7%)
    NOx: 14 (9.7%)
    NO: 14 (9.7%)
    O3: 9 (6.2%)
    SO2: 3 (2.1%)
    n-Pentane: 2 (1.4%)
    m,p-Xylene: 2 (1.4%)
    n-Butane: 2 (1.4%)
    n-Heptane: 2 (1.4%)
    n-Hexane: 2 (1.4%)
    n-Octane: 2 (1.4%)
    Propene: 2 (1.4%)
    o-Xylene: 2 (1.4%)
    Propane: 2 (1.4%)
    i-Pentane: 2 (1.4%)
    Toluene: 2 (1.4%)
    trans-2-Butene: 2 (1.4%)
    trans-2-Pentene: 2 (1.4%)
    Isoprene: 2 (1.4%)
    Ethyne: 2 (1.4%)
    i-Octane: 2 (1.4%)
    i-Hexane: 2 (1.4%)
    i-Butane: 2 (1.4%)
    Ethylbenzene: 2 (1.4%)
    Ethene: 2 (1.4%)
    Ethane: 2 (1.4%)
    cis-2-Butene: 2 (1.4%)
    Benzene: 2 (1.4%)
    1-Pentene: 2 (1.4%)
    1-Butene: 2 (1.4%)
    1,3-Butadiene: 2 (1.4%)
    1,3,5-TMB: 2 (1.4%)
    1,2,4-TMB: 2 (1.4%)
    1,2,3-TMB: 2 (1.4%)
    CO: 2 (1.4%)

- notes: analyse the stats:  2023: 1431 files, 2024: 1193 files, 2025: 939, making a total of 3,563 files. This is roughly 1k fewer than the laqn dataset which defra's issue  rate around 8%, and laqn's after hardcore cleaning decreased to %17ish. The number of monitoring stations is 18, with 37 unique pollutants and 144 station/pollutant combo. Although the defra dataset is smaller than the laqn dataset in terms of files and station/pollutant combinations promising better accuricy, and numerically six times more pollutant types than the laqn dataset.

## 2) Spatial Coverage Analysis

 analysing spatial distribution patterns before accepting the dataset. I need to understand where defra stations are located, identify any geographic biases, and compare coverage to laqn.

### Purpose
- Create maps showing station locations across London.
- Analyse density by borough to identify coverage gaps
- Compare spatial distribution to laqn network
- Ensure no geographic areas are overrepresented or underrepresented

### Methodology
1. Load defra metadata with coordinates
2. Create interactive folium map showing all stations
3. Calculate station density by borough
4. Identify coverage gaps in london
5. Compare to laqn spatial distribution



sources: 
- https://python-visualization.github.io/folium/latest/getting_started.html
- https://pandas.pydata.org/docs/user_guide/groupby.html 
- plotting: https://geopandas.org/en/stable/docs/user_guide/data_structures.html#geoseries
    - general: https://geopandas.org/en/stable/getting_started.html



## 3) Statistical Validation

A critical gap from the laqn report by applying formal statistical tests to validate data quality patterns. While descriptive statistics show 0% (before I notice the flags of the dataset) issue rate, I need statistical evidence that this pattern is real and not due to chance.