# LAQN Dataset Assesment


1) I'll be start adding my main paths and modules I will be using in this notebook below.

In [58]:
# possible python modules i will be using below
from curses import meta
import os
import pandas as pd
from pathlib import Path
import csv
from collections import defaultdict
#function 7 importing the full analysis function from pollution_analysis
import sys
sys.path.append('/mnt/user-data/outputs')

#last detailed anlasye and visualization imports
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Set visualisation style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

#findings 7 .func
# for parse pdf uk pollutant limitations to csv
import re
# pdfplumber for pdf parsing

# function 5. chi-square test
from scipy import stats

#define base path  without hardcoding
base_dir = Path.home() / "Desktop" / "data science projects" / "air-pollution-levels" / "data" / "laqn" / "optimased"
#metadata file for pollutant name, location and site names
metadata_path = Path.home() / "Desktop" / "data science projects" / "air-pollution-levels" / "data" /"laqn"/"optimased_siteSpecies.csv"

# output path for saving statistics 1. function
#the first analyse dataset created without inclitiong nan optimased files, and cross referencing that's why changed the name to dataset_statistics-noNAN-incl.csv
os.makedirs(base_dir / "report", exist_ok=True)
stats_output_path =  Path.home()/"Desktop" / "data science projects" / "air-pollution-levels" / "data" /"laqn"/"report"/ "laqn_stats.csv"

# output paths for saving all the pollutant distribution and nan value analysis.
pollutant_distrubution_path = Path.home() / "Desktop" / "data science projects" / "air-pollution-levels" / "data" /"laqn" / "report"/"pollutant_distribution.csv"
nan_val_pollutant_split_path = Path.home() / "Desktop" / "data science projects" / "air-pollution-levels" / "data" /"laqn" / "report" / "nan_values_by_pollutant.csv"
nan_val_stationPollutant_path = Path.home() / "Desktop" / "data science projects" / "air-pollution-levels" / "data" /"laqn" / "report" / "nan_values_by_station_pollutant.csv"


# log file from nan replacement process
nan_log_path = Path.home() / "Desktop" / "data science projects" / "air-pollution-levels" / "data" / "laqn" / "logs" / "NaN_values_record.csv"

# function for uk pollutant regulations pdf to parse csv file path
csv_output_path = Path.home() / "Desktop" / "data science projects" / "air-pollution-levels" / "data" / "defra" / "capabilities" / "uk_pollutant_limits.csv"


# data quality metrics report output path
quality_output = Path.home() / "Desktop" / "data science projects" / "air-pollution-levels" / "data" / "laqn"/ "report" / "quality_metrics_validation.csv"
quality_output.parent.mkdir(parents=True, exist_ok=True)

#chi-square test output path func 5
chi_square_output1 = Path.home() / "Desktop" / "data science projects" / "air-pollution-levels" / "data" / "laqn" / "report" / "chi_square_tests1.csv"
chi_square_output = Path.home() / "Desktop" / "data science projects" / "air-pollution-levels" / "data" / "laqn" / "report" / "chi_square_tests.csv"

# detailed last analysis and visualization output directory
report_dir = Path.home() / "Desktop" / "data science projects" / "air-pollution-levels" / "data" / "laqn" / "report" / "detailed_analysis"

report_dir.mkdir(parents=True, exist_ok=True)

## 1) Initial Dataset Assessment: Raw Numbers

Before conducting quality checks, I need to establish the baseline characteristics of the LAQN dataset. This section calculates comprehensive statistics about the data collection effort, including file counts, measurement records, station coverage, and pollutant distribution.

### Purpose
- Document the scale and scope of data collection.
- Establish baseline metrics for comparison with LAQN.
- Provide context for subsequent quality analysis.

### Methodology
The function `get_laqn_dataset_statistics()` performs the following:
1. Loads standardised metadata to identify unique stations and pollutants.
2. Counts files across all three yearly directories (2023, 2024, 2025).
3. Calculates total measurement records by reading all CSV files.
4. Determines spatial coverage from unique coordinate pairs.
5. Documents temporal coverage (35 months: January 2023 to November 2025).

### Notes
- File counting is fast (scans directory structure only).
- Record counting can be slow (reads every CSV file).
- Results are saved to csv.

In [26]:
def get_laqn_dataset_statistics(base_dir, metadata_path, nan_log_path):
    """
    Calculate statistics for the LAQN dataset using the new column structure.
    This function scans all CSV files recursively under base_dir and calculates key metrics needed for reporting.

    Parameters:
        base_dir : Path
            Base directory containing LAQN data folders.
        metadata_path : Path
            Path to the standardised metadata csv file.
        nan_log_path : Path
            Path to the NaN values log file after notice data flags, changed them to NaN.

    Returns:
        dict : Dictionary containing all calculated statistics.
    """
    stats = {}

    # read metadata to get station and pollutant info
    print("\nloading metadata...")
    metadata = pd.read_csv(metadata_path, encoding="utf-8")

    # calculate metadata statistics
    stats['unique_stations'] = metadata['SiteCode'].nunique()
    stats['total_combinations'] = len(metadata)
    stats['unique_pollutants'] = metadata['SpeciesCode'].nunique()

    # get pollutant breakdown
    pollutant_counts = metadata['SpeciesCode'].value_counts()
    stats['pollutant_distribution'] = pollutant_counts.to_dict()

    # create set of expected (SiteCode, SpeciesCode) pairs from metadata
    expected_pairs = set(
        zip(metadata['SiteCode'], metadata['SpeciesCode'])
    )
    stats['expected_pairs'] = len(expected_pairs)
    print(f"  expected SiteCode/SpeciesCode pairs from metadata: {len(expected_pairs)}")

    # count unique coordinates for spatial coverage
    unique_coords = metadata[['Latitude', 'Longitude']].drop_duplicates()
    stats['unique_locations'] = len(unique_coords)

    # Scan all CSVs in all subfolders under base_dir
    print("\nscanning optimased directory for collected data...")
    all_csv_files = list(Path(base_dir).rglob('*.csv'))
    total_files = len(all_csv_files)
    print(f"\nTotal CSV files found: {total_files}")
    stats['total_files'] = total_files

    # Count files, records, and missing values by period (e.g., "2023_apr")
    files_by_period = defaultdict(int)
    records_by_period = defaultdict(int)
    missing_by_period = defaultdict(int)

    all_csvs = []
    total_records = 0
    total_missing = 0

    print("\nReading all CSV files to calculate statistics...")
    for csv_file in all_csv_files:
        period = csv_file.parent.name
        try:
            df = pd.read_csv(csv_file)
            n_records = len(df)
            n_missing = df['@Value'].isna().sum() + (df['@Value'] == "").sum() if '@Value' in df.columns else 0
            all_csvs.append(df)
            files_by_period[period] += 1
            records_by_period[period] += n_records
            missing_by_period[period] += n_missing
            total_records += n_records
            total_missing += n_missing
        except Exception as e:
            print(f"  warning: could not read {csv_file.name}: {e}")

    stats['files_by_period'] = dict(files_by_period)
    stats['records_by_period'] = dict(records_by_period)
    stats['missing_by_period'] = dict(missing_by_period)
    stats['total_records'] = total_records
    stats['total_missing'] = total_missing
    stats['overall_completeness'] = ((total_records - total_missing) / total_records * 100) if total_records > 0 else 0

    for period in files_by_period:
        rec = records_by_period[period]
        miss = missing_by_period[period]
        miss_pct = (miss / rec * 100) if rec > 0 else 0
        print(f"  {period}: {files_by_period[period]} files, {rec:,} records, {miss:,} missing ({miss_pct:.2f}%)")

    # cross-reference metadata with collected data
    print("\ncross-referencing collected data with metadata...")

    if all_csvs:
        all_data = pd.concat(all_csvs, ignore_index=True)

        # check if required columns exist in csv files
        if 'SiteCode' in all_data.columns and 'SpeciesCode' in all_data.columns:
            # identify actual (SiteCode, SpeciesCode) pairs in collected data
            collected_pairs = set(
                zip(all_data['SiteCode'], all_data['SpeciesCode'])
            )
            stats['collected_pairs'] = len(collected_pairs)

            # find missing pairs (in metadata but not in collected data)
            missing_pairs = expected_pairs - collected_pairs
            stats['missing_pairs'] = list(missing_pairs)
            stats['missing_pairs_count'] = len(missing_pairs)

            # find extra pairs (in collected data but not in metadata)
            extra_pairs = collected_pairs - expected_pairs
            stats['extra_pairs'] = list(extra_pairs)
            stats['extra_pairs_count'] = len(extra_pairs)

            print(f"  expected pairs from metadata: {len(expected_pairs)}")
            print(f"  actually collected pairs: {len(collected_pairs)}")
            print(f"  missing pairs (in metadata but not collected): {len(missing_pairs)}")
            print(f"  extra pairs (collected but not in metadata): {len(extra_pairs)}")

            # group by SiteCode and SpeciesCode, count missing values
            missing_breakdown = {}
            for (site, species), group in all_data.groupby(['SiteCode', 'SpeciesCode']):
                total_rows = len(group)
                if '@Value' in group.columns:
                    missing_rows = group['@Value'].isna().sum() + (group['@Value'] == "").sum()
                else:
                    missing_rows = 0
                missing_breakdown[(site, species)] = (int(missing_rows), int(total_rows))
            stats['missing_by_station_pollutant'] = missing_breakdown
        else:
            print("  warning: SiteCode or SpeciesCode columns not found")
            stats['missing_by_station_pollutant'] = {}
            stats['collected_pairs'] = 0
            stats['missing_pairs'] = []
            stats['missing_pairs_count'] = 0
            stats['extra_pairs'] = []
            stats['extra_pairs_count'] = 0
    else:
        stats['missing_by_station_pollutant'] = {}
        stats['collected_pairs'] = 0
        stats['missing_pairs'] = list(expected_pairs)
        stats['missing_pairs_count'] = len(expected_pairs)
        stats['extra_pairs'] = []
        stats['extra_pairs_count'] = 0

    # distribution of nan by pollutant over time
    if stats['missing_by_station_pollutant']:
        pollutant_missing_summary = {}
        for (site, species), (missing, total) in stats['missing_by_station_pollutant'].items():
            if species not in pollutant_missing_summary:
                pollutant_missing_summary[species] = {'total_missing': 0, 'total_records': 0}
            pollutant_missing_summary[species]['total_missing'] += missing
            pollutant_missing_summary[species]['total_records'] += total
        for species in pollutant_missing_summary:
            total_missing = pollutant_missing_summary[species]['total_missing']
            total_records = pollutant_missing_summary[species]['total_records']
            percentage = (total_missing / total_records * 100) if total_records > 0 else 0
            pollutant_missing_summary[species]['percentage_missing'] = percentage
        stats['missing_by_pollutant_type'] = pollutant_missing_summary
    else:
        stats['missing_by_pollutant_type'] = {}

    # log file created during data cleaning process
    if Path(nan_log_path).exists():
        nan_log = pd.read_csv(nan_log_path)
        replacements_by_year = nan_log.groupby('year_folder')['invalid_flags_replaced'].sum().to_dict()
        stats['nan_replacements_by_year'] = replacements_by_year
        stats['total_nan_replacements'] = nan_log['invalid_flags_replaced'].sum()
        stats['mean_invalid_percentage'] = nan_log['percentage_invalid'].mean()
        stats['max_invalid_percentage'] = nan_log['percentage_invalid'].max()
    else:
        stats['nan_replacements_by_year'] = {}
        stats['total_nan_replacements'] = 0
        stats['mean_invalid_percentage'] = 0
        stats['max_invalid_percentage'] = 0

    # calculate temporal coverage based on the files collected
    stats['temporal_coverage'] = {
        'start_date': '2023-01-01',
        'end_date': '2025-11-19',
        'total_months': 35
    }


    def extract_year(period):
        # period is usually like '2023_apr' or '2024_jan'
        return str(period)[:4] if len(str(period)) >= 4 and str(period)[:4].isdigit() else 'unknown'

    files_by_year = defaultdict(int)
    records_by_year = defaultdict(int)
    missing_by_year = defaultdict(int)
    for period in files_by_period:
        year = extract_year(period)
        files_by_year[year] += files_by_period[period]
        records_by_year[year] += records_by_period[period]
        missing_by_year[year] += missing_by_period[period]
    stats['files_by_year'] = dict(files_by_year)
    stats['records_by_year'] = dict(records_by_year)
    stats['missing_by_year'] = dict(missing_by_year)

    return stats

In [27]:
def print_dataset_statistics(stats):
    """
    Print dataset statistics for LAQN using new column structure.

    Parameters:
        stats : dict
            returned by get_laqn_dataset_statistics().
    """

    print("\n" + "="*40)
    print("LAQN dataset statistics: initial assessment")
    print("="*40)

    print("\nScale and scope:")
    print(f"Total files collected: {stats['total_files']:,}")
    print(f"Total measurement records: {stats['total_records']:,}")
    print(f"Total missing values (@Value): {stats['total_missing']:,}")
    print(f"Overall completeness: {stats['overall_completeness']:.2f}%")
    print(f"Unique monitoring sites (SiteCode): {stats['unique_stations']}")
    print(f"Total site-species combinations: {stats['total_combinations']}")
    print(f"Unique pollutant types (SpeciesCode): {stats['unique_pollutants']}")
    print(f"Unique geographic locations: {stats['unique_locations']}")

    # data collection coverage
    print("\nData collection coverage:")
    print(f"Expected SiteCode/SpeciesCode pairs (from metadata): {stats.get('expected_pairs', 0)}")
    print(f"Actually collected pairs: {stats.get('collected_pairs', 0)}")
    print(f"Missing pairs (not collected): {stats.get('missing_pairs_count', 0)}")
    print(f"Extra pairs (not in metadata): {stats.get('extra_pairs_count', 0)}")

    if stats.get('missing_pairs_count', 0) > 0:
        print(f"\nWarning: {stats['missing_pairs_count']} SiteCode/SpeciesCode pairs from metadata were not found in collected data.")
        print("First 10 missing pairs:")
        for i, (site, species) in enumerate(stats['missing_pairs'][:10], 1):
            print(f"  {i}. {site} - {species}")

    if stats.get('extra_pairs_count', 0) > 0:
        print(f"\nNote: {stats['extra_pairs_count']} SiteCode/SpeciesCode pairs in collected data are not in metadata.")

    print("\nFiles by year:")
    for year, count in stats['files_by_year'].items():
        print(f"  {year}: {count:,} files")

    print("\nRecords by year:")
    for year, count in stats['records_by_year'].items():
        missing = stats['missing_by_year'].get(year, 0)
        missing_pct = (missing / count * 100) if count > 0 else 0
        print(f"  {year}: {count:,} records, {missing:,} missing ({missing_pct:.2f}%)")

    # adding nan value summary below
    print("\nNaN replacement summary:")
    print(f"Total invalid flags replaced: {stats['total_nan_replacements']:,}")
    print(f"Mean invalid percentage per file: {stats['mean_invalid_percentage']:.2f}%")
    print(f"Max invalid percentage: {stats['max_invalid_percentage']:.2f}%")

    # count of replacements by year
    if stats['nan_replacements_by_year']:
        print("\nReplacements by year:")
        for year_folder, count in stats['nan_replacements_by_year'].items():
            print(f"  {year_folder}: {count:,} flags replaced")

    print("\nTemporal coverage:")
    print(f"Start date: {stats['temporal_coverage']['start_date']}")
    print(f"End date: {stats['temporal_coverage']['end_date']}")
    print(f"Total months: {stats['temporal_coverage']['total_months']}")

    print("\nPollutant (SpeciesCode) distribution:")
    print("Site/species combinations by type:")
    for species, count in sorted(stats['pollutant_distribution'].items(),
                                 key=lambda x: x[1], reverse=True):
        percentage = (count / stats['total_combinations']) * 100
        print(f"  {species}: {count} ({percentage:.1f}%)")

    # missing value distribution by pollutant type
    print("\nMissing value distribution by pollutant type (SpeciesCode):")
    if stats.get('missing_by_pollutant_type'):
        # sort by percentage missing (highest first)
        sorted_species = sorted(
            stats['missing_by_pollutant_type'].items(),
            key=lambda x: x[1]['percentage_missing'],
            reverse=True
        )
        print(f"{'SpeciesCode':<20} {'total records':>15} {'missing':>12} {'% missing':>12}")
        print("-" * 60)
        for species, data in sorted_species:
            print(f"{species:<20} {data['total_records']:>15,} {data['total_missing']:>12,} {data['percentage_missing']:>11.2f}%")
    else:
        print("  No missing value distribution available.")

    # print missing values by site/species breakdown with row_number column
    print("\nMissing values by site/species (SiteCode/SpeciesCode):")
    if stats.get('missing_by_station_pollutant'):
        # prepare a sorted list by missing percentage descending
        breakdown = []
        for (site, species), (missing, total) in stats['missing_by_station_pollutant'].items():
            percent = (missing / total * 100) if total > 0 else 0
            breakdown.append((site, species, missing, total, percent))
        # sort by percentage descending and take top 20
        breakdown.sort(key=lambda x: x[4], reverse=True)
        breakdown = breakdown[:20]
        print(f"{'SiteCode':<20} {'SpeciesCode':<20} {'missing':>10} {'total_row':>12} {'% missing':>12}")
        print("-" * 60)
        for site, species, missing, total, percent in breakdown:
            print(f"{site:<20} {species:<20} {missing:>10,} {total:>12,} {percent:>11.2f}%")
    else:
        print("  No missing value breakdown available.")

In [28]:
# Run the analysis
stats = get_laqn_dataset_statistics(base_dir, metadata_path, nan_log_path)
print_dataset_statistics(stats)

# Save statistics for later use as csv
# Prepare flat data structure for csv
stats_rows = []
stats_rows.append(["metric", "value"])
stats_rows.append(["total_files", stats['total_files']])
stats_rows.append(["total_records", stats['total_records']])
stats_rows.append(["total_missing", stats['total_missing']])
stats_rows.append(["overall_completeness_pct", f"{stats['overall_completeness']:.2f}"])
stats_rows.append(["unique_sites", stats['unique_stations']])
stats_rows.append(["total_site_species_combinations", stats['total_combinations']])
stats_rows.append(["unique_species", stats['unique_pollutants']])
stats_rows.append(["unique_locations", stats['unique_locations']])
stats_rows.append(["expected_site_species_pairs", stats.get('expected_pairs', 0)])
stats_rows.append(["collected_site_species_pairs", stats.get('collected_pairs', 0)])
stats_rows.append(["missing_site_species_pairs_count", stats.get('missing_pairs_count', 0)])
stats_rows.append(["extra_site_species_pairs_count", stats.get('extra_pairs_count', 0)])
stats_rows.append(["total_nan_replacements", stats['total_nan_replacements']])
stats_rows.append(["mean_invalid_pct", f"{stats['mean_invalid_percentage']:.2f}"])
stats_rows.append(["max_invalid_pct", f"{stats['max_invalid_percentage']:.2f}"])

# Add year-specific metrics
for year in ['2023', '2024', '2025']:
    stats_rows.append([f"files_{year}", stats['files_by_year'].get(year, 0)])
    stats_rows.append([f"records_{year}", stats['records_by_year'].get(year, 0)])
    stats_rows.append([f"missing_{year}", stats['missing_by_year'].get(year, 0)])
    year_key = f'{year}measurements'
    stats_rows.append([f"replacements_{year}", stats['nan_replacements_by_year'].get(year_key, 0)])

# Save to csv stats report
pd.DataFrame(stats_rows[1:], columns=stats_rows[0]).to_csv(stats_output_path, index=False)
print(f"\nStatistics saved to: {stats_output_path}")

# Save species (pollutant) distribution to csv
total_combinations = stats['total_combinations']
species_distribution_df = pd.DataFrame(
    [
        {
            'SpeciesCode': k,
            'count': v,
            'percentage': round((v / total_combinations) * 100, 2) if total_combinations > 0 else 0
        }
        for k, v in stats['pollutant_distribution'].items()
    ]
)
species_distribution_df.to_csv(pollutant_distrubution_path, index=False)
print(f"Species (pollutant) distribution saved to: {pollutant_distrubution_path}")

# Save missing value distribution by species to csv
if stats.get('missing_by_pollutant_type'):
    missing_by_species_df = pd.DataFrame([
        {
            'SpeciesCode': k,
            'total_records': v['total_records'],
            'total_missing': v['total_missing'],
            'percentage_missing': v['percentage_missing']
        }
        for k, v in stats['missing_by_pollutant_type'].items()
    ])
    missing_by_species_df.to_csv(nan_val_pollutant_split_path, index=False)
    print(f"Missing value distribution by species saved to: {nan_val_pollutant_split_path}")

# Save missing values by site/species to csv
if stats.get('missing_by_station_pollutant'):
    missing_by_site_species_df = pd.DataFrame([
        {
            'SiteCode': k[0],
            'SpeciesCode': k[1],
            'missing': v[0],
            'total_row': v[1],
            'percentage_missing': (v[0] / v[1] * 100) if v[1] > 0 else 0
        }
        for k, v in stats['missing_by_station_pollutant'].items()
    ])
    missing_by_site_species_df.to_csv(nan_val_stationPollutant_path, index=False)
    print(f"Missing values by site/species saved to: {nan_val_stationPollutant_path}")


loading metadata...
  expected SiteCode/SpeciesCode pairs from metadata: 170

scanning optimased directory for collected data...

Total CSV files found: 4932

Reading all CSV files to calculate statistics...
  2023_mar: 141 files, 101,520 records, 14,883 missing (14.66%)
  2025_feb: 141 files, 91,368 records, 12,034 missing (13.17%)
  2024_feb: 141 files, 94,752 records, 9,582 missing (10.11%)
  2025_aug: 141 files, 101,520 records, 16,123 missing (15.88%)
  2024_aug: 141 files, 101,520 records, 19,425 missing (19.13%)
  2025_mar: 141 files, 101,520 records, 15,384 missing (15.15%)
  2023_feb: 141 files, 91,368 records, 12,838 missing (14.05%)
  2024_mar: 141 files, 101,520 records, 11,279 missing (11.11%)
  2023_aug: 141 files, 101,520 records, 11,360 missing (11.19%)
  2024_jul: 141 files, 101,520 records, 12,934 missing (12.74%)
  2025_jul: 141 files, 101,520 records, 16,506 missing (16.26%)
  2024_oct: 141 files, 101,520 records, 11,079 missing (10.91%)
  2023_sep: 141 files, 98,1

    loading metadata...
    expected SiteCode/SpeciesCode pairs from metadata: 170

    scanning optimased directory for collected data...

    Total CSV files found: 4932

    Reading all CSV files to calculate statistics...
    2023_mar: 141 files, 101,520 records, 14,883 missing (14.66%)
    2025_feb: 141 files, 91,368 records, 12,034 missing (13.17%)
    2024_feb: 141 files, 94,752 records, 9,582 missing (10.11%)
    2025_aug: 141 files, 101,520 records, 16,123 missing (15.88%)
    2024_aug: 141 files, 101,520 records, 19,425 missing (19.13%)
    2025_mar: 141 files, 101,520 records, 15,384 missing (15.15%)
    2023_feb: 141 files, 91,368 records, 12,838 missing (14.05%)
    2024_mar: 141 files, 101,520 records, 11,279 missing (11.11%)
    2023_aug: 141 files, 101,520 records, 11,360 missing (11.19%)
    2024_jul: 141 files, 101,520 records, 12,934 missing (12.74%)
    2025_jul: 141 files, 101,520 records, 16,506 missing (16.26%)
    2024_oct: 141 files, 101,520 records, 11,079 missing (10.91%)
    2023_sep: 141 files, 98,136 records, 11,727 missing (11.95%)
    2025_oct: 141 files, 101,520 records, 12,342 missing (12.16%)
    2023_jan: 141 files, 101,520 records, 17,911 missing (17.64%)
    2023_jul: 141 files, 101,520 records, 11,160 missing (10.99%)
    2024_jan: 141 files, 101,520 records, 10,375 missing (10.22%)
    2025_sep: 141 files, 98,136 records, 18,344 missing (18.69%)
    2024_sep: 141 files, 98,136 records, 15,469 missing (15.76%)
    2023_oct: 141 files, 101,520 records, 13,964 missing (13.75%)
    2025_jan: 141 files, 101,520 records, 11,198 missing (11.03%)
    2024_dec: 141 files, 101,520 records, 9,421 missing (9.28%)
    2024_apr: 141 files, 98,136 records, 11,208 missing (11.42%)
    2024_nov: 141 files, 98,136 records, 9,063 missing (9.24%)
    2023_may: 141 files, 101,520 records, 12,641 missing (12.45%)
    2025_nov: 141 files, 60,912 records, 8,613 missing (14.14%)
    2025_apr: 141 files, 98,136 records, 10,606 missing (10.81%)
    2024_may: 141 files, 101,520 records, 12,839 missing (12.65%)
    2025_may: 141 files, 101,520 records, 10,843 missing (10.68%)
    2023_nov: 138 files, 96,048 records, 10,851 missing (11.30%)
    2023_apr: 141 files, 98,136 records, 10,992 missing (11.20%)
    2023_dec: 141 files, 101,520 records, 14,882 missing (14.66%)
    2025_jun: 141 files, 98,136 records, 14,018 missing (14.28%)
    2024_jun: 141 files, 98,136 records, 9,661 missing (9.84%)
    2023_jun: 141 files, 98,136 records, 12,103 missing (12.33%)

    cross-referencing collected data with metadata...
    expected pairs from metadata: 170
    actually collected pairs: 141
    missing pairs (in metadata but not collected): 53
    extra pairs (collected but not in metadata): 24

    ========================================
    LAQN dataset statistics: initial assessment
    ========================================

    Scale and scope:
    Total files collected: 4,932
    Total measurement records: 3,446,208
    Total missing values (@Value): 443,658
    Overall completeness: 87.13%
    Unique monitoring sites (SiteCode): 78
    Total site-species combinations: 173
    Unique pollutant types (SpeciesCode): 6
    Unique geographic locations: 76

    Data collection coverage:
    Expected SiteCode/SpeciesCode pairs (from metadata): 170
    Actually collected pairs: 141
    Missing pairs (not collected): 53
    Extra pairs (not in metadata): 24

    Warning: 53 SiteCode/SpeciesCode pairs from metadata were not found in collected data.
    First 10 missing pairs:
    1. BL0 - PM25
    2. TH4 - PM25
    3. BT6 - PM25
    4. MEB - PM25
    5. GN6 - PM25
    6. GR8 - PM25
    7. GN3 - PM25
    8. TL6 - PM25
    9. GT1 - PM25
    10. CE3 - PM25

    Note: 24 SiteCode/SpeciesCode pairs in collected data are not in metadata.

    Files by year:
    2023: 1,689 files
    2025: 1,551 files
    2024: 1,692 files

    Records by year:
    2023: 1,192,464 records, 155,312 missing (13.02%)
    2025: 1,055,808 records, 146,011 missing (13.83%)
    2024: 1,197,936 records, 142,335 missing (11.88%)

    NaN replacement summary:
    Total invalid flags replaced: 0
    Mean invalid percentage per file: 0.00%
    Max invalid percentage: 0.00%

    Temporal coverage:
    Start date: 2023-01-01
    End date: 2025-11-19
    Total months: 35

    Pollutant (SpeciesCode) distribution:
    Site/species combinations by type:
    NO2: 60 (34.7%)
    PM25: 53 (30.6%)
    PM10: 43 (24.9%)
    O3: 11 (6.4%)
    SO2: 4 (2.3%)
    CO: 2 (1.2%)

    Missing value distribution by pollutant type (SpeciesCode):
    SpeciesCode            total records      missing    % missing
    ------------------------------------------------------------
    O3                           268,320       47,056       17.54%
    PM2.5                        586,944      100,755       17.17%
    SO2                           97,824       15,803       16.15%
    PM10                       1,026,456      126,749       12.35%
    NO2                        1,417,752      148,803       10.50%
    CO                            48,912        4,492        9.18%

    Missing values by site/species (SiteCode/SpeciesCode):
    SiteCode             SpeciesCode             missing    total_row    % missing
    ------------------------------------------------------------
    WM6                  PM10                     15,357       24,456       62.79%
    CE3                  NO2                      11,394       24,456       46.59%
    TL4                  NO2                       9,869       24,456       40.35%
    RI2                  O3                        9,732       24,456       39.79%
    WA7                  NO2                       9,236       24,456       37.77%
    BG1                  SO2                       8,373       24,456       34.24%
    TH4                  PM2.5                     8,042       24,456       32.88%
    CD1                  PM10                      7,711       24,456       31.53%
    WAA                  NO2                       7,697       24,456       31.47%
    CE3                  PM10                      7,652       24,456       31.29%
    CE3                  PM2.5                     7,652       24,456       31.29%
    TH4                  PM10                      7,457       24,456       30.49%
    CD1                  PM2.5                     7,339       24,456       30.01%
    GN6                  PM2.5                     7,240       24,456       29.60%
    CR8                  PM2.5                     7,207       24,456       29.47%
    GN0                  PM2.5                     7,099       24,456       29.03%
    HG4                  O3                        7,092       24,456       29.00%
    CD1                  NO2                       6,984       24,456       28.56%
    BT5                  PM2.5                     6,783       24,456       27.74%
    MY1                  O3                        6,634       24,456       27.13%

    Statistics saved to: /Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels/data/laqn/report/laqn_stats.csv
    Species (pollutant) distribution saved to: /Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels/data/laqn/report/pollutant_distribution.csv
    Missing value distribution by species saved to: /Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels/data/laqn/report/nan_values_by_pollutant.csv
    Missing values by site/species saved to: /Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels/data/laqn/report/nan_values_by_station_pollutant.csv

## 3) Data Quality validations:


A critical gap from the laqn report by applying formal statistical tests to validate data quality patterns. While descriptive statistics show 0% (before I notice the flags of the dataset) issue rate, I need statistical evidence that this pattern is real and not due to chance.


#### Purpuse:
 Checking data qualities if it is in the limits of eea, and make sence for general logic.
- Outlier detection in pollutant measurements.
- Data validity ranges based on WHO/EEA standards.
- Measurement consistency across time periods.
- Quality flags and suspicious patterns.

### methodology
 applies environmental data quality assessment standards:
1. Load aggregated measurement data from all csv files.
2. Calculate statistical distributions for each pollutant type.
3. Identify outliers using IQR method and domain knowledge.
4. Check values against established valid ranges.
5. Flag suspicious patterns constant values, extreme spikes.
6. Calculate quality scores for each station-pollutant combination.

#### air quality measurement standards

- Uk air quality objectives, limits and policy.
- https://uk-air.defra.gov.uk/air-pollution/uk-limits
- chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://uk-air.defra.gov.uk/assets/documents/Air_Quality_Objectives_Update_20230403.pdf

- DEFRA. (2023). *Air Pollution in the UK 2022*.
  - Source: https://uk-air.defra.gov.uk/library/annualreport/
  - Air Quality Objectives and limit values
  - Compliance assessment methodology

- UK Air Information Resource. (2024). *Air Pollution: UK Limits*.
  - Source: https://uk-air.defra.gov.uk/air-pollution/uk-limits
  - Current UK air quality objectives
  - Legal limit values and target dates
  - Measurement unit specifications (µg/m³)

  -  for the rest of the pollutants


- uk voc policy:
  - https://assets.publishing.service.gov.uk/media/5d7a2912ed915d522e4164a5/VO__statement_Final_12092019_CS__1_.pdf



In [None]:

def find_negative_values(base_dir, dry_run=True):
    summary = []
    all_csvs = list(Path(base_dir).rglob("*.csv"))
    for csv_file in all_csvs:
        try:
            df = pd.read_csv(csv_file)
            if '@Value' in df.columns and 'SpeciesCode' in df.columns and 'SiteCode' in df.columns:
                df_valid = df[pd.to_numeric(df['@Value'], errors='coerce').notnull()].copy()
                df_valid['@Value'] = df_valid['@Value'].astype(float)
                negatives = df_valid[df_valid['@Value'] < 0]
                if not negatives.empty:
                    grouped = negatives.groupby(['SpeciesCode', 'SiteCode']).size().reset_index(name='neg_count')
                    for _, row in grouped.iterrows():
                        total = df_valid[(df_valid['SpeciesCode'] == row['SpeciesCode']) & (df_valid['SiteCode'] == row['SiteCode'])].shape[0]
                        percent = (row['neg_count'] / total * 100) if total > 0 else 0
                        summary.append({
                            'SpeciesCode': row['SpeciesCode'],
                            'SiteCode': row['SiteCode'],
                            'NegativeCount': row['neg_count'],
                            'TotalCount': total,
                            'PercentNegative': round(percent, 2),
                            'File': str(csv_file)
                        })
        except Exception as e:
            print(f"Warning: Could not process {csv_file}: {e}")

    summary_df = pd.DataFrame(summary)
    if not summary_df.empty:
        print("Negative value summary (by pollutant and site):")
        display(summary_df)
    else:
        print("No negative values found in the scanned files.")
    return summary_df

In [44]:
#  Example usage:
neg_summary = find_negative_values(base_dir)
neg_summary.to_csv("laqn_negative_value_summary.csv", index=False)

No negative values found in the scanned files.


In [41]:
def replace_negatives_with_nan(base_dir):
    base_dir = Path(base_dir)
    for csv_file in base_dir.rglob('*.csv'):
        df = pd.read_csv(csv_file)
        if '@Value' in df.columns:
            df['@Value'] = pd.to_numeric(df['@Value'], errors='coerce')
            df.loc[df['@Value'] < 0, '@Value'] = np.nan
            df.to_csv(csv_file, index=False)
        else:
            print(f"Skipped (no @Value column): {csv_file}")


In [42]:
# Example usage:
replace_negatives_with_nan(base_dir)

In [65]:
def calculate_quality_metrics(base_dir, csv_output_path):
    """
    validates laqn measurements against uk air quality objectives with averaging periods.
    
    - path: data/laqn/optimased/YYYY_month/SiteCode_SpeciesCode_YYYY-MM-DD_YYYY-MM-DD.csv
    - columns: @MeasurementDateGMT, @Value, SpeciesCode, SiteCode, SpeciesName, SiteName, SiteType, Latitude, Longitude
    
    parameters:
        base_dir : path to laqn optimased directory (e.g., data/laqn/optimased/)
        uk_limits_path : path to uk_pollutant_limits.csv from parsed pdf
    
    """
    if not Path(csv_output_path).exists():
        print(f"error: uk limits file not found at {csv_output_path}")
        return {}
    
    # load uk legal limits from parsed pdf
    uk_limits = pd.read_csv(csv_output_path, encoding="utf-8")
    uk_limits.columns = [col.strip().replace(' ', '_') for col in uk_limits.columns]
    uk_limits_dict = {}
    
    for _, row in uk_limits.iterrows():
        poll_std = row['pollutant_std']
        limit_val = row['limit']
        conc_type = str(row['concentration_measured_as']).lower().strip()
        unit = row['unit']
        
        if pd.notna(poll_std) and pd.notna(limit_val):
            if poll_std not in uk_limits_dict:
                uk_limits_dict[poll_std] = []
            
            # averaging period detection
            avg_period = 'unknown'
            if 'annual' in conc_type and 'running' in conc_type:
                avg_period = 'running_annual'
            elif 'running annual' in conc_type:
                avg_period = 'running_annual'
            elif 'annual' in conc_type:
                avg_period = 'annual'
            elif '24 hour' in conc_type or '24-hour' in conc_type:
                avg_period = '24hour'
            elif '8 hour' in conc_type or '8-hour' in conc_type:
                avg_period = '8hour'
            elif '1 hour' in conc_type or '1-hour' in conc_type or 'hour mean' in conc_type:
                avg_period = '1hour'
            elif 'maximum daily' in conc_type:
                avg_period = 'daily_max'
            
            uk_limits_dict[poll_std].append({
                'limit': float(limit_val),
                'type': conc_type,
                'unit': unit,
                'avg_period': avg_period
            })
    
    print(f"\nuk limits loaded for {len(uk_limits_dict)} pollutants:")
    for poll, limits in uk_limits_dict.items():
        period_info = ', '.join([f"{lim['avg_period']}: {lim['limit']}" for lim in limits])
        print(f"  {poll}: {period_info}")
    
    # load all laqn measurement data with timestamp
    print("\nloading laqn measurement data...")
    all_data = []
    base_path = Path(base_dir)
    
    # iterate through year-month folders (e.g., 2023_jan, 2024_feb, etc.)
    for year_month_dir in base_path.glob('*'):
        if year_month_dir.is_dir():
            print(f"processing {year_month_dir.name}...")
            # iterate through csv files (SiteCode_SpeciesCode_YYYY-MM-DD_YYYY-MM-DD.csv)
            for csv_file in year_month_dir.glob('*.csv'):
                try:
                    df = pd.read_csv(csv_file)
                    # check for required columns
                    required_cols = {'@MeasurementDateGMT', '@Value', 'SpeciesCode'}
                    if not required_cols.issubset(df.columns):
                        print(f"skipped {csv_file}: missing required columns. found: {list(df.columns)}")
                        continue
                    
                    if not df.empty:
                        all_data.append(df)
                except Exception as e:
                    print(f"error reading {csv_file}: {e}")
    
    if not all_data:
        print("error: no measurement data found")
        return {}
    
    df_all = pd.concat(all_data, ignore_index=True)
    print(f"loaded {len(df_all):,} total records")
    
    # filter valid values and parse timestamp
    df_all['@Value'] = pd.to_numeric(df_all['@Value'], errors='coerce')
    df_valid = df_all[df_all['@Value'].notna()].copy()
    
    # parse timestamp - laqn uses ISO format with timezone
    df_valid['@MeasurementDateGMT'] = pd.to_datetime(df_valid['@MeasurementDateGMT'], errors='coerce')
    df_valid = df_valid[df_valid['@MeasurementDateGMT'].notna()]
    
    print(f"analysing {len(df_valid):,} valid measurements with timestamps")
    
    # calculate quality metrics for each pollutant
    print("\nprocessing quality metrics by pollutant...")
    quality_results = {}
    
    for pollutant in df_valid['SpeciesCode'].unique():
        if pd.isna(pollutant):
            continue
        
        print(f"\nprocessing {pollutant}...")
        
        poll_data = df_valid[df_valid['SpeciesCode'] == pollutant].copy()
        
        if len(poll_data) == 0:
            continue
        
        # basic statistics on raw hourly data
        q_metrics = {
            'pollutant': pollutant,
            'count': int(len(poll_data)),
            'mean_hourly': float(poll_data['@Value'].mean()),
            'median_hourly': float(poll_data['@Value'].median()),
            'std_hourly': float(poll_data['@Value'].std()),
            'min': float(poll_data['@Value'].min()),
            'max': float(poll_data['@Value'].max()),
            'p95': float(poll_data['@Value'].quantile(0.95)),
            'p99': float(poll_data['@Value'].quantile(0.99))
        }
        
        # check for suspicious values
        negative_count = (poll_data['@Value'] < 0).sum()
        zero_count = (poll_data['@Value'] == 0).sum()
        
        q_metrics['negative_values'] = int(negative_count)
        q_metrics['negative_pct'] = float((negative_count / len(poll_data) * 100))
        q_metrics['zero_values'] = int(zero_count)
        q_metrics['zero_pct'] = float((zero_count / len(poll_data) * 100))
        
        # laqn uses standard codes (NO2, PM10, PM25, SO2, CO, O3) - direct match to uk limits
        # but PM2.5 might appear as PM25 in laqn
        poll_std_code = pollutant
        if pollutant == 'PM25':
            poll_std_code = 'PM25'  # uk limits use PM25
        
        # check against uk limits with proper averaging
        if poll_std_code in uk_limits_dict:
            uk_poll_limits = uk_limits_dict[poll_std_code]
            
            for limit_info in uk_poll_limits:
                avg_period = limit_info['avg_period']
                limit_value = limit_info['limit']
                
                if avg_period == 'annual':
                    poll_data['year'] = poll_data['@MeasurementDateGMT'].dt.year
                    annual_means = poll_data.groupby('year')['@Value'].mean()
                    
                    q_metrics['uk_annual_limit'] = limit_value
                    q_metrics['mean_annual'] = float(annual_means.mean())
                    q_metrics['exceeds_uk_annual'] = q_metrics['mean_annual'] > limit_value
                    
                    print(f"  annual mean: {q_metrics['mean_annual']:.2f} vs limit {limit_value}")
                
                elif avg_period == '24hour':
                    poll_data['date'] = poll_data['@MeasurementDateGMT'].dt.date
                    daily_means = poll_data.groupby('date')['@Value'].mean()
                    
                    exceedances = (daily_means > limit_value).sum()
                    
                    q_metrics['uk_24hour_limit'] = limit_value
                    q_metrics['daily_exceedances'] = int(exceedances)
                    q_metrics['daily_exceedances_pct'] = float((exceedances / len(daily_means) * 100))
                    
                    print(f"  24-hour: {exceedances} days exceed {limit_value}")
                
                elif avg_period == '8hour':
                    poll_data_sorted = poll_data.sort_values('@MeasurementDateGMT')
                    poll_data_sorted['rolling_8h'] = poll_data_sorted['@Value'].rolling(window=8, min_periods=6).mean()
                    
                    exceedances = (poll_data_sorted['rolling_8h'] > limit_value).sum()
                    
                    q_metrics['uk_8hour_limit'] = limit_value
                    q_metrics['8hour_exceedances'] = int(exceedances)
                    q_metrics['8hour_exceedances_pct'] = float((exceedances / len(poll_data_sorted) * 100))
                    
                    print(f"  8-hour: {exceedances} periods exceed {limit_value}")
                
                elif avg_period == '1hour':
                    exceedances = (poll_data['@Value'] > limit_value).sum()
                    
                    q_metrics['uk_1hour_limit'] = limit_value
                    q_metrics['hourly_exceedances'] = int(exceedances)
                    q_metrics['hourly_exceedances_pct'] = float((exceedances / len(poll_data) * 100))
                    
                    print(f"  1-hour: {exceedances} hours exceed {limit_value}")
                
                elif avg_period == 'running_annual':
                    poll_data_sorted = poll_data.sort_values('@MeasurementDateGMT')
                    poll_data_sorted['rolling_annual'] = poll_data_sorted['@Value'].rolling(window=24*365, min_periods=24*300).mean()
                    
                    q_metrics['uk_running_annual_limit'] = limit_value
                    q_metrics['mean_running_annual'] = float(poll_data_sorted['rolling_annual'].mean())
                    q_metrics['exceeds_running_annual'] = q_metrics['mean_running_annual'] > limit_value
                    
                    print(f"  running annual: {q_metrics['mean_running_annual']:.2f} vs limit {limit_value}")
                
                elif avg_period == 'daily_max':
                    poll_data_sorted = poll_data.sort_values('@MeasurementDateGMT')
                    poll_data_sorted['date'] = poll_data_sorted['@MeasurementDateGMT'].dt.date
                    poll_data_sorted['rolling_8h'] = poll_data_sorted['@Value'].rolling(window=8, min_periods=6).mean()
                    
                    daily_max = poll_data_sorted.groupby('date')['rolling_8h'].max()
                    exceedances = (daily_max > limit_value).sum()
                    
                    q_metrics['uk_daily_max_limit'] = limit_value
                    q_metrics['daily_max_exceedances'] = int(exceedances)
                    
                    print(f"  daily max 8h: {exceedances} days exceed {limit_value}")
            
            # overall assessment: use most restrictive limit for out of range check
            all_limits = [lim['limit'] for lim in uk_poll_limits]
            max_limit = max(all_limits)
            
            # define extreme threshold as 10x highest uk limit
            extreme_threshold = max_limit * 10
            out_of_range = (poll_data['@Value'] > extreme_threshold).sum()
            
            q_metrics['extreme_threshold'] = extreme_threshold
            q_metrics['out_of_range'] = int(out_of_range)
            q_metrics['out_of_range_pct'] = float((out_of_range / len(poll_data) * 100))
        
        else:
            print(f"  no uk limits defined for {poll_std_code}")
            q_metrics['uk_annual_limit'] = None
            q_metrics['exceeds_uk_annual'] = False
            q_metrics['out_of_range'] = 0
            q_metrics['out_of_range_pct'] = 0.0
        
        # always calculate annual mean
        poll_data['year'] = poll_data['@MeasurementDateGMT'].dt.year
        annual_means = poll_data.groupby('year')['@Value'].mean()
        q_metrics['mean_annual'] = float(annual_means.mean())
        
        # if annual limit exists, compare
        if 'uk_annual_limit' in q_metrics and q_metrics['uk_annual_limit'] is not None:
            q_metrics['exceeds_uk_annual'] = q_metrics['mean_annual'] > q_metrics['uk_annual_limit']
        else:
            q_metrics['exceeds_uk_annual'] = None
        
        # ozone (O3): count days where max running 8-hour mean > 100 µg/m³
        if poll_std_code == 'O3':
            poll_data_sorted = poll_data.sort_values('@MeasurementDateGMT')
            poll_data_sorted['rolling_8h'] = poll_data_sorted['@Value'].rolling(window=8, min_periods=6).mean()
            poll_data_sorted['date'] = poll_data_sorted['@MeasurementDateGMT'].dt.date
            daily_max_8h = poll_data_sorted.groupby('date')['rolling_8h'].max()
            o3_exceedance_days = (daily_max_8h > 100).sum()
            q_metrics['o3_exceedance_days'] = int(o3_exceedance_days)
        
        # CO: maximum daily running 8-hour mean
        if poll_std_code == 'CO':
            poll_data_sorted = poll_data.sort_values('@MeasurementDateGMT')
            poll_data_sorted['date'] = poll_data_sorted['@MeasurementDateGMT'].dt.date
            poll_data_sorted['rolling_8h'] = poll_data_sorted['@Value'].rolling(window=8, min_periods=6).mean()
            daily_max_8h = poll_data_sorted.groupby('date')['rolling_8h'].max()
            co_max_daily_8h_mean = daily_max_8h.max()
            q_metrics['co_max_daily_8h_mean'] = float(co_max_daily_8h_mean)
        
        quality_results[pollutant] = q_metrics
    
    return quality_results

In [66]:
def print_quality_metrics(quality_results):
    """
    print comprehensive quality metrics report with uk compliance for laqn data.
    
    parameters:
        quality_results : dict
            dictionary returned by calculate_quality_metrics_laqn
    """
    
    print("\n" + "="*40)
    print("laqn quality metrics report")
    print("="*40)
    
    for poll, metrics in quality_results.items():
        print(f"\n{poll}:")
        print(f"  total measurements: {metrics['count']:,}")
        print(f"  hourly mean: {metrics['mean_hourly']:.2f}")
        
        if 'mean_annual' in metrics:
            print(f"  annual mean: {metrics['mean_annual']:.2f}", end="")
            if 'uk_annual_limit' in metrics and metrics['uk_annual_limit'] is not None:
                print(f" (limit: {metrics['uk_annual_limit']})")
                status = "exceeds" if metrics['exceeds_uk_annual'] else "compliant"
                print(f"    status: {status}")
            else:
                print(" (no uk annual limit)")
        
        if 'o3_exceedance_days' in metrics:
            print(f"  O3 8-hour mean exceedance days: {metrics['o3_exceedance_days']}")
        if 'co_max_daily_8h_mean' in metrics:
            print(f"  CO max daily 8-hour mean: {metrics['co_max_daily_8h_mean']:.2f}")
        
        if 'daily_exceedances' in metrics:
            print(f"  24-hour exceedances: {metrics['daily_exceedances']} days")
        if 'hourly_exceedances' in metrics:
            print(f"  1-hour exceedances: {metrics['hourly_exceedances']} hours")
        if metrics['negative_values'] > 0:
            print(f"  warning: {metrics['negative_values']} negative values")
        if metrics['out_of_range'] > 0:
            print(f"  warning: {metrics['out_of_range']} extreme values")
    
    print("="*40)



    

In [68]:
# calculate quality metrics
quality_results = calculate_quality_metrics(base_dir, csv_output_path)

print_quality_metrics(quality_results)

if quality_results:
    # save comprehensive report
    print("\nsaving laqn quality metrics report...")
    
    quality_rows = []
    for poll, metrics in quality_results.items():
        row = {
            'pollutant': metrics['pollutant'],
            'total_measurements': metrics['count'],
            'mean_hourly': f"{metrics['mean_hourly']:.2f}",
            'min': f"{metrics['min']:.2f}",
            'max': f"{metrics['max']:.2f}",
            'p95': f"{metrics['p95']:.2f}",
            'negative_values': metrics['negative_values'],
            'zero_values': metrics['zero_values'],
            'out_of_range': metrics['out_of_range']
        }
        
        # add uk limit compliance fields
        if 'uk_annual_limit' in metrics and metrics['uk_annual_limit']:
            row['uk_annual_limit'] = metrics['uk_annual_limit']
            row['mean_annual'] = f"{metrics['mean_annual']:.2f}" if 'mean_annual' in metrics else 'n/a'
            row['exceeds_annual'] = 'yes' if metrics.get('exceeds_uk_annual', False) else 'no'
        
        if 'daily_exceedances' in metrics:
            row['uk_24hour_limit'] = metrics['uk_24hour_limit']
            row['daily_exceedances'] = metrics['daily_exceedances']
        
        if 'hourly_exceedances' in metrics:
            row['uk_1hour_limit'] = metrics['uk_1hour_limit']
            row['hourly_exceedances'] = metrics['hourly_exceedances']
        
        if 'o3_exceedance_days' in metrics:
            row['o3_exceedance_days'] = metrics['o3_exceedance_days']
        
        if 'co_max_daily_8h_mean' in metrics:
            row['co_max_daily_8h_mean'] = f"{metrics['co_max_daily_8h_mean']:.2f}"
        
        quality_rows.append(row)
    
    #ssave to quality metrics csv
    pd.DataFrame(quality_rows).to_csv("quality_metrics.csv", index=False)
    print(f"saved to: {quality_results}")
    print("done")
else:
    print("laqn quality metrics calculation failed")


uk limits loaded for 11 pollutants:
  PM10: 24hour: 50.0, annual: 40.0
  PM2.5: annual: 20.0
  NO2: annual: 40.0
  O3: 8hour: 100.0
  SO2: 24hour: 125.0
  PAH: annual: 0.25
  Benzene: running_annual: 16.25
  1,3-butadiene: running_annual: 2.25
  CO: daily_max: 10.0
  LEAD: annual: 0.5
  NOx: annual: 30.0

loading laqn measurement data...
processing 2023_mar...
processing 2025_feb...
processing 2024_feb...
processing 2025_aug...
processing 2024_aug...
processing 2025_mar...
processing 2023_feb...
processing 2024_mar...
processing 2023_aug...
processing 2024_jul...
processing 2025_jul...
processing 2024_oct...
processing 2023_sep...
processing 2025_oct...
processing 2023_jan...
processing 2023_jul...
processing 2024_jan...
processing 2025_sep...
processing 2024_sep...
processing 2023_oct...
processing 2025_jan...
processing 2024_dec...
processing 2024_apr...
processing 2024_nov...
processing 2023_may...
processing 2025_nov...
processing 2025_apr...
processing 2024_may...
processing 2025

    uk limits loaded for 11 pollutants:
    PM10: 24hour: 50.0, annual: 40.0
    PM2.5: annual: 20.0
    NO2: annual: 40.0
    O3: 8hour: 100.0
    SO2: 24hour: 125.0
    PAH: annual: 0.25
    Benzene: running_annual: 16.25
    1,3-butadiene: running_annual: 2.25
    CO: daily_max: 10.0
    LEAD: annual: 0.5
    NOx: annual: 30.0

    loading laqn measurement data...
    processing 2023_mar...
    processing 2025_feb...
    processing 2024_feb...
    processing 2025_aug...
    processing 2024_aug...
    processing 2025_mar...
    processing 2023_feb...
    processing 2024_mar...
    processing 2023_aug...
    processing 2024_jul...
    processing 2025_jul...
    processing 2024_oct...
    processing 2023_sep...
    processing 2025_oct...
    processing 2023_jan...
    processing 2023_jul...
    processing 2024_jan...
    processing 2025_sep...
    processing 2024_sep...
    processing 2023_oct...
    processing 2025_jan...
    processing 2024_dec...
    processing 2024_apr...
    processing 2024_nov...
    processing 2023_may...
    processing 2025_nov...
    processing 2025_apr...
    processing 2024_may...
    processing 2025_may...
    processing 2023_nov...
    processing 2023_apr...
    processing 2023_dec...
    processing report...
    processing 2025_jun...
    processing 2024_jun...
    processing 2023_jun...
    loaded 3,446,208 total records
    analysing 2,981,417 valid measurements with timestamps

    processing quality metrics by pollutant...

    processing NO2...
    annual mean: 23.30 vs limit 40.0

    processing PM2.5...
    annual mean: 9.29 vs limit 20.0

    processing PM10...
    24-hour: 3 days exceed 50.0
    annual mean: 17.32 vs limit 40.0

    processing SO2...
    24-hour: 0 days exceed 125.0

    processing O3...
    8-hour: 2741 periods exceed 100.0

    processing CO...
    daily max 8h: 0 days exceed 10.0

    ========================================
    laqn quality metrics report
    ========================================

    NO2:
    total measurements: 1,267,018
    hourly mean: 23.34
    annual mean: 23.30 (limit: 40.0)
        status: compliant

    PM2.5:
    total measurements: 482,559
    hourly mean: 9.25
    annual mean: 9.29 (limit: 20.0)
        status: compliant
    warning: 11 extreme values

    PM10:
    total measurements: 896,146
    hourly mean: 17.23
    annual mean: 17.32 (limit: 40.0)
        status: compliant
    24-hour exceedances: 3 days
    warning: 3 extreme values

    SO2:
    total measurements: 71,217
    hourly mean: 1.47
    annual mean: 1.52 (no uk annual limit)
    24-hour exceedances: 0 days

    O3:
    total measurements: 220,912
    hourly mean: 47.71
    annual mean: 47.80 (no uk annual limit)
    O3 8-hour mean exceedance days: 70

    CO:
    total measurements: 43,565
    hourly mean: 0.20
    annual mean: 0.20 (no uk annual limit)
    CO max daily 8-hour mean: 2.14
    ========================================

    saving laqn quality metrics report...
    saved to: {'NO2': {'pollutant': 'NO2', 'count': 1267018, 'mean_hourly': 23.33811129755063, 'median_hourly': 19.0, 'std_hourly': 17.169706453593633, 'min': 0.0, 'max': 376.3, 'p95': 57.1, 'p99': 78.3, 'negative_values': 0, 'negative_pct': 0.0, 'zero_values': 632, 'zero_pct': 0.049880901455227944, 'uk_annual_limit': 40.0, 'mean_annual': 23.296558581709963, 'exceeds_uk_annual': False, 'extreme_threshold': 400.0, 'out_of_range': 0, 'out_of_range_pct': 0.0}, 'PM2.5': {'pollutant': 'PM2.5', 'count': 482559, 'mean_hourly': 9.249439550396948, 'median_hourly': 7.0, 'std_hourly': 7.869002875250606, 'min': 0.0, 'max': 909.0, 'p95': 23.9, 'p99': 38.8, 'negative_values': 0, 'negative_pct': 0.0, 'zero_values': 2090, 'zero_pct': 0.4331076614465796, 'uk_annual_limit': 20.0, 'mean_annual': 9.287265567932431, 'exceeds_uk_annual': False, 'extreme_threshold': 200.0, 'out_of_range': 11, 'out_of_range_pct': 0.0022795140076135767}, 'PM10': {'pollutant': 'PM10', 'count': 896146, 'mean_hourly': 17.23079754861373, 'median_hourly': 14.4, 'std_hourly': 12.470894582830518, 'min': 0.0, 'max': 759.0, 'p95': 39.0, 'p99': 61.8, 'negative_values': 0, 'negative_pct': 0.0, 'zero_values': 835, 'zero_pct': 0.09317678146194928, 'uk_24hour_limit': 50.0, 'daily_exceedances': 3, 'daily_exceedances_pct': 0.2944062806673209, 'uk_annual_limit': 40.0, 'mean_annual': 17.3172696741254, 'exceeds_uk_annual': False, 'extreme_threshold': 500.0, 'out_of_range': 3, 'out_of_range_pct': 0.0003347668795040094}, 'SO2': {'pollutant': 'SO2', 'count': 71217, 'mean_hourly': 1.46688431133016, 'median_hourly': 1.1, 'std_hourly': 3.2138526496443744, 'min': 0.0, 'max': 271.4, 'p95': 4.2, 'p99': 7.2, 'negative_values': 0, 'negative_pct': 0.0, 'zero_values': 2348, 'zero_pct': 3.296965612143168, 'uk_24hour_limit': 125.0, 'daily_exceedances': 0, 'daily_exceedances_pct': 0.0, 'extreme_threshold': 1250.0, 'out_of_range': 0, 'out_of_range_pct': 0.0, 'mean_annual': 1.5215136717118256, 'exceeds_uk_annual': None}, 'O3': {'pollutant': 'O3', 'count': 220912, 'mean_hourly': 47.70532836604621, 'median_hourly': 48.1, 'std_hourly': 23.953550068547706, 'min': 0.0, 'max': 198.6, 'p95': 85.6, 'p99': 108.68899999999849, 'negative_values': 0, 'negative_pct': 0.0, 'zero_values': 260, 'zero_pct': 0.11769392337220251, 'uk_8hour_limit': 100.0, '8hour_exceedances': 2741, '8hour_exceedances_pct': 1.2407655537046427, 'extreme_threshold': 1000.0, 'out_of_range': 0, 'out_of_range_pct': 0.0, 'mean_annual': 47.798197519325676, 'exceeds_uk_annual': None, 'o3_exceedance_days': 70}, 'CO': {'pollutant': 'CO', 'count': 43565, 'mean_hourly': 0.20249512223114888, 'median_hourly': 0.2, 'std_hourly': 0.18651998992979038, 'min': 0.0, 'max': 4.9, 'p95': 0.5, 'p99': 0.8, 'negative_values': 0, 'negative_pct': 0.0, 'zero_values': 6773, 'zero_pct': 15.546883966486858, 'uk_daily_max_limit': 10.0, 'daily_max_exceedances': 0, 'extreme_threshold': 100.0, 'out_of_range': 0, 'out_of_range_pct': 0.0, 'mean_annual': 0.2006708313657066, 'exceeds_uk_annual': None, 'co_max_daily_8h_mean': 2.1374999999999997}}
    done

## 5) Chi-Square Test for LAQN Data Quality

Uses statistical tests to mathematically prove that LAQN data collection process was consistent and reliable across time. 
It serves as a quality control check that ensures we didn't accidentally collect more data in some months than others, which could bias our analysis.

#### Why Chi-Square Test?
 - The chi-square test answers one simple question: Are my monthly file counts similar enough to trust, or are some months suspiciously different?
 - Air pollution varies by season
 - Policy decisions need unbiased evidence
 - Academic reviewers will question imbalanced datasets

### What Chi-Square Test Does

The chi-square test answers one simple question: Are my monthly file counts similar enough to trust, or are some months suspiciously different?

#### How It Works

1. What we observe: Count how many data files we have for each month.
2. What we expect: If data collection was perfect, each month should have roughly the same count.
3. The test: Measures how far observed counts are from the expected counts.
4. The result: Gives us a p-value that tells us if the differences are just random variation or a real problem.