# DEFRA Dataset Assesment


1) I'll be start adding my main paths and modules I will be using in this notebook below.

In [200]:
# possible python modules i will be using below
from curses import meta
import os
import pandas as pd
from pathlib import Path
import csv
#function 7 importing the full analysis function from pollution_analysis
import sys
sys.path.append('/mnt/user-data/outputs')

#last detailed anlasye and visualization imports
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Set visualisation style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

#findings 7 .func
# for parse pdf uk pollutant limitations to csv
import re
# pdfplumber for pdf parsing
import pdfplumber

# function 5. chi-square test
from scipy import stats

#define base path  without hardcoding
base_dir = Path.home() / "Desktop" / "data science projects" / "air-pollution-levels" / "data" / "defra" / "optimised"
#metadata file for pollutant name, location and site names
metadata_path = Path.home() / "Desktop" / "data science projects" / "air-pollution-levels" / "data" / "defra" /"test"/"std_london_sites_pollutant.csv"

# output path for saving statistics 1. function
#the first analyse dataset created without inclitiong nan optimased files, and cross referencing that's why changed the name to dataset_statistics-noNAN-incl.csv
os.makedirs(base_dir / "report", exist_ok=True)
stats_output_path = base_dir/"report"/ "defra_stats.csv"

# output paths for saving all the pollutant distribution and nan value analysis.
pollutant_distrubution_path = Path.home() / "Desktop" / "data science projects" / "air-pollution-levels" / "data" /"defra" / "pollutant_distribution.csv"
nan_val_pollutant_split_path = Path.home() / "Desktop" / "data science projects" / "air-pollution-levels" / "data" /"defra" / "report" / "nan_values_by_pollutant.csv"
nan_val_stationPollutant_path = Path.home() / "Desktop" / "data science projects" / "air-pollution-levels" / "data" /"defra" / "report" / "nan_values_by_station_pollutant.csv"


# log file from nan replacement process
nan_log_path = Path.home() / "Desktop" / "data science projects" / "air-pollution-levels" / "data" / "defra" / "logs" / "NaN_values_record.csv"

# function for uk pollutant regulations pdf to parse csv file path
pdf_path = Path.home() / "Desktop" / "data science projects" / "air-pollution-levels" / "data" / "defra" / "capabilities" / "Air_Quality_Objectives_Update.pdf"
csv_output_path = Path.home() / "Desktop" / "data science projects" / "air-pollution-levels" / "data" / "defra" / "capabilities" / "uk_pollutant_limits.csv"


# data quality metrics report output path

quality_output = Path.home() / "Desktop" / "data science projects" / "air-pollution-levels" / "data" / "defra"/ "report" / "quality_metrics_validation.csv"
quality_output.parent.mkdir(parents=True, exist_ok=True)

#chi-square test output path func 5
chi_square_output1 = Path.home() / "Desktop" / "data science projects" / "air-pollution-levels" / "data" / "defra" / "report" / "chi_square_tests1.csv"
chi_square_output = Path.home() / "Desktop" / "data science projects" / "air-pollution-levels" / "data" / "defra" / "report" / "chi_square_tests.csv"

# detailed last analysis and visualization output directory
report_dir = Path.home() / "Desktop" / "data science projects" / "air-pollution-levels" / "data" / "defra" / "report" / "detailed_analysis"

report_dir.mkdir(parents=True, exist_ok=True)

## 1) Initial Dataset Assessment: Raw Numbers

Before conducting quality checks, I need to establish the baseline characteristics of the DEFRA dataset. This section calculates comprehensive statistics about the data collection effort, including file counts, measurement records, station coverage, and pollutant distribution.

### Purpose
- Document the scale and scope of data collection.
- Establish baseline metrics for comparison with LAQN.
- Provide context for subsequent quality analysis.

### Methodology
The function `get_defra_dataset_statistics()` performs the following:
1. Loads standardised metadata to identify unique stations and pollutants.
2. Counts files across all three yearly directories (2023, 2024, 2025).
3. Calculates total measurement records by reading all CSV files.
4. Determines spatial coverage from unique coordinate pairs.
5. Documents temporal coverage (35 months: January 2023 to November 2025).

### Notes
- File counting is fast (scans directory structure only).
- Record counting can be slow (reads every CSV file).
- Results are saved to csv.

In [197]:
def get_defra_dataset_statistics(base_dir, metadata_path, nan_log_path):
    """
    Calculate statistics at DEFRA dataset.
    This function walks through the monthly data directories 2023, 2024, 2025 and calculates key metrics needed for reporting.
    
    Parameters:
        base_dir : Path
            Base directory containing defra data folders.
        metadata_path : Path
            Path to the standardised metadata csv file.
        nan_log_path : Path
            Path to the NaN values log file after notice data flags, changed them to NaN.
            
    Returns:
        dict : Dictionary containing all calculated statistics.
    """
    
    stats = {}
    
    # read metadata to get station and pollutant info
    print("\nloading metadata from std_london_sites_pollutant.csv...")
    metadata = pd.read_csv(metadata_path, encoding="utf-8")
    
    # calculate metadata statistics
    stats['unique_stations'] = metadata['station_name'].nunique()
    stats['total_combinations'] = len(metadata)
    stats['unique_pollutants'] = metadata['pollutant_std'].nunique()
    
    # get pollutant breakdown
    pollutant_counts = metadata['pollutant_std'].value_counts()
    stats['pollutant_distribution'] = pollutant_counts.to_dict()
    
    # create set of expected station/pollutant pairs from metadata
    expected_pairs = set(
        zip(metadata['station_name'], metadata['pollutant_std'])
    )
    stats['expected_pairs'] = len(expected_pairs)
    print(f"  expected station/pollutant pairs from metadata: {len(expected_pairs)}")
    
    # count unique coordinates for spatial coverage, i will be use this for laqn dataset as well
    # group by lat/lon and count unique locations, instead of station names and will do the validation afterwards
    unique_coords = metadata[['latitude', 'longitude']].drop_duplicates()
    stats['unique_locations'] = len(unique_coords)
    
    # count files in monthly data directories
    total_files = 0
    files_by_year = {}
    
    # loop through each years measurement directory
    print("\nscanning optimised directory for collected data...")
    for year in ['2023', '2024', '2025']:
        year_dir = Path(base_dir) / f'{year}measurements'
        
        if year_dir.exists():
            # count all CSV files in this years directory and subdirectories
            year_files = list(year_dir.rglob('*.csv'))
            files_by_year[year] = len(year_files)
            total_files += len(year_files)
            print(f"  {year}: {len(year_files)} files")
        else:
            files_by_year[year] = 0
            print(f"  {year}: directory not found")
    
    stats['total_files'] = total_files
    stats['files_by_year'] = files_by_year
    
    # calculate total measurement records, this requires reading all csv files and counting rows
    total_records = 0
    records_by_year = {}
    total_missing = 0
    missing_by_year = {}
    
    # concatenate all CSVs for missing value breakdown
    all_csvs = []
    
    print("\nreading all CSV files to calculate statistics...")
    for year in ['2023', '2024', '2025']:
        year_dir = Path(base_dir) / f'{year}measurements'
        year_records = 0
        year_missing = 0
        
        if year_dir.exists():
            # read each csv, count rows and missing values
            for csv_file in year_dir.rglob('*.csv'):
                try:
                    df = pd.read_csv(csv_file)
                    year_records += len(df)
                    
                    # count missing NaN or empty string values in value column
                    # calculation: missing values in value column only
                    if 'value' in df.columns:
                        missing_in_file = df['value'].isna().sum() + (df['value'] == "").sum()
                        year_missing += missing_in_file
                    
                    # store dataframe for later aggregation
                    all_csvs.append(df)
                    
                except Exception as e:
                    print(f"  warning: could not read {csv_file.name}: {e}")
            
            records_by_year[year] = year_records
            missing_by_year[year] = year_missing
            total_records += year_records
            total_missing += year_missing
            print(f"  {year}: {year_records:,} records, {year_missing:,} missing ({(year_missing/year_records*100):.2f}%)")
        else:
            records_by_year[year] = 0
            missing_by_year[year] = 0
    
    stats['total_records'] = total_records
    stats['records_by_year'] = records_by_year
    stats['missing_by_year'] = missing_by_year
    stats['total_missing'] = total_missing
    stats['overall_completeness'] = ((total_records - total_missing) / total_records * 100) if total_records > 0 else 0
    
    # cross-reference metadata with collected data
    print("\ncross-referencing collected data with metadata...")
    
    if all_csvs:
        all_data = pd.concat(all_csvs, ignore_index=True)
        
        # check if required columns exist in csv files
        # file structure: timestamp,value,timeseries_id,station_name,pollutant_name,pollutant_std,latitude,longitude
        if 'station_name' in all_data.columns and 'pollutant_std' in all_data.columns:
            # identify actual station/pollutant pairs in collected data
            collected_pairs = set(
                zip(all_data['station_name'], all_data['pollutant_std'])
            )
            stats['collected_pairs'] = len(collected_pairs)
            
            # find missing pairs (in metadata but not in collected data)
            missing_pairs = expected_pairs - collected_pairs
            stats['missing_pairs'] = list(missing_pairs)
            stats['missing_pairs_count'] = len(missing_pairs)
            
            # find extra pairs (in collected data but not in metadata)
            extra_pairs = collected_pairs - expected_pairs
            stats['extra_pairs'] = list(extra_pairs)
            stats['extra_pairs_count'] = len(extra_pairs)
            
            print(f"  expected pairs from metadata: {len(expected_pairs)}")
            print(f"  actually collected pairs: {len(collected_pairs)}")
            print(f"  missing pairs (in metadata but not collected): {len(missing_pairs)}")
            print(f"  extra pairs (collected but not in metadata): {len(extra_pairs)}")
            
            # group by station and pollutant_std, count missing values
            # calculation: (100 * missing value cell number) / (total number of row value col)
            missing_breakdown = {}
            
            for (station, pollutant), group in all_data.groupby(['station_name', 'pollutant_std']):
                total_rows = len(group)
                # count missing in value column
                if 'value' in group.columns:
                    missing_rows = group['value'].isna().sum() + (group['value'] == "").sum()
                else:
                    missing_rows = 0
                
                missing_breakdown[(station, pollutant)] = (int(missing_rows), int(total_rows))
            
            stats['missing_by_station_pollutant'] = missing_breakdown
        else:
            print("  warning: station_name or pollutant_std columns not found")
            stats['missing_by_station_pollutant'] = {}
            stats['collected_pairs'] = 0
            stats['missing_pairs'] = []
            stats['missing_pairs_count'] = 0
            stats['extra_pairs'] = []
            stats['extra_pairs_count'] = 0
    else:
        stats['missing_by_station_pollutant'] = {}
        stats['collected_pairs'] = 0
        stats['missing_pairs'] = list(expected_pairs)
        stats['missing_pairs_count'] = len(expected_pairs)
        stats['extra_pairs'] = []
        stats['extra_pairs_count'] = 0
    
    # distribution of nan by pollutant over time
    if stats['missing_by_station_pollutant']:
        pollutant_missing_summary = {}
        
        for (station, pollutant), (missing, total) in stats['missing_by_station_pollutant'].items():
            if pollutant not in pollutant_missing_summary:
                pollutant_missing_summary[pollutant] = {'total_missing': 0, 'total_records': 0}
            
            pollutant_missing_summary[pollutant]['total_missing'] += missing
            pollutant_missing_summary[pollutant]['total_records'] += total
        
        # calculate percentages
        for pollutant in pollutant_missing_summary:
            total_missing = pollutant_missing_summary[pollutant]['total_missing']
            total_records = pollutant_missing_summary[pollutant]['total_records']
            percentage = (total_missing / total_records * 100) if total_records > 0 else 0
            pollutant_missing_summary[pollutant]['percentage_missing'] = percentage
        
        stats['missing_by_pollutant_type'] = pollutant_missing_summary
    else:
        stats['missing_by_pollutant_type'] = {}
    
    # log file created during data cleaning process
    if Path(nan_log_path).exists():
        nan_log = pd.read_csv(nan_log_path)
        
        # calculate replacement statistics per year
        replacements_by_year = nan_log.groupby('year_folder')['invalid_flags_replaced'].sum().to_dict()
        stats['nan_replacements_by_year'] = replacements_by_year
        stats['total_nan_replacements'] = nan_log['invalid_flags_replaced'].sum()
        
        # get mean percentage of invalid flags
        stats['mean_invalid_percentage'] = nan_log['percentage_invalid'].mean()
        stats['max_invalid_percentage'] = nan_log['percentage_invalid'].max()
        
    else:
        stats['nan_replacements_by_year'] = {}
        stats['total_nan_replacements'] = 0
        stats['mean_invalid_percentage'] = 0
        stats['max_invalid_percentage'] = 0
    
    # calculate temporal coverage based on the files collected, understands which months have data
    stats['temporal_coverage'] = {
        'start_date': '2023-01-01',
        'end_date': '2025-11-19',  
        'total_months': 35
    }
    
    return stats

In [198]:
def print_dataset_statistics(stats):
    """
    Print dataset statistics
    
    Parameters:
        stats : dict
            returned by get_defra_dataset_statistics().
    """
    
    print("\n" + "="*40)
    print("Defra dataset statistics: initial assessment")
    print("="*40)
    
    print("\nScale and scope:")
    print(f"Total files collected: {stats['total_files']:,}")
    print(f"Total measurement records: {stats['total_records']:,}")
    print(f"Total missing values (nan): {stats['total_missing']:,}")
    print(f"Overall completeness: {stats['overall_completeness']:.2f}%")
    print(f"Unique monitoring stations: {stats['unique_stations']}")
    print(f"Total station-pollutant combinations: {stats['total_combinations']}")
    print(f"Unique pollutant types: {stats['unique_pollutants']}")
    print(f"Unique geographic locations: {stats['unique_locations']}")
    
    # data collection coverage
    print("\nData collection coverage:")
    print(f"Expected pairs (from metadata): {stats.get('expected_pairs', 0)}")
    print(f"Actually collected pairs: {stats.get('collected_pairs', 0)}")
    print(f"Missing pairs (not collected): {stats.get('missing_pairs_count', 0)}")
    print(f"Extra pairs (not in metadata): {stats.get('extra_pairs_count', 0)}")
    
    if stats.get('missing_pairs_count', 0) > 0:
        print(f"\nwarning: {stats['missing_pairs_count']} station/pollutant pairs from metadata were not found in collected data.")
        print("first 10 missing pairs:")
        for i, (station, pollutant) in enumerate(stats['missing_pairs'][:10], 1):
            print(f"  {i}. {station} - {pollutant}")
    
    if stats.get('extra_pairs_count', 0) > 0:
        print(f"\nNote: {stats['extra_pairs_count']} station/pollutant pairs in collected data are not in metadata.")
    
    print("\nfiles by year:")
    for year, count in stats['files_by_year'].items():
        print(f"  {year}: {count:,} files")
    
    print("\nrecords by year:")
    for year, count in stats['records_by_year'].items():
        missing = stats['missing_by_year'].get(year, 0)
        missing_pct = (missing / count * 100) if count > 0 else 0
        print(f"  {year}: {count:,} records, {missing:,} missing ({missing_pct:.2f}%)")
    
    # adding nan value summary below
    print("\nnan replacement summary:")
    print(f"Total invalid flags replaced: {stats['total_nan_replacements']:,}")
    print(f"Mean invalid percentage per file: {stats['mean_invalid_percentage']:.2f}%")
    print(f"Max invalid percentage: {stats['max_invalid_percentage']:.2f}%")
    
    # count of replacements by year
    if stats['nan_replacements_by_year']:
        print("\nreplacements by year:")
        for year_folder, count in stats['nan_replacements_by_year'].items():
            print(f"  {year_folder}: {count:,} flags replaced")
    
    print("\ntemporal coverage:")
    print(f"start date: {stats['temporal_coverage']['start_date']}")
    print(f"end date: {stats['temporal_coverage']['end_date']}")
    print(f"total months: {stats['temporal_coverage']['total_months']}")
    
    print("\npollutant distribution:")
    print("station/pollutant combinations by type:")
    for pollutant, count in sorted(stats['pollutant_distribution'].items(), 
                                   key=lambda x: x[1], reverse=True):
        percentage = (count / stats['total_combinations']) * 100
        print(f"  {pollutant}: {count} ({percentage:.1f}%)")
    
    # missing value distribution by pollutant type
    print("\nMissing value distribution by pollutant type:")
    if stats.get('missing_by_pollutant_type'):
        # sort by percentage missing (highest first)
        sorted_pollutants = sorted(
            stats['missing_by_pollutant_type'].items(),
            key=lambda x: x[1]['percentage_missing'],
            reverse=True
        )
        
        print(f"{'pollutant':<20} {'total records':>15} {'missing':>12} {'% missing':>12}")
        print("-" * 60)
        for pollutant, data in sorted_pollutants:
            print(f"{pollutant:<20} {data['total_records']:>15,} {data['total_missing']:>12,} {data['percentage_missing']:>11.2f}%")
    else:
        print("  no missing value distribution available.")
    
    # print missing values by station/pollutant breakdown with row_number column
    print("\nMissing values by station/pollutant:")
    if stats.get('missing_by_station_pollutant'):
        # prepare a sorted list by missing percentage descending
        breakdown = []
        for (station, pollutant), (missing, total) in stats['missing_by_station_pollutant'].items():
            percent = (missing / total * 100) if total > 0 else 0
            breakdown.append((station, pollutant, missing, total, percent))
        # sort by percentage descending and take top 20
        breakdown.sort(key=lambda x: x[4], reverse=True)
        breakdown = breakdown[:20]
        print(f"{'station':<30} {'pollutant':<20} {'missing':>10} {'total_row':>12} {'% missing':>12}")
        print("-" * 40)
        
        for station, pollutant, missing, total, percent in breakdown:
            print(f"{station:<30} {pollutant:<20} {missing:>10,} {total:>12,} {percent:>11.2f}%")
    else:
        print(" No missing value breakdown available.")

In [199]:
# run the analysis
stats = get_defra_dataset_statistics(base_dir, metadata_path, nan_log_path)
print_dataset_statistics(stats)

# # Save statistics for later use as csv
# save statistics for later use as csv
# prepare flat data structure for csv
stats_rows = []
stats_rows.append(["metric", "value"])
stats_rows.append(["total_files", stats['total_files']])
stats_rows.append(["total_records", stats['total_records']])
stats_rows.append(["total_missing", stats['total_missing']])
stats_rows.append(["overall_completeness_pct", f"{stats['overall_completeness']:.2f}"])
stats_rows.append(["unique_stations", stats['unique_stations']])
stats_rows.append(["total_combinations", stats['total_combinations']])
stats_rows.append(["unique_pollutants", stats['unique_pollutants']])
stats_rows.append(["unique_locations", stats['unique_locations']])
stats_rows.append(["expected_pairs", stats.get('expected_pairs', 0)])
stats_rows.append(["collected_pairs", stats.get('collected_pairs', 0)])
stats_rows.append(["missing_pairs_count", stats.get('missing_pairs_count', 0)])
stats_rows.append(["extra_pairs_count", stats.get('extra_pairs_count', 0)])
stats_rows.append(["total_nan_replacements", stats['total_nan_replacements']])
stats_rows.append(["mean_invalid_pct", f"{stats['mean_invalid_percentage']:.2f}"])
stats_rows.append(["max_invalid_pct", f"{stats['max_invalid_percentage']:.2f}"])

# add year-specific metrics
for year in ['2023', '2024', '2025']:
    stats_rows.append([f"files_{year}", stats['files_by_year'].get(year, 0)])
    stats_rows.append([f"records_{year}", stats['records_by_year'].get(year, 0)])
    stats_rows.append([f"missing_{year}", stats['missing_by_year'].get(year, 0)])
    year_key = f'{year}measurements'
    stats_rows.append([f"replacements_{year}", stats['nan_replacements_by_year'].get(year_key, 0)])

# save to csv stats report save func below (commented out for now to overwrite previous report)
pd.DataFrame(stats_rows[1:], columns=stats_rows[0]).to_csv(stats_output_path, index=False)
print(f"\nstatistics saved to: {stats_output_path}")

# save pollutant distribution to csv describe the path on top pollutant_distrubution_path
total_combinations = stats['total_combinations']
pollutant_distribution_df = pd.DataFrame(
    [
        {
            'pollutant': k,
            'count': v,
            'percentage': round((v / total_combinations) * 100, 2) if total_combinations > 0 else 0
        }
        for k, v in stats['pollutant_distribution'].items()
    ]
)
pollutant_distribution_df.to_csv(pollutant_distrubution_path, index=False)
print(f"Pollutant distribution saved to: {pollutant_distrubution_path}")

# Save missing value distribution by pollutant type to path described the path on top nan_val_pollutant_split_path
if stats.get('missing_by_pollutant_type'):
    missing_by_pollutant_df = pd.DataFrame([
        {
            'pollutant': k,
            'total_records': v['total_records'],
            'total_missing': v['total_missing'],
            'percentage_missing': v['percentage_missing']
        }
        for k, v in stats['missing_by_pollutant_type'].items()
    ])
    missing_by_pollutant_df.to_csv(nan_val_pollutant_split_path, index=False)
    print(f"Missing value distribution by pollutant type saved to: {nan_val_pollutant_split_path}")

# save missing values by station/pollutant to csv path on top nan_val_stationPollutant_path
if stats.get('missing_by_station_pollutant'):
    missing_by_station_pollutant_df = pd.DataFrame([
        {
            'station': k[0],
            'pollutant': k[1],
            'missing': v[0],
            'total_row': v[1],
            'percentage_missing': (v[0] / v[1] * 100) if v[1] > 0 else 0
        }
        for k, v in stats['missing_by_station_pollutant'].items()
    ])
    missing_by_station_pollutant_df.to_csv(nan_val_stationPollutant_path, index=False)
    print(f"Missing values by station/pollutant saved to: {nan_val_stationPollutant_path}")



loading metadata from std_london_sites_pollutant.csv...
  expected station/pollutant pairs from metadata: 141

scanning optimised directory for collected data...
  2023: 1431 files
  2024: 1193 files
  2025: 939 files

reading all CSV files to calculate statistics...
  2023: 1,000,126 records, 90,245 missing (9.02%)
  2024: 868,320 records, 101,522 missing (11.69%)
  2025: 657,545 records, 31,031 missing (4.72%)

cross-referencing collected data with metadata...
  expected pairs from metadata: 141
  actually collected pairs: 141
  missing pairs (in metadata but not collected): 0
  extra pairs (collected but not in metadata): 0

Defra dataset statistics: initial assessment

Scale and scope:
Total files collected: 3,563
Total measurement records: 2,525,991
Total missing values (nan): 222,798
Overall completeness: 91.18%
Unique monitoring stations: 18
Total station-pollutant combinations: 144
Unique pollutant types: 37
Unique geographic locations: 20

Data collection coverage:
Expected p

    loading metadata from std_london_sites_pollutant.csv...
    expected station/pollutant pairs from metadata: 141

    scanning optimised directory for collected data...
    2023: 1431 files
    2024: 1193 files
    2025: 939 files

    reading all CSV files to calculate statistics...
    2023: 1,000,126 records, 90,161 missing (9.01%)
    2024: 868,320 records, 101,256 missing (11.66%)
    2025: 657,545 records, 30,750 missing (4.68%)

    cross-referencing collected data with metadata...
    expected pairs from metadata: 141
    actually collected pairs: 141
    missing pairs (in metadata but not collected): 0
    extra pairs (collected but not in metadata): 0

    ========================================
    Defra dataset statistics: initial assessment
    ========================================

    Scale and scope:
    Total files collected: 3,563
    Total measurement records: 2,525,991
    Total missing values (nan): 222,167
    Overall completeness: 91.20%
    Unique monitoring stations: 18
    Total station-pollutant combinations: 144
    Unique pollutant types: 37
    Unique geographic locations: 20

    Data collection coverage:
    Expected pairs (from metadata): 141
    Actually collected pairs: 141
    Missing pairs (not collected): 0
    Extra pairs (not in metadata): 0

    files by year:
    2023: 1,431 files
    2024: 1,193 files
    2025: 939 files

    records by year:
    2023: 1,000,126 records, 90,161 missing (9.01%)
    2024: 868,320 records, 101,256 missing (11.66%)
    2025: 657,545 records, 30,750 missing (4.68%)

    nan replacement summary:
    Total invalid flags replaced: 222,167
    Mean invalid percentage per file: 9.61%
    Max invalid percentage: 100.00%

    replacements by year:
    2023measurements: 90,161 flags replaced
    2024measurements: 101,256 flags replaced
    2025measurements: 30,750 flags replaced

    temporal coverage:
    start date: 2023-01-01
    end date: 2025-11-19
    total months: 35

    pollutant distribution:
    station/pollutant combinations by type:
    PM10: 15 (10.4%)
    PM2.5: 15 (10.4%)
    NO2: 14 (9.7%)
    NOx: 14 (9.7%)
    NO: 14 (9.7%)
    O3: 9 (6.2%)
    SO2: 3 (2.1%)
    n-Pentane: 2 (1.4%)
    m,p-Xylene: 2 (1.4%)
    n-Butane: 2 (1.4%)
    n-Heptane: 2 (1.4%)
    n-Hexane: 2 (1.4%)
    n-Octane: 2 (1.4%)
    Propene: 2 (1.4%)
    o-Xylene: 2 (1.4%)
    Propane: 2 (1.4%)
    i-Pentane: 2 (1.4%)
    Toluene: 2 (1.4%)
    trans-2-Butene: 2 (1.4%)
    trans-2-Pentene: 2 (1.4%)
    Isoprene: 2 (1.4%)
    Ethyne: 2 (1.4%)
    i-Octane: 2 (1.4%)
    i-Hexane: 2 (1.4%)
    i-Butane: 2 (1.4%)
    Ethylbenzene: 2 (1.4%)
    Ethene: 2 (1.4%)
    Ethane: 2 (1.4%)
    cis-2-Butene: 2 (1.4%)
    Benzene: 2 (1.4%)
    1-Pentene: 2 (1.4%)
    1-Butene: 2 (1.4%)
    1,3-Butadiene: 2 (1.4%)
    1,3,5-TMB: 2 (1.4%)
    1,2,4-TMB: 2 (1.4%)
    1,2,3-TMB: 2 (1.4%)
    CO: 2 (1.4%)

    Missing value distribution by pollutant type:
    pollutant              total records      missing    % missing
    ------------------------------------------------------------
    PM10                         227,142       37,580       16.54%
    O3                           194,333       27,184       13.99%
    PM2.5                        234,748       29,623       12.62%
    SO2                           72,928        7,181        9.85%
    NO                           326,061       25,444        7.80%
    NO2                          326,072       25,429        7.80%
    NOx                          325,387       24,964        7.67%
    n-Octane                      26,649        1,764        6.62%
    CO                            48,578        3,078        6.34%
    m,p-Xylene                    25,503        1,612        6.32%
    1,3,5-TMB                     26,649        1,641        6.16%
    Toluene                       26,649        1,640        6.15%
    i-Octane                      26,649        1,624        6.09%
    n-Heptane                     26,649        1,622        6.09%
    1,2,4-TMB                     26,649        1,610        6.04%
    Ethylbenzene                  26,649        1,592        5.97%
    Benzene                       26,649        1,586        5.95%
    o-Xylene                      26,649        1,568        5.88%
    1,2,3-TMB                     26,649        1,560        5.85%
    1-Pentene                     26,572        1,381        5.20%
    cis-2-Butene                  26,599        1,378        5.18%
    trans-2-Pentene               26,599        1,366        5.14%
    Isoprene                      26,618        1,341        5.04%
    Ethyne                        26,529        1,328        5.01%
    1,3-Butadiene                 26,568        1,320        4.97%
    i-Hexane                      26,599        1,321        4.97%
    trans-2-Butene                26,599        1,321        4.97%
    n-Hexane                      26,580        1,320        4.97%
    Propane                       26,618        1,316        4.94%
    Ethane                        26,599        1,315        4.94%
    Ethene                        26,618        1,312        4.93%
    Propene                       26,618        1,312        4.93%
    i-Butane                      26,599        1,308        4.92%
    1-Butene                      26,599        1,307        4.91%
    n-Butane                      26,599        1,307        4.91%
    i-Pentane                     26,618        1,306        4.91%
    n-Pentane                     26,618        1,306        4.91%

    Missing values by station/pollutant:
    station                        pollutant               missing    total_row    % missing
    ----------------------------------------
    London Eltham                  PM10                     16,337       16,826       97.09%
    London Eltham                  NO2                      13,187       16,840       78.31%
    London Eltham                  NO                       13,182       16,835       78.30%
    London Eltham                  NOx                      13,125       16,793       78.16%
    London Eltham                  O3                       12,537       16,842       74.44%
    London Teddington Bushy Park   PM10                     10,525       24,327       43.26%
    London Teddington Bushy Park   PM2.5                    20,820       48,656       42.79%
    London Haringey Priory Park South O3                        8,171       24,288       33.64%
    London Marylebone Road         PM10                        632        2,355       26.84%
    London Marylebone Road         PM2.5                       479        2,355       20.34%
    London Norbury Manor School    PM10                        936        5,258       17.80%
    London Norbury Manor School    PM2.5                       936        5,258       17.80%
    London Bexley                  PM10                      4,012       24,273       16.53%
    Southwark A2 Old Kent Road     PM10                        388        2,355       16.48%
    Haringey Roadside              NOx                       3,725       24,250       15.36%
    Haringey Roadside              NO2                       3,708       24,285       15.27%
    Haringey Roadside              NO                        3,708       24,287       15.27%
    London Westminster             PM2.5                     3,463       24,299       14.25%
    London Marylebone Road         SO2                       2,987       24,290       12.30%
    London Marylebone Road         CO                        2,729       24,293       11.23%
    Pollutant distribution saved to: /Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels/data/defra/optimised/report/pollutant_distribution.csv

## 2) Spatial Coverage Analysis

 analysing spatial distribution patterns before accepting the dataset. I need to understand where defra stations are located, identify any geographic biases, and compare coverage to laqn.

### Purpose
- Create maps showing station locations across London.
- Analyse density by borough to identify coverage gaps
- Compare spatial distribution to laqn network
- Ensure no geographic areas are overrepresented or underrepresented

### Methodology
1. Load defra metadata with coordinates
2. Create interactive folium map showing all stations
3. Calculate station density by borough
4. Identify coverage gaps in london
5. Compare to laqn spatial distribution



sources: 
- https://python-visualization.github.io/folium/latest/getting_started.html
- https://pandas.pydata.org/docs/user_guide/groupby.html 
- plotting: https://geopandas.org/en/stable/docs/user_guide/data_structures.html#geoseries
    - general: https://geopandas.org/en/stable/getting_started.html

coordinates:
  -  https://www.ordnancesurvey.co.uk/
  - identifiers: https://www.ordnancesurvey.co.uk/products/search-for-os-products?category=387aa470-8f46-4b02-a4ea-b70d1835f812 
  - WGS84 coordinate system used for latitude/longitude.
  - london coordinates : 51.5072° N, 0.1276° W
  - Latitude and longitude coordinates are: 51.509865, -0.118092.

In [None]:
def analyse_spatial_coverage(metadata_path):
    """
    analyse the stations location on map 
    
    function validates coordinates, identifies  locations, and visulise the spatial distribution
    
    Parameters:
        metadata_path : 
             std metadata csv file.
            
    Returns:
        dictionary containing spatial statistics and coordinate data.
    
        *i got help for this section, sources folium tuttorials, plotting for geopandas and google. Also asked for my friend help as well which
        she works on geospatial data a lot for her phd research.

    """
    
    spatial_stats = {}
    
    # read metadata for coordinate information
    print("\nloading station coordinates from metadata...")
    metadata = pd.read_csv(metadata_path, encoding="utf-8")
    
    # check if coordinate columns exist
    if 'latitude' not in metadata.columns or 'longitude' not in metadata.columns:
        print("  error: latitude or longitude columns not found in metadata")
        return spatial_stats
    
    # validate coordinate completeness
    total_stations = len(metadata)
    missing_lat = metadata['latitude'].isna().sum()
    missing_lon = metadata['longitude'].isna().sum()
    missing_coords = metadata[['latitude', 'longitude']].isna().any(axis=1).sum()
    
    spatial_stats['total_stations'] = total_stations
    spatial_stats['missing_coordinates'] = missing_coords
    spatial_stats['missing_latitude'] = missing_lat
    spatial_stats['missing_longitude'] = missing_lon
    spatial_stats['coordinate_completeness'] = ((total_stations - missing_coords) / total_stations * 100) if total_stations > 0 else 0
    
    print(f"  total stations in metadata: {total_stations}")
    print(f"  missing coordinates: {missing_coords} ({(missing_coords/total_stations*100):.2f}%)")
    print(f"  coordinate completeness: {spatial_stats['coordinate_completeness']:.2f}%")
    
   

In [None]:
def print_spatial_statistics(spatial_stats):
    """
    Print spatial coverage statistics 
    
    Param:
        spatial_stats : 
            Dic returned by analyse_spatial_coverage()
    """
    
    

## 3) uk air quality standards framework

The UK has established legally binding air quality objectives, I'm missing in my dataset, so first i need to parse the pdf file to csv and std format to my dataset.
- chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://uk-air.defra.gov.uk/assets/documents/Air_Quality_Objectives_Update.pdf

In [29]:
def parse_defra_aq_objectives(pdf_path, csv_output_path, metadata_path):
    """
    Parse defra air quality objectives pdf and export to csv

    
    Output columns:
        pollutant: pollutant name from pdf
        pollutant_std: standardised pollutant code from metadata
        limit: numeric limit value extracted from objective column
        unit: unit of measurement (µg/m³, mg/m³, etc)
        objective: full objective text from pdf
        concentration measured as: averaging period (24 hour mean, annual mean, etc)
        applies: jurisdiction (uk only in this case)
    
    """
    
    print("\n" + "="*40)
    print("parsing defra air quality objectives pdf")
    print("="*40)
    print(f"pdf path: {pdf_path}")
    print(f"metadata path: {metadata_path}")
    print(f"output path: {csv_output_path}")
    
    # load metadata for pollutant mapping
    print("\nloading metadata for pollutant standardisation...")
    meta = pd.read_csv(metadata_path)
    print(f"loaded {len(meta)} metadata records")
    
    # build pollutant mapping dictionary
    pollutant_map = {}
    
    # first try direct pollutant column
    if 'pollutant' in meta.columns and 'pollutant_std' in meta.columns:
        meta_clean = meta[['pollutant', 'pollutant_std']].dropna().drop_duplicates()
        for _, row in meta_clean.iterrows():
            key = str(row['pollutant']).strip().lower()
            val = str(row['pollutant_std']).strip()
            pollutant_map[key] = val
    
    # then try pollutant_available column
    if 'pollutant_available' in meta.columns and 'pollutant_std' in meta.columns:
        meta_avail = meta[['pollutant_available', 'pollutant_std']].dropna().copy()
        
        for _, row in meta_avail.iterrows():
            pollutants = str(row['pollutant_available']).split(',')
            std_code = str(row['pollutant_std']).strip()
            
            for poll in pollutants:
                key = poll.strip().lower()
                if key and key != 'nan':
                    pollutant_map[key] = std_code
    
    print(f"built pollutant mapping with {len(pollutant_map)} entries")
    
    # extract tables from pdf using pdfplumber
    print("\nextracting tables from pdf...")
    all_rows = []
    
    with pdfplumber.open(pdf_path) as pdf:
        print(f"pdf has {len(pdf.pages)} pages")
        
        for page_num, page in enumerate(pdf.pages, 1):
            tables = page.extract_tables()
            
            if tables:
                print(f"page {page_num}: found {len(tables)} table(s)")
                
                for table in tables:
                    for row in table:
                        all_rows.append(row)
    
    if not all_rows:
        print("\nerror: no tables found in pdf")
        return None
    
    print(f"total rows extracted: {len(all_rows)}")
    
    # convert to dataframe
    df_raw = pd.DataFrame(all_rows)
    
    print("\nprocessing extracted data...")
    
    # remove completely empty rows
    df_raw = df_raw.replace(r'^\s*$', pd.NA, regex=True)
    df_raw = df_raw.dropna(how='all').reset_index(drop=True)
    
    # find header row
    header_idx = None
    for i in range(min(len(df_raw), 20)):
        row_text = ' '.join([str(x).lower() for x in df_raw.iloc[i].tolist() if pd.notna(x)])
        
        if 'pollutant' in row_text and 'applies' in row_text and 'objective' in row_text:
            header_idx = i
            print(f"found header at row {i}")
            break
    
    if header_idx is None:
        print("error: could not find header row")
        return None
    
    # use detected header
    header_row = [str(x).strip() if pd.notna(x) else '' for x in df_raw.iloc[header_idx].tolist()]
    df_raw.columns = header_row
    df_raw = df_raw.iloc[header_idx + 1:].reset_index(drop=True)
    
    print(f"original columns: {df_raw.columns.tolist()}")
    
    # find and map the concentration column (may be split or abbreviated)
    col_map = {}
    concentration_col = None
    
    for i, col in enumerate(df_raw.columns):
        col_lower = str(col).lower().strip()
        
        if col_lower == 'pollutant':
            col_map[col] = 'pollutant'
        elif col_lower == 'applies':
            col_map[col] = 'applies'
        elif col_lower == 'objective':
            col_map[col] = 'objective'
        elif 'concentration' in col_lower or col_lower == 'measured as':
            # this is the concentration measured as column
            concentration_col = col
            col_map[col] = 'concentration_measured_as'
    
    # if concentration column not found by name, try by position
    # typically it's the 4th column (index 3)
    if concentration_col is None:
        if len(df_raw.columns) > 3:
            concentration_col = df_raw.columns[3]
            col_map[concentration_col] = 'concentration_measured_as'
            print(f"using column position 3 for concentration: {concentration_col}")
    
    df_raw = df_raw.rename(columns=col_map)
    
    print(f"mapped columns: {list(col_map.values())}")
    
    # check required columns exist
    required_cols = ['pollutant', 'applies', 'objective', 'concentration_measured_as']
    missing_cols = [col for col in required_cols if col not in df_raw.columns]
    
    if missing_cols:
        print(f"\nerror: missing required columns: {missing_cols}")
        print(f"mapped columns: {df_raw.columns.tolist()}")
        
        # if only concentration is missing, check if we can merge columns
        if missing_cols == ['concentration_measured_as']:
            print("\nattempting to find concentration column by content...")
            
            # look for columns containing time period keywords
            for col in df_raw.columns:
                if col not in ['pollutant', 'applies', 'objective']:
                    # check if column contains time period text
                    sample_text = ' '.join(df_raw[col].dropna().astype(str).head(10).tolist()).lower()
                    if any(word in sample_text for word in ['hour', 'mean', 'annual', 'day', 'running']):
                        print(f"found concentration column by content: {col}")
                        df_raw = df_raw.rename(columns={col: 'concentration_measured_as'})
                        break
        
        # check again after attempted fix
        missing_cols = [col for col in required_cols if col not in df_raw.columns]
        if missing_cols:
            print(f"still missing: {missing_cols}")
            return None
    
    # select only needed columns
    df = df_raw[required_cols].copy()
    
    # clean text in all columns
    for col in df.columns:
        df[col] = df[col].astype(str).str.replace(r'\s+', ' ', regex=True).str.strip()
    
    # remove rows with missing critical data
    df = df.replace(['nan', 'None', '<NA>', ''], pd.NA)
    
    print(f"rows after cleanup: {len(df)}")
    
    # forward fill pollutant names
    df['pollutant'] = df['pollutant'].fillna(method='ffill')
    
    print("\nfiltering for uk only limits...")
    # filter for uk only
    df_uk = df[df['applies'].str.strip().str.upper() == 'UK'].copy()
    print(f"uk rows found: {len(df_uk)}")
    
    if len(df_uk) == 0:
        print("\nerror: no uk rows found after filtering")
        print("sample applies values found:")
        print(df['applies'].value_counts().head(10))
        return None
    
    print("\nextracting limit values and units from objectives...")
    
    # extract numeric limit from objective
    df_uk['limit'] = df_uk['objective'].str.extract(r'([\d,]+(?:\.\d+)?)', expand=False)
    df_uk['limit'] = df_uk['limit'].str.replace(',', '', regex=False)
    df_uk['limit'] = pd.to_numeric(df_uk['limit'], errors='coerce')
    
    # extract unit from objective
    df_uk['unit'] = df_uk['objective'].str.extract(r'[\d,]+(?:\.\d+)?\s*([^\s]+)', expand=False)
    
    # clean up unit extraction
    df_uk['unit'] = df_uk['unit'].str.extract(r'^([µμmng]+/m[²³3])', expand=False)
    
    # fallback for missing units
    mask_missing_unit = df_uk['unit'].isna()
    df_uk.loc[mask_missing_unit, 'unit'] = df_uk.loc[mask_missing_unit, 'objective'].str.extract(
        r'(µg/m³|μg/m³|mg/m³|ng/m³|ug/m3)', 
        expand=False
    )
    
    print(f"extracted limits for {df_uk['limit'].notna().sum()} rows")
    print(f"extracted units for {df_uk['unit'].notna().sum()} rows")
    
    # map pollutant names to standardised codes
    print("\nmapping pollutants to standardised codes...")
    
    df_uk['pollutant_std'] = df_uk['pollutant'].str.strip().str.lower().map(pollutant_map)
    
    # manual mappings for common pdf pollutant names
    manual_map = {
        'particles (pm10)': 'PM10',
        'particles (pm2.5)': 'PM2.5',
        'particles (pm2.5) exposure reduction': 'PM2.5',
        'pm10': 'PM10',
        'pm2.5': 'PM2.5',
        'nitrogen dioxide': 'NO2',
        'ozone': 'O3',
        'sulphur dioxide': 'SO2',
        'carbon monoxide': 'CO',
        'benzene': 'BENZENE',
        'lead': 'LEAD',
        '1,3-butadiene': 'BUTADIENE',
        'nitrogen oxides': 'NOX',
        'polycyclic aromatic hydrocarbons': 'PAH'
    }
    
    # apply manual mappings where metadata mapping failed
    mask_no_std = df_uk['pollutant_std'].isna()
    df_uk.loc[mask_no_std, 'pollutant_std'] = df_uk.loc[mask_no_std, 'pollutant'].str.strip().str.lower().map(manual_map)
    
    print(f"mapped {df_uk['pollutant_std'].notna().sum()} pollutants to standardised codes")
    
    # show pollutants that could not be mapped
    unmapped = df_uk[df_uk['pollutant_std'].isna()]
    if len(unmapped) > 0:
        print(f"\nwarning: {len(unmapped)} pollutants could not be mapped:")
        for poll in unmapped['pollutant'].unique():
            print(f"  {poll}")
    
    # rename column to match requirements
    df_uk = df_uk.rename(columns={'concentration_measured_as': 'concentration measured as'})
    
    # select final columns in specified order
    final_cols = [
        'pollutant',
        'pollutant_std', 
        'limit',
        'unit',
        'objective',
        'concentration measured as',
        'applies'
    ]
    
    df_final = df_uk[final_cols].copy()
    
    # warn about missing limits
    missing_limits = df_final['limit'].isna().sum()
    if missing_limits > 0:
        print(f"\nwarning: {missing_limits} rows have no numeric limit extracted")
    
    print(f"\nfinal dataset: {len(df_final)} rows")
    
    # show summary by pollutant
    print("\nsummary by pollutant:")
    print("-" * 40)
    summary = df_final.groupby('pollutant', dropna=False).agg({
        'limit': 'count',
        'pollutant_std': lambda x: x.mode()[0] if len(x.mode()) > 0 else None
    }).rename(columns={'limit': 'num_limits', 'pollutant_std': 'std_code'})
    
    for idx, row in summary.iterrows():
        std_code = row['std_code'] if pd.notna(row['std_code']) else 'unmapped'
        print(f"{idx}: {row['num_limits']} limit(s) [{std_code}]")
    
    # save to csv
    print(f"\nsaving to csv: {csv_output_path}")
    csv_output_path.parent.mkdir(parents=True, exist_ok=True)
    df_final.to_csv(csv_output_path, index=False, encoding='utf-8')
    
    print("done")
    print("="*40)
    
    return df_final

In [30]:
# run the parsing function
print("starting pdf parsing...")

result_df = parse_defra_aq_objectives(
    pdf_path=pdf_path,
    csv_output_path=csv_output_path,
    metadata_path=metadata_path
)

if result_df is not None:
    print("\n" + "="*40)
    print("preview of parsed data")
    print("="*40)
    print(result_df.head(20).to_string(index=False))
    
    print("\n" + "="*40)
    print("checking output file")
    print("="*40)
    
    if csv_output_path.exists():
        print(f"file created: {csv_output_path}")
        print(f"file size: {csv_output_path.stat().st_size / 1024:.2f} kb")
        
        verify_df = pd.read_csv(csv_output_path)
        print(f"csv readable: {len(verify_df)} rows")
        print(f"columns: {verify_df.columns.tolist()}")
        
        print("\nall pollutants found:")
        print("-" * 40)
        for poll in verify_df['pollutant'].unique():
            count = len(verify_df[verify_df['pollutant'] == poll])
            print(f"{poll}: {count} limit(s)")
    else:
        print("file was not created")
else:
    print("\nparsing failed, check error messages above")

starting pdf parsing...

parsing defra air quality objectives pdf
pdf path: /Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels/data/defra/capabilities/Air_Quality_Objectives_Update.pdf
metadata path: /Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels/data/defra/test/std_london_sites_pollutant.csv
output path: /Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels/data/defra/capabilities/uk_pollutant_limits.csv

loading metadata for pollutant standardisation...
loaded 144 metadata records
built pollutant mapping with 185 entries

extracting tables from pdf...
pdf has 4 pages
page 1: found 1 table(s)
page 2: found 1 table(s)
page 3: found 1 table(s)
page 4: found 1 table(s)
total rows extracted: 120

processing extracted data...
found header at row 1
original columns: ['Pollutant', 'Applies', 'Objective', 'Concentration', '', 'Date to be', '', 'European Obligations', '', 'Date to be', '']
mapped columns: ['pollutant', 'a

starting pdf parsing...

parsing defra air quality objectives pdf
pdf path: /Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels/data/defra/capabilities/Air_Quality_Objectives_Update.pdf
metadata path: /Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels/data/defra/test/std_london_sites_pollutant.csv
output path: /Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels/data/defra/capabilities/uk_pollutant_limits.csv

loading metadata for pollutant standardisation...
loaded 144 metadata records
built pollutant mapping with 185 entries

extracting tables from pdf...
pdf has 4 pages
page 1: found 1 table(s)
page 2: found 1 table(s)
page 3: found 1 table(s)
page 4: found 1 table(s)
total rows extracted: 120

processing extracted data...
found header at row 1
original columns: ['Pollutant', 'Applies', 'Objective', 'Concentration', '', 'Date to be', '', 'European Obligations', '', 'Date to be', '']
mapped columns: ['pollutant', 'a

  df['pollutant'] = df['pollutant'].fillna(method='ffill')


## 4) Data Quality validations:


A critical gap from the laqn report by applying formal statistical tests to validate data quality patterns. While descriptive statistics show 0% (before I notice the flags of the dataset) issue rate, I need statistical evidence that this pattern is real and not due to chance.


#### Purpuse:
 Checking data qualities if it is in the limits of eea, and make sence for general logic.
- Outlier detection in pollutant measurements.
- Data validity ranges based on WHO/EEA standards.
- Measurement consistency across time periods.
- Quality flags and suspicious patterns.

### methodology
 applies environmental data quality assessment standards:
1. Load aggregated measurement data from all csv files.
2. Calculate statistical distributions for each pollutant type.
3. Identify outliers using IQR method and domain knowledge.
4. Check values against established valid ranges.
5. Flag suspicious patterns constant values, extreme spikes.
6. Calculate quality scores for each station-pollutant combination.

#### air quality measurement standards

- Uk air quality objectives, limits and policy.
- https://uk-air.defra.gov.uk/air-pollution/uk-limits
- chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://uk-air.defra.gov.uk/assets/documents/Air_Quality_Objectives_Update_20230403.pdf

- DEFRA. (2023). *Air Pollution in the UK 2022*.
  - Source: https://uk-air.defra.gov.uk/library/annualreport/
  - Air Quality Objectives and limit values
  - Compliance assessment methodology

- UK Air Information Resource. (2024). *Air Pollution: UK Limits*.
  - Source: https://uk-air.defra.gov.uk/air-pollution/uk-limits
  - Current UK air quality objectives
  - Legal limit values and target dates
  - Measurement unit specifications (µg/m³)

  -  for the rest of the pollutants


- uk voc policy:
  - https://assets.publishing.service.gov.uk/media/5d7a2912ed915d522e4164a5/VO__statement_Final_12092019_CS__1_.pdf



- uk_pollutant_limit.css uk policy base logicl flaw:
    - data I fetched hourly measurements.
    - UK limits: different averaging periods annual mean, 24-hour mean, 8-hour mean...
    - I need to iterate my raw data according to uk_limit csv file format.



#### checks negative values.

In [191]:
def find_negative_values(base_dir, dry_run=True):
    results = []
    base_dir = Path(base_dir)
    for csv_file in base_dir.rglob("*.csv"):
        try:
            df = pd.read_csv(csv_file)
            if 'value' in df.columns:
                neg_rows = df[df['value'] < 0]
                neg_count = len(neg_rows)
                if neg_count > 0:
                    for idx, row in neg_rows.iterrows():
                        result = {
                            "path": str(csv_file.parent),
                            "file_name": csv_file.name,
                            "neg_value": row['value'],
                            "neg_count": neg_count,
                            "pollutant_std": row.get('pollutant_std', 'N/A')
                        }
                        print(f"{result['path']}/{result['file_name']}/ {result['neg_value']}/ {result['neg_count']}/ {result['pollutant_std']}")
                        results.append(result)
        except Exception as e:
            print(f"Error reading {csv_file}: {e}")
    neg_df = pd.DataFrame(results)
    if not dry_run and not neg_df.empty:
        output_path = Path.home() / "Desktop" / "data science projects" / "air-pollution-levels" / "data" / "defra" / "report" / "neg_values.csv"
        neg_df.to_csv(output_path, index=False)
        print(f"\nSaved negative values summary to {output_path}")
    return neg_df

In [193]:
neg_df = find_negative_values(base_dir, dry_run=False) 

/Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels/data/defra/optimised/2023measurements/London_Westminster/PM2.5__2023_01.csv/ -1.6/ 3/ PM2.5
/Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels/data/defra/optimised/2023measurements/London_Westminster/PM2.5__2023_01.csv/ -0.6/ 3/ PM2.5
/Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels/data/defra/optimised/2023measurements/London_Westminster/PM2.5__2023_01.csv/ -0.6/ 3/ PM2.5
/Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels/data/defra/optimised/2023measurements/London_Westminster/NO__2023_03.csv/ -0.042/ 2/ NO
/Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels/data/defra/optimised/2023measurements/London_Westminster/NO__2023_03.csv/ -0.094/ 2/ NO
/Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels/data/defra/optimised/2023measurements/London_Harlington/NO2__2023_01.csv/ -0.065/ 1/ NO2
/Users/bu

I have 631 negative value, logs added report/neg_values.csv
- I will be replacing all negative values NaN.

In [194]:
def replace_negatives_with_nan(base_dir):
    base_dir = Path(base_dir)
    for csv_file in base_dir.rglob("*.csv"):
        try:
            df = pd.read_csv(csv_file)
            if 'value' in df.columns:
                negative_values = df[df['value'] < 0]
                if not negative_values.empty:
                    df.loc[df['value'] < 0, 'value'] = np.nan
                    df.to_csv(csv_file, index=False)
                    print(f"Replaced {len(negative_values)} negative values with NaN in {csv_file.name}")
                    if 'station_name' in negative_values.columns:
                        print(f"Affected stations: {negative_values['station_name'].nunique()}")
                    if 'pollutant_name' in negative_values.columns:
                        print(f"Affected pollutants: {negative_values['pollutant_name'].nunique()}")
        except Exception as e:
            print(f"Error processing {csv_file}: {e}")

In [195]:
replace_negatives_with_nan(base_dir)

Replaced 3 negative values with NaN in PM2.5__2023_01.csv
Affected stations: 1
Affected pollutants: 1
Replaced 2 negative values with NaN in NO__2023_03.csv
Affected stations: 1
Affected pollutants: 1
Replaced 1 negative values with NaN in NO2__2023_01.csv
Affected stations: 1
Affected pollutants: 1
Replaced 6 negative values with NaN in NO2__2023_03.csv
Affected stations: 1
Affected pollutants: 1
Replaced 4 negative values with NaN in NO2__2023_12.csv
Affected stations: 1
Affected pollutants: 1
Replaced 35 negative values with NaN in NO__2023_10.csv
Affected stations: 1
Affected pollutants: 1
Replaced 9 negative values with NaN in NO__2023_11.csv
Affected stations: 1
Affected pollutants: 1
Replaced 24 negative values with NaN in O3__2023_01.csv
Affected stations: 1
Affected pollutants: 1
Replaced 3 negative values with NaN in NO2__2025_11.csv
Affected stations: 1
Affected pollutants: 1
Replaced 10 negative values with NaN in NO2__2025_10.csv
Affected stations: 1
Affected pollutants: 1

In [149]:
def calculate_quality_metrics(base_dir, csv_output_path):
    """
    Checks if measurements are realistic using uk_pollutant_limit.csv 3rd section parsed from uk air pollution policy pdf.
    function validates all measurements against official uk air quality objectives
    
    - Loads all measurement files.
    - Reads uk legal limits from parsed pdf csv.
    - For each pollutant, checks if values exceed uk limits.
    - Finds negative values.
    - Finds extreme values probably sensor err.
    - Calculates uk stD.
    
    Parameters:
        base_dir : 
        uk_limits_path : uk_pollutant_limits.csv from parsed pdf
            
    """
    if not Path(csv_output_path).exists():
        print(f"error: uk limits file not found at {csv_output_path}")
        return {}

    # load uk legal limits from parsed pdf
    uk_limits = pd.read_csv(csv_output_path, encoding="utf-8")
    
    # Create uk limits lookup dict structure = {pollutant_std: {limit_type: limit_value}}
    uk_limits_dict = {}
    
    for _, row in uk_limits.iterrows():
        uk_limits.columns = [col.strip().replace(' ', '_') for col in uk_limits.columns]
        poll_std = row['pollutant_std']
        limit_val = uk_limits['limit']
        conc_type = str(row['concentration measured as']).lower().strip()
        

        unit = row['unit']
        
        if pd.notna(poll_std) and pd.notna(limit_val):
            if poll_std not in uk_limits_dict:
                uk_limits_dict[poll_std] = []
            
        if pd.notna(poll_std) and pd.notna(limit_val):
            if poll_std not in uk_limits_dict:
                uk_limits_dict[poll_std] = []

        #  averaging period detection
        avg_period = 'unknown'
        if 'annual' in conc_type and 'running' in conc_type:
            avg_period = 'running_annual'
        elif 'running annual' in conc_type:
            avg_period = 'running_annual'
        elif 'annual' in conc_type:
            avg_period = 'annual'
        elif '24 hour' in conc_type or '24-hour' in conc_type:
            avg_period = '24hour'
        elif '8 hour' in conc_type or '8-hour' in conc_type:
            avg_period = '8hour'
        elif '1 hour' in conc_type or '1-hour' in conc_type or 'hour mean' in conc_type:
            avg_period = '1hour'
        elif 'maximum daily' in conc_type:
            avg_period = 'daily_max'

        uk_limits_dict[poll_std].append({
            'limit': float(limit_val),
            'type': conc_type,
            'unit': unit,
            'avg_period': avg_period
        })
    
    print(f"\nUK limits loaded for {len(uk_limits_dict)} pollutants:")
    for poll, limits in uk_limits_dict.items():
        period_info = ', '.join([f"{lim['avg_period']}: {lim['limit']}" for lim in limits])
        print(f"  {poll}: {period_info}")
    
    # load all measurement data with timestamp
    print("\nLoading measurement data.")
    all_data = []
    
    for year in ['2023', '2024', '2025']:
        year_dir = Path(base_dir) / f'{year}measurements'
        if year_dir.exists():
            for csv_file in year_dir.rglob('*.csv'):
                try:
                    df = pd.read_csv(csv_file)
                    if not df.empty and 'timestamp' in df.columns:
                        all_data.append(df)
                except Exception as e:
                    pass
    
    if not all_data:
        print("err no measurement data found")
        return {}
    
    df_all = pd.concat(all_data, ignore_index=True)
    print(f"loaded {len(df_all):,} total records")
    
    # filter valid values and parse timestamp
    df_valid = df_all[df_all['value'].notna()].copy()
    df_valid['value'] = pd.to_numeric(df_valid['value'], errors='coerce')
    df_valid = df_valid[df_valid['value'].notna()]
    
    # parse timestamp to datetime
    df_valid['timestamp'] = pd.to_datetime(df_valid['timestamp'], errors='coerce')
    df_valid = df_valid[df_valid['timestamp'].notna()]
    
    print(f"Analysing {len(df_valid):,} valid measurements with timestamps")
    
    # calculate quality metrics for each pollutant
    print("\nProcessing quality metrics by pollutant...")
    quality_results = {}
    
    for pollutant in df_valid['pollutant_std'].unique():
        if pd.isna(pollutant):
            continue
        
        print(f"\nprocessing {pollutant}...")
        
        poll_data = df_valid[df_valid['pollutant_std'] == pollutant].copy()
        
        if len(poll_data) == 0:
            continue
        
        # basic statistics on raw hourly data
        q_metrics = {
            'pollutant': pollutant,
            'count': int(len(poll_data)),
            'mean_hourly': float(poll_data['value'].mean()),
            'median_hourly': float(poll_data['value'].median()),
            'std_hourly': float(poll_data['value'].std()),
            'min': float(poll_data['value'].min()),
            'max': float(poll_data['value'].max()),
            'p95': float(poll_data['value'].quantile(0.95)),
            'p99': float(poll_data['value'].quantile(0.99))
        }
        
        # check for suspicious values
        negative_count = (poll_data['value'] < 0).sum()
        zero_count = (poll_data['value'] == 0).sum()
        
        q_metrics['negative_values'] = int(negative_count)
        q_metrics['negative_pct'] = float((negative_count / len(poll_data) * 100))
        q_metrics['zero_values'] = int(zero_count)
        q_metrics['zero_pct'] = float((zero_count / len(poll_data) * 100))
        
        # now check against uk limits with proper averaging
        if pollutant in uk_limits_dict:
            uk_poll_limits = uk_limits_dict[pollutant]
            
            for limit_info in uk_poll_limits:
                avg_period = limit_info['avg_period']
                limit_value = uk_limits['limit']
                
                if avg_period == 'annual':
                    # calculate annual mean
                    poll_data['year'] = poll_data['timestamp'].dt.year
                    annual_means = poll_data.groupby('year')['value'].mean()
                    
                    q_metrics['uk_annual_limit'] = limit_value
                    q_metrics['mean_annual'] = float(annual_means.mean())
                    q_metrics['exceeds_uk_annual'] = q_metrics['mean_annual'] > limit_value
                    
                    print(f"  annual mean: {q_metrics['mean_annual']:.2f} vs limit {limit_value}")
                
                elif avg_period == '24hour':
                    # calculate daily means
                    poll_data['date'] = poll_data['timestamp'].dt.date
                    daily_means = poll_data.groupby('date')['value'].mean()
                    
                    exceedances = (daily_means > limit_value).sum()
                    
                    q_metrics['uk_24hour_limit'] = limit_value
                    q_metrics['daily_exceedances'] = int(exceedances)
                    q_metrics['daily_exceedances_pct'] = float((exceedances / len(daily_means) * 100))
                    
                    print(f"  24-hour: {exceedances} days exceed {limit_value}")
                
                elif avg_period == '8hour':
                    # calculate 8-hour rolling mean
                    poll_data_sorted = poll_data.sort_values('timestamp')
                    poll_data_sorted['rolling_8h'] = poll_data_sorted['value'].rolling(window=8, min_periods=6).mean()
                    
                    exceedances = (poll_data_sorted['rolling_8h'] > limit_value).sum()
                    
                    q_metrics['uk_8hour_limit'] = limit_value
                    q_metrics['8hour_exceedances'] = int(exceedances)
                    q_metrics['8hour_exceedances_pct'] = float((exceedances / len(poll_data_sorted) * 100))
                    
                    print(f"  8-hour: {exceedances} periods exceed {limit_value}")
                
                elif avg_period == '1hour':
                    # compare hourly values directly
                    exceedances = (poll_data['value'] > limit_value).sum()
                    
                    q_metrics['uk_1hour_limit'] = limit_value
                    q_metrics['hourly_exceedances'] = int(exceedances)
                    q_metrics['hourly_exceedances_pct'] = float((exceedances / len(poll_data) * 100))
                    
                    print(f"  1-hour: {exceedances} hours exceed {limit_value}")
                
                elif avg_period == 'running_annual':
                    # running annual mean (365-day rolling average)
                    poll_data_sorted = poll_data.sort_values('timestamp')
                    poll_data_sorted['rolling_annual'] = poll_data_sorted['value'].rolling(window=24*365, min_periods=24*300).mean()
                    
                    q_metrics['uk_running_annual_limit'] = limit_value
                    q_metrics['mean_running_annual'] = float(poll_data_sorted['rolling_annual'].mean())
                    q_metrics['exceeds_running_annual'] = q_metrics['mean_running_annual'] > limit_value
                    
                    print(f"  running annual: {q_metrics['mean_running_annual']:.2f} vs limit {limit_value}")
                
                elif avg_period == 'daily_max':
                    # maximum daily 8-hour running mean
                    poll_data_sorted = poll_data.sort_values('timestamp')
                    poll_data_sorted['date'] = poll_data_sorted['timestamp'].dt.date
                    poll_data_sorted['rolling_8h'] = poll_data_sorted['value'].rolling(window=8, min_periods=6).mean()
                    
                    daily_max = poll_data_sorted.groupby('date')['rolling_8h'].max()
                    exceedances = (daily_max > limit_value).sum()
                    
                    q_metrics['uk_daily_max_limit'] = limit_value
                    q_metrics['daily_max_exceedances'] = int(exceedances)
                    
                    print(f"  daily max 8h: {exceedances} days exceed {limit_value}")
            
            # overall assessment: use most restrictive limit for out of range check
            all_limits = [lim['limit'] for lim in uk_poll_limits]
            max_limit = max(all_limits)
            
            # define extreme threshold as 10x highest uk limit
            extreme_threshold = max_limit * 10
            out_of_range = (poll_data['value'] > extreme_threshold).sum()
            
            q_metrics['extreme_threshold'] = extreme_threshold
            q_metrics['out_of_range'] = int(out_of_range)
            q_metrics['out_of_range_pct'] = float((out_of_range / len(poll_data) * 100))
            
        else:
            # no uk limit defined for this pollutant
            print(f"  no uk limits defined")
            q_metrics['uk_annual_limit'] = None
            q_metrics['exceeds_uk_annual'] = False
            q_metrics['out_of_range'] = 0
            q_metrics['out_of_range_pct'] = 0.0
        
        quality_results[pollutant] = q_metrics
    
    return quality_results

In [151]:

def print_quality_metrics(quality_results):
    """
    Print comprehensive quality metrics report with uk compliance.
    
    Parameters:
        quality_metrics : dict
            Dictionary returned by calculate_quality_metrics_uk_limits
    """
    
    print("\n" + "="*40)
    print("Quality metrics report")
    print("="*40)
    
 
    for poll, metrics in quality_results.items():
        print(f"\n{poll}:")
        print(f"  total measurements: {metrics['count']:,}")
        print(f"  hourly mean: {metrics['mean_hourly']:.2f}")
        
        if 'mean_annual' in metrics:
            print(f"  annual mean: {metrics['mean_annual']:.2f} (limit: {metrics['uk_annual_limit']})")
            status = "exceeds" if metrics['exceeds_uk_annual'] else "compliant"
            print(f"    status: {status}")
        
        if 'daily_exceedances' in metrics:
            print(f"  24-hour exceedances: {metrics['daily_exceedances']} days")
        
        if 'hourly_exceedances' in metrics:
            print(f"  1-hour exceedances: {metrics['hourly_exceedances']} hours")
        
        if metrics['negative_values'] > 0:
            print(f"  warning: {metrics['negative_values']} negative values")
        
        if metrics['out_of_range'] > 0:
            print(f"  warning: {metrics['out_of_range']} extreme values")
    
    print("="*40)
    
    return quality_results

In [152]:
# run quality metrics with proper averaging periods
print("starting quality metrics calculation...")

# Calculate quality metrics
quality_results = calculate_quality_metrics(base_dir, csv_output_path)

print_quality_metrics =(quality_results)


if quality_results:
    # # save comprehensive report
    # print("\nsaving quality metrics report...")
    

    
    quality_rows = []
    for poll, metrics in quality_results.items():
        row = {
            'pollutant': metrics['pollutant'],
            'total_measurements': metrics['count'],
            'mean_hourly': f"{metrics['mean_hourly']:.2f}",
            'min': f"{metrics['min']:.2f}",
            'max': f"{metrics['max']:.2f}",
            'p95': f"{metrics['p95']:.2f}",
            'negative_values': metrics['negative_values'],
            'zero_values': metrics['zero_values'],
            'out_of_range': metrics['out_of_range']
        }
        
        # add uk limit compliance fields
        if 'uk_annual_limit' in metrics and metrics['uk_annual_limit']:
            row['uk_annual_limit'] = metrics['uk_annual_limit']
            row['mean_annual'] = f"{metrics['mean_annual']:.2f}" if 'mean_annual' in metrics else 'n/a'
            row['exceeds_annual'] = 'yes' if metrics.get('exceeds_uk_annual', False) else 'no'
        
        if 'daily_exceedances' in metrics:
            row['uk_24hour_limit'] = metrics['uk_24hour_limit']
            row['daily_exceedances'] = metrics['daily_exceedances']
        
        if 'hourly_exceedances' in metrics:
            row['uk_1hour_limit'] = metrics['uk_1hour_limit']
            row['hourly_exceedances'] = metrics['hourly_exceedances']
        
        quality_rows.append(row)
    
    pd.DataFrame(quality_rows).to_csv(quality_output, index=False)
    print(f"saved to: {quality_output}")
    print("done")
else:
    print("quality metrics calculation failed")

starting quality metrics calculation...


ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

## 5) Chi-Square test
Uses statistical tests to mathematically prove that defra data collection process was consistent and reliable across time. 
It as a quality control check that ensures didn't accidentally collect more data in some months than others, which could bias defra analysis.

#### Why Chi-square test?
 - The chi-square test answers one simple question: Are my monthly file counts similar enough to trust, or are some months suspiciously different? And environmental dataset chi square test most common why, according to google.

- Air pollution varies by season
- Policy decisions need unbiased evidence
- Academic reviewers will question imbalanced datasets

### What Chi-Square Test Does

The chi-square test answers one simple question: Are my monthly file counts similar enough to trust, or are some months suspiciously different?

#### How It Works

1. What  observe: Count how many data files  have for each month.
2. What  expect: If data collection was perfect, each month should have roughly the same count.
3. The test: Measures how far observed counts are from the expected counts.
4. The result: Gives  a p-value that tells  if the differences are just random variation or a real problem.


### P-Value Meaning

The p-value tells the probability that  observed pattern happened by random chance:

| P-Value | Interpretation | What It Means for DEFRA Data |
|---------|---------------|----------------------------|
| p greater than or equal to 0.05 | Accept null hypothesis | Data is evenly distributed. Small differences between months are just normal variation.  data collection was consistent. |
| p less than 0.05 | Reject null hypothesis | Data is unevenly distributed. Some months have significantly more or less data than others.  should investigate why. |



### 1. Methodological Rigor
 data collection needs to be reliable

- Mathematical evidence not just visual inspection
- A standardized statistical measure p-value
- Reproducible results 


## Output

The test produces:

   - Test name: "Chi-square uniformity"
   - Chi-square statistic (χ²)
   - P-value
   - Interpretation (evenly/unevenly distributed)

2. Console output showing:
   - Null hypothesis statement
   - Test statistic value
   - P-value

---

## Expected Results for  Dataset

Based on  data collection using the DEFRA API:

- Expected p-value: greater than 0.05 (likely around 0.3-0.7)
- Why:  API calls were automated and systematic
- What this proves: Each month has 249 station-pollutant files (one per combination)

### If Get p less than 0.05

This would suggest:
1. Some months might have missing API data
2. New monitoring stations came online mid-year
3. Some stations stopped reporting in certain months


### Results Data Quality Section

Include the statistical test results as evidence that dataset is:
- Temporally balanced
- Methodologically sound
- Suitable for seasonal analysis


| Aspect | Details |
|--------|---------|
| Test Used | Chi-square test for uniformity |
| What It Tests | Whether monthly file counts are evenly distributed |
| Null Hypothesis | Data is uniformly distributed across months |
| Alternative Hypothesis | Data shows significant monthly imbalance |
| Acceptance Criterion | p-value greater than or equal to 0.05 |
| What p greater than or equal to 0.05 Means | Data collection was consistent and reliable |
| What p less than 0.05 Means | Some months have significantly different data volumes |
| Why This Matters | Proves  dataset is methodologically sound for thesis |

---


### If p greater than or equal to 0.05 (Expected Result)

1. Document result in thesis methodology
2. Include p-value in data quality section
3. Proceed with confidence to seasonal analysis

### If p less than 0.05 (Unexpected Result)

1. Review monthly counts to identify outliers
2. Check API logs for that month
3. Document known issues (e.g., "Station X offline in April 2024")
4. Consider excluding problematic months OR
5. Use weighted analysis to account for imbalance

---

In [82]:

def chi_square_tests(base_dir):
    """
    Run statistical tests to prove data collection was consistent.
    
    - Chi-square test: Checks if  similar amounts of data for each month
    - If p-value < 0.05 Data isn't evenly spreats problem!
    - If p-value > 0.05 Data is evenly spreats good!
    
    Parameters:
        base_dir : 
            
    """
    
    # Count files per month, 2025 only 19th of nov.
    yearly_data = {'2023': 0, '2024': 0, '2025': 0}
    year_months = {'2023': 12, '2024': 12, '2025': 11}
    
    for year in ['2023', '2024', '2025']:
        year_dir = Path(base_dir) / f'{year}measurements'
        if not year_dir.exists():
            continue
        pattern = f'*__{year}_*.csv'
        files = list(year_dir.rglob(pattern))
        yearly_data[year] = len(files)
    
    # prep for chi-square
    year_counts = [yearly_data[y] for y in ['2023', '2024', '2025']]
    total_files = sum(year_counts)
    total_months = 35  # 12 + 12 + 11
    
    expected_counts = [
        total_files * (year_months[year] / total_months)
        for year in ['2023', '2024', '2025']
    ]
    
    # run test
    chi2, p_value = stats.chisquare(
        f_obs=year_counts, 
        f_exp=expected_counts
    )
    for year, count, expected in zip(['2023', '2024', '2025'], 
                                      year_counts, expected_counts):
        print(f"  {year}: {count:5d} files (expected: {expected:7.1f})")
    
    print()
    print(f"Chi-square statistic: {chi2:.4f}")
    print(f"P-value: {p_value:.4f}")
    
    if p_value < 0.05:
        print(f"Result reject null hypothesis p < 0.05")
        print(f"Interpretation: Years NOT evenly distributed")
    else:
        print(f"Result: accept null hypothesis p >= 0.05")
        print(f"Interpretation: Years evenly distributed")
    
    return {
        'test': 'Chi-square year-wise',
        'chi2_statistic': chi2,
        'p_value': p_value,
        'year_counts': year_counts,
        'expected_counts': expected_counts
    }

In [83]:
# Run tests
test_results = chi_square_tests(base_dir)

# Save results

pd.DataFrame([{
    'test_name': test_results['test'],
    'statistic': f"{test_results['chi2_statistic']:.4f}",
    'p_value': f"{test_results['p_value']:.4f}",
    'interpretation': ('Evenly distributed' 
                      if test_results['p_value'] >= 0.05 
                      else 'Unevenly distributed')
}]).to_csv(chi_square_output, index=False)

print(f"\nStatistical test results saved to: {chi_square_output}")

  2023:  1431 files (expected:  1221.6)
  2024:  1193 files (expected:  1221.6)
  2025:   939 files (expected:  1119.8)

Chi-square statistic: 65.7553
P-value: 0.0000
Result reject null hypothesis p < 0.05
Interpretation: Years NOT evenly distributed

Statistical test results saved to: /Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels/data/defra/report/chi_square_tests.csv


    2023:  1431 files (expected:  1221.6)
    2024:  1193 files (expected:  1221.6)
    2025:   939 files (expected:  1119.8)

    Chi-square statistic: 65.7553
    P-value: 0.0000
    Result reject null hypothesis p < 0.05
    Interpretation: Years NOT evenly distributed

    Statistical test results saved to: /Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels/data/defra/report/chi_square_tests.csv

In [None]:
def analyse_year_difference(base_dir):
    """
    Find which stations/pollutants are missing in later years.
    """
    
    base_dir = Path(base_dir)
    
    # unique station-pollutant combinations per year
    year_files = {}
    
    for year in ['2023', '2024', '2025']:
        year_dir = base_dir / f'{year}measurements'
        if not year_dir.exists():
            continue
        
        files = list(year_dir.rglob(f'*__{year}_*.csv'))
        
        # Extract station/pollutant combinations
        combinations = set()
        for f in files:
            # File format station_pollutant_YY_MM
            parts = f.stem.split('__')
            if len(parts) >= 2:
                station = parts[0]
                pollutant = parts[1]
                combinations.add((station, pollutant))
        
        year_files[year] = combinations
    
    # Find whats in 23 but missing in 2024/2025
    lost_2024 = year_files['2023'] - year_files['2024']
    lost_2025 = year_files['2023'] - year_files['2025']
    
    print("\nStation-pollutant combinations lost over time:")
    print(f" 2023 total: {len(year_files['2023'])}")
    print(f"2024 total: {len(year_files['2024'])}")
    print(f"2025 total: {len(year_files['2025'])}")
    print()
    print(f"Lost in 2024 (vs 2023): {len(lost_2024)}")
    print(f"Lost in 2025 (vs 2023): {len(lost_2025)}")
    
    if lost_2024:
        print("\nExamples lost in 2024:")
        for station, pollutant in list(lost_2024)[:10]:
            print(f"    {station} - {pollutant}")
    
    if lost_2025:
        print("\nExamples lost in 2025:")
        for station, pollutant in list(lost_2025)[:10]:
            print(f"    {station} - {pollutant}")
    
    return {
        '2023_count': len(year_files['2023']),
        '2024_count': len(year_files['2024']),
        '2025_count': len(year_files['2025']),
        'lost_2024': lost_2024,
        'lost_2025': lost_2025
    }



In [85]:
# Run
analysis = analyse_year_difference(base_dir)


Station-pollutant combinations lost over time:
 2023 total: 444
2024 total: 444
2025 total: 370

Lost in 2024 (vs 2023): 444
Lost in 2025 (vs 2023): 444

Examples lost in 2024:
    NOx - 2023_09
    m,p-Xylene - 2023_02
    n-Heptane - 2023_03
    Isoprene - 2023_03
    i-Hexane - 2023_02
    O3 - 2023_08
    1,2,4-TMB - 2023_10
    CO - 2023_11
    SO2 - 2023_09
    Propene - 2023_10

Examples lost in 2025:
    NOx - 2023_09
    m,p-Xylene - 2023_02
    n-Heptane - 2023_03
    Isoprene - 2023_03
    i-Hexane - 2023_02
    O3 - 2023_08
    1,2,4-TMB - 2023_10
    CO - 2023_11
    SO2 - 2023_09
    Propene - 2023_10


    Station-pollutant combinations lost over time:
    2023 total: 444
    2024 total: 444
    2025 total: 370

    Lost in 2024 (vs 2023): 444
    Lost in 2025 (vs 2023): 444

    Examples lost in 2024:
        NOx - 2023_09
        m,p-Xylene - 2023_02
        n-Heptane - 2023_03
        Isoprene - 2023_03
        i-Hexane - 2023_02
        O3 - 2023_08
        1,2,4-TMB - 2023_10
        CO - 2023_11
        SO2 - 2023_09
        Propene - 2023_10

    Examples lost in 2025:
        NOx - 2023_09
        m,p-Xylene - 2023_02
        n-Heptane - 2023_03
        Isoprene - 2023_03
        i-Hexane - 2023_02
        O3 - 2023_08
        1,2,4-TMB - 2023_10
        CO - 2023_11
        SO2 - 2023_09
        Propene - 2023_10

#### adding monthly 2025 data completeness

In [86]:
def months_25 (base_dir):
    """
    See which 2025 months have data.
    """
    
    base_dir = Path(base_dir)
    year_dir = base_dir / '2025measurements'
    
    monthly_counts = {}
    
    for month in range(1, 12):  # Jan/Nov
        pattern = f'*__2025_{month:02d}.csv'
        files = list(year_dir.rglob(pattern))
        monthly_counts[f'2025-{month:02d}'] = len(files)
    
    print("\n2025 monthly file counts:")
    for month, count in monthly_counts.items():
        print(f"  {month}: {count:4d} files")
    
    # Check if recent months have less data
    avg_early = sum(list(monthly_counts.values())[:3]) / 3
    avg_late = sum(list(monthly_counts.values())[-3:]) / 3
    
    print(f"\n  Early 2025 avg (Jan-Mar): {avg_early:.0f} files/month")
    print(f"  Late 2025 avg (Sep-Nov):  {avg_late:.0f} files/month")
    
    if avg_late < avg_early * 0.9:
        print("Recent months have noticeably less data")

months_25 (base_dir)


2025 monthly file counts:
  2025-01:   95 files
  2025-02:    0 files
  2025-03:   92 files
  2025-04:   94 files
  2025-05:   94 files
  2025-06:   94 files
  2025-07:   94 files
  2025-08:   94 files
  2025-09:   94 files
  2025-10:   94 files
  2025-11:   94 files

  Early 2025 avg (Jan-Mar): 62 files/month
  Late 2025 avg (Sep-Nov):  94 files/month


  2025 monthly file counts:
    2025-01:   95 files
    2025-02:    0 files
    2025-03:   92 files
    2025-04:   94 files
    2025-05:   94 files
    2025-06:   94 files
    2025-07:   94 files
    2025-08:   94 files
    2025-09:   94 files
    2025-10:   94 files
    2025-11:   94 files

    Early 2025 avg (Jan-Mar): 62 files/month
    Late 2025 avg (Sep-Nov):  94 files/month

## 6) seasonal trend analyse:
- analyses the cleaned DEFRA air quality data to identify:
- Temporal trends (yearly, monthly, daily patterns)
- Pollutant-specific characteristics.
- Geographic hotspots.
- Seasonal variations.
- Exceedances of UK legal limits.

### 1- overal pollutant trends:
 
- Yearly average comparison by pollutant
- Monthly trends across all years
- Statistical summ tables
Visualizations: bar charts & line plots

### Pollutant-Specific Analysis

- Time series plot (daily averages)
- Distribution histogram (mean, median)
- Hourly pattern (rush hour peaks)
- Monthly boxplots (seasonal variation)


### Seasonal Patterns

- Winter/Spring/Summer/Autumn definitions
- Seasonal averages per pollutant
- Comparison visualisation
- Geographic Distribution ( can add later) 
 
### UK Legal Limit Exceedances

- Compares actual measurements vs UK limits
- Calculates exceedance percentages
I- dentifies which pollutants/stations exceed limits most





##### Data Analysis and Results
##### Pollution patterns, trends, and insights from DEFRA dataset (2023-2025)

This section analyzes the cleaned DEFRA air quality data to identify:
- Temporal trends (yearly, monthly, daily patterns)
- Pollutant-specific characteristics
- Geographic hotspots
- Seasonal variations
- Exceedances of UK legal limits

Structure:
1) Overall pollution trends (2023-2025)
2) Pollutant-specific analysis
3) Temporal patterns (seasonal, weekly, daily)
4) Geographic distribution
5) Limit exceedances analysis
6) Key findings summary


    
    

#### 1) Load_pollution_data
     sample data for analyse. begin witho nlky anlyse 100 of the file, that's why the run function set up very bottom 100.
     once quick explaration done it will be change.

In [183]:

def load_pollution_data(base_dir, sample_size=None, pollutants=None, stations=None):
    """
    Load a sample of data for analysis.

    For full dataset analysis, set sample_size=None. For quick exploration, use sample_size=100

    Parameters:
        sample_size : int, optional
        pollutants : list, optional
        stations : list, optional

    Returns:
        Combined data with cols: timestamp, value, station_name, pollutant_name, year, month, day
    """
    base_dir = Path(base_dir)
    all_data = []
    file_count = 0

    for year in ['2023', '2024', '2025']:
        year_dir = base_dir / f'{year}measurements'
        if not year_dir.exists():
            print(f"  Skipping {year} - directory not found.")
            continue
        
        # Get all CSV files recursively
        all_files = list(year_dir.rglob('*.csv'))
        
        # Filter by pollutant if specified
        if pollutants:
            all_files = [f for f in all_files if any(p in f.stem for p in pollutants)]
        
        # Filter by station if specified
        if stations:
            all_files = [f for f in all_files if any(s in f.stem for s in stations)]
        
        # Apply sample size limit
        if sample_size and file_count >= sample_size:
            break
        
        files_to_load = all_files
        if sample_size:
            remaining = sample_size - file_count
            files_to_load = all_files[:remaining]
        
        # Load each file
        for filepath in files_to_load:
            try:
                df = pd.read_csv(filepath)
                
                # Check required columns exist
                if 'timestamp' not in df.columns or 'value' not in df.columns:
                    continue
                
                # Add metadata if missing
                if 'station_name' not in df.columns:
                    df['station_name'] = filepath.parent.name
                if 'pollutant_name' not in df.columns:
                    # Extract from filename pattern: station__pollutant__date.csv
                    parts = filepath.stem.split('__')
                    df['pollutant_name'] = parts[1] if len(parts) > 1 else 'Unknown'
                
                all_data.append(df)
                file_count += 1
                
            except Exception as e:
                print(f"  Warning: couldn't load {filepath.name}: {e}")
                continue
    
    if not all_data:
        print("\n  Error: no data loaded.")
        return pd.DataFrame()
    
    # Combine all data
    data = pd.concat(all_data, ignore_index=True)
    
    # Parse timestamp and extract time components
    data['timestamp'] = pd.to_datetime(data['timestamp'])
    data['year'] = data['timestamp'].dt.year
    data['month'] = data['timestamp'].dt.month
    data['day'] = data['timestamp'].dt.day
    data['hour'] = data['timestamp'].dt.hour
    data['dayofweek'] = data['timestamp'].dt.dayofweek
    data['week'] = data['timestamp'].dt.isocalendar().week
    
    # Add month name for plotting
    month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
                  'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
    data['month_name'] = data['month'].apply(lambda x: month_names[x-1])
    
    # Print summary
    print(f"\n  Loaded {len(data):,} measurements from {file_count} files.")
    print(f"  Date range: {data['timestamp'].min()} to {data['timestamp'].max()}")
    print(f"  Stations: {data['station_name'].nunique()}")
    print(f"  Pollutants: {data['pollutant_name'].nunique()}")
    print("\n  Pollutant breakdown:")
    for pollutant, count in data['pollutant_name'].value_counts().items():
        print(f"    {pollutant}: {count:,} measurements")
    
    return data

#### 2) function for analyse overall trends
- csv will be save as yearly_averages.csv

In [184]:

def analyse_overall_trends(data, report_dir):
    """
    Overall pollution trends across study period.
    
    Creates:
    - Yearly average comparison
    - Monthly trends
    - Overall statistics table
    """
    
    print("\n" + "="*40)
    print("Overall pollutant trends")
    print("="*40)
    
    if data is None or data.empty:
        print("  Error no data loaded")
        return None
    # Calculate yearly averages per pollutant
    yearly_avg = data.groupby(['year', 'pollutant_name'])['value'].agg([
        'mean', 'median', 'std', 'count'
    ]).round(2)
    
    print("\nYearly averages by pollutant:")
    print(yearly_avg)
    
    # Save to CSV yearly_averages.csv under report/detailed analysis

    os.makedirs(report_dir, exist_ok=True)
    yearly_avg.to_csv(report_dir / 'yearly_averages.csv')
    
    # Create visualisation with two subplots
    fig, axes = plt.subplots(1, 2, figsize=(16, 6))
    
    # Plot 1: Yearly averages bar chart
    yearly_pivot = data.groupby(['year', 'pollutant_name'])['value'].mean().unstack()
    yearly_pivot.plot(kind='bar', ax=axes[0], width=0.8, alpha=0.8)
    axes[0].set_xlabel('Year', fontsize=12, fontweight='bold')
    axes[0].set_ylabel('Average concentration', fontsize=12, fontweight='bold')
    axes[0].set_title('Yearly average pollution levels by pollutant', 
                     fontsize=13, fontweight='bold')
    axes[0].legend(title='Pollutant', bbox_to_anchor=(1.05, 1), loc='upper left')
    axes[0].grid(axis='y', alpha=0.3)
    axes[0].set_xticklabels(axes[0].get_xticklabels(), rotation=0)
    
    # Plot 2: Monthly trend across all years
    monthly_avg = data.groupby(['month_name', 'pollutant_name'])['value'].mean().unstack()
    month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
                  'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
    monthly_avg = monthly_avg.reindex(month_order)
    monthly_avg.plot(ax=axes[1], marker='o', linewidth=2)
    axes[1].set_xlabel('Month', fontsize=12, fontweight='bold')
    axes[1].set_ylabel('Average concentration', fontsize=12, fontweight='bold')
    axes[1].set_title('Monthly average pollution levels (all years combined)', 
                     fontsize=13, fontweight='bold')
    axes[1].legend(title='Pollutant', bbox_to_anchor=(1.05, 1), loc='upper left')
    axes[1].grid(alpha=0.3)
    axes[1].tick_params(axis='x', rotation=45)
    
    plt.tight_layout()
    plt.savefig(report_dir / 'overall_trends.png', dpi=300, bbox_inches='tight')
    print(f"\n  Visualisation saved: overall_trends.png")
    plt.close()
    
    return yearly_avg


#### 3) Analyse specific pollutant:
- here i i will do full analyse the pollutants  I fetched:
    1) all pollutants at optimased/year_measuraments/*
    2) the pollutantsa uk have limits - uk_pollutant_limits.csv
    3) and also pollutants common between defra/LAQN. NO2, PM10, PM2.5, CO, SO2, 03
- the png visulation will be saved as: (pollutant name)analysis.png 
- report/detailed_analysis path.

In [185]:
def analyse_pollutant_specific(data, pollutant, report_dir):
    """
    Deep dive into a specific pollutant.
    
    Parameters:
        pollutant: str - Pollutant name (e.g. 'NO2', 'PM2.5')
            
    Creates:
        - Time series plot
        - Distribution histogram
        - Hourly pattern
        - Statistics summary
    """
    print(f"\n" + "="*40)
    print(f"Pollutant specific analysis: {pollutant}")
    print("="*40)
    
    if data is None or data.empty:
        print("  Error: no data loaded.")
        return None
    
    # Filter for this pollutant
    poll_data = data[data['pollutant_name'] == pollutant].copy()
    
    if poll_data.empty:
        print(f"  Error: no data found for {pollutant}.")
        return None
    
    # Remove NaN values before analysis
    poll_data_clean = poll_data[poll_data['value'].notna()].copy()
    
    if poll_data_clean.empty:
        print(f"  Warning: all values are NaN for {pollutant}, skipping analysis.")
        return None
    
    print(f"\n  Total measurements: {len(poll_data):,}")
    print(f"  Valid (non-NaN) measurements: {len(poll_data_clean):,}")
    print(f"  Date range: {poll_data_clean['timestamp'].min()} to {poll_data_clean['timestamp'].max()}")
    print(f"  Stations: {poll_data_clean['station_name'].nunique()}")
    
    # Statistics
    stats = poll_data_clean['value'].describe()
    print(f"\n  Concentration statistics for {pollutant}:")
    print(f"    Mean: {stats['mean']:.2f}")
    print(f"    Median: {stats['50%']:.2f}")
    print(f"    Std dev: {stats['std']:.2f}")
    print(f"    Min: {stats['min']:.2f}")
    print(f"    Max: {stats['max']:.2f}")
    print(f"    25th percentile: {stats['25%']:.2f}")
    print(f"    75th percentile: {stats['75%']:.2f}")
    
    # Create 4 panel visualisation
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    
    # Panel 1: Time series (daily average)
    daily_avg = poll_data_clean.groupby(poll_data_clean['timestamp'].dt.date)['value'].mean()
    axes[0, 0].plot(daily_avg.index, daily_avg.values, linewidth=1, alpha=0.7)
    axes[0, 0].set_xlabel('Date', fontsize=11, fontweight='bold')
    axes[0, 0].set_ylabel(f'{pollutant} concentration', fontsize=11, fontweight='bold')
    axes[0, 0].set_title(f'Daily average {pollutant} levels (2023-2025)', 
                        fontsize=12, fontweight='bold')
    axes[0, 0].grid(alpha=0.3)
    axes[0, 0].tick_params(axis='x', rotation=45)
    
    # Panel 2: Distribution histogram (only non-NaN values)
    axes[0, 1].hist(poll_data_clean['value'], bins=50, color='steelblue', 
                   alpha=0.7, edgecolor='black')
    axes[0, 1].axvline(poll_data_clean['value'].mean(), color='red', linestyle='--', 
                      linewidth=2, label=f'Mean: {poll_data_clean["value"].mean():.1f}')
    axes[0, 1].axvline(poll_data_clean['value'].median(), color='orange', linestyle='--', 
                      linewidth=2, label=f'Median: {poll_data_clean["value"].median():.1f}')
    axes[0, 1].set_xlabel(f'{pollutant} concentration', fontsize=11, fontweight='bold')
    axes[0, 1].set_ylabel('Frequency', fontsize=11, fontweight='bold')
    axes[0, 1].set_title(f'{pollutant} distribution', fontsize=12, fontweight='bold')
    axes[0, 1].legend()
    axes[0, 1].grid(alpha=0.3, axis='y')
    
    # Panel 3: Hourly pattern
    hourly_avg = poll_data_clean.groupby('hour')['value'].mean()
    axes[1, 0].plot(hourly_avg.index, hourly_avg.values, marker='o', 
                   linewidth=2, color='forestgreen')
    axes[1, 0].set_xlabel('Hour of day', fontsize=11, fontweight='bold')
    axes[1, 0].set_ylabel(f'Average {pollutant}', fontsize=11, fontweight='bold')
    axes[1, 0].set_title(f'Average {pollutant} by hour of day', 
                        fontsize=12, fontweight='bold')
    axes[1, 0].set_xticks(range(0, 24, 2))
    axes[1, 0].grid(alpha=0.3)
    
    # Panel 4: Monthly boxplot
    month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
                  'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
    poll_data_clean['month_name'] = pd.Categorical(poll_data_clean['month_name'], 
                                                   categories=month_order, ordered=True)
    poll_data_clean.boxplot(column='value', by='month_name', ax=axes[1, 1])
    axes[1, 1].set_xlabel('Month', fontsize=11, fontweight='bold')
    axes[1, 1].set_ylabel(f'{pollutant} concentration', fontsize=11, fontweight='bold')
    axes[1, 1].set_title(f'Monthly {pollutant} distribution', 
                        fontsize=12, fontweight='bold')
    axes[1, 1].get_figure().suptitle('')  # Remove default title
    plt.setp(axes[1, 1].xaxis.get_majorticklabels(), rotation=45)
    
    plt.tight_layout()
    safe_name = pollutant.replace('.', '_').replace('/', '_').replace(' ', '_')
    plt.savefig(report_dir / f'{safe_name}_analysis.png', dpi=300, bbox_inches='tight')
    print(f"\n  Visualisation saved: {safe_name}_analysis.png")
    plt.close()
    
    return poll_data_clean

#### 4) analyse_seasonal_patterns created for check seasonal patterns.
- func creates season definition 
- Calculates seasonal averages.
- and visualisa it.
- report/detailed_analysis folder and image name: seasonal_patterns.png

In [186]:
def analyse_seasonal_patterns(data, report_dir):
    """
  Seasonal variation analysis.
    
    Creates:
    - Season definitions (Winter, Spring, Summer, Autumn)
    - Seasonal averages
    - Visualization
    """
    
    print("\n" + "="*40)
    print("Seasonal pattern analyse")
    print("="*40)
    
    if data is None or data.empty:
        print("Error no data loaded")
        return None
    
   # Define seasons
    def assign_season(month):
        if month in [12, 1, 2]:
            return 'Winter'
        elif month in [3, 4, 5]:
            return 'Spring'
        elif month in [6, 7, 8]:
            return 'Summer'
        else:
            return 'Autumn'
    
    data_copy = data.copy()
    data_copy['season'] = data_copy['month'].apply(assign_season)
    
    # Calculate seasonal averages
    seasonal_avg = data_copy.groupby(['season', 'pollutant_name'])['value'].mean().unstack()
    season_order = ['Winter', 'Spring', 'Summer', 'Autumn']
    seasonal_avg = seasonal_avg.reindex(season_order)
    
    print("\nSeasonal averages")
    print(seasonal_avg.round(2))
    
    # Save
    seasonal_avg.to_csv(report_dir / 'seasonal_averages.csv')
    
    # Visualise
    fig, ax = plt.subplots(figsize=(12, 7))
    seasonal_avg.plot(kind='bar', ax=ax, width=0.8, alpha=0.8)
    ax.set_xlabel('Season', fontsize=12, fontweight='bold')
    ax.set_ylabel('Average Concentration', fontsize=12, fontweight='bold')
    ax.set_title('Seasonal Pollution Patterns by Pollutant', 
                fontsize=13, fontweight='bold')
    ax.legend(title='Pollutant', bbox_to_anchor=(1.05, 1), loc='upper left')
    ax.grid(axis='y', alpha=0.3)
    ax.set_xticklabels(ax.get_xticklabels(), rotation=0)
    
    plt.tight_layout()
    plt.savefig(report_dir / 'seasonal_patterns.png', dpi=300, bbox_inches='tight')
    print(f"\n  Visualisation saved seasonal_patterns.png")
    plt.close()
    
    return seasonal_avg

#### 5) analyse_limit_exceedances function checks the uk_ppollutant_limits.csv.
1) requires pollutant_limits.csv with columns: pollutant_std, limit, concentration measured as, objective.
2) report/detailed_analysis path cvreates file called: limit_exceedances.csv 
 - column structure of the file:  pollutant, objective, averaging_period, limit_value, total_measurements, exceedances, exceedance_pct

In [187]:
def analyse_limit_exceedances(data, report_dir, limits_path):
    """
    Analysis of UK legal limit exceedances.
    
    Matches pollutant names in data with standard codes in UK limits CSV.
    
    Creates:
        - Exceedance counts per pollutant
        - Percentage of measurements exceeding limits
        - Comparison against UK legal objectives
    """
    print("\n" + "="*40)
    print("UK legal limit exceedance analysis")
    print("="*40)
    
    if data is None or data.empty:
        print("  Error: no data loaded.")
        return None
    
    # Load limits
    try:
        limits_df = pd.read_csv(limits_path)
        print(f"\n  Loaded {len(limits_df)} UK limit standards from: {limits_path.name}")
    except Exception as e:
        print(f"  Error: couldn't load limits file: {e}")
        return None
    
    # Create mapping between data pollutant names and CSV standard codes
    pollutant_mapping = {
        'Nitrogen dioxide': 'NO2',
        'Nitrogen monoxide': 'NO',
        'Nitrogen oxides': 'NOx',
        'Sulphur dioxide': 'SO2',
        'Ozone': 'O3',
        'Particulate matter less than 2.5 micro m': 'PM2.5',
        'PM2.5 Particles': 'PM2.5',
        'Particulate matter less than 10 micro m': 'PM10',
        'PM10 Particles': 'PM10',
        'Benzene': 'Benzene',
        '1.3 Butadiene': '1,3-butadiene',
        '1,3-Butadiene': '1,3-butadiene',
        'Carbon monoxide': 'CO',
        'Lead': 'LEAD',
        'Polycyclic Aromatic Hydrocarbons': 'PAH'
    }
    
    # Prepare results storage
    exceedance_results = []
    pollutants_checked = 0
    pollutants_with_limits = 0
    
    # For each pollutant in data
    for pollutant in data['pollutant_name'].unique():
        pollutants_checked += 1
        
        # Map to standard code
        std_code = pollutant_mapping.get(pollutant, None)
        
        if std_code is None:
            continue  # Skip pollutants without standard codes
        
        # Get limits for this pollutant
        limits = limits_df[limits_df['pollutant_std'] == std_code]
        
        if limits.empty:
            continue  # Skip if no limits defined
        
        pollutants_with_limits += 1
        
        # Get clean data (remove NaN)
        poll_data = data[data['pollutant_name'] == pollutant].copy()
        poll_data = poll_data[poll_data['value'].notna()]
        
        if poll_data.empty:
            print(f"\n  {pollutant}: No valid data to check.")
            continue
        
        print(f"\n  Checking {pollutant} ({std_code}):")
        
        # Check each limit type for this pollutant
        for _, limit_row in limits.iterrows():
            limit_value = limit_row['limit']
            averaging = limit_row['concentration measured as']
            objective = limit_row['objective']
            unit = limit_row['unit']
            
            # Handle unit conversion if needed
            data_values = poll_data['value'].copy()
            if unit == 'mg/m3':
                # Convert µg/m3 to mg/m3 if needed (divide by 1000)
                # Assuming data is in µg/m3, convert to mg/m3
                data_values = data_values / 1000.0
                print(f"    Note: converted values from µg/m³ to mg/m³ for comparison")
            
            # Calculate exceedances
            exceedances = (data_values > limit_value).sum()
            total = len(data_values)
            pct = (exceedances / total * 100) if total > 0 else 0
            
            exceedance_results.append({
                'pollutant': pollutant,
                'pollutant_code': std_code,
                'objective': objective,
                'averaging_period': averaging,
                'limit_value': limit_value,
                'unit': unit,
                'total_measurements': total,
                'exceedances': exceedances,
                'exceedance_pct': round(pct, 2)
            })
            
            # Report results
            if exceedances > 0:
                print(f"    {averaging}: {exceedances:,} / {total:,} ({pct:.1f}%) EXCEED {limit_value} {unit}")
            else:
                print(f"    {averaging}: All measurements within limit ({limit_value} {unit})")
    
    # Summary
    print(f"\n  Summary:")
    print(f"    Total pollutants in dataset: {pollutants_checked}")
    print(f"    Pollutants with UK limits: {pollutants_with_limits}")
    
    if exceedance_results:
        exceedance_df = pd.DataFrame(exceedance_results)
        exceedance_df.to_csv(report_dir / 'limit_exceedances.csv', index=False)
        print(f"    Results saved: limit_exceedances.csv")
        
        # Print overall exceedance summary
        total_exceedances = exceedance_df['exceedances'].sum()
        total_checks = exceedance_df['total_measurements'].sum()
        print(f"\n  Overall: {total_exceedances:,} exceedances out of {total_checks:,} measurements checked")
        
        return exceedance_df
    else:
        print("\n  No matching limits found for any pollutants in dataset.")
        return None

#### 6) all analyse 
1) finally function for run_full_analysis for all the description.


In [188]:
# Example usage function
def run_full_analysis(base_dir, csv_output_path=None, 
                    sample_size=None):
    """
    Run complete analysis pipeline.

    Parameters:
        base_dir : str or Path
            Path to optimised directory.
        pollutant_limits_path : str or Path, optional
            Path to pollutant_limits.csv.
        sample_size : int, optional
            Number of files to load (None = all).
            
    Returns:
        PollutionAnalysis : Analyzer object with results.
    """

    print("\n" + "="*40)
    print("Starting full pollution analysis")
    print("="*40)
    
    # Load data
    data = load_pollution_data(base_dir, sample_size=sample_size)
    
    if data.empty:
        print("\n  Error: no data loaded, cannot continue.")
        return None
    
    # Run all analyses
    analyse_overall_trends(data, report_dir)
    
    # Analyze each pollutant found in data
    for pollutant in data['pollutant_name'].unique():
        analyse_pollutant_specific(data, pollutant, report_dir)
    
    analyse_seasonal_patterns(data, report_dir)
    
    if csv_output_path and Path(csv_output_path).exists():
        analyse_limit_exceedances(data, report_dir, Path(csv_output_path))
    
    print("\n" + "="*40)
    print("Analysis complete")
    print("="*40)
    print(f"\nAll results saved to: {report_dir}")
    print("\nGenerated files:")
    for f in sorted(report_dir.iterdir()):
        print(f"  - {f.name}")
    print("="*40)
    
    return data





In [189]:
# Run analysis - set sample_size=100 for testing, None for full dataset
data = run_full_analysis(
    base_dir=base_dir,
    csv_output_path=csv_output_path,
    sample_size=None  # Remove this or set to None to analyze all data
)



Starting full pollution analysis

  Loaded 2,525,991 measurements from 3563 files.
  Date range: 2023-01-01 01:00:00 to 2025-11-09 23:00:00
  Stations: 18
  Pollutants: 37

  Pollutant breakdown:
    Nitrogen dioxide: 326,072 measurements
    Nitrogen monoxide: 326,061 measurements
    Nitrogen oxides: 325,387 measurements
    Particulate matter less than 2.5 micro m: 234,748 measurements
    Particulate matter less than 10 micro m: 227,142 measurements
    Ozone: 194,333 measurements
    Sulphur dioxide: 72,928 measurements
    Carbon monoxide: 48,578 measurements
    Benzene: 26,649 measurements
    Ethyl benzene: 26,649 measurements
    o-Xylene: 26,649 measurements
    n-Octane: 26,649 measurements
    i-Octane: 26,649 measurements
    1,3,5-Trimethylbenzene: 26,649 measurements
    n-Heptane: 26,649 measurements
    Toluene: 26,649 measurements
    1,2,3-Trimethylbenzene: 26,649 measurements
    1,2,4-Trimethylbenzene: 26,649 measurements
    Ethene: 26,618 measurements
    Isopr

    ========================================
    Starting full pollution analysis
    ========================================

    Loaded 2,525,991 measurements from 3563 files.
    Date range: 2023-01-01 01:00:00 to 2025-11-09 23:00:00
    Stations: 18
    Pollutants: 37

    Pollutant breakdown:
        Nitrogen dioxide: 326,072 measurements
        Nitrogen monoxide: 326,061 measurements
        Nitrogen oxides: 325,387 measurements
        Particulate matter less than 2.5 micro m: 234,748 measurements
        Particulate matter less than 10 micro m: 227,142 measurements
        Ozone: 194,333 measurements
        Sulphur dioxide: 72,928 measurements
        Carbon monoxide: 48,578 measurements
        Benzene: 26,649 measurements
        Ethyl benzene: 26,649 measurements
        o-Xylene: 26,649 measurements
        n-Octane: 26,649 measurements
        i-Octane: 26,649 measurements
        1,3,5-Trimethylbenzene: 26,649 measurements
        n-Heptane: 26,649 measurements
        Toluene: 26,649 measurements
        1,2,3-Trimethylbenzene: 26,649 measurements
        1,2,4-Trimethylbenzene: 26,649 measurements
        Ethene: 26,618 measurements
        Isoprene: 26,618 measurements
        Propane: 26,618 measurements
        n-Pentane: 26,618 measurements
        i-Pentane: 26,618 measurements
        Propene: 26,618 measurements
        i-Hexane: 26,599 measurements
        trans-2-Pentene: 26,599 measurements
        cis-2-Butene: 26,599 measurements
        trans-2-Butene: 26,599 measurements
        1-Butene: 26,599 measurements
        n-Butane: 26,599 measurements
        Ethane: 26,599 measurements
        i-Butane: 26,599 measurements
        n-Hexane: 26,580 measurements
        1-Pentene: 26,572 measurements
        1.3 Butadiene: 26,568 measurements
        Ethyne: 26,529 measurements
        m,p-Xylene: 25,503 measurements

    ========================================
    Overall pollutant trends
    ========================================

    Yearly averages by pollutant:
                                mean  median    std  count
    year pollutant_name                                    
    2023 1,2,3-Trimethylbenzene  0.03    0.02   0.03  11091
        1,2,4-Trimethylbenzene  0.51    0.24   7.94  11079
        1,3,5-Trimethylbenzene  0.13    0.06   0.52  11049
        1-Butene                0.26    0.23   0.20  11060
        1-Pentene               0.18    0.04   1.44  10990
    ...                           ...     ...    ...    ...
    2025 n-Octane                0.12    0.10   0.15   6428
        n-Pentane               0.79    0.66   0.61   6548
        o-Xylene                1.54    0.62  11.65   6501
        trans-2-Butene          0.22    0.21   0.09   6533
        trans-2-Pentene         0.10    0.08   0.12   6525

    [111 rows x 4 columns]

    Visualisation saved: overall_trends.png

    ========================================
    Pollutant specific analysis: Toluene
    ========================================

    Total measurements: 26,649
    Valid (non-NaN) measurements: 25,009
    Date range: 2023-01-01 01:00:00 to 2025-11-09 23:00:00
    Stations: 2

    Concentration statistics for Toluene:
        Mean: 1.53
        Median: 1.03
        Std dev: 2.97
        Min: 0.01
        Max: 241.58
        25th percentile: 0.52
        75th percentile: 1.84

    Visualisation saved: Toluene_analysis.png

    ========================================
    Pollutant specific analysis: i-Butane
    ========================================

    Total measurements: 26,599
    Valid (non-NaN) measurements: 25,291
    Date range: 2023-01-01 01:00:00 to 2025-11-09 23:00:00
    Stations: 2

    Concentration statistics for i-Butane:
        Mean: 2.43
        Median: 1.77
        Std dev: 2.81
        Min: 0.01
        Max: 82.97
        25th percentile: 1.07
        75th percentile: 2.87

    Visualisation saved: i-Butane_analysis.png

    ========================================
    Pollutant specific analysis: 1,2,3-Trimethylbenzene
    ========================================

    Total measurements: 26,649
    Valid (non-NaN) measurements: 25,089
    Date range: 2023-01-01 01:00:00 to 2025-11-09 23:00:00
    Stations: 2

    Concentration statistics for 1,2,3-Trimethylbenzene:
        Mean: 0.03
        Median: 0.03
        Std dev: 0.02
        Min: 0.03
        Max: 3.16
        25th percentile: 0.03
        75th percentile: 0.03

    Visualisation saved: 1,2,3-Trimethylbenzene_analysis.png

    ========================================
    Pollutant specific analysis: Ethyne
    ========================================

    Total measurements: 26,529
    Valid (non-NaN) measurements: 25,201
    Date range: 2023-01-01 01:00:00 to 2025-11-09 23:00:00
    Stations: 2

    Concentration statistics for Ethyne:
        Mean: 0.60
        Median: 0.48
        Std dev: 0.52
        Min: 0.01
        Max: 10.33
        25th percentile: 0.32
        75th percentile: 0.74

    Visualisation saved: Ethyne_analysis.png

    ========================================
    Pollutant specific analysis: 1-Butene
    ========================================

    Total measurements: 26,599
    Valid (non-NaN) measurements: 25,292
    Date range: 2023-01-01 01:00:00 to 2025-11-09 23:00:00
    Stations: 2

    Concentration statistics for 1-Butene:
        Mean: 0.31
        Median: 0.28
        Std dev: 0.24
        Min: 0.01
        Max: 9.56
        25th percentile: 0.20
        75th percentile: 0.36

    Visualisation saved: 1-Butene_analysis.png

    ========================================
    Pollutant specific analysis: Ozone
    ========================================

    Total measurements: 194,333
    Valid (non-NaN) measurements: 167,149
    Date range: 2023-01-01 01:00:00 to 2025-11-09 23:00:00
    Stations: 9

    Concentration statistics for Ozone:
        Mean: 46.65
        Median: 47.10
        Std dev: 24.07
        Min: -4.74
        Max: 200.17
        25th percentile: 29.74
        75th percentile: 62.47

    Visualisation saved: Ozone_analysis.png

    ========================================
    Pollutant specific analysis: 1,2,4-Trimethylbenzene
    ========================================

    Total measurements: 26,649
    Valid (non-NaN) measurements: 25,039
    Date range: 2023-01-01 01:00:00 to 2025-11-09 23:00:00
    Stations: 2

    Concentration statistics for 1,2,4-Trimethylbenzene:
        Mean: 0.88
        Median: 0.36
        Std dev: 6.84
        Min: 0.00
        Max: 592.16
        25th percentile: 0.17
        75th percentile: 0.70

    Visualisation saved: 1,2,4-Trimethylbenzene_analysis.png

    ========================================
    Pollutant specific analysis: cis-2-Butene
    ========================================

    Total measurements: 26,599
    Valid (non-NaN) measurements: 25,221
    Date range: 2023-01-01 01:00:00 to 2025-11-09 23:00:00
    Stations: 2

    Concentration statistics for cis-2-Butene:
        Mean: 0.12
        Median: 0.10
        Std dev: 0.11
        Min: 0.00
        Max: 3.77
        25th percentile: 0.06
        75th percentile: 0.15

    Visualisation saved: cis-2-Butene_analysis.png

    ========================================
    Pollutant specific analysis: trans-2-Pentene
    ========================================

    Total measurements: 26,599
    Valid (non-NaN) measurements: 25,233
    Date range: 2023-01-01 01:00:00 to 2025-11-09 23:00:00
    Stations: 2

    Concentration statistics for trans-2-Pentene:
        Mean: 0.08
        Median: 0.05
        Std dev: 0.10
        Min: 0.00
        Max: 6.30
        25th percentile: 0.01
        75th percentile: 0.10

    Visualisation saved: trans-2-Pentene_analysis.png

    ========================================
    Pollutant specific analysis: Nitrogen oxides
    ========================================

    Total measurements: 325,387
    Valid (non-NaN) measurements: 300,423
    Date range: 2023-01-01 01:00:00 to 2025-11-09 23:00:00
    Stations: 14

    Concentration statistics for Nitrogen oxides:
        Mean: 33.00
        Median: 20.66
        Std dev: 38.23
        Min: -1.68
        Max: 861.01
        25th percentile: 10.52
        75th percentile: 40.71

    Visualisation saved: Nitrogen_oxides_analysis.png

    ========================================
    Pollutant specific analysis: 1.3 Butadiene
    ========================================

    Total measurements: 26,568
    Valid (non-NaN) measurements: 25,248
    Date range: 2023-01-01 01:00:00 to 2025-11-09 23:00:00
    Stations: 2

    Concentration statistics for 1.3 Butadiene:
        Mean: 0.15
        Median: 0.10
        Std dev: 0.16
        Min: 0.01
        Max: 9.76
        25th percentile: 0.05
        75th percentile: 0.22

    Visualisation saved: 1_3_Butadiene_analysis.png

    ========================================
    Pollutant specific analysis: i-Hexane
    ========================================

    Total measurements: 26,599
    Valid (non-NaN) measurements: 25,278
    Date range: 2023-01-01 01:00:00 to 2025-11-09 23:00:00
    Stations: 2

    Concentration statistics for i-Hexane:
        Mean: 0.57
        Median: 0.36
        Std dev: 1.38
        Min: 0.01
        Max: 72.82
        25th percentile: 0.20
        75th percentile: 0.60

    Visualisation saved: i-Hexane_analysis.png

    ========================================
    Pollutant specific analysis: Nitrogen monoxide
    ========================================

    Total measurements: 326,061
    Valid (non-NaN) measurements: 300,617
    Date range: 2023-01-01 01:00:00 to 2025-11-09 23:00:00
    Stations: 14

    Concentration statistics for Nitrogen monoxide:
        Mean: 7.95
        Median: 2.22
        Std dev: 16.60
        Min: -1.50
        Max: 463.12
        25th percentile: 0.62
        75th percentile: 7.61

    Visualisation saved: Nitrogen_monoxide_analysis.png

    ========================================
    Pollutant specific analysis: Isoprene
    ========================================

    Total measurements: 26,618
    Valid (non-NaN) measurements: 25,277
    Date range: 2023-01-01 01:00:00 to 2025-11-09 22:00:00
    Stations: 2

    Concentration statistics for Isoprene:
        Mean: 0.16
        Median: 0.09
        Std dev: 0.24
        Min: 0.01
        Max: 4.03
        25th percentile: 0.03
        75th percentile: 0.18

    Visualisation saved: Isoprene_analysis.png

    ========================================
    Pollutant specific analysis: n-Hexane
    ========================================

    Total measurements: 26,580
    Valid (non-NaN) measurements: 25,260
    Date range: 2023-01-01 01:00:00 to 2025-11-09 23:00:00
    Stations: 2

    Concentration statistics for n-Hexane:
        Mean: 0.37
        Median: 0.23
        Std dev: 2.17
        Min: 0.01
        Max: 94.56
        25th percentile: 0.14
        75th percentile: 0.36

    Visualisation saved: n-Hexane_analysis.png

    ========================================
    Pollutant specific analysis: Benzene
    ========================================

    Total measurements: 26,649
    Valid (non-NaN) measurements: 25,063
    Date range: 2023-01-01 01:00:00 to 2025-11-09 23:00:00
    Stations: 2

    Concentration statistics for Benzene:
        Mean: 0.51
        Median: 0.38
        Std dev: 0.78
        Min: 0.00
        Max: 54.47
        25th percentile: 0.22
        75th percentile: 0.62

    Visualisation saved: Benzene_analysis.png

    ========================================
    Pollutant specific analysis: Ethene
    ========================================

    Total measurements: 26,618
    Valid (non-NaN) measurements: 25,306
    Date range: 2023-01-01 01:00:00 to 2025-11-09 23:00:00
    Stations: 2

    Concentration statistics for Ethene:
        Mean: 1.42
        Median: 1.18
        Std dev: 1.48
        Min: 0.01
        Max: 157.86
        25th percentile: 0.69
        75th percentile: 1.83

    Visualisation saved: Ethene_analysis.png

    ========================================
    Pollutant specific analysis: n-Pentane
    ========================================

    Total measurements: 26,618
    Valid (non-NaN) measurements: 25,312
    Date range: 2023-01-01 01:00:00 to 2025-11-09 23:00:00
    Stations: 2

    Concentration statistics for n-Pentane:
        Mean: 0.73
        Median: 0.57
        Std dev: 0.73
        Min: 0.01
        Max: 13.18
        25th percentile: 0.35
        75th percentile: 0.87

    Visualisation saved: n-Pentane_analysis.png

    ========================================
    Pollutant specific analysis: Propene
    ========================================

    Total measurements: 26,618
    Valid (non-NaN) measurements: 25,306
    Date range: 2023-01-01 01:00:00 to 2025-11-09 23:00:00
    Stations: 2

    Concentration statistics for Propene:
        Mean: 0.81
        Median: 0.73
        Std dev: 0.57
        Min: 0.01
        Max: 8.56
        25th percentile: 0.43
        75th percentile: 1.05

    Visualisation saved: Propene_analysis.png

    ========================================
    Pollutant specific analysis: n-Butane
    ========================================

    Total measurements: 26,599
    Valid (non-NaN) measurements: 25,292
    Date range: 2023-01-01 01:00:00 to 2025-11-09 23:00:00
    Stations: 2

    Concentration statistics for n-Butane:
        Mean: 3.91
        Median: 2.79
        Std dev: 9.69
        Min: 0.01
        Max: 1357.48
        25th percentile: 1.68
        75th percentile: 4.54

    Visualisation saved: n-Butane_analysis.png

    ========================================
    Pollutant specific analysis: trans-2-Butene
    ========================================

    Total measurements: 26,599
    Valid (non-NaN) measurements: 25,278
    Date range: 2023-01-01 01:00:00 to 2025-11-09 23:00:00
    Stations: 2

    Concentration statistics for trans-2-Butene:
        Mean: 0.18
        Median: 0.14
        Std dev: 0.17
        Min: 0.01
        Max: 4.48
        25th percentile: 0.08
        75th percentile: 0.23

    Visualisation saved: trans-2-Butene_analysis.png

    ========================================
    Pollutant specific analysis: n-Heptane
    ========================================

    Total measurements: 26,649
    Valid (non-NaN) measurements: 25,027
    Date range: 2023-01-01 01:00:00 to 2025-11-09 23:00:00
    Stations: 2

    Concentration statistics for n-Heptane:
        Mean: 0.32
        Median: 0.23
        Std dev: 0.50
        Min: 0.00
        Max: 22.98
        25th percentile: 0.12
        75th percentile: 0.37

    Visualisation saved: n-Heptane_analysis.png

    ========================================
    Pollutant specific analysis: 1,3,5-Trimethylbenzene
    ========================================

    Total measurements: 26,649
    Valid (non-NaN) measurements: 25,008
    Date range: 2023-01-01 01:00:00 to 2025-11-09 23:00:00
    Stations: 2

    Concentration statistics for 1,3,5-Trimethylbenzene:
        Mean: 0.14
        Median: 0.06
        Std dev: 0.52
        Min: 0.00
        Max: 47.30
        25th percentile: 0.03
        75th percentile: 0.15

    Visualisation saved: 1,3,5-Trimethylbenzene_analysis.png

    ========================================
    Pollutant specific analysis: Nitrogen dioxide
    ========================================

    Total measurements: 326,072
    Valid (non-NaN) measurements: 300,643
    Date range: 2023-01-01 01:00:00 to 2025-11-09 23:00:00
    Stations: 14

    Concentration statistics for Nitrogen dioxide:
        Mean: 20.79
        Median: 16.45
        Std dev: 16.23
        Min: -2.10
        Max: 274.62
        25th percentile: 8.61
        75th percentile: 28.69

    Visualisation saved: Nitrogen_dioxide_analysis.png

    ========================================
    Pollutant specific analysis: Sulphur dioxide
    ========================================

    Total measurements: 72,928
    Valid (non-NaN) measurements: 65,747
    Date range: 2023-01-01 01:00:00 to 2025-11-09 23:00:00
    Stations: 3

    Concentration statistics for Sulphur dioxide:
        Mean: 1.94
        Median: 1.46
        Std dev: 1.91
        Min: 0.00
        Max: 156.19
        25th percentile: 0.93
        75th percentile: 2.40

    Visualisation saved: Sulphur_dioxide_analysis.png

    ========================================
    Pollutant specific analysis: Particulate matter less than 2.5 micro m
    ========================================

    Total measurements: 234,748
    Valid (non-NaN) measurements: 205,125
    Date range: 2023-01-01 01:00:00 to 2025-11-09 23:00:00
    Stations: 13

    Concentration statistics for Particulate matter less than 2.5 micro m:
        Mean: 7.80
        Median: 5.75
        Std dev: 7.04
        Min: -1.60
        Max: 189.53
        25th percentile: 3.87
        75th percentile: 9.24

    Visualisation saved: Particulate_matter_less_than_2_5_micro_m_analysis.png

    ========================================
    Pollutant specific analysis: i-Pentane
    ========================================

    Total measurements: 26,618
    Valid (non-NaN) measurements: 25,312
    Date range: 2023-01-01 01:00:00 to 2025-11-09 23:00:00
    Stations: 2

    Concentration statistics for i-Pentane:
        Mean: 1.59
        Median: 1.20
        Std dev: 1.65
        Min: 0.01
        Max: 51.20
        25th percentile: 0.68
        75th percentile: 1.96

    Visualisation saved: i-Pentane_analysis.png

    ========================================
    Pollutant specific analysis: Carbon monoxide
    ========================================

    Total measurements: 48,578
    Valid (non-NaN) measurements: 45,500
    Date range: 2023-01-01 01:00:00 to 2025-11-09 23:00:00
    Stations: 2

    Concentration statistics for Carbon monoxide:
        Mean: 0.22
        Median: 0.17
        Std dev: 0.18
        Min: 0.00
        Max: 2.54
        25th percentile: 0.09
        75th percentile: 0.30

    Visualisation saved: Carbon_monoxide_analysis.png

    ========================================
    Pollutant specific analysis: i-Octane
    ========================================

    Total measurements: 26,649
    Valid (non-NaN) measurements: 25,025
    Date range: 2023-01-01 01:00:00 to 2025-11-09 23:00:00
    Stations: 2

    Concentration statistics for i-Octane:
        Mean: 0.30
        Median: 0.24
        Std dev: 0.41
        Min: 0.00
        Max: 34.95
        25th percentile: 0.11
        75th percentile: 0.40

    Visualisation saved: i-Octane_analysis.png

    ========================================
    Pollutant specific analysis: n-Octane
    ========================================

    Total measurements: 26,649
    Valid (non-NaN) measurements: 24,885
    Date range: 2023-01-01 01:00:00 to 2025-11-09 23:00:00
    Stations: 2

    Concentration statistics for n-Octane:
        Mean: 0.11
        Median: 0.08
        Std dev: 0.15
        Min: 0.00
        Max: 5.76
        25th percentile: 0.03
        75th percentile: 0.15

    Visualisation saved: n-Octane_analysis.png

    ========================================
    Pollutant specific analysis: Propane
    ========================================

    Total measurements: 26,618
    Valid (non-NaN) measurements: 25,302
    Date range: 2023-01-01 01:00:00 to 2025-11-09 23:00:00
    Stations: 2

    Concentration statistics for Propane:
        Mean: 4.45
        Median: 3.32
        Std dev: 6.64
        Min: 0.01
        Max: 401.51
        25th percentile: 2.18
        75th percentile: 5.11

    Visualisation saved: Propane_analysis.png

    ========================================
    Pollutant specific analysis: m,p-Xylene
    ========================================

    Total measurements: 25,503
    Valid (non-NaN) measurements: 23,891
    Date range: 2023-01-01 01:00:00 to 2025-11-09 23:00:00
    Stations: 2

    Concentration statistics for m,p-Xylene:
        Mean: 2.27
        Median: 1.00
        Std dev: 19.88
        Min: 0.00
        Max: 1557.71
        25th percentile: 0.44
        75th percentile: 1.91

    Visualisation saved: m,p-Xylene_analysis.png

    ========================================
    Pollutant specific analysis: 1-Pentene
    ========================================

    Total measurements: 26,572
    Valid (non-NaN) measurements: 25,191
    Date range: 2023-01-01 01:00:00 to 2025-11-09 23:00:00
    Stations: 2

    Concentration statistics for 1-Pentene:
        Mean: 0.12
        Median: 0.05
        Std dev: 0.95
        Min: 0.00
        Max: 57.46
        25th percentile: 0.01
        75th percentile: 0.08

    Visualisation saved: 1-Pentene_analysis.png

    ========================================
    Pollutant specific analysis: Ethane
    ========================================

    Total measurements: 26,599
    Valid (non-NaN) measurements: 25,284
    Date range: 2023-01-01 01:00:00 to 2025-11-09 23:00:00
    Stations: 2

    Concentration statistics for Ethane:
        Mean: 6.83
        Median: 5.29
        Std dev: 6.27
        Min: 0.01
        Max: 121.84
        25th percentile: 3.69
        75th percentile: 7.97

    Visualisation saved: Ethane_analysis.png

    ========================================
    Pollutant specific analysis: o-Xylene
    ========================================

    Total measurements: 26,649
    Valid (non-NaN) measurements: 25,081
    Date range: 2023-01-01 01:00:00 to 2025-11-09 23:00:00
    Stations: 2

    Concentration statistics for o-Xylene:
        Mean: 0.87
        Median: 0.39
        Std dev: 6.20
        Min: 0.01
        Max: 468.71
        25th percentile: 0.19
        75th percentile: 0.73

    Visualisation saved: o-Xylene_analysis.png

    ========================================
    Pollutant specific analysis: Ethyl benzene
    ========================================

    Total measurements: 26,649
    Valid (non-NaN) measurements: 25,057
    Date range: 2023-01-01 01:00:00 to 2025-11-09 23:00:00
    Stations: 2

    Concentration statistics for Ethyl benzene:
        Mean: 0.65
        Median: 0.29
        Std dev: 5.19
        Min: 0.00
        Max: 495.85
        25th percentile: 0.14
        75th percentile: 0.56

    Visualisation saved: Ethyl_benzene_analysis.png

    ========================================
    Pollutant specific analysis: Particulate matter less than 10 micro m
    ========================================

    Total measurements: 227,142
    Valid (non-NaN) measurements: 189,562
    Date range: 2023-01-01 01:00:00 to 2025-11-09 23:00:00
    Stations: 14

    Concentration statistics for Particulate matter less than 10 micro m:
        Mean: 13.22
        Median: 10.60
        Std dev: 10.50
        Min: 0.00
        Max: 857.17
        25th percentile: 7.21
        75th percentile: 16.10

    Visualisation saved: Particulate_matter_less_than_10_micro_m_analysis.png

    ========================================
    Seasonal pattern analyse
    ========================================

    Seasonal averages
    pollutant_name  1,2,3-Trimethylbenzene  1,2,4-Trimethylbenzene  \
    season                                                           
    Winter                            0.03                    0.73   
    Spring                            0.02                    0.59   
    Summer                            0.03                    0.51   
    Autumn                            0.02                    2.05   

    pollutant_name  1,3,5-Trimethylbenzene  1-Butene  1-Pentene  1.3 Butadiene  \
    season                                                                       
    Winter                            0.17      0.35       0.07           0.14   
    Spring                            0.11      0.27       0.15           0.13   
    Summer                            0.13      0.29       0.15           0.16   
    Autumn                            0.14      0.36       0.08           0.17   

    pollutant_name  Benzene  Carbon monoxide  Ethane  Ethene  ...  i-Pentane  \
    season                                                    ...              
    Winter             0.80             0.28   10.36    1.92  ...       1.78   
    Spring             0.40             0.21    6.42    1.12  ...       1.20   
    Summer             0.37             0.17    4.54    1.13  ...       1.56   
    Autumn             0.53             0.23    6.62    1.72  ...       1.98   

    pollutant_name  m,p-Xylene  n-Butane  n-Heptane  n-Hexane  n-Octane  \
    season                                                                
    Winter                1.95      5.03       0.40      0.34      0.15   
    Spring                3.20      3.21       0.24      0.24      0.09   
    Summer                1.40      3.25       0.30      0.30      0.10   
    Autumn                2.44      4.60       0.38      0.68      0.13   

    pollutant_name  n-Pentane  o-Xylene  trans-2-Butene  trans-2-Pentene  
    season                                                                
    Winter               0.88      0.68            0.23             0.07  
    Spring               0.56      1.08            0.18             0.05  
    Summer               0.67      0.66            0.14             0.08  
    Autumn               0.88      1.07            0.18             0.12  

    [4 rows x 37 columns]

    Visualisation saved seasonal_patterns.png

    ========================================
    UK legal limit exceedance analysis
    ========================================

    Loaded 12 UK limit standards from: uk_pollutant_limits.csv

    Checking Ozone (O3):
        8 hour mean: 2,831 / 167,149 (1.7%) EXCEED 100.0 µg/m3

    Checking Nitrogen oxides (NOx):
        annual mean: 107,709 / 300,423 (35.9%) EXCEED 30.0 µg/m3

    Checking 1.3 Butadiene (1,3-butadiene):
        running annual mean: 5 / 25,248 (0.0%) EXCEED 2.25 µg/m3

    Checking Benzene (Benzene):
        running annual: 3 / 25,063 (0.0%) EXCEED 16.25 µg/m3

    Checking Nitrogen dioxide (NO2):
        annual mean: 37,422 / 300,643 (12.4%) EXCEED 40.0 µg/m3

    Checking Sulphur dioxide (SO2):
        24 hour mean: 2 / 65,747 (0.0%) EXCEED 125.0 µg/m3

    Checking Particulate matter less than 2.5 micro m (PM2.5):
        annual mean: 11,194 / 205,125 (5.5%) EXCEED 20.0 µg/m3

    Checking Carbon monoxide (CO):
        Note: converted values from µg/m³ to mg/m³ for comparison
        maximum daily: All measurements within limit (10.0 mg/m3)

    Checking Particulate matter less than 10 micro m (PM10):
        24 hour mean: 1,986 / 189,562 (1.0%) EXCEED 50.0 µg/m3
        annual mean: 4,462 / 189,562 (2.4%) EXCEED 40.0 µg/m3

    Summary:
        Total pollutants in dataset: 37
        Pollutants with UK limits: 9
        Results saved: limit_exceedances.csv

    Overall: 165,614 exceedances out of 1,514,022 measurements checked

    ========================================
    Analysis complete
    ========================================

    All results saved to: /Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels/data/defra/report/detailed_analysis

    Generated files:
    - 1,2,3-Trimethylbenzene_analysis.png
    - 1,2,4-Trimethylbenzene_analysis.png
    - 1,3,5-Trimethylbenzene_analysis.png
    - 1-Butene_analysis.png
    - 1-Pentene_analysis.png
    - 1_3_Butadiene_analysis.png
    - Benzene_analysis.png
    - Carbon_monoxide_analysis.png
    - Ethane_analysis.png
    - Ethene_analysis.png
    - Ethyl_benzene_analysis.png
    - Ethyne_analysis.png
    - Isoprene_analysis.png
    - Nitrogen_dioxide_analysis.png
    - Nitrogen_monoxide_analysis.png
    - Nitrogen_oxides_analysis.png
    - Ozone_analysis.png
    - Particulate_matter_less_than_10_micro_m_analysis.png
    - Particulate_matter_less_than_2_5_micro_m_analysis.png
    - Propane_analysis.png
    - Propene_analysis.png
    - Sulphur_dioxide_analysis.png
    - Toluene_analysis.png
    - cis-2-Butene_analysis.png
    - i-Butane_analysis.png
    - i-Hexane_analysis.png
    - i-Octane_analysis.png
    - i-Pentane_analysis.png
    - limit_exceedances.csv
    - m,p-Xylene_analysis.png
    - n-Butane_analysis.png
    - n-Heptane_analysis.png
    - n-Hexane_analysis.png
    - n-Octane_analysis.png
    - n-Pentane_analysis.png
    - o-Xylene_analysis.png
    - overall_trends.png
    - seasonal_averages.csv
    - seasonal_patterns.png
    - trans-2-Butene_analysis.png
    - trans-2-Pentene_analysis.png
    - yearly_averages.csv
    ========================================