# DEFRA dataset Find Missing Parts

- I will identify the missing values and data gaps in the DEFRA dataset and decide how to address them.
- I’ll start by importing the relevant modules and displaying the initial file paths.

In [21]:
import pandas as pd
from pathlib import Path
import logging
from typing import Dict
import numpy as np
from datetime import datetime

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)

# Use absolute path to avoid confusion
base_dir = Path("/Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels")
processed_path = base_dir / "data" / "defra" / "processed"
metadata_path = base_dir / "data" / "defra" / "test" / "london_stations_clean.csv"

#new metadata path for parse and checking.
site_pollutant = base_dir / "data" / "defra" / "test" / "std_london_sites_pollutant.csv"
pollutant_mapping_path = base_dir / "src" / "data_prep" / "pollutant_mapps.py"

#new paths after coordinates_processed function
optimased_path = base_dir / "data" / "defra" / "optimised"

# Change output directory to data/defra/logs
logs_path = base_dir / "data" / "defra" / "logs"
logs_path.mkdir(parents=True, exist_ok=True)

# changing -99 and -1 values to NaN
output = optimased_path 

## 1. Optimise for accurate file reading.
 - metadata_path shows london_stations_clean.csv
    - 2 row of the csv file: 
    - station_id,station_name,pollutant_available,pollutant_air,latitude,longitude,timeseries_id,pollutant
    - 785876,Borehamwood Meadow Park,Nitrogen dioxide,air,51.661229,-0.2705499999910774,4565.0,6875 - Borehamwood Meadow Park-Nitrogen dioxide (air)

- processed folder files: example: processed_dir path = /Borehamwood_Meadow_Park/NO__2023_01.csv
    2 row of the csv file:
    - timestamp,value,timeseries_id,station_name,pollutant_name,pollutant_std
    - 2023-02-01 00:00:00,0.125,4564,Borehamwood Meadow Park,Nitrogen monoxide,NO

metadata_paths and processed_dir file structure not matching. Solution need to add pollutant_std column to metadata_path file

- new structure of metadata_path:
    - station_id,station_name,pollutant_available,pollutant_std,pollutant_air,latitude,longitude,timeseries_id,pollutant
    - 785876,Borehamwood Meadow Park,Nitrogen dioxide,HERE STD VERSION OF POLLUTANT(NO2),air,51.661229,-0.2705499999910774,4565.0,6875 - Borehamwood Meadow Park-Nitrogen dioxide (air)

- standartise the pollutant names on pollutant_mapps.csv 
            -- DEFRA mappings common ones in both datasets first.
            'Nitrogen dioxide': 'NO2',
            'Nitrogen Dioxide': 'NO2',
            'Nitrogen_dioxide': 'NO2',
            'Nitric oxide': 'NO',
            'Nitrogen_monoxide': 'NO',
            'Nitrogen oxides': 'NOx',
            'Nitrogen_oxides': 'NOx',
            'PM2.5 Particulate': 'PM2.5',
            'Particulate_matter_less_than_2.5_micro_m': 'PM2.5',
            'PM10 Particulate': 'PM10',
            'Particulate_matter_less_than_10_micro_m': 'PM10',
            'Sulphur Dioxide': 'SO2',
            'Sulphur dioxide': 'SO2',
            'Sulphur_dioxide': 'SO2',
            'Ozone': 'O3',
            'Carbon Monoxide': 'CO',
            'Carbon monoxide': 'CO',
            'Carbon_monoxide': 'CO',
            
            -- VOCs used simplified standard codes instead of their chemical names.
            'Benzene': 'Benzene',
            'Toluene': 'Toluene',
            'Ethylbenzene': 'Ethylbenzene',
            'Ethyl_benzene': 'Ethylbenzene',
            'o-Xylene': 'o-Xylene',
            'm,p-Xylene': 'm,p-Xylene',
            
            -- Trimethylbenzenes.
            '1,2,3-Trimethylbenzene': '1,2,3-TMB',
            '1,2,4-Trimethylbenzene': '1,2,4-TMB',
            '1,3,5-Trimethylbenzene': '1,3,5-TMB',
            
            -- Alkanes.
            'Ethane': 'Ethane',
            'Propane': 'Propane',
            'n-Butane': 'n-Butane',
            'i-Butane': 'i-Butane',
            'n-Pentane': 'n-Pentane',
            'i-Pentane': 'i-Pentane',
            'n-Hexane': 'n-Hexane',
            'i-Hexane': 'i-Hexane',
            'n-Heptane': 'n-Heptane',
            'n-Octane': 'n-Octane',
            'i-Octane': 'i-Octane',
            
            - Alkenes
            'Ethene': 'Ethene',
            'Propene': 'Propene',
            '1-Butene': '1-Butene',
            'cis-2-Butene': 'cis-2-Butene',
            'trans-2-Butene': 'trans-2-Butene',
            '1-Pentene': '1-Pentene',
            'trans-2-Pentene': 'trans-2-Pentene',
            
            - Other VOCs
            '1,3-Butadiene': '1,3-Butadiene',
            '1.3_Butadiene': '1,3-Butadiene',
            'Isoprene': 'Isoprene',
            'Ethyne': 'Ethyne',

- New metadata file after add pollutant_std col std_london_sites_pollutant.csv and path below
    -  site_pollutant = base_dir / "data" / "defra" / "test" / "std_london_sites_pollutant.csv"

- Last std process for processed folder files:/defra/processed/2023measurements/Borehamwood_Meadow_Park/NO__2023_01.csv
    - add latitude,longitude coorditanetion columns.
    - timestamp,value,timeseries_id,station_name,pollutant_name,pollutant_std, latitude,longitude
    - latitude,longitude columns:
        - parse coordination columns from std_london_site_pollutant.csv
        - the way to parse it matching pollutant_std and station_name columns. 


 

### 1) Add column pollutant_std to std_london_sites_pollutant.csv file
- adding pollutant_std colmn to london_stations_clean.csv file (metadata_path).
    - new metadata_path saved as std_london_sites_pollutant.csv the path name changed as:
    - site_pollutant = base_dir / "data" / "defra" / "test" / "std_london_sites_pollutant.csv"

In [8]:
# Add pollutant_std column to london_stations_clean.csv using DEFRA mappings
# and match it to pollutant_available column

defra_mappings = {
    'Nitrogen dioxide': 'NO2',
    'Nitrogen Dioxide': 'NO2',
    'Nitrogen_dioxide': 'NO2',
    'Nitric oxide': 'NO',
    'Nitrogen_monoxide': 'NO',
    'Nitrogen monoxide': 'NO',
    'Nitrogen oxides': 'NOx',
    'Nitrogen_oxides': 'NOx',
    'PM2.5 Particulate': 'PM2.5',
    'Particulate_matter_less_than_2.5_micro_m': 'PM2.5',
    'Particulate matter less than 2.5 micro m': 'PM2.5',
    'PM10 Particulate': 'PM10',
    'Particulate_matter_less_than_10_micro_m': 'PM10',
    'Particulate matter less than 10 micro m': 'PM10',
    'Sulphur Dioxide': 'SO2',
    'Sulphur dioxide': 'SO2',
    'Sulphur_dioxide': 'SO2',
    'Ozone': 'O3',
    'Carbon Monoxide': 'CO',
    'Carbon monoxide': 'CO',
    'Carbon_monoxide': 'CO',
    'Benzene': 'Benzene',
    'Toluene': 'Toluene',
    'Ethylbenzene': 'Ethylbenzene',
    'Ethyl_benzene': 'Ethylbenzene',
    'Ethyl benzene': 'Ethylbenzene',
    'o-Xylene': 'o-Xylene',
    'm,p-Xylene': 'm,p-Xylene',
    '1,2,3-Trimethylbenzene': '1,2,3-TMB',
    '1,2,4-Trimethylbenzene': '1,2,4-TMB',
    '1,3,5-Trimethylbenzene': '1,3,5-TMB',
    'Ethane': 'Ethane',
    'Propane': 'Propane',
    'n-Butane': 'n-Butane',
    'i-Butane': 'i-Butane',
    'n-Pentane': 'n-Pentane',
    'i-Pentane': 'i-Pentane',
    'n-Hexane': 'n-Hexane',
    'i-Hexane': 'i-Hexane',
    'n-Heptane': 'n-Heptane',
    'n-Octane': 'n-Octane',
    'i-Octane': 'i-Octane',
    'Ethene': 'Ethene',
    'Propene': 'Propene',
    '1-Butene': '1-Butene',
    'cis-2-Butene': 'cis-2-Butene',
    'trans-2-Butene': 'trans-2-Butene',
    '1-Pentene': '1-Pentene',
    'trans-2-Pentene': 'trans-2-Pentene',
    '1,3-Butadiene': '1,3-Butadiene',
    '1.3_Butadiene': '1,3-Butadiene',
    '1.3 Butadiene': '1,3-Butadiene',
    'Isoprene': 'Isoprene',
    'Ethyne': 'Ethyne',
}

# Load the stations metadata
stations_df = pd.read_csv(metadata_path)

# Add pollutant_std column by mapping pollutant_available
stations_df['pollutant_std'] = stations_df['pollutant_available'].map(defra_mappings)

# Save the updated DataFrame (commented out for now)
# stations_df.to_csv(metadata_path, index=False)



# ...existing code...
stations_df['pollutant_std'] = stations_df['pollutant_available'].map(defra_mappings)

# Check for NaN values in pollutant_std
nan_count = stations_df['pollutant_std'].isna().sum()
if nan_count > 0:
    print(f"Warning: {nan_count} rows have NaN in pollutant_std. Check for unmapped pollutant names.")
    print(stations_df[stations_df['pollutant_std'].isna()][['station_name', 'pollutant_available']])
else:
    print("No NaN values found in pollutant_std.")


# Display rows to check
stations_df


No NaN values found in pollutant_std.


Unnamed: 0,station_id,station_name,pollutant_available,pollutant_air,latitude,longitude,timeseries_id,pollutant,pollutant_std
0,785876,Borehamwood Meadow Park,Nitrogen dioxide,air,51.661229,-0.270550,4565.0,6875 - Borehamwood Meadow Park-Nitrogen dioxid...,NO2
1,785875,Borehamwood Meadow Park,Nitrogen monoxide,air,51.661229,-0.270550,4564.0,6874 - Borehamwood Meadow Park-Nitrogen monoxi...,NO
2,785877,Borehamwood Meadow Park,Nitrogen oxides,air,51.661229,-0.270550,4566.0,6876 - Borehamwood Meadow Park-Nitrogen oxides...,NOx
3,787202,Borehamwood Meadow Park,Particulate matter less than 10 micro m,aerosol,51.661229,-0.270550,4890.0,8893 - Borehamwood Meadow Park-Particulate mat...,PM10
4,787203,Borehamwood Meadow Park,Particulate matter less than 2.5 micro m,aerosol,51.661229,-0.270550,4892.0,8894 - Borehamwood Meadow Park-Particulate mat...,PM2.5
...,...,...,...,...,...,...,...,...,...
139,1129,Southwark A2 Old Kent Road,Nitrogen oxides,air,51.480499,-0.059550,461.0,1129 - Southwark A2 Old Kent Road-Nitrogen oxi...,NOx
140,1131,Southwark A2 Old Kent Road,Particulate matter less than 10 micro m,aerosol,51.480499,-0.059550,459.0,1131 - Southwark A2 Old Kent Road-Particulate ...,PM10
141,1158,Tower Hamlets Roadside,Nitrogen dioxide,air,51.522530,-0.042155,492.0,1158 - Tower Hamlets Roadside-Nitrogen dioxide...,NO2
142,1264,Tower Hamlets Roadside,Nitrogen monoxide,air,51.522530,-0.042155,4122.0,1264 - Tower Hamlets Roadside-Nitrogen monoxid...,NO


#### 1. std_london_sites_pollutant func saved to csv as srd_london_sites_pollutant.csv

In [13]:
def std_london_sites_pollutant(metadata_path, site_pollutant, defra_mappings):
    """
    Adds a 'pollutant_std' column to the metadata and saves as std_london_sites_pollutant.csv.
    """
    df = pd.read_csv(metadata_path, encoding='utf-8')
    df['pollutant_std'] = df['pollutant_available'].map(defra_mappings)
    df.to_csv(site_pollutant, index=False)
    print(f"Saved standardised metadata to {site_pollutant}")
    return df

In [12]:
std_london_sites_pollutant(metadata_path, site_pollutant, defra_mappings)

Saved standardised metadata to /Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels/data/defra/test/std_london_sites_pollutant.csv


Unnamed: 0,station_id,station_name,pollutant_available,pollutant_air,latitude,longitude,timeseries_id,pollutant,pollutant_std
0,785876,Borehamwood Meadow Park,Nitrogen dioxide,air,51.661229,-0.270550,4565.0,6875 - Borehamwood Meadow Park-Nitrogen dioxid...,NO2
1,785875,Borehamwood Meadow Park,Nitrogen monoxide,air,51.661229,-0.270550,4564.0,6874 - Borehamwood Meadow Park-Nitrogen monoxi...,NO
2,785877,Borehamwood Meadow Park,Nitrogen oxides,air,51.661229,-0.270550,4566.0,6876 - Borehamwood Meadow Park-Nitrogen oxides...,NOx
3,787202,Borehamwood Meadow Park,Particulate matter less than 10 micro m,aerosol,51.661229,-0.270550,4890.0,8893 - Borehamwood Meadow Park-Particulate mat...,PM10
4,787203,Borehamwood Meadow Park,Particulate matter less than 2.5 micro m,aerosol,51.661229,-0.270550,4892.0,8894 - Borehamwood Meadow Park-Particulate mat...,PM2.5
...,...,...,...,...,...,...,...,...,...
139,1129,Southwark A2 Old Kent Road,Nitrogen oxides,air,51.480499,-0.059550,461.0,1129 - Southwark A2 Old Kent Road-Nitrogen oxi...,NOx
140,1131,Southwark A2 Old Kent Road,Particulate matter less than 10 micro m,aerosol,51.480499,-0.059550,459.0,1131 - Southwark A2 Old Kent Road-Particulate ...,PM10
141,1158,Tower Hamlets Roadside,Nitrogen dioxide,air,51.522530,-0.042155,492.0,1158 - Tower Hamlets Roadside-Nitrogen dioxide...,NO2
142,1264,Tower Hamlets Roadside,Nitrogen monoxide,air,51.522530,-0.042155,4122.0,1264 - Tower Hamlets Roadside-Nitrogen monoxid...,NO


#### 2. function to add coordinates add_coordinates_processed to processed folder files.

- Last std process for processed folder files:/defra/processed/2023measurements/Borehamwood_Meadow_Park/NO__2023_01.csv
    - add latitude,longitude coorditanetion columns.
    - timestamp,value,timeseries_id,station_name,pollutant_name,pollutant_std, latitude,longitude
    - latitude,longitude columns:
        - parse coordination columns from std_london_site_pollutant.csv
        - the way to parse it matching pollutant_std and station_name columns. 
- new processed files saved:
    /Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels/data/defra/standartise/(year)measuremnetns/(station_name)/(pollutant_std)_(year)_(month)

In [30]:
def add_coordinates_processed(processed_file, site_pollutant, base_dir):
    """
    Adds latitude and longitude columns to a processed DEFRA file based on station_name and pollutant_std.
    """
    df = pd.read_csv(processed_file)
    site_pollutant_df = pd.read_csv(site_pollutant)
    # Ensure pollutant_std is present in processed file
    if 'pollutant_std' not in df.columns:
        raise ValueError("pollutant_std column missing in processed file")
    # Merge latitude and longitude from std_metadata_df
    merged = pd.merge(
        df, 
        site_pollutant_df[['station_name', 'pollutant_std', 'latitude', 'longitude']],
        on=['station_name', 'pollutant_std'],
        how='left'
    )
    # Build new output path under 'standartise'
    processed_path_obj = Path(processed_file)
    output_path = base_dir / "data" / "defra" / "optimised" / processed_path_obj.relative_to(base_dir / "data" / "defra" / "processed")
    output_path.parent.mkdir(parents=True, exist_ok=True)
    merged.to_csv(output_path, index=False, encoding='utf-8')
    print(f"Saved updated file to {output_path}")
    return merged

In [31]:
# use the function on all processed files
for file in processed_path.glob("*measurements/*/*.csv"):
    add_coordinates_processed(file, site_pollutant, base_dir)

Saved updated file to /Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels/data/defra/optimised/2023measurements/London_Marylebone_Road/Toluene__2023_03.csv
Saved updated file to /Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels/data/defra/optimised/2023measurements/London_Marylebone_Road/i-Butane__2023_09.csv
Saved updated file to /Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels/data/defra/optimised/2023measurements/London_Marylebone_Road/1,2,3-TMB__2023_08.csv
Saved updated file to /Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels/data/defra/optimised/2023measurements/London_Marylebone_Road/Ethyne__2023_03.csv
Saved updated file to /Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels/data/defra/optimised/2023measurements/London_Marylebone_Road/1-Butene__2023_01.csv
Saved updated file to /Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels/data/

## 2. Data Quality test Function:
The functions for discover and checks data quality metrics before cleaning, below.
- Counts total rows in dataset
- Identifies missing values per column (count + percentage)
- Counts duplicate rows based on timestamp
- Detects negative values in measurements
- Checks timestamp format issues


##### Defra dataset structure:
defra/processed/(year)_measurements/station_names(with empty spaces underscore)/(pollutant_std)_(year)_(month) 
 - first row of each csv file:
    - timestamp,value,timeseries_id,station_name,pollutant_name,pollutant_std

    



##### defra metadata: london_stations_clean.csv 
 - first row: station_id,station_name,pollutant_available,pollutant_air,latitude,longitude,timeseries_id,pollutant

In [36]:
# Main data quality analysis function for DEFRA dataset
def defra_data_quality_analysis(optimased_path, metadata_path, output_dir):
    """
    DEFRA Data Quality Analysis:
    - Checks empty files, missing columns, duplicates, missing values, types, format errors
    - Calculates issue rate for files with >20% missing 'value'
    - Explicitly detects and records files where the 'value' column is 100% empty (all NaN)
    """

    # Find all CSV files recursively in optimised_path
    all_csv_files = list(Path(optimased_path).rglob('*.csv'))
    missing_values_log = []
    all_issues = {
        'empty_files': [],
        'duplicate_timestamps': [],
        'high_missing_values': [],
        'column_errors': [],
        'format_errors': [],
        'completely_empty_value_column': []  # Only track files where 'value' is 100% empty
    }
    total_stats = {
        'total_files': 0,
        'files_processed': 0,
        'files_with_high_missing': 0,
        'total_rows': 0,
        'empty_files': 0
    }
    # Required columns for DEFRA files
    required_columns = ['timestamp', 'value', 'timeseries_id', 'station_name', 'pollutant_name', 'pollutant_std']

    for csv_file in all_csv_files:
        total_stats['total_files'] += 1
        try:
            df = pd.read_csv(csv_file)
            # Check file is empty
            if df.empty:
                all_issues['empty_files'].append(str(csv_file))
                total_stats['empty_files'] += 1
                continue
            total_stats['files_processed'] += 1
            total_stats['total_rows'] += len(df)

            # Check for missing required columns
            missing_cols = [col for col in required_columns if col not in df.columns]
            if missing_cols:
                all_issues['column_errors'].append({'file': str(csv_file), 'missing_columns': missing_cols})
                continue

            # Check for duplicate timestamps
            dup_ts = df['timestamp'].duplicated().sum()
            if dup_ts > 0:
                all_issues['duplicate_timestamps'].append({'file': str(csv_file), 'duplicate_count': int(dup_ts)})

            # Check if 'value' column is completely empty (all NaN)
            if 'value' in df.columns and df['value'].isna().all():
                all_issues['completely_empty_value_column'].append(str(csv_file))
                logger.warning(f"{csv_file.name}: 'value' column is completely empty (all NaN)")

            # Check missing values for each column print summary
            for col in df.columns:
                missing_count = df[col].isna().sum()
                missing_pct = (missing_count / len(df) * 100) if len(df) > 0 else 0
                logger.info(f"{csv_file.name}: Missing {col}: {missing_count} ({missing_pct:.2f}%)")

            # Calculate missing value percentage for 'value' column
            missing_values = df['value'].isna().sum()
            empty_value_percentage = (100 * missing_values / len(df)) if len(df) > 0 else 0
            logger.info(f"{csv_file.name}: Missing 'value': {missing_values}/{len(df)} ({empty_value_percentage:.2f}%)")
            if empty_value_percentage > 20:
                total_stats['files_with_high_missing'] += 1
                missing_values_log.append({
                    'filename': csv_file.name,
                    'path': str(csv_file),
                    'station_name': df['station_name'].iloc[0] if 'station_name' in df.columns else '',
                    'pollutant_std': df['pollutant_std'].iloc[0] if 'pollutant_std' in df.columns else '',
                    'EmptyValuePercentage': round(empty_value_percentage, 2)
                })
        except Exception as e:
            all_issues['format_errors'].append({'file': str(csv_file), 'error': str(e)})

    # Calculate issue rate percentage of processed files with >20% missing 'value' column
    if total_stats['files_processed'] > 0:
        issue_rate = (total_stats['files_with_high_missing'] / total_stats['files_processed']) * 100
    else:
        issue_rate = 0.0
    print(f"\nIssue rate: {issue_rate:.2f}% of files have >20% missing 'value' column.")

    # Save log to CSV (commented out for now)
    # if missing_values_log:
    #     pd.DataFrame(missing_values_log).to_csv(Path(output_dir) / "logs_missin_value.csv", index=False)

    return all_issues, total_stats, issue_rate  # Return issue rate for inspection

In [37]:
all_issues, total_stats, issue_rate = defra_data_quality_analysis(
    optimased_path,
    pollutant_mapping_path,
    logs_path
)

2025-12-12 02:04:19,421 - __main__ - INFO - Toluene__2023_03.csv: Missing timestamp: 0 (0.00%)
2025-12-12 02:04:19,422 - __main__ - INFO - Toluene__2023_03.csv: Missing value: 106 (21.77%)
2025-12-12 02:04:19,422 - __main__ - INFO - Toluene__2023_03.csv: Missing timeseries_id: 0 (0.00%)
2025-12-12 02:04:19,422 - __main__ - INFO - Toluene__2023_03.csv: Missing station_name: 0 (0.00%)
2025-12-12 02:04:19,423 - __main__ - INFO - Toluene__2023_03.csv: Missing pollutant_name: 0 (0.00%)
2025-12-12 02:04:19,424 - __main__ - INFO - Toluene__2023_03.csv: Missing pollutant_std: 0 (0.00%)
2025-12-12 02:04:19,424 - __main__ - INFO - Toluene__2023_03.csv: Missing latitude: 0 (0.00%)
2025-12-12 02:04:19,425 - __main__ - INFO - Toluene__2023_03.csv: Missing longitude: 0 (0.00%)
2025-12-12 02:04:19,425 - __main__ - INFO - Toluene__2023_03.csv: Missing 'value': 106/487 (21.77%)
2025-12-12 02:04:19,428 - __main__ - INFO - i-Butane__2023_09.csv: Missing timestamp: 0 (0.00%)
2025-12-12 02:04:19,428 - __ma


Issue rate: 8.78% of files have >20% missing 'value' column.


# Issue rate: 0.00% of files have >20% missing 'value' column.
as result of defra!

## miss alinged the check function, that's why maybe in my first run was %0 missing value. 
Issue rate: 8.78% of files have >20% missing 'value' column. currently after NaN replacement.


## 3) check defra negative values.
    - logically p,2.5 and pm10 can't be negative value because they're little particules on air.
    - NO2, SO2, CO, O3: Gas concentrations cannot be negative the same reason they're gas particules.
    - Defra uses concentration for air analyses.
    - below I will be checking if any negative value exsist.

In [55]:
#check for negative values in 'value' column

def negative_value(optimased_path):
    """ function to check for negative values in the 'value' column of optimased defra data.
    """
    files = Path(optimased_path).glob("**/*.csv")
    found_negatives = False

    for file in files:
        df = pd.read_csv(file)
        for pollutant in df['pollutant_std'].unique():
            subset = df[df['pollutant_std'] == pollutant]['value']
            neg_values= subset[subset < 0]
            neg_count = subset[subset < 0].count()
            neg_pct = (neg_count / len(subset) * 100) if len(subset) > 0 else 0

        if neg_count > 0:
            found_negatives = True
            print("Negative values found in the following pollutants:")
            print(f"\nFile: {file.name} | Pollutant: {pollutant}")
            print(f"  Negative count: {neg_count} ({neg_pct:.2f}%)")
            print(f"  Minimum value: {neg_values.min()}")
            print(f"  Example values: {neg_values.head(5).tolist()}")
    if not found_negatives:
        print("No negative values found in the DEFRA data.")
    return negative_value

In [56]:
# usage
negative_value(optimased_path)

Negative values found in the following pollutants:

File: Toluene__2023_03.csv | Pollutant: Toluene
  Negative count: 106 (21.77%)
  Minimum value: -99.0
  Example values: [-99.0, -99.0, -99.0, -99.0, -99.0]
Negative values found in the following pollutants:

File: i-Butane__2023_09.csv | Pollutant: i-Butane
  Negative count: 21 (5.02%)
  Minimum value: -99.0
  Example values: [-99.0, -99.0, -99.0, -99.0, -99.0]
Negative values found in the following pollutants:

File: 1,2,3-TMB__2023_08.csv | Pollutant: 1,2,3-TMB
  Negative count: 27 (4.08%)
  Minimum value: -99.0
  Example values: [-99.0, -99.0, -99.0, -99.0, -99.0]
Negative values found in the following pollutants:

File: Ethyne__2023_03.csv | Pollutant: Ethyne
  Negative count: 106 (21.77%)
  Minimum value: -99.0
  Example values: [-99.0, -99.0, -99.0, -99.0, -99.0]
Negative values found in the following pollutants:

File: 1-Butene__2023_01.csv | Pollutant: 1-Butene
  Negative count: 29 (3.91%)
  Minimum value: -99.0
  Example valu

<function __main__.negative_value(optimased_path)>

- negative values found on dataset.
But the values repeat itself:

        File: Ethane__2024_03.csv | Pollutant: Ethane
        Negative count: 19 (2.61%)
        Minimum value: -99.0
        Example values: [-99.0, -99.0, -99.0, -99.0, -99.0]
        Negative values found in the following pollutants:

        File: o-Xylene__2024_02.csv | Pollutant: o-Xylene
        Negative count: 20 (3.41%)
        Minimum value: -99.0
        Example values: [-99.0, -99.0, -99.0, -99.0, -99.0]
        Negative values found in the following pollutants:

I'm suspicious that -99 might mean something. I will check defra/sos documantation.
- SOS documentation, air quality directive e-reporting vocabulary, please see below the link:


https://dd.eionet.europa.eu/vocabulary/aq/observationvalidity

according to vocabulary : -99 means
   -  Not valid due to station maintenance or calibration
    - Definition: Data is considered to be invalid due to the regular calibration or the normal maintenance of the instrumentation (only used for primary data).

    

| Id   | Label                                                        | Status | Status Modified | Notation | Accepted Date | Not Accepted Date |
| :--- | :----------------------------------------------------------- | :----- | :-------------- | :------- | :------------ | :---------------- |
| -99  | [Not valid due to station maintenance or …](https://dd.eionet.europa.eu/vocabularyconcept/aq/observationvalidity/-99/view?vocabularyFolder.workingCopy=false&facet=HTML+Representation) | Valid  | 22.03.2013      | -99      | 22.03.2013    |                   |
| -1   | [Not valid](https://dd.eionet.europa.eu/vocabularyconcept/aq/observationvalidity/-1/view?vocabularyFolder.workingCopy=false&facet=HTML+Representation) | Valid  | 22.03.2013      | -1       | 22.03.2013    |                   |
| 1    | [Valid](https://dd.eionet.europa.eu/vocabularyconcept/aq/observationvalidity/1/view?vocabularyFolder.workingCopy=false&facet=HTML+Representation) | Valid  | 22.03.2013      | 1        | 22.03.2013    |                   |
| 2    | [Valid, but below detection limit …](https://dd.eionet.europa.eu/vocabularyconcept/aq/observationvalidity/2/view?vocabularyFolder.workingCopy=false&facet=HTML+Representation) | Valid  | 22.03.2013      | 2        | 22.03.2013    |                   |
| 3    | [Valid, but below detection limit and …](https://dd.eionet.europa.eu/vocabularyconcept/aq/observationvalidity/3/view?vocabularyFolder.workingCopy=false&facet=HTML+Representation) | Valid  | 22.03.2013      | 3        | 22.03.2013    |                   |
| 4    | [Valid (Ozone only) using CCQM.O3.2019](https://dd.eionet.europa.eu/vocabularyconcept/aq/observationvalidity/4/view?vocabularyFolder.workingCopy=false&facet=HTML+Representation) | Valid  | 04.07.2024      | 4        | 04.07.2024    |                   |


## 4) NaN -99, -1 value replacement:
- I will be check defra's data according to eea data air quality directive id's -99, -1 integers means data not valid. Soo it is missing.
- Need to change the columns have -99, -1 value, because it shows that data missing.
- After that re-run the issue test.
- value = -99, -1 cells will be replaced by "NaN" data/defra/optimased folders.
- logs of the files had been change will be store data/defra/logs folder. NaN_values_record.csv


In [3]:
# invalid flags: -99, -1

def replace_invalid_flags(input_file : Path,year_folder: str, station_name: str) -> Dict:
    """
    Replace invalid flag values (-99, -1) in the 'value' column with NaN.
    Logs the number and percentage of replacements made.
    """
    try: 
        df = pd.read_csv(input_file, encoding ='utf-8')

        #check value colmn first
        if 'value' not in df.columns:
            logger.warning(f"'value' column not found in {input_file.name}")
            return {
                'status': 'skipped',
                'reason': "'value' column missing",
                'file' : input_file.name
            }
        
        #count invalid flags before replacement
        initial_count = len(df)
        invalid_mask = df['value'].isin([-99, -1])
        num_invalid = invalid_mask.sum()

        #skip no invalids
        if num_invalid == 0:
            return {
                'status': 'skipped',
                'reason': 'no invalid flags found',
                'file': input_file.name
            }
        
        #count flag type either -99 or -1
        flag_counts = {}
        for flag in [-99, -1]:
            count = (df['value'] == flag).sum()
            if count > 0:
                flag_counts[flag] = count

        #replace invalid flags with NaN
        df.loc[invalid_mask, 'value'] = np.nan
        df.to_csv(input_file, index=False, encoding='utf-8')

        #log entries
        log_entry = {
            'timestamp': datetime.now().isoformat(),
            'year_folder': year_folder,
            'station': station_name,
            'file': input_file.name,
            'total_rows': initial_count,
            'invalid_flags_replaced': num_invalid,
            'percentage_invalid': round(num_invalid / initial_count * 100, 2),
            'flag_minus_99': flag_counts.get(-99.0, 0),
            'flag_minus_1': flag_counts.get(-1.0, 0),
            'input_path': str(input_file),
            'output_path': str(optimased_path)
        }

        return {
            'status': 'processed',
            'replacements': num_invalid,
            'flag_counts': flag_counts,
            'log_entry': log_entry
        }
        
    except Exception as e:
        logger.error(f"error processing {input_file.name}: {e}")
        return {
            'status': 'error',
            'error': str(e),
            'file': input_file.name
        }

def process_all ():

    year_folders = [
        '2023measurements',
        '2024measurements',
        '2025measurements'
    ]

    # to track all variables log entries below
    all_logs = []
    total_files = 0
    total_replaced = 0
    total_skipped = 0
    total_errors = 0

    print("Starting invalid flag replacement process...")

    for year_folder in year_folders:
        year_path = optimased_path / year_folder

        if not year_path.exists():
            logger.warning(f"Year folder not found: {year_path}")
            continue

        print(f"Processing year folder: {year_folder}")

        # all station files

        station_dirs = [d for d in year_path.iterdir() if d.is_dir()]

        for station_dir in station_dirs:
            station_name = station_dir.name

            # all csv files in station dir
            csv_files = station_dir.glob("*.csv")

            for csv_file in csv_files:
                total_files += 1
                result = replace_invalid_flags(csv_file, year_folder, station_name)

                if result['status'] == 'processed':
                    total_files += 1
                    total_replaced += result['replacements']
                    all_logs.append(result['log_entry'])
                    
                    # print progress every 10 files
                    if total_files % 10 == 0:
                        print(f"processed {total_files} files, replaced {total_replaced} invalid flags")
                
                elif result['status'] == 'skipped':
                    total_skipped += 1
                
                elif result['status'] == 'error':
                    total_errors += 1
        
        print(f"completed year: {year_folder}")

        # log dataframe
    if all_logs:
        log_df = pd.DataFrame(all_logs)
    else:
        log_df = pd.DataFrame()

    print("Invalid flag replacement process completed.")
    print(f"total files processed: {total_files}")
    print(f"total files skipped: {total_skipped}")
    print(f"total errors: {total_errors}")
    print(f"total invalid flags replaced: {total_replaced}")

    return log_df


def save_log(log_df: pd.DataFrame):
    """ Save the log DataFrame to a CSV file in the logs directory. """

    if log_df.empty:
        logger.warning("No log entries to save.")
        return
    
    #save csv
    log_file = logs_path /"NaN_values_record.csv"
    log_df.to_csv(log_file, index=False, encoding='utf-8')

    print(f"Log saved to {log_file}")
    print(f"total log entries: {len(log_df)}")
    print(f" sample log entries:\n{log_df.head()}, \n{log_df.tail()}, \n{log_df.sample(5)}")

    print("Summary of invalid flags replaced per year:")
    year_summary = log_df.groupby('year_folder').agg({
        'invalid_flags_replaced': ['count', 'sum', 'mean']
    })
    print(year_summary)

In [4]:
# use

change_log_df = process_all()

#save log
save_log(change_log_df)

Starting invalid flag replacement process...
Processing year folder: 2023measurements
processed 10 files, replaced 289 invalid flags
processed 20 files, replaced 432 invalid flags
processed 30 files, replaced 676 invalid flags
processed 40 files, replaced 839 invalid flags
processed 50 files, replaced 1038 invalid flags
processed 60 files, replaced 1262 invalid flags
processed 70 files, replaced 1469 invalid flags
processed 80 files, replaced 1771 invalid flags
processed 90 files, replaced 1951 invalid flags
processed 100 files, replaced 2663 invalid flags
processed 110 files, replaced 2871 invalid flags
processed 120 files, replaced 3046 invalid flags
processed 130 files, replaced 3139 invalid flags
processed 140 files, replaced 3376 invalid flags
processed 150 files, replaced 3487 invalid flags
processed 160 files, replaced 3783 invalid flags
processed 440 files, replaced 11183 invalid flags
processed 450 files, replaced 11663 invalid flags
processed 460 files, replaced 12181 invalid

#### Data Flags:
Invalid data flags were identified in 3,160 files across three years 2023-2025, representing 222,167 individual measurements approx 3-4% of total data. The majority were -99 flags indicating missing  measurements during sensor calibration or maintenance periods. Data quality improves from 2023 to 2025, with average invalid flags per file decreasing from 70.8 to 39.0.
 - overrall worst year 2024
 - overall best year least flag: 2025.

## 5) NaN replacement validation function


In [14]:
def validate_file (opt_file, optimised_path, processed_path, year_folder):
    """
    Validate a single DEFRA optimased file for invalid flag values (-99, -1).
    Replace them with NaN and log the changes.
    """
    try: 
        rel_path = opt_file.relative_to(optimised_path / year_folder)
        # FIX: Use correct processed_path
        proc_file = processed_path / year_folder / rel_path
        if not proc_file.exists():
            # Always return a tuple for error
            return 'error', f'original file not found: {proc_file}'
        df_processed = pd.read_csv(proc_file)
        df_optimised = pd.read_csv(opt_file)
        if 'value' not in df_processed.columns or 'value' not in df_optimised.columns:
            # Always return a tuple for error
            return 'error', f'no value column: {opt_file.name}'
        # count invalid flags in processed file
        inv_proc = df_processed['value'].isin([-99.0, -1.0]).sum()

        # count invalid flags in optimised file
        inv_opt = df_optimised['value'].isin([-99.0, -1.0]).sum()

        # count NaN values
        nan_proc = df_processed['value'].isna().sum()
        nan_opt = df_optimised['value'].isna().sum()

        #expected NaN increase should equal invalid flags rm
        nan_increase = nan_opt - nan_proc
        # validate
        if inv_opt == 0 and nan_increase == inv_proc:
            status = 'valid'
        else:
            status = 'invalid'
        # Always return a tuple for all cases
        return status, opt_file.name
    except Exception as e:
        return 'error', opt_file.name
    
def validate_year_folder(year_folder, sample_size=20):
    print(f"\nValidating {year_folder}")
    opt_dir = optimased_path / year_folder
    if not opt_dir.exists():
        print("No directory.")
        return
    files = list(opt_dir.rglob("*.csv"))
    if not files:
        print("No files.")
        return
    if len(files) > sample_size:
        files = np.random.choice(files, sample_size, replace=False)
    counts = {'valid':0, 'invalid':0, 'error':0}
    for f in files:
        status, name = validate_file(f, optimased_path, processed_path, year_folder)
        counts[status] = counts.get(status, 0) + 1
        print(f"{status.upper()}: {name}")
    print("Summary:", counts)

def validate_change_log():
    log_file = logs_path / "NaN_values_record.csv"
    if not log_file.exists():
        print("No log file.")
        return
    df = pd.read_csv(log_file)
    print(f"Files: {len(df)}, Flags replaced: {df['invalid_flags_replaced'].sum()}")

for year in ["2023measurements", "2024measurements", "2025measurements"]:
    validate_year_folder(year, sample_size=20)
validate_change_log()
print("Done.")

def validate_change_log():
    log_file = logs_path / "NaN_values_record.csv"
    if not log_file.exists():
        print("No log file.")
        return
    df = pd.read_csv(log_file)
    print(f"Files: {len(df)}, Flags replaced: {df['invalid_flags_replaced'].sum()}")
    print(df.groupby('year_folder')['invalid_flags_replaced'].sum())
    print(f"-99: {df['flag_minus_99'].sum()} | -1: {df['flag_minus_1'].sum()}")
    print(f"Mean %: {df['percentage_invalid'].mean():.2f} | Max %: {df['percentage_invalid'].max():.2f}")
    print(df.nlargest(5, 'percentage_invalid')[['station','file','percentage_invalid']])



Validating 2023measurements
VALID: NO__2023_03.csv
VALID: NO__2023_12.csv
VALID: n-Heptane__2023_08.csv
VALID: cis-2-Butene__2023_03.csv
VALID: CO__2023_08.csv
VALID: NO2__2023_09.csv
VALID: NO2__2023_06.csv
VALID: NO2__2023_01.csv
VALID: Toluene__2023_04.csv
VALID: NO__2023_09.csv
VALID: NO__2023_01.csv
VALID: PM2.5__2023_08.csv
VALID: NO__2023_11.csv
VALID: 1-Butene__2023_03.csv
VALID: 1-Butene__2023_12.csv
VALID: PM10__2023_04.csv
VALID: O3__2023_07.csv
VALID: i-Butane__2023_02.csv
VALID: NO__2023_06.csv
VALID: trans-2-Butene__2023_01.csv
Summary: {'valid': 20, 'invalid': 0, 'error': 0}

Validating 2024measurements
VALID: Ethene__2024_11.csv
VALID: Toluene__2024_06.csv
VALID: 1,3-Butadiene__2024_11.csv
VALID: n-Hexane__2024_10.csv
VALID: NO2__2024_03.csv
VALID: Ethyne__2024_05.csv
VALID: NO__2024_02.csv
VALID: PM10__2024_09.csv
VALID: NO2__2024_08.csv
VALID: NO2__2024_01.csv
VALID: NO__2024_03.csv
VALID: NOx__2024_11.csv
VALID: NO__2024_10.csv
VALID: O3__2024_12.csv
VALID: SO2__202

In [15]:
# validate each year folder
for year in ["2023measurements", "2024measurements", "2025measurements"]:
    validate_year_folder(year, sample_size=10)

# validate change log
validate_change_log()


Validating 2023measurements
VALID: PM10__2023_09.csv
VALID: trans-2-Pentene__2023_12.csv
VALID: n-Heptane__2023_05.csv
VALID: n-Hexane__2023_04.csv
VALID: PM2.5__2023_03.csv
VALID: Ethyne__2023_02.csv
VALID: Ethyne__2023_07.csv
VALID: NO2__2023_07.csv
VALID: NO__2023_04.csv
VALID: Propane__2023_06.csv
Summary: {'valid': 10, 'invalid': 0, 'error': 0}

Validating 2024measurements
VALID: NO__2024_04.csv
VALID: SO2__2024_06.csv
VALID: NO__2024_05.csv
VALID: NO2__2024_01.csv
VALID: NOx__2024_05.csv
VALID: NO__2024_09.csv
VALID: NO__2024_04.csv
VALID: O3__2024_02.csv
VALID: NOx__2024_07.csv
VALID: 1,2,3-TMB__2024_03.csv
Summary: {'valid': 10, 'invalid': 0, 'error': 0}

Validating 2025measurements
VALID: SO2__2025_10.csv
VALID: O3__2025_03.csv
VALID: PM2.5__2025_10.csv
VALID: NO2__2025_03.csv
VALID: NO__2025_09.csv
VALID: PM2.5__2025_10.csv
VALID: NOx__2025_01.csv
VALID: PM10__2025_11.csv
VALID: i-Octane__2025_05.csv
VALID: Toluene__2025_08.csv
Summary: {'valid': 10, 'invalid': 0, 'error': 0

 The validation summary outputs are correct and complete:

    - All sampled files are marked as VALID for each year.
    - The summary counts show only valid files, with no invalid or error cases.
    - The log summary matches expectations: 3160 files, 222,167 flags replaced, and a breakdown by year.
    - The additional summary (mean/max percentage, top files with highest missing) is also working as intended.
    - NaN replacement and validation works.