# LAQN Data Analysis and Cleaning

## 1. Introduction
This notebook performs exploratory data analysis on the LAQN (London Air Quality Network) dataset for research. The goal is to understand the data structure, identify data quality issues, and prepare a cleaned dataset for further analysis.

## 2. Setup and Import Libraries

In [17]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import pathlib
import glob
import sys
import modulefinder as mdfind

# I need to set the path of the datasets here before loading them.

## 3. Load the Dataset : sites_species_london.csv

In [20]:
# ensure notebook runs from project root
os.chdir("/Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels")
print("cwd now:", os.getcwd())

# Load the dataset
df = pd.read_csv("data/laqn/sites_species_london.csv", encoding="utf-8")
print("Loaded:", df.shape)

# print statements to verify shape of the df
print(f"Shape: {df.shape[0]} rows, {df.shape[1]} columns")


cwd now: /Users/burdzhuchaglayan/Desktop/data science projects/air-pollution-levels
Loaded: (647, 18)
Shape: 647 rows, 18 columns


## 4. Understanding the Data Structure

### 4.1 First Look at the Data:
- Shape: 647 rows, 18 columns

In [27]:
import io

# describe the dataset
print("Description of sites_species_london.csv:")
print(df.describe(include='all'))

# to get detailed information about the dataset.
def info():
    buffer = io.StringIO()
    df.info(buf=buffer)
    s = buffer.getvalue()
    print(s)

print("Information about the dataset:")
info()

# print first few rows of the dataframe
print(df.head(5))

Description of sites_species_london.csv:
        @LocalAuthorityCode @LocalAuthorityName @SiteCode  \
count            647.000000                 647       647   
unique                  NaN                  33       251   
top                     NaN         Westminster       WM0   
freq                    NaN                  55         6   
mean              18.486862                 NaN       NaN   
std               10.099171                 NaN       NaN   
min                1.000000                 NaN       NaN   
25%                9.000000                 NaN       NaN   
50%               19.000000                 NaN       NaN   
75%               28.000000                 NaN       NaN   
max               33.000000                 NaN       NaN   

                   @SiteName @SiteType          @DateClosed  \
count                    647       647                  395   
unique                   249         6                  125   
top     Hounslow - Brentford  Roadsid

### Findings

Based on the data structure exploration above, here are the key findings:

**Dataset Composition:**
- 647 total records representing site-species combinations
- 251 unique monitoring sites across London
- 6 different pollutant species measured
- 33 local authority areas covered
- All data managed by King's College London

**Missing Data Pattern:**
- `@DateClosed`: 252 missing values (39%) - represents currently active sites
- `@DateMeasurementFinished`: 222 missing values (34%) - represents ongoing measurements
- `@LatitudeWGS84`: 30 missing values (4.6%) - requires investigation
- `@LongitudeWGS84`: 30 missing values (4.6%) - requires investigation

**Data Type Issues:**
- Date columns stored as strings (object type) instead of datetime
- Need conversion for proper temporal analysis

**Site Distribution:**
- Most common site type: Roadside (316 sites, 49%)
- Most monitored authority: Westminster (55 site-species combinations)
- Most monitored pollutant: NO2 (Nitrogen Dioxide) with 221 measurements

**Geographic & Temporal Coverage:**
- Latitude range: 51.31째N to 51.67째N (Greater London)
- Longitude range: -0.46째W to 0.23째E
- Temporal span: From 1996 to 2025 (nearly 30 years of data)
- Peak measurement start year: 2008

## 5. Specifically Checking Available Data

### 5.1 Active Measurements Analysis
Goal: Identify sites with ongoing species measurements (DateMeasurementFinished is null/missing)

#### Active Sites and Their Species
- Breakdown of each site and what species they're currently measuring.


In [28]:
actv_measurements = df[df['@DateMeasurementFinished'].isnull()]

print(f"Total active measurements: {len(actv_measurements)}")
print(f"Unique sites with active measurements: {actv_measurements['@SiteCode'].nunique()}")
print("\n" + "="*80)

Total active measurements: 222
Unique sites with active measurements: 84



#### takes:
-Total active measurements: 222
-Unique sites with active measurements: 84

In [29]:
# group by site to see what species each site is actively measuring.
actv_by_site = actv_measurements.groupby('@SiteCode').agg({
    '@SiteName': 'first',
    '@SiteType': 'first',
    '@LocalAuthorityName': 'first',
    '@SpeciesCode': lambda x: ', '.join(sorted(x)),
    '@SpeciesDescription': 'count'
}).rename(columns={'@SpeciesDescription': 'Species_Count'})

actv_by_site = actv_by_site.sort_values(by='Species_Count', ascending=False)

print ("sites with active measurements and their species: ")
print("="*80)
display(actv_by_site.head(10))


#summary statistics of active measurements
print("\n" + "="*80)
print("Summary of active sites:")
print(f"Total active sites: {len(actv_by_site)}")
print(f"Average species per active site: {actv_by_site['Species_Count'].mean():.2f}")
print(f"Max species at one site: {actv_by_site['Species_Count'].max()}")
print(f"Min species at one site: {actv_by_site['Species_Count'].min()}")

sites with active measurements and their species: 


Unnamed: 0_level_0,@SiteName,@SiteType,@LocalAuthorityName,@SpeciesCode,Species_Count
@SiteCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
BL0,Camden - Bloomsbury,Urban Background,Camden,"CO, NO2, O3, PM10, PM25, SO2",6
GR4,Greenwich - Eltham,Suburban,Greenwich,"NO2, O3, PM10, PM25, SO2",5
TH5,Tower Hamlets - Victoria Park,Urban Background,Tower Hamlets,"NO2, PM10, PM25, SO2",4
HK6,Hackney - Old Street,Roadside,Hackney,"NO2, O3, PM10, PM25",4
RI2,Richmond Upon Thames - Barnes Wetlands,Suburban,Richmond,"NO2, O3, PM10, PM25",4
GR9,Greenwich - Westhorne Avenue,Roadside,Greenwich,"NO2, O3, PM10, PM25",4
CE2,Waterloo Place (The Crown Estate),Roadside,Westminster,"NO2, O3, PM10, PM25",4
GR8,Greenwich - Woolwich Flyover,Roadside,Greenwich,"NO2, PM10, PM10, PM25",4
KC1,Kensington and Chelsea - North Ken,Urban Background,Kensington and Chelsea,"CO, NO2, O3, SO2",4
HP1,Lewisham - Honor Oak Park,Urban Background,Lewisham,"NO2, O3, PM10, PM25",4



Summary of active sites:
Total active sites: 84
Average species per active site: 2.64
Max species at one site: 6
Min species at one site: 1


##### takes:

- Summary of active sites:
- Total active sites: 84
- Average species per active site: 2.64
- Max species at one site: 6
- Min species at one site: 1

### 5.2 Species Distribution in Active Measurements
- How many species are being actively monitored across all sites.

In [None]:
#count active measurements by species
actv_species = actv_measurements.groupby(['@SpeciesCode', '@SpeciesDescription']).size().reset_index(name='Active_Measurement_Count')
actv_species = actv_species.sort_values(by='Active_Measurement_Count', ascending=False)
print("Active measurements by species:")
print("="*80)
display(actv_species.head(10))

print("\n" + "="*80)
print(f"Total unique species being actively monitored: {actv_species.shape[0]}")

Active measurements by species:


Unnamed: 0,@SpeciesCode,@SpeciesDescription,Active_Measurement_Count
1,NO2,Nitrogen Dioxide,80
3,PM10,PM10 Particulate,58
4,PM25,PM2.5 Particulate,53
2,O3,Ozone,20
5,SO2,Sulphur Dioxide,8
0,CO,Carbon Monoxide,3



Total unique species being actively monitored:   @SpeciesCode @SpeciesDescription  Active_Measurement_Count
1          NO2    Nitrogen Dioxide                        80
3         PM10    PM10 Particulate                        58
4         PM25   PM2.5 Particulate                        53
2           O3               Ozone                        20
5          SO2     Sulphur Dioxide                         8
0           CO     Carbon Monoxide                         3


outputs:

Active measurements by species:
================================================================================
        @SpeciesCode	@SpeciesDescription	Active_Measurement_Count
        - 1	NO2	            Nitrogen Dioxide	80
        - 3	PM10	        PM10 Particulate	58
        - 4	PM25	        PM2.5 Particulate	53
        - 2	O3	            Ozone	            20
        - 5	SO2	            Sulphur Dioxide	    8
        - 0	CO	            Carbon Monoxide	    3

================================================================================
Total unique species being actively monitored: (6, 3) 

- pollutants: 6

### 5.3 Active Sites Summary
- Sites that are currently operational (DateClosed is null):

In [32]:
# Check sites that are still open (DateClosed is null)
open_sites = df[df['@DateClosed'].isnull()]

print(f"Total site-species combinations at open sites: {len(open_sites)}")
print(f"Unique sites that are still open: {open_sites['@SiteCode'].nunique()}")

# Cross-check: open sites vs active measurements
print("\n" + "="*80)
print("Cross-verification:")
print(f"Open sites (DateClosed is null): {open_sites['@SiteCode'].nunique()}")
print(f"Sites with active measurements (DateMeasurementFinished is null): {actv_measurements['@SiteCode'].nunique()}")

# Find sites that are open but have no active measurements
open_site_codes = set(open_sites['@SiteCode'].unique())
actv_site_codes = set(actv_measurements['@SiteCode'].unique())
open_but_inactv = open_site_codes - actv_site_codes

if open_but_inactv:
    print(f"\nSites that are open but have no active measurements: {len(open_but_inactv)}")
    print("These sites might be operational but the specific species measurements have ended")
else:
    print("\nAll open sites have at least one active measurement")

Total site-species combinations at open sites: 252
Unique sites that are still open: 84

Cross-verification:
Open sites (DateClosed is null): 84
Sites with active measurements (DateMeasurementFinished is null): 84

All open sites have at least one active measurement


#### Findings:
- Total site-species combinations at open sites: 252
- Unique sites that are still open: 84

================================================================================
#### Cross-verification:
- Open sites (DateClosed is null): 84
- Sites with active measurements (DateMeasurementFinished is null): 84

- All open sites have at least one active measurement