# üî¨ Beijing Air Quality
## üìò Notebook 01 ‚Äì Data Extraction

| Field         | Description                                        |
|:--------------|:---------------------------------------------------|
| Author:       |	Robert Steven Elliott                            |
| Course:       |	Code Institute ‚Äì Data Analytics with AI Bootcamp |
| Project Type: |	Capstone                                         |
| Date:         |	December 2025                                    |

This project complies with the CC BY 4.0 licence by including proper attribution.

## Objectives

- Load all 12 raw Beijing air-quality CSV files.
- Verify file structure, column consistency, and encoding.
- Concatenate the raw station datasets into a single combined dataframe.
- Save the combined dataset in the `data/combined/` directory for downstream cleaning and feature engineering.

## Inputs

- Raw dataset folder: data/raw/
    - Contains 12 station CSV files (e.g., aotizhongxin.csv, changping.csv, etc.)
- Expected columns (as defined in metadata):
    - PM2.5, PM10, SO2, NO2, CO, O3, TEMP, PRES, DEWP, RAIN, wd, WSPM, station, year, month, day, hour

## Outputs

- A single concatenated dataframe containing all raw station data.
- Exported combined CSV: data/combined/combined_stations.csv
- Basic shape summary and verification logs

## Additional Comments
- This notebook performs no cleaning‚Äîit preserves raw data to ensure full provenance.
- Any inconsistencies identified here (missing timestamps, unexpected dtypes, column mismatches) will be addressed in Notebook 02 ‚Äì Data Cleaning.

## Citation  
This project uses data from:

**Chen, Song (2017). _Beijing Multi-Site Air Quality_. UCI Machine Learning Repository.**  
Chen, Song (2017). *Beijing Multi-Site Air Quality.*  
UCI Machine Learning Repository ‚Äî Licensed under **CC BY 4.0**.  
DOI: https://doi.org/10.24432/C5RK5G  
Kaggle mirror by Manu Siddhartha.

---

## üîç Overview

This notebook performs the first step of the ETL pipeline: **extracting** the raw Beijing air-quality data.
The goal is to load all 12 raw station CSV files, verify their structure, and combine them into a single
dataset that can be used for cleaning and preprocessing in later notebooks.

This step focuses only on *data extraction*, not data cleaning.

## Import Required Libraries

(The following libraries support analysis, plotting, and data manipulation.)


In [1]:
from pathlib import Path # Path is needed to handle file paths
import os # os is needed for operating system dependent functionality
import glob # glob is needed to find files matching a pattern
import pandas as pd # pandas is needed for data manipulation
import sys # sys is needed to manipulate Python runtime environment

## Set Up Project Paths

In [2]:
PROJECT_PATH = Path.cwd().parent # Assuming this script is in the 'notebooks' directory
sys.path.append(str(PROJECT_PATH)) # Add project root to sys.path
print(f"Project root : {PROJECT_PATH}") # Print the project root directory

RAW_FOLDER = PROJECT_PATH / "data" / "raw" # Define raw data folder path
COMBINED_FILEPATH = PROJECT_PATH / "data" / "combined" / "beijing_combined.csv" # Define combined data file path
RAW_METADATA_PATH = PROJECT_PATH / "data" / "raw" / "_metadata.yml" # Define metadata file path
COMBINED_METADATA_PATH = PROJECT_PATH / "data" / "combined" / "_metadata.yml" # Define combined metadata file path

if not COMBINED_FILEPATH.parent.exists(): # Check if combined data directory exists
    os.makedirs(COMBINED_FILEPATH.parent) # Create combined data directory if it doesn't exist

Project root : /home/robert/Projects/beijing-air-quality


## Create _metadata.yml for the raw files

In [3]:
from utils.metadata_builder import MetadataBuilder

# List all .csv files in the raw folder
raw_files = sorted([f.name for f in RAW_FOLDER.glob("*.csv")])

builder = MetadataBuilder(
    dataset_path="data/raw/", 
    dataset_name="Beijing Multi-Site Air Quality ‚Äì Raw Stations",
    description=(
        "This folder contains the 12 raw CSV files from the Beijing Multi-Site Air Quality dataset. "
        "These files are preserved exactly as provided by the source (Kaggle/UCI) with no cleaning, "
        "renaming, or preprocessing applied. The raw data is kept intact to maintain full provenance."
    )
)

# Add source + licence (same for all downstream datasets)
builder.add_source_info()
builder.add_licence()

# Add file listing
builder.add_file_list(raw_files)

# Add column names (taken from one representative file)
sample_df = pd.read_csv(RAW_FOLDER / raw_files[0], nrows=5)
builder.add_columns(sample_df.columns)

# Add notes
builder.add_step("Downloaded the raw datasets from Kaggle / UCI repository")
builder.add_step("Stored raw files without modification")
builder.add_step("Verified encoding and column structure")
builder.add_step("Validated presence of all 12 station files")

builder.metadata["notes"] = [
    "These files remain unmodified to maintain full provenance.",
    "No cleaning, renaming, or type conversion steps were applied."
]

# Write metadata
builder.write(RAW_METADATA_PATH)

üìÑ Metadata written to: /home/robert/Projects/beijing-air-quality/data/raw/_metadata.yml


## Initialise metadata for Combined dataset

In [4]:
builder = MetadataBuilder(
    dataset_path="data/combined/combined_stations.csv",
    dataset_name="Beijing Air Quality ‚Äì Combined Dataset",
    description="A combined dataset merging all 12 raw station CSV files into one file."
)

builder.add_source_info()
builder.add_licence()
builder.add_creation_script("notebooks/01_data_extraction.ipynb")

## Load Raw Station CSV Files

The raw dataset contains **12 separate files**, one for each monitoring station:

- We use `glob` to find all `.csv` files in `data/raw/`.
- Each file is read using `pandas.read_csv()`.
- A new column `station` is added based on the file name to preserve provenance.
- All station dataframes are collected into a list for later combination.

This maintains traceability and aligns with the Capstone data-governance requirements.


In [5]:
dfs = [] # List to hold individual dataframes
files = glob.glob(str(RAW_FOLDER / "*.csv")) # Get all CSV files in the raw data directory

for file in files:
    df = pd.read_csv(file) # Read each CSV file
    station = os.path.basename(file).split(".")[0] # Assuming filename format is like 'beijing_stationname_dates.csv'
    df["station"] = station # Add station column
    dfs.append(df) # Collect all dataframes

builder.add_step("Loaded 12 raw station CSV files") # Add step to metadata

## Combine All Station Files

Once all 12 files are loaded individually, they are concatenated into a single dataframe.


In [6]:

combined_df = pd.concat(dfs, ignore_index=True)
builder.add_step("Concatenated all stations into a single dataframe")

## Save the Combined Dataset

The combined dataframe is saved to `data/combined/combined_stations.csv`.

This file will be used in:

- Notebook 02 ‚Äì Cleaning

In [7]:

combined_df.to_csv(COMBINED_FILEPATH, index=False) # Save the combined dataframe
print("Combined dataset shape:", combined_df.shape) # Print the shape of the combined dataset
builder.add_step(f"Saved combined dataset to {COMBINED_FILEPATH}") # Add step to metadata 

Combined dataset shape: (420768, 18)


## Complete Metadata

In [8]:
builder.add_columns(df.columns) # Add columns the dataframe
builder.add_record_count_from_df(combined_df) # Set record count from the combined dataframe    
builder.add_record_stats(COMBINED_FILEPATH) # Add record statistics

builder.write(COMBINED_METADATA_PATH)

üìÑ Metadata written to: /home/robert/Projects/beijing-air-quality/data/combined/_metadata.yml


---
### AI Assistance Note
Some narrative text and minor formatting or wording improvements in this notebook were supported by AI-assisted tools (ChatGPT for documentation clarity, Copilot for small routine code suggestions, and Grammarly for proofreading). All analysis, code logic, feature engineering, modelling, and interpretations were independently created by the author.