# üî¨ Beijing Air Quality
## üìò Notebook 01 ‚Äì Data Extraction

| Field         | Description                                        |
|:--------------|:---------------------------------------------------|
| Author:       |	Robert Steven Elliott                            |
| Course:       |	Code Institute ‚Äì Data Analytics with AI Bootcamp |
| Project Type: |	Capstone                                         |
| Date:         |	December 2025                                    |

## Objectives

- Load all 12 raw Beijing air-quality CSV files.
- Verify file structure, column consistency, and encoding.
- Concatenate the raw station datasets into a single combined dataframe.
- Save the combined dataset in the `data/combined/` directory for downstream cleaning and feature engineering.

## Inputs

- Raw dataset folder: data/raw/
    - Contains 12 station CSV files (e.g., aotizhongxin.csv, changping.csv, etc.)
- Expected columns (as defined in metadata):
    - PM2.5, PM10, SO2, NO2, CO, O3, TEMP, PRES, DEWP, RAIN, wd, WSPM, station, year, month, day, hour

## Outputs

- A single concatenated dataframe containing all raw station data.
- Exported combined CSV: data/combined/combined_stations.csv
- Basic shape summary and verification logs

## Additional Comments
- This notebook performs no cleaning‚Äîit preserves raw data to ensure full provenance.
- Any inconsistencies identified here (missing timestamps, unexpected dtypes, column mismatches) will be addressed in Notebook 02 ‚Äì Data Cleaning.

## Citation  
This project uses data from:

**Chen, Song (2017). _Beijing Multi-Site Air Quality_. UCI Machine Learning Repository.**  
DOI: https://doi.org/10.24432/C5RK5G  
Mirrored on Kaggle by Manu Siddhartha (CC BY 4.0 Licence).

---

## üîç Overview

This notebook performs the first step of the ETL pipeline: **extracting** the raw Beijing air-quality data.
The goal is to load all 12 raw station CSV files, verify their structure, and combine them into a single
dataset that can be used for cleaning and preprocessing in later notebooks.

This step focuses only on *data extraction*, not data cleaning.

## Import Required Libraries

In this section we import all necessary Python libraries:

- `pathlib` ‚Äì handles directory paths in a platform-independent way  
- `os` ‚Äì used to extract file names  
- `glob` ‚Äì used to search for all CSV files in the raw dataset folder  
- `pandas` ‚Äì the main library for loading and manipulating dataframes  

These tools together allow us to dynamically load each raw station file.

In [1]:
from pathlib import Path # Path is needed to handle file paths
import os # os is needed for operating system dependent functionality
import glob # glob is needed to find files matching a pattern
import pandas as pd # pandas is needed for data manipulation

## Set Up Project Paths

We define the project root and confirm that the notebook is pointing to the correct folder.

This ensures the notebook remains portable, even if the directory structure changes.

In [2]:
project_root = Path.cwd().parent # Assuming this script is in the 'notebooks' directory
print(f"Project root : {project_root}") # Print the project root directory

current_dir = project_root / "notebooks" # Set current working directory to notebooks
print(f"Current working directory : {current_dir}") # Print the current working directory


Project root : /home/robert/Projects/beijing-air-quality
Current working directory : /home/robert/Projects/beijing-air-quality/notebooks


## Load Raw Station CSV Files

The raw dataset contains **12 separate files**, one for each monitoring station:

- We use `glob` to find all `.csv` files in `data/raw/`.
- Each file is read using `pandas.read_csv()`.
- A new column `station` is added based on the file name to preserve provenance.
- All station dataframes are collected into a list for later combination.

This maintains traceability and aligns with the Capstone data-governance requirements.


In [3]:
dfs = [] # List to hold individual dataframes
files = glob.glob(str(project_root / "data" / "raw" / "*.csv")) # Get all CSV files in the raw data directory

for file in files:
    df = pd.read_csv(file) # Read each CSV file
    station = os.path.basename(file).split(".")[0] # Assuming filename format is like 'beijing_stationname_dates.csv'
    df["station"] = station # Add station column
    dfs.append(df) # Collect all dataframes

## Combine All Station Files

Once all 12 files are loaded individually, they are concatenated into a single dataframe.


In [4]:

combined_df = pd.concat(dfs, ignore_index=True)

## Save the Combined Dataset

The combined dataframe is saved to `data/combined/combined_stations.csv`.

This file will be used in:

- Notebook 02 ‚Äì Cleaning

In [5]:

combined_df.to_csv(project_root / "data" / "combined" / "combined_stations.csv", index=False) # Save the combined dataframe
print("Combined dataset shape:", combined_df.shape) # Print the shape of the combined dataset

Combined dataset shape: (420768, 18)
