# DATA PREPROCESSING
Data preprocessing is the key step in any data analysis or machine learning pipeline. This step often interleaved with **Data Exploration**. It involves cleaning, transforming and organizing raw data to ensure it is accurate, consistent and ready for modeling. It has a big impact on model building such as:

- Clean and well-structured data allows models to learn meaningful patterns rather than noise.

- Properly processed data prevents misleading inputs, leading to more reliable predictions.

- Organized data makes it simpler to create useful inputs for the model, enhancing model performance.

- Organized data supports better Exploratory Data Analysis (EDA), making patterns and trends more interpretable.

**Steps in Data Preprocessing:** Some key steps in data preprocessing are: **Data Cleaning**, **Data Reduction**, **Feature Engineering** and **Data Splitting**.

## I. Data Cleaning
It is the process of correcting errors or inconsistencies, normalization and handling any missing, duplicated or irrelevant data in the dataset which was discovered by us through step **Data Exploration**. Clean data is essential for effective analysis, as it improves the quality of results and enhances the performance of data models. We will seperate data cleaning into **six key steps**, included in:

- **Synstax Errors**: Correcting issues such as **typos, incorrect data types, and invalid characters**.

- **Format Data**: Standardizing data formats, including **dates, numeric values, text cases, and units of measurement**.

- **Irrelevant Data**: Removing data fields or records that do not contribute to the analytical objective.

- **Handling Missing Data & Impossible Values**: Addressing missing values or abnormal values through **removal, imputation, or estimation**.

- **Handling Duplicated Data**: Eliminating duplicate records.

- **Handling Outlier**: Managing extreme which is detected.

Through insights gained from **Data Understanding of the Raw Dataset**, we recognize that the collected data is relatively clean, with no duplicated and wrong format. Therefore, in this section, we only focus on addressing the remaining issues through necessary.

At the beginning like any step, we also need import the libraries as well as load the data by `pandas`

**Import Libraries**

In [1]:
import os
import sys
import numpy as np
import pandas as pd

sys.path.append(os.path.abspath(os.path.join('..')))
from src import preprocessing as dp

**Load the Dataset**

In [2]:
path = "../data/raw/vietnam_air_quality.csv"
try:
    df = pd.read_csv(path)
    print("[SUCCESS]: Loading dataset successful")
except Exception as e:
    print(f"[ERROR]: Loading dataset fail: {e}")

[SUCCESS]: Loading dataset successful


### 1. Format Data
As we explored in step **explore datatype of the data** at **Data Exploration**. Most of columns have been right datatype except column `timestamp`. Therefore, in this section we will convert it to right datatype that is `datetime`

In [3]:
try:
    df['timestamp'] = pd.to_datetime(df['timestamp'], errors='raise')
    print("[SUCCESS] Formating data is successful")
except Exception as e:
    print(f"[FAIL] Formating data fail: {e}")

[SUCCESS] Formating data is successful


### 2. Irrelevant Data
The goal of this step is reduce memory and noise by removing columns which is not essential for analysis as well as build model.

With **20 columns** in our dataset, we will decide to drop column `lat`, `lon`  because the data from the above column was used for mapping to create the **`city`** column.
 

In [4]:
try:
    df.drop(columns=["lat", "lon"], inplace=True)
    print("[SUCCESS] Irrelevant column is successfull")
except Exception as e:
    print(f"[ERROR] Irrelevant column is fail")

[SUCCESS] Irrelevant column is successfull


### 3. Handling Missing Data & Impossible Values

- As identified in the **Data Exploration** phase, **93 records** violated the physical constraint where fine particulate matter ($PM_{2.5}$) exceeded coarse particulate matter ($PM_{10}$). We addressed these anomalies using a three-step imputation strategy: 
    - **Masking:** The invalid $PM_{10}$ values in these specific rows were converted to NaN (treated as missing). 
    - **Interpolation:** We applied **Time-based Interpolation** to reconstruct the missing values. Crucially, this operation was grouped by city to ensure spatial independence and preserve local time-series trends. 
    - **Constraint Enforcement:** A final consistency check was applied to strictly enforce $PM_{10} \geq PM_{2.5}$, ensuring physical validity across the entire dataset.


In [5]:
mask_invalid = df[df["pm2_5"] > df["pm10"]].index
df.loc[mask_invalid, "pm10"] = np.nan

# Set timestamp as Index (Required for time interpolation)
df = df.set_index("timestamp")

# Interpolate
df["pm10"] = df.groupby("city")["pm10"].transform(
    lambda group: group.interpolate(method="time", limit=5)
)

# Reset index
df = df.reset_index()
df["pm10"] = df[["pm10", "pm2_5"]].max(axis=1)

- **Checking After Resolve**

In [6]:
# Check remaining PM2.5 > PM10 violations
remaining_invalid = df[df["pm2_5"] > df["pm10"]].index
print(f"Post-fix Verification: Found {len(remaining_invalid)} impossible rows (PM2.5 > PM10).")

Post-fix Verification: Found 0 impossible rows (PM2.5 > PM10).


### 4. Handling Outlier
Before jumping into handle outlier we need save processed data for explore continuously which we process in previous step. This data is ready to define about distribution data and from this we can have insight about outliers

**Save Processed Data**

In [7]:
path = "../data/processed/processed_data.csv"
df.to_csv(path, index = False)

Based on the insights derived from the `data_exploration.ipynb` notebook and the physical characteristics of air quality data, we have decided **NOT to remove or alter statistical outliers** (extreme high values). These are some reason about it:

1. **Authentic Environmental Variability vs. Error**
   - **Insight:** As noted in the **Data Exploration** phase, the "outliers" (e.g., AQI > 150) represent periods of **severe air pollution events**.
   - **Reasoning:** In environmental science, extreme values are often "features," not "bugs." A value of `AQI = 146` (as seen in *Bac Ninh*) is a realistic occurrence during winter or temperature inversion periods. Removing these values would artificially smooth the data, causing it to deviate from reality.

2. **Preservation of Time-Series Continuity**
    - **Constraint:** The data is a continuous hourly time series.
    - **Reasoning:** Deep learning models (e.g., LSTM, GRU) and feature engineering techniques (e.g., Lag features $t-1$, Rolling Windows) require an unbroken temporal sequence. Deleting rows containing outliers would create **time gaps**, disrupting the temporal dependencies required for accurate forecasting.

3. **Multivariate Integrity**
   - **Logic:** Pollution levels are physically correlated with meteorological factors (e.g., high PM2.5 often correlates with low `wind_speed` or high `humidity`).
   - **Reasoning:** Removing a high PM2.5 data point while retaining its corresponding weather conditions destroys the physical cause-and-effect relationship. The model needs these extreme examples to learn how specific weather patterns trigger pollution spikes.


However, if in step **Modeling** the accurancy or efficiency of model doesn't good performance we will consider to fall back in this step - **Handling Outliers**.


## II. Data Reduction
The method of **Data Reduction** may achieve a condensed description of the original data which is **much smaller in quantity but keeps the quality** of the original data. After **step EDA**, with `processed_data` we continue to remove some columns inclued in: `aqi`, `pollution_level` and `pollution_class` which we have analyzed through step **Data Undestanding about Processed Data**. Now, these columns aren't essential with data for modeling.

In [8]:
cols_to_drop = ["aqi", "pollution_level", "pollution_class"]
try:
    df.drop(columns=cols_to_drop, inplace=True)
    print("[SUCCESS] Irrelevant column is successfull")
except Exception as e:
    print(f"[ERROR] Irrelevant column is fail")

[SUCCESS] Irrelevant column is successfull


## III. Feature Engineering

**Feature Engineering** is the process of selecting, scaling, transforming, or creating new features from raw data. It helps turn messy real-world data into a form that the model can understand, thereby improving its predictive performance.

At the **EDA (Data Exploration)** step, we initially identified several potential features based on domain knowledge. We revisit and refine them here because these features are hypothesized to significantly enhance regression performance. Additionally, **for time-series problems**, we specifically engineer variables to capture **temporal dependencies**, **seasonality**, and **lag effects**—critical components for forecasting accuracy.

We have categorized the feature extraction process into **4 main groups**, based on the physical and statistical nature of the data:

1. **Group 1: Temporal & Social Context**: To capture natural time cycles (diurnal/seasonal) and human behavioral patterns (rush hours, holidays) that influence emission levels.

- **Cyclical Features:** `hour_sin`, `hour_cos`, `month_sin`, `month_cos`.
  - *Significance:* These transform time variables (0-23 hours, 1-12 months) into sine/cosine coordinates, allowing the model to understand continuity (e.g., 23:00 is temporally close to 00:00).

- **Calendar Features:** `hour`, `day_of_week`, `month`, `season`.

- **Social Activity:**
  - `is_weekend`: Distinguishes between weekdays and weekends (reflecting differences in traffic flow).
  
  - `is_working_day`, `is_rush_hour`: Marks peak traffic hours and workdays, where vehicle emissions typically spike.
  
  - `day_part`: Segments the day into periods (Morning, Noon, Afternoon, Evening, Night) to group behavioral characteristics.

  **Impact:** Enables the model to learn **recurrent patterns** based on Hour, Day, and Season.

2. **Group 2: Physics & Meteorology:** To model the physical impact of weather conditions on the dispersion, accumulation, or cleansing of air pollutants.

- **Wind Vectors:** `wind_x`, `wind_y`.
  - *Significance:* Decomposes wind direction and speed into two orthogonal vectors. This handles directionality better than degrees (where 0° and 360° are physically identical but numerically distant).

- **Washout Effect:** `rain_sum_6h`.
  - *Significance:* Cumulative rainfall over the last 6 hours, reflecting the atmosphere's "cleansing" capacity (rain washout).

- **Thermodynamics:**
    * `temp_diff_24h`: Diurnal temperature range, influencing thermal inversion phenomena.
    
    * `humid_x_temp`: Interaction term between humidity and temperature (e.g., high humidity + low temp often leads to fog/haze, trapping pollutants).

  **Impact:** Reflects environmental **physical mechanisms**: Wind aids dispersion, rain cleanses the air, while specific temperature/humidity conditions facilitate secondary pollutant formation.

3. **Group 3: History & Trend (PM2.5):** To exploit the strong **autocorrelation** inherent in time-series data. Current pollution levels are highly dependent on the immediate past.

- **Lag Features:** `pm25_lag_1h`, `pm25_lag_2h`, `pm25_lag_3h`, `pm25_lag_24h`.
  - *Significance:* Direct PM2.5 values at previous time steps (1 hour ago, 1 day ago).

- **Rolling Statistics:**
  - `pm25_rm_6h`, `pm25_rm_24h`: Rolling means over 6h and 24h, representing the **background trend**.
  
  - `pm25_rs_6h`: Rolling standard deviation, representing data **volatility**.

- **Trend:** `pm25_trend_1h` (Instantaneous change compared to the previous hour).

  **Impact:** This is empirically the **most critical feature group**, contributing most significantly to the regression model's predictive performance.

4. **Group 4: Composition & Exogenous Variables:** To leverage correlations between PM2.5 and other pollutants (beneficial multicollinearity) and analyze pollution composition.

- **Dust Composition:**
  - `coarse_dust` ($PM_{10} - PM_{2.5}$): Represents the coarse particle fraction.
  
  - `pm_ratio` ($PM_{2.5} / PM_{10}$): The ratio of fine dust.
  
  - `coarse_dust_lag1h`, `pm_ratio_lag1h`.

- **Gas Pollutant Lags:** `no2_lag1h`, `so2_lag1h`, `co_lag1h`, `o3_lag1h`.
  - *Significance:* Uses past concentrations of precursor gases to predict PM2.5, as emission sources (vehicles, factories) typically release both gases and particulates simultaneously.

  **Impact:** Helps the model distinguish between pollution types (fine vs. coarse dust events) and utilizes predictive signals from chemical precursors.

**Replicate Data**

In [9]:
# Create a clone to avoid effect on processed data
df_feat = df.copy()

**Group 1: Temporal & Social Context**

In [10]:
try:
    df_feat = dp.create_feature_temporal_social(df_feat)
    print("[SUCCESS] Creating features is successfull")
except Exception as e:
    print(f"[ERROR] Creating features is fail")

Processing Group 1: Temporal & Social...
[SUCCESS] Creating features is successfull


**Group 2: Physics & Meteorology**

In [11]:
try:
    df_feat = dp.create_feature_physic(df_feat)
    print("[SUCCESS] Creating features is successfull")
except Exception as e:
    print(f"[ERROR] Creating features is fail")

Processing Group 2: Physics & Meteo...
[SUCCESS] Creating features is successfull


**Group 3: History & Trend (PM2.5)**

In [12]:
try:
    df_feat = dp.create_feature_history_trend(df_feat)
    print("[SUCCESS] Creating features is successfull")
except Exception as e:
    print(f"[ERROR] Creating features is fail")

Processing Group 3: History & Trend...


[SUCCESS] Creating features is successfull


**Group 4: Composition & Exogenous Variables**

In [13]:
try:
    df_feat = dp.create_feature_composition(df_feat)
    print("[SUCCESS] Creating features is successfull")
except Exception as e:
    print(f"[ERROR] Creating features is fail")

Processing Group 4: Composition...
[SUCCESS] Creating features is successfull


## IV. Data Spliting


To prepare the dataset for supervised learning, we construct a **one-hour-ahead forecasting target** by shifting the PM2.5 concentration forward by one time step within each city. Rows containing missing values introduced by shifting and feature engineering are subsequently removed to ensure data consistency.

The cleaned dataset is then split **chronologically** into three subsets to prevent temporal leakage:

- **Training set**: observations before *August 1, 2025*
- **Validation set**: observations from *August 1, 2025* to *September 30, 2025*
- **Test set**: observations from *October 1, 2025* onward

In [14]:
train, val, test, train_cut, val_cut = dp.train_val_test_split(df_feat)

print(f"Train set: {train.shape}, from start to {train_cut}")
print(f"Validation set: {val.shape}, from {train_cut} to {val_cut}")
print(f"Test set: {test.shape}, from {val_cut} to end")

Train set: (718896, 48), from start to 2025-06-01 00:00:00
Validation set: (99552, 48), from 2025-06-01 00:00:00 to 2025-10-01 00:00:00
Test set: (98736, 48), from 2025-10-01 00:00:00 to end


## Data Persistence
This step serves as the “bridge” between the Preprocessing step and the Modeling step. After we preprocess data ready for training, we need to save this data for next step

In [15]:
path = "../data/model"

os.makedirs(path, exist_ok=True)

train.to_csv(os.path.join(path, "train.csv"), index = False)
val.to_csv(os.path.join(path, "val.csv"), index = False)
test.to_csv(os.path.join(path, "test.csv"), index = False)