# DATA PREPROCESSING
Data preprocessing is the key step in any data analysis or machine learning pipeline. This step often interleaved with **Data Exploration**. It involves cleaning, transforming and organizing raw data to ensure it is accurate, consistent and ready for modeling. It has a big impact on model building such as:

- Clean and well-structured data allows models to learn meaningful patterns rather than noise.

- Properly processed data prevents misleading inputs, leading to more reliable predictions.

- Organized data makes it simpler to create useful inputs for the model, enhancing model performance.

- Organized data supports better Exploratory Data Analysis (EDA), making patterns and trends more interpretable.

**Steps in Data Preprocessing:** Some key steps in data preprocessing are: **Data Cleaning**, **Data Integration**, **Data Transformation**, and **Feature Engineering**.

## I. Data Cleaning
It is the process of correcting errors or inconsistencies, normalization and handling any missing, duplicated or irrelevant data in the dataset which was discovered by us through step **Data Exploration**. Clean data is essential for effective analysis, as it improves the quality of results and enhances the performance of data models. We will seperate data cleaning into **six key steps**, included in:

- **Synstax Errors**: Correcting issues such as **typos, incorrect data types, and invalid characters**.

- **Format Data**: Standardizing data formats, including **dates, numeric values, text cases, and units of measurement**.

- **Irrelevant Data**: Removing data fields or records that do not contribute to the analytical objective.

- **Handling Missing Data**: Addressing missing values through **removal, imputation, or estimation**.

- **Handling Duplicated Data**: Eliminating duplicate records.

- **Handling Outlier & Impossible Values**: Managing extreme or abnormal values which is detected.

Through insights gained from **Data Understanding of the Raw Dataset**, we recognize that the collected data is relatively clean, with no duplicated and wrong format. Therefore, in this section, we only focus on addressing the remaining issues through necessary.

At the beginning like any step, we also need import the libraries as well as load the data by `pandas`

**Import Libraries**

In [None]:
import os
import sys
import numpy as np
import pandas as pd

sys.path.append(os.path.abspath(os.path.join('..')))
from src import preprocessing as dp

**Load the Dataset**

In [None]:
path = "../data/raw/vietnam_air_quality.csv"
try:
    df = pd.read_csv(path)
    print("[SUCCESS]: Loading dataset successful")
except Exception as e:
    print(f"[ERROR]: Loading dataset fail: {e}")

[SUCCESS]: Loading dataset successful


### 1. Format Data
As we explored in step **explore datatype of the data** at **Data Exploration**. Most of columns have been right datatype except column `timestamp`. Therefore, in this section we will convert it to right datatype that is `datetime`

In [None]:
try:
    df['timestamp'] = pd.to_datetime(df['timestamp'], errors='raise')
    print("[SUCCESS] Formating data is successful")
except Exception as e:
    print(f"[FAIL] Formating data fail: {e}")

[SUCCESS] Formating data is successful


### 2. Irrelevant Data
The goal of this step is reduce memory and noise by removing columns which is not essential for analysis as well as build model.

With **20 columns** in our dataset, we will decide to drop column `lat`, `lon`  because the data from the above column was used for mapping to create the **`city`** column.
 

In [None]:
try:
    df.drop(columns=["lat", "lon"], inplace=True)
    print("[SUCCESS] Irrelevant column is successfull")
except Exception as e:
    print(f"[ERROR] Irrelevant column is fail")

[SUCCESS] Irrelevant column is successfull


### 3. Handling Missing Data

As identified in the **Data Exploration** phase, **93 records** violated the physical constraint where fine particulate matter ($PM_{2.5}$) exceeded coarse particulate matter ($PM_{10}$). In addition, a small number of records were found where $PM_{2.5}$ or $PM_{10}$ were equal to zero values that are physically implausible in the context of air quality monitoring and are likely indicative of sensor malfunction or implicitly encoded missing data.

To address these anomalies, we employed a **four-step imputation strategy**:

1. **Zero-value Masking:**  
   Records where $PM_{2.5}$ or $PM_{10}$ equaled zero were treated as implicit missing values and replaced with `NaN`.

2. **Constraint-based Masking:**  
   For records violating the physical constraint $PM_{2.5} > PM_{10}$, the invalid $PM_{10}$ values were converted to `NaN` (treated as missing).

3. **Time-based Interpolation:**  
   Missing values were reconstructed using **time-based interpolation**. Crucially, this operation was performed **within each city** to preserve spatial independence and local temporal patterns.

4. **Constraint Enforcement:**  
   A final consistency check was applied to strictly enforce the physical constraint $PM_{10} \geq PM_{2.5}$ across the entire dataset, ensuring physical validity after imputation.


In [None]:
# Detect PM2.5 or PM10 equal to zero (implicit missing values)
pm_zero_mask = (df["pm2_5"] == 0) | (df["pm10"] == 0)
df.loc[pm_zero_mask, ["pm2_5", "pm10"]] = np.nan

mask_invalid = df[df["pm2_5"] > df["pm10"]].index
df.loc[mask_invalid, "pm10"] = np.nan

# Convert to datetime
df["timestamp"] = pd.to_datetime(df["timestamp"])

# Set timestamp as Index (Required for time interpolation)
df = df.set_index("timestamp")

# Interpolate
df["pm10"] = df.groupby("city")["pm10"].transform(
    lambda group: group.interpolate(method="time", limit=5)
)

# Reset index
df = df.reset_index()
df["pm10"] = df[["pm10", "pm2_5"]].max(axis=1)

**Checking After Resolve**

In [None]:
remaining_invalid = df[df["pm2_5"] > df["pm10"]].index
print(f"Post-fix Verification: Found {len(remaining_invalid)} impossible rows.")

Post-fix Verification: Found 0 impossible rows.


### 4. Handling Outlier
Before jumping into handle outlier we need save processed data for explore continuously which we process in previous step. This data is ready to define about distribution data and from this we can have insight about outliers

#### 4.1 Save Processed Data

In [None]:
path = "../data/processed/processed_data.csv"
df.to_csv(path, index = False)

Based on the insights derived from the `data_exploration.ipynb` notebook and the physical characteristics of air quality data, we have decided **NOT to remove or alter statistical outliers** (extreme high values). These are some reason about it:

1. **Authentic Environmental Variability vs. Error**
   - **Insight:** As noted in the **Data Exploration** phase, the "outliers" (e.g., AQI > 150) represent periods of **severe air pollution events**.
   - **Reasoning:** In environmental science, extreme values are often "features," not "bugs." A value of `AQI = 146` (as seen in *Bac Ninh*) is a realistic occurrence during winter or temperature inversion periods. Removing these values would artificially smooth the data, causing it to deviate from reality.

2. **Preservation of Time-Series Continuity**
    - **Constraint:** The data is a continuous hourly time series.
    - **Reasoning:** Deep learning models (e.g., LSTM, GRU) and feature engineering techniques (e.g., Lag features $t-1$, Rolling Windows) require an unbroken temporal sequence. Deleting rows containing outliers would create **time gaps**, disrupting the temporal dependencies required for accurate forecasting.

3. **Multivariate Integrity**
   - **Logic:** Pollution levels are physically correlated with meteorological factors (e.g., high PM2.5 often correlates with low `wind_speed` or high `humidity`).
   - **Reasoning:** Removing a high PM2.5 data point while retaining its corresponding weather conditions destroys the physical cause-and-effect relationship. The model needs these extreme examples to learn how specific weather patterns trigger pollution spikes.


However, if in step **Modeling** the accurancy or efficiency of model doesn't good performance we will consider to fall back in this step - **Handling Outliers**.


## II. Data Intergation
It involves merging data from various sources into a single, unified dataset. It can be challenging due to differences in data formats, structures, and meanings. However, this dataset only store in one file `csv`, so at this step we will not record linkage as well as data fusion.