# PROJECT OVERVIEW REPORT: PM2.5 FINE PARTICULATE MATTER FORECASTING IN VIETNAM

## I. Project Overview
This project focuses on forecasting PM2.5 (Particulate Matter 2.5) concentrations across 34 key provinces and cities in Vietnam. This is framed as a Time Series Regression problem with the specific objective of predicting PM2.5 levels one hour in advance.

The dataset was constructed by combining geographic coordinates (via Geopy) with historical queries from the **Open-Meteo API**, spanning from January 2023 to November 2025. The final dataset consists of over **860,000** records, integrating Air Quality Indices (AQI, PM2.5, PM10, CO, NO2, SO2, O3) with meteorological factors (temperature, humidity, precipitation, wind speed, pressure, and cloud cover). The project's core value lies in building an early warning system to help citizens and authorities take proactive measures against environmental pollution.

## II. Exploration, Preprocessing, and Insights
This stage served as the compass for the entire model-building process.

**1. Data Exploration**

Through correlation analysis, we identified PM2.5 and PM10 as the primary drivers of the AQI. Surface pressure was found to be the meteorological factor with the strongest positive correlation with pollution, while wind and rain acted as natural dispersing agents.

Research questions yielded several critical insights:
- **Human Activity Impact**: In Hanoi and Bac Ninh, pollution accumulates during the week and drops significantly on weekends. Conversely, in tourist destinations like Da Nang or port cities like Hai Phong, weekend AQI tends to be higher due to increased traffic and service activities.
- **Regional Characteristics**: A distinct North-South divide was observed. Northern provinces face severe pollution in winter due to thermal inversion, which traps fine dust near the ground. Meanwhile, the South and Central Highlands enjoy better air quality thanks to favorable thermal convection and open terrain.
- **Pollution Sources in Hanoi**: Identified as a hybrid phenomenon. Pollution stems not only from local urban emissions (traffic, construction) but is also heavily influenced by regional background concentrations, remaining stable throughout both day and night.

**2. Data Preprocessing**

The preprocessing workflow focused on physical integrity and feature engineering:
- **Physical Error Correction**: Addressed the "impossible" scenario where PM2.5 exceeded PM10 by applying time-based interpolation for each specific city.
- **Feature Engineering**: Developed four main feature groups: Temporal (cyclical Sin/Cos variables); Physics (wind vectors, rain washout effects); History (1h and 24h lag variables); and Composition (dust ratios, secondary gas lags).

## III. Modeling and Advanced Approaches
The modeling phase progressed from basic linear methods to advanced machine learning algorithms.

**1. Linear Models and the Multicollinearity Challenge**

Using Linear Regression as a baseline revealed severe multicollinearity between PM2.5 and PM10 (correlation as high as 0.96).

Multicollinearity Insights:
- In linear models, keeping both variables led to unstable regression coefficients and poor interpretability.
- Experiments showed that removing PM10-related features resulted in almost no change in RMSE, while the PM2.5 coefficient surged to absorb the information previously held by PM10. This proved that multicollinearity primarily hinders model interpretation rather than severely degrading predictive accuracy.

**2. Advanced Linear Models**
- **Ridge Regression**: Adding L2 regularization helped stabilize the regression coefficients but did not provide a significant accuracy boost over traditional OLS.
- **PCR (Principal Component Regression)**: Compressed correlated variables into independent principal components. While it reduced noise, PCR occasionally lost target-specific signals if the components did not align well with PM2.5 variance.
- **PLS (Partial Least Squares)**: Unlike PCR, PLS found components that explained both the independent variables and the target. However, PLS in this project showed a performance drop when dimensions were reduced too aggressively, proving that raw information from lag variables remains vital.

**3. Transition to XGBoost**

The turning point of the project was the implementation of the XGBoost Regressor combined with Optuna optimization.
- Nonlinearity Handling: XGBoost significantly outperformed linear models (reducing MAE by over 20%) by learning the complex, non-linear interactions between meteorology and pollution.
- Resolving Multicollinearity: Unlike Linear Regression, tree-based models like XGBoost are not adversely affected by multicollinearity. In fact, PM10 emerged as a top 3 most important feature, acting as a physical reference frame that helps the model understand the dust structure.
- **Final Results**: The model achieved an impressive R2 Score of 0.9766, indicating its ability to explain nearly all variance in the air quality data.

## IV. Conclusion
The project demonstrates that forecasting PM2.5 is not merely about analyzing historical data but requires a deep understanding of physical laws and socio-economic patterns. The tight integration of structured EDA and robust machine learning models has created a high-reliability forecasting system. The most valuable insight gained is the superiority of non-linear models (like XGBoost) in utilizing highly correlated variables to enrich information rather than discarding them as required by classical statistical methods.