## Predicting Air Quality Index (AQI) Using Machine Learning and Deep Learning Models
*Students* Andrea Thomas, Joseph Edwards, Jinyuan He

# Introduction

Air pollution has become one of the most significant environmental and public health challenges in modern cities. Poor air quality is closely linked to respiratory illness, reduced life expectancy, productivity losses, and increased healthcare burden. For governments, environmental agencies, and businesses, accurate forecasting of air quality is essential for early warnings, public protection, and operational decision-making.

In this project, we focus on predicting the Air Quality Index (AQI) using the **Taiwan Air Quality Dataset (2016–2024)**, an hourly dataset published on Kaggle. Taiwan provides an excellent case study due to its dense urban centers, complex topography, and highly variable meteorological patterns influenced by monsoon seasons, typhoons, long-range dust transport, and industrial activity. These unique characteristics make AQI prediction both scientifically interesting and practically important.

The goal of this project is to build a complete machine learning pipeline to forecast hourly AQI values based on historical pollutant concentrations and weather conditions. We construct models from three categories:

- **Baseline Model:** Linear Regression  
- **Tree-Based Models:** Random Forest and XGBoost  
- **Deep Learning Sequence Models:** Feedforward Neural Network (FNN) and Long Short-Term Memory (LSTM)

By comparing these approaches, we aim to understand:
1. Which model family performs best for Taiwan's AQI patterns  
2. Which pollutant and meteorological variables contribute most to prediction accuracy  
3. How temporal dependencies (lag features and sequences) influence forecasting quality  

This notebook includes end-to-end steps:
- Data loading and exploratory analysis  
- Feature engineering (time features, lag features, rolling windows)  
- Model development across ML and DL techniques  
- Evaluation with MAE, RMSE, and R²  
- Visualization of predictions and feature importances  

Ultimately, this project provides a practical and data-driven approach to AQI forecasting tailored to Taiwan's environmental context, demonstrating how machine learning and deep learning can support smarter air-quality monitoring and early-warning systems.



# 1. Data Loading & Cleaning

In [10]:
# install kagglehub if needs
#!pip install kagglehub[pandas-datasets]

In [2]:
import kagglehub
import pandas as pd

# Download the dataset (KaggleHub returns the local directory path)
local_dir = kagglehub.dataset_download("taweilo/taiwan-air-quality-data-20162024")

print("Downloaded to:", local_dir)

Using Colab cache for faster access to the 'taiwan-air-quality-data-20162024' dataset.
Downloaded to: /kaggle/input/taiwan-air-quality-data-20162024


In [2]:
df = pd.read_csv(f"{local_dir}/air_quality.csv")
df.head()

  df = pd.read_csv(f"{local_dir}/air_quality.csv")


Unnamed: 0,date,sitename,county,aqi,pollutant,status,so2,co,o3,o3_8hr,...,windspeed,winddirec,unit,co_8hr,pm2.5_avg,pm10_avg,so2_avg,longitude,latitude,siteid
0,2024-08-31 23:00,Hukou,Hsinchu County,62.0,PM2.5,Moderate,0.9,0.17,35.0,40.2,...,2.3,225,,0.2,20.1,26.0,1.0,121.038869,24.900097,22.0
1,2024-08-31 23:00,Zhongming,Taichung City,50.0,,Good,1.6,0.32,27.9,35.1,...,1.1,184,,0.2,15.3,23.0,1.0,120.641092,24.151958,31.0
2,2024-08-31 23:00,Zhudong,Hsinchu County,45.0,,Good,0.4,0.17,25.1,40.6,...,0.4,210,,0.2,13.8,24.0,0.0,121.088955,24.740914,23.0
3,2024-08-31 23:00,Hsinchu,Hsinchu City,42.0,,Good,0.8,0.2,30.0,35.9,...,1.9,239,,0.2,13.0,26.0,1.0,120.972368,24.805636,24.0
4,2024-08-31 23:00,Toufen,Miaoli County,50.0,,Good,1.0,0.16,33.5,35.9,...,1.8,259,,0.1,15.3,28.0,1.0,120.898693,24.696907,25.0


In [1]:
import matplotlib.pyplot as plt
import seaborn as sns

# Check missing data
missing_summary = pd.DataFrame({
    'Column': df.columns,
    'Missing_Count': df.isnull().sum(),
    'Missing_Percentage': (df.isnull().sum() / len(df) * 100).round(2)
})
missing_summary = missing_summary[missing_summary['Missing_Count'] > 0].sort_values('Missing_Percentage', ascending=False)
print(missing_summary)

# Visualize missing data pattern
plt.figure(figsize=(12, 6))
sns.heatmap(df.isnull(), cbar=True, yticklabels=False)
plt.title('Missing Data Pattern')
plt.tight_layout()
plt.show()

NameError: name 'pd' is not defined

In [None]:
# 转换为宽格式
df_wide = df.pivot_table(
    index=['date', 'siteid', 'latitude', 'longitude'],
    columns='pollutant',
    values='value',
    aggfunc='mean'
).reset_index()

print("转换后的数据：")
print(df_wide.head())
print(f"\n转换后缺失情况：")
print(df_wide.isnull().sum())

In [None]:
df = df.sort_values(["siteid", "date"])
df_cleaned = df.copy()

# Drop rows where siteid/longitude/longitude columns have missing values
df_cleaned = df_cleaned.dropna(subset=['date', 'siteid', 'longitude', 'longitude'])

df_cleaned["siteid"] = df_cleaned["siteid"].astype(int)
df_cleaned["date"] = pd.to_datetime(df_cleaned["date"], errors="coerce")
df_cleaned = df_cleaned[df_cleaned["siteid"]!=1]

# Remove unuseful colums
df_cleaned = df_cleaned.drop(columns=['unit', 'o3_8hr', 'co_8hr', 'pm2.5_avg', 'pm10_avg', 'so2_avg', 'pollutant'])

# Drop ALL rows with any NaN
df_cleaned = df_cleaned.dropna().reset_index(drop=True)

print("Remaining missing values:")
print(df_cleaned.isna().sum())
print("Final shape:", df_cleaned.shape)

In [25]:

# Compute time differences per site
df_cleaned["time_diff"] = df_cleaned.groupby("siteid")["date"].diff()

# Find gaps > 4 hours
missing_gaps = df_cleaned[df_cleaned["time_diff"] > pd.Timedelta(hours=4)]

# Count number of gaps per siteid
gap_counts = missing_gaps.groupby("siteid").size()

# List of all siteids
all_siteids = df_cleaned["siteid"].unique()

# Siteids that have NO gaps
siteids_no_gaps = [sid for sid in all_siteids if sid not in gap_counts.index]

print("Site IDs with NO missing gaps > 4 hours:")
print(siteids_no_gaps)

Site IDs with NO missing gaps > 4 hours:
[np.int64(0)]
