# 01 ‚Äî Preprocessing & EDA (Beijing Multi-Site Air Quality)
M·ª•c ti√™u: t·∫£i d·ªØ li·ªáu, l√†m s·∫°ch, t·∫°o nh√£n ph√¢n l·ªõp (AQI class theo PM2.5 24h mean), t·∫°o ƒë·∫∑c tr∆∞ng th·ªùi gian + lag, v√† l∆∞u `data/processed/cleaned.parquet`.

**L∆∞u √Ω:** n·∫øu `USE_UCIMLREPO=True` th√¨ notebook c·∫ßn internet ƒë·ªÉ t·∫£i dataset t·ª´ UCI.

In [None]:
USE_UCIMLREPO = False
RAW_ZIP_PATH = "data/raw/PRSA2017_Data_20130301-20170228.zip"

OUTPUT_CLEANED_PATH = 'data/processed/cleaned.parquet'
LAG_HOURS=[1, 3, 24]


In [None]:
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from src.classification_library import (
    load_beijing_air_quality,
    clean_air_quality_df,
    add_pm25_24h_and_label,
    add_time_features,
    add_lag_features,
)

PROJECT_ROOT = Path('..').resolve()
OUT_PATH = (PROJECT_ROOT / OUTPUT_CLEANED_PATH).resolve()
OUT_PATH.parent.mkdir(parents=True, exist_ok=True)

In [None]:
df_raw = load_beijing_air_quality(use_ucimlrepo=USE_UCIMLREPO, raw_zip_path=RAW_ZIP_PATH)
print('raw shape:', df_raw.shape)
df_raw.head()

In [None]:
df = clean_air_quality_df(df_raw)
df = add_pm25_24h_and_label(df)
df = add_time_features(df)
df = add_lag_features(df, lag_hours=LAG_HOURS)
print('cleaned shape:', df.shape)
df[['datetime','station','PM2.5','pm25_24h','aqi_class']].head(10)

In [None]:
# EDA nhanh: missingness v√† ph√¢n b·ªë l·ªõp
missing_rate = df.isna().mean().sort_values(ascending=False)
missing_rate.head(20)

## Q1.1 ‚Äî Ki·ªÉm tra ph·∫°m vi th·ªùi gian v√† t·∫ßn su·∫•t d·ªØ li·ªáu
M·ª•c ti√™u: x√°c nh·∫≠n d·ªØ li·ªáu ph·ªß t·ª´ khi n√†o ƒë·∫øn khi n√†o, t·∫ßn su·∫•t theo gi·ªù c√≥ li√™n t·ª•c kh√¥ng, v√† c√≥ gap n√†o kh√¥ng.

In [None]:
# Ki·ªÉm tra ph·∫°m vi th·ªùi gian
print("="*60)
print("KI·ªÇM TRA PH·∫†M VI TH·ªúI GIAN V√Ä T·∫¶N SU·∫§T D·ªÆ LI·ªÜU")
print("="*60)

print(f"\nüìÖ Kho·∫£ng th·ªùi gian d·ªØ li·ªáu:")
print(f"   Start: {df['datetime'].min()}")
print(f"   End:   {df['datetime'].max()}")
print(f"   T·ªïng s·ªë b·∫£n ghi: {len(df):,}")

# Ki·ªÉm tra s·ªë tr·∫°m
stations = df['station'].unique()
print(f"\nüìç S·ªë tr·∫°m quan tr·∫Øc: {len(stations)}")
print(f"   C√°c tr·∫°m: {list(stations)}")

# Ki·ªÉm tra t·∫ßn su·∫•t theo gi·ªù - l·∫•y 1 tr·∫°m ƒë·ªÉ ki·ªÉm tra
sample_station = stations[0]
df_station = df[df['station'] == sample_station].sort_values('datetime')
time_diff = df_station['datetime'].diff().dropna()

print(f"\n‚è∞ Ki·ªÉm tra t·∫ßn su·∫•t (tr·∫°m {sample_station}):")
print(f"   Kho·∫£ng c√°ch th·ªùi gian ph·ªï bi·∫øn nh·∫•t: {time_diff.mode().iloc[0]}")
print(f"   Kho·∫£ng c√°ch th·ªùi gian min: {time_diff.min()}")
print(f"   Kho·∫£ng c√°ch th·ªùi gian max: {time_diff.max()}")

# Ki·ªÉm tra gaps (kho·∫£ng tr·ªëng > 1 gi·ªù)
gaps = time_diff[time_diff > pd.Timedelta(hours=1)]
if len(gaps) > 0:
    print(f"\n‚ö†Ô∏è  Ph√°t hi·ªán {len(gaps)} gaps (kho·∫£ng tr·ªëng > 1 gi·ªù):")
    print(gaps.value_counts().head(10))
else:
    print(f"\n‚úÖ Kh√¥ng c√≥ gaps - d·ªØ li·ªáu li√™n t·ª•c theo gi·ªù")

# T√≠nh s·ªë gi·ªù l√Ω thuy·∫øt
expected_hours = int((df['datetime'].max() - df['datetime'].min()).total_seconds() / 3600) + 1
actual_hours_per_station = len(df_station)
print(f"\nüìä S·ªë gi·ªù l√Ω thuy·∫øt: {expected_hours:,}")
print(f"   S·ªë gi·ªù th·ª±c t·∫ø m·ªói tr·∫°m: {actual_hours_per_station:,}")

## Q1.2 ‚Äî Ph√¢n t√≠ch t·ª∑ l·ªá thi·∫øu d·ªØ li·ªáu theo bi·∫øn v√† theo th·ªùi gian
M·ª•c ti√™u: X√°c ƒë·ªãnh bi·∫øn n√†o thi·∫øu nhi·ªÅu nh·∫•t, v√† thi·∫øu t·∫≠p trung v√†o giai ƒëo·∫°n n√†o hay r·∫£i ƒë·ªÅu.

In [None]:
# T·ª∑ l·ªá thi·∫øu theo bi·∫øn
print("="*60)
print("PH√ÇN T√çCH T·ª∂ L·ªÜ THI·∫æU D·ªÆ LI·ªÜU")
print("="*60)

# C√°c bi·∫øn quan tr·ªçng c·∫ßn ph√¢n t√≠ch
important_cols = ['PM2.5', 'PM10', 'SO2', 'NO2', 'CO', 'O3', 'TEMP', 'PRES', 'DEWP', 'RAIN', 'WSPM', 'wd']
missing_by_var = df[important_cols].isna().mean().sort_values(ascending=False) * 100

print("\nüìä T·ª∑ l·ªá thi·∫øu theo bi·∫øn (%):")
for col, rate in missing_by_var.items():
    bar = "‚ñà" * int(rate * 2) + "‚ñë" * int((10 - rate) * 2)
    print(f"   {col:8s}: {bar} {rate:.2f}%")

# Visualize missing rate
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart - missing rate by variable
ax1 = axes[0]
missing_by_var.plot(kind='barh', ax=ax1, color='coral')
ax1.set_xlabel('T·ª∑ l·ªá thi·∫øu (%)')
ax1.set_title('T·ª∑ l·ªá thi·∫øu d·ªØ li·ªáu theo bi·∫øn')
ax1.axvline(x=2, color='red', linestyle='--', label='Ng∆∞·ª°ng 2%')
ax1.legend()

# Missing over time - theo th√°ng
df['year_month'] = df['datetime'].dt.to_period('M')
missing_by_month = df.groupby('year_month')['PM2.5'].apply(lambda x: x.isna().mean() * 100)

ax2 = axes[1]
missing_by_month.plot(kind='line', ax=ax2, marker='o', markersize=3)
ax2.set_xlabel('Th√°ng')
ax2.set_ylabel('T·ª∑ l·ªá thi·∫øu PM2.5 (%)')
ax2.set_title('T·ª∑ l·ªá thi·∫øu PM2.5 theo th·ªùi gian')
ax2.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

# Ph√¢n t√≠ch thi·∫øu theo tr·∫°m cho PM2.5
print("\nüìç T·ª∑ l·ªá thi·∫øu PM2.5 theo tr·∫°m:")
missing_by_station = df.groupby('station')['PM2.5'].apply(lambda x: x.isna().mean() * 100).sort_values(ascending=False)
for station, rate in missing_by_station.items():
    print(f"   {station:20s}: {rate:.2f}%")

## Q1.3 ‚Äî Ph√°t hi·ªán outliers v√† ph√¢n ph·ªëi l·ªách
M·ª•c ti√™u: D√πng boxplot v√† quantile ƒë·ªÉ nh√¨n nhanh c√°c gi√° tr·ªã ngo·∫°i lai v√† ph√¢n ph·ªëi c·ªßa PM2.5, PM10.

In [None]:
# Ph√¢n t√≠ch outliers v√† ph√¢n ph·ªëi l·ªách c·ªßa PM2.5
print("="*60)
print("PH√ÅT HI·ªÜN OUTLIERS V√Ä PH√ÇN PH·ªêI L·ªÜCH")
print("="*60)

# L·∫•y d·ªØ li·ªáu PM2.5 (b·ªè missing)
pm25_data = df['PM2.5'].dropna()

print(f"\nüìä Th·ªëng k√™ m√¥ t·∫£ PM2.5:")
print(f"   Count:  {len(pm25_data):,}")
print(f"   Mean:   {pm25_data.mean():.2f}")
print(f"   Median: {pm25_data.median():.2f}")
print(f"   Std:    {pm25_data.std():.2f}")
print(f"   Min:    {pm25_data.min():.2f}")
print(f"   Max:    {pm25_data.max():.2f}")
print(f"\n   Quantiles:")
for q in [0.25, 0.50, 0.75, 0.90, 0.95, 0.99]:
    print(f"   Q{int(q*100):02d}:   {pm25_data.quantile(q):.2f}")

# Ph√°t hi·ªán outliers b·∫±ng IQR
Q1 = pm25_data.quantile(0.25)
Q3 = pm25_data.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = pm25_data[(pm25_data < lower_bound) | (pm25_data > upper_bound)]

print(f"\nüîç Ph√°t hi·ªán Outliers (IQR method):")
print(f"   Q1: {Q1:.2f}, Q3: {Q3:.2f}, IQR: {IQR:.2f}")
print(f"   Lower bound: {lower_bound:.2f}")
print(f"   Upper bound: {upper_bound:.2f}")
print(f"   S·ªë outliers: {len(outliers):,} ({len(outliers)/len(pm25_data)*100:.2f}%)")

# Histogram v√† Boxplot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

ax1 = axes[0]
pm25_data.hist(bins=100, ax=ax1, alpha=0.7, edgecolor='black')
ax1.axvline(pm25_data.mean(), color='red', linestyle='--', label=f'Mean={pm25_data.mean():.1f}')
ax1.axvline(pm25_data.median(), color='green', linestyle='--', label=f'Median={pm25_data.median():.1f}')
ax1.set_xlabel('PM2.5')
ax1.set_ylabel('Frequency')
ax1.set_title('Ph√¢n ph·ªëi PM2.5 (to√†n b·ªô)')
ax1.legend()

ax2 = axes[1]
ax2.boxplot(pm25_data, vert=True)
ax2.set_ylabel('PM2.5 (Œºg/m¬≥)')
ax2.set_title('Boxplot PM2.5 - Ph√°t hi·ªán outliers')

plt.tight_layout()
plt.show()

# ƒê√°nh gi√° ƒë·ªô l·ªách (skewness)
skewness = pm25_data.skew()
kurtosis = pm25_data.kurtosis()
print(f"\nüìà ƒê√°nh gi√° ph√¢n ph·ªëi:")
print(f"   Skewness: {skewness:.3f} {'(l·ªách ph·∫£i - positive skew)' if skewness > 0 else '(l·ªách tr√°i - negative skew)'}")
print(f"   Kurtosis: {kurtosis:.3f} {'(ƒëu√¥i n·∫∑ng - heavy tails)' if kurtosis > 0 else '(ƒëu√¥i nh·∫π - light tails)'}")
print(f"\n   ‚Üí PM2.5 c√≥ ph√¢n ph·ªëi l·ªách ph·∫£i m·∫°nh v·ªõi ƒëu√¥i n·∫∑ng")
print(f"   ‚Üí C√≥ nhi·ªÅu gi√° tr·ªã c·ª±c cao (√¥ nhi·ªÖm n·∫∑ng) nh∆∞ng ƒëa s·ªë gi√° tr·ªã th·∫•p-trung b√¨nh")

## Q1.4 ‚Äî V·∫Ω chu·ªói PM2.5 theo th·ªùi gian
M·ª•c ti√™u: V·∫Ω to√†n giai ƒëo·∫°n ƒë·ªÉ th·∫•y xu h∆∞·ªõng d√†i h·∫°n, v√† zoom 1-2 th√°ng ƒë·ªÉ th·∫•y dao ƒë·ªông ng·∫Øn h·∫°n.

In [None]:
# V·∫Ω chu·ªói PM2.5 theo th·ªùi gian - ch·ªçn 1 tr·∫°m ƒë·ªÉ ph√¢n t√≠ch
ANALYSIS_STATION = 'Aotizhongxin'
df_station = df[df['station'] == ANALYSIS_STATION].sort_values('datetime').copy()

print(f"üìç Ph√¢n t√≠ch chu·ªói th·ªùi gian PM2.5 cho tr·∫°m: {ANALYSIS_STATION}")
print(f"   S·ªë b·∫£n ghi: {len(df_station):,}")

# Figure 1: To√†n giai ƒëo·∫°n
fig, axes = plt.subplots(2, 1, figsize=(16, 10))

ax1 = axes[0]
ax1.plot(df_station['datetime'], df_station['PM2.5'], linewidth=0.5, alpha=0.7)
# Th√™m rolling mean 7 ng√†y
rolling_mean = df_station['PM2.5'].rolling(window=24*7, min_periods=24*3).mean()
ax1.plot(df_station['datetime'], rolling_mean, color='red', linewidth=1.5, label='Rolling Mean (7 ng√†y)')
ax1.set_xlabel('Th·ªùi gian')
ax1.set_ylabel('PM2.5 (Œºg/m¬≥)')
ax1.set_title(f'PM2.5 to√†n giai ƒëo·∫°n - Tr·∫°m {ANALYSIS_STATION} (2013-2017)')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Figure 2: Zoom 2 th√°ng (th√°ng 12/2016 - th√°ng 1/2017, th∆∞·ªùng c√≥ √¥ nhi·ªÖm cao)
zoom_start = '2016-12-01'
zoom_end = '2017-02-01'
df_zoom = df_station[(df_station['datetime'] >= zoom_start) & (df_station['datetime'] < zoom_end)]

ax2 = axes[1]
ax2.plot(df_zoom['datetime'], df_zoom['PM2.5'], linewidth=0.8, marker='.', markersize=2, alpha=0.7)
ax2.set_xlabel('Th·ªùi gian')
ax2.set_ylabel('PM2.5 (Œºg/m¬≥)')
ax2.set_title(f'PM2.5 zoom 2 th√°ng ({zoom_start} ƒë·∫øn {zoom_end}) - Tr·∫°m {ANALYSIS_STATION}')
ax2.grid(True, alpha=0.3)

# ƒê√°nh d·∫•u c√°c m·ª©c nguy hi·ªÉm theo AQI
for level, (threshold, color, label) in enumerate([
    (35, 'green', 'Good (0-35)'),
    (75, 'yellow', 'Moderate (35-75)'),
    (150, 'orange', 'Unhealthy for Sensitive Groups'),
    (250, 'red', 'Unhealthy'),
]):
    ax2.axhline(y=threshold, color=color, linestyle='--', alpha=0.5, linewidth=1)

plt.tight_layout()
plt.show()

print("\nüìà Nh·∫≠n x√©t t·ª´ ƒë·ªì th·ªã:")
print("   1. C√≥ xu h∆∞·ªõng m√πa v·ª• r√µ r√†ng: PM2.5 cao v√†o m√πa ƒë√¥ng (th√°ng 11-2), th·∫•p v√†o m√πa h√®")
print("   2. C√≥ c√°c ƒë·ª£t spike (ƒë·ªânh √¥ nhi·ªÖm) tƒÉng v·ªçt, ƒë·∫∑c bi·ªát trong m√πa ƒë√¥ng")
print("   3. Dao ƒë·ªông theo ng√†y r√µ r√†ng khi zoom v√†o 2 th√°ng")
print("   4. Rolling mean 7 ng√†y cho th·∫•y xu h∆∞·ªõng d√†i h·∫°n ·ªïn ƒë·ªãnh")

## Q1.5 ‚Äî Ki·ªÉm tra t·ª± t∆∞∆°ng quan (Autocorrelation)
M·ª•c ti√™u: So s√°nh t∆∞∆°ng quan c·ªßa PM2.5 v·ªõi c√°c ƒë·ªô tr·ªÖ 24h (chu k·ª≥ ng√†y) v√† 168h (chu k·ª≥ tu·∫ßn) ƒë·ªÉ g·ª£i √Ω chu k·ª≥.

In [None]:
# Ki·ªÉm tra t·ª± t∆∞∆°ng quan - Autocorrelation
print("="*60)
print("KI·ªÇM TRA T·ª∞ T∆Ø∆†NG QUAN (AUTOCORRELATION)")
print("="*60)

# L·∫•y chu·ªói PM2.5 c·ªßa 1 tr·∫°m, ƒë√£ sort theo th·ªùi gian
pm25_series = df_station.set_index('datetime')['PM2.5'].dropna()

# T√≠nh autocorrelation t·∫°i c√°c lag quan tr·ªçng
lags_to_check = [1, 3, 6, 12, 24, 48, 72, 168, 336]
autocorr_values = {}

print("\nüìä T·ª± t∆∞∆°ng quan (Autocorrelation) t·∫°i c√°c ƒë·ªô tr·ªÖ:")
for lag in lags_to_check:
    if lag < len(pm25_series):
        acf_value = pm25_series.autocorr(lag=lag)
        autocorr_values[lag] = acf_value
        lag_label = ""
        if lag == 24:
            lag_label = " (1 ng√†y)"
        elif lag == 168:
            lag_label = " (1 tu·∫ßn)"
        elif lag == 336:
            lag_label = " (2 tu·∫ßn)"
        print(f"   Lag {lag:3d}h{lag_label:15s}: {acf_value:.4f}")

# Visualize autocorrelation
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Autocorrelation t·∫°i c√°c lag ƒë·∫∑c bi·ªát
ax1 = axes[0]
lags = list(autocorr_values.keys())
acf_vals = list(autocorr_values.values())
colors = ['red' if l in [24, 168] else 'steelblue' for l in lags]
bars = ax1.bar(range(len(lags)), acf_vals, color=colors)
ax1.set_xticks(range(len(lags)))
ax1.set_xticklabels([str(l) for l in lags])
ax1.set_xlabel('Lag (gi·ªù)')
ax1.set_ylabel('Autocorrelation')
ax1.set_title('T·ª± t∆∞∆°ng quan t·∫°i c√°c ƒë·ªô tr·ªÖ quan tr·ªçng')
ax1.axhline(y=0, color='black', linestyle='-', linewidth=0.5)

# Highlight lag 24 v√† 168
for i, (lag, val) in enumerate(autocorr_values.items()):
    if lag in [24, 168]:
        ax1.annotate(f'{val:.3f}', (i, val), ha='center', va='bottom', fontsize=9, fontweight='bold')

# Plot 2: ACF ƒë·∫ßy ƒë·ªß (d√πng statsmodels)
from statsmodels.graphics.tsaplots import plot_acf
ax2 = axes[1]
plot_acf(pm25_series, lags=200, ax=ax2, alpha=0.05)
ax2.set_xlabel('Lag (gi·ªù)')
ax2.set_title('ACF to√†n b·ªô (200 lags ƒë·∫ßu ti√™n)')
ax2.axvline(x=24, color='red', linestyle='--', alpha=0.7, label='Lag 24h')
ax2.axvline(x=168, color='green', linestyle='--', alpha=0.7, label='Lag 168h')
ax2.legend()

plt.tight_layout()
plt.show()

print("\nüìà Nh·∫≠n x√©t:")
print(f"   - T·ª± t∆∞∆°ng quan lag 1h r·∫•t cao ({autocorr_values.get(1, 0):.3f}): PM2.5 gi·ªù n√†y ph·ª• thu·ªôc m·∫°nh v√†o gi·ªù tr∆∞·ªõc")
print(f"   - T·ª± t∆∞∆°ng quan lag 24h v·∫´n cao ({autocorr_values.get(24, 0):.3f}): c√≥ chu k·ª≥ ng√†y r√µ r√†ng")
print(f"   - T·ª± t∆∞∆°ng quan lag 168h ({autocorr_values.get(168, 0):.3f}): c√≥ t√≠n hi·ªáu chu k·ª≥ tu·∫ßn nh∆∞ng y·∫øu h∆°n")
print("   - ACF gi·∫£m ch·∫≠m ‚Üí chu·ªói c√≥ t√≠nh persistence cao (gi√° tr·ªã hi·ªán t·∫°i ph·ª• thu·ªôc nhi·ªÅu v√†o qu√° kh·ª©)")

## Q1.6 ‚Äî Ki·ªÉm ƒë·ªãnh t√≠nh d·ª´ng (Stationarity Tests)
M·ª•c ti√™u: Ch·∫°y ADF v√† KPSS test ƒë·ªÉ x√°c ƒë·ªãnh chu·ªói c√≥ d·ª´ng kh√¥ng, t·ª´ ƒë√≥ quy·∫øt ƒë·ªãnh c·∫ßn sai ph√¢n (d) hay kh√¥ng.

In [None]:
# Ki·ªÉm ƒë·ªãnh t√≠nh d·ª´ng - Stationarity Tests
from statsmodels.tsa.stattools import adfuller, kpss
import warnings
warnings.filterwarnings('ignore')

print("="*60)
print("KI·ªÇM ƒê·ªäNH T√çNH D·ª™NG (STATIONARITY TESTS)")
print("="*60)

# ADF Test (Augmented Dickey-Fuller)
print("\nüìä ADF Test (Augmented Dickey-Fuller):")
print("   H0: Chu·ªói c√≥ unit root (kh√¥ng d·ª´ng)")
print("   H1: Chu·ªói d·ª´ng (stationary)")

adf_result = adfuller(pm25_series, autolag='AIC')
print(f"\n   ADF Statistic: {adf_result[0]:.6f}")
print(f"   p-value:       {adf_result[1]:.6f}")
print(f"   Lags used:     {adf_result[2]}")
print(f"   Observations:  {adf_result[3]}")
print("   Critical Values:")
for key, value in adf_result[4].items():
    print(f"      {key}: {value:.4f}")

if adf_result[1] < 0.05:
    print("\n   ‚úÖ K·∫æT LU·∫¨N ADF: p-value < 0.05 ‚Üí B√°c b·ªè H0 ‚Üí Chu·ªói C√ì T√çNH D·ª™NG")
else:
    print("\n   ‚ö†Ô∏è K·∫æT LU·∫¨N ADF: p-value >= 0.05 ‚Üí Kh√¥ng b√°c b·ªè H0 ‚Üí Chu·ªói KH√îNG D·ª™NG")

# KPSS Test
print("\n" + "-"*60)
print("\nüìä KPSS Test (Kwiatkowski-Phillips-Schmidt-Shin):")
print("   H0: Chu·ªói d·ª´ng (stationary)")
print("   H1: Chu·ªói c√≥ unit root (kh√¥ng d·ª´ng)")

kpss_result = kpss(pm25_series, regression='c', nlags='auto')
print(f"\n   KPSS Statistic: {kpss_result[0]:.6f}")
print(f"   p-value:        {kpss_result[1]:.6f}")
print(f"   Lags used:      {kpss_result[2]}")
print("   Critical Values:")
for key, value in kpss_result[3].items():
    print(f"      {key}: {value:.4f}")

if kpss_result[1] > 0.05:
    print("\n   ‚úÖ K·∫æT LU·∫¨N KPSS: p-value > 0.05 ‚Üí Kh√¥ng b√°c b·ªè H0 ‚Üí Chu·ªói C√ì T√çNH D·ª™NG")
else:
    print("\n   ‚ö†Ô∏è K·∫æT LU·∫¨N KPSS: p-value <= 0.05 ‚Üí B√°c b·ªè H0 ‚Üí Chu·ªói KH√îNG D·ª™NG")

# T·ªïng h·ª£p k·∫øt lu·∫≠n
print("\n" + "="*60)
print("T·ªîNG H·ª¢P K·∫æT LU·∫¨N V·ªÄ T√çNH D·ª™NG")
print("="*60)

adf_stationary = adf_result[1] < 0.05
kpss_stationary = kpss_result[1] > 0.05

if adf_stationary and kpss_stationary:
    print("\n‚úÖ C·∫¢ HAI TEST ƒë·ªÅu cho th·∫•y chu·ªói D·ª™NG")
    print("   ‚Üí C√≥ th·ªÉ s·ª≠ d·ª•ng d=0 trong ARIMA")
elif adf_stationary and not kpss_stationary:
    print("\n‚ö†Ô∏è K·∫æT QU·∫¢ M√ÇU THU·∫™N: ADF n√≥i d·ª´ng, KPSS n√≥i kh√¥ng d·ª´ng")
    print("   ‚Üí Chu·ªói c√≥ th·ªÉ d·ª´ng theo xu h∆∞·ªõng (trend stationary)")
    print("   ‚Üí C√¢n nh·∫Øc d=0 ho·∫∑c d=1, ki·ªÉm tra th√™m b·∫±ng ACF/PACF")
elif not adf_stationary and kpss_stationary:
    print("\n‚ö†Ô∏è K·∫æT QU·∫¢ M√ÇU THU·∫™N: ADF n√≥i kh√¥ng d·ª´ng, KPSS n√≥i d·ª´ng")
    print("   ‚Üí C·∫ßn ph√¢n t√≠ch th√™m")
else:
    print("\n‚ùå C·∫¢ HAI TEST ƒë·ªÅu cho th·∫•y chu·ªói KH√îNG D·ª™NG")
    print("   ‚Üí C·∫ßn sai ph√¢n (d >= 1) tr∆∞·ªõc khi √°p d·ª•ng ARIMA")

## Q1.7 ‚Äî Nh·∫≠n x√©t: Bi·∫øn thi·∫øu n√†o ƒë√°ng lo nh·∫•t cho d·ª± b√°o PM2.5?

**Ph√¢n t√≠ch v√† k·∫øt lu·∫≠n:**

1. **CO (~4.92% missing)**: ƒê√¢y l√† bi·∫øn thi·∫øu nhi·ªÅu nh·∫•t. CO (Carbon Monoxide) th∆∞·ªùng c√≥ t∆∞∆°ng quan v·ªõi PM2.5 v√¨ c√πng ngu·ªìn ph√°t th·∫£i (giao th√¥ng, ƒë·ªët nhi√™n li·ªáu). Tuy nhi√™n, CO kh√¥ng ph·∫£i ƒë·∫∑c tr∆∞ng ch√≠nh ƒë·ªÉ d·ª± b√°o PM2.5.

2. **PM2.5 (~2.08% missing)**: ƒê√¢y l√† **bi·∫øn ƒë√°ng lo nh·∫•t** v√¨:
   - PM2.5 l√† **bi·∫øn m·ª•c ti√™u (target)** - n·∫øu thi·∫øu th√¨ kh√¥ng c√≥ g√¨ ƒë·ªÉ d·ª± b√°o
   - V·ªõi m√¥ h√¨nh ARIMA univariate, thi·∫øu PM2.5 s·∫Ω t·∫°o **gaps trong chu·ªói**, ·∫£nh h∆∞·ªüng ƒë·∫øn vi·ªác h·ªçc t·ª± t∆∞∆°ng quan
   - V·ªõi m√¥ h√¨nh regression, thi·∫øu PM2.5 ƒë·ªìng nghƒ©a v·ªõi thi·∫øu c√°c **lag features** (PM2.5_lag1, lag3, lag24)

3. **Bi·∫øn kh√≠ t∆∞·ª£ng (TEMP, PRES, DEWP, RAIN, WSPM < 0.1%)**: Thi·∫øu r·∫•t √≠t, kh√¥ng ƒë√°ng lo. ƒê√¢y l√† tin t·ªët v√¨ c√°c bi·∫øn n√†y l√† ƒë·∫∑c tr∆∞ng quan tr·ªçng cho m√¥ h√¨nh regression/SARIMAX.

**K·∫øt lu·∫≠n**: 
- V·ªõi **ARIMA**: PM2.5 thi·∫øu l√† nguy hi·ªÉm nh·∫•t ‚Üí c·∫ßn interpolate/fill tr∆∞·ªõc khi fit
- V·ªõi **Regression**: PM2.5 v√† CO thi·∫øu ·∫£nh h∆∞·ªüng ƒë·∫øn lag features ‚Üí c·∫ßn drop rows ho·∫∑c fill h·ª£p l√Ω
- Chi·∫øn l∆∞·ª£c x·ª≠ l√Ω: interpolate theo th·ªùi gian cho c√°c gaps ng·∫Øn, ho·∫∑c forward-fill n·∫øu gaps nh·ªè

In [None]:
class_dist = df['aqi_class'].value_counts(dropna=False)
class_dist

In [None]:
import matplotlib.pyplot as plt

class_dist.drop(index=[x for x in class_dist.index if pd.isna(x)], errors='ignore').plot(kind='bar')
plt.title('AQI class distribution (PM2.5 24h mean)')
plt.ylabel('count')
plt.tight_layout()
plt.show()

In [None]:
df.to_parquet(OUT_PATH, index=False)
print('Saved:', OUT_PATH)