# Final Analysis EDA

This notebook explores `data/FINAL_ANALYSIS.csv` to understand distributions, trends, correlations, and AQI-weather relationships across countries and regions. Figures are saved to `visualizations/final_analysis/` for the report.


In [11]:
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from pathlib import Path

# Configure plotting
sns.set(style="whitegrid", context="notebook")
plt.rcParams.update({"figure.dpi": 140, "savefig.dpi": 200})

# Paths
PROJECT_ROOT = Path("/Users/sharin/Downloads/COS30049/Assignment/Assignment_2/COS30049-Computing-Technology-Innovation-Project-by-YSA")
DATA_PATH = PROJECT_ROOT / "data" / "FINAL_ANALYSIS.csv"
VIS_DIR = PROJECT_ROOT / "visualizations" / "final_analysis"
VIS_DIR.mkdir(parents=True, exist_ok=True)

# Load data
df = pd.read_csv(DATA_PATH, parse_dates=["Date"]) 
print(df.shape)
df.head()


(61581, 12)


Unnamed: 0,Country,Region,Date,AQI,Temperature,RelativeHumidity,WindSpeed,Year,Month,Quarter,MonthName,AQI_Category
0,Singapore,Central,2016-02-07,47.0,25.87,87.26,17.8,2016,2,1,February,Good
1,Singapore,Central,2016-02-08,57.0,26.26,86.48,16.4,2016,2,1,February,Moderate
2,Singapore,Central,2016-02-09,57.88,25.97,87.06,9.5,2016,2,1,February,Moderate
3,Singapore,Central,2016-02-10,54.67,25.68,87.64,12.0,2016,2,1,February,Moderate
4,Singapore,Central,2016-02-11,32.79,26.36,86.28,11.0,2016,2,1,February,Good


In [12]:
# Basic info and missingness overview
summary = {
    "rows": len(df),
    "columns": df.shape[1],
    "date_range": (df["Date"].min(), df["Date"].max()),
    "countries": df["Country"].nunique(),
    "regions": df["Region"].nunique(),
    "aqi_categories": df["AQI_Category"].value_counts().to_dict(),
}
print(summary)

# Compute fraction missing per column robustly
missing = (df.isna().sum() / len(df)).sort_values(ascending=False)
print("Missingness (fraction):")
print(missing)

plt.figure(figsize=(8, 4))
sns.barplot(x=missing.values, y=missing.index, color="#4C78A8")
plt.title("Missingness by Column")
plt.xlabel("Fraction Missing")
plt.tight_layout()
plt.savefig(VIS_DIR / "00_missingness.png")
plt.close()


{'rows': 61581, 'columns': 12, 'date_range': (Timestamp('2014-01-01 00:00:00'), Timestamp('2024-12-31 00:00:00')), 'countries': 3, 'regions': 18, 'aqi_categories': {'Good': 28200, 'Moderate': 18867, 'Unhealthy': 8938, 'Unhealthy for Sensitive Groups': 4268, 'Very Unhealthy': 753, 'Unknown': 532, 'Hazardous': 23}}
Missingness (fraction):
AQI                 0.008639
Country             0.000000
Region              0.000000
Date                0.000000
Temperature         0.000000
RelativeHumidity    0.000000
WindSpeed           0.000000
Year                0.000000
Month               0.000000
Quarter             0.000000
MonthName           0.000000
AQI_Category        0.000000
dtype: float64


In [13]:
# Numeric distributions
numeric_cols = ["AQI", "Temperature", "RelativeHumidity", "WindSpeed"]
fig, axes = plt.subplots(2, 2, figsize=(10, 7))
axes = axes.ravel()
for i, col in enumerate(numeric_cols):
    sns.histplot(data=df, x=col, kde=True, ax=axes[i], color="#72B7B2")
    axes[i].set_title(f"Distribution of {col}")
plt.tight_layout()
plt.savefig(VIS_DIR / "01_distributions.png")
plt.close()


In [14]:
# Correlation heatmap
numeric_df = df[["AQI", "Temperature", "RelativeHumidity", "WindSpeed"]]
corr = numeric_df.corr()
plt.figure(figsize=(6, 5))
sns.heatmap(corr, annot=True, cmap="vlag", vmin=-1, vmax=1, fmt=".2f")
plt.title("Correlation Heatmap")
plt.tight_layout()
plt.savefig(VIS_DIR / "02_correlation_heatmap.png")
plt.close()


In [15]:
# Time series trends by country
ts_df = df.sort_values("Date").copy()
plt.figure(figsize=(10, 5))
sns.lineplot(data=ts_df, x="Date", y="AQI", hue="Country")
plt.title("AQI Over Time by Country")
plt.ylabel("AQI")
plt.tight_layout()
plt.savefig(VIS_DIR / "03_time_series_aqi_by_country.png")
plt.close()


In [16]:
# Yearly and monthly seasonality
yearly = df.groupby(["Country", "Year"], as_index=False)["AQI"].mean()
yearly_df = pd.DataFrame(yearly)
plt.figure(figsize=(8, 5))
sns.lineplot(data=yearly_df, x="Year", y="AQI", hue="Country", marker="o")
plt.title("Yearly Average AQI by Country")
plt.tight_layout()
plt.savefig(VIS_DIR / "04_yearly_avg_aqi.png")
plt.close()

monthly = df.groupby(["Country", "Month"], as_index=False)["AQI"].mean()
monthly_df = pd.DataFrame(monthly)
plt.figure(figsize=(8, 5))
sns.lineplot(data=monthly_df, x="Month", y="AQI", hue="Country", marker="o")
plt.title("Monthly Average AQI by Country")
plt.tight_layout()
plt.savefig(VIS_DIR / "05_monthly_avg_aqi.png")
plt.close()


In [17]:
# Regional comparison boxplots
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x="Country", y="AQI", hue="Region")
plt.title("Regional AQI Distribution by Country")
plt.tight_layout()
plt.savefig(VIS_DIR / "06_regional_boxplots.png")
plt.close()


In [18]:
# AQI Category distribution by country
plt.figure(figsize=(8, 5))
sns.countplot(data=df, x="Country", hue="AQI_Category")
plt.title("AQI Category Distribution by Country")
plt.tight_layout()
plt.savefig(VIS_DIR / "07_aqi_category_by_country.png")
plt.close()


In [19]:
# Relationships: AQI vs weather
fig, axes = plt.subplots(1, 3, figsize=(14, 4))
sns.scatterplot(data=df, x="Temperature", y="AQI", hue="Country", ax=axes[0], alpha=0.5)
axes[0].set_title("AQI vs Temperature")
sns.scatterplot(data=df, x="RelativeHumidity", y="AQI", hue="Country", ax=axes[1], alpha=0.5)
axes[1].set_title("AQI vs RelativeHumidity")
sns.scatterplot(data=df, x="WindSpeed", y="AQI", hue="Country", ax=axes[2], alpha=0.5)
axes[2].set_title("AQI vs WindSpeed")
plt.tight_layout()
plt.savefig(VIS_DIR / "08_aqi_vs_weather_scatter.png")
plt.close()


In [20]:
# Save a compact textual summary for the report
summary_lines = []
summary_lines.append(f"Rows: {len(df)}")
summary_lines.append(f"Date range: {df['Date'].min().date()} to {df['Date'].max().date()}")
summary_lines.append(f"Countries: {df['Country'].nunique()}, Regions: {df['Region'].nunique()}")
summary_lines.append("AQI category counts: " + df['AQI_Category'].value_counts().to_dict().__repr__())

with open(VIS_DIR / "00_analysis_summary.txt", "w") as f:
    f.write("\n".join(summary_lines))

print("Saved summary and figures to:", VIS_DIR)


Saved summary and figures to: /Users/sharin/Downloads/COS30049/Assignment/Assignment_2/COS30049-Computing-Technology-Innovation-Project-by-YSA/visualizations/final_analysis


## Notes for Model Selection

We will decide on models after EDA:
- Classification (predicting `AQI_Category`): Logistic Regression, KNN, Random Forest Classifier.
- Regression (predicting numeric `AQI`): Linear Regression, Random Forest Regressor.

Criteria to choose:
- Data linearity and feature correlation (heatmap, scatter plots)
- Nonlinear patterns and interactions (favor tree-based models)
- Class balance for categories (consider resampling if needed)
- Predictive stability across regions/countries (cross-validation by groups)

