# Urban Air Quality Forecaster
## Data Ingestion Notebook

This notebook handles:
- Fetching air quality data
- Fetching weather data
- Basic Validation & storage


This notebook ingests:
- Air quality sensor data (PM2.5, NO2, O3)
- Weather data (temperature, wind)


In [20]:
# Imports

import pandas as pd
import numpy as np
import requests
from datetime import datetime

In [21]:
print("Kernel locked and ready")
print("Notebook environment ready")
# The above lines are Used for locking and veriying the kernel used.

Kernel locked and ready
Notebook environment ready


Mock AQI sensor data
- simulate data first

In [22]:
data = {
    "timestamp": pd.date_range(
        start=datetime.now(), periods=24, freq="h"
    ),
    "pm25" : np.random.uniform(10, 150, 24),
    "no2" : np.random.uniform(5, 80, 24),
    "o3" : np.random.uniform(10, 120, 24),
    "lat" : np.random.uniform(12.90, 13.05, 24),
    "lon": np.random.uniform(77.50, 77.70 , 24),
 }

df = pd.DataFrame(data)
df.head()



Unnamed: 0,timestamp,pm25,no2,o3,lat,lon
0,2026-01-18 17:32:25.406103,61.078337,66.99371,114.895069,12.954615,77.512818
1,2026-01-18 18:32:25.406103,133.877891,74.383799,104.75448,13.022563,77.606302
2,2026-01-18 19:32:25.406103,146.515163,28.464076,93.622592,12.918892,77.563464
3,2026-01-18 20:32:25.406103,111.58352,24.913692,88.19834,13.02924,77.699484
4,2026-01-18 21:32:25.406103,80.024366,24.61175,53.408993,12.96505,77.572189


Saving raw data to csv

In [23]:
df.to_csv("C:/Users/Navyashree/Documents/urban-air-quality-forecaster/data/raw/air_quality_mock.csv",index = False)
print("Raw data saved")

Raw data saved


Feature Engineering: Spatial Grid Mapping (CORE CONCEPT)
- “Transform irregular point-source data into grid-cell-specific features (1 km × 1 km)”

Load raw data

In [24]:
df = pd.read_csv("C:/Users/Navyashree/Documents/urban-air-quality-forecaster/data/raw/air_quality_mock.csv")
df.head()

Unnamed: 0,timestamp,pm25,no2,o3,lat,lon
0,2026-01-18 17:32:25.406103,61.078337,66.99371,114.895069,12.954615,77.512818
1,2026-01-18 18:32:25.406103,133.877891,74.383799,104.75448,13.022563,77.606302
2,2026-01-18 19:32:25.406103,146.515163,28.464076,93.622592,12.918892,77.563464
3,2026-01-18 20:32:25.406103,111.58352,24.913692,88.19834,13.02924,77.699484
4,2026-01-18 21:32:25.406103,80.024366,24.61175,53.408993,12.96505,77.572189


Create spatial grid :
simulate a 1 km grid using rounding.

In [25]:
#Approx 1Km ~ 0.01 degrees
df["grid_lat"] = df["lat"].round(2)
df["grid_lon"] = df["lon"].round(2)

df[["lat", "lon", "grid_lat", "grid_lon"]].head()

Unnamed: 0,lat,lon,grid_lat,grid_lon
0,12.954615,77.512818,12.95,77.51
1,13.022563,77.606302,13.02,77.61
2,12.918892,77.563464,12.92,77.56
3,13.02924,77.699484,13.03,77.7
4,12.96505,77.572189,12.97,77.57


aggregate per grid & hour

In [29]:
df["timestamp"] = pd.to_datetime(df["timestamp"])
df["hour"] = df["timestamp"].dt.floor("h")

grid_df = (
    df.groupby(["grid_lat","grid_lon","hour"])
    .agg({
        "pm25": "mean",
        "no2": "mean",
        "o3": "mean"
    })
    .reset_index()
)

grid_df.head()


Unnamed: 0,grid_lat,grid_lon,hour,pm25,no2,o3
0,12.9,77.63,2026-01-19 10:00:00,96.479902,40.746327,33.020855
1,12.91,77.68,2026-01-18 22:00:00,128.436367,28.942489,101.960714
2,12.92,77.56,2026-01-18 19:00:00,146.515163,28.464076,93.622592
3,12.92,77.63,2026-01-19 15:00:00,105.413565,69.038529,44.56536
4,12.93,77.54,2026-01-19 13:00:00,13.666896,43.048675,52.656615


saving the processed dat into csv

In [32]:
grid_df.to_csv("C:/Users/Navyashree/Documents/urban-air-quality-forecaster/data/processed/grid_air_quality.csv",index=False)
print("Processed grid data saved")

Processed grid data saved
