## Step 1: Data Collection via API (46 Major Indian Cities)

To build a robust, real-world dataset for air quality and environmental analysis, I fetched **hourly air quality and weather data** for **46 major Indian cities**, including all state capitals and other urban hubs like Pune, Nagpur, Jamshedpur, and Dhanbad.

This was done using:
- **Open-Meteo Air Quality API**: For pollutants like PM10, PM2.5, Ozone, NO₂, SO₂, and CO
- **Open-Meteo Weather API**: For weather-related factors such as temperature, humidity, wind, pressure, and precipitation

📅 **Date Range**: From **25th April 2025 to 25th July 2025**  
📈 **Frequency**: Hourly (to enable high-resolution analysis)

**Collected Features Include:**
- **Air Pollutants**: PM2.5, PM10, O₃, NO₂, SO₂, CO
- **AQI Values**: Based on **US AQI** and **European AQI** standards
- **Meteorological Features**: Temperature, Humidity, Wind Speed/Direction, Surface Pressure, Precipitation
- **City Name + Timestamp**

Why this matters:
- Enables **fine-grained, time-series modeling**
- Much more **customizable and up-to-date** than static Kaggle datasets
- Demonstrates ability to **integrate multiple APIs**, handle real-world data complexities, and scale across cities


In [62]:
import requests
import pandas as pd
import time

# City list with lat/lon
cities = [
    {"name": "Delhi", "lat": 28.6139, "lon": 77.2090},
    {"name": "Mumbai", "lat": 19.0760, "lon": 72.8777},
    {"name": "Bangalore", "lat": 12.9716, "lon": 77.5946},
    {"name": "Chennai", "lat": 13.0827, "lon": 80.2707},
    {"name": "Kolkata", "lat": 22.5726, "lon": 88.3639},
    {"name": "Hyderabad", "lat": 17.3850, "lon": 78.4867},
    {"name": "Ahmedabad", "lat": 23.0225, "lon": 72.5714},
    {"name": "Pune", "lat": 18.5204, "lon": 73.8567},
    {"name": "Nagpur", "lat": 21.1458, "lon": 79.0882},
    {"name": "Jamshedpur", "lat": 22.8046, "lon": 86.2029},
    {"name": "Dhanbad", "lat": 23.7957, "lon": 86.4304},
    {"name": "Lucknow", "lat": 26.8467, "lon": 80.9462},
    {"name": "Jaipur", "lat": 26.9124, "lon": 75.7873},
    {"name": "Patna", "lat": 25.5941, "lon": 85.1376},
    {"name": "Bhopal", "lat": 23.2599, "lon": 77.4126},
    {"name": "Raipur", "lat": 21.2514, "lon": 81.6296},
    {"name": "Bhubaneswar", "lat": 20.2961, "lon": 85.8245},
    {"name": "Thiruvananthapuram", "lat": 8.5241, "lon": 76.9366},
    {"name": "Imphal", "lat": 24.8170, "lon": 93.9368},
    {"name": "Shillong", "lat": 25.5788, "lon": 91.8933},
    {"name": "Aizawl", "lat": 23.7271, "lon": 92.7176},
    {"name": "Kohima", "lat": 25.6701, "lon": 94.1077},
    {"name": "Itanagar", "lat": 27.0844, "lon": 93.6053},
    {"name": "Agartala", "lat": 23.8315, "lon": 91.2868},
    {"name": "Gangtok", "lat": 27.3389, "lon": 88.6065},
    {"name": "Dispur", "lat": 26.1433, "lon": 91.7898},
    {"name": "Panaji", "lat": 15.4909, "lon": 73.8278},
    {"name": "Chandigarh", "lat": 30.7333, "lon": 76.7794},
    {"name": "Shimla", "lat": 31.1048, "lon": 77.1734},
    {"name": "Dehradun", "lat": 30.3165, "lon": 78.0322},
    {"name": "Ranchi", "lat": 23.3441, "lon": 85.3096},
    {"name": "Guwahati", "lat": 26.1445, "lon": 91.7362},
    {"name": "Puducherry", "lat": 11.9416, "lon": 79.8083},
    {"name": "Port Blair", "lat": 11.6234, "lon": 92.7265},
    {"name": "Leh", "lat": 34.1526, "lon": 77.5771},
    {"name": "Srinagar", "lat": 34.0837, "lon": 74.7973},
    {"name": "Amritsar", "lat": 31.6340, "lon": 74.8723},
    {"name": "Gandhinagar", "lat": 23.2156, "lon": 72.6369},
    {"name": "Noida", "lat": 28.5355, "lon": 77.3910},
    {"name": "Faridabad", "lat": 28.4089, "lon": 77.3178},
    {"name": "Ghaziabad", "lat": 28.6692, "lon": 77.4538},
    {"name": "Varanasi", "lat": 25.3176, "lon": 82.9739},
    {"name": "Kanpur", "lat": 26.4499, "lon": 80.3319},
    {"name": "Surat", "lat": 21.1702, "lon": 72.8311},
    {"name": "Visakhapatnam", "lat": 17.6868, "lon": 83.2185},
    {"name": "Indore", "lat": 22.7196, "lon": 75.8577}
]

# Date range
start_date = "2025-04-25"
end_date = "2025-07-25"

# API base URL
base_url = "https://air-quality-api.open-meteo.com/v1/air-quality"

# Parameters to include
aqi_params = "pm10,pm2_5,ozone,nitrogen_dioxide,sulphur_dioxide,carbon_monoxide,us_aqi,european_aqi"

# Store data
all_data = []

for city in cities:
    params = {
        "latitude": city["lat"],
        "longitude": city["lon"],
        "start_date": start_date,
        "end_date": end_date,
        "hourly": aqi_params,
        "timezone": "Asia/Kolkata"
    }

    try:
        response = requests.get(base_url, params=params)
        response.raise_for_status()
        data = response.json()["hourly"]

        num_hours = len(data["time"])

        for i in range(num_hours):
            record = {
                "city": city["name"],
                "datetime": data["time"][i],
                "pm10": data["pm10"][i],
                "pm2_5": data["pm2_5"][i],
                "ozone": data["ozone"][i],
                "nitrogen_dioxide": data["nitrogen_dioxide"][i],
                "sulphur_dioxide": data["sulphur_dioxide"][i],
                "carbon_monoxide": data["carbon_monoxide"][i],
                "us_aqi": data["us_aqi"][i],
                "european_aqi": data["european_aqi"][i]
            }
            all_data.append(record)

        print(f"✔ Data fetched for {city['name']}")
        time.sleep(1)  # Respect rate limits

    except Exception as e:
        print(f"❌ Failed for {city['name']}: {e}")

# Save to CSV
df = pd.DataFrame(all_data)
df.to_csv("india_aqi_extended_apr25_jul25_2025.csv", index=False)
print("✅ Data saved to 'india_aqi_extended_apr25_jul25_2025.csv'")


✔ Data fetched for Delhi
✔ Data fetched for Mumbai
✔ Data fetched for Bangalore
✔ Data fetched for Chennai
✔ Data fetched for Kolkata
✔ Data fetched for Hyderabad
✔ Data fetched for Ahmedabad
✔ Data fetched for Pune
✔ Data fetched for Nagpur
✔ Data fetched for Jamshedpur
✔ Data fetched for Dhanbad
✔ Data fetched for Lucknow
✔ Data fetched for Jaipur
✔ Data fetched for Patna
✔ Data fetched for Bhopal
✔ Data fetched for Raipur
✔ Data fetched for Bhubaneswar
✔ Data fetched for Thiruvananthapuram
✔ Data fetched for Imphal
✔ Data fetched for Shillong
✔ Data fetched for Aizawl
✔ Data fetched for Kohima
✔ Data fetched for Itanagar
✔ Data fetched for Agartala
✔ Data fetched for Gangtok
✔ Data fetched for Dispur
✔ Data fetched for Panaji
✔ Data fetched for Chandigarh
✔ Data fetched for Shimla
✔ Data fetched for Dehradun
✔ Data fetched for Ranchi
✔ Data fetched for Guwahati
✔ Data fetched for Puducherry
✔ Data fetched for Port Blair
✔ Data fetched for Leh
✔ Data fetched for Srinagar
✔ Data fetch

## Step 2: Data Inspection & Preprocessing

This step ensures the dataset is reliable, consistent, and ready for deeper analysis. After collecting the data from APIs for 46 major Indian cities, I performed basic inspections and preprocessing to handle potential issues.

#### ✅ What Was Done:

- **Initial Dataset Exploration:**
  - Verified dataset shape, data types, and column names.
  - Checked sample rows to confirm the structure.
  - Validated inclusion of pollutant levels, meteorological variables, AQI indices, and timestamps.

- **City-wise Missing Value Treatment:**
  - Missing values were found only in weather-related features:
    - `temperature_2m`, `humidity`, `wind_speed`, `wind_direction`, `surface_pressure`, and `precipitation`.
  - Imputation strategy:
    - `temperature_2m`, `humidity`, `wind_direction`: Forward fill + backward fill.
    - `wind_speed`, `surface_pressure`: Linear interpolation + forward/backward fill.
    - `precipitation`: Missing values filled with `0` (often signifies no rainfall).
  - Applied imputations **per city** to preserve time-series and regional patterns.

- **Saved the cleaned dataset** as `cleaned_aqi_weather_dataset.csv` for further use.

This step ensures no missing data remains and that the dataset maintains its temporal and spatial integrity—critical for downstream modeling or exploratory analysis.


In [69]:
# 1. Basic info
print("\n📌 Dataset Info:")
df.info()


📌 Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101568 entries, 0 to 101567
Data columns (total 16 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   city              101568 non-null  object 
 1   datetime          101568 non-null  object 
 2   pm10              101568 non-null  float64
 3   pm2_5             101568 non-null  float64
 4   ozone             101568 non-null  float64
 5   nitrogen_dioxide  101568 non-null  float64
 6   sulphur_dioxide   101568 non-null  float64
 7   carbon_monoxide   101568 non-null  float64
 8   us_aqi            101568 non-null  int64  
 9   european_aqi      101568 non-null  int64  
 10  temperature_2m    100740 non-null  float64
 11  humidity          100740 non-null  float64
 12  wind_speed        100740 non-null  float64
 13  wind_direction    100740 non-null  float64
 14  surface_pressure  100740 non-null  float64
 15  precipitation     100740 non-null  float64
dtypes: 

In [70]:
# 2. Basic statistics
print("\n📌 Descriptive Statistics:")
print(df.describe(include='all'))


📌 Descriptive Statistics:
          city          datetime           pm10          pm2_5          ozone  \
count   101568            101568  101568.000000  101568.000000  101568.000000   
unique      46              2208            NaN            NaN            NaN   
top      Delhi  2025-07-25T07:00            NaN            NaN            NaN   
freq      2208                46            NaN            NaN            NaN   
mean       NaN               NaN      79.776889      31.739604      87.996712   
std        NaN               NaN     192.325726      29.524709      46.076639   
min        NaN               NaN       0.000000       0.000000       0.000000   
25%        NaN               NaN      18.200000      13.200000      52.000000   
50%        NaN               NaN      34.300000      24.100000      78.000000   
75%        NaN               NaN      61.900000      40.600000     117.000000   
max        NaN               NaN    3173.900000     380.900000     383.000000   



In [71]:
# 3. Check for missing values
print("\n📌 Missing Values:")
print(df.isnull().sum())


📌 Missing Values:
city                  0
datetime              0
pm10                  0
pm2_5                 0
ozone                 0
nitrogen_dioxide      0
sulphur_dioxide       0
carbon_monoxide       0
us_aqi                0
european_aqi          0
temperature_2m      828
humidity            828
wind_speed          828
wind_direction      828
surface_pressure    828
precipitation       828
dtype: int64


In [72]:
# 4. Check for duplicated rows
print("\n📌 Duplicate Rows:")
print(df.duplicated().sum())



📌 Duplicate Rows:
0


In [73]:
# 5. Check for unique cities
print("\n📌 Unique Cities in Dataset:")
print(df["city"].nunique(), "cities:", df["city"].unique())


📌 Unique Cities in Dataset:
46 cities: ['Delhi' 'Mumbai' 'Bangalore' 'Chennai' 'Kolkata' 'Hyderabad' 'Ahmedabad'
 'Pune' 'Nagpur' 'Jamshedpur' 'Dhanbad' 'Lucknow' 'Jaipur' 'Patna'
 'Bhopal' 'Raipur' 'Bhubaneswar' 'Thiruvananthapuram' 'Imphal' 'Shillong'
 'Aizawl' 'Kohima' 'Itanagar' 'Agartala' 'Gangtok' 'Dispur' 'Panaji'
 'Chandigarh' 'Shimla' 'Dehradun' 'Ranchi' 'Guwahati' 'Puducherry'
 'Port Blair' 'Leh' 'Srinagar' 'Amritsar' 'Gandhinagar' 'Noida'
 'Faridabad' 'Ghaziabad' 'Varanasi' 'Kanpur' 'Surat' 'Visakhapatnam'
 'Indore']


#### 🌍 Dataset Coverage: Urban India Air Quality Snapshot

This dataset offers a comprehensive view of air quality and meteorological conditions across **46 major Indian cities**, collected hourly between **April 25 and July 25, 2025**. The cities span a diverse set of geographic, climatic, and developmental zones, including:

- **Metropolitan hubs:** Delhi, Mumbai, Bengaluru, Chennai, Hyderabad, Kolkata
- **Tier-2 and emerging cities:** Pune, Indore, Lucknow, Patna, Surat, Kanpur
- **State capitals and UTs:** Dehradun, Shillong, Gangtok, Port Blair, Leh, etc.
- **Northeastern region:** Represented via Imphal, Aizawl, Kohima, Agartala, and others

Such a wide-ranging selection enables:
- 📊 **Regional Comparisons** – North vs. South, urban inland vs. coastal, etc.
- 🌦️ **Climatic Diversity** – From humid tropics to arid plains and high-altitude zones
- 🏙️ **Urban Spectrum** – From mega cities to smaller, lesser-studied capitals

While this dataset focuses on **urban India**, it captures a **rich and diverse environmental landscape** that reflects the real-world challenges of air pollution, population growth, and climate variation in Indian cities.

This solid coverage sets the stage for reliable modeling, insightful analysis, and potential policy-relevant outcomes.


In [74]:
# 6. Time range check
print("\n📌 Date Range:")
df['datetime'] = pd.to_datetime(df['datetime'])
print(f"From {df['datetime'].min()} to {df['datetime'].max()}")


📌 Date Range:
From 2025-04-25 00:00:00 to 2025-07-25 23:00:00


In [81]:
# Convert datetime column to proper dtype
df["datetime"] = pd.to_datetime(df["datetime"])

# Extract useful time features
df["hour"] = df["datetime"].dt.hour
df["day"] = df["datetime"].dt.day
df["month"] = df["datetime"].dt.month
df["weekday"] = df["datetime"].dt.weekday  # Monday = 0

print(df.dtypes)

city                        object
datetime            datetime64[ns]
pm10                       float64
pm2_5                      float64
ozone                      float64
nitrogen_dioxide           float64
sulphur_dioxide            float64
carbon_monoxide            float64
us_aqi                       int64
european_aqi                 int64
temperature_2m             float64
humidity                   float64
wind_speed                 float64
wind_direction             float64
surface_pressure           float64
precipitation              float64
hour                         int32
day                          int32
month                        int32
weekday                      int32
dtype: object


In [82]:
#7: Missing Value Treatment for Meteorological Features

def impute_weather_features(group):
    group = group.sort_values("datetime")  # Ensure time order
    group["temperature_2m"] = group["temperature_2m"].ffill().bfill()
    group["humidity"] = group["humidity"].ffill().bfill()
    group["wind_speed"] = group["wind_speed"].interpolate(method='linear').ffill().bfill()
    group["wind_direction"] = group["wind_direction"].ffill().bfill()
    group["surface_pressure"] = group["surface_pressure"].interpolate(method='linear').ffill().bfill()
    group["precipitation"] = group["precipitation"].fillna(0)  # Missing often means no rain
    return group

# Apply city-wise to preserve time-series integrity
df = (
    df.groupby("city", group_keys=False)
      .apply(impute_weather_features)
      .reset_index(drop=True)
)

# Save cleaned dataset to CSV
df.to_csv("cleaned_aqi_weather_dataset.csv", index=False)

  .apply(impute_weather_features)


In [83]:
df_cleaned = pd.read_csv("cleaned_aqi_weather_dataset.csv")
print("Missing Values After Cleaning:")
print(df_cleaned.isna().sum())

Missing Values After Cleaning:
city                0
datetime            0
pm10                0
pm2_5               0
ozone               0
nitrogen_dioxide    0
sulphur_dioxide     0
carbon_monoxide     0
us_aqi              0
european_aqi        0
temperature_2m      0
humidity            0
wind_speed          0
wind_direction      0
surface_pressure    0
precipitation       0
hour                0
day                 0
month               0
weekday             0
dtype: int64
