Project Title
Analyzing Air Quality for an Early-Warning Insight System

Problem Statement:Climate change and rapid urbanization have intensified air pollution, yet traditional Air Quality Index (AQI) reporting often fails to highlight localized risks and disaster-level events. This project focuses on analyzing AQI datasets to identify hidden patterns in pollution spikes, detect anomalies linked to extreme weather or human activities, and develop an early-warning insight system that flags potential high-risk days before they escalate into public health disasters.

Description:This project will involve a multi-stage data science workflow, beginning with exploratory data analysis to understand air quality trends. Following this, the project will implement machine learning models to identify key drivers of pollution spikes and forecast future AQI levels. The ultimate goal is to build a robust framework that provides actionable, data-driven insights for public health and environmental management.

Objective:To use data analysis and machine learning to create a predictive system for urban air pollution, providing timely insights that can inform public health and disaster management strategy.

In [1]:
#Import required libraries 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns



In [2]:
# Load the city_day.csv dataset
df = pd.read_csv('city_day.csv')
df.head()

Unnamed: 0,City,Datetime,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,Xylene,AQI,AQI_Bucket
0,Delhi,2015-01-01,153.3,241.7,182.9,33.0,81.3,38.5,1.87,64.5,83.6,18.93,20.81,8.32,204.5,Severe
1,Mumbai,2015-01-01,70.5,312.7,195.0,42.0,122.5,31.5,7.22,83.8,108.0,2.01,19.41,2.86,60.9,Satisfactory
2,Chennai,2015-01-01,174.1,275.4,56.2,68.8,230.9,28.5,8.56,60.8,43.9,19.07,10.19,9.63,486.5,Severe
3,Kolkata,2015-01-01,477.2,543.9,14.1,76.4,225.9,45.6,2.41,42.1,171.1,9.31,11.65,9.39,174.4,Very Poor
4,Bangalore,2015-01-01,171.6,117.7,123.3,12.4,61.9,49.7,1.26,79.7,164.3,6.04,12.74,9.59,489.7,Good


EXPLORE AND UNDERSTAND THE DATA


In [3]:
#BASIC INFO
print(df.info())
print(df.describe())
print(df.isnull().sum())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18265 entries, 0 to 18264
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   City        18265 non-null  object 
 1   Datetime    18265 non-null  object 
 2   PM2.5       18265 non-null  float64
 3   PM10        18265 non-null  float64
 4   NO          18265 non-null  float64
 5   NO2         18265 non-null  float64
 6   NOx         18265 non-null  float64
 7   NH3         18265 non-null  float64
 8   CO          18265 non-null  float64
 9   SO2         18265 non-null  float64
 10  O3          18265 non-null  float64
 11  Benzene     18265 non-null  float64
 12  Toluene     18265 non-null  float64
 13  Xylene      18265 non-null  float64
 14  AQI         18265 non-null  float64
 15  AQI_Bucket  18265 non-null  object 
dtypes: float64(13), object(3)
memory usage: 2.2+ MB
None
              PM2.5          PM10            NO           NO2           NOx  \
count  18265.000

OBSERVATIONS

1.The dataset contains 18,265 rows and 16 columns.
2.Columns include:
    a.City (categorical)
    b.Datetime (date/time)
    c.13 numerical pollutant features: PM2.5, PM10, NO, NO2, NOx, NH3, CO, SO2, O3, Benzene, Toluene, Xylene, AQI
    d.AQI_Bucket (categorical, air quality classification)
3.No missing values in any column.
4.Ranges of pollutants:
    a.PM2.5 ranges from 0.0 to ~500 with mean ~250.6
    b.PM10 ranges from 0.0 to ~600 with mean ~299.4
    c.AQI ranges from 0.0 to 500, mean ~251 (very high average, suggesting frequent poor air quality).
5.Many pollutants (e.g., NO, SO2, CO, Benzene) have minimum values of 0.0, which could mean either clean air periods or missing/zeroed-out sensor readings.
6.AQI_Bucket will be useful for classification tasks, while pollutant values can be used for regression and anomaly detection.