<a href="https://colab.research.google.com/github/AbhayBhise/Week1/blob/main/Week1_Air_Quality_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AI-Based Air Quality Prediction and Health Risk Analysis

## Objective
The aim of this project is to predict the Air Quality Index (AQI) or PM2.5 levels for selected Indian cities using historical air quality and meteorological data. The project will also classify health risk levels based on predicted air quality.



## Dataset
- **Source**: [Air Quality Data in India (2015–2020) - Kaggle](https://www.kaggle.com/datasets/rohanrao/air-quality-data-in-india)
- **Description**: Contains concentrations of major air pollutants (PM2.5, PM10, NO₂, SO₂, CO, O₃) along with weather data.
- **Time Period**: 2015–2020
- **Locations**: Multiple cities across India


3.Import Libraries

In [1]:
# Importing basic libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Optional: Configure display settings
pd.set_option('display.max_columns', None)


In [2]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Set the path to your project folder
project_path = '/content/drive/MyDrive/AI_Project_AirQuality/'
data_path = project_path + 'city_day.csv'

# Load the dataset
df = pd.read_csv(data_path)
df.head()


Mounted at /content/drive


Unnamed: 0,City,Date,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,Xylene,AQI,AQI_Bucket
0,Ahmedabad,2015-01-01,,,0.92,18.22,17.15,,0.92,27.64,133.36,0.0,0.02,0.0,,
1,Ahmedabad,2015-01-02,,,0.97,15.69,16.46,,0.97,24.55,34.06,3.68,5.5,3.77,,
2,Ahmedabad,2015-01-03,,,17.4,19.3,29.7,,17.4,29.07,30.7,6.8,16.4,2.25,,
3,Ahmedabad,2015-01-04,,,1.7,18.48,17.97,,1.7,18.59,36.08,4.43,10.14,1.0,,
4,Ahmedabad,2015-01-05,,,22.1,21.42,37.76,,22.1,39.33,39.31,7.01,18.89,2.78,,


In [3]:
# Check dataset information
df.info()

# Get descriptive statistics
df.describe()

# Check for missing values
df.isnull().sum()

# Quick look at unique cities
print("Available cities:", df['City'].unique())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29531 entries, 0 to 29530
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   City        29531 non-null  object 
 1   Date        29531 non-null  object 
 2   PM2.5       24933 non-null  float64
 3   PM10        18391 non-null  float64
 4   NO          25949 non-null  float64
 5   NO2         25946 non-null  float64
 6   NOx         25346 non-null  float64
 7   NH3         19203 non-null  float64
 8   CO          27472 non-null  float64
 9   SO2         25677 non-null  float64
 10  O3          25509 non-null  float64
 11  Benzene     23908 non-null  float64
 12  Toluene     21490 non-null  float64
 13  Xylene      11422 non-null  float64
 14  AQI         24850 non-null  float64
 15  AQI_Bucket  24850 non-null  object 
dtypes: float64(13), object(3)
memory usage: 3.6+ MB
Available cities: ['Ahmedabad' 'Aizawl' 'Amaravati' 'Amritsar' 'Bengaluru' 'Bhopal'
 'Brajrajnagar' 

## Planned Workflow (AI/ML Project Life Cycle)

1. **Data Cleaning & Preprocessing**
   - Handle missing values, outliers, and duplicates
   - Extract time-based features (month, season, weekday)

2. **Exploratory Data Analysis (EDA)**
   - City-wise pollution trends
   - Correlation between pollutants and weather parameters

3. **Model Development**
   - Algorithms: Linear Regression, Random Forest, XGBoost
   - Target: AQI or PM2.5 prediction

4. **Model Evaluation**
   - Metrics: MAE, RMSE for regression
   - Classification report for health risk levels

5. **Optional Enhancements**
   - Hyperparameter tuning
   - Streamlit dashboard for visualization


## Current Status (Week 1)
- Project title finalized
- Dataset selected and linked
- Initial exploration performed (data loaded, basic statistics and missing values checked)
- Plan for data cleaning, EDA, and model training prepared
