<a href="https://colab.research.google.com/github/ShreshthaSinha/Air-Quality-Prediction-/blob/main/Air_Quality_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project:** Air Quality Forecasting using Machine Learning


# **Goal:**  To Predict short-term and medium-term pollutant levels (e.g., PM2.5) using sensor and weather data, enabling proactive health & policy actions.

# **Objectives:**
- Build forecasting models (baseline → advanced ML/DL)
- Create a full pipeline: data ingestion → preprocessing → modeling → evaluation

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
#IMPORT LIBRARIES

import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# READING DATASETS

station_hour = pd.read_excel('/content/drive/MyDrive/FSP/station-hour.xlsx')
station = pd.read_excel('/content/drive/MyDrive/FSP/station.xlsx')

In [None]:
station_hour.head()

Unnamed: 0,StationId,Date,Time,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,Xylene,AQI,AQI_Bucket
0,AP001,2017-11-25,09:00:00,104.0,148.5,1.93,23.0,13.75,9.8,0.1,15.3,117.62,0.3,10.4,0.23,155,Moderate
1,AP001,2017-11-25,10:00:00,94.5,142.0,1.33,16.25,9.75,9.65,0.1,17.0,136.23,0.28,7.1,0.15,159,Moderate
2,AP001,2017-11-25,11:00:00,82.75,126.5,1.47,14.83,9.07,9.7,0.1,15.4,149.92,0.2,4.55,0.08,173,Moderate
3,AP001,2017-11-25,14:00:00,68.5,117.0,1.35,13.6,8.35,7.4,0.1,21.8,161.7,0.1,2.3,0.0,191,Moderate
4,AP001,2017-11-25,15:00:00,69.25,112.25,1.52,11.8,7.55,9.25,0.1,21.38,161.68,0.1,2.35,0.0,191,Moderate


In [None]:
station.head()

Unnamed: 0,StationId,StationName,City,State,Status
0,AP001,"Secretariat, Amaravati - APPCB",Amaravati,Andhra Pradesh,Active
1,AP005,"GVM Corporation, Visakhapatnam - APPCB",Visakhapatnam,Andhra Pradesh,Active
2,AS001,"Railway Colony, Guwahati - APCB",Guwahati,Assam,Active
3,BR005,"DRM Office Danapur, Patna - BSPCB",Patna,Bihar,Active
4,BR006,"Govt. High School Shikarpur, Patna - BSPCB",Patna,Bihar,Active


In [None]:
station_hour.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 219037 entries, 0 to 219036
Data columns (total 17 columns):
 #   Column      Non-Null Count   Dtype         
---  ------      --------------   -----         
 0   StationId   219037 non-null  object        
 1   Date        219037 non-null  datetime64[ns]
 2   Time        219037 non-null  object        
 3   PM2.5       211535 non-null  float64       
 4   PM10        214232 non-null  float64       
 5   NO          219037 non-null  float64       
 6   NO2         217683 non-null  float64       
 7   NOx         217839 non-null  float64       
 8   NH3         219037 non-null  float64       
 9   CO          216699 non-null  float64       
 10  SO2         219037 non-null  float64       
 11  O3          219037 non-null  float64       
 12  Benzene     219037 non-null  float64       
 13  Toluene     218942 non-null  float64       
 14  Xylene      219037 non-null  float64       
 15  AQI         219037 non-null  int64         
 16  AQ

In [None]:
station_hour['AQI_Bucket'].value_counts()

Unnamed: 0_level_0,count
AQI_Bucket,Unnamed: 1_level_1
Moderate,93653
Satisfactory,75287
Good,25788
Poor,12547
Very Poor,9685
Severe,2077


In [None]:
station_hour.isnull().sum()

Unnamed: 0,0
StationId,0
Date,0
Time,0
PM2.5,7502
PM10,4805
NO,0
NO2,1354
NOx,1198
NH3,0
CO,2338


** Since the data is highly skewed, we will replace the null values with median values of the particular column.**

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
#REPLACING THE MISSING VALUES WITH THE MEDIAN VALUES OF THE RESOECTIVE COLUMNS

station_hour['PM2.5'].fillna(station_hour['PM2.5'].median(), inplace=True)
station_hour['PM10'].fillna(station_hour['PM10'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  station_hour['PM2.5'].fillna(station_hour['PM2.5'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  station_hour['PM10'].fillna(station_hour['PM10'].median(), inplace=True)


In [None]:
station_hour.isnull().sum()

Unnamed: 0,0
StationId,0
Date,0
Time,0
PM2.5,0
PM10,0
NO,0
NO2,1354
NOx,1198
NH3,0
CO,2338


In [None]:
#REPLACING THE MISSING VALUES WITH THE MEDIAN VALUES OF THE RESOECTIVE COLUMNS

station_hour['NO2'].fillna(station_hour['NO2'].median(), inplace=True)
station_hour['NOx'].fillna(station_hour['NOx'].median(), inplace=True)
station_hour['CO'].fillna(station_hour['CO'].median(), inplace=True)
station_hour['Toluene'].fillna(station_hour['Toluene'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  station_hour['NO2'].fillna(station_hour['NO2'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  station_hour['NOx'].fillna(station_hour['NOx'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the in

In [None]:
station_hour.isnull().sum()

Unnamed: 0,0
StationId,0
Date,0
Time,0
PM2.5,0
PM10,0
NO,0
NO2,0
NOx,0
NH3,0
CO,0


**station_hour datasets has many zero values which are just null values, so we convert it to null values.**

In [None]:
station_hour.replace(0.0, np.nan, inplace=True)

In [None]:
station_hour[station_hour['NOx'] == 0]

Unnamed: 0,StationId,Date,Time,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,Xylene,AQI,AQI_Bucket


In [None]:
station_hour[station_hour['CO'] == 0]

Unnamed: 0,StationId,Date,Time,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,Xylene,AQI,AQI_Bucket


In [None]:
station_hour[station_hour==0]

Unnamed: 0,StationId,Date,Time,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,Xylene,AQI,AQI_Bucket
0,,NaT,,,,,,,,,,,,,,,
1,,NaT,,,,,,,,,,,,,,,
2,,NaT,,,,,,,,,,,,,,,
3,,NaT,,,,,,,,,,,,,,,
4,,NaT,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
219032,,NaT,,,,,,,,,,,,,,,
219033,,NaT,,,,,,,,,,,,,,,
219034,,NaT,,,,,,,,,,,,,,,
219035,,NaT,,,,,,,,,,,,,,,


In [None]:
station_hour.isnull().sum()

Unnamed: 0,0
StationId,0
Date,0
Time,0
PM2.5,0
PM10,0
NO,0
NO2,0
NOx,981
NH3,0
CO,10794


**Replacing the changed null values to median values of the column.**

In [None]:
station_hour['Benzene'].fillna(station_hour['Benzene'].median(), inplace=True)
station_hour['NOx'].fillna(station_hour['NOx'].median(), inplace=True)
station_hour['CO'].fillna(station_hour['CO'].median(), inplace=True)
station_hour['Toluene'].fillna(station_hour['Toluene'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  station_hour['Benzene'].fillna(station_hour['Benzene'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  station_hour['NOx'].fillna(station_hour['NOx'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work becaus

In [None]:
station_hour.isnull().sum()

Unnamed: 0,0
StationId,0
Date,0
Time,0
PM2.5,0
PM10,0
NO,0
NO2,0
NOx,0
NH3,0
CO,0


**Since XYLENE column has more than 25% null values, we are dropping it.**

In [None]:
station_hour.drop(columns=['Xylene'], inplace=True)

In [None]:
station_hour.columns

Index(['StationId', 'Date', 'Time', 'PM2.5', 'PM10', 'NO', 'NO2', 'NOx', 'NH3',
       'CO', 'SO2', 'O3', 'Benzene', 'Toluene', 'AQI', 'AQI_Bucket'],
      dtype='object')

**Changing the TIME column to DateTime type**

In [None]:
station_hour['Time'] = pd.to_datetime(station_hour['Time'], format='%H:%M:%S')

In [None]:
station_hour.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 219037 entries, 0 to 219036
Data columns (total 16 columns):
 #   Column      Non-Null Count   Dtype         
---  ------      --------------   -----         
 0   StationId   219037 non-null  object        
 1   Date        219037 non-null  datetime64[ns]
 2   Time        219037 non-null  datetime64[ns]
 3   PM2.5       219037 non-null  float64       
 4   PM10        219037 non-null  float64       
 5   NO          219037 non-null  float64       
 6   NO2         219037 non-null  float64       
 7   NOx         219037 non-null  float64       
 8   NH3         219037 non-null  float64       
 9   CO          219037 non-null  float64       
 10  SO2         219037 non-null  float64       
 11  O3          219037 non-null  float64       
 12  Benzene     219037 non-null  float64       
 13  Toluene     219037 non-null  float64       
 14  AQI         219037 non-null  int64         
 15  AQI_Bucket  219037 non-null  object        
dtypes:

**Merging the two datasets based on StationId column.**

In [None]:
merged_df = pd.merge(station_hour, station, on='StationId', how='inner')  # Only matching IDs

In [None]:
merged_df

Unnamed: 0,StationId,Date,Time,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,AQI,AQI_Bucket,StationName,City,State,Status
0,AP001,2017-11-25,1900-01-01 09:00:00,104.00,148.50,1.93,23.00,13.75,9.80,0.10,15.30,117.62,0.30,10.40,155,Moderate,"Secretariat, Amaravati - APPCB",Amaravati,Andhra Pradesh,Active
1,AP001,2017-11-25,1900-01-01 10:00:00,94.50,142.00,1.33,16.25,9.75,9.65,0.10,17.00,136.23,0.28,7.10,159,Moderate,"Secretariat, Amaravati - APPCB",Amaravati,Andhra Pradesh,Active
2,AP001,2017-11-25,1900-01-01 11:00:00,82.75,126.50,1.47,14.83,9.07,9.70,0.10,15.40,149.92,0.20,4.55,173,Moderate,"Secretariat, Amaravati - APPCB",Amaravati,Andhra Pradesh,Active
3,AP001,2017-11-25,1900-01-01 14:00:00,68.50,117.00,1.35,13.60,8.35,7.40,0.10,21.80,161.70,0.10,2.30,191,Moderate,"Secretariat, Amaravati - APPCB",Amaravati,Andhra Pradesh,Active
4,AP001,2017-11-25,1900-01-01 15:00:00,69.25,112.25,1.52,11.80,7.55,9.25,0.10,21.38,161.68,0.10,2.35,191,Moderate,"Secretariat, Amaravati - APPCB",Amaravati,Andhra Pradesh,Active
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
219032,WB010,2020-02-14,1900-01-01 11:00:00,85.50,161.35,16.15,65.22,81.32,39.08,0.79,8.60,20.47,57.14,51.31,222,Poor,"Jadavpur, Kolkata - WBPCB",Kolkata,West Bengal,Active
219033,WB010,2020-02-14,1900-01-01 12:00:00,73.75,143.65,8.40,52.65,61.08,38.53,0.66,9.82,26.58,53.71,51.38,219,Poor,"Jadavpur, Kolkata - WBPCB",Kolkata,West Bengal,Active
219034,WB010,2020-02-14,1900-01-01 13:00:00,71.50,133.38,5.60,45.03,50.62,42.62,0.55,9.57,28.28,56.80,56.27,217,Poor,"Jadavpur, Kolkata - WBPCB",Kolkata,West Bengal,Active
219035,WB010,2020-02-14,1900-01-01 14:00:00,54.47,117.12,4.20,39.00,43.17,48.02,0.62,9.20,31.63,56.04,55.58,215,Poor,"Jadavpur, Kolkata - WBPCB",Kolkata,West Bengal,Active


In [None]:
merged_df.isnull().sum()

Unnamed: 0,0
StationId,0
Date,0
Time,0
PM2.5,0
PM10,0
NO,0
NO2,0
NOx,0
NH3,0
CO,0


**Feature Engineering**