### Problem Statement




#####  Air pollution is among the most serious environmental problems with adverse effects on public health and the climate. The Air Quality Index (AQI) is a normalized measure of the evaluation of air quality. However, traditional AQI monitoring networks rely on expensive fixed-point sensors covering a limited number of sites. The aim of this project is to develop a machine learning model for AQI prediction based on leading pollutants such as PM2.5, PM10, NO2, SO2, CO, and O3. The goal is to develop a predictive model to predict AQI from real-time pollutant concentrations and implement the model as a web application for better accessibility.


##  AQI Prediction Model using Python
 ###  PM2.5 PM10
 ###  NO, NO2
 ###  NH3 - Ammonia
 ###  CO
 ###  So2
 ###  O3
 ###  Benzene, Toluene, Xylene

In [1]:
pip install numpy pandas matplotlib seaborn scikit-learn streamlit

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip list

Package                        Version
------------------------------ ---------------
absl-py                        2.1.0
altair                         5.5.0
anyio                          4.2.0
argon2-cffi                    23.1.0
argon2-cffi-bindings           21.2.0
arrow                          1.3.0
asttokens                      2.4.1
astunparse                     1.6.3
async-lru                      2.0.4
attrs                          23.2.0
azure-cognitiveservices-speech 1.36.0
Babel                          2.14.0
beautifulsoup4                 4.12.2
bleach                         6.1.0
blinker                        1.8.2
CacheControl                   0.14.0
cachetools                     5.5.0
category-encoders              2.6.4
certifi                        2023.11.17
cffi                           1.16.0
charset-normalizer             3.3.2
click                          8.1.7
cloudpickle                    3.1.0
colorama                       0.4.6
comm         

In [3]:
# importing necessaries libraries 

import numpy as np
import pandas as pd
import streamlit as st
import matplotlib.pyplot as plt
import seaborn as sns
from warnings import filterwarnings
filterwarnings('ignore')

In [4]:
# Load Dataset
df = pd.read_csv(r'air quality data.csv')
df

Unnamed: 0,City,Date,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,Xylene,AQI,AQI_Bucket
0,Ahmedabad,2015-01-01,,,0.92,18.22,17.15,,0.92,27.64,133.36,0.00,0.02,0.00,,
1,Ahmedabad,2015-01-02,,,0.97,15.69,16.46,,0.97,24.55,34.06,3.68,5.50,3.77,,
2,Ahmedabad,2015-01-03,,,17.40,19.30,29.70,,17.40,29.07,30.70,6.80,16.40,2.25,,
3,Ahmedabad,2015-01-04,,,1.70,18.48,17.97,,1.70,18.59,36.08,4.43,10.14,1.00,,
4,Ahmedabad,2015-01-05,,,22.10,21.42,37.76,,22.10,39.33,39.31,7.01,18.89,2.78,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29526,Visakhapatnam,2020-06-27,15.02,50.94,7.68,25.06,19.54,12.47,0.47,8.55,23.30,2.24,12.07,0.73,41.0,Good
29527,Visakhapatnam,2020-06-28,24.38,74.09,3.42,26.06,16.53,11.99,0.52,12.72,30.14,0.74,2.21,0.38,70.0,Satisfactory
29528,Visakhapatnam,2020-06-29,22.91,65.73,3.45,29.53,18.33,10.71,0.48,8.42,30.96,0.01,0.01,0.00,68.0,Satisfactory
29529,Visakhapatnam,2020-06-30,16.64,49.97,4.05,29.26,18.80,10.03,0.52,9.84,28.30,0.00,0.00,0.00,54.0,Satisfactory


In [5]:
#first 5 rows
df.head()

Unnamed: 0,City,Date,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,Xylene,AQI,AQI_Bucket
0,Ahmedabad,2015-01-01,,,0.92,18.22,17.15,,0.92,27.64,133.36,0.0,0.02,0.0,,
1,Ahmedabad,2015-01-02,,,0.97,15.69,16.46,,0.97,24.55,34.06,3.68,5.5,3.77,,
2,Ahmedabad,2015-01-03,,,17.4,19.3,29.7,,17.4,29.07,30.7,6.8,16.4,2.25,,
3,Ahmedabad,2015-01-04,,,1.7,18.48,17.97,,1.7,18.59,36.08,4.43,10.14,1.0,,
4,Ahmedabad,2015-01-05,,,22.1,21.42,37.76,,22.1,39.33,39.31,7.01,18.89,2.78,,


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29531 entries, 0 to 29530
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   City        29531 non-null  object 
 1   Date        29531 non-null  object 
 2   PM2.5       24933 non-null  float64
 3   PM10        18391 non-null  float64
 4   NO          25949 non-null  float64
 5   NO2         25946 non-null  float64
 6   NOx         25346 non-null  float64
 7   NH3         19203 non-null  float64
 8   CO          27472 non-null  float64
 9   SO2         25677 non-null  float64
 10  O3          25509 non-null  float64
 11  Benzene     23908 non-null  float64
 12  Toluene     21490 non-null  float64
 13  Xylene      11422 non-null  float64
 14  AQI         24850 non-null  float64
 15  AQI_Bucket  24850 non-null  object 
dtypes: float64(13), object(3)
memory usage: 3.6+ MB


In [7]:
df.describe()

Unnamed: 0,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,Xylene,AQI
count,24933.0,18391.0,25949.0,25946.0,25346.0,19203.0,27472.0,25677.0,25509.0,23908.0,21490.0,11422.0,24850.0
mean,67.450578,118.127103,17.57473,28.560659,32.309123,23.483476,2.248598,14.531977,34.49143,3.28084,8.700972,3.070128,166.463581
std,64.661449,90.60511,22.785846,24.474746,31.646011,25.684275,6.962884,18.133775,21.694928,15.811136,19.969164,6.323247,140.696585
min,0.04,0.01,0.02,0.01,0.0,0.01,0.0,0.01,0.01,0.0,0.0,0.0,13.0
25%,28.82,56.255,5.63,11.75,12.82,8.58,0.51,5.67,18.86,0.12,0.6,0.14,81.0
50%,48.57,95.68,9.89,21.69,23.52,15.85,0.89,9.16,30.84,1.07,2.97,0.98,118.0
75%,80.59,149.745,19.95,37.62,40.1275,30.02,1.45,15.22,45.57,3.08,9.15,3.35,208.0
max,949.99,1000.0,390.68,362.21,467.63,352.89,175.81,193.86,257.73,455.03,454.85,170.37,2049.0


In [8]:
df.duplicated()

0        False
1        False
2        False
3        False
4        False
         ...  
29526    False
29527    False
29528    False
29529    False
29530    False
Length: 29531, dtype: bool

In [9]:
# null values
df.isnull()

Unnamed: 0,City,Date,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,Xylene,AQI,AQI_Bucket
0,False,False,True,True,False,False,False,True,False,False,False,False,False,False,True,True
1,False,False,True,True,False,False,False,True,False,False,False,False,False,False,True,True
2,False,False,True,True,False,False,False,True,False,False,False,False,False,False,True,True
3,False,False,True,True,False,False,False,True,False,False,False,False,False,False,True,True
4,False,False,True,True,False,False,False,True,False,False,False,False,False,False,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29526,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
29527,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
29528,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
29529,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [10]:
#complete null values
df.isnull().sum()

City              0
Date              0
PM2.5          4598
PM10          11140
NO             3582
NO2            3585
NOx            4185
NH3           10328
CO             2059
SO2            3854
O3             4022
Benzene        5623
Toluene        8041
Xylene        18109
AQI            4681
AQI_Bucket     4681
dtype: int64

In [11]:
# Drop rows where AQI is missing
df.dropna(subset=['AQI'], inplace=True)

In [12]:
# arranging null values in order
df.isnull().sum().sort_values(ascending=False)

Xylene        15372
PM10           7086
NH3            6536
Toluene        5826
Benzene        3535
NOx            1857
O3              807
PM2.5           678
SO2             605
CO              445
NO2             391
NO              387
City              0
Date              0
AQI               0
AQI_Bucket        0
dtype: int64

In [13]:
#shape of dataset
df.shape

(24850, 16)

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 24850 entries, 28 to 29530
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   City        24850 non-null  object 
 1   Date        24850 non-null  object 
 2   PM2.5       24172 non-null  float64
 3   PM10        17764 non-null  float64
 4   NO          24463 non-null  float64
 5   NO2         24459 non-null  float64
 6   NOx         22993 non-null  float64
 7   NH3         18314 non-null  float64
 8   CO          24405 non-null  float64
 9   SO2         24245 non-null  float64
 10  O3          24043 non-null  float64
 11  Benzene     21315 non-null  float64
 12  Toluene     19024 non-null  float64
 13  Xylene      9478 non-null   float64
 14  AQI         24850 non-null  float64
 15  AQI_Bucket  24850 non-null  object 
dtypes: float64(13), object(3)
memory usage: 3.2+ MB


In [15]:
#stats of the dataset
df.describe()

Unnamed: 0,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,Xylene,AQI
count,24172.0,17764.0,24463.0,24459.0,22993.0,18314.0,24405.0,24245.0,24043.0,21315.0,19024.0,9478.0,24850.0
mean,67.476613,118.454435,17.622421,28.978391,32.289012,23.848366,2.345267,14.362933,34.912885,3.458668,9.525714,3.588683,166.463581
std,63.075398,89.487976,22.421138,24.627054,30.712855,25.875981,7.075208,17.428693,21.724525,16.03602,20.881085,6.754324,140.696585
min,0.04,0.03,0.03,0.01,0.0,0.01,0.0,0.01,0.01,0.0,0.0,0.0,13.0
25%,29.0,56.7775,5.66,11.94,13.11,8.96,0.59,5.73,19.25,0.23,1.0275,0.39,81.0
50%,48.785,96.18,9.91,22.1,23.68,16.31,0.93,9.22,31.25,1.29,3.575,1.42,118.0
75%,80.925,150.1825,20.03,38.24,40.17,30.36,1.48,15.14,46.08,3.34,10.18,4.12,208.0
max,914.94,917.08,390.68,362.21,378.24,352.89,175.81,186.08,257.73,455.03,454.85,170.37,2049.0


In [16]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
PM2.5,24172.0,67.476613,63.075398,0.04,29.0,48.785,80.925,914.94
PM10,17764.0,118.454435,89.487976,0.03,56.7775,96.18,150.1825,917.08
NO,24463.0,17.622421,22.421138,0.03,5.66,9.91,20.03,390.68
NO2,24459.0,28.978391,24.627054,0.01,11.94,22.1,38.24,362.21
NOx,22993.0,32.289012,30.712855,0.0,13.11,23.68,40.17,378.24
NH3,18314.0,23.848366,25.875981,0.01,8.96,16.31,30.36,352.89
CO,24405.0,2.345267,7.075208,0.0,0.59,0.93,1.48,175.81
SO2,24245.0,14.362933,17.428693,0.01,5.73,9.22,15.14,186.08
O3,24043.0,34.912885,21.724525,0.01,19.25,31.25,46.08,257.73
Benzene,21315.0,3.458668,16.03602,0.0,0.23,1.29,3.34,455.03


In [17]:
#percentage ofnull values in dataset
null_values_percentages=(df.isnull().sum()/df.isnull().count()*100).sort_values(ascending=False)
null_values_percentages

Xylene        61.859155
PM10          28.515091
NH3           26.301811
Toluene       23.444668
Benzene       14.225352
NOx            7.472837
O3             3.247485
PM2.5          2.728370
SO2            2.434608
CO             1.790744
NO2            1.573441
NO             1.557344
City           0.000000
Date           0.000000
AQI            0.000000
AQI_Bucket     0.000000
dtype: float64

 ## key considerations 
   ### Xylene has the highest percentage of missing values - 61.86%
   ### PM10 and NH3 28 - 26 %

In [19]:
from sklearn.impute import KNNImputer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# KNN Imputation for missing values
imputer = KNNImputer(n_neighbors=5)
df.iloc[:, 2:-2] = imputer.fit_transform(df.iloc[:, 2:-2])

In [20]:
# Splitting Data
X = df[['PM2.5', 'PM10', 'NO', 'NO2', 'NH3', 'CO', 'SO2', 'O3', 'Benzene', 'Toluene', 'Xylene']]
y = df['AQI']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [21]:
# Train Model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

In [22]:
# Predictions
y_pred = model.predict(X_test)

In [23]:
# Evaluate
st.write("MAE:", mean_absolute_error(y_test, y_pred))
st.write("RMSE:", mean_squared_error(y_test, y_pred, squared=False))
st.write("R² Score:", r2_score(y_test, y_pred))


2025-02-24 22:29:42.805 
  command:

    streamlit run C:\Users\SATHVIK\Lib\site-packages\ipykernel_launcher.py [ARGUMENTS]


In [24]:

# Streamlit UI
st.title("AQI Prediction App")
st.write("Upload an air quality dataset to predict AQI")
uploaded_file = st.file_uploader("Upload CSV", type=["csv"])

if uploaded_file is not None:
    new_data = pd.read_csv(uploaded_file)
    predictions = model.predict(new_data)
    st.write(predictions)



In [25]:
import streamlit as st
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

st.title("AQI Prediction Model")

uploaded_file = st.file_uploader("Upload Air Quality Dataset", type=["csv"])

if uploaded_file is not None:
    df = pd.read_csv(uploaded_file)
    st.write("Dataset Preview:", df.head())

    # Define features and target variable
    X = df[['PM2.5', 'PM10', 'NO2', 'SO2', 'CO', 'O3']]
    y = df['AQI']

    # Train Model
    model = LinearRegression()
    model.fit(X, y)

    # User Input
    st.subheader("Enter Pollutant Values:")
    pm25 = st.number_input("PM2.5", value=50.0)
    pm10 = st.number_input("PM10", value=80.0)
    no2 = st.number_input("NO2", value=20.0)
    so2 = st.number_input("SO2", value=10.0)
    co = st.number_input("CO", value=1.0)
    o3 = st.number_input("O3", value=30.0)

    # Make Prediction
    input_data = np.array([[pm25, pm10, no2, so2, co, o3]])
    predicted_aqi = model.predict(input_data)[0]

    st.subheader(f"Predicted AQI: {predicted_aqi:.2f}")


