# AQI Prediction Model Using Python

In [None]:
Let us see how to predict the air quality index using Python. 
AQI is calculated based on chemical pollutant quantity.

# Air Pollutant	                       # Typical Percentage
Particulate Matter (PM2.5)	             ~10-20%
Particulate Matter (PM10)	             ~20-30%
Nitrogen Oxides (NOx)	                 ~20-30%
Sulfur Dioxide (SO2)	                 ~5-10%
Carbon Monoxide (CO)	                 ~10-20%
Volatile Organic Compounds (VOCs)        ~10-15%
Ozone (O3)	                             ~5-10%
Ammonia (NH3)	                         <5%

By using machine learning, we can predict the AQI.

AQI: The air quality index is an index that reports air quality daily.  
In other words, it measures how air pollution affects one’s health within a short period. 
The AQI is calculated based on the average concentration of a particular pollutant measured over a standard time interval. 
Generally, the time interval is 24 hours for most pollutants and 8 hours for carbon monoxide and ozone.
    
As the data is numeric and there are no missing values in the data, no preprocessing is required. 
Our goal is to predict the AQI, so this task is either Classification or regression. 
So, as our class label is continuous, the regression technique is required.

Regression is a supervised learning technique that fits the data in a given range. 
Example of Regression techniques in Python:
-Random Forest Regressor
-Ada Boost Regressor
-Bagging Regressor
-Linear Regression, etc.

In [None]:
pip install numpy pandas matplotlib seaborn scikit-learn

In [6]:
# Import Necessary Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from warnings import filterwarnings
filterwarnings ('ignore')

In [2]:
df = pd.read_csv('air quality data.csv')
df.head()

Unnamed: 0,City,Date,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,Xylene,AQI,AQI_Bucket
0,Ahmedabad,2015-01-01,,,0.92,18.22,17.15,,0.92,27.64,133.36,0.0,0.02,0.0,,
1,Ahmedabad,2015-01-02,,,0.97,15.69,16.46,,0.97,24.55,34.06,3.68,5.5,3.77,,
2,Ahmedabad,2015-01-03,,,17.4,19.3,29.7,,17.4,29.07,30.7,6.8,16.4,2.25,,
3,Ahmedabad,2015-01-04,,,1.7,18.48,17.97,,1.7,18.59,36.08,4.43,10.14,1.0,,
4,Ahmedabad,2015-01-05,,,22.1,21.42,37.76,,22.1,39.33,39.31,7.01,18.89,2.78,,


In [4]:
# No.of Rows and Columns
df.shape

(29531, 16)

In [7]:
# Information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29531 entries, 0 to 29530
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   City        29531 non-null  object 
 1   Date        29531 non-null  object 
 2   PM2.5       24933 non-null  float64
 3   PM10        18391 non-null  float64
 4   NO          25949 non-null  float64
 5   NO2         25946 non-null  float64
 6   NOx         25346 non-null  float64
 7   NH3         19203 non-null  float64
 8   CO          27472 non-null  float64
 9   SO2         25677 non-null  float64
 10  O3          25509 non-null  float64
 11  Benzene     23908 non-null  float64
 12  Toluene     21490 non-null  float64
 13  Xylene      11422 non-null  float64
 14  AQI         24850 non-null  float64
 15  AQI_Bucket  24850 non-null  object 
dtypes: float64(13), object(3)
memory usage: 3.6+ MB


In [8]:
# DUplicate Values
df.duplicated()

0        False
1        False
2        False
3        False
4        False
         ...  
29526    False
29527    False
29528    False
29529    False
29530    False
Length: 29531, dtype: bool

In [9]:
df.duplicated().sum()

np.int64(0)

In [10]:
# Missing Values
df.isnull().sum()

City              0
Date              0
PM2.5          4598
PM10          11140
NO             3582
NO2            3585
NOx            4185
NH3           10328
CO             2059
SO2            3854
O3             4022
Benzene        5623
Toluene        8041
Xylene        18109
AQI            4681
AQI_Bucket     4681
dtype: int64

In [11]:
# Drop the rows which are having missing values
df.dropna(subset=['PM2.5'], inplace = True)
df.dropna(subset=['PM10'], inplace = True)
df.dropna(subset=['NO'], inplace = True)
df.dropna(subset=['NO2'], inplace = True)
df.dropna(subset=['NOx'], inplace = True)
df.dropna(subset=['NH3'], inplace = True)
df.dropna(subset=['CO'], inplace = True)
df.dropna(subset=['SO2'], inplace = True)
df.dropna(subset=['O3'], inplace = True)
df.dropna(subset=['Benzene'], inplace = True)
df.dropna(subset=['Toluene'], inplace = True)
df.dropna(subset=['Xylene'], inplace = True)
df.dropna(subset=['AQI'], inplace = True)
df.dropna(subset=['AQI_Bucket'], inplace = True)

In [13]:
df.isnull().sum()

City          0
Date          0
PM2.5         0
PM10          0
NO            0
NO2           0
NOx           0
NH3           0
CO            0
SO2           0
O3            0
Benzene       0
Toluene       0
Xylene        0
AQI           0
AQI_Bucket    0
dtype: int64

In [15]:
df.shape

(6236, 16)

In [16]:
# Summary of the dataset
df.describe()

Unnamed: 0,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,Xylene,AQI
count,6236.0,6236.0,6236.0,6236.0,6236.0,6236.0,6236.0,6236.0,6236.0,6236.0,6236.0,6236.0,6236.0
mean,61.327365,123.418321,17.015191,31.70819,32.448956,20.73707,0.984344,11.514426,36.127691,3.700361,10.323696,2.557439,140.510103
std,53.709682,85.791491,20.037836,18.784041,27.388129,16.088215,1.356161,7.166113,19.553695,5.062159,12.287223,4.53506,92.738826
min,2.0,7.8,0.25,0.17,0.17,0.12,0.0,0.71,1.55,0.0,0.0,0.0,23.0
25%,27.9275,66.97,5.08,15.9775,14.5475,10.39,0.49,6.5575,22.3575,0.91,2.21,0.3,78.0
50%,47.49,103.01,10.06,28.9,24.285,14.69,0.73,9.875,32.54,2.435,6.31,1.25,112.0
75%,73.4425,150.77,19.3925,43.6325,39.6225,28.545,1.06,14.43,45.5125,4.62,13.04,3.03,166.0
max,639.19,796.88,159.22,140.17,224.09,166.7,16.23,70.39,162.33,64.44,103.0,125.18,677.0


In [17]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
PM2.5,6236.0,61.327365,53.709682,2.0,27.9275,47.49,73.4425,639.19
PM10,6236.0,123.418321,85.791491,7.8,66.97,103.01,150.77,796.88
NO,6236.0,17.015191,20.037836,0.25,5.08,10.06,19.3925,159.22
NO2,6236.0,31.70819,18.784041,0.17,15.9775,28.9,43.6325,140.17
NOx,6236.0,32.448956,27.388129,0.17,14.5475,24.285,39.6225,224.09
NH3,6236.0,20.73707,16.088215,0.12,10.39,14.69,28.545,166.7
CO,6236.0,0.984344,1.356161,0.0,0.49,0.73,1.06,16.23
SO2,6236.0,11.514426,7.166113,0.71,6.5575,9.875,14.43,70.39
O3,6236.0,36.127691,19.553695,1.55,22.3575,32.54,45.5125,162.33
Benzene,6236.0,3.700361,5.062159,0.0,0.91,2.435,4.62,64.44


In [28]:
# Percentage of Dataset Values
percentage = ((df.describe().sum()/df.count())*100).sort_values(ascending = False)
percentage

PM10          121.402178
AQI           120.674293
PM2.5         114.513904
NOx           105.813856
O3            105.131036
NO2           104.479510
NH3           104.125566
NO            103.705188
Toluene       102.360021
Xylene        102.194556
SO2           101.934622
Benzene       101.301596
CO            100.334357
AQI_Bucket           NaN
City                 NaN
Date                 NaN
dtype: float64