#***AIR QUALITY INDEX PREDICTION USING ML***

### **Dataset Infromation:**

1. Date (DD/MM/YYYY)  
2. Time (HH.MM.SS)  
3. True hourly averaged concentration CO in mg/m^3 (reference analyzer)  
4. PT08.S1 (tin oxide) hourly averaged sensor response (nominally CO targeted)  
5. True hourly averaged overall Non Methanic HydroCarbons concentration in microg/m^3 (reference analyzer)  
6. True hourly averaged Benzene concentration in microg/m^3 (reference analyzer)  
7. PT08.S2 (titania) hourly averaged sensor response (nominally NMHC targeted)  
8. True hourly averaged NOx concentration in ppb (reference analyzer)  
9. PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally NOx targeted)  
10. True hourly averaged NO2 concentration in microg/m^3 (reference analyzer)  
11. PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted)  
12. PT08.S5 (indium oxide) hourly averaged sensor response (nominally O3 targeted)  
13. Temperature in °C  
14. Relative Humidity (%)  
15. AH Absolute Humidity


Data Set Information:
The dataset contains 9358 instances of hourly averaged responses from an array of 5 metal oxide chemical sensors embedded in an Air Quality Chemical Multisensor Device. The device was located on the field in a significantly polluted area, at road level, within an Italian city. **Data were recorded from March 2004 to February 2005 (one year)** representing the longest freely available recordings of on-field deployed air quality chemical sensor devices responses. Ground Truth hourly averaged concentrations for CO, Non Methanic Hydrocarbons, Benzene, Total Nitrogen Oxides (NOx) and Nitrogen Dioxide (NO2) and were provided by a co-located reference certified analyzer. Evidences of cross-sensitivities as well as both concept and sensor drifts are present as described in De Vito et al., Sens. And Act. B, Vol. 129, 2.2.2008 (citation required) eventually affecting sensors concentration estimation capabilities. **Missing values are tagged with -200 value**.
This dataset can be used exclusively for research purposes. Commercial purposes are fully excluded.

Github repository: https://github.com/12215212sudhiksha/Air-Quality-Index-Prediction

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('/content/Air_Quality.csv', sep =';', decimal = ',')
#All the values in csv file are seperated by semicolon and few colums contains ',' instead of decimal

In [None]:
df.head()

##Dropping the unwanted columns

In [None]:
#removing the last 2 columns from the dataframe
df = df.iloc[:, :-2]

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.shape


##Remove NaN rows

In [None]:
df.isna().sum()

In [None]:
sns.heatmap(df.isna(), yticklabels=False, cmap='coolwarm')
plt.show()

In [None]:
df.loc[[9356]]

9356 represents the last data point in the dataframe and remaining rows are just null values.

In [None]:
df.head(9357)

In [None]:
df.tail()

In [None]:
df.dropna(inplace=True)
df.tail()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.isnull().sum()

This shows thatbhere are no missing values in the dataset. But the actual missing values are tagged with the value "-200".

In [None]:
#Counting the number of time -200 appears in the data
df.isin([-200]).sum(axis=0)

##Handling the missing values

Convert all -200 to NaN

Replace all NaN values with mean of that specific column.

In [None]:
df=df.replace(to_replace=-200, value = np.nan)

In [None]:
df.isnull().sum()

This shows the actual number of missing values

In [None]:
df.tail()

In [None]:
df.select_dtypes(include='number').mean()


In [None]:
#Replacing the missing values with mean value of each column
df = df.fillna(df.select_dtypes(include='number').mean())


In [None]:
df.tail()

In [None]:
df.isnull().sum()

##Handling Outliers

In [None]:
plt.figure(figsize=(6,6))
sns.boxplot(data=df,palette='rocket')
plt.xticks(rotation='vertical')
plt.show()

USING IQR METHOD TO HANDLE OUTLIERS

In [None]:
# Select only numeric columns
df_numeric = df.select_dtypes(include='number')

# Calculate IQR for numeric data
Q1 = df_numeric.quantile(0.25)
Q3 = df_numeric.quantile(0.75)
IQR = Q3 - Q1

# Count of outliers in each numeric column
outliers = ((df_numeric < (Q1 - 1.5 * IQR)) | (df_numeric > (Q3 + 1.5 * IQR))).sum()
print(outliers)


In [None]:
print(outliers)


In [None]:
column_outlier = ['AH', 'C6H6(GT)', 'CO(GT)', 'NO2(GT)', 'NOx(GT)', 'PT08.S1(CO)',
                  'PT08.S2(NMHC)', 'PT08.S3(NOx)', 'PT08.S4(NO2)', 'PT08.S5(O3)',
                  'RH', 'T']

# Convert columns to float
for i in column_outlier:
    df[i] = df[i].astype('float')

# Calculate Q1, Q3, and IQR for only the relevant columns
Q1 = df[column_outlier].quantile(0.25)
Q3 = df[column_outlier].quantile(0.75)
IQR = Q3 - Q1

# Detect outliers
outliers = (df[column_outlier] < (Q1 - 1.5 * IQR)) | (df[column_outlier] > (Q3 + 1.5 * IQR))

# Replace outliers with median
for i in column_outlier:
    median_val = df[i].median()
    df.loc[outliers[i], i] = median_val

# Check if outliers remain
remaining_outliers = ((df[column_outlier] < (Q1 - 1.5 * IQR)) |
                      (df[column_outlier] > (Q3 + 1.5 * IQR))).sum()

print("Remaining outliers after replacement:")
print(remaining_outliers)


In [None]:
plt.figure(figsize=(6,6))
sns.boxplot(data=df,palette='rocket')
plt.xticks(rotation='vertical')
plt.show()
#we can see that the number of points outside the whiskers have reduced, indicating outliers have been handled

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Select only numeric columns
numeric_df = df.select_dtypes(include='number')

# Plot correlation heatmap
plt.figure(figsize=(9, 5))
sns.heatmap(numeric_df.corr(), cmap='YlGnBu', annot=True)
plt.title("Correlation Heatmap")
plt.show()


1. CO(GT) and C6H6(GT) show strong positive correlations with several other features and pollutant indicators, especially gas sensor readings.

2. NOx(GT) and NO2(GT) show moderate to strong correlations with related gas sensors, indicating their influence on air quality.

3. Temperature (T), Relative Humidity (RH), and Absolute Humidity (AH) show low or negligible correlation with pollutant levels and sensor data, except for a moderate correlation between T and AH.

Therefore we use the pollutants with highest correlation as features. However, a clear range for C6H6 to calculate its AQI subindex could not be found and hence it has not been used as a feature.

In [None]:
#calculate subindex of CO
def CO_AQI_subindex(x):
    if x <= 1:
        return x * 50 / 1
    elif x <= 2:
        return 50 + (x - 1) * 50
    elif x <= 10:
        return 100 + (x - 2) * 100 / 8
    elif x <= 17:
        return 200 + (x - 10) * 100 / 7
    elif x <= 34:
        return 300 + (x - 17) * 100 / 17
    elif x > 34:
        return 400 + (x - 34) * 100 / 17
    else:
        return 0

df["CO_SubIndex"] = df["CO(GT)"].apply(lambda x: CO_AQI_subindex(x))

In [None]:
##calculate subindex of NO2
def NO2_AQI_subindex(x):
    if x <= 40:
        return x * 50 / 40
    elif x <= 80:
        return 50 + (x - 40) * 50 / 40
    elif x <= 180:
        return 100 + (x - 80) * 100 / 100
    elif x <= 280:
        return 200 + (x - 180) * 100 / 100
    elif x <= 400:
        return 300 + (x - 280) * 100 / 120
    elif x > 400:
        return 400 + (x - 400) * 100 / 120
    else:
        return 0

df["NO2_SubIndex"] = df["NO2(GT)"].apply(lambda x: NO2_AQI_subindex(x))

In [None]:
##calculate subindex of NOx
def NOx_AQI_subindex(x):
    if x <= 40:
        return x * 50 / 40
    elif x <= 80:
        return 50 + (x - 40) * 50 / 40
    elif x <= 180:
        return 100 + (x - 80) * 100 / 100
    elif x <= 280:
        return 200 + (x - 180) * 100 / 100
    elif x <= 400:
        return 300 + (x - 280) * 100 / 120
    elif x > 400:
        return 400 + (x - 400) * 100 / 120
    else:
        return 0
df["NOx_SubIndex"] = df["NOx(GT)"].apply(lambda x: NOx_AQI_subindex(x))


In [None]:
print(df.columns.tolist())


In [None]:
#calculating AQI
df["AQI"] = round(df[["NO2_SubIndex", "CO_SubIndex", "NOx_SubIndex"]].max(axis = 1))

#Naive Bayes Classifier

In [None]:
from sklearn.model_selection import train_test_split

y = df['Air Quality'].values
features = ['CO(GT)', 'NO2(GT)', 'NOx(GT)']
X = df[features].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)


In [None]:
#creating a Naive bayes class
class NaiveBayes:

    #fitting the model
    def fit(self, X, y):
        samples, features = X.shape
        self.classes = np.unique(y)
        n_classes = len(self.classes)

        # calculate mean, var, and prior for each class
        self.mean = np.zeros((n_classes, features), dtype=np.float64)
        self.var = np.zeros((n_classes, features), dtype=np.float64)
        self.priors = np.zeros(n_classes, dtype=np.float64)

        for index, c in enumerate(self.classes):
            X_c = X[y == c]
            self.mean[index, :] = X_c.mean(axis=0)
            self.var[index, :] = X_c.var(axis=0)
            self.priors[index] = X_c.shape[0] / float(samples)


    def predict(self, X,noise_factor):
        y_pred = [self._predict(x) for x in X]
        return np.array(y_pred)

    def _predict(self, x):
        probab = []

        # calculate posterior probability for each class
        for index, c in enumerate(self.classes):
            prior = np.log(self.priors[index])
            posterior = np.sum(np.log(self.pdf(index, x)))
            posterior = posterior + prior
            probab.append(posterior)

        # introduce randomness by adding noise to the posteriors to see how to affects model performance
        probab_with_noise = np.array(probab) + np.random.normal(scale=noise_factor, size=len(probab))

        # return class with the highest posterior probability with noise added
        return self.classes[np.argmax(probab_with_noise)]

    #calculating proability distribution
    def pdf(self, class_index, x):
        mean = self.mean[class_index]
        var = self.var[class_index]
        num = np.exp(-((x - mean) ** 2) / (2 * var))
        denom = np.sqrt(2 * np.pi * var)
        return num / denom

In [None]:
def accuracy(y_true, y_pred):
    accuracy = np.sum(y_true == y_pred) / len(y_true)
    return accuracy
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
naive_bayes = NaiveBayes()
naive_bayes.fit(X_train, y_train)
#accuracy with noise included
noise_factor=0.7
prediction =naive_bayes.predict(X_test, noise_factor)
print("Naive Bayes classification accuracy with noise", accuracy(y_test, prediction))
#accuracy without noise
noise_factor=0
y_pred =naive_bayes.predict(X_test, noise_factor)
print("Naive Bayes classification accuracy without noise", accuracy(y_test, y_pred))

High accuracy (94.8%): Your Naive Bayes model does a great job at correctly predicting the air quality category using just 3 features.

Small drop with noise (94.8% → 94.2%):

Even after adding randomness (which simulates real-world uncertainty), the model's accuracy only dropped by 0.6%.

That means your model is stable and robust, which is exactly what you want in real applications where data isn't always perfect.



My model:

Performs very well

Is not overly sensitive to noise or uncertainty

Is reliable for prediction tasks

✅ Conclusion

This implementation builds a Naive Bayes classifier from scratch to predict air quality categories based on pollutant levels (CO(GT), NO2(GT), and NOx(GT)). The classifier assumes a Gaussian (Normal) distribution for each feature within each class and uses these distributions to calculate class probabilities.

To evaluate the model's robustness, controlled random noise is added to the posterior probabilities before making predictions. This simulates uncertainty in real-world scenarios and allows us to compare accuracy with and without noise.

Results show how the model's performance can degrade when predictions are influenced by noise, providing insights into its stability and reliability under imperfect conditions.

In [119]:
import numpy as np

# Air Quality Mapping according to AQI values
def air_qual(x):
    if 0 <= x <= 50:
        return 1  # Good
    elif x <= 100:
        return 2  # Satisfactory
    elif x <= 200:
        return 3  # Moderately polluted
    elif x <= 300:
        return 4  # Poor
    elif x <= 400:
        return 5  # Very Poor
    elif x > 400:
        return 6  # Severe

# Air quality class mapping
class_map = {
    1: "Good",
    2: "Satisfactory",
    3: "Moderately polluted",
    4: "Poor",
    5: "Very Poor",
    6: "Severe"
}

# Example: Trained NaiveBayes model (make sure to run training code first!)
# naive_bayes = NaiveBayes()
# naive_bayes.fit(X_train, y_train)

# Get user inputs for the pollutants
co = float(input("Enter CO(GT) value (e.g., 2.5): "))
no2 = float(input("Enter NO2(GT) value (e.g., 35.0): "))
nox = float(input("Enter NOx(GT) value (e.g., 20.0): "))

# Prepare input data in the same format as used during training
input_data = np.array([[co, no2, nox]])

# Ask user if they want to include noise
add_noise = input("Add noise to prediction? (yes/no): ").strip().lower()
noise_factor = 0.7 if add_noise == "yes" else 0.0

# Make prediction using Naive Bayes model
prediction = naive_bayes.predict(input_data, noise_factor)

# Map the predicted class to the air quality label
predicted_class = prediction[0]
predicted_air_quality = class_map[predicted_class]

# Display the result
print(f"\nPredicted Air Quality: {predicted_air_quality}")


Enter CO(GT) value (e.g., 2.5): 2.5
Enter NO2(GT) value (e.g., 35.0): 35.0
Enter NOx(GT) value (e.g., 20.0): 20.0
Add noise to prediction? (yes/no): yes

Predicted Air Quality: Moderately polluted
