#                  AIR QUALITY PREDICTION
[Kaggle link for the dataset and its description](https://www.kaggle.com/datasets/mujtabamatin/air-quality-and-pollution-assessment/data)

## Team Members
1.   Imthias Abubakkar
2.   Sravana Sakthidharan
3.   Anu Neduvely Asokan
4.   Huong Ta
5.   Revanth Puvaneswaran





##Importing Necessary Packages

In [1]:
import pandas as pd
import numpy as np
import joblib
import pickle
from collections import Counter
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import mutual_info_classif
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score,classification_report,confusion_matrix

Running the below command directly downloads the data file (CSV) into your current working repository


*   You could see the data on your left side panel below sample data


In [2]:
!gdown 1HOe13ZP-5-t9syBLhEkgNSGS33CDqcKb

Downloading...
From: https://drive.google.com/uc?id=1HOe13ZP-5-t9syBLhEkgNSGS33CDqcKb
To: /content/updated_pollution_dataset.csv
  0% 0.00/242k [00:00<?, ?B/s]100% 242k/242k [00:00<00:00, 104MB/s]


In [None]:
df = pd.read_csv('/content/updated_pollution_dataset.csv')

In [None]:
df

Unnamed: 0,Temperature,Humidity,PM2.5,PM10,NO2,SO2,CO,Proximity_to_Industrial_Areas,Population_Density,Air Quality
0,29.8,59.1,5.2,17.9,18.9,9.2,1.72,6.3,319,Moderate
1,28.3,75.6,2.3,12.2,30.8,9.7,1.64,6.0,611,Moderate
2,23.1,74.7,26.7,33.8,24.4,12.6,1.63,5.2,619,Moderate
3,27.1,39.1,6.1,6.3,13.5,5.3,1.15,11.1,551,Good
4,26.5,70.7,6.9,16.0,21.9,5.6,1.01,12.7,303,Good
...,...,...,...,...,...,...,...,...,...,...
4995,40.6,74.1,116.0,126.7,45.5,25.7,2.11,2.8,765,Hazardous
4996,28.1,96.9,6.9,25.0,25.3,10.8,1.54,5.7,709,Moderate
4997,25.9,78.2,14.2,22.1,34.8,7.8,1.63,9.6,379,Moderate
4998,25.3,44.4,21.4,29.0,23.7,5.7,0.89,11.6,241,Good


Below are our data points that distributed in average

In [None]:
df = df.rename(columns={'Temperature': 'temperature', 'Humidity': 'humidity','PM2.5':'pm_25','PM10':'pm_10',
                        'NO2':'no2','SO2':'so2','CO':'co','Proximity_to_Industrial_Areas':'proximity_level',
                        'Population_Density':'population_density','Air Quality':'air_quality'})

In [None]:
df.head()

Unnamed: 0,temperature,humidity,pm_25,pm_10,no2,so2,co,proximity_level,population_density
0,29.8,59.1,5.2,17.9,18.9,9.2,1.72,6.3,319
1,28.3,75.6,2.3,12.2,30.8,9.7,1.64,6.0,611
2,23.1,74.7,26.7,33.8,24.4,12.6,1.63,5.2,619
3,27.1,39.1,6.1,6.3,13.5,5.3,1.15,11.1,551
4,26.5,70.7,6.9,16.0,21.9,5.6,1.01,12.7,303


In [None]:
len(df)

5000

In [None]:
df.describe()

Unnamed: 0,Temperature,Humidity,PM2.5,PM10,NO2,SO2,CO,Proximity_to_Industrial_Areas,Population_Density
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,30.02902,70.05612,20.14214,30.21836,26.4121,10.01482,1.500354,8.4254,497.4238
std,6.720661,15.863577,24.554546,27.349199,8.895356,6.750303,0.546027,3.610944,152.754084
min,13.4,36.0,0.0,-0.2,7.4,-6.2,0.65,2.5,188.0
25%,25.1,58.3,4.6,12.3,20.1,5.1,1.03,5.4,381.0
50%,29.0,69.8,12.0,21.7,25.3,8.0,1.41,7.9,494.0
75%,34.0,80.3,26.1,38.1,31.9,13.725,1.84,11.1,600.0
max,58.6,128.1,295.0,315.8,64.9,44.9,3.72,25.8,957.0


Below you can see our dataset info about each columns and their data types such as float, int and object


*  We dont have any null values
*  We dont need any data cleaning operations
*  Let's assume we dont have any duplicate values



In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   temperature         5000 non-null   float64
 1   humidity            5000 non-null   float64
 2   pm_25               5000 non-null   float64
 3   pm_10               5000 non-null   float64
 4   no2                 5000 non-null   float64
 5   so2                 5000 non-null   float64
 6   co                  5000 non-null   float64
 7   proximity_level     5000 non-null   float64
 8   population_density  5000 non-null   int64  
 9   air_quality         5000 non-null   object 
dtypes: float64(8), int64(1), object(1)
memory usage: 390.8+ KB


### Skipping the Exploratory Data Analysis part as we are only focused on the temporary model building

## Data Preparation

We are initializing the label encoder to our target variable


*   In our case the target variable is "Air Quality"
*   Rest of our columns are the depending variables



In [None]:
# Before label encoding
Counter(df['air_quality'])

Counter({'Moderate': 1500, 'Good': 2000, 'Hazardous': 500, 'Poor': 1000})

In [None]:
le=LabelEncoder()
df['air_quality']=le.fit_transform(df['air_quality'])

In [None]:
# After label encoding
Counter(df['air_quality'])

Counter({2: 1500, 0: 2000, 1: 500, 3: 1000})

Below are our label encoded assigned values

*   Moderate   == 2
*   Good       == 0
*   Hazardous  == 1
*   Poor       == 3



In [None]:
x=df.drop(columns='air_quality')
y=df['air_quality']

In [None]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.25,random_state=22)
print(x_train.shape,y_train.shape,x_test.shape,y_test.shape)

(3750, 9) (3750,) (1250, 9) (1250,)


In [None]:
x_train.head(2)

Unnamed: 0,temperature,humidity,pm_25,pm_10,no2,so2,co,proximity_level,population_density
150,42.6,94.6,17.9,30.1,44.3,4.0,1.91,2.6,525
4207,32.4,97.5,25.4,42.8,33.4,10.6,2.02,4.6,792


In [None]:
y_train.head(2)

Unnamed: 0,air_quality
150,1
4207,3


In [None]:
x_test.head(2)

Unnamed: 0,temperature,humidity,pm_25,pm_10,no2,so2,co,proximity_level,population_density
1332,21.8,43.3,1.3,5.1,19.8,8.0,0.84,10.6,451
2844,27.5,66.2,1.8,7.0,27.2,7.3,1.06,11.3,326


In [None]:
y_test.head(2)

Unnamed: 0,air_quality
1332,0
2844,0


##Model Building

We are using Random forest for our model

In [None]:
rf = RandomForestClassifier()
params = {'criterion':['gini','entropy'],'min_samples_split':list(np.arange(2,41)),
        'min_samples_leaf':list(np.arange(2,41)),'max_features':['sqrt','log2',None],'n_estimators':[400]}

model = RandomizedSearchCV(rf,param_distributions=params,random_state=16,cv=10,scoring='accuracy',n_jobs=-1)
model.fit(x_train,y_train)
print(model.best_params_)
print(model.best_score_)

model = model.best_estimator_

{'n_estimators': 400, 'min_samples_split': 32, 'min_samples_leaf': 11, 'max_features': 'sqrt', 'criterion': 'gini'}
0.9469333333333333


In [None]:
joblib.dump(model, "model.pkl")

['model.pkl']

In [None]:
mod = joblib.load("/content/model.pkl")

In [None]:
mod.predict(x_train.head(5))

array([1, 3, 0, 3, 2])

In [None]:
pred_train = model.predict(x_train)
pred_test = model.predict(x_test)

In [None]:
print("Training Evalaution Metrics:")
print("Accuracy: ",accuracy_score(y_train,pred_train))
print("Precision: ",precision_score(y_train,pred_train,average='micro'))
print("Recall: ",recall_score(y_train,pred_train,average='micro'))
print("F1 Score: ",f1_score(y_train,pred_train,average='micro'))
print("\nClassification Report: \n",classification_report(y_train,pred_train))

Training Evalaution Metrics:
Accuracy:  0.9664
Precision:  0.9664
Recall:  0.9664
F1 Score:  0.9664

Classification Report: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      1509
           1       0.95      0.88      0.91       368
           2       0.96      0.98      0.97      1116
           3       0.92      0.92      0.92       757

    accuracy                           0.97      3750
   macro avg       0.96      0.95      0.95      3750
weighted avg       0.97      0.97      0.97      3750



In [None]:
print("Testing Evalaution Metrics:")
print("Accuracy: ",accuracy_score(y_test,pred_test))
print("Precision: ",precision_score(y_test,pred_test,average='micro'))
print("Recall: ",recall_score(y_test,pred_test,average='micro'))
print("F1 Score: ",f1_score(y_test,pred_test,average='micro'))
print("\nClassification Report: \n",classification_report(y_test,pred_test))

Testing Evalaution Metrics:
Accuracy:  0.9504
Precision:  0.9504
Recall:  0.9504
F1 Score:  0.9504

Classification Report: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       491
           1       0.94      0.80      0.86       132
           2       0.96      0.96      0.96       384
           3       0.85      0.92      0.88       243

    accuracy                           0.95      1250
   macro avg       0.94      0.92      0.93      1250
weighted avg       0.95      0.95      0.95      1250



In [None]:
actual_labels = np.array(['Moderate', 'Good', 'Hazardous', 'Poor'])

In [None]:
le = LabelEncoder()
le.fit(actual_labels)

In [None]:
model.predict(x_train.head(5))

array([1, 3, 0, 3, 2])

In [None]:
predictions = np.array(model.predict(x_train.head(5)))

In [None]:
original_labels = le.inverse_transform(predictions)


In [None]:
print(original_labels)

['Hazardous' 'Poor' 'Good' 'Poor' 'Moderate']
