# Air Quality Predictor

## Dataset Description 
Data Set Information:

The dataset contains 9358 instances of hourly averaged responses from an array of 5 metal oxide chemical sensors embedded in an Air Quality Chemical Multisensor Device. The device was located on the field in a significantly polluted area, at road level,within an Italian city. Data were recorded from March 2004 to February 2005 (one year)representing the longest freely available recordings of on field deployed air quality chemical sensor devices responses. Ground Truth hourly averaged concentrations for CO, Non Metanic Hydrocarbons, Benzene, Total Nitrogen Oxides (NOx) and Nitrogen Dioxide (NO2) and were provided by a co-located reference certified analyzer. Evidences of cross-sensitivities as well as both concept and sensor drifts are present as described in De Vito et al., Sens. And Act. B, Vol. 129,2,2008 (citation required) eventually affecting sensors concentration estimation capabilities. Missing values are tagged with -200 value.
This dataset can be used exclusively for research purposes. Commercial purposes are fully excluded.


Attribute Information:

0 Date (DD/MM/YYYY)
1 Time (HH.MM.SS)
2 True hourly averaged concentration CO in mg/m^3 (reference analyzer)
3 PT08.S1 (tin oxide) hourly averaged sensor response (nominally CO targeted)
4 True hourly averaged overall Non Metanic HydroCarbons concentration in microg/m^3 (reference analyzer)
5 True hourly averaged Benzene concentration in microg/m^3 (reference analyzer)
6 PT08.S2 (titania) hourly averaged sensor response (nominally NMHC targeted)
7 True hourly averaged NOx concentration in ppb (reference analyzer)
8 PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally NOx targeted)
9 True hourly averaged NO2 concentration in microg/m^3 (reference analyzer)
10 PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted)
11 PT08.S5 (indium oxide) hourly averaged sensor response (nominally O3 targeted)
12 Temperature in Â°C
13 Relative Humidity (%)
14 AH Absolute Humidity

Objective:
So we will predict the Relative Humidity of a given point of time based on the all other attributes affecting the change in RH.


## Importing the libraries

In [253]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error

## Importing the dataset

In [254]:
df=pd.read_excel('AirQualityUCI.xlsx')
df.head()

Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH
0,2004-03-10,18:00:00,2.6,1360.0,150,11.881723,1045.5,166.0,1056.25,113.0,1692.0,1267.5,13.6,48.875001,0.757754
1,2004-03-10,19:00:00,2.0,1292.25,112,9.397165,954.75,103.0,1173.75,92.0,1558.75,972.25,13.3,47.7,0.725487
2,2004-03-10,20:00:00,2.2,1402.0,88,8.997817,939.25,131.0,1140.0,114.0,1554.5,1074.0,11.9,53.975,0.750239
3,2004-03-10,21:00:00,2.2,1375.5,80,9.228796,948.25,172.0,1092.0,122.0,1583.75,1203.25,11.0,60.0,0.786713
4,2004-03-10,22:00:00,1.6,1272.25,51,6.518224,835.5,131.0,1205.0,116.0,1490.0,1110.0,11.15,59.575001,0.788794


## Finding Number of Missing Values(denoted by -200)


In [255]:
a=(df[['Time','CO(GT)','PT08.S1(CO)','NMHC(GT)','C6H6(GT)','PT08.S2(NMHC)','NOx(GT)','PT08.S3(NOx)','NO2(GT)','PT08.S4(NO2)','PT08.S5(O3)','T','RH','AH' ]]==-200).sum()
print(a)

Time                0
CO(GT)           1683
PT08.S1(CO)       366
NMHC(GT)         8443
C6H6(GT)          366
PT08.S2(NMHC)     366
NOx(GT)          1639
PT08.S3(NOx)      366
NO2(GT)          1642
PT08.S4(NO2)      366
PT08.S5(O3)       366
T                 366
RH                366
AH                366
dtype: int64


In [256]:
#deleting NMHC(GT) as maximum are missing values
df.drop('NMHC(GT)',axis=1,inplace=True)

In [257]:
#extracting date and hour as a numeric object from date and time column and adding as a new column
from datetime import datetime
df['time']=df['Time'].apply(lambda x:x.strftime("%H:%M:%S"))
df['Hour']=df['time'].apply(lambda x: int(x.split(':')[0]))
df['date']=df['Date'].apply(lambda x:x.strftime("%Y:%m:%d"))
df['Month']=df['date'].apply(lambda x: int(x.split(':')[1]))
#dropping the extra created columns
df.drop('time',axis=1,inplace=True)
df.drop('date',axis=1,inplace=True)
df.shape

(9357, 16)

## Filling out missing values

In [258]:
#Using the simple imputer we fill the missing values as mean of the values of the feature
from sklearn.impute import SimpleImputer
trans=df.iloc[:,2:] # creating temporary dataframe to fill 
imputer=SimpleImputer(missing_values=-200,strategy='mean')
trans=imputer.fit_transform(trans) 
df=pd.DataFrame(trans) # storing the filled values back in df
df.columns=['CO(GT)','PT08.S1(CO)','C6H6(GT)','PT08.S2(NMHC)','NOx(GT)','PT08.S3(NOx)','NO2(GT)','PT08.S4(NO2)','PT08.S5(O3)','T','RH','AH','Hour','Month' ]
df.head()

Unnamed: 0,CO(GT),PT08.S1(CO),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,Hour,Month
0,2.6,1360.0,11.881723,1045.5,166.0,1056.25,113.0,1692.0,1267.5,13.6,48.875001,0.757754,18.0,3.0
1,2.0,1292.25,9.397165,954.75,103.0,1173.75,92.0,1558.75,972.25,13.3,47.7,0.725487,19.0,3.0
2,2.2,1402.0,8.997817,939.25,131.0,1140.0,114.0,1554.5,1074.0,11.9,53.975,0.750239,20.0,3.0
3,2.2,1375.5,9.228796,948.25,172.0,1092.0,122.0,1583.75,1203.25,11.0,60.0,0.786713,21.0,3.0
4,1.6,1272.25,6.518224,835.5,131.0,1205.0,116.0,1490.0,1110.0,11.15,59.575001,0.788794,22.0,3.0


## Train Test Split and Scaling the features

In [259]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
y=df.RH #dividing dataset in X and y
X=df.drop('RH',axis=1)
ss=StandardScaler()#creating object of the standard scaler     
X_std=ss.fit_transform(X)
X_std=pd.DataFrame(X_std)
X_train, X_test, y_train, y_test = train_test_split(X_std,y, random_state = 0)

## Training and Testing for various methods
##### Metric used for evaluation- R2 score, Root mean squared error
Models tested-
1) Simple Linear Regression
2) K nearest neighbor regression(n=5)
3) Lasso regressio with alpha=20
4) Decision Tree Regresser

In [260]:
from sklearn.linear_model import LinearRegression
linreg=LinearRegression().fit(X_train,y_train)
print('R-squared score (training): {:.3f}' .format(linreg.score(X_train, y_train)))
print('R-squared score (test): {:.3f}'.format(linreg.score(X_test, y_test)))
y_pred=linreg.predict(X_test)                   
rmse=np.sqrt(mean_squared_error(y_test,y_pred))      
print('Baseline RMSE of model:',rmse)

R-squared score (training): 0.882
R-squared score (test): 0.880
Baseline RMSE of model: 5.9467598184347645


In [261]:
from sklearn.neighbors import KNeighborsRegressor
knnreg = KNeighborsRegressor(n_neighbors =5).fit(X_train, y_train)
print('R-squared score (training): {:.3f}' .format(knnreg.score(X_train, y_train)))
print('R-squared score (test): {:.3f}'.format(knnreg.score(X_test, y_test)))
y_pred=knnreg.predict(X_test)                   
rmse=np.sqrt(mean_squared_error(y_test,y_pred))      
print('Baseline RMSE of model:',rmse)

R-squared score (training): 0.935
R-squared score (test): 0.893
Baseline RMSE of model: 5.6029112723284396


In [262]:
from sklearn.linear_model import Ridge
linridge = Ridge(alpha=20.0).fit(X_train, y_train)
print('R-squared score (training): {:.3f}' .format(linridge.score(X_train, y_train)))
print('R-squared score (test): {:.3f}'.format(linridge.score(X_test, y_test)))
y_pred=linridge.predict(X_test)                   
rmse=np.sqrt(mean_squared_error(y_test,y_pred))      
print('Baseline RMSE of model:',rmse)


R-squared score (training): 0.882
R-squared score (test): 0.880
Baseline RMSE of model: 5.94689278814495


In [266]:
from sklearn.tree import DecisionTreeRegressor
decreg=DecisionTreeRegressor().fit(X_train,y_train)
print('R-squared score (training): {:.3f}' .format(decreg.score(X_train, y_train)))
print('R-squared score (test): {:.3f}'.format(decreg.score(X_test, y_test)))
y_pred=decreg.predict(X_test)                   
rmse=np.sqrt(mean_squared_error(y_test,y_pred))      
print('Baseline RMSE of model',rmse)

R-squared score (training): 1.000
R-squared score (test): 0.994
Baseline RMSE of model 1.3095329145372698


#### For number of models ,we see that Decision tree has the best RMSE value , so we try to improve the RMSE by varying the given parameters.

In [264]:
for i in range(4,20):
    decreg=DecisionTreeRegressor(max_depth=i).fit(X_train,y_train)
    y_pred=decreg.predict(X_test)                   
    rmse=np.sqrt(mean_squared_error(y_test,y_pred))      
    print(' RMSE of model for max depth {:.1f}:'.format(i),rmse)

 RMSE of model for max depth 4.0: 8.522432650537851
 RMSE of model for max depth 5.0: 6.751700640079649
 RMSE of model for max depth 6.0: 5.210192615541617
 RMSE of model for max depth 7.0: 3.8019998348748754
 RMSE of model for max depth 8.0: 2.798474524644858
 RMSE of model for max depth 9.0: 2.123719561975873
 RMSE of model for max depth 10.0: 1.696689712600099
 RMSE of model for max depth 11.0: 1.4583185085271357
 RMSE of model for max depth 12.0: 1.378332557407365
 RMSE of model for max depth 13.0: 1.3289465926892938
 RMSE of model for max depth 14.0: 1.30027230220531
 RMSE of model for max depth 15.0: 1.326261614615174
 RMSE of model for max depth 16.0: 1.3228279033322856
 RMSE of model for max depth 17.0: 1.3071350696194215
 RMSE of model for max depth 18.0: 1.310557426735483
 RMSE of model for max depth 19.0: 1.2835782543610346


Lowest RMSE is obtained for maximum depth 14 and 19. We also observe that their RMSE is pretty close to the Baseline RMSE of the model. So we will use the baseline model only to predict the RH feature.

## Conclusion
#### We obtained the lowest RMSE value for Decision tree regressor(RMSE = 1.309) amongst the various models .
#### Hence Decision tree regressor is obtained as the best technique to predict the Relative Humidity(RH) given rest of the features.
More for the future - In the future I would like to try various other techniques and apply feature engineering to improve the RMSE value.