# Air Quality Index Prediction

#### Data Dictionary

    State  : State code where the recording was made
    County : County code of the country where the recording was made
    City   : City code of the city where the recording was made
    Date   : Date when the recording was made
    Average_Value : Average concentration of the given pollutant on that day
    Max_Value_of_the_Day : The maximum value obtained for the pollutant concentration in a given day
    Hour_of_Max_Value : The hour where maximum concentration of the pollutant was recorded in a given day
    Pollutant_AQI : Air quality index of the pollutant
    Pollutant_Type : The type of the given pollutant

#### Import the required libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import RandomForestRegressor


#### Reading the training and test set

In [4]:
air_data = pd.read_csv("Hackathon_AQI_train1.0.csv")

In [5]:
test_air_data = pd.read_csv("Hackathon_AQI_test1.0.csv")

test_air_data1 = test_air_data.copy()
test_air_data1=test_air_data1.drop(['Id'],axis=1)

#### Description and Information on the data

In [6]:
air_data.head()

Unnamed: 0,State,Province,City,Date,Average_Value,Max_Value_of_the_Day,Hour_of_Max_Value,Pollutant_AQI,Pollutant_Type
0,ST00,COT102,CT126,2011-01-29,34.043478,38.04,22,37,A
1,ST00,COT102,CT126,2011-02-10,25.217391,35.87,7,36,A
2,ST00,COT102,CT126,2011-06-03,13.5,37.7,22,39,A
3,ST00,COT102,CT126,2011-09-18,10.695652,22.32,19,22,A
4,ST00,COT102,CT126,2012-02-08,39.25,53.42,8,54,A


In [7]:
air_data.isnull().sum()

State                   0
Province                0
City                    0
Date                    0
Average_Value           0
Max_Value_of_the_Day    0
Hour_of_Max_Value       0
Pollutant_AQI           0
Pollutant_Type          0
dtype: int64

In [8]:
air_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17604 entries, 0 to 17603
Data columns (total 9 columns):
State                   17604 non-null object
Province                17604 non-null object
City                    17604 non-null object
Date                    17604 non-null object
Average_Value           17604 non-null float64
Max_Value_of_the_Day    17604 non-null float64
Hour_of_Max_Value       17604 non-null int64
Pollutant_AQI           17604 non-null int64
Pollutant_Type          17604 non-null object
dtypes: float64(2), int64(2), object(5)
memory usage: 894.0+ KB


In [9]:
air_data.describe()

Unnamed: 0,Average_Value,Max_Value_of_the_Day,Hour_of_Max_Value,Pollutant_AQI
count,17604.0,17604.0,17604.0,17604.0
mean,3.845645,7.531067,9.273461,18.399739
std,7.275965,13.250808,7.099505,18.590594
min,-1.379167,-1.16,0.0,0.0
25%,0.03825,0.05,3.0,3.0
50%,0.4125,0.75,9.0,12.0
75%,3.880208,8.75,13.0,30.0
max,67.086957,145.32,23.0,192.0


In [10]:
air_data.describe(include=['O'])

Unnamed: 0,State,Province,City,Date,Pollutant_Type
count,17604,17604,17604,17604,17604
unique,47,131,142,5589,4
top,ST42,COT23,CT03,2008-09-11,D
freq,5967,971,1293,13,4401


#### Making note of the categorical, numerical predictors and the target variable

In [11]:
air_copy = air_data.copy()

categorical_predictors = ['State','Province','City','Date','Pollutant_Type']
numerical_predictors = ['Average_Value','Max_Value_of_the_Day','Hour_of_Max_Value']

target = air_copy.Pollutant_AQI

air_copy = air_copy.drop(['Pollutant_AQI'],axis=1)

#### Splitting the data into train and test sets

In [12]:
from sklearn.model_selection import train_test_split

train_X, test_X, train_Y, test_Y = train_test_split(air_copy,target, random_state=0)

#### Feature engineering the Date column in both train and test sets for analysis

In [13]:
import datetime as dt
date1 = pd.to_datetime(train_X['Date'])

train_X['Year'] = date1.dt.year
train_X['Month'] =  date1.dt.month
train_X['Day'] =  date1.dt.day
train_X=train_X.drop(['Date'],axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [14]:
date2 = pd.to_datetime(test_X['Date'])

test_X['Year'] = date2.dt.year
test_X['Month'] =  date2.dt.month
test_X['Day'] =  date2.dt.day
test_X=test_X.drop(['Date'],axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


#### Test Data given for prediction date variable feature engineering

In [15]:
date3 = pd.to_datetime(test_air_data1['Date'])

test_air_data1['Year'] = date3.dt.year
test_air_data1['Month'] =  date3.dt.month
test_air_data1['Day'] =  date3.dt.day
test_air_data1=test_air_data1.drop(['Date'],axis=1)

#### Original encoding the categorical variables

In [16]:
for variable in train_X.columns: # Loop through all columns in the dataframe
    if train_X[variable].dtype == 'object': # Only apply for columns with categorical strings
        train_X[variable] = pd.Categorical(train_X[variable]).codes # Replace strings with an integer
train_X.head()

Unnamed: 0,State,Province,City,Average_Value,Max_Value_of_the_Day,Hour_of_Max_Value,Pollutant_Type,Year,Month,Day
9076,7,80,97,7.916667,13.71,12,2,2008,12,26
245,8,18,122,9.041667,14.8,1,0,2008,2,16
13556,8,20,9,0.229167,0.28,7,3,2015,5,18
1309,14,126,90,9.958333,20.72,1,0,2012,3,24
7602,42,55,133,0.014708,0.03,10,1,2004,2,1


In [17]:
for variable in test_X.columns: # Loop through all columns in the dataframe
    if test_X[variable].dtype == 'object': # Only apply for columns with categorical strings
        test_X[variable] = pd.Categorical(test_X[variable]).codes # Replace strings with an integer
test_X.head()

Unnamed: 0,State,Province,City,Average_Value,Max_Value_of_the_Day,Hour_of_Max_Value,Pollutant_Type,Year,Month,Day
16328,41,55,93,0.2125,0.29,7,3,2007,6,6
17387,42,38,44,0.341667,0.68,23,3,2014,5,25
7049,41,11,116,0.025917,0.04,12,1,2003,2,18
295,8,20,9,16.125,26.18,21,0,2002,7,18
14044,10,51,68,0.245833,0.28,0,3,2008,5,25


#### Original encoding the test data given for prediction

In [18]:
for variable in test_air_data1.columns: # Loop through all columns in the dataframe
    if test_air_data1[variable].dtype == 'object': # Only apply for columns with categorical strings
        test_air_data1[variable] = pd.Categorical(test_air_data1[variable]).codes # Replace strings with an integer
test_air_data1.head()

Unnamed: 0,State,Province,City,Average_Value,Max_Value_of_the_Day,Hour_of_Max_Value,Pollutant_Type,Year,Month,Day
0,0,12,38,42.409091,51.74,23,0,2011,1,11
1,0,12,38,12.0,30.48,5,0,2011,4,10
2,0,12,38,10.782609,27.54,6,0,2011,5,20
3,0,12,38,6.8,21.4,22,0,2011,5,29
4,0,12,38,16.333333,29.09,21,0,2011,6,28


### Using Decision tree regressor for prediction

In [19]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
def getMae(max_leaf_nodes, train_X,test_X, train_Y, test_Y):
    model = DecisionTreeRegressor()
    
    model.fit(train_X, train_Y)
    predicted_values = model.predict(test_X)
    mae = mean_absolute_error(test_Y,predicted_values )
    return mae

#### For different value of Max leaf nodes calculating the mean absolute error

In [21]:
for leaf in [5,50,500]:
    mae = getMae(leaf,train_X,test_X, train_Y, test_Y)
    print("leaf : %d \t\t Mae : %d" %(leaf,mae))

leaf : 5 		 Mae : 1
leaf : 50 		 Mae : 1
leaf : 500 		 Mae : 1


#### Dropping the province variable which has too many unique values

In [22]:
train_X2 = train_X.drop(['Province'],axis=1)
test_X2 = test_X.drop(['Province'],axis=1)
test_air_data2 = test_air_data1.drop(['Province'],axis=1)

#### Applying Random Forest regressor for predicting the values

In [23]:
model2 = RandomForestRegressor()
model2.fit(train_X2,train_Y)
pred = model2.predict(test_X2)
print(mean_absolute_error(test_Y,pred))

0.903726425812


#### Calculating the predicted values for the given test set

In [24]:
predicted_AQI = model2.predict(test_air_data2)

#### Using XGBoost to test the model perfomance

In [25]:
from xgboost import XGBRegressor

train_X.apply(pd.to_numeric)
test_X.apply(pd.to_numeric)
my_model = XGBRegressor(n_estimators=800, learning_rate=0.05)
my_model.fit(train_X, train_Y, early_stopping_rounds=5, eval_set=[(test_X, test_Y)], verbose=False)

# make predictions
predictions = my_model.predict(test_X)
print("Mean Absolute Error : " + str(mean_absolute_error(predictions, test_Y)))




Mean Absolute Error : 1.2860398518


### Out of all Random Forest Regressor performs the best

In [16]:
#predictions = my_model.predict(test_air_data1)

my_submission = pd.DataFrame({'Id': test_air_data.Id, 'Pollutant_AQI': predicted_AQI.astype(int)})
# you could use any filename. We choose submission here
my_submission.to_csv('submission.csv', index=False)