<img src="https://mk0eeborgicuypctuf7e.kinstacdn.com/wp-content/uploads/2017/02/Industrial-Emissions-10-web-1024x680.jpg" width="800">
    
# Introduction 🦾

This notebook addresses the prediction of emission values from three key pollutants - carbon monoxide ($CO$), benzene ($C_6H_6$) and nitrogen oxides ($NO_X$) - using sensor readouts as well as date and time at measurement, relative and absolute humidity and temperature. We will employ some feature engineering to encode a few temporal components, conduct cross-validation (CV) and fit a Gradient Boosting Regression (GBR) model. This dataset is part of the Tabular Playground Series - July 2021 competition.

It came to my attention that most top entries in this competition exploit some leaked test data using pseudo-labeling, thereby cutting down error by a substantial margin. I am not particularly fond of leveraging leaked information and as such **no external data is used in this analysis**.

We will go over feature engineering, CV and model building using the optimal CV hyperparameter values. To start things off we will load some important utilities, set random seed and a bunch of other constants and load the CSV files.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.model_selection import RepeatedKFold, GridSearchCV

# Path to the files, set seed
PATH = '../input/tabular-playground-series-jul-2021/'
SEED = 999
N_FOLDS = 3
N_REPEATS = 5
TARGET_VARS = ['target_carbon_monoxide',
               'target_benzene',
               'target_nitrogen_oxides']

np.random.seed(SEED)

# Load CSV files
train = pd.read_csv(PATH + 'train.csv')
test = pd.read_csv(PATH + 'test.csv')
subm = pd.read_csv(PATH + 'sample_submission.csv', index_col='date_time')

# Feature engineering 🔨

The training predictor set is of size 7111 x 9, whereas the target set is 7111 x 3. The test predictor set is of size 2247 x 9. Given the relative paucity of predictors in the dataset we will make the best of it by investigating ways of engineering the underlying predictors. In the present analysis I consider the following:

* Capture periodicity over month and hour using $sin$ and $cos$ encodings. If we had picked a categorical encoding instead, the model would be oblivious to the fact January and December are consecutive months. The same holds for hour of the day. As for weekdays, since there are only seven values this particular encoding might degrade model performance

* Identify weekends using a single binary feature. One might expect weekends to associate with lower emissions.

To extract these features we will process both train and test data using a custom utility defined underneath, `extract_datetime_feats`.

In [None]:
def sin_cos_encoding(df, dt, feat_name, max_val):
    # Encode variable using sin and cos
    df['sin_' + feat_name] = np.sin(2 * np.pi * (dt/max_val))
    df['cos_' + feat_name] = np.cos(2 * np.pi * (dt/max_val))
    return None

def extract_dt_feats(df):
    # Extract month and hour
    date_enc = pd.to_datetime(df.date_time)
    month = date_enc.dt.month
    hour = date_enc.dt.hour
    # Add features, compute and add is_weekend
    sin_cos_encoding(df, month, 'month', 12)
    sin_cos_encoding(df, hour, 'hour', 23)
    df['is_weekend'] = date_enc.dt.day_name().isin(['Saturday', 'Sunday'])*1
    return df

# Expand features from train and test
x_train = extract_dt_feats(train.copy())
x_test = extract_dt_feats(test.copy())

# Visualize relationship between is_weekend and targets
sns.pairplot(x_train, hue='is_weekend', vars=TARGET_VARS, corner=True,
            plot_kws={'alpha':.1})

With this simple feature engineering step we increased the number of available features from 9 to 14, although `date_time` will not be used. The above scatterplots, which depict the bivariate distributions of the three target variables - seemingly intercorrelated - suggest that indeed weekends associate with lower emissions. Also, the three target variables appear to be left-skewed and might therefore be better modeled following a log-transformation.

We may also note just over a hundred odd measurements of high $CO$ and $NO_X$ but comparatively low $C_6H_6$. These look like outliers but their exclusion did hurt model performance in a previous iteration. 

Overall, I propose log-transforming the three target variables from `x_train`, moving them to a separate dataframe,`y_train` and finally dropping `date_time` from both `x_train` and `x_test`.

In [None]:
# Log-transform target vars
x_train[TARGET_VARS] = np.log(x_train[TARGET_VARS] + 1)

# Plot again, in log-scale
sns.pairplot(x_train, hue='is_weekend', vars=TARGET_VARS, corner=True,
            plot_kws={'alpha':.1})

# Split train X and Y, drop date_time from train and test
y_train = pd.concat([x_train.pop(target) for target in TARGET_VARS], axis=1)
x_train.drop(columns='date_time', inplace=True)
x_test.drop(columns='date_time', inplace=True)

Upon log-transformation, the correlation among the three emission targets is more evident. There are [interesting frameworks](https://www.datacamp.com/community/tutorials/gflasso-R) that induce parameter sharing for genuine multitask regression, and take advantage of such covariance among target variables. For simplicity we will not use multitask regression but rather model each response separately, using `MultiOutputRegressor`.

# Repeated *k*-Fold Cross-Validation ⏳

In order to select appropriate hyperparameters for the GBR model, 5x repeated three-fold cross-validation (CV) will be employed first. We will experiment with different values of `learning_rate`, `max_depth` and `subsample`.

In [None]:
%%time
# Define hyperparameter and CV parameter values
pars = {'estimator__learning_rate': [.01, .05, .1],
        'estimator__max_depth': [3, 5, 10],
        'estimator__subsample': [.5, .75, 1.],
        'estimator__n_estimators': [500]}
cv_pars = RepeatedKFold(n_splits=N_FOLDS, n_repeats=N_REPEATS)

# Build and initialize CV
cv_model = MultiOutputRegressor(GradientBoostingRegressor())
crossval = GridSearchCV(cv_model, pars, scoring='neg_mean_squared_error', cv=cv_pars, n_jobs=-1)
crossval.fit(x_train, y_train)

# Visualize CV error
error = np.vstack([crossval.cv_results_['split{}_test_score'.format(str(i))] for i in range(N_FOLDS*N_REPEATS)])
plt.figure(figsize=(16, 4))
plt.boxplot(error); plt.ylabel('neg_MSE')

# Fit model 🧠

Next, we take the optimal hyperparameters to fit the GBR model - in fact three models, as explained above - to the entire training set. These are contained in `crossval.best_params_`.

In [None]:
# Final model using optimal cross-validation parameters
opt_pars = crossval.best_params_

model = MultiOutputRegressor(GradientBoostingRegressor(learning_rate=opt_pars['estimator__learning_rate'],
                                                       max_depth=opt_pars['estimator__max_depth'],
                                                       subsample=opt_pars['estimator__subsample'],
                                                       n_estimators=opt_pars['estimator__n_estimators']))
model.fit(x_train, y_train)

print('The optimal hyperparameter values are:\n', opt_pars)

# Test set predictions ✍️

Finally, we predict on the testing set and write our predictions for all three emission targets. To revert the log-transformation that is embedded into the model, we simply take $e^{\hat{y}} - 1$.

In [None]:
# Get predictions
preds = model.predict(x_test)
# Recover original units
inv_preds = np.exp(preds) - 1

# Write to submission table, export
subm.iloc[:, :] = inv_preds
subm.to_csv("submission.csv")

And with this we come to the end 😎 If you found this notebook informative and entertaining, please comment and upvote!