<h1 style='color:white; background:#663399; border:0'><center>TPS-July: EDA, Baseline Analysis (XGBRegressor)</center></h1>

![](https://storage.googleapis.com/kaggle-competitions/kaggle/26480/logos/header.png?t=2021-04-09-00-57-05)

<a id="section-start"></a>

The goal of these competitions is to provide a fun, and approachable for anyone, tabular dataset. These competitions will be great for people looking for something in between the Titanic Getting Started competition and a Featured competition.

The dataset is used for this competition is based on a real dataset, but has synthetic-generated aspects to it. The original dataset deals with predicting air pollution in a city via various input sensor values (e.g., a time series).

In this competition you are predicting the values of air pollution measurements over time, based on basic weather information (temperature and humidity) and the input values of 5 sensors.

The three target values to you to predict are: `target_carbon_monoxide`, `target_benzene`, and `target_nitrogen_oxides`

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_log_error
from xgboost import XGBRegressor

# for feature importance study
import eli5
from eli5.sklearn import PermutationImportance
from pdpbox import pdp
import shap

# Custom theme
plt.style.use('fivethirtyeight')

figure = {'dpi': '200'}
font = {'family': 'serif'}
grid = {'linestyle': ':', 'alpha': .9}
axes = {'titlecolor': 'black', 'titlesize': 20, 'titleweight': 'bold',
        'labelsize': 12, 'labelweight': 'bold'}

plt.rc('font', **font)
plt.rc('figure', **figure)
plt.rc('grid', **grid)
plt.rc('axes', **axes)

my_colors = ['#DC143C', '#FF1493', '#FF7F50', '#FFD700', '#32CD32', 
             '#4ddbff', '#1E90FF', '#663399', '#708090']

caption = "© maksymshkliarevskyi"

# Show our custom palette
sns.palplot(sns.color_palette(my_colors))
plt.title('Custom palette')
plt.text(6.9, 0.75, caption, size = 8)
plt.show()

First, let's load the data and take a look at basic statistics.

In [None]:
base_train = pd.read_csv('../input/tabular-playground-series-jul-2021/train.csv', 
                         index_col = 0)
base_train.index = base_train.index.astype('datetime64[ns]')
train = pd.read_csv('../input/tabular-playground-series-jul-2021/train.csv', 
                    index_col = 0)
train.index = train.index.astype('datetime64[ns]')
test = pd.read_csv('../input/tabular-playground-series-jul-2021/test.csv', 
                   index_col = 0)
test.index = test.index.astype('datetime64[ns]')
ss = pd.read_csv('../input/tabular-playground-series-jul-2021/sample_submission.csv')

<h2 style='color:white; background:#663399; border:0'><center>EDA</center></h2>

[**Back to the start**](#section-start)

In [None]:
train.describe().T.style.background_gradient(subset = ['count'], cmap = 'viridis') \
    .bar(subset = ['mean', '50%'], color = my_colors[6]) \
    .bar(subset = ['std'], color = my_colors[0])

In [None]:
test.describe().T.style.background_gradient(subset = ['count'], cmap = 'viridis') \
    .bar(subset = ['mean', '50%'], color = my_colors[6]) \
    .bar(subset = ['std'], color = my_colors[0])

In [None]:
dtypes = train.dtypes.value_counts().reset_index()

plt.figure(figsize = (12, 1))
plt.title('Data types\n')
plt.barh(str(dtypes.iloc[0, 0]), dtypes.iloc[0, 1],
         label = str(dtypes.iloc[0, 0]), color = my_colors[4])
plt.legend(loc = 'upper center', ncol = 3, fontsize = 13,
           bbox_to_anchor = (0.5, 1.45), frameon = False)
plt.yticks('')
plt.text(10, -0.9, caption, size = 8)
plt.show()

We have 7111 training and 2247 test observations. All our data is in `float32` format.

Before we continue, let's pull the targets into separate variables.

In [None]:
target_carbon_monoxide = train.target_carbon_monoxide
target_benzene = train.target_benzene
target_nitrogen_oxides = train.target_nitrogen_oxides

train.drop(['target_carbon_monoxide', 'target_benzene', 
            'target_nitrogen_oxides'], 
           axis = 1, inplace = True)

It's important to see if our data has missing values.

In [None]:
# Concatenate train and test datasets
all_data = pd.concat([train, test], axis = 0)

# columns with missing values
cols_with_na = all_data.isna().sum()[all_data.isna().sum() > 0].sort_values(ascending = False)
cols_with_na

As in the previous competitions, our data has no missing values. Now, let's look at the feature distributions.

In [None]:
fig = plt.figure(figsize = (15, 10))
fig.suptitle('Train data', size = 25, weight = 'bold')
for idx, i in enumerate(train.columns):
    fig.add_subplot(np.ceil(len(train.columns)/4), 4, idx+1)
    train.iloc[:, idx].hist(bins = 20)
    plt.title(i)
plt.text(1000, -200, caption, size = 12)
plt.show()

In [None]:
targets = [target_carbon_monoxide, target_benzene, target_nitrogen_oxides]
target_name = ['target_carbon_monoxide', 'target_benzene', 'target_nitrogen_oxides']

fig = plt.figure(figsize = (15, 10))
fig.suptitle('Train targets', size = 25, weight = 'bold')
for i in range(len(targets)):
    fig.add_subplot(3, 3, i+1)
    targets[i].hist(bins = 20)
    plt.title(target_name[i])
plt.text(900, -450, caption, size = 12)
plt.show()

In [None]:
fig = plt.figure(figsize = (15, 10))
fig.suptitle('Test data', size = 25, weight = 'bold')
for idx, i in enumerate(test.columns):
    fig.add_subplot(np.ceil(len(test.columns)/4), 4, idx+1)
    test.iloc[:, idx].hist(bins = 20)
    plt.title(i)
plt.text(1000, -40, caption, size = 12)
plt.show()

We should also look at the correlation between features.

In [None]:
corr = base_train.corr()
mask = np.triu(np.ones_like(corr, dtype = bool))

plt.figure(figsize = (13, 10))
plt.title('Corelation matrix')
sns.heatmap(corr, mask = mask, cmap = 'Spectral_r', linewidths = .5, annot = True)
plt.text(11, 14.5, caption, size = 12)
plt.show()

The data show a wide range of correlations, from highly negative to highly positive. The maximum linear relationship is observed between target_benzene and the second sensor.

In [None]:
print('Min date in the train data: ', min(base_train.index))
print('Max date in the train data: ', max(base_train.index))

Let's visualize our data.

In [None]:
sns.pairplot(base_train, corner = True)
plt.text(100, -250, caption, size = 14)
plt.show()

In [None]:
sns.pairplot(test, corner = True)
plt.text(100, -100, caption, size = 14)
plt.show()

The relationships between sensors, as well as between sensors and targets, are linear, in some cases logarithmic.

In [None]:
train.plot(figsize = (11, 25), subplots = True, linewidth = 0.8, 
           color = my_colors)
plt.xlabel('')
plt.show()

In [None]:
base_train[target_name].plot(figsize = (10, 10), subplots = True, 
                             linewidth = 0.8, color = my_colors[3:6])
plt.xlabel('')
plt.show()

<h2 style='color:white; background:#663399; border:0'><center>Baseline</center></h2>

[**Back to the start**](#section-start)

Now let's build the simplest XGBRegressor as Baseline.

In [None]:
# Create data sets for training (80%) and validation (20%)
X_train, X_valid, y_train, y_valid = train_test_split(train, base_train[target_name], 
                                                      test_size = 0.2,
                                                      random_state = 0,
                                                      shuffle = False)

In [None]:
# The basic model
params = {'n_estimators': 400,
          'subsample': 0.8,
          'max_depth': 8,
          'learning_rate': 0.05,
          'n_jobs': -1,
          'colsample_bytree': 0.8,
          'reg_alpha': 0.1,
          'reg_lambda': 0.1,
          'random_state': 0}

model1 = XGBRegressor(**params).fit(X_train, y_train.iloc[:, 0])
model2 = XGBRegressor(**params).fit(X_train, y_train.iloc[:, 1])
model3 = XGBRegressor(**params).fit(X_train, y_train.iloc[:, 2])

Let's check the results.

In [None]:
y_pred1 = model1.predict(X_valid)
print('RMSLE ({}): {}'.format(target_name[0], round(np.sqrt(mean_squared_log_error(y_valid.iloc[:, 0], y_pred1)), 4)))
y_pred2 = model2.predict(X_valid)
print('RMSLE ({}): {}'.format(target_name[1], round(np.sqrt(mean_squared_log_error(y_valid.iloc[:, 1], y_pred2)), 4)))
y_pred3 = model3.predict(X_valid)
print('RMSLE ({}): {}'.format(target_name[2], round(np.sqrt(mean_squared_log_error(y_valid.iloc[:, 2], y_pred3)), 4)))

In [None]:
date = pd.to_datetime(X_valid.reset_index().date_time).apply(lambda x: x.strftime('%Y/%m/%d'))

valid_preds = pd.DataFrame({'date': date,
                            'target_carbon_monoxide': y_valid.iloc[:, 0].values,
                            'target_benzene': y_valid.iloc[:, 1].values,
                            'target_nitrogen_oxides': y_valid.iloc[:, 2].values,
                            'preds_carbon_monoxide': y_pred1,
                            'preds_benzene': y_pred2,
                            'preds_nitrogen_oxides': y_pred3})
valid_preds = valid_preds.groupby('date').mean()

In [None]:
plt.figure(figsize = (15, 5))
valid_preds['target_carbon_monoxide'].plot(color = my_colors[6])
valid_preds['preds_carbon_monoxide'].plot(color = my_colors[2])
plt.legend()
plt.xlabel('')
plt.text(52, -0.7, caption, size = 14)
plt.show()

In [None]:
plt.figure(figsize = (15, 5))
valid_preds['target_benzene'].plot(color = my_colors[6])
valid_preds['preds_benzene'].plot(color = my_colors[2])
plt.legend()
plt.xlabel('')
plt.text(52, -6, caption, size = 14)
plt.show()

In [None]:
plt.figure(figsize = (15, 5))
valid_preds['target_nitrogen_oxides'].plot(color = my_colors[6])
valid_preds['preds_nitrogen_oxides'].plot(color = my_colors[2])
plt.legend()
plt.xlabel('')
plt.text(52, -110, caption, size = 14)
plt.show()

Our model predicts target_benzene almost perfectly and the other two targets are slightly worse.

Now, we'll see at the permutation importance of features.

In [None]:
print('Permutation importance for Model#1 ({})'.format(target_name[0]))
pi1 = PermutationImportance(model1, random_state = 0).fit(X_valid, y_valid.iloc[:, 0])
eli5.show_weights(pi1, feature_names = X_valid.columns.tolist())

For the prediction of `target_carbon_monoxide` the second sensor is the most important.

In [None]:
print('Permutation importance for Model#2 ({})'.format(target_name[1]))
pi2 = PermutationImportance(model2, random_state = 0).fit(X_valid, y_valid.iloc[:, 1])
eli5.show_weights(pi2, feature_names = X_valid.columns.tolist())

`target_benzene`, it seems, can only be predicted by the second sensor.

In [None]:
print('Permutation importance for Model#3 ({})'.format(target_name[2]))
pi3 = PermutationImportance(model3, random_state = 0).fit(X_valid, y_valid.iloc[:, 2])
eli5.show_weights(pi3, feature_names = X_valid.columns.tolist())

But for predicting `target_nitrogen_oxides`, almost all sensors (except for two) are quite important.

In [None]:
explainer = shap.TreeExplainer(model1)
shap_values = explainer.shap_values(X_valid)

plt.title('Model #1 ({})'.format(target_name[0]))
shap.summary_plot(shap_values, X_valid)

In [None]:
explainer = shap.TreeExplainer(model2)
shap_values = explainer.shap_values(X_valid)

plt.title('Model #2 ({})'.format(target_name[1]))
shap.summary_plot(shap_values, X_valid)

In [None]:
explainer = shap.TreeExplainer(model3)
shap_values = explainer.shap_values(X_valid)

plt.title('Model #3 ({})'.format(target_name[2]))
shap.summary_plot(shap_values, X_valid)

The same conclusions about the importance of some indicators can be drawn after analyzing SHAP values. Large values of sensor 2 lead to large values of the predicted indicators (the nature of this dependence is also indicated by a high positive correlation).

In [None]:
# Train model on all the data
params = {'n_estimators': 400,
          'subsample': 0.8,
          'max_depth': 8,
          'learning_rate': 0.05,
          'n_jobs': -1,
          'colsample_bytree': 0.8,
          'reg_alpha': 0.1,
          'reg_lambda': 0.1,
          'random_state': 0}

model1 = XGBRegressor(**params).fit(train, targets[0])
model2 = XGBRegressor(**params).fit(train, targets[1])
model3 = XGBRegressor(**params).fit(train, targets[2])

In [None]:
ss['target_carbon_monoxide'] = model1.predict(test)
ss['target_benzene'] = model2.predict(test)
ss['target_nitrogen_oxides'] = model3.predict(test)

ss.head()

In [None]:
test_date = pd.to_datetime(test.reset_index().date_time).apply(lambda x: x.strftime('%Y/%m/%d'))

test_preds = pd.DataFrame({'date': test_date,
                            'test_carbon_monoxide': ss['target_carbon_monoxide'],
                            'test_benzene': ss['target_benzene'],
                            'test_nitrogen_oxides': ss['target_nitrogen_oxides']})
test_preds = test_preds.groupby('date').mean()

In [None]:
test_preds.plot(color = my_colors[3:6], subplots = True, figsize = (15, 10))
plt.xlabel('')
plt.text(80, -120, caption, size = 14)
plt.show()

In [None]:
ss.to_csv('submission.csv', index = False)

<h2 style='color:white; background:#663399; border:0'><center>WORK IN PROGRESS...</center></h2>

[**Back to the start**](#section-start)