# TPS July: A time series EDA and data visualizations

## Introduction:

This is the monthly tabular playground competition. A time series problem for the month of July.


**Objective**: In this competition we are predicting the values of air pollution measurements over time, based on basic weather information (temperature and humidity) and the input values of 5 sensors.

The three **target** values to predict are: `target_carbon_monoxide`, `target_benzene`, and `target_nitrogen_oxides`

**Files Provided:** 

`train.csv` - the training data, including the weather data, sensor data, and values for the 3 targets

`test.csv` - the same format as train.csv, but without the target value; your task is to predict the value for each of these targets.

`sample_submission.csv` - a sample submission file in the correct format.

**Evaluation metrics:** Submissions are evaluated using the mean column-wise root mean squared logarithmic error. The RMSLE for a single column calculated as:

RMSLE = $\sqrt{\frac{1}{n}\sum\limits_{i=1}^{n}(\log(p_i + 1)-\log(a_i +1))^2} $


where:

 $n$ : is the total number of observations
 
 $p_i$: is your prediction
 
 $a_i$: is the actual value
 
 $log(x)$: is the natural logarithm of $x$
 
The final score is the mean of the RMSLE over all columns, in this case, 3.


### Import libraries and modules

In [None]:
# all modules may not be used
import numpy as np
import pandas as pd 

from pandas.plotting import lag_plot
from pandas.plotting import autocorrelation_plot

from scipy.stats.kde import gaussian_kde
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import kurtosis, skew

from plotly.subplots import make_subplots
import plotly.graph_objects as go

from bokeh.models import ColumnDataSource, RangeTool
from bokeh.io import output_file, show, output_notebook, curdoc
from bokeh.plotting import figure
from bokeh.layouts import column
from bokeh.models import LabelSet, HoverTool, Range1d, Label

output_notebook()

import holoviews as hv
from holoviews import opts, dim
hv.extension('bokeh')

theme_list = ['caliber', 'dark_minimal', 'light_minimal', "night_sky", "contrast"]
import warnings
warnings.filterwarnings('ignore')

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

### Load data

In [None]:
train = pd.read_csv('/kaggle/input/tabular-playground-series-jul-2021/train.csv')
test = pd.read_csv('/kaggle/input/tabular-playground-series-jul-2021/test.csv')
subm = pd.read_csv('/kaggle/input/tabular-playground-series-jul-2021/sample_submission.csv')


## Data Overview
* **Data size:** 
   * Train data has 7111 rows and 12 columns. Of the 12 columns, 3 are target columns, 1 col is timestamp and the rest are continuous variables
   * Test data has 2247 rows and 9 columns. Of the 9 columns, 1 col is timestamp and the rest are continuous variables

* **Data type:**
    * Timestape is the only object datatype. The rest are of float type.
   
* **Missing Values:**
    * Both train and test data have no missing values
    
* **Distributions:**
    * Looking at the histogram plots and the calculated skewness and kurtosis values, by and large all the features in train and test data look fairly normally distributed- except sensor_3 data. 
    * The distribution of the three target variables are more skewed to the right (right-skewed distn.).
    > A log tranformation could be useful for the target variable 
    
* **Features:**
    * date_time: timestamp of the data 
    * deg_C: ambient temperature
    * relative_humidity (ambient air property, amount of water vapor in the air relative to the maximum amount the air can contain at a given temperature)
    * absolute_humidity (ambient air property, amount of water vapor in the air)
    * sensor_1 to sensor_5 (sensor input data)
    
* **Targets:**
    > The target variables are air pollutants measured at some location in a city
    * target_carbon_monoxide (carbon monoxide formed as a result of combustion)
    * target_benzen (unburned benzen traces released to the atmospher)
    * target_nitrogen_oxides (nitrogen oxides formed as a result of combustion)

### Train data

In [None]:
train.head()

In [None]:
train.info()

In [None]:
train.describe()

### Train data: features histogram

In [None]:
feat = train.columns[1:-3]
df_train = hv.Dataset(train)
df_test = hv.Dataset(test)

hist1 = df_train.hist(dimension=list(feat), bins=10, adjoin=False)
hist1.opts(opts.Histogram(alpha=0.9, width=300, height=200, bgcolor ='#f4f0ec'))
hist1.opts(title='Train data distribution', fontscale=1.15)
hist1.cols(3)

### Train data: Skeweness and kurtosis

In [None]:
kurt = []
ske = []
for feat in train.columns[1:]:
    x = train[feat]     
    kurt.append(kurtosis(x))
    ske.append(skew(x))
 
    
fig = plt.figure(figsize=(8, 6), facecolor='#e8e8e8')
ax = fig.add_subplot(111, facecolor='#E8E8E8')
ax.plot(train.columns[1:], kurt, '*', markersize = 15, color= 'red', label="kurtosis")
ax.plot(train.columns[1:], ske, 'o', markersize = 15, color='blue',label="skewness" )
ax.hlines(y=0, xmin=0, xmax=10, colors='black', linestyles='solid', label='Normal-dist')
ax.set_xlabel('Features + targets', fontsize=12)
ax.set_ylabel('Skewness & kurtosis', fontsize=12)
ax.set_title('Skewness and kurtosis of all the train_features + targets', fontsize=16)
ax.grid()
ax.legend()
plt.xticks(rotation = 90)
plt.show()

### Test data

In [None]:
test.head()

In [None]:
test.info()

In [None]:
test.describe()

### Test data: features distribution

In [None]:
feat = test.columns[1:]
df_test = hv.Dataset(test)

hist2 = df_test.hist(dimension=list(feat), bins=10, adjoin=False)
hist2.opts(opts.Histogram(alpha=0.9, width=300, height=200, color='salmon', bgcolor ='#f4f0ec'))
hist2.opts(title='Test data distribution', fontscale=1.15)
hist2.cols(3)

### Test data: Skeweness and kurtosis of features

In [None]:
kurt_test = []
ske_test = []
for feat in test.columns[1:]:
    x = test[feat]     
    kurt_test.append(kurtosis(x))
    ske_test.append(skew(x))
 
    
fig = plt.figure(figsize=(8, 6), facecolor='#e8e8e8')
ax = fig.add_subplot(facecolor='#e8e8e8')
ax.plot(test.columns[1:], kurt_test, '*', markersize = 15, color= 'seagreen', label="kurtosis")
ax.plot(test.columns[1:], ske_test, 'o', markersize = 15, color='salmon',label="skewness" )
ax.hlines(y=0, xmin=0, xmax=7, colors='black', linestyles='solid', label='Normal-dist')
ax.set_xlabel('Features', fontsize=12)
ax.set_ylabel('Skewness & kurtosis', fontsize=12)
ax.set_title('Skewness and kurtosis of all the test_feature', fontsize=16)
ax.grid()
ax.legend()
plt.xticks(rotation = 90)
plt.show()

### Targets' distribution

In [None]:
width = 300
height = 250

kde1 = (hv.Distribution(train.target_carbon_monoxide, label='CO2'))          

kde1.opts(width=width,
         height=height,
         title='Carbon monoxide',
         xrotation=0,
         xaxis='bottom',
         xlabel='CO', 
         ylabel='Density',  
         #tools=['hover'],
         bgcolor ='#f4f0ec',
         show_legend=False,                    

         )
kde2 = (hv.Distribution(train.target_benzene, label='Benzen'))              

kde2.opts(width=width,
         height=height,
         title='Benzene',
         color= 'red',
         xrotation=0,
         xaxis='bottom',
         xlabel='Benzen', 
         ylabel='Density',  
         #tools=['hover'],
         bgcolor ='#f4f0ec',
         show_legend=False,                     

         )
kde3 = (hv.Distribution(train.target_nitrogen_oxides, label='NoX'))                

kde3.opts(width=width,
         height=height,
         title='Nitrogen oxides',
         color= 'seagreen',
         xrotation=0,
         xaxis='bottom',
         xlabel='NOx', 
         ylabel='Density',  
         #tools=['hover'],
         bgcolor ='#f4f0ec',
         show_legend=False,                    

         )
(kde1 + kde2+ kde3)

### Correlation heatmap

* Pearson's correlation will do here since all but the timestamp (removed from the correlation) are continuous variables. 
* Few things to note:
    * `target_benzene` and `sensor_2` are the highest positively correlated pairs at **0.96** corr. coeff.
    * `sensor_2` and `sensor_3` are the highest nrgatively correlated pairs at **-0.82** corr. coeff.
    * except `sensor_3`, which is **negatively** correlated, the other `four sensors` are **positively** correlated with all the **three target variables**
    * `deg_C`, `relative_humidity` and `absolute_humidity` are the **least** correlated with the **three target variables**.

In [None]:
import plotly.figure_factory as ff
feats = train.columns[1:]
df_ = train[feats]
corr = df_.corr()

mask = np.triu(np.ones_like(corr, dtype=np.bool))
corr = corr.mask(mask)
fig = ff.create_annotated_heatmap(
    z=corr.to_numpy().round(2),
    x=list(corr.index.values),
    y=list(corr.columns.values),       
    xgap=3, ygap=3,
    zmin=-1, zmax=1,
    colorscale='earth',
    colorbar_thickness=30,
    colorbar_ticklen=3,
)
fig.update_layout(title_text='<b>Correlation heatmap <b>',
                  title_x=0.5,
                  titlefont={'size': 24},
                  width=750, height=650,
                  xaxis_showgrid=False,
                  xaxis={'side': 'bottom'},
                  yaxis_showgrid=False,
                  yaxis_autorange='reversed',
                  template ='simple_white',
                  paper_bgcolor='lightgray',
                  )
fig.show()

### Time series plot of the independent variables

In [None]:
fig = make_subplots(rows=8, cols=1,                   
    subplot_titles=('deg_C', 'relative_humidity', 'absolute_humidity', 'sensor_1', 'sensor_2', 'sensor_3', 'sensor_4', 'sensor_5')
                   )

for i, col in enumerate(test.columns[1:]):
    
    fig.add_trace(go.Scatter(x=train['date_time'], 
         y=train[col],name='train', line=dict(color='gray', width=1, dash='solid')
         ),
         row=i+1, col=1)
    fig.add_trace(go.Scatter(x=test['date_time'], 
         y=test[col],name='test', line=dict(color='salmon', width=1, dash='solid')
         ),
         row=i+1, col=1)

fig.update_layout(title_text='<b> Time series plots <b>',
                  titlefont={'size': 28, 'family':'Courier'},
                  title_x=0.5,
                  showlegend=False,
                  autosize=True, 
                  width=1200, 
                  height=1200, 
                  template='ggplot2', 
                  paper_bgcolor='lightgray')
fig.show()

### Time series plot of the target variables
* Note the noticable drop in reading for month August and the spikes in the winter season (Nov-Dec)


In [None]:
dates = np.array(train['date_time'], dtype=np.datetime64)
source = ColumnDataSource(data=dict(date=dates, signal=train['target_carbon_monoxide']))

fig = figure(title='Carbon Monoxide',plot_height=250, plot_width=800, tools="xpan", toolbar_location=None,
           x_axis_type="datetime", x_axis_location="above",background_fill_color="#efefef", x_range=(dates[3500], dates[4500]))

fig.line('date', 'signal', color='red', source=source)
fig.yaxis.axis_label = None

fig.title.text_font_size = '20pt'
fig.title.text_font_style = 'bold'
fig.title.text_font = 'Serif'

select = figure(title="Drag the middle and edges of the selection box to change the range above",
                plot_height=130, plot_width=800, y_range=fig.y_range,
                x_axis_type="datetime", y_axis_type=None,
                tools="", toolbar_location=None, background_fill_color="#efefef")

range_tool = RangeTool(x_range=fig.x_range)
range_tool.overlay.fill_color = "navy"
range_tool.overlay.fill_alpha = 0.2

select.line('date', 'signal', color='red', source=source)
select.ygrid.grid_line_color = None
select.add_tools(range_tool)
select.toolbar.active_multi = range_tool

show(column(fig, select))

### Target Benzene

In [None]:
dates = np.array(train['date_time'], dtype=np.datetime64)
source = ColumnDataSource(data=dict(date=dates, signal=train['target_benzene']))

fig = figure(title='Benzene',
    plot_height=250, plot_width=800, tools="xpan", toolbar_location=None,
           x_axis_type="datetime", x_axis_location="above",
           background_fill_color="#efefef", x_range=(dates[3500], dates[4500]))

fig.line('date', 'signal', color='gold',source=source)
fig.yaxis.axis_label = None

fig.title.text_font_size = '20pt'
fig.title.text_font_style = 'bold'
fig.title.text_font = 'Serif'

select = figure(title="Drag the middle and edges of the selection box to change the range above",
                plot_height=130, plot_width=800, y_range=fig.y_range,
                x_axis_type="datetime", y_axis_type=None,
                tools="", toolbar_location=None, background_fill_color="#efefef")

range_tool = RangeTool(x_range=fig.x_range)
range_tool.overlay.fill_color = "salmon"
range_tool.overlay.fill_alpha = 0.2

select.line('date', 'signal', color='gold', source=source)
select.ygrid.grid_line_color = None
select.add_tools(range_tool)
select.toolbar.active_multi = range_tool

show(column(fig, select))

### Target nitrogen oxides

In [None]:
dates = np.array(train['date_time'], dtype=np.datetime64)
source = ColumnDataSource(data=dict(date=dates, signal=train['target_nitrogen_oxides']))

fig = figure(title='Nitrogen Oxides',
    plot_height=250, plot_width=800, tools="xpan", toolbar_location=None,
           x_axis_type="datetime", x_axis_location="above",
           background_fill_color="#efefef", x_range=(dates[3500], dates[4500]))

fig.line('date', 'signal', color='seagreen', source=source)
fig.yaxis.axis_label = None

fig.title.text_font_size = '20pt'
fig.title.text_font_style = 'bold'
fig.title.text_font = 'Serif'

select = figure(title="Drag the middle and edges of the selection box to change the range above",
                plot_height=130, plot_width=800, y_range=fig.y_range,
                x_axis_type="datetime", y_axis_type=None,
                tools="", toolbar_location=None, background_fill_color="#efefef")

range_tool = RangeTool(x_range=fig.x_range)
range_tool.overlay.fill_color = "salmon"
range_tool.overlay.fill_alpha = 0.2

select.line('date', 'signal', color='seagreen', source=source)
select.ygrid.grid_line_color = None
select.add_tools(range_tool)
select.toolbar.active_multi = range_tool

show(column(fig, select))

### Sensors data:
* The sensor data is not known. Nonetheless, let's see how they are related.
     * `sensor_1` is directly related to the all except `sensor_3`
     * `sensor_2` is directly related to the `sensor_4 and 5`, but inversely related to `sensor_3`
     * `sensor_3` is inversely related to `sensor_4 and 5`
     * `sensor_4` is directly related to `sensor_5`
     > Note: we have already seen these relations in correlation heatmap.
     

In [None]:
# If this code snipet feels a little cumbersome, 
# ff.create_scatterplotmatrix() is the simple way of making parplot in plotly

feat = train.columns[4:9]
df_study = train[feat]

fig = make_subplots(
    rows=5, cols=5,
    specs=[[{}, {}, {}, {}, {}],
           [{}, {}, {}, {}, {}],
           [{}, {}, {}, {}, {}],
           [{}, {}, {}, {}, {}],
           [{}, {}, {}, {}, {}]],
    print_grid=False,#shared_yaxes=True, 
    shared_xaxes=True)

# fig.add_trace(go.Violin(x=df_study['sensor_1'], line_color='lightseagreen', name='sensor1_train', y0=0, orientation='h', side='positive', meanline_visible=False), row=1, col=1)
# fig.add_trace(go.Violin(x=test['sensor_1'], line_color='#0077b3', name='sensor1_test', y0=0, orientation='h', side='positive', meanline_visible=False), row=1, col=1)
fig.add_trace(go.Violin(), row=1, col=1)

fig.add_trace(go.Scatter(x=df_study['sensor_1'], y=df_study['sensor_2'], name='sensor1_2', mode='markers',  marker_color='lightseagreen'), row=2, col=1)
#fig.add_trace(go.Violin(x=df_study['sensor_2'], line_color='gold', name='sensor2', y0=0, orientation='h', side='positive', meanline_visible=False), row=2, col=2)

fig.add_trace(go.Scatter(x=df_study['sensor_1'], y=df_study['sensor_3'], name='sensor1_3',mode='markers',  marker_color='#f86767'), row=3, col=1)
fig.add_trace(go.Scatter(x=df_study['sensor_2'], y=df_study['sensor_3'], name='sensor2_3',mode='markers',  marker_color='#f86767'), row=3, col=2)
#fig.add_trace(go.Violin(x=df_study['sensor_3'], line_color='gold', name='sensor3',y0=0, orientation='h', side='positive', meanline_visible=False), row=3, col=3)


fig.add_trace(go.Scatter(x=df_study['sensor_1'], y=df_study['sensor_4'], name='sensor1_4',mode='markers',  marker_color='lightseagreen'), row=4, col=1)
fig.add_trace(go.Scatter(x=df_study['sensor_2'], y=df_study['sensor_4'], name='sensor2_4',mode='markers',  marker_color='lightseagreen'), row=4, col=2)
fig.add_trace(go.Scatter(x=df_study['sensor_3'], y=df_study['sensor_4'], name='sensor3_4',mode='markers',  marker_color='#f86767'), row=4, col=3)
#fig.add_trace(go.Violin(x=df_study['sensor_4'], line_color='gold', name='', y0=0, orientation='h', side='positive', meanline_visible=False), row=4, col=4)


fig.add_trace(go.Scatter(x=df_study['sensor_1'], y=df_study['sensor_5'],  name='sensor1_5',mode='markers',  marker_color='lightseagreen'), row=5, col=1)
fig.add_trace(go.Scatter(x=df_study['sensor_2'], y=df_study['sensor_5'],  name='sensor2_5',mode='markers',  marker_color='lightseagreen'), row=5, col=2)
fig.add_trace(go.Scatter(x=df_study['sensor_3'], y=df_study['sensor_5'],  name='sensor3_5',mode='markers',  marker_color='#f86767'), row=5, col=3)
fig.add_trace(go.Scatter(x=df_study['sensor_4'], y=df_study['sensor_5'],  name='sensor4_5',mode='markers',  marker_color='lightseagreen'), row=5, col=4)
#fig.add_trace(go.Violin(x=df_study['sensor_5'], line_color='gold',  name='sensor5',y0=0, orientation='h', side='positive', meanline_visible=False), row=5, col=5)

####
#fig.add_trace(go.Scatter(x=test['sensor_1'], y=test['sensor_5'],  name='sensor1_5',mode='markers',  marker_color='seagreen'), row=1, col=5)

#fig.add_trace(go.Violin(x=test['sensor_1'], line_color='gold', name='sensor1', y0=0, orientation='h', side='positive', meanline_visible=False), row=1, col=1)

fig.add_trace(go.Scatter(x=test['sensor_2'], y=test['sensor_1'], name='sensor2_1', mode='markers',  marker_color='#0077b3'), row=1, col=2)
#fig.add_trace(go.Violin(x=test['sensor_2'], line_color='gold', name='sensor2', y0=0, orientation='h', side='positive', meanline_visible=False), row=2, col=2)

fig.add_trace(go.Scatter(y=test['sensor_1'], x=test['sensor_3'], name='sensor3_1',mode='markers',  marker_color='#b36b00'), row=1, col=3)
fig.add_trace(go.Scatter(y=test['sensor_2'], x=test['sensor_3'], name='sensor3_1',mode='markers',  marker_color='#b36b00'), row=2, col=3)
#fig.add_trace(go.Violin(x=test['sensor_3'], line_color='gold', name='sensor3',y0=0, orientation='h', side='positive', meanline_visible=False), row=3, col=3)


fig.add_trace(go.Scatter(y=test['sensor_1'], x=test['sensor_4'], name='sensor4_1',mode='markers',  marker_color='#0077b3'), row=1, col=4)
fig.add_trace(go.Scatter(y=test['sensor_2'], x=test['sensor_4'], name='sensor4_2',mode='markers',  marker_color='#0077b3'), row=2, col=4)
fig.add_trace(go.Scatter(y=test['sensor_3'], x=test['sensor_4'], name='sensor4_3',mode='markers',  marker_color='#b36b00'), row=3, col=4)
#fig.add_trace(go.Violin(x=test['sensor_4'], line_color='gold', name='', y0=0, orientation='h', side='positive', meanline_visible=False), row=4, col=4)


fig.add_trace(go.Scatter(y=test['sensor_1'], x=test['sensor_5'],  name='sensor5_1',mode='markers',  marker_color='#0077b3'), row=1, col=5)
fig.add_trace(go.Scatter(y=test['sensor_2'], x=test['sensor_5'],  name='sensor5_2',mode='markers',  marker_color='#0077b3'), row=2, col=5)
fig.add_trace(go.Scatter(y=test['sensor_3'], x=test['sensor_5'],  name='sensor5_3',mode='markers',  marker_color='#b36b00'), row=3, col=5)
fig.add_trace(go.Scatter(y=test['sensor_4'], x=test['sensor_5'],  name='sensor5_4',mode='markers',  marker_color='#0077b3'), row=4, col=5)
fig.add_trace(go.Scatter(), row=5, col=5)
#fig.add_trace(go.Violin(x=test['sensor_5'], line_color='gold',  name='sensor5',y0=0, orientation='h', side='positive', meanline_visible=False), row=5, col=5)


# Update xaxis properties

fig.update_xaxes(title_text="sensor_1", showgrid=False, row=5, col=1)
fig.update_xaxes(title_text="sensor_2", showgrid=False, row=5, col=2)
fig.update_xaxes(title_text="sensor_3", showgrid=False, row=5, col=3)
fig.update_xaxes(title_text="sensor_4", showgrid=False, row=5, col=4)
fig.update_xaxes(title_text="sensor_5", showgrid=False, row=5, col=5)

# Update yaxis properties
fig.update_yaxes(title_text="sensor_1",showgrid=False, row=1, col=1)
fig.update_yaxes(title_text="sensor_2",showgrid=False, row=2, col=1)
fig.update_yaxes(title_text="sensor_3",showgrid=False, row=3, col=1)
fig.update_yaxes(title_text="sensor_4",showgrid=False, row=4, col=1)
fig.update_yaxes(title_text="sensor_5",showgrid=False, row=5, col=1)

fig.update_layout(height=900, width=900,
                  showlegend=False,
                  title_text="<b>Sensor Data Pairplot<b>",
                  titlefont={'size': 28, 'family':'Courier New'},
                  template='plotly_dark',
                  plot_bgcolor='#303330',
                  margin=dict(b=120,
                              t=120,
                 )
                 )            

annotations = []
annotations.append(dict(xref='paper', yref='paper',
                        x=0.99, y=1.1,
                        text='<b> Test data: Upper triangular <b>',
                             font=dict(family='Arial', size=20, color='yellow'),
                        showarrow=False))
annotations.append(dict(xref='paper', yref='paper',
                        x=-0.02, y=-0.15,
                        text='<b> Train data: Lower triangular<b>',
                             font=dict(family='Arial', size=20, color='yellow'),
                        showarrow=False))

fig.update_layout(annotations=annotations)
fig.show()

## Seasonality of the data

In [None]:
# change the date_time columns datatype from object to datetime
train['date_time'] = pd.to_datetime(train['date_time'])
train['month'] = train['date_time'].dt.month
train['hour'] = train['date_time'].dt.hour
train['day_of_week'] = train['date_time'].dt.dayofweek
train['week_of_year'] = train['date_time'].dt.weekofyear

## Seasonality/periodicity of sensor readings
* Except sensor_3 we see that a noticable valley in early hours of the day and peak at the `beginning` and `end` of working hours of the day in all the sensors
* Also valleys in the weekends (especially on sunday)
* Valley in `August` and peaks in `winter` months (October, November, December)

In [None]:
def _barPlots(data, x, features, rows, cols, title, pal='Greens'):
    cnt =1 
    fig, axes = plt.subplots(rows, cols, figsize=(10, 12), sharex=True)
    fig.subplots_adjust(top=0.92)
    for name, ax in zip(features, axes):
        sns.barplot(data=data, x=data[x], y=name, ax=ax, palette=pal)
        ax.set_ylabel(None)
        ax.set_title(name)
        if ax != axes[-1]:
            ax.set_xlabel('')
        cnt = cnt + 1
    plt.suptitle(title, fontsize=24);

In [None]:
feats = ['sensor_1', 'sensor_2', 'sensor_3', 'sensor_4', 'sensor_5']
title = 'Hourly sensors reading'

_barPlots(data=train, x='hour', features=feats, rows=len(feats), cols=1, title=title)

### Potential feature creation from the sensors
> May be if we take the ratio of sensor_4 and sensor_3 we might see a better periodicity during the the day matching the readings from other sensors. It is very similar to the sensor_5 pattern.

In [None]:
train['sensor_4/sensor_3'] = (train['sensor_4']/train['sensor_3'])
feats = ['sensor_1', 'sensor_2', 'sensor_3', 'sensor_4', 'sensor_5', 'sensor_4/sensor_3']
title = 'Hourly sensors reading with the additional feature (senor_4/sensor_3)'
_barPlots(data=train, x='hour', features=feats, rows=len(feats), cols=1, title=title)

In [None]:
feats = ['sensor_1', 'sensor_2', 'sensor_3', 'sensor_4', 'sensor_5']
title = 'Day of the week sensors reading'
_barPlots(data=train, x='day_of_week', features=feats, rows=len(feats), cols=1, title=title)

In [None]:
feats = ['sensor_1', 'sensor_2', 'sensor_3', 'sensor_4', 'sensor_5']
title = 'Monthly sensors reading'
_barPlots(data=train, x='month', features=feats, rows=len(feats), cols=1, title=title)

## Seasonality/periodicity of target variables
* A noticable drop in early hours of the day and peak at the `beginning` and `end` of working hours of the day
* A drop in the weekends (especially on sunday)
* Valley in `August` and peaks in `winter` months (October, November, December) 


### Targets hourly readings

In [None]:
feats = ['target_carbon_monoxide', 'target_benzene', 'target_nitrogen_oxides']
title = 'Hourly target variations'
_barPlots(data=train, x='hour', features=feats, rows=len(feats), cols=1, title=title, pal='coolwarm')

### Targets week of the day readings

In [None]:
feats = ['target_carbon_monoxide', 'target_benzene', 'target_nitrogen_oxides']
title = 'Day of the week target variations'
_barPlots(data=train, x='day_of_week', features=feats, rows=len(feats), cols=1, title=title, pal='coolwarm')

### Targets reading by month average

In [None]:
feats = ['target_carbon_monoxide', 'target_benzene', 'target_nitrogen_oxides']
title = 'Monthly target variations'
_barPlots(data=train, x='month', features=feats, rows=len(feats), cols=1, title=title, pal='coolwarm')

 > ### Take away: Additional features derived from date_time is gonna be useful!
  > Few examples coud be :
   > * Parts of the day: early hours, working hours (start end of a working period), late night, etc.
   > * Day of the week: three groups (Mon + Sat, Sun, the rest) for example
   > * Month: 'Is winter?' 'Is summer?' Even a sepcific month such as 'Is August?' for example

### Autocorrelation plots (ACF)

An autocorrelation plot shows whether the elements of a time series are correlated with each other. Autocorrelation specifically refers to correlation among the elements of a time series. An autocorrelation plot shows the value of the autocorrelation function (acf) on the vertical axis. It can range from –1 to 1. The horizontal axis of an autocorrelation plot shows the size of the lag between the elements of the time series. The shaded region shows the confidence interval for statistical significance of the correlations. Points outside of this region are statistically significatley correlated. 

In [None]:
import statsmodels.api as sm
features = ['sensor_1','sensor_2', 'sensor_3','sensor_4','sensor_5','target_carbon_monoxide', 'target_benzene', 'target_nitrogen_oxides']
for name in features:
    fig = sm.graphics.tsa.plot_acf(train[name], use_vlines= True,marker='o', color='red',lags=17, alpha=0.05, title='Autocorrelation: '+ name)    
    plt.xlabel("Lag periods")
    plt.ylabel("ACorr. coeff.")
    plt.show();

### Partial autocorrelation plots (PACF)

A partial autocorrelation plot shows the relationship between an observation in a time series with observations at prior time steps (lags) with the relationships of intervening observations removed. Partial autocorrelations can be useful in identifying the order of an autoregressive (AR) model.

In [None]:
features = ['sensor_1','sensor_2', 'sensor_3','sensor_4','sensor_5','target_carbon_monoxide', 'target_benzene', 'target_nitrogen_oxides']
for name in features:
    fig = sm.graphics.tsa.plot_pacf(train[name], use_vlines= True,marker='o', color='red',lags=20, alpha=0.05, title='Partial autocorrelation: '+ name)    
    plt.xlabel("Lag periods")
    plt.ylabel("PACorr. coeff.")
    plt.show();

The two takeawys:
* From the ACF plots we could see that we can add new features by considering the lags
* We could consult the PACF plots to identify the order if we use AR model

#### Thank you!