# Exploratory Data Analysis and Predictive Modelling of Local Atmospheric Pollutants

###### Kaggle Tabular Playground Series - July 2021

The [given dataset](https://www.kaggle.com/c/tabular-playground-series-jul-2021) includes dates, temperatures, humidities (relative and absolute), and data from sensors.

The goal of the project is to accurately predict the amounts of three target variables (pollutants);
target_carbon_monoxide, target_benzene, and target_nitrogen_oxides.

An additional personal goal is to determine the factors which effect each of the target variables the most.

I've also seen others use Plotly to visualise the data and findings. As I don't have much experience with Plotly in
Python, I'll be using it here also, as an opportunity to learn.

In [256]:
# import libraries

import os
import datetime as dt  # processing of dates and times

from termcolor import colored as cl  # text customization

import numpy as np  # calculations and functions
print('numpy version : {}'.format(np.__version__))
import scipy
from scipy.stats import normaltest
print('scipy version : {}'.format(np.__version__))
import pandas as pd  # data processing / wrangling
pd.set_option("display.max_columns", None)
print('pandas version : {}'.format(pd.__version__))

import matplotlib
import matplotlib.pyplot as plt  # data visualisation
from IPython.core.display import display
%matplotlib inline
print('matplotlib version : {}'.format(matplotlib.__version__))

import plotly
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
print('plotly version : {}'.format(plotly.__version__))

import sklearn
print('sklearn version : {}'.format(sklearn.__version__))

numpy version : 1.20.2
scipy version : 1.20.2
pandas version : 1.2.5
matplotlib version : 3.3.4
plotly version : 4.14.3
sklearn version : 0.24.2


## Data Processing

### Reading and Inspecting Datasets

First, we import both the training and testing datasets. We also create a new dataframe containing the data from both
the training and testing datasets, minus the target data. This makes the following data processing steps easier.

In [257]:
# path to directory containing input files
input_dir = os.getcwd() + os.sep + 'input'

# read input data files
df_train = pd.read_csv(input_dir + os.sep + 'train.csv')
df_test = pd.read_csv(input_dir + os.sep + 'test.csv')

# remove target columns from train data and append to test data
target_cols = ['target_carbon_monoxide',
               'target_benzene',
               'target_nitrogen_oxides']
df_all = df_train.drop(columns=target_cols).append(df_test)

Next, let's take a look at the data to get to grips with what we're working with.

In [258]:
# training dataframe
print('Training data shape : ', df_train.shape)
print(df_train.info())
display(df_train.head(5))

# testing dataframe
print('Testing data shape : ', df_test.shape)
print(df_test.info())
display(df_test.head())

Training data shape :  (7111, 12)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7111 entries, 0 to 7110
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   date_time               7111 non-null   object 
 1   deg_C                   7111 non-null   float64
 2   relative_humidity       7111 non-null   float64
 3   absolute_humidity       7111 non-null   float64
 4   sensor_1                7111 non-null   float64
 5   sensor_2                7111 non-null   float64
 6   sensor_3                7111 non-null   float64
 7   sensor_4                7111 non-null   float64
 8   sensor_5                7111 non-null   float64
 9   target_carbon_monoxide  7111 non-null   float64
 10  target_benzene          7111 non-null   float64
 11  target_nitrogen_oxides  7111 non-null   float64
dtypes: float64(11), object(1)
memory usage: 666.8+ KB
None


Unnamed: 0,date_time,deg_C,relative_humidity,absolute_humidity,sensor_1,sensor_2,sensor_3,sensor_4,sensor_5,target_carbon_monoxide,target_benzene,target_nitrogen_oxides
0,2010-03-10 18:00:00,13.1,46.0,0.7578,1387.2,1087.8,1056.0,1742.8,1293.4,2.5,12.0,167.7
1,2010-03-10 19:00:00,13.2,45.3,0.7255,1279.1,888.2,1197.5,1449.9,1010.9,2.1,9.9,98.9
2,2010-03-10 20:00:00,12.6,56.2,0.7502,1331.9,929.6,1060.2,1586.1,1117.0,2.2,9.2,127.1
3,2010-03-10 21:00:00,11.0,62.4,0.7867,1321.0,929.0,1102.9,1536.5,1263.2,2.2,9.7,177.2
4,2010-03-10 22:00:00,11.9,59.0,0.7888,1272.0,852.7,1180.9,1415.5,1132.2,1.5,6.4,121.8


Testing data shape :  (2247, 9)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2247 entries, 0 to 2246
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   date_time          2247 non-null   object 
 1   deg_C              2247 non-null   float64
 2   relative_humidity  2247 non-null   float64
 3   absolute_humidity  2247 non-null   float64
 4   sensor_1           2247 non-null   float64
 5   sensor_2           2247 non-null   float64
 6   sensor_3           2247 non-null   float64
 7   sensor_4           2247 non-null   float64
 8   sensor_5           2247 non-null   float64
dtypes: float64(8), object(1)
memory usage: 158.1+ KB
None


Unnamed: 0,date_time,deg_C,relative_humidity,absolute_humidity,sensor_1,sensor_2,sensor_3,sensor_4,sensor_5
0,2011-01-01 00:00:00,8.0,41.3,0.4375,1108.8,745.7,797.1,880.0,1273.1
1,2011-01-01 01:00:00,5.1,51.7,0.4564,1249.5,864.9,687.9,972.8,1714.0
2,2011-01-01 02:00:00,5.8,51.5,0.4689,1102.6,878.0,693.7,941.9,1300.8
3,2011-01-01 03:00:00,5.0,52.3,0.4693,1139.7,916.2,725.6,1011.0,1283.0
4,2011-01-01 04:00:00,4.5,57.5,0.465,1022.4,838.5,871.5,967.0,1142.3


There's a few things we should note about this data:
- Both datasets are relatively small, this may influence the type of model we choose.
- Certain input data (features) could be expanded. For example, the date can be used to add features such as season or
day of the week.
- There are no missing values (None or Null) in either set.

### Date and Time Processing

As usual, the dates and times listed in the date_time column could do with some work.

First, we'll inspect the dates and times and see if we can spot any patterns.

In [259]:
print('Training data:')
print(df_train.date_time)
print('')
print('Testing data:')
print(df_test.date_time)

Training data:
0       2010-03-10 18:00:00
1       2010-03-10 19:00:00
2       2010-03-10 20:00:00
3       2010-03-10 21:00:00
4       2010-03-10 22:00:00
               ...         
7106    2010-12-31 20:00:00
7107    2010-12-31 21:00:00
7108    2010-12-31 22:00:00
7109    2010-12-31 23:00:00
7110    2011-01-01 00:00:00
Name: date_time, Length: 7111, dtype: object

Testing data:
0       2011-01-01 00:00:00
1       2011-01-01 01:00:00
2       2011-01-01 02:00:00
3       2011-01-01 03:00:00
4       2011-01-01 04:00:00
               ...         
2242    2011-04-04 10:00:00
2243    2011-04-04 11:00:00
2244    2011-04-04 12:00:00
2245    2011-04-04 13:00:00
2246    2011-04-04 14:00:00
Name: date_time, Length: 2247, dtype: object


So it looks like we have a sample (row) for each hour, from 2010-03-10 6:00pm to 2011-01-01 12:00am for the training
data and from 2011-01-01 12:00am to 2011-04-04 2pm for the testing data.

Let's confirm that's the case by calculating the expected number of samples if we have 1 sample every hour, and
comparing this value to the actual number of samples.

In [260]:
# datetime format : (year, month, day, hour, minute, second)
train_initial_date = dt.datetime(2010, 3, 10, 18, 0, 0)
train_final_date = dt.datetime(2011, 1, 1, 0, 0, 0)
test_final_date = dt.datetime(2011, 4, 4, 14, 0, 0)

# difference between datetimes
SECONDS_PER_HOUR = 3600
train_timedelta = train_final_date - train_initial_date
train_hours = int(train_timedelta.total_seconds()/SECONDS_PER_HOUR)+1
test_timedelta = test_final_date - train_final_date
test_hours = int(test_timedelta.total_seconds()/SECONDS_PER_HOUR)+1

# get number of samples in datasets
train_num_samples = len(df_train)
test_num_samples = len(df_test)


# check if expected samples == actual samples
print('Training data:')
if train_hours == train_num_samples:
    print('1 sample every hour. No duplicates.')
else:
    print('Expected does not match actual number of samples.')
print('')
print('Testing data:')
if test_hours == test_num_samples:
    print('1 sample every hour. No duplicates.')
else:
    print('Expected does not match actual number of samples.')

Training data:
1 sample every hour. No duplicates.

Testing data:
1 sample every hour. No duplicates.


In validating that we have 1 sample every hour for both datasets, we discovered that the final date and time in the
training dataset is the same as the first date and time in the testing dataset.

Are these two samples identical?

In [261]:
display(df_train.iloc[-1:])
display(df_test.iloc[:1])

Unnamed: 0,date_time,deg_C,relative_humidity,absolute_humidity,sensor_1,sensor_2,sensor_3,sensor_4,sensor_5,target_carbon_monoxide,target_benzene,target_nitrogen_oxides
7110,2011-01-01 00:00:00,8.0,41.3,0.4375,1108.8,745.7,797.1,880.0,1273.1,1.4,4.1,186.5


Unnamed: 0,date_time,deg_C,relative_humidity,absolute_humidity,sensor_1,sensor_2,sensor_3,sensor_4,sensor_5
0,2011-01-01 00:00:00,8.0,41.3,0.4375,1108.8,745.7,797.1,880.0,1273.1


Yes! These samples are identical. This is something to be aware of when we work with the concatenated df_all dataframe.

Now let's work on converting the date_time feature to the following features:
- Year
- Month
- Date
- Hour
- Minute
- Second
- Season
- Day of the week
- Time of day
- Weekend?
- Public holiday?

We'll start by breaking up date_time into its constituent parts.

In [262]:
# retain the intact date from the date_time values for ordering purposes
df_all['date'] = df_all.date_time.str.split(' ', expand=True)[0]

# convert values of date_time to datetime objects
df_all['date_time'] = pd.to_datetime(df_all['date_time'])
df_all['year'] = df_all['date_time'].dt.year
df_all['month'] = df_all['date_time'].dt.month
df_all['day'] = df_all['date_time'].dt.day
df_all['hour'] = df_all['date_time'].dt.hour
df_all['minute'] = df_all['date_time'].dt.minute
df_all['second'] = df_all['date_time'].dt.second

All of that was accomplished using the datetime library!

Now let's expand this feature even more, by adding some additional features from the list above.

The day of the week can also be found using the datetime library. From the day of the week we can determine whether the
date is on a weekday or a weekend.

The season is simply a function of the month. Similarly, the time of day is a function of the hour.

In [263]:
# create day of the week column
df_all['day_of_week'] = df_all['date_time'].dt.dayofweek

# create weekend column
df_all['weekend'] = 0
df_all.loc[df_all['day_of_week'].isin([5, 6, 7]), 'weekend'] = 1

# create season column (0 : winter, 1 : spring, 2 : summer, 3 : autumn)
df_all['season'] = 0
df_all.loc[df_all['month'].isin([3, 4, 5]), 'season'] = 1
df_all.loc[df_all['month'].isin([6, 7, 8]), 'season'] = 2
df_all.loc[df_all['month'].isin([9, 10, 11]), 'season'] = 3

# create time of day column (0 : morning, 1: afternoon, 2: evening, 3: night)
df_all['time_of_day'] = 0
df_all.loc[df_all['hour'].isin(range(12, 18)), 'season'] = 1
df_all.loc[df_all['hour'].isin(range(18, 23)), 'season'] = 2
df_all.loc[df_all['hour'].isin(range(23, 6)), 'season'] = 3

Whether the date falls on a public holiday is a little more complex, and involves importing data from
[here](https://www.kaggle.com/donnetew/us-holiday-dates-2004-2021).

In [264]:
# read csv file to pandas dataframe
df_holidays = pd.read_csv(input_dir + os.sep + 'us_holiday_dates_2004-2021.csv')

# inspect data
display(df_holidays.head())

# create public_holiday column on df_all
df_all['public_holiday'] = 0
df_all.loc[df_all['date'].isin(df_holidays.Date), 'public_holiday'] = 1

Unnamed: 0,Date,Holiday,WeekDay,Month,Day,Year
0,2004-07-04,4th of July,Sunday,7,4,2004
1,2005-07-04,4th of July,Monday,7,4,2005
2,2006-07-04,4th of July,Tuesday,7,4,2006
3,2007-07-04,4th of July,Wednesday,7,4,2007
4,2008-07-04,4th of July,Friday,7,4,2008


Done! Here is the dataframe with the expanded date and time features:

In [265]:
display(df_all.head(3))

Unnamed: 0,date_time,deg_C,relative_humidity,absolute_humidity,sensor_1,sensor_2,sensor_3,sensor_4,sensor_5,date,year,month,day,hour,minute,second,day_of_week,weekend,season,time_of_day,public_holiday
0,2010-03-10 18:00:00,13.1,46.0,0.7578,1387.2,1087.8,1056.0,1742.8,1293.4,2010-03-10,2010,3,10,18,0,0,2,0,2,0,0
1,2010-03-10 19:00:00,13.2,45.3,0.7255,1279.1,888.2,1197.5,1449.9,1010.9,2010-03-10,2010,3,10,19,0,0,2,0,2,0,0
2,2010-03-10 20:00:00,12.6,56.2,0.7502,1331.9,929.6,1060.2,1586.1,1117.0,2010-03-10,2010,3,10,20,0,0,2,0,2,0,0


### Additional Feature Expansion

Are there any other features we can extract from the current data set? Let's try to understand the other feature columns
by learning a bit about [humidity](https://www.engineersedge.com/thermodynamics/humidity.htm) and
[dew points](https://en.wikipedia.org/wiki/Dew_point).

**Absolute Humidity**<br>
*The total amount of water vapour present in a given volume of air, not taking temperature into consideration.*

**Relative Humidity**<br>
*The ratio of the partial pressure of water vapour in the mixture to the saturated vapour pressure of water at a given
temperature.*

**Dew Point**<br>
*The temperature at which air becomes saturated with water vapour and below which water vapour condenses to form liquid
(dew).*

We can use the relative humidities, and the temperatures, to *approximate* the
[dew point](https://bmcnoldy.rsmas.miami.edu/Humidity.html) using the following formula:

$$T_d = 243.04\times\frac{\log\left(\frac{RH}{100}\right)+\frac{17.625\times T}{243.04+T}}{17.625-\log\left(\frac{RH}{100}\right)-\frac{17.625\times T}{243.04+T}}$$

where $$T_d$$ is the dew point, $$RH$$ is the relative humidity, and $$T$$ is the temperature.

In [266]:
# create dew point column
df_all['dew_point'] = 243.04*(np.log(df_all['relative_humidity']/100)+((17.625*df_all['deg_C'])/
                      (243.04+df_all['deg_C'])))/(17.625-np.log(df_all['relative_humidity']/100)-
                      ((17.625*df_all['deg_C'])/(243.04+df_all['deg_C'])))

display(df_all.head(3))

Unnamed: 0,date_time,deg_C,relative_humidity,absolute_humidity,sensor_1,sensor_2,sensor_3,sensor_4,sensor_5,date,year,month,day,hour,minute,second,day_of_week,weekend,season,time_of_day,public_holiday,dew_point
0,2010-03-10 18:00:00,13.1,46.0,0.7578,1387.2,1087.8,1056.0,1742.8,1293.4,2010-03-10,2010,3,10,18,0,0,2,0,2,0,0,1.734357
1,2010-03-10 19:00:00,13.2,45.3,0.7255,1279.1,888.2,1197.5,1449.9,1010.9,2010-03-10,2010,3,10,19,0,0,2,0,2,0,0,1.611224
2,2010-03-10 20:00:00,12.6,56.2,0.7502,1331.9,929.6,1060.2,1586.1,1117.0,2010-03-10,2010,3,10,20,0,0,2,0,2,0,0,4.100765


### Visually Examining Datasets

#### Temperature Visualisation

Hey, the following figures are interactive! Try mousing over them or selecting a region to zoom in by drawing a
rectangle.

In [267]:
# remove duplicate sample from df_all
df_all = df_all.drop_duplicates()

# get temperature statistics
temp_mean = df_all.deg_C.mean()
temp_lq, temp_median, temp_uq = df_all.deg_C.quantile([0.25, 0.5, 0.75])
temp_min = df_all.deg_C.min()
temp_max = df_all.deg_C.max()

# show temperature statistics
print('TEMPERATURE STATISTICS')
print('----------------------')
print('Mean : ', end='')
print(round(temp_mean, 1))
print('Lower Quartile : ', end='')
print(temp_lq)
print('Median : ', end='')
print(temp_median)
print('Upper Quartile : ', end='')
print(temp_uq)
print('Minimum : ', end='')
print(temp_min)
print('Maximum : ', end='')
print(temp_max)
print('----------------------')

# get temperature range
temperature_range = df_all.deg_C.value_counts().sort_index()

# plot temperature distribution
fig = go.Figure()
fig.add_trace(
    go.Scatter(x=temperature_range.index,
               y=temperature_range,
               line={'width': 1}
    )
)

# draw mean and quartiles on plot
fig.add_vline(
    x=temp_mean,
    line={'width': 1, 'color': 'red'},
    annotation_text='mean ({})'.format(round(temp_mean, 1)),
    annotation_position='top right',
    annotation_font_color='red'
)
fig.add_vline(
    x=temp_median,
    line={'width': 1, 'color': 'orange'},
    annotation_text='median ({})'.format(temp_median),
    annotation_position='top left',
    annotation_font_color='orange'
)

# set figure attributes
fig.update_layout(
    title='Temperature Distribution',
    xaxis_title='temperature (°C)',
    yaxis_title='number of occurrences'
)

# display figure
fig

TEMPERATURE STATISTICS
----------------------
Mean : 18.5
Lower Quartile : 12.0
Median : 18.1
Upper Quartile : 24.4
Minimum : -1.8
Maximum : 46.1
----------------------


Not too much to note from this. We can see the distribution is as expected, with low counts of extreme temperatures.

The duplicate sample discussed earlier has been removed from df_all now, to prevent this from having an effect on the
statistics.

In [268]:
# mean temperature per month
df_monthly_temp = df_all[['year', 'month', 'deg_C']] \
                    .groupby(['year', 'month'], as_index=False) \
                    .agg({'deg_C': 'mean'})
df_monthly_temp['date'] = pd.to_datetime(df_monthly_temp.year*10000+df_monthly_temp.month*100+1, format='%Y%m%d')
df_monthly_temp = df_monthly_temp.drop(columns=['year', 'month'])

# make figure
fig = go.Figure()

# draw temperature over time
fig.add_trace(
    go.Scatter(
        x=df_all.date_time,
        y=df_all.deg_C,
        name='hourly temperature',
        line={'width': 1}
    )
)

# draw monthly mean temperature
fig.add_trace(
    go.Scatter(
        x=df_monthly_temp.date,
        y=df_monthly_temp.deg_C,
        mode='lines',
        name='monthly mean temperature',
        line={
            'width': 1,
            'shape': 'spline'
        }
    )
)

# draw test region on plot
fig.add_vrect(
    x0=df_test.date_time.values[0],
    x1=df_test.date_time.values[-1],
    line_width=0,
    fillcolor='grey',
    opacity=0.1
)

# add test region annotations to plot
fig.add_annotation(
    x=df_test.date_time.values[0],
    y=45,
    xanchor='right',
    text='train',
    showarrow=False
)
fig.add_annotation(
    x=df_test.date_time.values[0],
    y=45,
    xanchor='left',
    text='test',
    showarrow=False
)

# set figure attributes
fig.update_layout(
    title='Temperature Change Over Time',
    xaxis_title='date',
    yaxis_title='temperature (°C)'
)

What can we learn from this plot?

It appears as though temperature is higher in the summer months, and lower in the winter months. This is, of course,
expected.

However, if we zoom in to the time period of December 2010 to March 2011, we can see 5 spikes in the
temperature. These spikes represent relatively high temperatures, sustained for multiple days. It's possible this is
correct, however, it could also be an example of some erroneous data.

In [269]:
# filter df_all for samples which occur during winter months
WINTER_START_DATE = dt.datetime(2010, 12, 1, 0, 0, 0)
WINTER_END_DATE = dt.datetime(2011, 3, 1, 0, 0, 0)
df_winter = df_all[df_all['date_time'].between(WINTER_START_DATE, WINTER_END_DATE)]

# check if winter temperature values follow normal distribution
_, p = normaltest(df_winter['deg_C'])
print('Do the winter temperatures follow a normal (Gaussian) distribution? ', end='')
if p <= 0.01:
    print('Yes!')
else:
    print('No.')

Do the winter temperatures follow a normal (Gaussian) distribution? Yes!


As the temperatures for the winter months follow a normal distribution, we can find outlier temperatures using Z-scores.

In [270]:
# get mean temperature and standard deviation of winter samples
winter_temp_mean = df_winter.deg_C.mean()
winter_temp_std = df_winter.deg_C.std()

# get Z-scores of winter samples
df_winter['z-score'] = (df_winter['deg_C']-winter_temp_mean)/winter_temp_std

# filter winter samples to only include samples with z-scores greater than +/-2
df_winter_outliers = df_winter[abs(df_winter['z-score']) >= 2]

# get mean temperature and standard deviation of winter outlier samples
winter_outliers_temp_mean = df_winter_outliers.deg_C.mean()
winter_outliers_temp_std = df_winter_outliers.deg_C.std()

# show results
print('WINTER TEMPERATURE STATISTICS')
print('-------------------------------------')
print('Mean : {}'.format(round(winter_temp_mean, 2)))
print('Standard deviation : {}'.format(round(winter_temp_std, 2)))
print('-------------------------------------')
print('WINTER TEMPERATURE OUTLIER STATISTICS')
print('-------------------------------------')
print('Mean : {}'.format(round(winter_outliers_temp_mean, 2)))
print('Standard deviation : {}'.format(round(winter_outliers_temp_std, 2)))
print('-------------------------------------')

Unnamed: 0,date_time,deg_C,relative_humidity,absolute_humidity,sensor_1,sensor_2,sensor_3,sensor_4,sensor_5,date,year,month,day,hour,minute,second,day_of_week,weekend,season,time_of_day,public_holiday,dew_point,z-score
6695,2010-12-14 17:00:00,23.9,31.1,0.2309,1040.4,373.7,1738.4,631.9,717.6,2010-12-14,2010,12,14,17,0,0,1,0,1,0,0,5.789226,2.288729
6696,2010-12-14 18:00:00,23.9,35.2,0.2308,1174.4,401.3,1757.3,619.8,754.6,2010-12-14,2010,12,14,18,0,0,1,0,2,0,0,7.592186,2.288729
6697,2010-12-14 19:00:00,23.6,31.0,0.2308,1084.4,405.2,1739.1,631.8,765.3,2010-12-14,2010,12,14,19,0,0,1,0,2,0,0,5.482087,2.238100
6698,2010-12-14 20:00:00,22.9,32.9,0.2300,1050.7,369.8,1869.9,583.3,725.8,2010-12-14,2010,12,14,20,0,0,1,0,2,0,0,5.730328,2.119967
6699,2010-12-14 21:00:00,24.1,34.6,0.2297,1184.4,389.5,1757.7,644.1,759.2,2010-12-14,2010,12,14,21,0,0,1,0,2,0,0,7.516270,2.322481
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1000,2011-02-11 16:00:00,23.8,33.4,0.2280,1103.5,373.5,1785.7,632.0,707.7,2011-02-11,2011,2,11,16,0,0,4,0,1,0,0,6.737184,2.271853
1001,2011-02-11 17:00:00,24.1,31.4,0.2282,1114.7,408.9,1953.1,577.4,628.4,2011-02-11,2011,2,11,17,0,0,4,0,1,0,0,6.102033,2.322481
1002,2011-02-11 18:00:00,22.9,33.0,0.2281,1103.4,401.1,1860.1,638.2,642.1,2011-02-11,2011,2,11,18,0,0,4,0,2,0,0,5.774182,2.119967
1003,2011-02-11 19:00:00,25.1,33.0,0.2280,1091.6,408.9,1916.2,601.6,625.1,2011-02-11,2011,2,11,19,0,0,4,0,2,0,0,7.698932,2.491243


WINTER TEMPERATURE STATISTICS
-------------------------------------
Mean : 10.34
Standard deviation : 5.93
-------------------------------------
WINTER TEMPERATURE OUTLIER STATISTICS
-------------------------------------
Mean : 23.86
Standard deviation : 0.98
-------------------------------------


It seems that all these spikes in temperature during the winter months have relatively consistant temperature values.
This could point to these outliers being unnatural. Let's keep an eye on this region when looking at the other features.
If similar outliers aren't visible in other features, we may want to exclude these temperature outliers from our data
before training our predictive model.

#### Humidity and Dew Point Visualisation

Next, let's plot the absolute and relative humidity features, as well as the newly added dew point feature.

In [273]:
# get ranges
abs_humidity_range = df_all.absolute_humidity.value_counts().sort_index()
rel_humidity_range = df_all.relative_humidity.value_counts().sort_index()
dew_point_range = df_all.dew_point.value_counts().sort_index()

# create figure
fig = make_subplots(rows=2, cols=3)

# draw feature distributions
fig.add_trace(
    go.Scatter(
        x=abs_humidity_range.index,
        y=abs_humidity_range,
        line={'width': 1}
    ),
    row=1,
    col=1
)
fig.add_trace(
    go.Scatter(
        x=rel_humidity_range.index,
        y=rel_humidity_range,
        line={'width': 1}
    ),
    row=1,
    col=2
)
fig.add_trace(
    go.Scatter(
        x=dew_point_range.index,
        y=dew_point_range,
        line={'width': 1}
    ),
    row=1,
    col=3
)