<img src="https://www.th-koeln.de/img/logo.svg" style="float:right;" width="200">

# 12th exercise: <font color="#C70039">Deep Learning Basics: Preprocessing, Encoding and Initial Setup</font>
* Course: <a href="https://www.gernotheisenberg.de/time_series_forecasting.html">Time Series Forecasting (TSF)</a>
* Lecturer: <a href="https://www.gernotheisenberg.de/uebermich.html">Gernot Heisenberg</a>
* Date:   23.03.2025

<img src="./images/DL1.jpg" style="float: center;" width="450">

---------------------------------
**GENERAL NOTE 1**: 
Please make sure you are reading the entire notebook, since it contains a lot of information on your tasks (e.g. regarding the set of certain paramaters or a specific computational trick), and the written mark downs as well as comments contain a lot of information on how things work together as a whole. 

**GENERAL NOTE 2**: 
* Please, when commenting source code, just use English language only. 
* When describing an observation please use English language, too
* This applies to all exercises throughout this course.  

---------------------

### <font color="ce33ff">DESCRIPTION OF THE NOTEBOOK CONTENT</font>:
This notebook allows you for learning about the initial first steps, including data preprocessing and especially data encoding when planning to forecast a time series by Deep Learning approaches. 

-------------------------------------------------------------------------------------------------------------

### <font color="FFC300">TASKS</font>:
The tasks that you need to work on within this notebook are always indicated below as bullet points. 
If a task is more challenging and consists of several steps, this is indicated as well. 
Make sure you have worked down the task list and commented your doings. 
This should be done by using markdown.<br> 
<font color=red>Make sure you don't forget to specify your name and your matriculation number in the notebook.</font>

**YOUR TASKS in this exercise are as follows**:
1. import the notebook to Google Colab or use your local machine.
2. make sure you specified you name and your matriculation number in the header below my name and date. 
    * set the date too and remove mine.
3. read the entire notebook carefully 
    * add comments whereever you feel it necessary for better understanding
    * run the notebook for the first time.
    * understand the output
4. Prepare a data set for DL and perform preprocessing
    * Download data set from the UCI Machine Learning Repository:
        * https://archive.ics.uci.edu/ml/datasets/Beijing+Multi-Site+Air-Quality+Data.
    * Read the data and plot the target
    * Remove unnecessary columns
    * Identify whether there is daily seasonality and encode the time accordingly 
    * Split your data into training, validation and testing sets.
    * Scale the data using MinMaxScaler.
    * Save the train, validation and test sets to be used later.
-----------------------------------------------------------------------------------

# PART I - Data Preprocessing

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

In [None]:
df = pd.read_csv('./data/DL/Metro_Interstate_Traffic_Volume.csv')
df.head()

In [None]:
df['date_time'] = pd.to_datetime(df['date_time'])

In [None]:
df = df.drop_duplicates(subset='date_time', ignore_index=True)

In [None]:
df.shape

In [None]:
date_range = list(pd.date_range('2012-10-02 09:00:00', '2018-09-30 23:00:00', freq='H'))
print(len(date_range))

In [None]:
new_df = pd.DataFrame({'date_time': date_range})
new_df.head()

In [None]:
full_df = pd.merge(new_df, df, how='left', on='date_time')

In [None]:
full_df.head()

In [None]:
full_df.isna().sum()

In [None]:
fig, ax = plt.subplots(figsize=(13,6))

ax.plot(full_df.traffic_volume)
ax.set_xlabel('Date')
ax.set_ylabel('Traffic volume')

fig.autofmt_xdate()
plt.tight_layout()

In [None]:
full_df[35000:].isna().sum()

In [None]:
full_df = full_df[35000:].reset_index(drop=True)

In [None]:
full_df.head()

In [None]:
full_df = full_df.drop(['holiday', 'weather_main', 'weather_description'], axis=1)
full_df.shape

In [None]:
full_df = full_df.fillna(full_df.groupby(full_df.date_time.dt.hour).transform('median'))

In [None]:
full_df.isna().sum()

In [None]:
fig, ax = plt.subplots(figsize=(13,6))

ax.plot(full_df.traffic_volume)
ax.set_xlabel('Date')
ax.set_ylabel('Traffic volume')

fig.autofmt_xdate()
plt.tight_layout()

In [None]:
full_df.to_csv('./data/DL/metro_interstate_traffic_volume_preprocessed.csv', index=False, header=True)

# PART II - Data Encodings

In [None]:
# load all remaining libs that have not been loaded in the first import section
import datetime

import seaborn as sns
import tensorflow as tf

from tensorflow.keras import Model, Sequential

from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.losses import MeanSquaredError
from tensorflow.keras.metrics import MeanAbsoluteError

from tensorflow.keras.layers import Dense, Conv1D, LSTM, Lambda, Reshape, RNN, LSTMCell

In [None]:
# useful settings
plt.rcParams['figure.figsize'] = (10, 7.5)
plt.rcParams['axes.grid'] = False

In [None]:
tf.random.set_seed(42)
np.random.seed(42)

In [None]:
df = pd.read_csv('./data/DL/metro_interstate_traffic_volume_preprocessed.csv')
df.head()

In [None]:
df.tail()

In [None]:
df.shape

#### Visualization section
Visualize the evolution of the traffic volume over time. 
Since the dataset is very large, with more than 17,000 records, plot only the first 400 data points,
which is roughly equivalent to two weeks of data.

In [None]:
fig, ax = plt.subplots()

ax.plot(df['traffic_volume'])
ax.set_xlabel('Time')
ax.set_ylabel('Traffic volume')

plt.xticks(np.arange(7, 400, 24), ['Friday', 'Saturday', 'Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
plt.xlim(0, 400)# plot the first 400 data points only 

fig.autofmt_xdate()
plt.tight_layout()

Notice a clear daily seasonality, since the traffic volume is lower at the start and end of each day.
Also see a smaller traffic volume during the weekends. 

As for the trend, two weeks of data (0:400) is likely insufficient to draw a reasonable conclusion but it seems that the volume is neither increasing nor decreasing
over time in the figure.

Also plot the hourly temperature, as it will be a target for the multi-output models. Here, we will expect to see both yearly and daily seasonality.

In [None]:
fig, ax = plt.subplots()

ax.plot(df['temp'])
ax.set_xlabel('Time')
ax.set_ylabel('Temperature (K)')

plt.xticks([2239, 10999], [2017, 2018])

fig.autofmt_xdate()
plt.tight_layout()

Visualize the first two weeks again

In [None]:
fig, ax = plt.subplots()

ax.plot(df['temp'])
ax.set_xlabel('Time')
ax.set_ylabel('Temperature (K)')

plt.xticks(np.arange(7, 400, 24), ['Friday', 'Saturday', 'Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
plt.xlim(0, 400) # two week again

fig.autofmt_xdate()
plt.tight_layout()


The yearly seasonality in the plot (upper one) should be due to the seasons in the year, while the daily seasonality (lower one) will be due to the fact that temperatures tend to be lower at night and higher during the day, although the data is a bit noisy.

#### Feature engineering and data splitting

use the describe method in order to get a good overview.

In [None]:
df.describe().transpose()

##### Remove unusable features

From the output, you’ll notice that rain_1h is mostly 0 throughout the dataset, as its third quartile is still at 0. Since at least 75% of the values for rain_1h are 0, it is unlikely that it is a strong predictor of traffic volume. Thus, this feature will be removed. 

Looking at snow_1h, you’ll notice that this variable is at 0 through the entire dataset. This is easily observable, since its minimum and maximum values are both 0.
Thus, this is not predictive of the variation in traffic volume over time. This feature will also be removed from the dataset.

In [None]:
cols_to_drop = ['rain_1h', 'snow_1h']
df = df.drop(cols_to_drop, axis=1)

df.shape

##### Enconding of the time

Right now, the date_time feature is not usable by the models, since it is a datetime string. Thus convert it into a numerical value.
A simple way to do that is to express the date as a number of seconds. This is achieved through the use of the timestamp method from the datetime library.

<font color = red>NOTE:</font>
However, this leads us to losing the cyclical nature of time, because the number of seconds simply increases linearly with time.

Therefore, we must apply a transformation to recover the cyclical behavior of time. A simple way to do that is to apply a sine transformation. We know that the
sine function is cyclical, bounded between –1 and 1. This will help us regain part of the cyclical property of time.

However, we need to confirm the seasonality cycle in the data. For this purpose we will use the power spectrum visualization by means of a Fast Fourier Transformation (FFT). This FFT maps the time series into the frequency space and plots the absolute frequenvy (absolute Häufigkeit) over the time frequency in the data.

In [None]:
timestamp_s = pd.to_datetime(df['date_time']).map(datetime.datetime.timestamp)

In [None]:
fft = tf.signal.rfft(df['traffic_volume'])
f_per_dataset = np.arange(0, len(fft))

n_sample_h = len(df['traffic_volume'])
hours_per_week = 24 * 7
weeks_per_dataset = n_sample_h / hours_per_week

f_per_week = f_per_dataset / weeks_per_dataset

plt.step(f_per_week, np.abs(fft))
plt.xscale('log')
plt.xticks([1, 7], ['1/week', '1/day'])
plt.xlabel('Frequency [Hz]', color ='r')
plt.ylabel('#', color ='r')
plt.tight_layout()
plt.show()

Amplitude of the weekly and daily seasonality in our target. See that the amplitude of the weekly seasonality is lower than the daily seasonality peak. 
Therefore, we indeed have daily seasonality for our target.

#### Applying the sine / cosine encoding

With a single sine transformation, we regain some of the cyclical property that was lost when converting to seconds. 
However, at this point, 12 p.m. is equivalent to 12 a.m. and 5 p.m. is equivalent to 5 a.m. 
This is undesired, as we want to distinguish between morning and afternoon. Thus, we’ll apply a cosine transformation. We know that
cosine is out of phase with the sine function. This allows us to distinguish between 5 a.m. and 5 p.m., 
expressing the cyclical nature of time in a day. After that, we can remove the date_time column from the DataFrame.

In [None]:
# The timestamp is in seconds, so we must calculate the number of seconds in a day
# before applying the sine/cosine transformation.
day = 24 * 60 * 60

df['day_sin'] = (np.sin(timestamp_s * (2*np.pi/day))).values
df['day_cos'] = (np.cos(timestamp_s * (2*np.pi/day))).values

In [None]:
df = df.drop(['date_time'], axis=1)

df.head()

In [None]:
df.sample(50).plot.scatter('day_sin','day_cos').set_aspect('equal');
plt.tight_layout()

In [None]:
# here the old timestamp in seconds encoding
fig, ax = plt.subplots()

ax.plot(timestamp_s)
ax.set_xlabel('Time')
ax.set_ylabel('Number of seconds')

plt.xticks([2239, 10999], [2017, 2018])

fig.autofmt_xdate()
plt.tight_layout()

##### Data split (train, val, test)

In [None]:
n = len(df)

# Split 70:20:10 (train:validation:test)
train_df = df[0:int(n*0.7)]
val_df = df[int(n*0.7):int(n*0.9)]
test_df = df[int(n*0.9):]

train_df.shape, val_df.shape, test_df.shape

##### Scale all feature to be between 0 and 1

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(train_df)

train_df[train_df.columns] = scaler.transform(train_df[train_df.columns])
val_df[val_df.columns] = scaler.transform(val_df[val_df.columns])
test_df[test_df.columns] = scaler.transform(test_df[test_df.columns])

In [None]:
train_df.describe().transpose()

In [None]:
train_df.to_csv('./data/DL/train.csv')
val_df.to_csv('./data/DL/val.csv')
test_df.to_csv('./data/DL/test.csv')

In [None]:
test_df