## Assignment 3 - Weather forecasting

As a preparation I extracted the full dataset available on http://idojarasbudapest.hu/archivalt-idojaras into *weather_data.csv* and did a little bit of feature engineering in LibreOffice Calc. I removed the columns related to wind and rain, and merged the t_min and t_max columns into one column by averaging those. Also, I split the date column into a year, month and a day part (so the model may be able to learn the years that was warmer, the cold months, etc.).

In [1]:
import numpy as np
import pandas as pd

np.random.seed(123456789)

We read the .csv with pandas and check the column types. For some reason the **YEAR** became float so I converted it back to integer, which was in reality completely redundant, but the preview this way will be a little bit nicer as we will not get any decimal points appearing in the **YEAR** column when calling *tail()* or *head()*.

In [2]:
df = pd.read_csv("weather_data.csv")

In [3]:
df.dtypes

YEAR           float64
MONTH            int64
DAY              int64
TEMPERATURE    float64
dtype: object

In [4]:
df["YEAR"] = df["YEAR"].astype(np.int64)

In [5]:
df.head()

Unnamed: 0,YEAR,MONTH,DAY,TEMPERATURE
0,2011,9,29,18.5
1,2011,9,30,17.5
2,2011,10,1,18.0
3,2011,10,2,18.5
4,2011,10,3,17.0


In [6]:
df.tail()

Unnamed: 0,YEAR,MONTH,DAY,TEMPERATURE
3296,2020,10,19,11.1
3297,2020,10,20,12.2
3298,2020,10,21,12.8
3299,2020,10,22,13.0
3300,2020,10,23,13.6


We split the DataFrame returned by read_csv into a feature array **X** and a label array **y**.  

**df_X** still contains the headline, so by extracting *values* (which is an nparray type of object) we can easily get rid of it! I did the very same with **df_y**.

I reshaped **y** into 2D as I prefer dealing with 2D arrays.

In [7]:
df_X = df[["YEAR", "MONTH", "DAY"]]
df_y = df["TEMPERATURE"]

X = df_X.values
y = df_y.values
y = y.reshape((-1, 1))

In [8]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split

I scaled down the features in the following step. I was experimenting with MinMaxScaler and StandardScaler but I eventually chose StandardScaler even though they both performed identically well. Also, I made a train and a test set.

In [9]:
scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.05, random_state=123456789)

In [10]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

Throughout creating the model I was using Dense layers with ReLU activations. For the larger hidden layers I enabled dropout regularization.  Also I'm using the ModelCheckpoint callback for saving the best model (the one with the lowest validation loss).

I was about to use EarlyStopping here as well, but due to the low number of epochs I felt it sort of unnecessary.

In [11]:
model = Sequential(
    [
        Dense(units=32, activation="relu", input_shape=(X.shape[1],)),
        Dense(units=128, activation="relu"),
        Dropout(0.25),
        Dense(units=256, activation="relu"),
        Dropout(0.5),
        Dense(units=128, activation="relu"),
        Dropout(0.25),
        Dense(units=64, activation="relu"),
        Dense(units=32, activation="relu"),
        Dense(units=y.shape[1], activation="relu")
    ]
)

#es = EarlyStopping(monitor="val_loss", patience=25)
mc = ModelCheckpoint("best_model", monitor="val_loss", save_weights_only=True, save_best_only=True, verbose=1)

I use RMSprop as the optimizer and MSE as the loss function. I train the model with the parameters that can be seen in the fit function. After the training ends I load the best model back to **model**. 

In [12]:
model.compile(optimizer=RMSprop(), loss="mse", metrics=["mse"])
model.fit(X_train, y_train, epochs=50, batch_size=8, verbose=2, validation_split=0.15, shuffle=True, callbacks=[mc])
model.load_weights("best_model")

Epoch 1/50

Epoch 00001: val_loss improved from inf to 22.24934, saving model to best_model
333/333 - 1s - loss: 91.2583 - mse: 91.2583 - val_loss: 22.2493 - val_mse: 22.2493
Epoch 2/50

Epoch 00002: val_loss improved from 22.24934 to 18.84693, saving model to best_model
333/333 - 0s - loss: 23.3958 - mse: 23.3958 - val_loss: 18.8469 - val_mse: 18.8469
Epoch 3/50

Epoch 00003: val_loss did not improve from 18.84693
333/333 - 0s - loss: 20.5763 - mse: 20.5763 - val_loss: 27.0613 - val_mse: 27.0613
Epoch 4/50

Epoch 00004: val_loss improved from 18.84693 to 14.73520, saving model to best_model
333/333 - 0s - loss: 19.5095 - mse: 19.5095 - val_loss: 14.7352 - val_mse: 14.7352
Epoch 5/50

Epoch 00005: val_loss did not improve from 14.73520
333/333 - 0s - loss: 19.5253 - mse: 19.5253 - val_loss: 16.1400 - val_mse: 16.1400
Epoch 6/50

Epoch 00006: val_loss did not improve from 14.73520
333/333 - 0s - loss: 18.6319 - mse: 18.6319 - val_loss: 15.4813 - val_mse: 15.4813
Epoch 7/50

Epoch 00007:

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7fe1d06efc40>

Just for measuring how good or bad the model performs on the training set I calculate the absolute error.

In [13]:
y_preds = model.predict(X_test)
print("For", len(X_test), "days, the summed absolute error is", np.sum(np.abs(y_preds - y_test)))

For 166 days, the summed absolute error is 477.7634516477585


## Predictions

**October 28.**

In [14]:
y_oct28 = model.predict(scaler.transform([[2020, 10, 28]]))
print(*y_oct28[0], "°C")

10.433494 °C


**November 3.**

In [15]:
y_nov3 = model.predict(scaler.transform([[2020, 11, 3]]))
print(*y_nov3[0], "°C")

8.182936 °C


**November 24.**

In [16]:
y_nov24 = model.predict(scaler.transform([[2020, 11, 24]]))
print(*y_nov24[0], "°C")

4.1978006 °C
