# Deep Learning Homework 3
**Pfeifer Dániel<br>
N65V6V**

Our goal is to create a Neural Network that can predict the average temperature on a given day in Budapest. I have used the Minimum and Maximum temperatures recorded from 2018 October 2nd to 2020 October 25th (yesterday) - around 2 years worth of data. This can be found here: http://idojarasbudapest.hu/archivalt-idojaras

I have also complied a database containing the data from the website, which can be found along with this Notebook on GitHub.

In [1]:
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split
import datetime
from datetime import timedelta
from datetime import date
import calendar
import matplotlib.pyplot as plt
random_state = 5555

### 1. Data Preparation

**First I load in the database.**

In [2]:
weather_base = pd.read_excel("./weather.xlsx", header=None, names=['day','Tmax','Tmin','wind','precip'])
weather_base

Unnamed: 0,day,Tmax,Tmin,wind,precip
0,2018-10-02 00:00:00,14.3,6.8,6.3,0.8
1,kedd,,,,
2,2018-10-03 00:00:00,16.8,3.5,7.4,0.1
3,szerda,,,,
4,2018-10-04 00:00:00,15.7,5.7,4.3,0.0
...,...,...,...,...,...
1505,péntek,,,,
1506,2020-10-24 00:00:00,17.9,12.1,5.6,0.0
1507,szombat,,,,
1508,2020-10-25 00:00:00,13.3,12.9,2.3,4.5


**However, it's not quite in a usable format, so I'm transforming it.**
- I'm removing every second row,
- Adding an `ID`,
- Adding a `month` variable that I'll later use as a predictor variable.
- And I'm only using the `Tmin` and `Tmax` columns. (Omitting the columns containing `wind` and `precipitation` values. - Though perhaps a better prediction could have been gotten if I used them.)

In [3]:
id_list = []
date_list = []
month_list = []
Tmin_list = []
Tmax_list = []
counter = 0
for i, row in weather_base.iterrows():
    if i % 2 == 0:
        id_list.append(counter)
        counter += 1
        date_list.append(row['day'])
        month_list.append(row['day'].month)
        Tmin_list.append(row['Tmin'])
        Tmax_list.append(row['Tmax'])
weather = pd.DataFrame(data={'id':id_list,'date':date_list,'month':month_list,
                             'Tmin':Tmin_list, 'Tmax':Tmax_list})

**Now it seems quite a but more usable:**

**Note:** Our goal is to predict the **average** temperature, which is by definition $\frac{\text{Tmin}+\text{Tmax}}{2}$. This can obvously be gotten from our predictions later.

In [4]:
weather

Unnamed: 0,id,date,month,Tmin,Tmax
0,0,2018-10-02,10,6.8,14.3
1,1,2018-10-03,10,3.5,16.8
2,2,2018-10-04,10,5.7,15.7
3,3,2018-10-05,10,2.9,17.1
4,4,2018-10-06,10,4.0,20.3
...,...,...,...,...,...
750,750,2020-10-21,10,8.9,16.6
751,751,2020-10-22,10,10.3,15.7
752,752,2020-10-23,10,10.8,16.4
753,753,2020-10-24,10,12.1,17.9


### 2. Train and test sets

- Each row of the **predictor varables** will contain 4 consecutive rows of our original database, flattened out into $[month_1,Tmin_1,Tmax_1,\dots,month_4,Tmin_4,Tmax_4]$, which is exactly 12 values per row.
- Each row of the **output variables** will contain the lowest and highest temperatures of the following day: $[Tmin_5,Tmax_5]$.
- This way our predictor has access to temperatures of the previous 4 days, and knows generally what season we're currently in.

In [5]:
X = []
y = []
for i, row in weather.iterrows():
    if i < len(weather)-4:
        current_X_element = []
        for j in range(4):
            cur_row = weather[i+j:i+j+1]
            current_X_element.append(int(cur_row['month']))
            current_X_element.append(float(cur_row['Tmin']))
            current_X_element.append(float(cur_row['Tmax']))
        X.append(current_X_element)
        cur_target_row = weather[i+4:i+5]
        y.append([float(cur_target_row['Tmin']),
                  float(cur_target_row['Tmax'])])
X = np.array(X)
y = np.array(y)

Our predictor variables are the following:

In [6]:
X

array([[10. ,  6.8, 14.3, ..., 10. ,  2.9, 17.1],
       [10. ,  3.5, 16.8, ..., 10. ,  4. , 20.3],
       [10. ,  5.7, 15.7, ..., 10. , 12.7, 20.3],
       ...,
       [10. ,  8. , 14.1, ..., 10. , 10.3, 15.7],
       [10. ,  9.6, 14.8, ..., 10. , 10.8, 16.4],
       [10. ,  8.9, 16.6, ..., 10. , 12.1, 17.9]])

751 rows and 12 columns (as I've described earlier):

In [7]:
X.shape

(751, 12)

And our output variables are the following:

In [8]:
y

array([[ 4. , 20.3],
       [12.7, 20.3],
       [11.9, 19.6],
       ...,
       [10.8, 16.4],
       [12.1, 17.9],
       [12.9, 13.3]])

Also 751 rows and 2 columns (minimum and maximum temperatures on the given day):

In [9]:
y.shape

(751, 2)

From these, I'll use `sklearn`'s `train_test_split` method to split them randomly into 70% (Train set) and 30% (Test set):

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=random_state)
print("Shape of X_train: ", X_train.shape)
print("Shape of X_test: ", X_test.shape)
print("Shape of y_train: ", y_train.shape)
print("Shape of y_test: ", y_test.shape)

Shape of X_train:  (525, 12)
Shape of X_test:  (226, 12)
Shape of y_train:  (525, 2)
Shape of y_test:  (226, 2)


### 3. Building the Neural Network

I'm defining a **Dense Layer** which I'll use multiple times, with:
- 0.01 L1 kernel regularization, and
- RELU activation

In [11]:
def dense_layer(output_size):
    return tf.keras.layers.Dense(output_size,
                                kernel_regularizer=tf.keras.regularizers.l1(0.01),
                                activation='relu')

The actual **Neural Network** is made up of:
- simply 3 of these layers with an input and output size if 12,
- then a final layer with an input size of 12, and an output size of 2 nodes;
- it's using `sklearn`'s `Adam` optimizer,
- and the loss function is simply the Mean Squared Error.

In [12]:
model = keras.Sequential()
model.add(dense_layer(12))
model.add(dense_layer(12))
model.add(dense_layer(12))
model.add(tf.keras.layers.Dense(2))
model.compile(optimizer='Adam', loss='mean_squared_error')

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


Them I'm fitting the model for the previously definied `X_train` and `y_train` variables, with:
- a batch size of 20,
- and 100 epochs, which might seem like a lot, however from my testing it seemed it was still learning even after 70-80 epochs, and not overfitting. Besides, the model is very simple and runs pretty fast even with 100 epochs.

In [13]:
model.fit(X_train, y_train, batch_size=20, epochs=100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<tensorflow.python.keras.callbacks.History at 0x28433ae8848>

### 4. Predictions

Then I'm using the `X_test` set to obtain the model's prediction. I've also printed out some rows of the predicted minimum and maximum temperatures.

**Note:** These are in random order for some random days, (however predictions were still done for consecutive days).

In [14]:
prediction = model.predict(X_test)
prediction[10:20]

array([[ 2.257655 ,  5.878769 ],
       [ 5.2861385, 12.333096 ],
       [ 8.784059 , 16.89118  ],
       [16.22849  , 23.702333 ],
       [17.982056 , 26.115002 ],
       [-3.214916 ,  1.6870147],
       [ 5.468858 , 13.4878025],
       [21.172699 , 31.098381 ],
       [ 4.9253774, 15.948999 ],
       [ 1.1899122,  5.98812  ]], dtype=float32)

Then I'm taking the mean of each row and comparing it to the actual values. Here are some examples of what the Neural Network has predicted:

In [15]:
prediction_means = np.array([np.mean(e) for e in prediction])
actual_means = np.array([np.mean(e) for e in y_test])
comparison = [(actual_means[i], prediction_means[i]) for i in range(len(actual_means))][10:20]
print('Actual average temperature | Predicted average temperature')
print('----------------------------------------------------------')
def spaces(l):
    return ''.join([' ' for k in range(l)])
for i in range(len(comparison)):
    c0 = comparison[i][0]
    c1 = comparison[i][1]
    print(spaces(18-len(str(c0))), c0, spaces(6), "|", spaces(6), c1)

Actual average temperature | Predicted average temperature
----------------------------------------------------------
                2.5        |        4.068212
               7.25        |        8.809617
               14.5        |        12.83762
               20.5        |        19.965412
              24.05        |        22.048529
               -3.5        |        -0.76395065
               10.8        |        9.478331
              22.25        |        26.13554
              12.15        |        10.437188
                5.2        |        3.5890162


I'd say it works very well for predicting the temperature on the following day.<br><br>
We can also take the average of the absolute differences of the actual and predicted values (mean absolute error), to get a feel of how much this predictor misses by:

In [16]:
print("Average temperature miss:", np.mean([abs(prediction_means[i] - actual_means[i]) for i in range(len(actual_means))]))

Average temperature miss: 1.7174537472229088


### 5. Predicting multiple days ahead

I will use the simple model of:
1. Predict the minimum and maximum temperatures for the next day.
2. Add it to the known values of the model.
3. Repeat.

For this I'll need a fully ordered set of the days and temperatures, so I'll use the original `X` and `y` variables, and not `X_train` and `y_train` that only contain about ~70% of the days in our database.<br>
However I'll still use the prevuious `model`.<br><br>
The following function creates a continously updating list of the temperatures of the last 12 days (the `last_12_days` variable), calculates a prediction of the next day, appends it to this list, and deletes the first row of `last_12_days`. It does this until it reaches the day we're looking for.

In [17]:
today = list(weather['date'])[-1]
def predict_temperatures(days_from_now):
    predictions = [] 
    day_needed = today + timedelta(days=days_from_now)
    last_12_days = X[-12:]
    for i in range(days_from_now):
        next_pred = model.predict(last_12_days)[-1]
        predictions.append(next_pred)
        last_12_days = last_12_days[1:]
        last_12_days = np.array(list(last_12_days) + list([np.append(np.append(last_12_days[-1][3:],[day_needed.month]),next_pred)]))
    return predictions

Here is an example: These are the predicted temperatures of the following 10 days (October 26th - November 4th):

In [18]:
next10_pred = predict_temperatures(10)
print("    Day   | Predicted minimum temperature | Predicted maximum temperature")
print("-------------------------------------------------------------------------")
for r in range(len(next10_pred)):
    day = today + timedelta(days=r+1)
    m = calendar.month_abbr[day.month]
    d = day.day
    mint = next10_pred[r][0]
    maxt = next10_pred[r][1]
    print(" ", m, d, spaces(5-len(str(m))-len(str(d))), "|", spaces(9), mint, spaces(18-len(str(mint))), "|", spaces(9), maxt)

    Day   | Predicted minimum temperature | Predicted maximum temperature
-------------------------------------------------------------------------
  Oct 26  |           11.423736           |           17.370682
  Oct 27  |           10.653222           |           16.565948
  Oct 28  |           9.992475            |           16.134771
  Oct 29  |           9.569747            |           15.814251
  Oct 30  |           9.09143             |           15.583388
  Oct 31  |           8.722247            |           15.2616005
  Nov 1   |           8.313805            |           14.895863
  Nov 2   |           7.8520064           |           14.49543
  Nov 3   |           7.3704844           |           14.076249
  Nov 4   |           6.9357166           |           13.642221


And the following function predicts for a specific day:

In [19]:
def predict_faraway_day_temperature(year, month, day):
    d = date(year, month, day)
    today = list(weather['date'])[-1]
    delta = (d - date(today.year, today.month, today.day)).days
    return predict_temperatures(delta)[-1]

**Here are the actual predictions the exercise asked for:**

In [20]:
print("Predicted average temperatures in Budapest (°C):")
print("------------------------------------------------")
print("October 28th:", np.mean(predict_faraway_day_temperature(2020,10,28)))
print("November 3rd:", np.mean(predict_faraway_day_temperature(2020,11,3)))
print("November 24th:", np.mean(predict_faraway_day_temperature(2020,11,24)))

Predicted average temperatures in Budapest (°C):
------------------------------------------------
October 28th: 13.207807
November 3rd: 10.723367
November 24th: 1.6468673
