# Univariate time series prediction 

This notebook is an example of how you can use observation data. We will try to forecast a univariate time series, the temperature at one ground station, using Recurrent Neural Networks (RNNs).

# Note

<font size="4.5">To use <span style="color:blue">**Cartopy**</span>, a library to plot data with basemaps (see cells below), it is necessary to <span style="color:red">activate the internet connection</span> of that notebook (in edit mode, you can find on the right column, in the *Settings* section, a row entitled *Internet*, put the slider bar on **on**).  </font>

In [None]:
import numpy as np
import pandas as pd
import time

import tensorflow as tf
from sklearn.metrics import mean_absolute_error

import matplotlib.pyplot as plt

# Map plotting library
import cartopy.crs as ccrs
import cartopy.feature as cfeature

# Input data files are available in the "../input/" directory.
# Any results you write to the current directory are saved as output.

## Quick data exploration
This first part will help you explore quickly the observation data we will use in this example. To get more details about the entire observation data, check the *open_ground_stations* notebook, or navigate to https://meteofrance.github.io/meteonet/data/ground-observations/. Here, we will use 3 years of observation data for the temperature parameter and measured in the North West quarter of France. 

### Loading the data[](http://)

In [None]:
param = 't'  # parameter to study, here the temperature 
zone = 'NW'  # zone to study 
path = '/kaggle/input/meteonet/NW_Ground_Stations/NW_Ground_Stations/NW_Ground_Stations_'  # path to the data 
cols = ['number_sta','lat','lon','date',param]    # columns we need in the array

df = pd.concat([pd.read_csv(path + '2016.csv',usecols = cols, parse_dates=['date'],infer_datetime_format=True),
                pd.read_csv(path + '2017.csv',usecols = cols, parse_dates=['date'],infer_datetime_format=True),
                pd.read_csv(path + '2018.csv',usecols = cols,parse_dates=['date'],infer_datetime_format=True)], axis=0)

Let's get a look at our data !

In [None]:
display(df.head())
display(df.tail())

### Check the data availability 
We have 3 years of data, but we might need to check that the data is available during this entire period for a given station.  

In [None]:
for id_sta in np.unique(df['number_sta'])[0:3]:
    print('Station number:',id_sta)    
    uni_data = df[(df['number_sta'] == id_sta)][{param,'date'}]
    uni_data = uni_data[uni_data[param].notnull()]
    if (uni_data.empty == False):
        t = [t.year for t in uni_data['date']]
        print('Parameter ',param,' available for the following years : ',np.unique(t))
    else:
        print('Param ',param,' missing') 

### Plot the map of temperature for a given date

In [None]:
date = '2016-01-01T06:00:00'
d_sub = df[df['date'] == date]
plt.figure()
plt.scatter(d_sub['lon'], d_sub['lat'], c=d_sub[param], cmap='jet')
plt.xlabel('Longitude (°E)')
plt.ylabel('Latitude (°N)')
plt.title(date)
plt.colorbar().set_label('Temperature (K)')
plt.show()

Let's add the map to our plot with Cartopy !

In [None]:
# Coordinates of studied area boundaries (in °N and °E)
lllat = 46.25  #lower left latitude
urlat = 51.896  #upper right latitude
lllon = -5.842  #lower left longitude
urlon = 2  #upper right longitude
extent = [lllon, urlon, lllat, urlat]

fig = plt.figure(figsize=(9,5))

# Select projection
ax = plt.axes(projection=ccrs.PlateCarree())

# Plot the data
plt.scatter(d_sub['lon'], d_sub['lat'], c=d_sub[param], cmap='jet')
plt.colorbar().set_label('Temperature (K)')
plt.title(date)

ax.coastlines(resolution='50m', linewidth=1)
ax.add_feature(cfeature.BORDERS.with_scale('50m'))

# Adjust the plot to the area we defined 
#/!\# this line causes a bug of the kaggle notebook and clears all the memory. That is why this line is commented and so
# the plot is not completely adjusted to the data
#ax.set_extent(extent)

plt.show()

If you wonder why some data points are in the middle of the sea, they were measured on small islands ;)

### Plot the evolution of the temperature at one station

In [None]:
number_sta = 29277001
uni_data = df[(df['number_sta'] == number_sta)][param]
uni_data.index = df[(df['number_sta'] == number_sta)]['date']

plt.figure(figsize=(10,5))
plt.ylabel('Temperature (K)')
plt.title('Temperature evolution of the station '+str(number_sta))
uni_data.plot(subplots=True)
plt.show()

We can see that the temperature is rising and falling with the different seasons. 

## Time series prediction

Let's come back to our example of univariate time series prediction ! We will start by preprocessing the data. 

In [None]:
#replace nan values by the mean 
uni_data_mean = np.nanmean(uni_data)
uni_data = np.nan_to_num(uni_data,nan=uni_data_mean)

### Training and validation datasets

Let's start by splitting the dataset into a training and a validation dataset. 
It is also important to rescale the data before training our neural network. Standardization is a common way of doing this rescaling by subtracting the mean and dividing by the standard deviation of the feature.

In [None]:
coeff_train = 0.7  #proportion of the dataset in the training set
TRAIN_SPLIT = round(uni_data.shape[0]*coeff_train)

Let's now create the data for the univariate model. Our model will take as input an array of consecutive values of size `input_size` and will use this 'past' information to infer the  maximum 'future' value of the parameter on a given period, which is a new array of size `target_size`.

We choose that the model will be given the last 24h recorded temperature observations (`input_size`=24x10, the data time step is 6min, so we have 10 observations per hour) and learn to predict the maximum temperature at the next 24 hours (`target_size`=240).

In [None]:
def univariate_data(x_data, start_index, end_index,input_size ,target_size):
    data = []
    labels = []
    
    start_index = start_index + input_size
    if end_index is None:
        end_index = len(x_data) - target_size

    for i in range(start_index, end_index):
        # Reshape data from (input_size,) to (input_size, 1)
        data.append(np.expand_dims(x_data[(i-input_size):i], axis=1))
        labels.append(np.max(x_data[i:i+target_size]))
    return np.array(data), np.array(labels)

Let's compute the training and validation set:

In [None]:
univariate_past_history = 240
univariate_future_target = 240

x_train,y_train = univariate_data(uni_data, 0,TRAIN_SPLIT,univariate_past_history,univariate_future_target)
x_val,y_val = univariate_data(uni_data, TRAIN_SPLIT, None,univariate_past_history,univariate_future_target)

Let's standardize the data:

In [None]:
x_train_mean = np.mean(x_train)
x_train_std = np.std(x_train)

y_train_mean = np.mean(y_train)
y_train_std = np.std(y_train)

#Let's standardize the data:
x_data_train = (x_train-x_train_mean)/x_train_std
x_data_val = (x_val-x_train_mean)/x_train_std
y_data_train = (y_train-y_train_mean)/y_train_std
y_data_val = (y_val-y_train_mean)/y_train_std

In [None]:
print ('Single window of past history, first elements')
print (x_data_train[0][0:10])
print ('\n Target ',param,' to predict')
print (y_data_train[0])

#### Single example of the preprocessed dataset

Now that the data has been created, let's take a look at a single example. The information given to the network is given in blue, and it must predict the value at the red cross.

In [None]:
def create_time_steps(length):
  return list(range(-length, 0))

In [None]:
def show_plot(ind, plot_data, delta):
    labels = ['History : x', 'True Future : y']
    marker = ['.-', 'rx', 'go']
    time_steps = create_time_steps(plot_data[0].shape[0])
    if delta:
        future = delta
    else:
        future = 0

    plt.figure()  
    plt.title('Sample example '+str(ind))
    for i, x in enumerate(plot_data):
        if i:
            plt.plot(future, plot_data[i], marker[i], markersize=10,
               label=labels[i])
        else:
            plt.plot(time_steps, plot_data[i].flatten(), marker[i], label=labels[i])
    plt.legend()
    plt.xlim([time_steps[0], (future+5)*2])
    plt.xlabel('Time-Step')
    plt.ylabel('Standardized temperature')
    plt.show()

ind = 0
show_plot(ind,[x_data_train[ind], y_data_train[ind]], 0)

## Baseline

Before proceeding to train a model, let's first set a simple baseline. Given an input point, the baseline predicts the next point to be the maximum value of the last 24 hours. 

In [None]:
def baseline(history):
    return np.max(history)

In [None]:
def show_baseline(ind, plot_data, delta):
    labels = ['History : x', 'True Future : y','Baseline : Max(History)']
    marker = ['.-', 'rx', 'go']
    time_steps = create_time_steps(plot_data[0].shape[0])
    if delta:
        future = delta
    else:
        future = 0

    plt.figure()  
    plt.title('Sample example '+str(ind))
    for i, x in enumerate(plot_data):
        if i:
            plt.plot(future, plot_data[i], marker[i], markersize=10,
               label=labels[i])
        else:
            plt.plot(time_steps, plot_data[i].flatten(), marker[i], label=labels[i])
    plt.legend()
    plt.xlim([time_steps[0], (future+5)*2])
    plt.xlabel('Time-Step')
    plt.ylabel('Standardized temperature')
    plt.show()

ind = 240
show_baseline(ind,[x_data_train[ind], y_data_train[ind], baseline(x_data_train[ind])], 0)

Let's see if you can beat this baseline using a recurrent neural network.

## Recurrent neural network

A Recurrent Neural Network (RNN) is a type of neural network well-suited to time series data. RNNs process a time series step-by-step, maintaining an internal state summarizing the information they've seen so far. In this notebook, you will use a specialized RNN layer called Long Short Term Memory (LSTM).

First we set the seed to ensure the reproducibility of the experiment:

In [None]:
tf.random.set_seed(13)

Let's now use `tf.data` to shuffle, batch, and cache the dataset.

In [None]:
BATCH_SIZE = 256
BUFFER_SIZE = 10000

train_univariate = tf.data.Dataset.from_tensor_slices((x_data_train, y_data_train))
train_univariate = train_univariate.cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE).repeat()

val_univariate = tf.data.Dataset.from_tensor_slices((x_data_val, y_data_val))
val_univariate = val_univariate.batch(BATCH_SIZE).repeat()

The following visualisation should help you understand how the data is represented after batching:

![](https://www.tensorflow.org/tutorials/structured_data/images/time_series.png)

You will see the LSTM requires the input shape of the data it is being given.

In [None]:
simple_lstm_model = tf.keras.models.Sequential([
    tf.keras.layers.LSTM(8, input_shape=x_data_train.shape[-2:]),
    tf.keras.layers.Dense(1)
])

simple_lstm_model.compile(optimizer='adam', loss='mae')

Let's make a sample prediction, to check the output of the model:

In [None]:
for x,y in val_univariate.take(1):
    print(simple_lstm_model.predict(x).shape)

Let's train the model now. 

In [None]:
EPOCHS = 5
STEPS_PER_EPOCH = (x_data_train.shape[0]/BATCH_SIZE//1)+1
VAL_STEPS = (x_data_val.shape[0]/BATCH_SIZE//1)+1

simple_lstm_model.fit(train_univariate, epochs=EPOCHS,
                      steps_per_epoch=STEPS_PER_EPOCH,
                      validation_data=val_univariate, validation_steps = VAL_STEPS)

### Predict using the simple LSTM model

Now that you have trained your simple LSTM, let's try and make a few predictions.

In [None]:
def show_result(index, plot_data, delta):
    labels = ['History : x', 'True Future : y', 'Prediction','Baseline : Max(History)']
    marker = ['.-', 'rx', 'go','go']
    colors = ['blue','red','green','darkblue']
    time_steps = create_time_steps(plot_data[0].shape[0])
    if delta:
        future = delta
    else:
        future = 0

    plt.figure()  
    plt.title('Sample example '+str(ind))
    for i, x in enumerate(plot_data):
        if i:
            plt.plot(future, plot_data[i], marker[i], markersize=10,
               label=labels[i], color = colors[i])
        else:
            plt.plot(time_steps, plot_data[i].flatten(), marker[i], label=labels[i], color = colors[i])
    plt.legend()
    plt.xlim([time_steps[0], (future+5)*2])
    plt.xlabel('Time-Step')
    plt.ylabel('Standardized temperature')
    plt.show()

In [None]:
ind = 0
for x, y in val_univariate.take(3):
    y_true=[]
    y_true.append(y[ind].numpy())
    basel=[]
    basel.append(baseline(x[ind].numpy()))
    print('mae baseline',mean_absolute_error(y_true,basel))
    print('mae LSTM',mean_absolute_error(y_true,simple_lstm_model.predict(x)[ind]))
    plot = show_result(ind,[x[ind].numpy(), y[ind].numpy(),simple_lstm_model.predict(x)[ind],baseline(x[ind].numpy())], 0)

This looks better than the baseline. 

Now that you have seen some basics, let's go ahead, play with the data, propose your own methods ! 