# Predict seasonal CO2 concentration

Time Series prediction is a useful tool for many areas of physics and society.
Here we will use keras to predict the seasonal CO2 concentration change from our previous homework.
Note: here we looking into the seasonal change whereas in the previous homework we were looking into the seasonal corrected outlook.

Loosly based on [https://machinelearningmastery.com/time-series-prediction-with-deep-learning-in-python-with-keras/

Start with importing libraries

In [None]:
%pylab inline
import pandas as pd
import tensorflow as tf
from tensorflow import keras

This code is written for 
- tensorflow/keras 2.4. 
- Pandas version 1.2.3. 
- Python version 3.8.8. 

So check your versions first:

In [None]:
import sys
print("Python: {}".format(sys.version))
print("tensorflow: {}".format(tf.__version__))
print("keras: {}".format(keras.__version__))
print("pandas: {}".format(pd.__version__))

## Data input
Let's read in out CO2 data in frist.
We skip the header rows and are only interested in the non-season corrected monthly average and set the decimal date as our index:

In [None]:
dfraw = pd.read_fwf('../DataAnalysis/FFT/co2_mm_mlo.txt',skiprows=71,index_col=2)

Let's filter out all missing values

In [None]:
df = dfraw.iloc[:,2][dfraw.iloc[:,2]>0]

In [None]:
df.plot()

For our model it is convenient to use normalized data, so we take the mean and divide by the standard derivation:

In [None]:
def normalize(data):
    data_mean = data.mean(axis=0)
    data_std = data.std(axis=0)
    return (data - data_mean) / data_std

In [None]:
normalize(df).plot()

## Create Dataset

In the next step we split our dataset into a training set and a validation (test) set.
This helps us to validate the training against a known future.
You can experiment with the split ratio.

In [None]:
splitRatio = 0.67
train_size = int(len(df) * splitRatio)
test_size = len(df) - train_size
train = normalize(df).iloc[0:train_size].values
test = normalize(df).iloc[train_size:].values
print("length of training set: {},\nlength of validation set: {}".format(len(train), len(test)))

Now we create a dataset for our model.
There are also built in functions

`keras.preprocessing.timeseries_dataset_from_array`

[https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/timeseries_dataset_from_array]

for that but for educational purposes we do that ourselves.

In [None]:
def create_dataset(dataset, look_back=1):
    dataX, dataY = [], []
    for i in range(len(dataset)-look_back-1):
        a = dataset[i:(i+look_back)][0]
        dataX.append(a)
        dataY.append(dataset[i + look_back])
    return numpy.array(dataX), numpy.array(dataY)

What did we do here?
We use a *look_back* variable to define the amount of months we want to use to predict the future.
Then we stack up an $n+1$ dimensional array that uses the original timeseries shifted by *look_back*.
This gives us the data for our model.

Let's have a look at it:

In [None]:
x,y = create_dataset(df.iloc[0:10].values)
dd = pd.DataFrame([x,y])
dd

Depending on *look_back* we shifted the timeseries in the following row.

Let's do it for our dataset:

In [None]:
look_back = 1
trainX, trainY = create_dataset(train, look_back)
testX, testY = create_dataset(test, look_back)

## Create Model
now we create our model.
What changes if you use less or more hidden layers?
You can also play around with the loss function and the optimizer to see the effects.
We use a batch size of zero for our dataset.

In [None]:
with tf.device("cpu:0"):
    model = keras.Sequential()
    model.add(keras.layers.Dense(6, input_dim=look_back, activation='relu'))
    model.add(keras.layers.Dense(1))
    model.compile(loss='mean_squared_error', optimizer='adam')
    model.fit(trainX, trainY, epochs=15, batch_size=0, verbose=1)
    model.summary()

## Estimate model performance

In [None]:
with tf.device("cpu:0"):
    trainScore = model.evaluate(trainX, trainY, verbose=0)
    print("Train Score: {:.2f} MSE".format(trainScore))
    testScore = model.evaluate(testX, testY, verbose=0)
    print("Test Score: {:.2f} MSE".format(testScore))

## Prediction

Now use the keras predition to validate our data.
Is that the expected outcome?

In [None]:
with tf.device("cpu:0"):
    # generate predictions for training
    trainPredict = model.predict(trainX)
    testPredict = model.predict(testX)
    # shift train predictions for plotting
    trainPredictPlot = np.empty_like(df.iloc[0:train_size].values)
    #trainPredictPlot[:,:] = np.nan
    trainPredictPlot[look_back:len(trainPredict)+look_back] = trainPredict[:,0]
    # shift test predictions for plotting
    testPredictPlot = np.empty_like(df)
    testPredictPlot[:] = np.nan
    testPredictPlot[len(trainPredict)+(look_back*2)+1:len(df)-1] = testPredict[:,0]
    # plot baseline and predictions
    plt.plot(normalize(df).values)
    plt.plot(trainPredictPlot)
    plt.plot(testPredictPlot)
    ylim(-2,2)
    plt.show()
    