## What is Deep Learning?

Deep learning, which is a subset of machine learning, uses artificial neural networks to simulate human decision-making. These networks are composed of layers of interconnected nodes or "neurons" that process input data and generate output. Deep learning and all of its components are incredibly complex, and I cannot exaplin them all here. If you are interested in learning more, [IBM](https://www.ibm.com/topics/deep-learning) provides a great explanation for sata-savvy folks that doesn't get too lost in the weeds.Deep learning, a subset of machine learning, employs artificial neural networks to mimic human decision-making. These networks consist of layers of interconnected nodes or "neurons" that process input data and generate output. 

Deep learning is especially effective with large, unstructured datasets, and although it demands substantial computational power, it's capable of producing highly accurate models. Thanks to tools like TensorFlow and Keras, deep learning has become more accessible, making it easier to design, train, and deploy models. Deep learning, with its intricate components, is an intricate field, and a comprehensive explanation is beyond this page's scope. For those interested in diving deeper, IBM offers an excellent resource that provides a solid understanding without getting overly technical.


Here are some of the critical elements of deep learning leveraged in the following models include neural networks, weights and biases, activation functions, backpropogation and gradient descent. In the models below, I will make specirfic note of when each of these elements is being utilized.

**Neural Networks:** These are inspired by the human brain's structure, with networks transforming input data through layers of interconnected nodes. Each layer refines the input it receives, gradually building up complex patterns and associations much like our brain does when processing information.

**Weights and Biases:** These are parameters that are fine-tuned during the learning process. They determine the significance of inputs and play a pivotal role in the accuracy of the model's predictions. Think of them as the factors that determine how much importance should be given to each input when making a prediction.

**Activation Functions:** These functions dictate whether a neuron should be activated based on its inputs. They're like the gatekeepers of information that decide whether the input they receive is relevant enough to be passed on to the next layer.

**Backpropagation and Gradient Descent:** These are techniques used to adjust weights and biases and minimize the error function. They are the backbone of learning in neural networks, allowing the model to learn from its mistakes and improve over time.

Leveraging the same univariate time-series data that was utilized in the ARMA/ARIMA models, I successfully implemented and trained three distinct neural network models using Keras. These models included a Recurrent Neural Network (RNN), a Gated Recurrent Unit (GRU), and a Long Short-Term Memory (LSTM) model. A partition of the data was set aside for training and validation purposes, ensuring an unbiased evaluation of the model performance. Furthermore, to mitigate the risk of overfitting, I incorporated regularization techniques into the model's architecture. Consequently, these models now provide a robust framework for making future predictions.

![](images\deep-learning.png)

In [1]:
import pandas as pd
import numpy as np

from keras.models import Sequential
from keras.layers import Dense, SimpleRNN,LSTM,GRU
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

from tensorflow.keras import regularizers

import plotly.io as pio
import plotly.graph_objects as go
import plotly.express as px
pio.renderers.default = "plotly_mimetype+notebook_connected"

## Import and Prepare Data

In [2]:
df = pd.read_csv('co2.csv')[['year','month','average']]
print(df)

     year  month  average
0    1958      3   315.70
1    1958      4   317.45
2    1958      5   317.51
3    1958      6   317.24
4    1958      7   315.86
..    ...    ...      ...
773  2022      8   417.19
774  2022      9   415.95
775  2022     10   415.78
776  2022     11   417.51
777  2022     12   418.95

[778 rows x 3 columns]


### Convert features to Numpy Array

In [3]:
X = np.array(df["average"].values.astype('float32')).reshape(df.shape[0],1)
print(X.shape)

(778, 1)


### Visualize Raw Data

In [4]:
def plotly_line_plot(t,y,title="Plot", x_label="Months since March 1958", y_label="CO2 Concentration"):
    fig = px.line(x=t[0],y=y[0], title=title, render_mode='SVG')  
    for i in range(1,len(y)):
        if len(t[i])==1: 
            fig.add_scatter(x=t[i],y=y[i])
        else: 
            fig.add_scatter(x=t[i],y=y[i], mode='lines')
    fig.update_layout(xaxis_title=x_label, yaxis_title=y_label, template="plotly_white", showlegend=False)
    fig.show()

In [5]:
plotly_line_plot([[*range(0,len(X))]],[X[:,0]],title="Atmospheric CO2 Concentration since March 1958")

### Preform Train-Test Split

In [6]:
def get_train_test(data, split_percent=0.8):
    scaler = MinMaxScaler(feature_range=(0, 1)) # apply min-max scalar
    data = scaler.fit_transform(data).flatten()
    split = int(len(data)*split_percent)
    train_data = data[range(split)]
    test_data = data[split:]
    return train_data, test_data, data

In [7]:
train_data, test_data, data = get_train_test(X)
print('Train Data:',train_data.shape)
print('Test Data:',test_data.shape)

Train Data: (622,)
Test Data: (156,)


### Visualize Train-Test Split

In [8]:
t1=[*range(0,len(train_data))]
t2=len(train_data)+np.array([*range(0,len(test_data))])
plotly_line_plot([t1,t2],[train_data,test_data],title="Atmospheric CO2 Concentration since March 1958: Train-Test Split")

### Re-format Data for Deep Learning 

In [9]:
def get_XY(dat, time_steps,plot_data_partition=False):
    global X_ind, X, Y_ind, Y

    Y_ind = np.arange(time_steps, len(dat), time_steps);
    Y = dat[Y_ind]

    rows_x = len(Y)
    X_ind=[*range(time_steps*rows_x)]
    
    del X_ind[::time_steps]
    X = dat[X_ind]; 

    if plot_data_partition:
        plt.figure(figsize=(15, 6), dpi=80)
        plt.plot(Y_ind, Y,'o',X_ind, X,'-'); plt.show(); 

    X1 = np.reshape(X, (rows_x, time_steps-1, 1))

    return X1, Y

In [10]:
p=12 # seasonal lag

testX, testY = get_XY(test_data, p)
trainX, trainY = get_XY(train_data, p)

print('Train Data:',testX.shape,testY.shape)
print('Test Data:',trainX.shape,trainY.shape)

Train Data: (12, 11, 1) (12,)
Test Data: (51, 11, 1) (51,)


### Visualize 

In [11]:
tmp1=[]; 
tmp2=[]; 
tmp3=[]; 
count=0

for i in range(0,trainX.shape[0]):
    tmp1.append(count+np.array([*range(0,trainX[i,:,0].shape[0])]))
    tmp1.append([count+trainX[i,:,0].shape[0]]);
    tmp2.append(trainX[i,:,0])
    tmp2.append([trainY[i]]);
    count+=trainX[i,:,0].shape[0]+1

plotly_line_plot(tmp1,tmp2,title="Atmospheric CO2 Concentration since March 1958: Training Points")

## Create Models

### Model and Training Parameters 

In [12]:
recurrent_hidden_units=3
epochs=60
f_batch=0.2 
optimizer="RMSprop"
validation_split=0.2

### 3 Models: LSTM, SimpleRNN, and GRU 

In [13]:
mod_lstm = Sequential()
mod_lstm.add(
    LSTM(
        recurrent_hidden_units,
        return_sequences=False,
        input_shape=(trainX.shape[1],trainX.shape[2]), 
        recurrent_dropout=0.8,
        recurrent_regularizer=regularizers.L2(1e-1),
        activation='tanh'
    )
) 
     
mod_lstm.add(Dense(units=1, activation='linear'))
mod_lstm.compile(loss='MeanSquaredError', optimizer=optimizer)
mod_lstm.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm (LSTM)                 (None, 3)                 60        
                                                                 
 dense (Dense)               (None, 1)                 4         
                                                                 
Total params: 64
Trainable params: 64
Non-trainable params: 0
_________________________________________________________________


In [14]:
mod_srnn = Sequential()
mod_srnn.add(
    SimpleRNN(
        recurrent_hidden_units,
        return_sequences=False,
        input_shape=(trainX.shape[1],trainX.shape[2]), 
        recurrent_dropout=0.8,
        recurrent_regularizer=regularizers.L2(1e-1),
        activation='tanh'
    )
) 
     
mod_srnn.add(Dense(units=1, activation='linear'))
mod_srnn.compile(loss='MeanSquaredError', optimizer=optimizer)
mod_srnn.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 simple_rnn (SimpleRNN)      (None, 3)                 15        
                                                                 
 dense_1 (Dense)             (None, 1)                 4         
                                                                 
Total params: 19
Trainable params: 19
Non-trainable params: 0
_________________________________________________________________


In [15]:
mod_gru = Sequential()
mod_gru.add(
    GRU(
        recurrent_hidden_units,
        return_sequences=False,
        input_shape=(trainX.shape[1],trainX.shape[2]), 
        recurrent_dropout=0.8,
        recurrent_regularizer=regularizers.L2(1e-1),
        activation='tanh'
    )
) 
     
mod_gru.add(Dense(units=1, activation='linear'))
mod_gru.compile(loss='MeanSquaredError', optimizer=optimizer)
mod_gru.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 gru (GRU)                   (None, 3)                 54        
                                                                 
 dense_2 (Dense)             (None, 1)                 4         
                                                                 
Total params: 58
Trainable params: 58
Non-trainable params: 0
_________________________________________________________________


## Train Models

In [16]:
models = {
    'LSTM' : mod_lstm,
    'SRNN' : mod_srnn,
    'GRU' : mod_gru
}

histories = {
    'LSTM' : None,
    'SRNN' : None,
    'GRU' : None
}

for mod in models.keys():
    model = models[mod]
    history = model.fit(
        trainX, trainY, 
        epochs=epochs, 
        batch_size=int(f_batch*trainX.shape[0]), 
        validation_split=validation_split,
        verbose=0
    )
    histories[mod] = history

## Compare Models

In [17]:
train_predictions = {}
test_predictions = {}

for mod in models.keys():
    model = models[mod]
    train_predict = model.predict(trainX).squeeze()
    test_predict = model.predict(testX).squeeze()
    train_predictions[mod] = train_predict
    test_predictions[mod] = test_predict



### Compute RMSE

In [18]:
def calculate_rsme(model_name, trainY, testY, train_predict, test_predict):
    print(f'\n{model_name}:\n')
    train_rmse = np.sqrt(mean_squared_error(trainY, train_predict))
    test_rmse = np.sqrt(mean_squared_error(testY, test_predict))

    print('Train MSE = %.5f. RMSE = %.5f' % (train_rmse**2.0,train_rmse))
    print('Test MSE = %.5f. RMSE = %.5f' % (test_rmse**2.0,test_rmse))

In [19]:
calculate_rsme('LSTM', trainY, testY, train_predictions['LSTM'], test_predictions['LSTM'])


LSTM:

Train MSE = 0.00021. RMSE = 0.01450
Test MSE = 0.01208. RMSE = 0.10992


In [20]:
calculate_rsme('SRNN', trainY, testY, train_predictions['SRNN'], test_predictions['SRNN'])


SRNN:

Train MSE = 0.01732. RMSE = 0.13161
Test MSE = 0.22555. RMSE = 0.47493


In [21]:
calculate_rsme('GRU', trainY, testY, train_predictions['GRU'], test_predictions['GRU'])


GRU:

Train MSE = 0.00170. RMSE = 0.04129
Test MSE = 0.02943. RMSE = 0.17155


### Loss by Epoch

In [22]:
def loss_by_epoch(model_name, epochs_steps, history):
    plotly_line_plot(
        [epochs_steps, epochs_steps],
        [history.history['loss'],
         history.history['val_loss']],
        title=f"{model_name}: Loss by Training Epoch",
        x_label="Training Epochs",
        y_label="Loss (MSE)"
    )

In [23]:
epochs_steps = [*range(0, len(histories['LSTM'].history['loss']))]
loss_by_epoch('LSTM', epochs_steps, histories['LSTM'])

In [24]:
epochs_steps = [*range(0, len(histories['SRNN'].history['loss']))]
loss_by_epoch('SRNN', epochs_steps, histories['SRNN'])

In [25]:
epochs_steps = [*range(0, len(histories['GRU'].history['loss']))]
loss_by_epoch('GRU', epochs_steps, histories['GRU'])

### Parity Plot

In [26]:
def parity_plot(model_name, trainY, testY, train_predict, test_predict):
    fig = px.scatter(x=trainY,y=train_predict,height=600,width=800)
    fig.add_scatter(x=testY,y=test_predict,mode="markers")
    fig.add_scatter(x=trainY,y=trainY, mode='lines')

    fig.update_layout(
        title=f"{model_name}: Predicted vs True Y-Values",
        xaxis_title="Predicted Y-Value",
        yaxis_title="True Y-Value",
        template="plotly_white",
        showlegend=False
    )

    fig.show()

In [27]:
parity_plot('LSTM', trainY, testY, train_predictions['LSTM'], test_predictions['LSTM'])

In [28]:
parity_plot('SRNN', trainY, testY, train_predictions['SRNN'], test_predictions['SRNN'])

In [29]:
parity_plot('GRU', trainY, testY, train_predictions['GRU'], test_predictions['GRU'])

### Prediction Result

In [30]:
def plot_result(model_name, trainY, testY, train_predict, test_predict):
    train_len = np.arange(len(trainY))
    test_len = np.arange(len(trainY), len(trainY) + len(testY))

    fig = go.Figure()
    fig.add_trace(go.Scatter(x=train_len, y=trainY,mode='lines+markers',name='Training Data'))
    fig.add_trace(go.Scatter(x=test_len, y=testY,mode='lines+markers',name='Test Data'))
    fig.add_trace(go.Scatter(x=train_len, y=train_predict,mode='lines+markers',name='Training Predictions'))
    fig.add_trace(go.Scatter(x=test_len, y=test_predict,mode='lines+markers',name='Test Predictions'))

    fig.update_layout(title_text=f'{model_name}: Actual and Predicted Values',
                      xaxis_title='Observation Number',
                      yaxis_title='CO2 Concentration (Scaled)',
                      template="plotly_white")
    fig.show()

In [31]:
plot_result('LSTM', trainY, testY, train_predictions['LSTM'], test_predictions['LSTM'])

In [32]:
plot_result('SRNN', trainY, testY, train_predictions['SRNN'], test_predictions['SRNN'])

In [33]:
plot_result('GRU', trainY, testY, train_predictions['GRU'], test_predictions['GRU'])

## Analysis Questions

### How do the results from the 3 different ANN models compare with each other in terms of accuracy and predictive power

In this section, we went on the ardous journey of creating three different Artificial Neural Networks (ANNs) - a Simple Recurrent Neural Network (SRNN), a Long Short-Term Memory (LSTM), and a Gated Recurrent Unit (GRU). Our objective was to model and predict the atmospheric $CO_2$ concentration over the next decade. The neural nets harness the auto-correlation of the time-series data, refining the observations to align with the 12-month seasonal lag. Essentially, we used data from the same month (e.g., May) across multiple years to predict what the CO2 concentration would look like in the subsequent Mays. This strategy delivered predictions for the atmospheric CO2 levels for the next ten  years (Mays), extending until 2033.

When it comes to comparing the performance of these models, it's a bit of a mixed bag. There was a notable variation across different runs of the code, with either the LSTM or GRU models showing adequate to good performance. Interestingly, the performance of these two models was typically quite close on any given run, while the SRNN consistently trailed behind in terms of results. In the final run of the code, the LSTM model was the best model, with the GRU model close behind, and the SRNN rounding out the three. This performance hierarchy was evident across various evaluation methods, including Root Mean Squared Error (RMSE), Loss by Epoch, Parity Plots, and Predictive Results. In the end, the LSTM model emerged as the top model in accurately predicting CO2 concentrations. However, it's important to remember that this was just the final run. In other iterations, the GRU model demonstrated similar prowess to the LSTM model. Thus, while we can glean some insights from this exercise, the dynamic nature of these models means that the 'best' model can vary from one run to the next.

### What effect does including regularization have on your results. 

Regularization had a subtle yet noticeable impact on the precision of our modes. As mentioned earlier, the accuracy of different models exhibited considerable variance across different runs, making it a challenge to definitively quantify the influence of regularization. However, generally speaking, the models' performance seemed to dip slightly when regularization was omitted. 

I surmise that the modest effect of regularization can be attributed to the relatively stable scale of the data throughout the time series. In 1958, the atmospheric CO2 concentration was 320 ppm, which rose to 420 ppm by 2023. While this increase is alarmingly significant from an environmental perspective, it's not such a drastic shift that would significantly amplify the impact of regularization in our models. Over the span of these years, the increase in CO2 concentration amounts to around 35%.

Regularization often proves to be particularly beneficial in scenarios where the data exhibits substantial scale disparity either within a single dataset or when combined with another dataset. In our case, given the relatively modest scale difference in the CO2 concentration data, the effect of regularization was understandably muted. Nonetheless, its influence, albeit small, contributed to the overall performance increase of our models when using regularization.

### How far into the future can the deep learning model accurately predict the future. 

Determining the precise extent up to which our deep learning models can accurately predict is a challenging task, due in part to the model's performance fluctuations across multiple code runs. However, a general observation is that the predictions tend to demonstrate internal consistency - if the model is correctly predicting values up to a certain point, it is likely to continue to do so, but not always. Interestingly, in the case of the LSTM model, the training predictions mirrored the actual values up to a specific point, beyond which it began to underestimate the true values. This underestimation could be attributed to the accelerating rate of carbon dioxide emissions in the atmosphere. The model seems to predict a linear continuation of the trend, but the reality has surpassed this linear progression, indicating an accelerating greenhouse gas phenomenon. This acceleration is a grave concern for climate researchers and has been the focus of numerous studies. Therefore, while the predictions demonstrate a consistency with their own trends, it does not necessarily equate to accuracy with the real-world data. The LSTM model's performance is indicative of this: minor inaccuracies can accumulate over time, leading to a model that eventually falls short in accurately predicting CO2 concentration towards the end of the test data period. It serves as a reminder that these models must be periodically recalibrated and retrained to adapt to the changing trends in our ever-dynamic climate.

### How does your deep learning modeling compare to the traditional single-variable time-series ARMA/ARIMA models from HW-3? 

The deep learning models, specifically the LSTM and GRU models, significantly outperformed the traditional single-variable time-series ARMA/ARIMA models in predicting atmospheric CO2 concentrations. Initially, I applied the ARMA/ARIMA models to SF6 concentration data, as it did not exhibit seasonality. However, upon revisiting the analysis with CO2 concentration data, the deep learning models demonstrated superior performance.

The best CO2 models using the ARMA/ARIMA methods achieved an RMSE of 0.6, while the LSTM model had an RMSE of 0.1 and the GRU model an RMSE of 0.17 for the test data. It is important to note that training ARIMA and deep learning models is a complex and challenging task. It is highly probable that alternative hyperparameter configurations for both methods could yield more accurate predictions. Nonetheless, based on the models we have, it is evident that the deep learning models are substantially more effective in modeling CO2 data and making accurate predictions.

### Compare your models (use RMSE) and forecasts from these sections with your Deep Learning Models: ARIMA/SARIMA/VAR

When comparing the deep learning models with the SARIMA model, the outcomes echo those of the ARIMA model. The deep learning models, specifically LSTM and GRU, substantially outperform the traditional time series models in modeling and predicting CO2 concentrations. While the SARIMA model did present improved performance over ARIMA, its best RMSE was 0.4, compared to the LSTM model with an RMSE of 0.1, and the GRU model with an RMSE of 0.17 for the test data. Just like with the ARIMA model, the SARIMA model's performance may have been hindered by the lack of extensive hyperparameter tuning. It's challenging to definitively state which model is superior without further optimization. Nonetheless, the preliminary results indicate that the deep learning models significantly outperform the traditional time series models.

In the VAR section, I employed data from four greenhouse gases to construct a multivariate model, aiming to predict future concentrations of all four gases concurrently. I did my best to implement a deep learning model for this task but encountered challenges during the data wrangling stage. As Professor James suggested, the most challenging aspect of the deep learning process (and data science in general) is ensuring the data is appropriately prepared and in a format that TensorFlow likes. I am still working on the multivariate deep learning model and aim to have it ready for the final deliverable. Comparing just the prediciting of the CO2 concentration from the VAR model, we find the same thing as the SARIMA model. The VAR model's best RSME for the CO2 is 0.4, which is not even close to the deep learning model. 