Module 3 Assignment - Part 4: Use TF/Keras to Create both an RNN and an LSTM and Compare Results
---

For this part, you can choose any dataset you like. I recommend choosing text data (such as movie reviews: <https://keras.io/api/datasets/imdb/>) and predicting sentiment. However, I will allow you to select any sequential dataset you like as long as it is appropriate for RNNs.

Use TF/Keras to create two models. The first will be a simple RNN. The second will be a LSTM model.

There are great examples online that you can review and learn from (but not copy :)). You can also have a look at one of my examples that compares ANN, CNN, RNN, and LSTM: <https://gatesboltonanalytics.com/?page_id=903>

Just as in Parts 1 and 2 (I will not repeat the requirements again here - but please assume them) you will illustrate your process and results. You will also discuss and illustrate a comparison between the RNN and the LSTM models.

---


In [36]:
#libraries
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler,MinMaxScaler
import tensorflow.keras
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout, LSTM
from tensorflow.keras.optimizers import Adam
import matplotlib.pyplot as plt
from tensorflow.keras import layers
import numpy as np
import pandas as pd
import seaborn as sns

# Data

I started messing around with the movie data, but ended struggling for a while since I'm less familar about working with text data, etc. So instead, I'm going to be using data from my undergrad research work, which is a time series data: NASA's [OMNI dataset](https://omniweb.gsfc.nasa.gov/).

However, one thing to note. The problem I tackle with this data is a **regression problem not a classification problem**. So I won't be outputting a confusion matrix, but I will be sure to use another form of performance evaluation visualization. 

## Data loading

If I submitted as I intended, the csv data should've been included in my submission (titled "`omni_df.csv`")

In [28]:
omni_df = pd.read_csv('omni_df.csv')

In [29]:
omni_df.head()

Unnamed: 0,Epoch,BX_GSE,BY_GSE,BZ_GSE,Vx,Vy,Vz,proton_density,T,AE_INDEX,AL_INDEX,AU_INDEX,SYM_H,ASY_H
0,2000-01-01 00:00:00,-5.94,0.24,-0.15,,,,,,668.0,-486.0,182.0,-44.0,48.0
1,2000-01-01 00:01:00,-5.88,2.17,0.53,-662.6,7.3,-46.5,3.12,343841.0,638.0,-487.0,151.0,-45.0,48.0
2,2000-01-01 00:02:00,-5.71,3.23,1.44,-661.4,2.4,-46.3,3.24,326583.0,666.0,-527.0,139.0,-45.0,48.0
3,2000-01-01 00:03:00,-5.33,3.8,1.84,-659.8,-8.4,-56.2,3.11,306470.0,615.0,-474.0,141.0,-45.0,48.0
4,2000-01-01 00:04:00,,,,,,,,,554.0,-418.0,136.0,-45.0,47.0


## EDA

### Data columns and their meanings:

**JK Note to TAs:** It's not crucial to this work for you to understand these for the purposes of this assignment, but just in case it might be helpful. 

Can also refer to their [about OMNI data page](https://omniweb.gsfc.nasa.gov/html/ow_data.html)

Essentially this is all solar wind data

- `Epoch`: Date and Time (datetime index)
- Interplanetary Magnetic Field (units = nano-Teslas (nT))
    - `BX_GSE`: x-component of Magnetic Field in [GSE coordinate system](https://www.spenvis.oma.be/help/background/coortran/coortran.html#GSE) 
    - `BY_GSE`: y-component of Magnetic Field in GSE coordinate system
    - `BX_GSE`: z-component of Magnetic Field in GSE coordinate system
- Solar Wind velocity (km/sec)
    - `Vx`: Solar wind velocity in x
    - `Vy`: Solar wind velocity in y
    - `Vz`: Solar wind velocity in z
- `proton_density`: Proton density in the solar wind (n/cc)
- `T`: Temperature (K)
- Geomagnetic Indices 
    - all units = nano Teslas (nT)
    - [Auroral Electrojet](https://www.ncei.noaa.gov/access/metadata/landing-page/bin/iso?id=gov.noaa.ngdc.stp.indices:G00584#:~:text=The%20AU%20and%20AL%20indices%20are%20intended%20to%20express%20the,Show%20more...)
        - `AE_INDEX`: Auroral Electrojet Index (AE = AU-AL)
        - `AU_INDEX`: Auroral Electrojet upper
        - `AL_INDEX`: Auroral Electrojet lower
    - [SYM/ASY H](https://wdc.kugi.kyoto-u.ac.jp/aeasy/asy.pdf)
        - `SYM_H`: Longitudinally symmetric disturbance of geomagnetic H-component
            - Sym-H is the same is the [Dst Index](https://www.ngdc.noaa.gov/stp/geomag/dst.html), but at higher resolution
        - `ASY_H`: Longitudinally asymmetric disturbance of geomagnetic H-component



For my purposes the features (predictor variables, model input, etc. however you want to call it) are:
`BX_GSE`, `BY_GSE`, `BZ_GSE`, `AE_INDEX`, `AL_INDEX`, `AU_INDEX`, `SYM_H`, `ASY_H`

And the targets ("label", the model output, etc.) are: `Vx`, `Vy`, `Vz`, `proton_density`, `T`

In [30]:
omni_df.describe()

Unnamed: 0,BX_GSE,BY_GSE,BZ_GSE,Vx,Vy,Vz,proton_density,T,AE_INDEX,AL_INDEX,AU_INDEX,SYM_H,ASY_H
count,8888632.0,8888632.0,8888632.0,7679877.0,7679877.0,7679877.0,7679877.0,7673164.0,9552960.0,9552960.0,9552960.0,9552960.0,9552960.0
mean,0.002586364,0.0599756,0.002418269,-435.3644,-1.359368,-2.238432,5.999525,98806.42,179.2767,-111.4364,67.84032,-12.03162,20.1168
std,3.660013,4.227383,3.395529,105.3745,25.52232,22.66174,4.988173,97444.86,210.2548,155.054,73.83967,19.71669,15.93993
min,-49.51,-51.09,-60.64,-1136.5,-864.9,-310.4,0.03,543.0,1.0,-4141.0,-975.0,-490.0,0.0
25%,-2.65,-2.67,-1.73,-494.8,-16.3,-14.9,3.0,35519.0,42.0,-147.0,19.0,-19.0,11.0
50%,0.0,0.02,0.0,-412.6,-3.5,-2.1,4.6,70020.0,93.0,-45.0,41.0,-9.0,16.0
75%,2.63,2.76,1.73,-355.6,11.4,10.2,7.24,130872.0,240.0,-18.0,90.0,-1.0,24.0
max,39.95,55.09,64.38,-93.4,329.1,451.6,80.95,7617765.0,4192.0,79.0,2063.0,151.0,984.0


## Data Cleaning

In [31]:
# number of missing values in each column
omni_df.isnull().sum()

Epoch                   0
BX_GSE             664328
BY_GSE             664328
BZ_GSE             664328
Vx                1873083
Vy                1873083
Vz                1873083
proton_density    1873083
T                 1879796
AE_INDEX                0
AL_INDEX                0
AU_INDEX                0
SYM_H                   0
ASY_H                   0
dtype: int64

The data clearly has some missing values, so we'll use the I'll combat by interpolating the input parameters, and dropping the other rows with missing target values. Can do so with the `interpolate_input()` function I defined below.

(The following is the same code I used to clean my data for work.)

In [4]:
def interpolate_input(df,features,targets,interpolation_method = 'linear',has_time_col=True,
                      drop_target_nan=True,includes_TH=False,th_len=15,include_target_th = False):
    
    #Creating X and y df from Time History data th_df (if applicable)-----------------------------------------------
    if includes_TH:
        X_col = []
        y_col =[]
        
        for f in features:
            for i in range(0,th_len+1):
                X_col.append(f+'_m{}'.format(i))
        if include_target_th:
            for tt in targets:
                for j in range(1,th_len+1):
                    X_col.append(tt+'_m{}'.format(j))
        for tt in targets:
            y_col.append(tt+'_m0')
            
        features = X_col
        targets =y_col
    #---------------------------------------------------------------------------------------------------------------    
    
    
    #separate df into features only & targets only dataframes========================================================
    
    feat_df = df[features].astype('float32')
    targ_df = df[targets].astype('float32')
    if has_time_col:
        if includes_TH:
            time = df['time']
        else:
            time = df['Epoch']      #may want to change this so it doesn't only for dfs with column = 'Epoch'
    
    #interpolate features-only dataframe==============================================================================
    interpolated_feat_df = feat_df.interpolate(interpolation_method)    #if method not specified when function 
                                                                        # is passed, defaults to linear interpolation
    #combine df's together again=====================================================================================
    if has_time_col:
        concat_df = pd.concat([time,interpolated_feat_df,targ_df],axis=1)
    else:
        concat_df = pd.concat([interpolated_feat_df,targ_df],axis=1)
        
    print("New df with interpolated input created")
    # TO drop or NOT to drop... the nan of target parameters---------------------------------------------------------
    if drop_target_nan:
        new_df = concat_df.dropna()
        print("Target Nan's were dropped (no missing values in df;can be used for model training & testing)")
    else:
        new_df = concat_df
        print("Target NaN's were NOT dropped (dataframe still contains missing values)")
    
        
    return new_df

In [5]:
features = ['BX_GSE','BY_GSE','BZ_GSE','AE_INDEX','AL_INDEX','AU_INDEX','SYM_H','ASY_H']
targets = ['Vx','Vy','Vz','proton_density','T']

In [6]:
df = interpolate_input(omni_df,features=features,
                  targets=targets, 
                  has_time_col=True,
                  drop_target_nan=True,
                  includes_TH=False)

New df with interpolated input created
Target Nan's were dropped (no missing values in df;can be used for model training & testing)


In [32]:
df.isnull().sum()

Epoch             0
BX_GSE            0
BY_GSE            0
BZ_GSE            0
AE_INDEX          0
AL_INDEX          0
AU_INDEX          0
SYM_H             0
ASY_H             0
Vx                0
Vy                0
Vz                0
proton_density    0
T                 0
dtype: int64

Viola! No more missing data

### Training and Testing sets

The following is also code directly from my work code (so there might be some uneccessary things).

In [39]:
def train_test_dataframes(df,features,targets,split_type='random',test_size=0.2,scale=None,
                          includes_TH=False,th_len=15,include_target_th = False):
    
    #Creating X and y df from Time History data th_df (if applicable)-----------------------------------------------
    if includes_TH:
        X_col = []
        y_col =[]
        
        for f in features:
            for i in range(0,th_len+1):
                X_col.append(f+'_m{}'.format(i))
        if include_target_th:
            for tt in targets:
                for j in range(1,th_len+1):
                    X_col.append(tt+'_m{}'.format(j))
        for tt in targets:
            y_col.append(tt+'_m0')
    else:
        X_col = features
        y_col = targets
    #--------------------------------------------------------------------------------------------------------------- 
    
    X = df[X_col].astype('float32')
    y = df[y_col].astype('float32')
    
    # Training-Test Split=====================================================
    #(Random) Train-Test Split-----------------------------------------------
    X_train,X_test,y_train,y_test = train_test_split(X,y,test_size= test_size,random_state=123)
    split = 'Random split'
        
    # (Sequential) Train-Test Split------------------------------------------
    if split_type == 'sequential':
    
        train_size = 1-test_size
        
        X_train = X[:int(train_size*len(X))]   #values up to % indicated by train_size
        X_test = X[int(train_size*len(X)):]    #remaining sequential values from X
        y_train = y[:int(train_size*len(y))]
        y_test = y[int(train_size*len(y)):]
        
        split = 'Sequential split'
    # Print split type--------------------------------------------------------
    print('Train-Test split was:', split)
    
    #Scaling==================================================================
    if scale is not None:
        scaler = scale
        X_train = scaler.fit_transform(X_train)
        X_test = scaler.transform(X_test)
        print('X_train & X_test have been scaled using {}'.format(str(scale)))
    else:
        print('X_train & X_test are not scaled')
        
    print('Dataframes are complete')  
    
    return X_train,X_test,y_train,y_test



In [40]:
X_train,X_test,y_train,y_test = train_test_dataframes(df,
                      features=features,
                      targets=targets,
                      split_type='sequential',
                      test_size=0.2,
                      scale=StandardScaler(),
                      includes_TH=False)

Train-Test split was: Sequential split
X_train & X_test have been scaled using StandardScaler()
Dataframes are complete


### Data Shapes

In [41]:
print("X_train shape:",X_train.shape)
print("y_train shape:",y_train.shape)
print("X_test shape:",X_test.shape)
print("y_test shape:",y_test.shape)

X_train shape: (6734530, 8)
y_train shape: (6734530, 5)
X_test shape: (1683633, 8)
y_test shape: (1683633, 5)


# RNN Model

In [43]:
RNN_Model = tf.keras.models.Sequential([
  tf.keras.layers.Embedding(input_dim=8,output_dim=5),
  tf.keras.layers.SimpleRNN(units =50),
  ## If not using Embedding, you would use SimpleRNN(units, input_shape=(x,y))
])

RNN_Model.summary()

2023-11-30 19:55:56.289532: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: UNKNOWN ERROR (100)


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 5)           40        
                                                                 
 simple_rnn (SimpleRNN)      (None, 50)                2800      
                                                                 
 dense (Dense)               (None, 1)                 51        
                                                                 
Total params: 2,891
Trainable params: 2,891
Non-trainable params: 0
_________________________________________________________________


In [45]:
loss_function = keras.losses.MeanSquaredError
RNN_Model.compile(
                 loss=loss_function,
                 metrics=["accuracy"],
                 optimizer='adam'
                 )

In [46]:
%%time
Hist=RNN_Model.fit(X_train,y_train, epochs=5, validation_data=(X_test, y_test))

Epoch 1/5


TypeError: in user code:

    File "/home/jasminekobayashi/anaconda3/envs/code-ref-notebook/lib/python3.11/site-packages/keras/engine/training.py", line 1284, in train_function  *
        return step_function(self, iterator)
    File "/home/jasminekobayashi/anaconda3/envs/code-ref-notebook/lib/python3.11/site-packages/keras/engine/training.py", line 1268, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/home/jasminekobayashi/anaconda3/envs/code-ref-notebook/lib/python3.11/site-packages/keras/engine/training.py", line 1249, in run_step  **
        outputs = model.train_step(data)
    File "/home/jasminekobayashi/anaconda3/envs/code-ref-notebook/lib/python3.11/site-packages/keras/engine/training.py", line 1051, in train_step
        loss = self.compute_loss(x, y, y_pred, sample_weight)
    File "/home/jasminekobayashi/anaconda3/envs/code-ref-notebook/lib/python3.11/site-packages/keras/engine/training.py", line 1109, in compute_loss
        return self.compiled_loss(
    File "/home/jasminekobayashi/anaconda3/envs/code-ref-notebook/lib/python3.11/site-packages/keras/engine/compile_utils.py", line 265, in __call__
        loss_value = loss_obj(y_t, y_p, sample_weight=sw)
    File "/home/jasminekobayashi/anaconda3/envs/code-ref-notebook/lib/python3.11/site-packages/keras/losses.py", line 142, in __call__
        losses = call_fn(y_true, y_pred)
    File "/home/jasminekobayashi/anaconda3/envs/code-ref-notebook/lib/python3.11/site-packages/keras/losses.py", line 268, in call  **
        return ag_fn(y_true, y_pred, **self._fn_kwargs)
    File "/home/jasminekobayashi/anaconda3/envs/code-ref-notebook/lib/python3.11/site-packages/keras/losses.py", line 358, in __init__  **
        super().__init__(mean_squared_error, name=name, reduction=reduction)
    File "/home/jasminekobayashi/anaconda3/envs/code-ref-notebook/lib/python3.11/site-packages/keras/losses.py", line 246, in __init__
        super().__init__(reduction=reduction, name=name)
    File "/home/jasminekobayashi/anaconda3/envs/code-ref-notebook/lib/python3.11/site-packages/keras/losses.py", line 83, in __init__
        losses_utils.ReductionV2.validate(reduction)
    File "/home/jasminekobayashi/anaconda3/envs/code-ref-notebook/lib/python3.11/site-packages/keras/utils/losses_utils.py", line 87, in validate
        if key not in cls.all():

    TypeError: Expected float32 passed to parameter 'y' of op 'Equal', got 'auto' of type 'str' instead. Error: Expected float32, but got auto of type 'str'.
