# Hybrid model:

This notebook will perform the LSTM model for Nike data with 2 sets of data having features selected by 2 different methods: FFS, RFE. Based on the model performance evaluation, we will pick the better model (with ffs data or rfe data) to develop the "hybrid" model.

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import math
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from scipy import stats
import timeit
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error
from statsmodels.tsa.stattools import adfuller
from keras.layers import LSTM, Dense, Dropout, Lambda
from keras import backend as K
from keras.models import Sequential
from keras.callbacks import EarlyStopping
import tensorflow as tf

In [2]:
# Set Seeds
import random as rn
seed = 1992
np.random.seed(seed)
rn.seed(seed)

## Load data

In [3]:
# FFS (forward feature selection) data:
NIKE_ffs= pd.read_csv('/Users/kienguyen/Documents/DATA SCIENCE/MSDS/11. Practicum I/Data/Nike_ffs.csv',index_col=0)
# RFE (recursive forest regression) data:
NIKE_rfe = pd.read_csv('/Users/kienguyen/Documents/DATA SCIENCE/MSDS/11. Practicum I/Data/Nike_rfe.csv',index_col=0)

In [4]:
NIKE_ffs.head(5)

Unnamed: 0,open,high,low,close,symbol,volume,inflation,unemployment,retail_sales,CPI,MACD,MACD_Signal,RSI,stoch-slowK,stoch-slowD,volume_shift,unemployment_shift,retail_sales_shift
2024-02-15,107.02,107.34,105.83,106.05,NKE,7019618.0,8.0028,3.7,554784.0,308.417,0.0293,-0.7664,54.7524,64.4738,73.2061,10086580.0,3.7,554784.0
2024-02-14,104.82,106.42,104.46,106.33,NKE,5743277.0,8.0028,3.7,554784.0,308.417,-0.1351,-0.9653,55.5998,74.6893,81.6098,8051815.0,3.7,554784.0
2024-02-13,104.99,105.8,104.245,105.0,NKE,6180509.0,8.0028,3.7,554784.0,308.417,-0.3714,-1.1729,52.347,80.4553,85.7871,8690871.0,3.7,554784.0
2024-02-12,104.74,107.43,104.645,107.18,NKE,7501946.0,8.0028,3.7,554784.0,308.417,-0.5321,-1.3733,58.9164,89.6848,85.3288,9601810.0,3.7,554784.0
2024-02-09,103.8,104.94,103.33,104.5,NKE,5449022.0,8.0028,3.7,554784.0,308.417,-0.9516,-1.5835,52.0466,87.2212,74.2464,7287568.0,3.7,554784.0


In [5]:
NIKE_ffs.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6019 entries, 2024-02-15 to 2000-03-15
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   open                6019 non-null   float64
 1   high                6019 non-null   float64
 2   low                 6019 non-null   float64
 3   close               6019 non-null   float64
 4   symbol              6019 non-null   object 
 5   volume              6019 non-null   float64
 6   inflation           6019 non-null   float64
 7   unemployment        6019 non-null   float64
 8   retail_sales        6019 non-null   float64
 9   CPI                 6019 non-null   float64
 10  MACD                6019 non-null   float64
 11  MACD_Signal         6019 non-null   float64
 12  RSI                 6019 non-null   float64
 13  stoch-slowK         6019 non-null   float64
 14  stoch-slowD         6019 non-null   float64
 15  volume_shift        6019 non-null   float64
 

In [6]:
NIKE_rfe.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6019 entries, 2024-02-15 to 2000-03-15
Data columns (total 19 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   open                  6019 non-null   float64
 1   high                  6019 non-null   float64
 2   low                   6019 non-null   float64
 3   close                 6019 non-null   float64
 4   symbol                6019 non-null   object 
 5   inflation             6019 non-null   float64
 6   unemployment          6019 non-null   float64
 7   CPI                   6019 non-null   float64
 8   fed_funds_rate        6019 non-null   float64
 9   yield10y              6019 non-null   float64
 10  yield3m               6019 non-null   float64
 11  MACD                  6019 non-null   float64
 12  bbands-upper          6019 non-null   float64
 13  bbands-middle         6019 non-null   float64
 14  bbands-lower          6019 non-null   float64
 15  unemploymen

## Data Preprocessing

### Transform to stationary

Why stationarity important in time series analysis?
<li>Non-stationary data can lead to unreliable model outputs and inaccurate predictions, just because the models aren’t expecting it.
<li>Easier modeling and forecasting
<li>Interpretability of trends and patterns
<li>Enhanced diagnostic checks.
<li>Improved model performance
<br> ref: https://hex.tech/blog/stationarity-in-time-series/#:~:text=Non%2Dstationary%20data%20can%20lead,than%20non%2Dstationary%20time%20series.

Augmented Dickey Fuller Test (ADF Test) to test whether a given Time series is stationary or not.
<br>Ref: https://www.machinelearningplus.com/time-series/augmented-dickey-fuller-test/

<b>NIKE_ffs</b>

In [7]:
df_nike_ffs=NIKE_ffs.copy()
transform = 'log'
fea_to_transform = ['open','high','low','close', 'stoch-slowK', 'stoch-slowD']
for feature in df_nike_ffs.columns:
    if feature in fea_to_transform:
      df_nike_ffs[feature] = df_nike_ffs[feature].apply(np.log)

# check the closing price for stationarity using the augmented dicky fuller test
t_stat, p_value, _, _, critical_values, _  = adfuller(df_nike_ffs.close.values, autolag='AIC')
print('Augmented Dicky Fuller Test for Stationarity')
print("-"*60)
print(f'ADF Statistic: {t_stat:.2f}')
print(f'p-value: {p_value:.2f}')
for key, value in critical_values.items():
    print('Critial Values:')
    if t_stat < value:
      print(f'   {key}, {value:.2f} => non-stationary')
    else:
      print(f'   {key}, {value:.2f} => stationary')

Augmented Dicky Fuller Test for Stationarity
------------------------------------------------------------
ADF Statistic: -2.19
p-value: 0.21
Critial Values:
   1%, -3.43 => stationary
Critial Values:
   5%, -2.86 => stationary
Critial Values:
   10%, -2.57 => stationary


<b>NIKE_rfe</b>

In [70]:
df_nike_rfe=NIKE_rfe.copy()
transform = 'log'
fea_to_transform_rfe = ['open','high','low','close', 'stoch-slowK', 'stoch-slowD','bbands-upper', 'bbands-middle', 'bbands-lower']
for feature in df_nike_rfe.columns:
    if feature in fea_to_transform_rfe:
      df_nike_rfe[feature] = df_nike_rfe[feature].apply(np.log)

# check the closing price for stationarity using the augmented dicky fuller test
t_stat, p_value, _, _, critical_values, _  = adfuller(df_nike_rfe.close.values, autolag='AIC')
print('Augmented Dicky Fuller Test for Stationarity')
print("-"*60)
print(f'ADF Statistic: {t_stat:.2f}')
print(f'p-value: {p_value:.2f}')
for key, value in critical_values.items():
    print('Critial Values:')
    if t_stat < value:
      print(f'   {key}, {value:.2f} => non-stationary')
    else:
      print(f'   {key}, {value:.2f} => stationary')

Augmented Dicky Fuller Test for Stationarity
------------------------------------------------------------
ADF Statistic: -2.19
p-value: 0.21
Critial Values:
   1%, -3.43 => stationary
Critial Values:
   5%, -2.86 => stationary
Critial Values:
   10%, -2.57 => stationary


### Prepare the Data 
+ Reverse the date-time index so that the data will be in the ascending order
+ Convert the date-time index to an integer index so the model can understand
+ Convert to numpy arrays as LSTM model requirement

LSTM model requires a three-dimensional array as an input.
Ref: https://www.kaggle.com/code/shivajbd/input-and-output-shape-in-lstm-keras

<b>NIKE_ffs</b>

In [9]:
# reverse data:
df_nike_ffs=df_nike_ffs.iloc[::-1]
# get the features:
ffs_features =[c for c in df_nike_ffs.columns if c not in ['symbol']]
# get the target values:
df_nike_ffs_y = df_nike_ffs['close']
# get the features:
df_nike_ffs_X= df_nike_ffs[ffs_features]
# replace date index with integer index:
df_nike_ffs_X.reset_index(drop=True, inplace=True)
# conver to numpy arrays:
ffs_array_X = np.array(df_nike_ffs_X)
ffs_array_y = np.array(df_nike_ffs_y).reshape(-1,1)

<b>NIKE_rfe</b>

In [71]:
# reverse data:
df_nike_rfe=df_nike_rfe.iloc[::-1]
# get the features:
rfe_features =[c for c in df_nike_rfe.columns if c not in ['symbol']]
# get the target values:
df_nike_rfe_y = df_nike_rfe['close']
# get the features:
df_nike_rfe_X= df_nike_rfe[rfe_features]
# replace date index with integer index:
df_nike_rfe_X.reset_index(drop=True, inplace=True)
# conver to numpy arrays:
rfe_array_X = np.array(df_nike_rfe_X)
rfe_array_y = np.array(df_nike_rfe_y).reshape(-1,1)

### Scale the input, outputs

In [72]:
# FFS:
scaler_X_ffs = MinMaxScaler(feature_range=(0,1))
scaler_y_ffs = MinMaxScaler(feature_range=(0,1))
scaled_X_ffs = scaler_X_ffs.fit_transform(ffs_array_X)
scaled_y_ffs = scaler_y_ffs.fit_transform(ffs_array_y)
# RFE:
scaler_X_rfe = MinMaxScaler(feature_range=(0,1))
scaler_y_rfe = MinMaxScaler(feature_range=(0,1))
scaled_X_rfe = scaler_X_rfe.fit_transform(rfe_array_X)
scaled_y_rfe = scaler_y_rfe.fit_transform(rfe_array_y)

### Split train, test

In [12]:
def split_train_test(X,train_pct=0.8):
    train_n =math.ceil(X.shape[0]*train_pct) #get the number of samples allocated for training
    train_X = X[0:train_n,:] # subset the first "train_n" rows of dataset with all the feature columns
    
    # Testing data:
    test_X = X[train_n:,:] # subset the rows from "train_n" rows of dataset
    return train_X,test_X

In [73]:
train_pct = 0.80

# FFS:
train_x_ffs,test_x_ffs=split_train_test(scaled_X_ffs,train_pct)
# RFE:
train_x_rfe,test_x_rfe=split_train_test(scaled_X_rfe,train_pct)

### Partition the train/test data

<b>Why partition data?</b>
<br>Data partitioning is essential for evaluating the performance and generalization of your data analysis models, such as machine learning algorithms or statistical tests. By partitioning your data, you can use one subset of data to train your model, and another subset to test how well your model works on new and unseen data. This way, you can avoid overfitting, which is when your model learns too much from the training data and fails to generalize to new data. Data partitioning also helps you optimize your model parameters, compare different models, and assess the reliability and validity of your results.
</br>ref: https://www.linkedin.com/advice/3/what-best-ways-partition-data-training-testing#:~:text=Partitioning%20data%20into%20training%20and,performance%20in%20real%2Dworld%20scenarios.

In [14]:
def create_partitions(data,idx_close,n_steps,n_predict,visualize=False):
    n = data.shape[0]
    window = n_steps + n_predict
    i, p = [],[]
    # create the partitions
    for step in range(n_steps, n-n_predict):
    # get the input window + all features

        # train window
        i.append(data[step-n_steps:step,:])

        # get the prediction window + the closing price
        p.append(data[step:step+n_predict,idx_close])
    return np.array(i),np.array(p)

In [15]:
def plot_training_window(x_array,y_array,idx_close,n_steps,n_predict,batch,plot_name):
    #Plots a single training batch showing the train/prediction windows
    # convert the arrays to dataframes
    # align the x indexes to compare
    df_y = pd.DataFrame(y_array[batch],index=range(n_steps-1,n_predict+n_steps-1),columns=['y'])
    df_x = pd.DataFrame(x_array[batch+1])[idx_close]
    df_x = pd.DataFrame(df_x)
    df_x.columns = ['x']

    # create the plots
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=df_x.index,y=df_x['x'],name='train window',line_color='#a8b8d0',fill='tozeroy'))
    fig.add_trace(go.Scatter(x=df_y.index,y=df_y['y'],name='predict window',line_color='orange',fill='tozeroy'))
    fig.update_layout(template='plotly_white',
                    title=f'Train/Predict Windows: {plot_name}',
                    yaxis_title = 'Closing Price (Scaled)',
                    xaxis_title='Period',
                    width = 700,
                    height = 500)
    fig.show()

In [74]:
n_steps = 25
n_predict = 3 # the model will predict for 3 days in the future
# FFS:
index_close_ffs = df_nike_ffs.columns.get_loc('close')
x_train_ffs, y_train_ffs = create_partitions(train_x_ffs,index_close_ffs,n_steps, n_predict,True)
x_test_ffs,  y_test_ffs  = create_partitions(test_x_ffs, index_close_ffs,n_steps, n_predict)
plot_training_window(x_train_ffs, y_train_ffs,index_close_ffs,n_steps,n_predict,seed//10,'FFS method')

# RFE:
index_close_rfe = df_nike_rfe.columns.get_loc('close')
x_train_rfe, y_train_rfe = create_partitions(train_x_rfe,index_close_rfe,n_steps, n_predict,True)
x_test_rfe,  y_test_rfe  = create_partitions(test_x_rfe, index_close_rfe,n_steps, n_predict)
plot_training_window(x_train_rfe, y_train_rfe,index_close_rfe,n_steps,n_predict,seed//10,'RFE method')

## LSTM models

In [17]:
# learning rate:
lr=0.001
# layer dictionary:
layers = [('lstm',256,True,0.3),('lstm',128,False,0.2),('dense',32,None,0.1),('dense',16,None,0.1)]


<b>Model for data formed by FFS method</b>

In [59]:

# FFS:
n_ffs_features = x_train_ffs.shape[2]
# clear previous models
K.clear_session()
# Create model
model_ffs = Sequential(name='LSTM')
model_ffs.add(LSTM(n_steps,return_sequences=True,input_shape=(n_steps,n_ffs_features)))
# add additional layers
for layer,nodes,ret_seq,drop in layers:
    if layer=='lstm':
        model_ffs.add(LSTM(nodes,return_sequences=ret_seq))
        if drop is not None:
            model_ffs.add(Dropout(drop))
    elif layer=='dense':
        model_ffs.add(Dense(nodes))
        if drop is not None:
            model_ffs.add(Dropout(drop))
# add prediction layer
model_ffs.add(Dense(n_predict))
#compile:
model_ffs.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=lr),loss='mse')
model_ffs.summary()



Model: "LSTM"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm (LSTM)                 (None, 25, 25)            4300      
                                                                 
 lstm_1 (LSTM)               (None, 25, 256)           288768    
                                                                 
 dropout (Dropout)           (None, 25, 256)           0         
                                                                 
 lstm_2 (LSTM)               (None, 128)               197120    
                                                                 
 dropout_1 (Dropout)         (None, 128)               0         
                                                                 
 dense (Dense)               (None, 32)                4128      
                                                                 
 dropout_2 (Dropout)         (None, 32)                0      

In [60]:
# Train the model
tic = timeit.default_timer()
history_ffs = model_ffs.fit(x_train_ffs,
                    y_train_ffs,
                    batch_size=128,
                    epochs=250,
                    callbacks = EarlyStopping(monitor='loss',patience=10),
                    validation_split =0.10, 
                    verbose = 0
                    )
# print the training time
toc =timeit.default_timer()
print('\nTraining Time')
print('='*60)
print(f'Minutes:{round((toc-tic)/60,2)}\n')


Training Time
Minutes:9.63



In [42]:
# Predict price and calculate the model performance:
def calculate_performance(x_test,y_test,model,scaler_y):
    # Predict the prices
    y_pred = model.predict(x_test)
    # convert units back to the original scale
    y_pred_unscaled = scaler_y.inverse_transform(y_pred)
    y_test_unscaled = scaler_y.inverse_transform(y_test)
    rmse  = math.sqrt(mean_squared_error(y_test_unscaled, y_pred_unscaled))
    mae   = mean_absolute_error(y_test_unscaled, y_pred_unscaled)
    mape  = np.mean((np.abs(np.subtract(y_test_unscaled, y_pred_unscaled)/ y_test_unscaled))) * 100
    mdape = np.median((np.abs(np.subtract(y_test_unscaled, y_pred_unscaled)/ y_test_unscaled)) ) * 100

    print("\nModel Error")
    print("="*62)
    print(f'{"Mean Absolute Error (MAE)" :-<55} {np.round(mae, 2):>5}')
    print(f'{"Root Mean Squared Error (MSE)" :-<55} {np.round(rmse,2):>5}')
    print(f'{"Mean Absolute Percentage Error (MAPE)" :-<55} {np.round(mape, 2):>5}%')
    print(f'{"Median Absolute Percentage Error (MDAPE)" :-<55} {np.round(mdape, 2):>5}%')

    return y_pred

In [61]:
y_pred_scaled_ffs = calculate_performance(x_test_ffs,y_test_ffs,model_ffs,scaler_y_ffs)


Model Error
Mean Absolute Error (MAE)------------------------------  0.15
Root Mean Squared Error (MSE)--------------------------  0.18
Mean Absolute Percentage Error (MAPE)------------------  3.15%
Median Absolute Percentage Error (MDAPE)---------------   3.0%


In [44]:
# Plot price predictions:
def plot_price_predictions(batch, idx_close, x_test, y_pred_scaled,scaler_y):
    # unscale the y predictions
    y_pred_unscaled = scaler_y.inverse_transform(y_pred_scaled)
    # unscale the x_test data
    x_test_np = np.array(pd.DataFrame(x_test[batch])[idx_close]).reshape(-1,1)
    x_test_unscaled = scaler_y.inverse_transform(x_test_np)
    x_test_df = pd.DataFrame(x_test_unscaled)
    # set the indexes for plotting
    max_test_idx=x_test_df.shape[0]
    max_pred_idx =y_pred_unscaled[0].shape[0]
    test_idx = list(range(batch,batch + max_test_idx))
    pred_idx = list(range(batch + max_test_idx,batch + max_test_idx + max_pred_idx))
    # combine the actual + predicted prices
    data = pd.DataFrame(list(zip(y_pred_unscaled[batch], x_test_df[0])), columns=['pred', 'actual'])
    # create the plot
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=test_idx, y=x_test_df[0],
                        mode='lines',
                        name='Test Data',
                        fill='tozeroy',
                        line_color='#ccc'))
    fig.add_trace(go.Scatter(x=pred_idx, y=data['actual'],
                        mode='lines+markers', 
                        name='Actual Price',
                        fill='tozeroy',
                        line_color ='#ccc')) 
    fig.add_trace(go.Scatter(x=pred_idx, y=data['pred'],
                        mode='lines+markers',
                        name='Predicted Price',
                        line_color='red'))

    fig.update_layout(template = 'plotly_white',
                      title= 'Actual vs Predicted Price',
                      xaxis_title = 'Batch',
                      yaxis_title = 'Price',
                      width=600,
                      height=400)

    fig.show()

In [62]:
plot_price_predictions(128, index_close_ffs, x_test_ffs, y_pred_scaled_ffs,scaler_y_ffs)

In [46]:
def plot_training_metrics(history):
    # get the number of epochs
    epochs = list(range(1, len(history.history['loss']) + 1))

    # create the line plots
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=epochs,
                           y=history.history['loss'],
                           name = 'train-loss',
                           line=dict(width=3,color='royalblue')))
    fig.add_trace(go.Scatter(x=epochs,
                           y=history.history['val_loss'],
                           name='val-loss',
                           line=dict(width=3,color='crimson')))

    fig.update_layout(title = 'Training Metrics',
                    template="plotly_white",
                    width = 700,
                    height= 500,
                    yaxis_title='loss',
                    xaxis_title='epochs')

    fig.show()

In [63]:
plot_training_metrics(history_ffs)

In [91]:
# save model and training performance
model_ffs.save('/Users/kienguyen/Documents/DATA SCIENCE/MSDS/11. Practicum I/model/final/NIKE_ffs')

df_hist_NIKE_ffs = pd.DataFrame(history_ffs.history) 
df_hist_NIKE_ffs.to_csv(f'/Users/kienguyen/Documents/DATA SCIENCE/MSDS/11. Practicum I/model/final/NIKE_ffs_train_history.csv')

INFO:tensorflow:Assets written to: /Users/kienguyen/Documents/DATA SCIENCE/MSDS/11. Practicum I/model/final/NIKE_ffs/assets


INFO:tensorflow:Assets written to: /Users/kienguyen/Documents/DATA SCIENCE/MSDS/11. Practicum I/model/final/NIKE_ffs/assets


<b>Model for data formed by RFE method</b>

In [75]:

# RFE:
n_rfe_features = x_train_rfe.shape[2]
# clear previous models
K.clear_session()
# Create model:
model_rfe = Sequential(name='LSTM')
model_rfe.add(LSTM(n_steps,return_sequences=True,input_shape=(n_steps,n_rfe_features)))
# add additional layers
for layer,nodes,ret_seq,drop in layers:
    if layer=='lstm':
        model_rfe.add(LSTM(nodes,return_sequences=ret_seq))
        if drop is not None:
            model_rfe.add(Dropout(drop))
    elif layer=='dense':
        model_rfe.add(Dense(nodes))
        if drop is not None:
            model_rfe.add(Dropout(drop))
# add prediction layer
model_rfe.add(Dense(n_predict))
#compile:
model_rfe.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=lr),loss='mse')
model_rfe.summary()



Model: "LSTM"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm (LSTM)                 (None, 25, 25)            4400      
                                                                 
 lstm_1 (LSTM)               (None, 25, 256)           288768    
                                                                 
 dropout (Dropout)           (None, 25, 256)           0         
                                                                 
 lstm_2 (LSTM)               (None, 128)               197120    
                                                                 
 dropout_1 (Dropout)         (None, 128)               0         
                                                                 
 dense (Dense)               (None, 32)                4128      
                                                                 
 dropout_2 (Dropout)         (None, 32)                0      

In [76]:
# Train the model
tic = timeit.default_timer()
history_rfe = model_rfe.fit(x_train_rfe,
                    y_train_rfe,
                    batch_size=128,
                    epochs=250,
                    callbacks = EarlyStopping(monitor='loss',patience=10),
                    validation_split =0.10, 
                    verbose = 0
                    )
# print the training time
toc =timeit.default_timer()
print('\nTraining Time')
print('='*60)
print(f'Minutes:{round((toc-tic)/60,2)}\n')


Training Time
Minutes:9.1



In [77]:
# Predict price and calculate the model performance:
y_pred_scaled_rfe = calculate_performance(x_test_rfe,y_test_rfe,model_rfe,scaler_y_rfe)


Model Error
Mean Absolute Error (MAE)------------------------------  0.08
Root Mean Squared Error (MSE)--------------------------  0.09
Mean Absolute Percentage Error (MAPE)------------------  1.65%
Median Absolute Percentage Error (MDAPE)---------------  1.63%


In [78]:
plot_price_predictions(128, index_close_rfe, x_test_rfe, y_pred_scaled_rfe,scaler_y_rfe)

In [79]:
plot_training_metrics(history_rfe)

In [92]:
# save model and training performance
model_rfe.save('/Users/kienguyen/Documents/DATA SCIENCE/MSDS/11. Practicum I/model/final/NIKE_rfe')

df_hist_NIKE_rfe = pd.DataFrame(history_rfe.history) 
df_hist_NIKE_rfe.to_csv(f'/Users/kienguyen/Documents/DATA SCIENCE/MSDS/11. Practicum I/model/final/NIKE_rfe_train_history.csv')

INFO:tensorflow:Assets written to: /Users/kienguyen/Documents/DATA SCIENCE/MSDS/11. Practicum I/model/final/NIKE_rfe/assets


INFO:tensorflow:Assets written to: /Users/kienguyen/Documents/DATA SCIENCE/MSDS/11. Practicum I/model/final/NIKE_rfe/assets


### Chosen model to build "Hybrid" model
As we can see, the performance of model built from data with features selected by RFE method is better than FFS method.
<br> As a result, the model_rfe will be the chosen one to develop further "hybrid" model

## 'Hybrid' model

### Add MA forecast as a vector to the chosen LSTM model:

In [80]:
# Add ma_predictions to data to train the hybrid LSTM model:
rfe_hybrid = df_nike_rfe.copy()
rfe_hybrid['50day_MA'] = rfe_hybrid["close"].rolling(50).mean()
rfe_hybrid.fillna(method='bfill',inplace=True,axis = 0)
rfe_hybrid.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6019 entries, 2000-03-15 to 2024-02-15
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   open                  6019 non-null   float64
 1   high                  6019 non-null   float64
 2   low                   6019 non-null   float64
 3   close                 6019 non-null   float64
 4   symbol                6019 non-null   object 
 5   inflation             6019 non-null   float64
 6   unemployment          6019 non-null   float64
 7   CPI                   6019 non-null   float64
 8   fed_funds_rate        6019 non-null   float64
 9   yield10y              6019 non-null   float64
 10  yield3m               6019 non-null   float64
 11  MACD                  6019 non-null   float64
 12  bbands-upper          6019 non-null   float64
 13  bbands-middle         6019 non-null   float64
 14  bbands-lower          6019 non-null   float64
 15  unemploymen

<b>Process the data

In [83]:
rfe_hybrid_copy =rfe_hybrid.copy()
hybrid_features =[c for c in rfe_hybrid_copy.columns if c not in ['symbol']]

index_close = rfe_hybrid_copy.columns.get_loc('close')
transform = 'log'

In [84]:
# get the target values:
rfe_hybrid_y = rfe_hybrid_copy['close']
# get the features:
rfe_hybrid_X= rfe_hybrid_copy[hybrid_features]
# replace date index with integer index:
rfe_hybrid_X.reset_index(drop=True, inplace=True)
# convert to numpy arrays:
hybrid_array_X = np.array(rfe_hybrid_X)
hybrid_array_y = np.array(rfe_hybrid_y).reshape(-1,1)

<b>Scale the input, output

In [85]:
# Hybrid:
scaler_X_hybrid = MinMaxScaler(feature_range=(0,1))
scaler_y_hybrid = MinMaxScaler(feature_range=(0,1))
scaled_X_hybrid = scaler_X_hybrid.fit_transform(hybrid_array_X)
scaled_y_hybrid = scaler_y_hybrid.fit_transform(hybrid_array_y)

<b> Split train,test

In [86]:
train_x_hybrid,test_x_hybrid=split_train_test(scaled_X_hybrid,train_pct)

<b>Partition the train,test

In [87]:
index_close_hybrid = rfe_hybrid_copy.columns.get_loc('close')
x_train_hybrid, y_train_hybrid = create_partitions(train_x_hybrid,index_close_hybrid,n_steps, n_predict,True)
x_test_hybrid,  y_test_hybrid  = create_partitions(test_x_hybrid, index_close_hybrid,n_steps, n_predict)
plot_training_window(x_train_hybrid, y_train_hybrid,index_close_hybrid,n_steps,n_predict,seed//10,'Hybrid model')

<b>Train the model

In [88]:
# clear previous models
K.clear_session()
# Create Hybrid model:
hybrid_features = x_train_hybrid.shape[2]
model_hybrid = Sequential(name='LSTM')
model_hybrid.add(LSTM(n_steps,return_sequences=True,input_shape=(n_steps,hybrid_features)))
# add additional layers
for layer,nodes,ret_seq,drop in layers:
    if layer=='lstm':
        model_hybrid.add(LSTM(nodes,return_sequences=ret_seq))
        if drop is not None:
            model_hybrid.add(Dropout(drop))
    elif layer=='dense':
        model_hybrid.add(Dense(nodes))
        if drop is not None:
            model_hybrid.add(Dropout(drop))
# add prediction layer
model_hybrid.add(Dense(n_predict))
#compile:
model_hybrid.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=lr),loss='mse')
model_hybrid.summary()



Model: "LSTM"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm (LSTM)                 (None, 25, 25)            4500      
                                                                 
 lstm_1 (LSTM)               (None, 25, 256)           288768    
                                                                 
 dropout (Dropout)           (None, 25, 256)           0         
                                                                 
 lstm_2 (LSTM)               (None, 128)               197120    
                                                                 
 dropout_1 (Dropout)         (None, 128)               0         
                                                                 
 dense (Dense)               (None, 32)                4128      
                                                                 
 dropout_2 (Dropout)         (None, 32)                0      

In [89]:
# Train the model
tic = timeit.default_timer()
history_hybrid = model_hybrid.fit(x_train_hybrid,
                    y_train_hybrid,
                    batch_size=128,
                    epochs=250,
                    callbacks = EarlyStopping(monitor='loss',patience=10),
                    validation_split =0.10, 
                    verbose = 0
                    )
# print the training time
toc =timeit.default_timer()
print('\nTraining Time')
print('='*60)
print(f'Minutes:{round((toc-tic)/60,2)}\n')


Training Time
Minutes:8.11



In [90]:
# Predict price and calculate the model performance:
y_pred_scaled_hybrid = calculate_performance(x_test_hybrid,y_test_hybrid,model_hybrid,scaler_y_hybrid)
# Price prediction
plot_price_predictions(128, index_close_hybrid, x_test_hybrid, y_pred_scaled_hybrid,scaler_y_hybrid)
# Training metrics
plot_training_metrics(history_hybrid)


Model Error
Mean Absolute Error (MAE)------------------------------  0.06
Root Mean Squared Error (MSE)--------------------------  0.07
Mean Absolute Percentage Error (MAPE)------------------  1.17%
Median Absolute Percentage Error (MDAPE)---------------  1.01%


In [93]:
# save model and training performance
model_hybrid.save('/Users/kienguyen/Documents/DATA SCIENCE/MSDS/11. Practicum I/model/final/NIKE_hybrid')

df_hist_NIKE_hybrid = pd.DataFrame(history_hybrid.history) 
df_hist_NIKE_hybrid.to_csv(f'/Users/kienguyen/Documents/DATA SCIENCE/MSDS/11. Practicum I/model/final/NIKE_hybrid_train_history.csv')

INFO:tensorflow:Assets written to: /Users/kienguyen/Documents/DATA SCIENCE/MSDS/11. Practicum I/model/final/NIKE_hybrid/assets


INFO:tensorflow:Assets written to: /Users/kienguyen/Documents/DATA SCIENCE/MSDS/11. Practicum I/model/final/NIKE_hybrid/assets
