# 4. Data Modeling

<a id="contents"></a>
# Table of Contents  
4.1. [Introduction](#introduction)  
4.2. [Imports](#imports)   
4.3. [Load Pre-processed Data](#load)     
4.4. [Build and Compile the LSTM Model](#build)       
4.5. [Model Training](#model)       
4.6. [Model Evaluation](#eval)      
4.7. [Model Results](#results)

## 4.1 Introduction<a id="introduction"></a>

The goal of this notebook is to develop a final model that effectively predicts stock market prices for the following stocks: <br>
	#	Stock Name/Ref<br>
	1)	The Estée Lauder Companies Inc. (EL)<br>
	2)	Ulta Beauty, Inc. (ULTA)<br>
	3)	COTY (COTY)<br>
	4)	e.l.f. Beauty, Inc. (ELF)

## 4.2 Imports<a id="imports"></a>

In [1]:
import os
import math 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import TimeSeriesSplit
from keras.models import Sequential
from keras.optimizers import Adam
from keras.layers import LSTM, Dense, Dropout

#ignore warning messages to ensure clean outputs
import warnings
warnings.filterwarnings('ignore')

2024-06-19 13:33:50.104992: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## 4.3 Load Pre-processed Data

In [2]:
df = pd.read_csv('/Users/heatheradler/Documents/GitHub/Springboard/Springboard_Projects/Stock_Predictor_Capstone/Updated_df.csv')
with open('/Users/heatheradler/Documents/GitHub/Springboard/Springboard_Projects/Stock_Predictor_Capstone/scalers.pkl', 'rb') as f:
    scalers = pickle.load(f)

X_train = np.load('X_train.npy', allow_pickle=True)
X_test = np.load('X_test.npy', allow_pickle=True)
y_train = np.load('y_train.npy', allow_pickle=True)
y_test = np.load('y_test.npy', allow_pickle=True)
stock_symbols_test = np.load('stock_symbols_test.npy', allow_pickle=True)

print("Data loaded successfully.")
# Verify the shape and dtype of the loaded arrays
print(f"X_train shape: {X_train.shape}, dtype: {X_train.dtype}")
print(f"X_test shape: {X_test.shape}, dtype: {X_test.dtype}")
print(f"y_train shape: {y_train.shape}, dtype: {y_train.dtype}")
print(f"y_test shape: {y_test.shape}, dtype: {y_test.dtype}")

Data loaded successfully.
X_train shape: (12544, 50, 6), dtype: object
X_test shape: (3136, 50, 6), dtype: object
y_train shape: (12544,), dtype: float64
y_test shape: (3136,), dtype: float64


In [3]:
# Remove stock symbols from the feature arrays if present
# Assuming stock symbols were in the last column
if isinstance(X_train[0, 0, -1], str):
    X_train = X_train[:, :, :-1]
    X_test = X_test[:, :, :-1]

# Convert data types to float32
X_train = X_train.astype(np.float32)
X_test = X_test.astype(np.float32)
y_train = y_train.astype(np.float32)
y_test = y_test.astype(np.float32)

In [4]:
df.Date = pd.to_datetime(df.Date)
df.set_index('Date', inplace=True)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 15880 entries, 1995-11-17 to 2024-03-28
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Open          15880 non-null  float64
 1   High          15880 non-null  float64
 2   Low           15880 non-null  float64
 3   Close         15880 non-null  float64
 4   Volume        15880 non-null  int64  
 5   stock_symbol  15880 non-null  object 
dtypes: float64(4), int64(1), object(1)
memory usage: 868.4+ KB


## 4.4 Building and Compiling the LSTM Model<a id="modelling"></a>

In [6]:
# Define the LSTM model
model = Sequential()
model.add(LSTM(units=50, return_sequences=True, input_shape=(X_train.shape[1], X_train.shape[2])))
model.add(Dropout(0.2))
model.add(LSTM(units=50, return_sequences=False))
model.add(Dropout(0.2))
model.add(Dense(units=1))

model.compile(optimizer='adam', loss='mean_squared_error')

## 4.5 Model Training

In [7]:
#f.set_index('Date', inplace=True)
#dates = df.index

# Check the first few dates to ensure they are correct
#print(dates[:5])

In [8]:
#df

In [9]:
history = model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test))

train_loss = model.evaluate(X_train, y_train, verbose=0)
test_loss = model.evaluate(X_test, y_test, verbose=0)

print(f"Train Loss: {train_loss}")
print(f"Test Loss: {test_loss}")

Epoch 1/10
[1m392/392[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 17ms/step - loss: 0.0784 - val_loss: 0.0172
Epoch 2/10
[1m392/392[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 18ms/step - loss: 0.0139 - val_loss: 0.0103
Epoch 3/10
[1m392/392[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 18ms/step - loss: 0.0117 - val_loss: 0.0080
Epoch 4/10
[1m392/392[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 17ms/step - loss: 0.0107 - val_loss: 0.0089
Epoch 5/10
[1m392/392[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 18ms/step - loss: 0.0096 - val_loss: 0.0070
Epoch 6/10
[1m392/392[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 20ms/step - loss: 0.0092 - val_loss: 0.0058
Epoch 7/10
[1m392/392[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 21ms/step - loss: 0.0094 - val_loss: 0.0084
Epoch 8/10
[1m392/392[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 23ms/step - loss: 0.0097 - val_loss: 0.0059
Epoch 9/10
[1m392/392[0m [32m

In [10]:
# Save the trained model
model.save('/Users/heatheradler/Documents/GitHub/Springboard/Springboard_Projects/Stock_Predictor_Capstone/stock_prediction_model.h5')



## 4.6 Model Evaluation and Prediction

In [11]:
#scalers

In [12]:
#predictions

In [16]:
# Make predictions
predictions = model.predict(X_test)

# Inverse transform predictions and y_test back to original scale
def inverse_transform_column(scaled_values, scaler, column_idx):
    # Create a placeholder array of zeros with the same shape as scaled_values
    placeholder = np.zeros_like(scaled_values)
    
    # Assign scaled_values to the correct column of the placeholder array
    placeholder[:, 0] = scaled_values[:, 0]
    
    # Inverse transform the scaled values using the scaler
    inversed = scaler.inverse_transform(placeholder)
    
    # Extract the desired column after inverse transformation
    inversed_column = inversed[:, column_idx]
    
    return inversed_column

#target_col_idx = 0  # Index of the 'Open' column in the scaled data

# Inverse transform predictions and y_test for each stock
predictions_rescaled = []
y_test_rescaled = []

for stock in np.unique(stock_symbols_test):
    stock_scaler = scalers[stock]
    
    stock_indices = stock_symbols_test == stock
    stock_predictions = predictions[stock_indices]
    stock_y_test = y_test[stock_indices]
    
    predictions_rescaled.extend(inverse_transform_column(stock_predictions, stock_scaler, target_col_idx))
    y_test_rescaled.extend(inverse_transform_column(stock_y_test, stock_scaler, target_col_idx))

predictions_rescaled = np.array(predictions_rescaled)
y_test_rescaled = np.array(y_test_rescaled)

[1m98/98[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 8ms/step


ValueError: non-broadcastable output operand with shape (1295,1) doesn't match the broadcast shape (1295,5)

In [None]:
# Evaluate the model
rmse = mean_squared_error(y_test_rescaled, predictions_rescaled, squared=False)
print(f"Root Mean Squared Error (RMSE): {rmse}")

In [None]:
# Save the rescaled predictions and labels for further analysis
np.save('predictions_rescaled.npy', predictions_rescaled)
np.save('y_test_rescaled.npy', y_test_rescaled)

In [None]:
X

In [None]:
# Make predictions
predictions = model.predict(X_test)

# Inverse transform predictions and y_test back to original scale
predictions_rescaled = scalers.inverse_transform(predictions)
y_test_rescaled = scalers['stock_symbol'].inverse_transform(y_test.reshape(-1, 1))

# Evaluate the model (optional)
from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(y_test_rescaled, predictions_rescaled, squared=False)
print(f"Root Mean Squared Error (RMSE): {rmse}")


In [None]:
# Evaluate the model (optional)
from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(y_test_rescaled, predictions_rescaled, squared=False)
print(f"Root Mean Squared Error (RMSE): {rmse}")

In [None]:
# Sort the DataFrame by the index
df = df.sort_index()

# Now you can access all dates from the DataFrame
dates = df.index

# Check the first few dates to ensure they are correct
print(dates[:5])

## 4.7 Vizualization

In [None]:
# Get the unique years from the dates
unique_years = np.unique([date.year for date in dates])

# Generate the x-axis ticks for each quarter of each year
quarter_ticks = []
for year in unique_years:
    for quarter in range(1, 5):
        quarter_ticks.append(f'{year} Q{quarter}')

print(quarter_ticks)  # Check the generated quarter ticks

In [None]:
print(dates[:])

In [None]:
# Rescale predictions and y_test back to original scale
predictions_rescaled = scaler.inverse_transform(np.concatenate((np.zeros((predictions.shape[0], scaled_data.shape[1] - 1)), predictions), axis=1))[:, -1]
y_test_rescaled = scaler.inverse_transform(np.concatenate((np.zeros((y_test.shape[0], scaled_data.shape[1] - 1)), y_test.reshape(-1, 1)), axis=1))[:, -1]

# Evaluate the model
train_rmse = mean_squared_error(y_train, model.predict(X_train), squared=False)
test_rmse = mean_squared_error(y_test, predictions, squared=False)

print(f"Train RMSE: {train_rmse}")
print(f"Test RMSE: {test_rmse}")

In [None]:
# Get the dates for the test set
dates = df.index[-len(y_test):]

# Trim dates to match the length of y_test_rescaled and predictions_rescaled
dates = dates[-len(y_test_rescaled):]

# Generate the x-axis ticks for each quarter of each year
quarter_ticks = []
for year in unique_years:
    for quarter in range(1, 5):
        quarter_ticks.append(f'{year} Q{quarter}')

# Plot the true and predicted 'Close' prices
plt.figure(figsize=(12, 6))
plt.plot(dates, y_test_rescaled, label='True Close', color='blue')
plt.plot(dates, predictions_rescaled, label='Predicted Close', color='red')
plt.xlabel('Date')
plt.ylabel('Close Price')
plt.title('True vs Predicted Close Prices')
plt.legend()
plt.xticks(ticks=dates[::len(dates)//len(quarter_ticks)], labels=quarter_ticks, rotation=45)  # Set custom ticks and labels
plt.tight_layout()
plt.show()