# Profit Prophet: The Stock Market ML Predictor

This project uses LSTM machine learning to make predictions of stock market prices.

### Our Team

* Daniel Oliyarnik
* Harry van der Veen
* Darren Sun
* Jialu Xu

## Problem Statement

We are trying to build a TAF enhanced LSTM to accurately predict stock market prices. We believe that stock prices can be forecasted based on their previous numerical values, but are also heavily impacted by speculation which we can capture using corprate news articles.

## Stock Data

We use a combination of the fundamental stock data (open price, high, low, close price) along with some calculated indicators (rsi, macd) in the form of a csv.

 Additionally, we use new articles related to the companies of interest.

## Approach

First we train an LSTM model for the fundamental data to get a forecast of the market based on pure numeric data. Then we apply a TAF to the prediction to reduce RMSE. This produces our forecast based on the numeric data.

Next, we take the news data and extract embeddings to output a vector containing the relevance and estimations of stock impact. This output is combined with the fundamental stock data and pushed to another LSTM to predict future values. This provides the speculative forecast which accounts for corporate news.

![Block Diagram.png](images/Block%20Diagram.png)

Our LSTM model in this codebase is used for time series forecasting, particularly for what appears to be financial or stock data prediction. The implementation in lstm_model.py creates a sequential neural network with LSTM layers for capturing temporal patterns in sequential data.
The model's architecture is configurable, allowing for multiple LSTM layers. Each LSTM cell maintains internal states (cell state and hidden state) that help it remember information over long sequences. The model uses dropout for regularization to prevent overfitting and supports different activation functions like 'tanh' or 'relu'.
What makes this implementation interesting is that it uses stateful LSTMs, where the cell states are preserved between batches during training. This allows the model to maintain context across different batches of the same sequence. The state is reset after each epoch via the reset_model_states() method. The training process involves fitting the model without shuffling the data (important for time series) and resetting states between epochs.
For forecasting, the ForecastEngine class extends the LSTM model, allowing it to make multi-step predictions by feeding its own predictions back into itself iteratively - a technique known as autoregressive forecasting. The model uses the last batch of training data as a starting point and then generates future predictions one step at a time, updating its input with each new prediction.

## Tools

There are multiple different ways to get the stock prices; Bloomberg Terminals, and OpenBB.

News data can be imported from a variety of sources including Bloomberg and OpenBB, but also regular sites such as the Financial Times.

### Bloomberg API

Bloomberg terminals are the defacto way to get stock information. UW also provides access to 4 of these terminals in the MC building. The API for Bloomberg requires the terminal to be running, so the API can only run on a machine with the terminal open.

For this reason, we are moving away from Bloomberg API

In [None]:
# Bloomberg API

from xbbg import blp
import pandas as pd

DATA_DIR = './Data/'

tickers = ['NVDA US Equity', 'AAPL US Equity']
fields = ['High', 'Low', 'Last_Price']
start_date = '2024-11-01'
end_date = '2024-11-10'

# This line hangs unless it is running with a Bloomberg terminal
hist_tick_data = blp.bdh(tickers=tickers, fields=fields, start_date=start_date, end_date=end_date)

filename = f'tick_data_{start_date}_to_{end_date}.csv'
hist_tick_data.to_csv(DATA_DIR + filename)



### OpenBB

OpenBB is a free open-source implementation of Bloomberg's stock viewer. It can be run without any special software running in the background.

The data is aggregated from multiple sources, though some data is inaccessible unless we purchase api keys from the corresponding sources

### Finnhub

Finnhub is a free data provider for corporate news. We can use this to import up to 3 months of news on a particular company.

In [2]:
import openbb
openbb.build()

In [3]:
# Rate limiter class
# Some of the liraries used in the code are rate limited. This class can be used
# to limit the number of requests made to the library in a given time period.

import threading
import time

class TokenBucket:
    def __init__(self, tokens, refill_rate):
        self.capacity = tokens  # Max tokens (60)
        self.tokens = tokens    # Initial tokens
        self.refill_rate = refill_rate  # Tokens added per second (60)
        self.lock = threading.Lock()
        self.last_refill = time.time()

    def _refill(self):
        now = time.time()
        elapsed = now - self.last_refill
        # Calculate tokens to add based on elapsed time
        new_tokens = elapsed * self.refill_rate
        if new_tokens > 0:
            self.tokens = min(self.capacity, self.tokens + new_tokens)
            self.last_refill = now

    def consume(self, tokens=1):
        with self.lock:
            self._refill()
            if self.tokens >= tokens:
                self.tokens -= tokens
                return True
            return False

In [None]:
# OpenBB API

from openbb import obb
import finnhub
from ftfy import fix_text

from concurrent.futures import ThreadPoolExecutor  # Use ProcessPoolExecutor for CPU-bound tasks
import pandas as pd
import traceback

finnhub_client = finnhub.Client(api_key="[Key removed - please add your own]")
obb.user.preferences.output_type = 'dataframe'
rate_limiter = TokenBucket(tokens=60, refill_rate=60)

FAST = 12
SLOW = 26
SIGNAL = 9

MIN_POINTS = SLOW + SIGNAL - 1
DAYS_TO_PAD = -(-(MIN_POINTS * 1.5) // 1) # Not every day has data. Round up to nearest integer

def process_symbol_data(symbol, start_date, end_date):
    # Wait until a token is available
    while not rate_limiter.consume():
        time.sleep(0.001)  # Avoid busy-waiting
        
    # Process data for a single symbol with technical indicators
    try:
        # Fetch OHLCV data
        symbol_data_df = obb.equity.price.historical(symbol, start_date=start_date, end_date=end_date)
        symbol_data_df['symbol'] = symbol  # Add symbol column

        # Remove any duplicate dates
        symbol_data_df = symbol_data_df[~symbol_data_df.index.duplicated(keep='first')] 

        # RSI
        symbol_data_df = obb.technical.rsi(data=symbol_data_df, target='close', length=14, scalar=100.0, drift=1)
        symbol_data_df.rename(columns={'close_RSI_14': 'rsi'}, inplace=True)

        # MACD
        symbol_data_df = obb.technical.macd(data=symbol_data_df, target='close', fast=FAST, slow=SLOW, signal=SIGNAL)
        symbol_data_df.rename(columns={f'close_MACD_{str(FAST)}_{str(SLOW)}_{str(SIGNAL)}': 'macd',
                                       f'close_MACDh_{str(FAST)}_{str(SLOW)}_{str(SIGNAL)}': 'macdh',
                                       f'close_MACDs_{str(FAST)}_{str(SLOW)}_{str(SIGNAL)}': 'macds'}, inplace=True)
        
        # Convert 'date' index to regular index
        symbol_data_df.reset_index(inplace=True)

        # News
        symbol_news = finnhub_client.company_news(symbol, _from=start_date, to=end_date)

        # Fix encoding for all text fields in the raw API response
        for article in symbol_news:
            for text_field in ['headline', 'summary', 'source']:
                if text_field in article and article[text_field] is not None:
                    article[text_field] = fix_text(article[text_field])

        group_column = 'date'
        text_columns = ['headline', 'summary', 'source']

        symbol_news_df = (pd.DataFrame(symbol_news)
            .assign(datetime=lambda x: pd.to_datetime(x['datetime'], unit='s', errors="coerce"))
            .dropna(subset=["datetime"])  # Remove invalid rows
            .assign(datetime=lambda x: x["datetime"].dt.strftime("%Y-%m-%d"))
            .rename(columns={'datetime': group_column})
            [[group_column] + text_columns]
        )

        # Ensure 'date' is datetime in both DataFrames
        symbol_data_df['date'] = pd.to_datetime(symbol_data_df['date'])
        symbol_news_df['date'] = pd.to_datetime(symbol_news_df['date'])

        # Aggregate news data to one row per date
        symbol_news_df = symbol_news_df.groupby('date').agg({
            'headline': lambda x: '\n'.join(x.astype(str)),
            'summary': lambda x: '\n'.join(x.astype(str)),
            'source': lambda x: '\n'.join(x.astype(str))
        }).reset_index()

        symbol_data_df = symbol_data_df.merge(symbol_news_df, on='date', how='outer')

        return symbol_data_df
    
    except Exception as e:
        print(f"Error processing {symbol}: {traceback.format_exc()}")
        return pd.DataFrame()

def downloadStockData(symbols, start_date=None, end_date=None, parallel=True):
    try:
        # Fetch S&P 500 data once for all symbols
        sp500_df = obb.equity.price.historical("SPX", start_date=start_date, end_date=end_date)
        sp500_df = sp500_df[['close']].rename(columns={'close': 'SP500'})
        sp500_df.reset_index(inplace=True)
        sp500_df['date'] = pd.to_datetime(sp500_df['date'])

        # Process symbols in parallel or sequentially
        if parallel:
            with ThreadPoolExecutor(max_workers=4) as executor:
                futures = [executor.submit(process_symbol_data, symbol, start_date, end_date) for symbol in symbols]
                results = [f.result() for f in futures]
        else:
            results = [process_symbol_data(symbol, start_date, end_date) for symbol in symbols]

        # Combine all symbols and merge with SP500
        combined_df = pd.concat(results)

        final_df = combined_df.merge(sp500_df, on='date', how='outer')        

        # Add groupby-aware technical calculations
        final_df = final_df.groupby('symbol', group_keys=False).apply(lambda x: x.sort_index())
        final_df.reset_index(inplace=True, drop=True)
    
        return final_df
    
    except Exception as e:
        print(f"Error during download: {traceback.format_exc()}")
        return pd.DataFrame()

# Declare search bounds 
symbols = ['AAPL', 'NVDA']
start_date = '1950-01-01'
end_date = '2025-03-01'

data_df = downloadStockData(symbols, start_date, end_date)

data_df.to_csv('Stock Data.csv')
display(data_df)

Unnamed: 0,date,open,high,low,close,volume,symbol,rsi,macd,macdh,macds,headline,summary,source,SP500
0,2004-01-02,0.39,0.39,0.38,0.38,2.024994e+09,AAPL,,,,,,,,1108.48
1,2004-01-02,0.20,0.20,0.19,0.19,1.309248e+09,NVDA,,,,,,,,1108.48
2,2004-01-05,0.38,0.40,0.38,0.40,5.530258e+09,AAPL,,,,,,,,1122.22
3,2004-01-05,0.20,0.20,0.19,0.20,1.725876e+09,NVDA,,,,,,,,1122.22
4,2004-01-06,0.40,0.40,0.39,0.40,7.130872e+09,AAPL,,,,,,,,1123.67
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10647,2025-02-26,129.99,133.73,128.49,131.28,3.225538e+08,NVDA,47.845685,-0.139424,-0.190138,0.050714,QTUM: Our Favorite Way To Invest In Quantum Co...,Invest in the Defiance Quantum ETF (QTUM) for ...,SeekingAlpha\nFinnhub\nSeekingAlpha\nSeekingAl...,5956.06
10648,2025-02-27,239.66,242.46,237.06,237.30,4.115364e+07,AAPL,46.727976,2.034598,0.185772,1.848826,Auxier Asset Management Winter 2024 Market Com...,Auxier Focus Fund's Investor Class declined 2....,SeekingAlpha\nSeekingAlpha\nSeekingAlpha\nMark...,5861.57
10649,2025-02-27,134.97,135.01,120.01,120.15,4.431758e+08,NVDA,38.123629,-1.153285,-0.963199,-0.190086,Auxier Asset Management Winter 2024 Market Com...,Auxier Focus Fund's Investor Class declined 2....,SeekingAlpha\nSeekingAlpha\nSeekingAlpha\nSeek...,5861.57
10650,2025-02-28,236.91,242.09,230.20,241.84,5.683336e+07,AAPL,53.196852,1.906182,0.045885,1.860297,The AI Smartphone Battle Of Titans: iPhone 16 ...,Apple's iPhone 16 and Samsung's Galaxy S25 mar...,SeekingAlpha\nDowJones\nMarketWatch\nSeekingAl...,5954.50


# Data Preprocessor

We create an extended MinMaxScaler which includes extra headroom for future values. This class is also used to create a sliding window dataset.

In [None]:
import numpy as np
from sklearn.preprocessing import MinMaxScaler

class BufferedMinMaxScaler(MinMaxScaler):
    """New Subclass, inheriting from MinMaxScalar, gives custom/buffered scale"""
    def __init__(self, headroom=0.2):
        super().__init__()
        self.headroom = headroom

    def fit(self, X, y=None):
        X = np.asarray(X)
        # 1. Store original data min/max
        self.orig_data_min_ = X.min(axis=0)
        self.orig_data_max_ = X.max(axis=0)

        # 2. Calculate buffer-adjusted max
        data_range = self.orig_data_max_ - self.orig_data_min_
        self.data_max_ = self.orig_data_max_ + data_range * self.headroom
        self.data_min_ = self.orig_data_min_  # Keep original min (for now, potentially will need to change)

        # 3. Calculate parent class parameters (data_range_ is for parent class, incorporating the headroom)
        self.data_range_ = self.data_max_ - self.data_min_
        self.scale_ = (self.feature_range[1] - self.feature_range[0]) / self.data_range_
        self.min_ = self.feature_range[0] - self.data_min_ * self.scale_
        return self 

class DataPreprocessor:
    def __init__(self, headroom=0.2):
        # Its using the extra headroom by defult
        self.scaler = BufferedMinMaxScaler(headroom=headroom)

    def fit_transform(self, data):
        return self.scaler.fit_transform(data)

    def inverse_transform(self, data):
        return self.scaler.inverse_transform(data)

    def create_dataset(self, dataset, look_back=1, target_feature=0, forecast_horizon=1):
        dataX, dataY = [], []

        for i in range(len(dataset)-look_back-1):
            # Note the : here indicated to put into a 2D array [[1,2,3], [1,2,3], [1,2,3]]
            input = dataset[i:(i+look_back), :] # IF there was only 1 feature, then need 0 to put into 1D
            # Append into shape(X, 3)
            dataX.append(input)
            output = dataset[i + look_back, :]
            dataY.append(output)

        return np.array(dataX), np.array(dataY)

    def trim_XY(self, dataX, dataY, batch_size):
        # Removing any odd data depending on batch size
        trim_size = len(dataX) - (len(dataX) % batch_size)
        return dataX[:trim_size], dataY[:trim_size]

    def invert_1d_prediction(self, pred_1d, feature_cols_num, target_feature=0):
        # If we are looking at ALL the included features
        if pred_1d.shape[1] == feature_cols_num:
            inverted = self.scaler.inverse_transform(pred_1d)
        else:
            # pred_1d shape: (samples,1) -> what the INFERENCE output of the LSTM is
            # (samples, num_features) -> (samples,1) keeping target column
            padded = np.zeros((pred_1d.shape[0], feature_cols_num), dtype=np.float32)
            # [0, 0, 1, 0]
            # [0, 0, 1, 0] <- is essentially this
            # [0, 0, 1, 0]
            if pred_1d.shape[1] == 1:
                padded[:, target_feature] = pred_1d[:, 0]
            else:
                padded[:, target_feature] = pred_1d[:, target_feature]
            inverted = self.scaler.inverse_transform(padded)
        # Return just the "target" column (out of all the feature columns)
        return inverted[:, target_feature].reshape(-1,1)

def filter_multi_features(dataset, stock_rows, feature_cols):
    df_symbol = dataset[dataset['symbol'] == stock_rows].copy()
    df_symbol[feature_cols] = df_symbol[feature_cols].fillna(0)
    full_dataset = df_symbol[feature_cols].values.astype('float32')
    return full_dataset

# LSTM Model

The LSTM is implemented below:

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, InputLayer, RepeatVector, TimeDistributed

class LSTMModel:
    def __init__(self, layers, look_back, batch_size, neurons, epochs, activation, dropout, features=1, isReturnSeq=False, forecast_horizon=1):
        self.layers = layers
        self.isReturnSeq = isReturnSeq # should be either True or False
        self.features = features
        self.look_back = look_back
        self.batch_size = batch_size
        self.neurons = neurons
        self.epochs = epochs
        self.activation = activation # should be either the str 'tanh' or 'relu'
        self.dropout = dropout
        self.model = self._build_model(self.batch_size, forecast_horizon) 
        # What the class will fill
        self.trainPredict = None

    def _build_model(self, batch_size, forecast_horizon):
        # SHOULD ONLY HAVE ONE MODEL IN MEMORY (TF handles the models in memory strangely)
        # So clear_session is to clear the way tf stores/handles the models
        tf.keras.backend.clear_session()
        model = Sequential()
        # batch_input_shape (batch_size, num_steps, features)
        model.add(InputLayer(batch_input_shape=(batch_size, self.look_back, self.features)))
        # the more complex the data -> more neurons needed
        if self.layers > 1:
            for l in range(self.layers-1):
                model.add(LSTM(self.neurons, activation=self.activation, dropout=self.dropout, stateful=True, return_sequences=True))
            # In multi-layered the last is non-stateful ***
            model.add(LSTM(self.neurons, activation=self.activation, dropout=self.dropout, return_sequences=self.isReturnSeq))
        else:
            model.add(LSTM(self.neurons, activation=self.activation, dropout=self.dropout, stateful=True, return_sequences=self.isReturnSeq))
        model.add(Dense(9))
        model.compile(loss='mean_squared_error', optimizer='adam')
        model.summary()
        return model

    def reset_model_states(self):
        for layer in self.model.layers:
            if isinstance(layer, LSTM) and layer.stateful:
                layer.reset_states()
        print('Model states reset')

    def train(self, trainX, trainY):
        self.reset_model_states()

        for i in range(self.epochs):
            # Unlike CNNs, for RNNs (LSTM fitted) we do not shuffle
            self.model.fit(trainX, trainY, 
                        epochs=1, 
                        batch_size=self.batch_size, 
                        shuffle=False, 
                        verbose=2)
            
            # Resetting states after each epoch for stateful LSTM
            self.reset_model_states()
            print(f"Epoch {i+1}/{self.epochs} --- Completed")

    def predict(self, trainX):
        # X is the training data
        self.trainPredict = self.model.predict(trainX, batch_size=self.batch_size)
        return self.trainPredict

# Forecast Engine

The Forecast Engine allows us to create longer predictions by feeding the last prediction into the input of the next one, in a rolling window fashion.

In [None]:
import math
import numpy as np
import tensorflow as tf

class ForecastEngine(LSTMModel):
    def __init__(self, trained_model, layers=None, isReturnSeq=True, features=None, look_back=None, batch_size=None, neurons=None, epochs=None, activation=None, dropout=None):
        params = {'layers': layers,
                'isReturnSeq' : isReturnSeq,
                'features' : features,
                'look_back': look_back,
                'batch_size': batch_size,
                'neurons': neurons,
                'epochs': epochs,
                'activation': activation,
                'dropout': dropout}

        # Use parameters from trained_model or default values IF specified
        for param_name in params:
            if params[param_name] is None:
                params[param_name] = getattr(trained_model, param_name)

        # Initialize LSTMModel with inherited parameters
        super().__init__(layers=params['layers'],
                         isReturnSeq=params['isReturnSeq'],
                         features=params['features'],
                         look_back=params['look_back'],
                         batch_size=params['batch_size'],
                         neurons=params['neurons'],
                         epochs=params['epochs'],
                         activation=params['activation'],
                         dropout=params['dropout'])

        self.trained_model = trained_model
        # What the class will fill
        self.futurePredictions = None

    def forecast(self, start_input, steps, target_col_idx=0):
        # Use _build_model from LSTMModel to create a new model for forecasting
        forecast_model = self._build_model(self.batch_size, forecast_horizon=steps)  # Rebuild model for forecasting

        # Set weights from the trained model
        forecast_model.set_weights(self.trained_model.model.get_weights())

        new_predictions = []
        current_batch = start_input[-self.batch_size:]  # Get the last batch for prediction

        for i in range(steps):
            # Make a prediction for the next step
            pred = forecast_model.predict(current_batch, batch_size=self.batch_size)

            new_predictions.append(pred[-1, -1, target_col_idx])

            # Get all the next infered values/timesteps, put into 3D shape
            new_step = pred[:, -1, :].reshape(self.batch_size, 1, -1)


              # Update of batch for next prediction step, dropping the oldest value (in look back) and
              # adding the new infered values (newest in look back)
            current_batch = np.concatenate([current_batch[:, 1:, :], new_step], axis=1)

        # Convert predictions to a numpy array (predictions_array)
        self.futurePredictions = np.array(new_predictions).reshape(-1, 1)

        # ***** May need to manage resetting the states better, however may not be necessary 

        return self.futurePredictions

# TAF Shift

To apply the TAF, we calculate the smoothed error and trend factors to adjust predictions, with a grid search function to find optimal alpha, beta, and weight parameters that minimize prediction error against test data 

In [None]:
import numpy as np

class TAFShift:
    def __init__(self, alpha=0.5, beta=0.5):
        self.alpha = alpha  # Smoothing constant for error TAF
        self.beta = beta    # Smoothing constant for trend in TAF

    def calculate_taf(self, data, predictions):
        """full_taf, predicted_taf"""
        data = np.asarray(data).flatten()
        predictions = np.asarray(predictions).flatten()

        n = len(data) + len(predictions)

        # Create a single array with all datapoints
        combined_data = np.concatenate([data, predictions])
        total_length = len(combined_data)

        st = np.zeros(total_length) # Smooth error
        tt = np.zeros(total_length) # Trend factor
        taf_values = np.zeros(total_length)

        # Using the first timestep (from data) as initial smoothed forecast
        st[0] = combined_data[0]
        tt[0] = combined_data[1] - combined_data[0]

        taf_values = np.zeros(n)
        
        # Calculate TAF values for the complete dataset:
        # https://courses.worldcampus.psu.edu/welcome/mangt515/lesson02_13.html
        for t in range(1, n):
            taf_values[t] = st[t-1] + tt[t-1]
            st[t] = taf_values[t] + self.alpha*(combined_data[t] - taf_values[t])
            tt[t] = tt[t-1] + self.beta*(taf_values[t] - taf_values[t-1] - tt[t-1])

        return taf_values, taf_values[len(data):]

    def apply_taf(self, historical_data, forecasted, normalize=False, weight=0.0):
        _, predicted_taf = self.calculate_taf(historical_data, forecasted)
        # We need to reshape due to applying flatten in calculate_taf must be in (n, 1) shape for plotting
        predicted_taf = predicted_taf.reshape(-1, 1)

        # using Robust scaling
        if normalize:
            median_taf = np.median(predicted_taf)
            q1, q3 = np.percentile(predicted_taf, [22, 95])
            iqr = q3 - q1
            if iqr < 1e-6:  
                iqr = np.std(predicted_taf) if np.std(predicted_taf) > 0 else 1.0
            predicted_taf = (predicted_taf - median_taf) / iqr 

        adjusted_forecast = forecasted + weight * predicted_taf
        return adjusted_forecast
    

def taf_search_test(calculate_rmse, historical_data, forecasted, test_data, normalize=False):
    """Function for getting optimal TAF"""
    optimal_TAF_params = {}
    alpha_range = np.arange(0.0, 1.0, 0.1)
    optimal_alpha = 0
    beta_range = np.arange(0.0, 1.0, 0.1)
    optimal_beta = 0
    weight_range = np.arange(0.0, 0.2, 0.01)
    optimal_weight = 0

    # Relys on the assumption that the weights and TAF parameters (alpha and beta) affect RMSE independently
    lowest_rmse = 1000000
    optimal_taf = TAFShift()
    optimalTAF_forecast = None
    for a in alpha_range:
        for b in beta_range:
            taf_shift = TAFShift(alpha=a, beta=b)
            for w in weight_range:
                adjusted_forecast = taf_shift.apply_taf(historical_data, forecasted, normalize, weight=w)
                rmse = calculate_rmse(adjusted_forecast[:, 0], test_data)
                if rmse < lowest_rmse:
                    lowest_rmse = rmse
                    optimal_alpha = a
                    optimal_beta = b
                    optimal_weight = w
                    optimalTAF_forecast = adjusted_forecast

    print(f"TAF alpha: {optimal_alpha}, beta: {optimal_beta}, weight: {optimal_weight} | RMSE={lowest_rmse:.2f}")

    optimal_TAF_params[(optimal_alpha, optimal_beta, optimal_weight)] = (lowest_rmse, optimalTAF_forecast)
    return optimal_TAF_params

# Cross Validation

This code implements time series cross-validation for forecasting models with a focus on LSTM networks. It preprocesses data, trains models, generates forecasts, and evaluates predictions using RMSE metrics, with optional TAF optimization to improve forecast accuracy through parameter tuning.

In [None]:
import math
import numpy as np
import tensorflow as tf
from sklearn.metrics import mean_squared_error

def calculate_rmse(true_values, predicted_values):
    return np.sqrt(mean_squared_error(true_values, predicted_values))

def time_series_cross_validation(curr_dataset, model_params, forecast_horizon, initial_train_size, step_size, target_feature_col=0, OPTIMAL=False):
    """
    curr_dataset: the full dataset (2D array, e.g. shape (n_samples, 1))
    model_params: dict with keys: look_back, batch_size, epochs, headroom, dropout, etc.
    forecast_horizon: number of points to forecast in each fold
    initial_train_size: the initial number of samples used for training
    step_size: number of samples to roll forward between folds
    taf_params_list: list of tuples (alpha, beta, weight) to test

    Returns: list of RMSE values, one per fold
    """
    tf.keras.backend.clear_session()
    curr_dataset = np.nan_to_num(curr_dataset, nan=0.0)
    rmse_list = []
    n = len(curr_dataset)
    start = 0

    train_data = curr_dataset[start:start+initial_train_size]

    data_preprocessor = DataPreprocessor(headroom=model_params['headroom'])
    scaled_train = data_preprocessor.fit_transform(train_data)

    

    dataX, dataY = data_preprocessor.create_dataset(scaled_train, 
                                                    look_back=model_params['look_back'],
                                                    target_feature=target_feature_col)
    if len(dataX) == 0:
        raise ValueError("No training/data samples generated (dataX). Look back may be too big")
    dataX, dataY = data_preprocessor.trim_XY(dataX, dataY, model_params['batch_size'])

    effective_train_samples = len(dataX)
    effective_train_end = start + effective_train_samples + model_params['look_back']

    test_end = effective_train_end + forecast_horizon
    test_data = curr_dataset[effective_train_end:test_end]
    print('test_data shape: ', test_data.shape)

    print('*********** Starting New Cross-Validation ***********')
    print(f"Fold (Cross Validation) with train indices {start}:{effective_train_end} and test indices {effective_train_end}:{test_end}")

    # THIS IS WHAT I AM CHANGING FOR THE MULTI FEATURES
    # *****Need to tie input layer to Model class*****
    # trainX = np.reshape(dataX, (dataX.shape[0], dataX.shape[1], 1))
    # NO NEED to reshape since dataX is already 3D, (samples, look_back, num_features)
    trainX = dataX
    print('trainX shape (After Reshape): ', trainX.shape)
    trainY = dataY

    lstm_model = LSTMModel(layers=model_params['layers'],
                            isReturnSeq=False,
                            features=model_params['features'],
                            look_back=model_params['look_back'],
                            batch_size=model_params['batch_size'],
                            neurons=model_params['neurons'],
                            epochs=model_params['epochs'],
                            activation=model_params['activation'],
                            dropout=model_params['dropout'],
                            forecast_horizon=forecast_horizon)
    lstm_model.train(trainX, trainY)
    trainPredict = lstm_model.predict(trainX)

    forecast_engine = ForecastEngine(trained_model=lstm_model, isReturnSeq=True)
    forecastPredict = forecast_engine.forecast(trainX, forecast_horizon, target_feature_col)
    print('Forecast infered data shape: ', forecastPredict.shape)
    print("NaNs in forecastPredict:", np.isnan(forecastPredict).sum())

    # Invert the scaling for the forecast, train and test data
    forecasted_inverted = data_preprocessor.invert_1d_prediction(forecastPredict, model_params['features'], target_feature_col)
    print("NaNs in forecasted_inverted:", np.isnan(forecasted_inverted).sum())

    # Ensuring forecasted_inverted has the same number of rows as test_data
    # I ran into issues with mismatch due to th batch size
    num_test_samples = test_data.shape[0]
    if forecasted_inverted.shape[0] > num_test_samples:
        forecasted_inverted = forecasted_inverted[:num_test_samples]
    elif forecasted_inverted.shape[0] < num_test_samples:
        test_data = test_data[:forecasted_inverted.shape[0]]

    train_predict_inverted = 0

    rmse_taf_preTAF = calculate_rmse(forecasted_inverted[:, 0], test_data[:, target_feature_col])
    print('pre TAF: ', rmse_taf_preTAF)

    if OPTIMAL:
        rmse_TAF_results = taf_search_test(calculate_rmse, scaled_train[:effective_train_end, target_feature_col], forecasted_inverted, test_data[:, target_feature_col], normalize=False) 
        return [data_preprocessor, lstm_model, forecast_engine], train_predict_inverted, effective_train_end, test_end, forecasted_inverted, rmse_taf_preTAF, rmse_TAF_results
    else:
        return [data_preprocessor, lstm_model, forecast_engine], train_predict_inverted, effective_train_end, test_end, forecasted_inverted, rmse_taf_preTAF

# Visualizer

This code plots the forecasted predictions against the test data.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error

class Visualizer:
    def __init__(self, scaler: 'BufferedMinMaxScaler', trained_model: 'LSTMModel', forecast_engine: 'ForecastEngine'):
        """Instance needs to be of || BufferedMinMaxScaler | LSTMModel | ForecastEngine || type"""
        if not isinstance(scaler, BufferedMinMaxScaler):
            raise TypeError("Expected an instance of BufferedMinMaxScaler.")
        if not isinstance(trained_model, LSTMModel):
            raise TypeError("Expected an instance of LSTMModel.")
        if not isinstance(forecast_engine, ForecastEngine):
            raise TypeError("Expected an instance of ForecastEngine.")
        self.scaler = scaler
        self.trained_model = trained_model
        self.forecast_engine = forecast_engine

    def create_plot_array(self, full_length, loaction_num, value_arr):
        plot_array = np.zeros((len(full_length), 1))
        plot_array[loaction_num:loaction_num+len(value_arr)] = value_arr
        return plot_array

    def plot_results(self, rmse, train_predictions_inverted, train_end, test_end, future_predictions_inverted, curr_dataset, curr_system, target_col, curr_dir, TAFvars):
        """Both Train and Forecast must be given INVERTED, following the .predict/forecast output"""
        # Extract parameters from the trained model
        look_back = self.trained_model.look_back
        batch_size = self.trained_model.batch_size
        neurons = self.trained_model.neurons
        epochs = self.trained_model.epochs
        dropout = self.trained_model.dropout
        train_predictions = self.trained_model.trainPredict

        # Extract parameters from the forecasted data
        layers = self.forecast_engine.layers
        future_predictions = self.forecast_engine.futurePredictions

        # Extract parameters from the scaler
        headroom = self.scaler.headroom

        # Create full time array for x-axis
        full_time = np.arange(len(future_predictions_inverted))

        # Create plot array for plot values
        plot_array_forecast = self.create_plot_array(full_time, 0, future_predictions_inverted)

        # Plotting
        plt.figure(figsize=(12, 6))

        plt.plot(full_time[:], curr_dataset[train_end:test_end, target_col], color='green', linewidth=1.5, alpha=0.95, label='Given Data (TEST)')
        plt.plot(full_time[:], plot_array_forecast[:], color='red', linestyle='--', linewidth=1.5, alpha=0.75, label='Future Predictions')

        # Add labels and title
        plt.xlabel('Time (Days)')
        plt.ylabel('Value (Points)')
        plt.title(f'{curr_system}, RMSE: {rmse:.2f} | TAF A:{TAFvars[0]} B:{TAFvars[1]} W:{TAFvars[2]}')
        
        plt.legend()

        save_dir = f"{curr_dir}/"
        
        plt.savefig(f"{save_dir}{curr_system}_predictions Lr_{layers} H_{headroom} N_{neurons} B_{batch_size} L_{look_back} E_{epochs} D_{dropout} (TAF A {TAFvars[0]} B {TAFvars[1]} W {TAFvars[2]}).png")
        
        plt.close()
        
        print(f"Saved plot for column: {curr_system}")

# Run Tests

This code runs the forecasting pipeline

In [None]:
import time
import math
import traceback
import numpy as np
import pandas as pd
import tensorflow as tf
from bayes_opt import BayesianOptimization
import matplotlib.pyplot as plt

# Enabling multi-GPU useage on 1 node
# gpu_strategy = tf.distribute.MirroredStrategy()
# print(f"Number of GPUs Available: {gpu_strategy.num_replicas_in_sync}")

gpus = len(tf.config.list_physical_devices('GPU'))
print(f"Num GPUs Available: {gpus}")
#print(f"Worker (1 task per node) {os.environ.get('SLURM_PROCID', 'N/A')} sees {len(gpus)} GPU(s).")

# ------------------------------
#          Run Script
# ------------------------------
def run():
    
    data_dir = 'data/'
    results_dir = 'Model/results/'

    curr_dir = 'results1_long'

    # 1. Load and Prepare Data
    # Ensure the CSV is divided into columns named 'System1', 'System2', etc.
    file = 'Stock Data.csv'
    try:
        df = pd.read_csv(data_dir+file)
    except FileNotFoundError:
        print(f"Error: '{file}' not found. Place it into the '/data' directory")
        return

# ==================== Global Parameters ====================
    feature_cols = ['open',
                    'high',
                    'low',
                    'close',
                    'volume',
                    'rsi',
                    'macd',
                    'macdh',
                    'macds']
    target_feature_col = 3
    features = len(feature_cols)
    
    # Fixed parameters
    batch_size = 256
    headroom = 1.0
    dropout = 0.0
    layers = 2
    neurons = 100
    activation = 'tanh'
    # activation = 'relu' # NO Good!
    
    # Forecasting and dataset parameters
    forecast_horizon = 60   # Number of future points to forecast per fold
    initial_train_size = 4500
    step_size = 0            # For rolling window (0 -> no rolling)

    stocks = ['AAPL', 'NVDA']
    

    for stock in stocks:
        if stock != 'AAPL':
            continue

        curr_dataset = filter_multi_features(df, stock, feature_cols)

        # ==================== Bayesian Optimization Setup ====================
        # Define the objective function for Bayesian Optimization
        def objective(look_back, epochs):
            look_back = int(look_back)
            epochs = int(epochs)
            
            model_params = {
                'features': features,
                'look_back': look_back,
                'batch_size': batch_size,
                'epochs': epochs,
                'headroom': headroom,
                'dropout': dropout,
                'layers': layers,
                'neurons': neurons,
                'activation': activation
            }
            
            try:
                # Run cross-validation (non-TAF version) and obtain RMSE.
                # time_series_cross_validation (FALSE) is expected to return:
                # model_components, train_predict_inverted, effective_train_end, test_end, forecasted_inverted, rmse_taf_preTAF
                _, _, _, _, _, rmse = time_series_cross_validation(
                    curr_dataset, model_params, forecast_horizon, initial_train_size, step_size, target_feature_col, False)
            except Exception as e:
                print("Error during evaluation:", e)
                traceback.print_exc()
                rmse = 1e6

            # BayesianOptimization maximizes the objective so return negative RMSE.
            return -rmse

        # *****************************
        # Define parameter bounds
        pbounds = {
            'look_back': (2786, 2786),
            'epochs': (348, 348)
        }
        # *****************************
        
        optimizer = BayesianOptimization(
            f=objective,
            pbounds=pbounds,
            random_state=42  # For reproducibility
        )
        
        print("Starting Bayesian optimization for:", stock)
        optimizer.maximize(
            init_points=5,   # Number of random initialization points
            n_iter=20        # Number of iterations for the optimization
        )
        
        best_params = optimizer.max['params']
        optimal_look_back = int(best_params['look_back'])
        optimal_epochs = int(best_params['epochs'])
        print(f"Optimal parameters for {stock}: look_back={optimal_look_back}, epochs={optimal_epochs}")
        
        # ==================== Run Cross-Validation with Optimal Parameters ====================
        optimal_model_params = {
            'features': features,
            'look_back': optimal_look_back,
            'batch_size': batch_size,
            'epochs': optimal_epochs,
            'headroom': headroom,
            'dropout': dropout,
            'layers': layers,
            'neurons': neurons,
            'activation': activation
        }
        
        model, train_data_inverted, train_end, test_end, non_taf_forecast, rmse_non_taf, rmse_TAFs = time_series_cross_validation(curr_dataset, optimal_model_params, forecast_horizon, initial_train_size, step_size, target_feature_col, True)
        
        visualizer = Visualizer(scaler=model[0].scaler,
                                trained_model=model[1],
                                forecast_engine=model[2])
        visualizer.plot_results(rmse_non_taf, train_data_inverted, train_end, test_end, non_taf_forecast, curr_dataset, stock, target_feature_col, results_dir+curr_dir, [0, 0, 0])
        print("|=====================================|")
        print("Cross-Validation RMSEs (Non-TAF):", rmse_non_taf)
        print("|=====================================|")
        for (alpha, beta, weight), (rmse_taf, adjusted_forecast) in rmse_TAFs.items():
            visualizer.plot_results(rmse_taf, train_data_inverted, train_end, test_end, adjusted_forecast, curr_dataset, stock, target_feature_col, results_dir+curr_dir, [alpha, beta, weight])
            print("|=====================================|")
            print("Cross-Validation RMSEs (TAF):", rmse_taf)
            print("|=====================================|")

if __name__ == "__main__":
    run()

# News Embeddings

In [None]:
import pandas as pd
import numpy as np
from google import genai

# --- Gemini API Embedding Function ---
client = genai.Client(api_key="GEMINI_API_KEY")  # Remember to censor KEY!!!!

def get_article_embedding(article_text):
    """
    Uses the Gemini API to embed the article text.
    Returns a numpy array representing the embedding.
    """
    result = client.models.embed_content(
        model="gemini-embedding-exp-03-07",
        contents=article_text
    )
    return np.array(result.embeddings)


def generate_target_label(article_text):
    """
    Generates a target label for an article based on simple heuristics.
    Returns:
        label: A list in the format [relevance, up_weight, down_weight, unchanged_weight]
    """
    text = article_text.lower()
    
    relevance_terms = [
        # High Confidence Names (1.0)
        ("apple inc.", 1.0),
        ("apple incorporated", 1.0),
        ("apple computer", 1.0),
        ("apple corporation", 1.0),
        ("apple co.", 1.0),
        ("apple headquarters", 1.0),
        ("cupertino-based company", 1.0),

        # Stock Ticker
        ("aapl", 0.5),
        ("aapl shares", 0.75),
        ("aapl stock", 0.75),
        ("nasdaq:aapl", 0.75),

        # General Apple Mentions (0.75)
        ("apple", 0.75),
        ("apple brand", 0.75),
        ("apple products", 0.75),
        ("apple ecosystem", 0.75),

        # Key People
        ("steve jobs", 0.75),
        ("tim cook", 0.75),
        ("jonny ive", 0.75),
        ("phil schiller", 0.75),
        ("craig federighi", 0.75),
        ("luca maestri", 0.75),

        # Major Products
        ("iphone", 0.75),
        ("iphone 15", 0.75),
        ("iphone 14", 0.75),
        ("iphone pro", 0.75),
        ("ipad", 0.75),
        ("ipad pro", 0.75),
        ("ipad air", 0.75),
        ("macbook", 0.75),
        ("macbook pro", 0.75),
        ("macbook air", 0.75),
        ("mac studio", 0.75),
        ("mac pro", 0.75),
        ("mac mini", 0.75),
        ("mac os", 0.75),
        ("macos", 0.75),

        # Services
        ("apple music", 0.75),
        ("apple tv", 0.75),
        ("apple tv+", 0.75),
        ("apple arcade", 0.75),
        ("apple news", 0.75),
        ("apple fitness+", 0.75),
        ("icloud", 0.75),
        ("apple id", 0.75),

        # Hardware
        ("apple watch", 0.75),
        ("apple pencil", 0.75),
        ("apple silicon", 0.75),
        ("m1 chip", 0.75),
        ("m2 chip", 0.75),
        ("m3 chip", 0.75),
        ("t2 chip", 0.75),
        ("homepod", 0.75),
        ("airpods", 0.75),
        ("airpods pro", 0.75),
        ("airpods max", 0.75),
        ("vision pro", 0.75),
        ("apple glasses", 0.75),

        # Operating Systems
        ("ios", 0.75),
        ("ios 17", 0.75),
        ("ipados", 0.75),
        ("macos ventura", 0.75),
        ("macos sonoma", 0.75),
        ("watchos", 0.75),
        ("tvos", 0.75),

        # Financials and Market
        ("apple stock", 0.75),
        ("apple shares", 0.75),
        ("apple earnings", 0.75),
        ("apple revenue", 0.75),
        ("apple profits", 0.75),
        ("apple forecast", 0.75),
        ("apple quarterly results", 0.75),
        ("apple guidance", 0.75),
        ("apple investors", 0.75),
        ("apple market cap", 0.75),
        ("apple dividends", 0.75),
        ("apple buybacks", 0.75),

        # News & Events
        ("apple event", 0.75),
        ("apple keynote", 0.75),
        ("wwdc", 0.75),
        ("apple launch", 0.75),
        ("spring event", 0.75),
        ("fall event", 0.75),

        # Legal & Regulatory
        ("apple lawsuit", 0.75),
        ("apple antitrust", 0.75),
        ("apple vs epic", 0.75),
        ("apple regulation", 0.75),
        ("apple privacy", 0.75),

        # Business Activities
        ("apple acquisition", 0.75),
        ("apple partnership", 0.75),
        ("apple investment", 0.75),
        ("apple r&d", 0.75),
        ("apple supply chain", 0.75),
        ("apple manufacturing", 0.75),
        ("foxconn", 0.75),
        ("apple retail", 0.75),
        ("apple online store", 0.75),

        # Technology & Innovation
        ("apple innovation", 0.75),
        ("apple ai", 0.75),
        ("apple machine learning", 0.75),
        ("apple chips", 0.75),
        ("apple ar", 0.75),
        ("apple vr", 0.75),
        ("apple car", 0.75),
        ("project titan", 0.75),
        ("apple security", 0.75),
        ("apple encryption", 0.75),

        # Software & Apps
        ("apple store", 0.75),
        ("app store", 0.75),
        ("apple developer", 0.75),
        ("xcode", 0.75),
        ("apple sdk", 0.75),
        ("testflight", 0.75),
        ("apple beta", 0.75),

        # Medium Relevance (0.5)
        ("smartphone market", 0.5),
        ("consumer electronics", 0.5),
        ("wearables", 0.5),
        ("tech giant", 0.5),
        ("big tech", 0.5),
        ("us tech stock", 0.5),
        ("silicon valley", 0.5),
        ("hardware company", 0.5),
        ("tablet market", 0.5),
        ("smartwatch sales", 0.5),
        ("mobile os", 0.5),
        ("voice assistant", 0.5),
        ("supply chain disruption", 0.5),

        # Low Relevance (0.25)
        ("consumer trends", 0.25),
        ("technology adoption", 0.25),
        ("semiconductor trends", 0.25),
        ("software update", 0.25),
        ("ai assistant", 0.25),
        ("smart home", 0.25),
        ("us markets", 0.25),
        ("tech stocks", 0.25),
        ("electronics retail", 0.25),
        ("app monetization", 0.25),
        ("digital marketplace", 0.25),
        ("cloud storage", 0.25),
        ("eco-friendly tech", 0.25),

        # Irrelevant (0.0)
        ("banana", 0.0),
        ("orange fruit", 0.0),
        ("pineapple", 0.0),
        ("fruit basket", 0.0),
        ("orchard", 0.0),
        ("apple pie", 0.0),
        ("apple cider", 0.0),
        ("fruit nutrition", 0.0),
        ("apple picking", 0.0),
        ("red delicious", 0.0),
    ]

    # --- Relevance Score ---
    relevance = 0.0  # Default relevance
    for term, score in relevance_terms:
        if term in text:
            relevance = max(relevance, score)

    # --- Directional Weights ---
    # Define keywords for directional sentiment related to business performance.
    up_terms = [
        "growth", "profit", "increase", "rise", "gain", "surge", "expansion",
        "record", "strong", "improved", "bullish", "beat", "outperform", "upgrade"
    ]
    down_terms = [
        "loss", "decline", "drop", "fall", "decrease", "slump", "weak",
        "bearish", "downgrade", "miss", "underperform", "cut", "reduction"
    ]
    unchanged_terms = [
        "stable", "steady", "unchanged", "flat", "consistent", "no change", "sideways"
    ]
    
    # Count occurrences for each category
    def count_terms(text, terms):
        count = 0
        for term in terms:
            count += text.count(term)
        return count

    up_count = count_terms(text, up_terms)
    down_count = count_terms(text, down_terms)
    unchanged_count = count_terms(text, unchanged_terms)
    
    total = up_count + down_count + unchanged_count
    if total == 0:
        # If no keywords found, default to a neutral distribution
        up_weight, down_weight, unchanged_weight = 0.0, 0.0, 1.0
    else:
        up_weight = up_count / total
        down_weight = down_count / total
        unchanged_weight = unchanged_count / total

    # Return the label in the format: [relevance, up, down, unchanged]
    return [relevance, up_weight, down_weight, unchanged_weight]


class ArticleProcessor:
    def __init__(self, excel_path, sheet_name=None):
        """
        Loads the Excel file into a DataFrame.
        Assumes the first column is a date and subsequent columns contain article text.
        """
        self.data = pd.read_excel(excel_path, sheet_name=sheet_name)

    import numpy as np

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers, models
from sklearn.model_selection import train_test_split

class ArticleWeightPredictor:
    def __init__(self, input_dim):
        self.input_dim = input_dim
        self.model = self._build_model()
        
    def _build_model(self):
        """
        Constructs a neural network mapping embeddings to four outputs:
           - Relevance score (0 to 1)
           - Direction weights (stock up, stock down, stock unchanged) that sum to 1
        """
        input_layer = layers.Input(shape=(self.input_dim,))
        # Shared hidden layers
        x = layers.Dense(128, activation='relu')(input_layer)
        x = layers.Dense(64, activation='relu')(x)
        
        relevance = layers.Dense(1, activation='sigmoid', name='relevance')(x)
        
        direction_logits = layers.Dense(3, name='direction_logits')(x)
        # Using softmax so the 3 ebeddings from the layer will add up to 1
        direction = layers.Activation('softmax', name='direction')(direction_logits)
        
        output = layers.Concatenate(name='output')([relevance, direction])
        
        model = models.Model(inputs=input_layer, outputs=output)
        model.compile(optimizer='adam', loss='mean_squared_error')
        model.summary()
        return model
    
    def train(self, x, y, epochs=50, batch_size=32, validation_split=0.2):
        """
        Parameters:
          - X: Array of embeddings (shape: [num_samples, input_dim])
          - y: Array of target outputs (shape: [num_samples, 4])
               where the first element is the relevance score, and the next three are directional weights.
        """
        x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=validation_split, random_state=42)
        history = self.model.fit(
            x_train,
            y_train,
            validation_data=(x_val, y_val),
            epochs=epochs,
            batch_size=batch_size,
            callbacks=[
                tf.keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True),
                tf.keras.callbacks.ReduceLROnPlateau(factor=0.1, patience=3)
            ]
        )
        return history
    
    def predict(self, embedding):
        """Output is 4 dimensional"""
        return self.model.predict(np.array([embedding]))
    
    def save_model(self, filepath):
        self.model.save(filepath)
    
    @classmethod
    def load_model(cls, filepath, input_dim):
        predictor = cls(input_dim)
        predictor.model = tf.keras.models.load_model(filepath)
        return predictor

In [None]:
import numpy as np
import pandas as pd
from src.article_weigher import ArticleWeightPredictor
from src.article_processor import ArticleProcessor, get_article_embedding


class Predictor:
    def __init__(self, excel_path, model_filepath, sheet_name=None):
        self.processor = ArticleProcessor(excel_path, sheet_name)
        self.model_filepath = model_filepath
        self.model = None  # Will be loaded in load_model()
    
    def load_model(self, input_dim):
        """
        Loads the saved model from disk.
        """
        self.model = ArticleWeightPredictor.load_model(self.model_filepath, input_dim)
    
    def predict_daily(self):
        """
        Processes the Excel file row by row. For each day (row), it:
          - Loops through each article (columns 2+)
          - Gets the embedding via the Gemini API for each article
          - Uses the neural network to predict the 4 outputs for each article
          - Averages the 4-dimensional outputs across all articles for that day
        Returns:
          - dates: List of dates (from the first column)
          - daily_predictions: A NumPy array with one 4-d vector per day.
        """
        # Use the raw DataFrame from the processor
        data = self.processor.data
        dates = []
        daily_predictions = []
        
        for index, row in data.iterrows():
            # The first column is assumed to be the date
            date = row[0]
            article_preds = []
            # Loop through each article (columns 2+)
            for article in row[1:]:
                if isinstance(article, str) and article.strip():
                    try:
                        # Get the embedding for this article
                        embedding = get_article_embedding(article)
                        # Get the neural network's 4-d prediction
                        pred = self.model.predict(np.array([embedding]))[0]
                        article_preds.append(pred)
                    except Exception as e:
                        print(f"Error processing article: {e}")
            # If there are predictions for the day, average them; otherwise use zeros.
            if article_preds:
                daily_avg = np.mean(article_preds, axis=0)
            else:
                daily_avg = np.zeros(4)
            dates.append(date)
            daily_predictions.append(daily_avg)
        
        return dates, np.array(daily_predictions)
    
    def save_predictions(self, dates, predictions, output_excel):
        """
        Saves the daily predictions to an Excel file.
        Each row in the Excel file corresponds to a day with 4 prediction values.
        """
        df = pd.DataFrame(predictions, columns=['relevance', 'up', 'down', 'unchanged'])
        df.insert(0, "date", dates)
        df.to_excel(output_excel, index=False)
        print(f"Predictions saved to {output_excel}")

In [None]:
from src.article_weigher import ArticleWeightPredictor
from src.article_processor import ArticleProcessor

# I can add Bayesian Optimization for hyperparameters, later
class Trainer:
    def __init__(self, excel_path, sheet_name=None):
        self.processor = ArticleProcessor(excel_path, sheet_name)
        self.model = None
    
    def prepare_data(self):
        """
        Process the Excel file to build the daily dataset.
        Returns dates and aggregated daily embeddings.
        """
        dates, x = self.processor.build_daily_dataset()
        return dates, x
    
    def train_model(self, x, y, epochs=50):
        """
        Trains the ArticleWeightPredictor using the provided data.
        Returns the trained model.
        """
        input_dim = x.shape[1]
        predictor = ArticleWeightPredictor(input_dim)
        predictor.train(x, y, epochs=epochs)
        self.model = predictor
        return predictor
    
    def save_model(self, model_filepath):
        """
        Saves the trained model to the given filepath.
        """
        if self.model:
            self.model.save_model(model_filepath)
        else:
            print("No model has been trained yet.")

# Results

Due to the large processing demand, all the tests were performed on the ECE Nebula server. The results are below:

# Data Features
---

To establish a baseline, we first trained the model on a **single feature**, namely, the target time series itself, without incorporating any additional inputs. **Images 1 and 2** illustrate these baseline forecasts for the SP500 index, where TAF (Trend Adjusted Forecast) was applied in the first case and disabled in the second. In **Image 1**, TAF parameters were set to \( A=0.0 \), \( B=0.8 \), and \( W=0.17 \), resulting in a test RMSE of **101.79**. Although TAF provided some modest adjustment by slightly reducing the overall error, the model still tended to underfit abrupt market swings and displayed difficulty tracking rapid price changes over short intervals. Meanwhile, **Image 2** shows the same setup but with TAF turned off (\( A=0, B=0, W=0 \)); in this scenario, the RMSE climbed to **138.69**, reinforcing the notion that even a minor TAF application could improve baseline accuracy. However, neither configuration effectively captured the full volatility or nuanced price fluctuations, suggesting a clear need to **expand the feature set** and refine the trend-adjustment mechanisms. In contrast, later experiments, shown in **Images 3 through 10**, incorporated multiple features and systematically tested different TAF parameters (including cases like \( A=0.0, B=0.0, W=0.0 \) where TAF was calculated but provided no positive impact on RMSE, in some of the cases). As those multi-feature models typically achieved lower errors, it became evident that relying on the single feature alone was insufficient to model stock movements with the desired level of accuracy.

## Using just one feature

![SP500_predictions_1](images/SP500_predictions_Lr_2_H_1.0_N_100_B_256_L_5000_E_20_D_0_TAF_A_0_B_0_W_0.png)

![SP500_predictions_2](images/SP500_predictions_Lr_2_H_1.0_N_100_B_256_L_5000_E_20_D_0_TAF_A_0.0_B_0.8_W_0.17.png)

## Using all features

![Y_AAPL](images/Y_AAPL_predictions_Lr_2_H_1.0_N_100_B_256_L_4000_E_18_D_0_TAF_A_0.1_B_0.1_W_0.19.png)

## All using multi features

![alt text](images/o81sckc7.png)

![alt text](images/uu3pi7gu.png)

![alt text](images/2kvteh0t.png)

![alt text](images/owmuet3d.png)

![alt text](images/v1d1p1op.png)

![alt text](images/y8e910b5.png)

![alt text](images/5tapcpl4.png)

# Bayesian Optimization
---

To refine our model configurations and better capture the forecast horizon of **60-steps (Days)**, we employed **Bayesian Optimization** to tune key hyperparameters, particularly the **look-back window** (i.e., how many past observations the model sees at once/its most recent memory) and the number of **epochs**. In **Images 1 through 4**, we illustrate the “before” phase, where initial guesswork and manual experimentation led to suboptimal or inconsistent results. By systematically searching the hyperparameter space, Bayesian Optimization enabled us to converge on more effective settings without exhaustive trial-and-error.

Once the optimal look-back and epoch combinations were identified, we explored two distinct forecasting strategies for generating the 60 future points. The first strategy was a **sequential** prediction approach, iteratively forecasting one time step at a time, moving from steps 1 to 2, then 2 to 3, and so on until reaching step 60. While straightforward, this method risks “error accumulation,” as each forecasted point influences the next prediction. The second strategy used a **TimeDistributed** layer with a **Dense** layer including all the 60 points, to predict all steps **at once**, avoiding the compounding effect of sequential errors. However, this all-at-once approach imposed significantly higher computational costs and tended to underfit given the limited training data; the model struggled to balance the complexity of producing 60 simultaneous predictions with the amount of information available, illustrated in **Imagea 1 and 2**. In **Images 3 onward**, we present the refined results after applying Bayesian Optimization, where the final model choices (including whether to use sequential or TimeDistributed forecasting) were guided by a balance of predictive accuracy, computational feasibility, and generalization potential.

## Bayesian Optimization

![alt text](images/67kqa5zt.png)

![alt text](images/n22vhf95.png)

![alt text](images/xgs782uc.png)

![alt text](images/ti2o5uh4.png)

## Sequential Distribution vs Bulk

![alt text](images/zw8g2f1w.png)

![alt text](images/f731i1gf.png)

# Activation Function & Batch Size
---
In the next phase of experimentation, we evaluated the impact of different activation functions on our model’s predictive stability. Images 1 and 2 reveal that using ReLU led to a severe exponential decay in forecast values: after just a few time steps, the predicted series would collapse toward zero, effectively erasing any meaningful signals. To address this, we switched to tanh, which better preserved signal dynamics over the extended forecasting horizon and helped maintain more realistic output ranges. Concurrently, we tested a significantly larger batch size (1024), a move designed to accomodate more timesteps in training and inference, inadvertently dampening the beneficial effects of TAF. As seen in **Images 3 through 6**, TAF’s contribution to error reduction became negligible when the network processed such large batches, suggesting that the smoothing influence of a high batch size may already have been addressing the same temporal dynamics that TAF attempts to correct. Consequently, these findings guided us toward using moderate batch sizes and tanh activations for optimal balance between stable training behavior and effective trend adjustment.

## Relu Activation Function

![alt text](images/jykluk22.png)

![alt text](images/7semowwj.png)

## Large Batch Size

![alt text](images/czka1ame.png)

![alt text](images/m2nm403t.png)

![alt text](images/evb2wvzf.png)

![alt text](images/o2ylpjz5.png)

# News Embeggings

Due to lack of data, we were not able to generate the predictions based on news.


# Conclusions

We have experimented with  a variety of techniques to reduce RMSE and improve our stock price predictions. We have been able to predict stock prices for 60 days into the future with an RMSE of 1-7.

## Techniques

### Best LSTM Parameters

After optimizing our parameters, we found the best results using 2 layers, 100 neurons, a batch size of 256, and a lookback and number of epochs at 2786 and 348 respectively. Addtionally, we got our best results with the dropout rate set to 0. We have found that the parameters for an LSTM are all very co-dependant on eachother and are best optimized together instead of individually. Especially look back and epochs were heavily dependant on batch size.

For hyperparameter optimization, we found Bayesian optimization to be very useful for training the LSTM, mostly for look back and epochs.

We found results were best when all features were forcasted in the same step instead of one at a time. Conversely, we found better predictions when foreacting one time step (1 day) at a time instead of all 60 at once.

### Trend Adjusted Forecast

The TAF was less impactful than we hoped. For many of our trials ended with the TAF adjustments converging to 0, meaning it was not useful. When we used larger batch sizes, we found that the TAF almost never had any effects.

### Activation Functions

We found tanh had the better performance for the stock prediction thans relu.

## Insights on Data Analytics

Techniques that work well for other applications did not automatically perform better in our situation (tanh vs relu). It is important to be aware of all of them and see what works in each case.

## Recommendations for Future Work

Most of our LSTM models suffered from underfitting, perhaps using a more complex architecture with deeper layers could address this. Also, while we could easily collect numerical data on stock, it was much more difficult to collect news data resulting in us having a very small dataset. A potential solution would be to purchase access to this news from one for the many official sources who have years of archive (including OpenBB sources).