# Profit Prophet: The Stock Market ML Predictor

This project uses LSTM machine learning to make predictions on the stock market and recommend a course of action.

## Stock Data

TBD. What specific data should we collect for the LSTM?

## Approach

First we train an LSTM model for the [TBD. Aggreggated index? Each stock?] to get a forecast of the market based on pure numeric data.

Simultaneously, we read the news and use ML to catagorize the news and react in these ways:

|    Stock Implication   |          Past          |         Present           |          Future          |
| :--------------------: | :--------------------: | :-----------------------: | :----------------------: |
| Artificially Increased |  Reduce Past Estimate  |  Reduce Present Estimate  |  Reduce Future Estimate  |
| Artificially Decreased | Increase Past Estimate | Increase Present Estimate | Increase Future Estimate |
|        No Change       |       Do Nothing       |         Do Nothing        |        Do Nothing        |

This gives us an estimate and forecast of the *True* value of the stock, which we can use to make fat stacks.

## Tools

There are multiple different ways to get the stock prices; Bloomberg Terminals, and OpenBB

### Bloomberg API

Bloomberg terminals are the defacto way to get stock information. UW also provides access to 4 of these terminals in the MC building. The API for Bloomberg requires the terminal to be running, so the API can only run on a machine with the terminal open.

For this reason, we are moving away from Bloomberg API

In [None]:
# Bloomberg API

from xbbg import blp
import pandas as pd

DATA_DIR = './Data/'

tickers = ['NVDA US Equity', 'AAPL US Equity']
fields = ['High', 'Low', 'Last_Price']
start_date = '2024-11-01'
end_date = '2024-11-10'

# This line hangs unless it is running with a Bloomberg terminal
hist_tick_data = blp.bdh(tickers=tickers, fields=fields, start_date=start_date, end_date=end_date)

filename = f'tick_data_{start_date}_to_{end_date}.csv'
hist_tick_data.to_csv(DATA_DIR + filename)



### OpenBB

OpenBB is a free open-source implementation of Bloomberg's stock viewer. It can be run without any special software running in the background.

In [2]:
import openbb
openbb.build()

In [3]:
# Rate limiter class
# Some of the liraries used in the code are rate limited. This class can be used
# to limit the number of requests made to the library in a given time period.

import threading
import time

class TokenBucket:
    def __init__(self, tokens, refill_rate):
        self.capacity = tokens  # Max tokens (60)
        self.tokens = tokens    # Initial tokens
        self.refill_rate = refill_rate  # Tokens added per second (60)
        self.lock = threading.Lock()
        self.last_refill = time.time()

    def _refill(self):
        now = time.time()
        elapsed = now - self.last_refill
        # Calculate tokens to add based on elapsed time
        new_tokens = elapsed * self.refill_rate
        if new_tokens > 0:
            self.tokens = min(self.capacity, self.tokens + new_tokens)
            self.last_refill = now

    def consume(self, tokens=1):
        with self.lock:
            self._refill()
            if self.tokens >= tokens:
                self.tokens -= tokens
                return True
            return False

In [5]:
# OpenBB API

from openbb import obb
import finnhub
from ftfy import fix_text

from concurrent.futures import ThreadPoolExecutor  # Use ProcessPoolExecutor for CPU-bound tasks
import pandas as pd
import traceback

finnhub_client = finnhub.Client(api_key="cv32dahr01qk43u03h50cv32dahr01qk43u03h5g")
obb.user.preferences.output_type = 'dataframe'
rate_limiter = TokenBucket(tokens=60, refill_rate=60)

FAST = 12
SLOW = 26
SIGNAL = 9

MIN_POINTS = SLOW + SIGNAL - 1
DAYS_TO_PAD = -(-(MIN_POINTS * 1.5) // 1) # Not every day has data. Round up to nearest integer

def process_symbol_data(symbol, start_date, end_date):
    # Wait until a token is available
    while not rate_limiter.consume():
        time.sleep(0.001)  # Avoid busy-waiting
        
    # Process data for a single symbol with technical indicators
    try:
        # Fetch OHLCV data
        symbol_data_df = obb.equity.price.historical(symbol, start_date=start_date, end_date=end_date)
        symbol_data_df['symbol'] = symbol  # Add symbol column

        # Remove any duplicate dates
        symbol_data_df = symbol_data_df[~symbol_data_df.index.duplicated(keep='first')] 

        # RSI
        symbol_data_df = obb.technical.rsi(data=symbol_data_df, target='close', length=14, scalar=100.0, drift=1)
        symbol_data_df.rename(columns={'close_RSI_14': 'rsi'}, inplace=True)

        # MACD
        symbol_data_df = obb.technical.macd(data=symbol_data_df, target='close', fast=FAST, slow=SLOW, signal=SIGNAL)
        symbol_data_df.rename(columns={f'close_MACD_{str(FAST)}_{str(SLOW)}_{str(SIGNAL)}': 'macd',
                                       f'close_MACDh_{str(FAST)}_{str(SLOW)}_{str(SIGNAL)}': 'macdh',
                                       f'close_MACDs_{str(FAST)}_{str(SLOW)}_{str(SIGNAL)}': 'macds'}, inplace=True)
        
        # Convert 'date' index to regular index
        symbol_data_df.reset_index(inplace=True)

        # News
        symbol_news = finnhub_client.company_news(symbol, _from=start_date, to=end_date)

        # Fix encoding for all text fields in the raw API response
        for article in symbol_news:
            for text_field in ['headline', 'summary', 'source']:
                if text_field in article and article[text_field] is not None:
                    article[text_field] = fix_text(article[text_field])

        group_column = 'date'
        text_columns = ['headline', 'summary', 'source']

        symbol_news_df = (pd.DataFrame(symbol_news)
            .assign(datetime=lambda x: pd.to_datetime(x['datetime'], unit='s', errors="coerce"))
            .dropna(subset=["datetime"])  # Remove invalid rows
            .assign(datetime=lambda x: x["datetime"].dt.strftime("%Y-%m-%d"))
            .rename(columns={'datetime': group_column})
            [[group_column] + text_columns]
        )

        # Ensure 'date' is datetime in both DataFrames
        symbol_data_df['date'] = pd.to_datetime(symbol_data_df['date'])
        symbol_news_df['date'] = pd.to_datetime(symbol_news_df['date'])

        # Aggregate news data to one row per date
        symbol_news_df = symbol_news_df.groupby('date').agg({
            'headline': lambda x: '\n'.join(x.astype(str)),
            'summary': lambda x: '\n'.join(x.astype(str)),
            'source': lambda x: '\n'.join(x.astype(str))
        }).reset_index()

        symbol_data_df = symbol_data_df.merge(symbol_news_df, on='date', how='outer')

        return symbol_data_df
    
    except Exception as e:
        print(f"Error processing {symbol}: {traceback.format_exc()}")
        return pd.DataFrame()

def downloadStockData(symbols, start_date=None, end_date=None, parallel=True):
    try:
        # Fetch S&P 500 data once for all symbols
        sp500_df = obb.equity.price.historical("SPX", start_date=start_date, end_date=end_date)
        sp500_df = sp500_df[['close']].rename(columns={'close': 'SP500'})
        sp500_df.reset_index(inplace=True)
        sp500_df['date'] = pd.to_datetime(sp500_df['date'])

        # Process symbols in parallel or sequentially
        if parallel:
            with ThreadPoolExecutor(max_workers=4) as executor:
                futures = [executor.submit(process_symbol_data, symbol, start_date, end_date) for symbol in symbols]
                results = [f.result() for f in futures]
        else:
            results = [process_symbol_data(symbol, start_date, end_date) for symbol in symbols]

        # Combine all symbols and merge with SP500
        combined_df = pd.concat(results)

        final_df = combined_df.merge(sp500_df, on='date', how='outer')        

        # Add groupby-aware technical calculations
        final_df = final_df.groupby('symbol', group_keys=False).apply(lambda x: x.sort_index())
        final_df.reset_index(inplace=True, drop=True)
    
        return final_df
    
    except Exception as e:
        print(f"Error during download: {traceback.format_exc()}")
        return pd.DataFrame()

# Declare search bounds 
symbols = ['AAPL', 'NVDA']
start_date = '1950-01-01'
end_date = '2025-03-01'

data_df = downloadStockData(symbols, start_date, end_date)

data_df.to_csv('Stock Data.csv')
display(data_df)

Unnamed: 0,date,open,high,low,close,volume,symbol,rsi,macd,macdh,macds,headline,summary,source,SP500
0,2004-01-02,0.39,0.39,0.38,0.38,2.024994e+09,AAPL,,,,,,,,1108.48
1,2004-01-02,0.20,0.20,0.19,0.19,1.309248e+09,NVDA,,,,,,,,1108.48
2,2004-01-05,0.38,0.40,0.38,0.40,5.530258e+09,AAPL,,,,,,,,1122.22
3,2004-01-05,0.20,0.20,0.19,0.20,1.725876e+09,NVDA,,,,,,,,1122.22
4,2004-01-06,0.40,0.40,0.39,0.40,7.130872e+09,AAPL,,,,,,,,1123.67
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10647,2025-02-26,129.99,133.73,128.49,131.28,3.225538e+08,NVDA,47.845685,-0.139424,-0.190138,0.050714,QTUM: Our Favorite Way To Invest In Quantum Co...,Invest in the Defiance Quantum ETF (QTUM) for ...,SeekingAlpha\nFinnhub\nSeekingAlpha\nSeekingAl...,5956.06
10648,2025-02-27,239.66,242.46,237.06,237.30,4.115364e+07,AAPL,46.727976,2.034598,0.185772,1.848826,Auxier Asset Management Winter 2024 Market Com...,Auxier Focus Fund's Investor Class declined 2....,SeekingAlpha\nSeekingAlpha\nSeekingAlpha\nMark...,5861.57
10649,2025-02-27,134.97,135.01,120.01,120.15,4.431758e+08,NVDA,38.123629,-1.153285,-0.963199,-0.190086,Auxier Asset Management Winter 2024 Market Com...,Auxier Focus Fund's Investor Class declined 2....,SeekingAlpha\nSeekingAlpha\nSeekingAlpha\nSeek...,5861.57
10650,2025-02-28,236.91,242.09,230.20,241.84,5.683336e+07,AAPL,53.196852,1.906182,0.045885,1.860297,The AI Smartphone Battle Of Titans: iPhone 16 ...,Apple's iPhone 16 and Samsung's Galaxy S25 mar...,SeekingAlpha\nDowJones\nMarketWatch\nSeekingAl...,5954.50
