# LSTM Stock Prediction Notebook

## Table of Contents
1. [Introduction](#1-introduction)
2. [Libraries](#2-libraries)  
3. [Data Collection](#3-data-collection)  
4. [Data Exploration](#4-data-exploration)  
5. [Feature Engineering](#5-feature-engineering)  
6. [Model Architecture](#6-model-architecture)  
7. [Training & Evaluation](#7-training--evaluation)  
8. [Results & Diagnostics](#8-results--diagnostics)  
9. [Conclusion](#9-conclusion)

## 1. Introduction

## 2. Libraries

In [2]:
# Core Libraries 
import pandas as pd
import numpy as np
import os
from dotenv import load_dotenv
from datetime import datetime

# Machine Learning Libraries
import tensorflow as tf
from tensorflow import keras
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Visualization Libraries
import matplotlib.pyplot as plt
import seaborn as sns
import mplfinance as mpf

# Alpaca Libraries
from alpaca.data.historical import StockHistoricalDataClient
from alpaca.data.requests import StockBarsRequest
from alpaca.data.timeframe import TimeFrame

# Statistical Libraries
from scipy.stats import entropy
from statsmodels.tsa.stattools import acf

## 3. Data Collection

In [10]:
# Load environment variables
load_dotenv()

# Gather data from Alpaca
client = StockHistoricalDataClient(os.getenv('ALPACA_API_KEY'), os.getenv('ALPACA_SECRET_KEY'))
request_params = StockBarsRequest(
    symbol_or_symbols=["AAPL"],
    timeframe=TimeFrame.Day,
    start=datetime(2016, 7, 1),
    end=datetime(2025, 7, 1)
)
bars = client.get_stock_bars(request_params)

# Convert to DataFrane
df = bars.df.reset_index()

## 4. Data Exploration

In [11]:
print("Head of DataFrame:")
print(df.head())

Head of DataFrame:
  symbol                 timestamp   open    high    low  close      volume  \
0   AAPL 2016-07-01 04:00:00+00:00  95.49  96.465  95.33  95.89  27180926.0   
1   AAPL 2016-07-05 04:00:00+00:00  95.39  95.400  94.46  95.04  30590138.0   
2   AAPL 2016-07-06 04:00:00+00:00  94.60  95.660  94.37  95.53  32320508.0   
3   AAPL 2016-07-07 04:00:00+00:00  95.70  96.500  95.62  95.94  26759405.0   
4   AAPL 2016-07-08 04:00:00+00:00  96.49  96.890  96.05  96.68  30976552.0   

   trade_count       vwap  
0     154544.0  95.995066  
1     153278.0  94.848509  
2     187589.0  95.158127  
3     143923.0  96.051727  
4     168615.0  96.635640  


In [12]:
print("\n DataFrame Info:")
print(df.info())


 DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2261 entries, 0 to 2260
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype              
---  ------       --------------  -----              
 0   symbol       2261 non-null   object             
 1   timestamp    2261 non-null   datetime64[ns, UTC]
 2   open         2261 non-null   float64            
 3   high         2261 non-null   float64            
 4   low          2261 non-null   float64            
 5   close        2261 non-null   float64            
 6   volume       2261 non-null   float64            
 7   trade_count  2261 non-null   float64            
 8   vwap         2261 non-null   float64            
dtypes: datetime64[ns, UTC](1), float64(7), object(1)
memory usage: 159.1+ KB
None


In [13]:
print("\n DataFrame Descriptive Statistics:")
print(df.describe())


 DataFrame Descriptive Statistics:
              open         high          low        close        volume  \
count  2261.000000  2261.000000  2261.000000  2261.000000  2.261000e+03   
mean    182.703186   184.678842   180.916845   182.912084  5.948293e+07   
std      58.044565    58.851109    57.386703    58.246307  3.868233e+07   
min      94.600000    95.400000    94.370000    95.040000  2.080515e+06   
25%     145.660000   147.230000   144.370000   145.910000  3.100255e+07   
50%     172.400000   174.010000   170.970000   172.570000  4.871406e+07   
75%     204.430000   207.160000   202.586900   204.610000  7.803178e+07   
max     514.790000   515.140000   500.330000   506.090000  3.570209e+08   

        trade_count         vwap  
count  2.261000e+03  2261.000000  
mean   4.735065e+05   182.881165  
std    3.062269e+05    58.176753  
min    3.000000e+00    94.848509  
25%    2.021250e+05   145.814255  
50%    4.770020e+05   172.638542  
75%    6.347160e+05   204.960355  
max    2

## 5. Feature Engineering

In this section, we enrich the raw OHLCV data with meaningful derived features. These help the LSTM model learn more nuanced temporal patterns. beyond just raw price.

In [None]:
# Add Datetime index
df['date'] = pd.to_datetime(df['timestamp'])
df.set_index('date', inplace=True)

### 5.1 Price-Based Features

We begin by generating basic transformations of open, high, low, and close to give the model awareness of price structure and returns.

In [None]:
df["close_lag_1"] = df["close"].shift(1)
df["close_lag_2"] = df["close"].shift(2)
df["close_minus_open"] = df["close"] - df["open"]
df["high_minus_low"] = df["high"] - df["low"]
df["close_over_open"] = df["close"] / df["open"]
df["high_over_low"] = df["high"] / df["low"]
df["daily_return"] = df["close"].pct_change()

### 5.2 Technical Indicators

Here we compute standard indicators used in quantiative analysis, such as moving averages, RSI, MACD, Bollinger Bands, momentum and stochastic oscillators.
- **SMA (Simple Moving Average)**: Average of closing price over a window; helps identify trend direction.
- **EMA (Exponential Moving Average)**: Similar to SMA but gives more weight to recent prices; reacts faster to change.
- **RSI (Relative strength Index)**: Momentum indicator that measures recent gains vs. losses (0-100); detects overbough/oversold conditions.
- **MACD (Moving Average Convergence Divergence)**: Shows the difference between fast & slow EMAs; useful for trend strengnth and reversals.
- **Bollinger Bands**: Bands above/below SMA that expand/contract based on volatility; used to detect price extremes
- **Momentum**: Measures volatility of price changes; helps capture accelerations.
- **Stochastic Oscillator**: Compares closing price to recent highs/lows; signals potential reversals when crossing certain levels.
- **OBV (On-Balance Volume)**: Cumulative Volume that adds on up days and subtracts on down days; used to confirm trends

In [None]:
# Moving Averages
df["SMA_10"] = df["close"].rolling(window=10).mean()
df["SMA_20"] = df["close"].rolling(window=20).mean()
df["EMA_20"] = df["close"].ewm(span=20).mean()
df["EMA_50"] = df["close"].ewm(span=50).mean()

# Volatility
df["volatility_10"] = df["daily_return"].rolling(window=10).std()
df["volatility_20"] = df["daily_return"].rolling(window=20).std()

# RSI (14-day period)
delta = df["close"].diff()
gain = np.where(delta > 0, delta, 0)
loss = np.where(delta < 0, -delta, 0)
avg_gain = pd.Series(gain, index=df.index).rolling(window=14).mean()
avg_loss = pd.Series(loss, index=df.index).rolling(window=14).mean()
rs = avg_gain / avg_loss
df["RSI_14"] = 100 - (100 / (1 + rs))

# MACD & Signal Line
EMA_12 = df["close"].ewm(span=12).mean()
EMA_26 = df["close"].ewm(span=26).mean()
df["MACD"] = EMA_12 - EMA_26
df["MACD_signal"] = df["MACD"].ewm(span=9).mean

# Bollinger Bands
df["BB_upper"] = df["SMA_20"] + (2 * df["close"].rolling(window=20).std())
df["BB_lower"] = df["SMA_20"] - (2 * df["close"].rolling(window=20).std())

# Momentum 
df["momentum_5"] = df["close"] - df["close"].shift(5)

# Stochastic Oscillator 
low_14 = df["low"].rolling(window=14).min()
high_14 = df["high"].rolling(window=14).max()
df["stochastic_14"] = (df["close"] - low_14) / (high_14 - low_14) * 100


### 5.3 Volatility & Rolling Statistics

We use rolling statistics to capture trends in market volatility and relative deviation.

In [None]:
df["rolling_std_5"] = df["close"].rolling(window=5).std()
df["rolling_std_10"] = df["close"].rolling(window=10).std()
df["rolling_std_20"] = df["close"].rolling(window=20).std()

df["rolling_mean_5"] = df["close"].rolling(window=5).mean()
df["rolling_mean_10"] = df["close"].rolling(window=10).mean()
df["rolling_mean_20"] = df["close"].rolling(window=20).mean()

# Z-score
df["zscore_close_10"] = (df["close"] - df["rolling_mean_10"]) / df["rolling_std_10"]

### 5.3 Volume-Based Features

Volume patterns can be strong predictors of price action. Here we include lagged volume, changes, and VWAP deviations.

In [None]:
df["volume_lag_1"] = df["volume"].shift(1)
df["volume_change"] = df["volume"].pct_change()
df["volume_over_rolling_avg_10"] = df["volume"] / df["volume"].rolling(window=10).mean()
df["vwap_diff"] = df["close"] - df["vwap"]
df["vwap_ratio"] = df["close"] / df["vwap"]
df["trade_count_change"] = df["trade_count"].pct_change()

# On-Balance Volume (OBV)
obv = [0] # Initialize OBV list
for i in range(1, len(df)):
    if df['close'][i] > df['close'][i-1]:
        obv.append(obv[-1] + df['volume'][i])
    elif df['close'][i] < df['close'][i-1]:
        obv.append(obv[-1] - df['volume'][i])
    else:
        obv.append(obv[-1])
df['OBV'] = obv

### 5.5 Statistical Features

These features attempt to extract hidden structures, periodicity, and randomness from the time series - often useful in regime detection, anomaly spotting, and advanced modeling.

- **shannon Entropy**: Measures the degree of randomness in a distribution. High entropy suggests noise or unpredictability, while low entropy implies more structure.
- **Hurst Exponent**: Estimates the long-term memory of a series. A value > 0.5 implies trending behaviour, < 0.5 suggests mean-reverting tendencies, and ~0.5 indicates a random walk.
- **FFT (Fast Fourier Transform)**: Decompose the time series into a sum of sine/cosine waves. Useful for identifying dominant periodicities (cycles).
- **Autocorrelation (Lag 1)**: Measures the correlation between a value and its lagged version. High autocorrelation can indicate persistence or seasonality. 

In [None]:
# Shanon Entropy of Close Price Over a Rolling Window
def shannon_entropy(series, bins=20):
    """Compute the Shannon entropy of a windowed series."""
    hist, _ = np.histogram(series.dropna(), bins=bins, density=True)
    return entropy(hist)

df["shannon_entropy_close"] = df["close"].rolling(window=20).apply(shannon_entropy, raw=False)

# Hurst Exponent Over a Rolling Window
def hurst_exponent(ts):
    """Estimate the Hurst exponent of a time series window."""
    lags = range(2, 20)
    tau = [np.std(ts[lag:], ts[:-lag]) for lag in lags]
    return np.polyfit(np.log(lags, np.log(tau)), 1)[0]

df["hurst_exponent_close"] = df["close"].rolling(window=100).apply(hurst_exponent, raw=False)

# FFT Top Components (Broadcast Statistic Global top 5 Magnitudes)
fft_vals = np.fft.fft(df["close"].dropna().values)
fft_abs = np.abs(fft_vals)
top_n = 5
for i in range(1, top_n + 1):
    df[f"fft_{i}"] = 0
    df.loc[df.index[:len(fft_abs)], f"fft_{i}"] = fft_abs[i]  # broadcast value

# Rolling Autocorrelation (Lag 1 to 3)
df["autocorr_lag_1"] = df["close"].rolling(window=20).apply(lambda x: acf(x, nlags=1)[1], raw=False)
df["autocorr_lag_2"] = df["close"].rolling(window=20).apply(lambda x: acf(x, nlags=2)[2], raw=False)
df["autocorr_lag_3"] = df["close"].rolling(window=20).apply(lambda x: acf(x, nlags=3)[3], raw=False)
