# 📊 Data Workspace Notebook
# Author: David Linger
# Created: October 2025

"""
This notebook is part of the local data workspace environment.
It is designed for exploratory analysis, data cleaning, and model development.

Environment:
- Python 3.14(local install)
- Virtual Environment: data_env
- Kernel: Python (data_env)

Tools & Libraries:
- pandas, numpy, matplotlib, seaborn
- scikit-learn, ipykernel, jupyter

Version Control:
- Managed via Git (local repo)

Notes:
- All paths are local (no cloud sync)
- Virtual environment is excluded from version control via .gitignore
- For reproducibility, install dependencies via requirements.txt

"""


In [1]:
import pandas as pd
import matplotlib as plt
import yfinance as yf
import ta
from datetime import datetime, timedelta

To start off the dataset we are going to be looking at a small subset of the stock market and will be focussing on ETF's by producing a 5-ticker set for a diversified preliminary dataset that will provide broad exposure to large-cap U.S. equities across different industries. 

1. SPY – SPDR S&P 500 ETF
- Tracks: S&P 500 Index (500 large U.S. companies)
- Focus: Broad U.S. market exposure
- Top Holdings: Apple, Microsoft, Amazon
- Use Case: Core equity benchmark

2. QQQ – Invesco Nasdaq-100 ETF
- Tracks: Nasdaq-100 Index (100 largest non-financial U.S. companies)
- Focus: Tech-heavy growth stocks
- Top Holdings: NVIDIA, Apple, Meta, Google
- Use Case: High-growth, innovation-focused exposure

3. DIA – SPDR Dow Jones Industrial Average ETF
- Tracks: Dow Jones Industrial Average (30 blue-chip companies)
- Focus: Stable, mature U.S. companies
- Top Holdings: UnitedHealth, Goldman Sachs, Boeing
- Use Case: Defensive, value-oriented investing

4. TLT – iShares 20+ Year Treasury Bond ETF
- Tracks: Long-term U.S. Treasury bonds
- Focus: Fixed income, interest rate sensitivity
- Top Holdings: U.S. government bonds
- Use Case: Hedge against equity risk, macro exposure

5. VXUS – Vanguard Total International Stock ETF
- Tracks: Global stocks outside the U.S.
- Focus: International diversification
- Top Holdings: Nestlé, Samsung, Toyota
- Use Case: Exposure to developed and emerging markets

ETF(Exchange-Traded Fund) - is a type of investment fund that can be bought and sold just like a regular stock on a stock exchange. But instead of representing one company, an ETF holds a conjunction of assets, such as stocks, bonds, commodities, or currencies.

Candle Granularity

For the purposes of pensionfund trading were going to be focussing on: 
- end-of-day trading; a model that will be optimized for buy/sell once a day after market close.
- Swing trading; Holding positions for several days to weeks.
- Portfolio rotation:; Rebalance weekly or monthly based on daily signals.

In [2]:
# fetch the tables for the ETFs

ETFS = ["SPY", "QQQ", "DIA", "TLT", "VXUS"]
leadup_days = 50
start_date = (datetime.strptime("2023-01-01", "%Y-%m-%d") - timedelta(days=leadup_days)).strftime("%Y-%m-%d")
df_etf = yf.download(ETFS, interval="1d", start=start_date, end="2025-01-01", group_by="ticker")


  df_etf = yf.download(ETFS, interval="1d", start=start_date, end="2025-01-01", group_by="ticker")
[*********************100%***********************]  5 of 5 completed


In [3]:
df_etf

Ticker,TLT,TLT,TLT,TLT,TLT,DIA,DIA,DIA,DIA,DIA,...,SPY,SPY,SPY,SPY,SPY,VXUS,VXUS,VXUS,VXUS,VXUS
Price,Open,High,Low,Close,Volume,Open,High,Low,Close,Volume,...,Open,High,Low,Close,Volume,Open,High,Low,Close,Volume
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2022-11-14,87.562717,87.634258,86.829370,87.330193,13741100,319.323563,322.320396,318.232950,318.432098,3303300,...,380.674142,384.052271,378.917877,379.196198,71903500,47.130042,47.430583,47.038971,47.057182,4871300
2022-11-15,87.866775,88.805812,87.741571,88.743210,26608200,321.011695,322.557533,316.203512,318.849426,4824100,...,384.983192,386.096446,378.591594,382.430389,93194500,47.940585,48.013445,47.139146,47.530758,5610700
2022-11-16,89.449697,90.728573,89.199289,90.683861,28507700,318.337305,319.835707,318.223477,318.716644,3085500,...,380.789326,381.749025,378.879535,379.512939,68508500,47.394151,47.476117,47.075395,47.184685,4175000
2022-11-17,89.610693,89.941594,89.235081,89.726959,24528900,315.549135,319.389998,315.549135,318.745117,3606500,...,374.724024,379.033092,374.416943,378.351685,74496300,46.519867,47.239338,46.519867,47.221123,3871300
2022-11-18,89.968416,90.227769,89.029386,89.109871,14941300,320.684583,321.340284,318.593964,320.599060,3659100,...,381.710657,381.777843,377.200090,380.069580,92922500,47.275763,47.339514,47.002546,47.166477,2951400
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-12-24,84.257634,85.080461,84.199555,85.061104,22377600,423.748295,427.443449,423.145592,427.315002,1431700,...,590.882430,596.116596,590.297529,596.076904,33160100,58.599468,58.599468,58.264558,58.520664,11058600
2024-12-26,84.422202,85.148227,84.412520,85.012703,19981800,425.803283,428.470925,425.753894,428.016418,1867400,...,594.292655,597.246751,592.885006,596.116699,41219100,58.579767,58.737372,58.461561,58.638866,2977800
2024-12-27,84.683577,84.973983,84.276997,84.315720,27262300,425.417974,427.107495,422.305743,424.844940,2429100,...,592.349606,592.587572,585.628530,589.841614,64969300,58.441867,58.619169,58.313812,58.510818,2758800
2024-12-30,85.022387,85.225673,84.867498,84.993347,48519600,420.369319,422.533067,417.454673,420.665710,3858300,...,582.783506,586.600040,579.333692,583.110596,56578800,58.136503,58.353211,57.909947,58.116802,4583800


In [4]:
# Reset column index
df_flat = df_etf.stack(level=0).reset_index()

# Rename columns for clarity
df_flat.columns = ["Date", "Ticker", "Open", "High", "Low", "Close", "Volume"]

  df_flat = df_etf.stack(level=0).reset_index()


In [5]:
df_flat

Unnamed: 0,Date,Ticker,Open,High,Low,Close,Volume
0,2022-11-14,DIA,319.323563,322.320396,318.232950,318.432098,3303300
1,2022-11-14,QQQ,280.487499,283.774544,279.094203,280.075409,55290700
2,2022-11-14,SPY,380.674142,384.052271,378.917877,379.196198,71903500
3,2022-11-14,TLT,87.562717,87.634258,86.829370,87.330193,13741100
4,2022-11-14,VXUS,47.130042,47.430583,47.038971,47.057182,4871300
...,...,...,...,...,...,...,...
2670,2024-12-31,DIA,421.792026,422.612062,418.956437,420.398926,2442700
2671,2024-12-31,QQQ,514.954366,515.711454,508.339345,509.305695,29117000
2672,2024-12-31,SPY,584.785885,585.509585,579.343582,580.989197,57052700
2673,2024-12-31,TLT,85.312791,85.457997,84.470606,84.538368,31917300


In [6]:
def add_indicators(df):
    df = df.copy()

    # Momentum - Detect overbought/oversold conditions
    df["RSI"] = ta.momentum.RSIIndicator(close=df["Close"]).rsi()
    df["StochRSI"] = ta.momentum.StochRSIIndicator(close=df["Close"]).stochrsi()

    # Trend - Identify direction and momentum
    macd = ta.trend.MACD(close=df["Close"], window_fast=12, window_slow=26, window_sign=9)
    df["MACD"] = macd.macd()
    df["MACD_signal"] = macd.macd_signal()
    df["MACD_diff"] = macd.macd_diff()

    df["SMA_20"] = ta.trend.SMAIndicator(close=df["Close"], window=20).sma_indicator()
    df["EMA_20"] = ta.trend.EMAIndicator(close=df["Close"], window=20).ema_indicator()
    df["SMA_50"] = ta.trend.SMAIndicator(close=df["Close"], window=50).sma_indicator()
    df["EMA_50"] = ta.trend.EMAIndicator(close=df["Close"], window=50).ema_indicator()

    # Volatility - Gauge price fluctuation and breakout potential
    bb = ta.volatility.BollingerBands(close=df["Close"], window=20, window_dev=2)
    df["BB_high"] = bb.bollinger_hband()
    df["BB_low"] = bb.bollinger_lband()
    df["BB_width"] = df["BB_high"] - df["BB_low"]

    # Average True Range (ATR) - measures market volatility
    df["ATR"] = ta.volatility.AverageTrueRange(
        high=df["High"], 
        low=df["Low"], 
        close=df["Close"], 
        window=14
    ).average_true_range()

    # Volume - Confirm price moves with volume strength
    df["OBV"] = ta.volume.OnBalanceVolumeIndicator(
        close=df["Close"], 
        volume=df["Volume"]
    ).on_balance_volume()

    # Chaikin Money Flow (CMF) - measures accumulation/distribution pressure
    df["CMF"] = ta.volume.ChaikinMoneyFlowIndicator(
        high=df["High"], 
        low=df["Low"], 
        close=df["Close"], 
        volume=df["Volume"], 
        window=20
    ).chaikin_money_flow()

    return df
    


Some breakdowns of the Technical Indicators from the book: Technical Analysis from A to Z by Steven B. Achelis

### Momentum 
- RSI (Relative strength index): Welles Wilder, New Concepts in Technical Trading
Systems(1978)

"Popular oscilator for comparing the internal strength of a single 'security'. Price following oscilator that ranges between 0 to 100. A popular method of analyzing the RSI is to look for a divergence in which the security is making a new high, but the RSI is failing to surpass its previous high. This divergence is an indication of an impending reversal. When the RSI then turns down and falls below its most recent trough, it is said to have completed a "failure swing." The failure swing is considered a confirmation of the impending reversal."

- StochRSI The StochRSI oscillator was developed to take advantage of both momentum indicators in order to create a more sensitive indicator that is attuned to a specific security’s historical performance rather than a generalized analysis of price change.
 


### Trend
- MACD

"The MACD is the difference between a 26-day and 12-day exponential moving average. A 9-day exponential moving average, called the "signal" (or "trigger") line is plotted on top of the MACD to show buy/sell opportunities."

There are three main ways to deploy the MACD; Crossover, Overbought/Oversold Conditions and Divergences. 

1.  Crossover: the general rule to follow is that when the MACD falls below the signal line the asset should be sold, conversely when it is above the signal line the asset should be bought. Additionally when the signal rises or falls above/below zero, it is also common practice to buy or sell relative to it's position.

2. Overbought/Oversold: The MACD can also indicate this attributem when the shorter moving average dramatically pulls away from the longer moving average this is a strong indicator that the asset price is overextending and soon will have to move back to more realistic levels. Since these conditions vary from stock to stock we won't be using this, as we're trying to generalize across market regimes and industries for various stocks at once. 

3.  Divergences: an indication that an end of a trend may be near occurs when the MACD diverges from the asset. A "bearish" divergence is when the MACD is going into new lows whilst the prices fail to reach those lows, A "bullish" divergence is when the MACD is making new highs whilst prices fail to rech new highs. both of these are most significant when overbought/oversold levals are high.

- SMA

"A simple, or arithmetic, moving average is calculated by adding the closing price of the security for a number of time periods (e.g., 12 days) and then dividing this total by the number of time periods. The result is the average price of the security over the time period. Simple moving averages give equal weight to each daily price."

The average is calculated by simply aggregating the closing prices of the stock for each candle at closing time, and dividing it by the number of candles within that timeframe. Since the SMA gives equal weight to all prices in the window, it is slower to react to price changes but better in identifying long-term trends. 

- EMA

"An exponential (or exponentially weighted) moving average is calculated by applying a percentage of today's closing price to yesterday's moving average value. Exponential moving averages place more weight on recent prices."

### Volatility

Bollinger Bands (BB)
Developed by John Bollinger in the 1980s, Bollinger Bands are envelopes (or bands) plotted at a standard deviation level above and below a moving average of price. Since the distance of the bands is based on standard deviation, they expand and contract as volatility increases or decreases.

A common interpretation is that prices tend to revert to the mean, so when the price touches or breaks above the upper band, the market is considered overbought; conversely, when the price touches or breaks below the lower band, the market may be oversold. Bollinger Bands are not designed to generate standalone buy or sell signals but rather to provide a relative definition of high and low prices.

The "squeeze" — a narrowing of the bands — often precedes a significant price movement, while a "band expansion" signals increased volatility and the potential continuation of a trend.

ATR (Average True Range)
The Average True Range, introduced by J. Welles Wilder in New Concepts in Technical Trading Systems (1978), measures market volatility by decomposing the entire range of an asset price for a given period. Unlike other volatility measures, ATR does not indicate direction (bullish or bearish) — it simply quantifies the degree of price movement.

ATR is calculated as the moving average of the True Range (TR), where TR is the greatest of the following:

Current high minus current low,

The absolute value of the current high minus the previous close, and

The absolute value of the current low minus the previous close.

A higher ATR value indicates greater volatility (wider price swings), while a lower ATR reflects calmer markets. Traders use ATR to set stop-loss levels and gauge the likelihood of breakout events.

### Volume

OBV (On-Balance Volume)
Developed by Joseph Granville in 1963, On-Balance Volume (OBV) measures buying and selling pressure as a cumulative indicator that adds volume on up days and subtracts volume on down days. The basic idea is that volume precedes price movement — if a security is seeing increasing OBV while price remains stable, the rising volume may foreshadow a price breakout.

Granville believed that when OBV increases sharply without a corresponding increase in price, prices will eventually rise to confirm the higher OBV, and vice versa. Divergences between OBV and price are often seen as early warnings of potential trend reversals.

CMF (Chaikin Money Flow)
The Chaikin Money Flow, developed by Marc Chaikin, quantifies the amount of Money Flow Volume over a specific period (commonly 20 days). It combines price and volume to show whether money is flowing into or out of a security.

CMF is based on the principle that accumulation (buying pressure) tends to occur when prices close near the high of the range with increasing volume, while distribution (selling pressure) occurs when prices close near the low with increasing volume.

Values range between -1 and +1.

A positive CMF suggests accumulation (bullish sentiment).

A negative CMF suggests distribution (bearish sentiment).

Divergences between CMF and price action can indicate weakening trends or reversals.

In [52]:
ti_series = add_indicators(df_flat)
ti_series = ti_series.groupby("Ticker", group_keys=False).apply(add_indicators)
ti_series

  ti_series = ti_series.groupby("Ticker", group_keys=False).apply(add_indicators)


Unnamed: 0,Date,Ticker,Open,High,Low,Close,Volume,RSI,StochRSI,MACD,...,SMA_20,EMA_20,SMA_50,EMA_50,BB_high,BB_low,BB_width,ATR,OBV,CMF
0,2022-11-14,DIA,319.323563,322.320396,318.232950,318.432098,3303300,,,,...,,,,,,,,0.000000,3303300,
1,2022-11-14,QQQ,280.487499,283.774544,279.094203,280.075409,55290700,,,,...,,,,,,,,0.000000,55290700,
2,2022-11-14,SPY,380.674142,384.052271,378.917877,379.196198,71903500,,,,...,,,,,,,,0.000000,71903500,
3,2022-11-14,TLT,87.562717,87.634258,86.829370,87.330193,13741100,,,,...,,,,,,,,0.000000,13741100,
4,2022-11-14,VXUS,47.130042,47.430583,47.038971,47.057182,4871300,,,,...,,,,,,,,0.000000,4871300,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2670,2024-12-31,DIA,421.792026,422.612062,418.956437,420.398926,2442700,37.711105,0.347999,-2.686600,...,430.583728,427.716480,428.498165,426.692069,446.989090,414.178366,32.810724,4.756474,193033100,-0.268159
2671,2024-12-31,QQQ,514.954366,515.711454,508.339345,509.305695,29117000,44.908909,0.000000,2.609418,...,521.112582,517.766850,507.829073,508.651419,534.970306,507.254859,27.715448,8.013936,2107284300,0.017238
2672,2024-12-31,SPY,584.785885,585.509585,579.343582,580.989197,57052700,41.593674,0.148148,-0.427133,...,593.028421,589.825451,585.973876,584.695211,607.509027,578.547815,28.961212,6.719261,2294842200,-0.074308
2673,2024-12-31,TLT,85.312791,85.457997,84.470606,84.538368,31917300,35.236680,0.215565,-1.149875,...,87.297853,86.449028,87.740692,87.916672,91.986199,82.609508,9.376692,0.975991,222013500,-0.418228


In [None]:
# Reset column index
df_flat = df_etf.stack(level=0).reset_index()

# Rename columns for clarity
df_flat.columns = ["Date", "Ticker", "Open", "High", "Low", "Close", "Volume"]

In [90]:
# Download ^IRX (13-week Treasury Bill rate)
rf = yf.download("^IRX", start=start_date, end="2025-01-01", interval="1d")

  rf = yf.download("^IRX", start=start_date, end="2025-01-01", interval="1d")
[*********************100%***********************]  1 of 1 completed


In [101]:
# Reset column index
rf_flat = rf.stack(level=1).reset_index()

# Name columns for clarity
rf_flat.columns = ["Date", "Ticker", "Open", "High", "Low", "Close", "Volume"]

# Drop unnessecary columns (Only need Date and Close)
rf_new = rf_flat.drop(columns=["Ticker", "Open", "High", "Low", "Volume"])

# rename and convert Risk-free rate
rf_new.columns = ["Date", "RiskFreeRate"]
rf_new["RiskFreeRate"] = rf_new["RiskFreeRate"] / 100  # Convert % to decimal

rf_new


  rf_flat = rf.stack(level=1).reset_index()


Unnamed: 0,Date,RiskFreeRate
0,2022-11-14,0.04078
1,2022-11-15,0.04125
2,2022-11-16,0.04120
3,2022-11-17,0.04128
4,2022-11-18,0.04123
...,...,...
530,2024-12-24,0.04220
531,2024-12-26,0.04210
532,2024-12-27,0.04203
533,2024-12-30,0.04178


In [103]:
ti_series["Date"] = pd.to_datetime(ti_series["Date"])
rf_new["Date"] = pd.to_datetime(rf_new["Date"])

df_merged = ti_series.merge(rf_new, on="Date", how="left")
# df_merged["RiskFreeRate"].fillna(method="ffill", inplace=True)


In [106]:
df_working = df_merged.dropna()
df_working

Unnamed: 0,Date,Ticker,Open,High,Low,Close,Volume,RSI,StochRSI,MACD,...,EMA_20,SMA_50,EMA_50,BB_high,BB_low,BB_width,ATR,OBV,CMF,RiskFreeRate
245,2023-01-26,DIA,322.625098,323.587853,320.480301,323.492523,2930900,56.216923,0.666116,0.749848,...,320.070778,319.988742,319.568451,327.175633,311.562722,15.612911,4.387267,17238600,0.233536,0.0454
246,2023-01-26,QQQ,286.430860,288.693024,283.775249,288.515991,51596300,67.042529,1.000000,3.635249,...,275.348414,274.374741,274.477274,289.871179,252.068591,37.802588,5.750424,-2413700,0.302198,0.0454
247,2023-01-26,SPY,388.659925,390.385682,385.671191,390.221771,72287400,63.625421,1.000000,3.139059,...,380.033194,378.212150,378.331911,392.609755,361.677702,30.932053,5.855391,-194078100,0.246413,0.0454
248,2023-01-26,TLT,96.277709,96.736005,95.693607,96.133934,15549400,55.653493,0.314276,0.930721,...,95.064663,93.651888,93.407699,99.229024,89.363514,9.865510,1.536824,217403000,0.215131,0.0454
249,2023-01-26,VXUS,52.186704,52.278889,51.836395,52.223579,2630800,76.412713,1.000000,1.036595,...,50.513774,48.783931,49.215089,53.230899,47.062338,6.168561,0.582645,43425300,0.189366,0.0454
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2670,2024-12-31,DIA,421.792026,422.612062,418.956437,420.398926,2442700,37.711105,0.347999,-2.686600,...,427.716480,428.498165,426.692069,446.989090,414.178366,32.810724,4.756474,193033100,-0.268159,0.0421
2671,2024-12-31,QQQ,514.954366,515.711454,508.339345,509.305695,29117000,44.908909,0.000000,2.609418,...,517.766850,507.829073,508.651419,534.970306,507.254859,27.715448,8.013936,2107284300,0.017238,0.0421
2672,2024-12-31,SPY,584.785885,585.509585,579.343582,580.989197,57052700,41.593674,0.148148,-0.427133,...,589.825451,585.973876,584.695211,607.509027,578.547815,28.961212,6.719261,2294842200,-0.074308,0.0421
2673,2024-12-31,TLT,85.312791,85.457997,84.470606,84.538368,31917300,35.236680,0.215565,-1.149875,...,86.449028,87.740692,87.916672,91.986199,82.609508,9.376692,0.975991,222013500,-0.418228,0.0421


In [112]:
# Compute daily returns
df_working.loc[:, "Return"] = (
    df_working
    .groupby("Ticker")["Close"]
    .pct_change()
)

df_working = df_working.dropna()
df_working

Unnamed: 0,Date,Ticker,Open,High,Low,Close,Volume,RSI,StochRSI,MACD,...,SMA_50,EMA_50,BB_high,BB_low,BB_width,ATR,OBV,CMF,RiskFreeRate,Return
255,2023-01-30,DIA,322.634614,324.512504,321.099906,321.300079,3032700,51.975106,0.445685,0.889574,...,320.143730,319.793187,327.426634,312.894054,14.532580,4.240357,17113300,0.218881,0.04545,-0.007509
256,2023-01-30,QQQ,288.112781,289.538948,285.122795,285.496521,49405800,60.493853,0.541674,4.517052,...,274.631899,275.546562,293.573526,254.266601,39.306925,5.821996,4622400,0.290045,0.04545,-0.020219
257,2023-01-30,SPY,388.341786,390.588169,385.912250,386.211121,74202000,57.165089,0.544017,3.541691,...,378.526209,379.122668,394.353181,364.385546,29.967634,5.722476,-199933900,0.216428,0.04545,-0.012547
258,2023-01-30,TLT,95.666653,96.295689,95.424030,95.540848,11459700,53.152002,0.065702,0.806478,...,93.959063,93.584929,98.898841,90.843923,8.054918,1.439234,194097800,0.223361,0.04545,-0.003655
259,2023-01-30,VXUS,51.753428,51.928584,51.550617,51.559837,2793900,65.073551,0.000000,0.976521,...,48.965444,49.415878,53.327812,47.777587,5.550225,0.565738,37863200,0.188122,0.04545,-0.010439
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2670,2024-12-31,DIA,421.792026,422.612062,418.956437,420.398926,2442700,37.711105,0.347999,-2.686600,...,428.498165,426.692069,446.989090,414.178366,32.810724,4.756474,193033100,-0.268159,0.04210,-0.000634
2671,2024-12-31,QQQ,514.954366,515.711454,508.339345,509.305695,29117000,44.908909,0.000000,2.609418,...,507.829073,508.651419,534.970306,507.254859,27.715448,8.013936,2107284300,0.017238,0.04210,-0.008495
2672,2024-12-31,SPY,584.785885,585.509585,579.343582,580.989197,57052700,41.593674,0.148148,-0.427133,...,585.973876,584.695211,607.509027,578.547815,28.961212,6.719261,2294842200,-0.074308,0.04210,-0.003638
2673,2024-12-31,TLT,85.312791,85.457997,84.470606,84.538368,31917300,35.236680,0.215565,-1.149875,...,87.740692,87.916672,91.986199,82.609508,9.376692,0.975991,222013500,-0.418228,0.04210,-0.005353


# Rules

In [121]:
# creaing a rule base for the labeling

df_working["OBV_prev"] = df_working["OBV"].shift(1)

def label_row(row):
    bullish_signals = 0
    bearish_signals = 0

    # Momentum
    if row["RSI"] < 30 or row["StochRSI"] < 0.2: bullish_signals += 1
    if row["RSI"] > 70 or row["StochRSI"] > 0.8: bearish_signals += 1

    # Trend
    if row["MACD"] > row["MACD_signal"]: bullish_signals += 1
    if row["MACD"] < row["MACD_signal"]: bearish_signals += 1
    if row["EMA_20"] > row["EMA_50"]: bullish_signals += 1
    if row["EMA_20"] < row["EMA_50"]: bearish_signals += 1

    # Volatility
    if row["Close"] < row["BB_low"]: bullish_signals += 1
    if row["Close"] > row["BB_high"]: bearish_signals += 1

    # Volume (fixed)
    if row["CMF"] > 0 or row["OBV"] > row["OBV_prev"]: bullish_signals += 1
    if row["CMF"] < 0 or row["OBV"] < row["OBV_prev"]: bearish_signals += 1

    # Decision
    if bullish_signals - bearish_signals >= 2:
        return 1    # Buy
    elif bearish_signals - bullish_signals >= 2:
        return -1   # Sell
    else:
        return 0    # Hold

df_working["Label"] = df_working.apply(label_row, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_working["OBV_prev"] = df_working["OBV"].shift(1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_working["Label"] = df_working.apply(label_row, axis=1)




The labeling strategy is derived from classical technical analysis indicators. 
- The thresholds for momentum-based conditions (e.g., RSI < 30 for oversold) follow the conventions introduced by Wilder (1978). 
- Trend confirmation is obtained via MACD crossovers as formulated by Appel (1979)
- Volatility-based breakout conditions rely on Bollinger’s (2001) band theory. 
- Volume confirmation integrates the On-Balance Volume (Granville, 1963) and Chaikin Money Flow (Chaikin, 1994) indicators. 

This multi-indicator approach aligns with the integrated framework outlined by Murphy (1999) and is consistent with prior computational applications of technical analysis in automated systems (Neely et al., 1997).

In [127]:
df_working = df_working.dropna()
df_working.head(10)

Unnamed: 0,Date,Ticker,Open,High,Low,Close,Volume,RSI,StochRSI,MACD,...,BB_high,BB_low,BB_width,ATR,OBV,CMF,RiskFreeRate,Return,OBV_prev,Label
257,2023-01-30,SPY,388.341786,390.588169,385.91225,386.211121,74202000,57.165089,0.544017,3.541691,...,394.353181,364.385546,29.967634,5.722476,-199933900,0.216428,0.04545,-0.012547,4622400.0,1
258,2023-01-30,TLT,95.666653,96.295689,95.42403,95.540848,11459700,53.152002,0.065702,0.806478,...,98.898841,90.843923,8.054918,1.439234,194097800,0.223361,0.04545,-0.003655,-199933900.0,1
259,2023-01-30,VXUS,51.753428,51.928584,51.550617,51.559837,2793900,65.073551,0.0,0.976521,...,53.327812,47.777587,5.550225,0.565738,37863200,0.188122,0.04545,-0.010439,194097800.0,1
260,2023-01-31,DIA,321.814843,324.922395,320.775811,324.893799,2593500,57.473889,0.731436,1.124302,...,327.85129,313.401731,14.44956,4.233659,19706800,0.20724,0.0458,0.011185,37863200.0,1
261,2023-01-31,QQQ,285.535874,289.873355,285.427697,289.774994,46705100,64.02833,0.724099,4.881781,...,295.310598,255.316923,39.993675,5.723686,51327500,0.296909,0.0458,0.014986,19706800.0,1
262,2023-01-31,SPY,386.731768,391.937934,386.384675,391.88974,86811800,62.429677,0.875485,3.902873,...,395.61963,365.437774,30.181856,5.722786,-113122100,0.220715,0.0458,0.014703,51327500.0,0
263,2023-01-31,TLT,96.133963,96.403543,95.136498,96.304695,13705300,55.978963,0.334854,0.79876,...,98.425856,92.000756,6.4251,1.426935,207803100,0.294537,0.0458,0.007995,-113122100.0,0
264,2023-01-31,VXUS,51.430774,51.836393,51.292493,51.817955,2668200,67.142234,0.182437,0.941071,...,53.249281,48.270034,4.979247,0.564178,40531400,0.290244,0.0458,0.005006,207803100.0,1
265,2023-02-01,DIA,323.26374,327.219693,319.889252,324.931915,6175400,57.529434,0.734323,1.298434,...,328.148361,314.051279,14.097081,4.454858,25882200,0.229023,0.0456,0.000117,40531400.0,1
266,2023-02-01,QQQ,289.568422,298.440077,287.493105,295.971375,67562200,68.433129,0.959717,5.606203,...,297.752106,256.459484,41.292622,6.096778,118889700,0.344647,0.0456,0.021383,25882200.0,1


# NN Architecture