### Feature Engineering

Using historical OHLCV data, we engineer a set of features (X) that describe recent price behavior, risk, market state, and trading activity. These features are motivated by exploratory analysis and are used to predict the future 5-day return (y).

In [19]:
import pandas as pd
import numpy as np

df = pd.read_csv("amzn_data.csv")
df.head()

Unnamed: 0,Date,Close,High,Low,Open,Volume,daily_return,future_5d_return,ret_5d_past,ret_5d_future,vol_20d,rsi,rsi_bucket,vol_chg
0,2014-01-02,19.8985,19.968,19.701,19.940001,42756000,,0.007639,,0.007639,,,,
1,2014-01-03,19.822001,20.1355,19.811001,19.914499,44204000,-0.003845,0.003077,,0.003077,,,,0.033867
2,2014-01-06,19.681499,19.85,19.421,19.7925,63412000,-0.007088,-0.006732,,-0.006732,,,,0.434531
3,2014-01-07,19.901501,19.9235,19.7145,19.752001,38320000,0.011178,-0.001231,,-0.001231,,,,-0.395698
4,2014-01-08,20.096001,20.15,19.802,19.9235,46330000,0.009773,-0.015053,,-0.015053,,,,0.209029


In [20]:
price_cols = ["Open", "High", "Low", "Close", "Volume","future_5d_return" ]

for col in price_cols:
    df[col] = pd.to_numeric(df[col], errors="coerce")


## Momentum Features

In [21]:

df["ret_5d"] = df["Close"].pct_change(5)

df["ret_10d"] = df["Close"].pct_change(10)

df["ret_20d"] = df["Close"].pct_change(20)


In [22]:
df[["Close", "ret_5d", "ret_10d", "ret_20d"]].head(25)


Unnamed: 0,Close,ret_5d,ret_10d,ret_20d
0,19.8985,,,
1,19.822001,,,
2,19.681499,,,
3,19.901501,,,
4,20.096001,,,
5,20.050501,0.007639,,
6,19.882999,0.003077,,
7,19.549,-0.006732,,
8,19.877001,-0.001231,,
9,19.793501,-0.015053,,


### Interpretation

These momentum features describe recent price direction over different horizons.
They do not attempt to predict the future directly, but instead provide the model
with context about recent market behavior.

All momentum features are computed using only historical data. Momentum features describe where the stock has been recently.


## Volatility Features



In [23]:
# 10-day rolling volatility
df["vol_10d"] = df["daily_return"].rolling(10).std()

# 20-day rolling volatility
df["vol_20d"] = df["daily_return"].rolling(20).std()


In [24]:
df[["vol_10d", "vol_20d"]].head(30)

Unnamed: 0,vol_10d,vol_20d
0,,
1,,
2,,
3,,
4,,
5,,
6,,
7,,
8,,
9,,


### Interpretation

These volatility features quantify recent uncertainty in returns.
Higher volatility implies a wider range of possible future outcomes,
while lower volatility suggests tighter return ranges.

These features are particularly important for learning downside (q10)
and upside (q90) quantiles.
> Volatility controls how wide the prediction range should be.



## Market State Indicator (RSI)




In [25]:

delta = df["Close"].diff()


gain = delta.clip(lower=0)
loss = -delta.clip(upper=0)


avg_gain = gain.rolling(14).mean()
avg_loss = loss.rolling(14).mean()


rs = avg_gain / avg_loss


df["rsi"] = 100 - (100 / (1 + rs))


### Interpretation

RSI provides a compact representation of recent market state.
Extreme RSI values often correspond to asymmetric future return distributions.





## Volume-Based Features



In [26]:
df["vol_chg"] = df["Volume"].pct_change()


In [27]:
df["vol_avg_20d"] = df["Volume"].rolling(20).mean()
df["vol_ratio"] = df["Volume"] / df["vol_avg_20d"]


### Interpretation

Volume change captures sudden spikes in trading activity,
while relative volume compares current participation to recent norms.

Together, these features help the model distinguish between
high-conviction and low-conviction price movements, which is useful
for assessing the reliability of predicted return ranges.


## Lagged Risk Features



In [28]:


df["vol_10d_lag1"] = df["vol_10d"].shift(1)
df["vol_10d_lag5"] = df["vol_10d"].shift(5)


In [29]:


df["vol_20d_lag1"] = df["vol_20d"].shift(1)
df["vol_20d_lag5"] = df["vol_20d"].shift(5)


### Interpretation

Lagged volatility features capture the recent evolution of risk levels.
They allow to distinguish between:

- Sustained high volatility regimes
- Temporarily elevated risk
- Transition phases between regimes

These features are especially useful for learning stable q10 and q90 bounds.


In [30]:
df.head()

Unnamed: 0,Date,Close,High,Low,Open,Volume,daily_return,future_5d_return,ret_5d_past,ret_5d_future,...,ret_5d,ret_10d,ret_20d,vol_10d,vol_avg_20d,vol_ratio,vol_10d_lag1,vol_10d_lag5,vol_20d_lag1,vol_20d_lag5
0,2014-01-02,19.8985,19.968,19.701,19.940001,42756000,,0.007639,,0.007639,...,,,,,,,,,,
1,2014-01-03,19.822001,20.1355,19.811001,19.914499,44204000,-0.003845,0.003077,,0.003077,...,,,,,,,,,,
2,2014-01-06,19.681499,19.85,19.421,19.7925,63412000,-0.007088,-0.006732,,-0.006732,...,,,,,,,,,,
3,2014-01-07,19.901501,19.9235,19.7145,19.752001,38320000,0.011178,-0.001231,,-0.001231,...,,,,,,,,,,
4,2014-01-08,20.096001,20.15,19.802,19.9235,46330000,0.009773,-0.015053,,-0.015053,...,,,,,,,,,,


In [31]:
df.columns

Index(['Date', 'Close', 'High', 'Low', 'Open', 'Volume', 'daily_return',
       'future_5d_return', 'ret_5d_past', 'ret_5d_future', 'vol_20d', 'rsi',
       'rsi_bucket', 'vol_chg', 'ret_5d', 'ret_10d', 'ret_20d', 'vol_10d',
       'vol_avg_20d', 'vol_ratio', 'vol_10d_lag1', 'vol_10d_lag5',
       'vol_20d_lag1', 'vol_20d_lag5'],
      dtype='object')

In [33]:
drop_cols = [
    "daily_return",
    "ret_5d_past",
    "ret_5d_future",
    "rsi_bucket",
    "vol_avg_20d"
    
]

df = df.drop(columns=drop_cols)


In [34]:
df.columns

Index(['Date', 'Close', 'High', 'Low', 'Open', 'Volume', 'future_5d_return',
       'vol_20d', 'rsi', 'vol_chg', 'ret_5d', 'ret_10d', 'ret_20d', 'vol_10d',
       'vol_ratio', 'vol_10d_lag1', 'vol_10d_lag5', 'vol_20d_lag1',
       'vol_20d_lag5'],
      dtype='object')

In [35]:
df.to_csv("amzn_features_final.csv")