# 03 Label Creation

- In regression we don’t create a 0/1 label — we create:
- X = features (predictors at time t)
- y = continuous target (future volatility at time t + horizon)

This notebook’s job:
1. load volatility_features.csv
2. define X and y cleanly
3. do a time-based split index
4. save train.csv / test.csv (optional but very clean)

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("../data_processed/volatility_features.csv", index_col=0, parse_dates=True)
df.head(), df.shape

(            realized_vol  target_vol   rv_lag1   rv_lag5  rv_lag20  \
 Date                                                                 
 2010-04-28      0.135474    0.179863  0.133121  0.098533  0.072647   
 2010-04-29      0.140353    0.213912  0.135474  0.095851  0.075440   
 2010-04-30      0.153672    0.215462  0.140353  0.096389  0.077133   
 2010-05-03      0.157710    0.271736  0.153672  0.098244  0.067241   
 2010-05-04      0.179769    0.271638  0.157710  0.133121  0.066840   
 
             rv_change1  rv_change5  ret_lag1  ret_lag5   abs_ret  \
 Date                                                               
 2010-04-28    0.002353    0.036941 -0.023935 -0.001821  0.007568   
 2010-04-29    0.004879    0.044502  0.007568  0.002979  0.012321   
 2010-04-30    0.013319    0.057284  0.012321  0.006507  0.017107   
 2010-05-03    0.004037    0.059466 -0.017107 -0.003784  0.012878   
 2010-05-04    0.022060    0.046649  0.012878 -0.023935  0.023796   
 
             abs

### 1. Store the features

In [5]:
target_col = "target_vol"

# Basic: use everything except the target
feature_cols = [c for c in df.columns if c != target_col]

X = df[feature_cols].copy()
y = df[target_col].copy()

X.shape, y.shape

# log the target to make it more Gaussian
y_log = np.log(y)

df_model = X.copy()
df_model["y"] = y
df_model["y_log"] = y_log
df_model = df_model.dropna()

df_model.head(), df_model.shape

(            realized_vol   rv_lag1   rv_lag5  rv_lag20  rv_change1  \
 Date                                                                 
 2010-04-28      0.135474  0.133121  0.098533  0.072647    0.002353   
 2010-04-29      0.140353  0.135474  0.095851  0.075440    0.004879   
 2010-04-30      0.153672  0.140353  0.096389  0.077133    0.013319   
 2010-05-03      0.157710  0.153672  0.098244  0.067241    0.004037   
 2010-05-04      0.179769  0.157710  0.133121  0.066840    0.022060   
 
             rv_change5  ret_lag1  ret_lag5   abs_ret  abs_ret_mean5  \
 Date                                                                  
 2010-04-28    0.036941 -0.023935 -0.001821  0.007568       0.008955   
 2010-04-29    0.044502  0.007568  0.002979  0.012321       0.010823   
 2010-04-30    0.057284  0.012321  0.006507  0.017107       0.012943   
 2010-05-03    0.059466 -0.017107 -0.003784  0.012878       0.014762   
 2010-05-04    0.046649  0.012878 -0.023935  0.023796       0.014734 

### 2. Split up the data into train and test

In [7]:
split_date = "2024-01-01"

train = df_model[df_model.index < split_date].copy()
test  = df_model[df_model.index >= split_date].copy()

train.shape, test.shape


((3443, 17), (497, 17))

In [8]:
train.to_csv("../data_processed/train_regression.csv")
test.to_csv("../data_processed/test_regression.csv")

print("Saved:")
print("- ../data_processed/train_regression.csv")
print("- ../data_processed/test_regression.csv")


Saved:
- ../data_processed/train_regression.csv
- ../data_processed/test_regression.csv
