# ðŸ§© **02 â€” Feature Engineering**

**Notebook Purpose:**
Transform insights from exploratory data analysis into *explicit, reproducible features* suitable for time-series classification. This notebook focuses on constructing lagged, rolling, and event-aware features while preserving temporal integrity and avoiding data leakage.

---

**Competition:** *Detect Reversal Points in US Equities*
**Deadline:** December 31, 2025
**Repository:** `Kaggle-Detect-Reversal-Points-in-US-Equities`
**Author:** Brice Nelson

---

**Notebook Date Created:** 2025-12-15<br>
**Notebook Last Updated:** 2025-12-15

---

## ðŸ§­ **Goals of This Notebook**

- Ingest validated raw and baseline datasets
- Engineer time-aware features derived from Signal Descriptor columns
- Construct lag-based and rolling window features without leakage
- Encode sparse event information relevant to reversal detection
- Maintain compatibility with baseline and advanced modeling pipelines
- Persist engineered datasets to `/data/processed/`
- Document feature rationale for downstream interpretation

---

## ðŸ”— **Context from Prior Analysis**

Feature engineering decisions in this notebook are informed by findings from the light EDA phase, including:

- Extremely wide feature space dominated by boolean Signal Descriptor columns
- Sparse, event-based target labels (`H`, `L`, `None`)
- Strong temporal ordering within each `ticker_id`
- Need for models robust to high-dimensional, sparse inputs

Detailed exploratory analysis and visualization are deferred to
`01_eda_detailed.ipynb`.

---

## ðŸ“‚ **References**

- Light EDA: `notebooks/01_eda.ipynb`
- Detailed EDA (planned): `notebooks/01_eda_detailed.ipynb`
- Project Plan: `docs/00_overview/01_reversal_points_project_plan.md`
- Feature Design Notes: `docs/03_notebooks/02_notebook_notes/03_feature_engineering/01_feature_engineering.md`
- Project Structure: `docs/01_architecture/01_project_structure.md`


In [21]:
# import libraries

import os
import sys
import pandas as pd
import numpy as np
import duckdb
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import pyarrow

# Add project root to path
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if project_root not in sys.path:
    sys.path.insert(0, project_root)

from src.data.eda_utils import get_prefix_counts

# configurations
sns.set_theme(style="darkgrid")
pd.set_option("display.max_columns", None)
pd.set_option("display.width", 120)

plt.style.use("seaborn-v0_8")

### Load Raw Data
- Load via Duckdb
- Create a connection
- Load the training and test datasets into dataframes

In [4]:
# Create duckdb connection

conn = duckdb.connect()

# Create duckdb dataframe

train_df = conn.execute("""
                        SELECT * FROM
                            read_csv_auto('../data/raw/new_competition_data/train.csv',
                            max_line_size=5000000)""").df()
test_df = conn.execute("""
                       SELECT * FROM
                           read_csv_auto('../data/raw/new_competition_data/test.csv',
                           max_line_size=5000000)""").df()

print('Train dataframe created.')
print('Test dataframe created.')

Train dataframe created.
Test dataframe created.


In [5]:
# Shape

print('Train dataframe shape:', train_df.shape)
print('Test dataframe shape:', test_df.shape)

Train dataframe shape: (2683, 68507)
Test dataframe shape: (1151, 68506)


In [6]:
# Register for SQL use
duckdb.register('train', train_df)
duckdb.register('test', test_df)

# Convert column names into pandas DataFrames
col_df = pd.DataFrame({'column_name': train_df.columns})
duckdb.register('cols', col_df)

# Extract prefixes and count them
duckdb.query("""
    SELECT
        regexp_extract(column_name, '^[^_]+') AS prefix,
        COUNT(*) AS count
    FROM cols
    GROUP BY prefix
    ORDER BY count DESC
""").df()


Unnamed: 0,prefix,count
0,occurs,34220
1,happens,34220
2,cross,26
3,zone,13
4,trending,10
5,peaks,5
6,troughs,5
7,sm,2
8,ticker,1
9,t,1


In [7]:
# Identify Signal Columns (Once, Explicitly)

signal_cols = [
    col for col, dtype in zip(train_df.columns, train_df.dtypes)
    if dtype == "bool"
]

len(signal_cols)


68499

In [8]:
# Create Signal Population Features (DuckDB SQL)

signal_array_expr = ", ".join(f'"{col}"' for col in signal_cols)


signal_count = len(signal_cols)


In [9]:
# Create a View with Signal Count (DuckDB)

conn.execute(f"""
CREATE OR REPLACE VIEW train_signal_counts AS
SELECT
    *,
    (
        SELECT SUM(CAST(val AS INT))
        FROM UNNEST([{signal_array_expr}]) AS t(val)
    ) AS signal_count
FROM train_df
""")


<_duckdb.DuckDBPyConnection at 0x7b72ea627670>

In [10]:
# Identify the datetime column explicitly

time_cols = train_df.select_dtypes(include=["datetime64"]).columns
time_cols



Index(['t'], dtype='object')

In [11]:
assert len(time_cols) == 1, f"Expected 1 datetime column, found {len(time_cols)}"
TIME_COL = time_cols[0]
TIME_COL


't'

In [12]:
# Aggregate boolean signals (this should run fast)
train_df["signal_count"] = train_df[signal_cols].sum(axis=1)

n_signals = len(signal_cols)
train_df["signal_density"] = train_df["signal_count"] / n_signals




In [13]:
# Quick sanity checks (train only)

train_df[["signal_count", "signal_density"]].describe()


Unnamed: 0,signal_count,signal_density
count,2683.0,2683.0
mean,316.43347,0.00462
std,436.44966,0.006372
min,6.0,8.8e-05
25%,26.0,0.00038
50%,150.0,0.00219
75%,472.0,0.006891
max,5402.0,0.078862


In [14]:
# Quick sanity checks (train only)

train_df[[TIME_COL, "signal_count"]].head()


Unnamed: 0,t,signal_count
0,2024-01-10,718
1,2025-06-06,235
2,2024-07-29,1313
3,2024-11-11,549
4,2025-01-15,44


In [15]:
train_df = train_df.sort_values(
    by=["ticker_id", TIME_COL],
    kind="mergesort"   # stable + memory-friendly
).reset_index(drop=True)


In [16]:
train_df.shape


(2683, 68509)

In [17]:
train_df.memory_usage(deep=True).sum() / 1e6


np.float64(184.235929)

In [18]:
train_df.index.is_monotonic_increasing


True

In [23]:
train_df.to_pickle(
    "../data/processed/train_sorted_checkpoint.pkl"
)



In [25]:
# sanity check
os.path.getsize("../data/processed/train_sorted_checkpoint.pkl") / 1e6

189.050198

## ðŸ§  Feature Engineering â€” Tier 1 Summary

This notebook establishes a **stable, leakage-safe Tier 1 feature baseline** for the training set under strict hardware constraints.

Key outcomes:
- The datasetâ€™s **extreme width (~68k columns)** required careful avoidance of full-row SQL window operations.
- Tier 1 features were intentionally limited to **signal population and simple lag features** to validate signal utility before adding complexity.
- **Vectorized pandas operations** were used for boolean aggregation and lagging to minimize memory pressure.
- Data was **explicitly sorted by `ticker_id` and time**, with temporal integrity verified.
- Progress was **checkpointed using a pickle artifact** to avoid repeated long-running operations.

Deferred by design:
- Test-set feature engineering
- Rolling window features
- Event-distance features
- Model training and evaluation

This Tier 1 checkpoint serves as a **lightweight, restart-safe foundation** for incremental feature expansion and modeling in subsequent phases.
