# ðŸ§© **02 â€” Feature Engineering**

**Notebook Purpose:**
Transform insights from exploratory data analysis into *explicit, reproducible features* suitable for time-series classification. This notebook focuses on constructing lagged, rolling, and event-aware features while preserving temporal integrity and avoiding data leakage.

---

**Competition:** *Detect Reversal Points in US Equities*
**Deadline:** December 31, 2025
**Repository:** `Kaggle-Detect-Reversal-Points-in-US-Equities`
**Author:** Brice Nelson

---

**Notebook Date Created:** 2025-12-15<br>
**Notebook Last Updated:** 2025-12-15

---

## ðŸ§­ **Goals of This Notebook**

- Ingest validated raw and baseline datasets
- Engineer time-aware features derived from Signal Descriptor columns
- Construct lag-based and rolling window features without leakage
- Encode sparse event information relevant to reversal detection
- Maintain compatibility with baseline and advanced modeling pipelines
- Persist engineered datasets to `/data/processed/`
- Document feature rationale for downstream interpretation

---

## ðŸ”— **Context from Prior Analysis**

Feature engineering decisions in this notebook are informed by findings from the light EDA phase, including:

- Extremely wide feature space dominated by boolean Signal Descriptor columns
- Sparse, event-based target labels (`H`, `L`, `None`)
- Strong temporal ordering within each `ticker_id`
- Need for models robust to high-dimensional, sparse inputs

Detailed exploratory analysis and visualization are deferred to
`01_eda_detailed.ipynb`.

---

## ðŸ“‚ **References**

- Light EDA: `notebooks/01_eda.ipynb`
- Detailed EDA (planned): `notebooks/01_eda_detailed.ipynb`
- Project Plan: `docs/00_overview/01_reversal_points_project_plan.md`
- Feature Design Notes: `docs/03_notebooks/02_notebook_notes/03_feature_engineering/01_feature_engineering.md`
- Project Structure: `docs/01_architecture/01_project_structure.md`


In [1]:
# import libraries

import os
import sys
import pandas as pd
import numpy as np
import duckdb
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

# Add project root to path
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if project_root not in sys.path:
    sys.path.insert(0, project_root)

from src.data.eda_utils import get_prefix_counts

# configurations
sns.set_theme(style="darkgrid")
pd.set_option("display.max_columns", None)
pd.set_option("display.width", 120)

plt.style.use("seaborn-v0_8")

### Load Raw Data
- Load via Duckdb
- Create a connection
- Load the training and test datasets into dataframes

In [None]:
# Create duckdb connection

conn = duckdb.connect()

# Create duckdb dataframe

train_df = conn.execute("""
                        SELECT * FROM
                            read_csv_auto('../data/raw/new_competition_data/train.csv',
                            max_line_size=5000000)""").df()
test_df = conn.execute("""
                       SELECT * FROM
                           read_csv_auto('../data/raw/new_competition_data/test.csv',
                           max_line_size=5000000)""").df()

print('Train dataframe created.')
print('Test dataframe created.')

In [None]:
# Shape

print('Train dataframe shape:', train_df.shape)
print('Test dataframe shape:', test_df.shape)

In [None]:
# Register for SQL use
duckdb.register('train', train_df)
duckdb.register('test', test_df)

# Convert column names into pandas DataFrames
col_df = pd.DataFrame({'column_name': train_df.columns})
duckdb.register('cols', col_df)

# Extract prefixes and count them
duckdb.query("""
    SELECT
        regexp_extract(column_name, '^[^_]+') AS prefix,
        COUNT(*) AS count
    FROM cols
    GROUP BY prefix
    ORDER BY count DESC
""").df()


In [None]:
# Identify Signal Columns (Once, Explicitly)

signal_cols = [
    col for col, dtype in zip(train_df.columns, train_df.dtypes)
    if dtype == "bool"
]

len(signal_cols)


In [None]:
# Create Signal Population Features (DuckDB SQL)

signal_array_expr = ", ".join(f'"{col}"' for col in signal_cols)


signal_count = len(signal_cols)


In [None]:
# Create a View with Signal Count (DuckDB)

conn.execute(f"""
CREATE OR REPLACE VIEW train_signal_counts AS
SELECT
    *,
    (
        SELECT SUM(CAST(val AS INT))
        FROM UNNEST([{signal_array_expr}]) AS t(val)
    ) AS signal_count
FROM train_df
""")


In [None]:
# Identify the datetime column explicitly

time_cols = train_df.select_dtypes(include=["datetime64"]).columns
time_cols



In [None]:
assert len(time_cols) == 1, f"Expected 1 datetime column, found {len(time_cols)}"
TIME_COL = time_cols[0]
TIME_COL


In [None]:
# Aggregate boolean signals (this should run fast)
train_df["signal_count"] = train_df[signal_cols].sum(axis=1)

n_signals = len(signal_cols)
train_df["signal_density"] = train_df["signal_count"] / n_signals




In [None]:
# Quick sanity checks (train only)

train_df[["signal_count", "signal_density"]].describe()


In [None]:
# Quick sanity checks (train only)

train_df[[TIME_COL, "signal_count"]].head()


In [None]:
train_df = train_df.sort_values(
    by=["ticker_id", TIME_COL],
    kind="mergesort"   # stable + memory-friendly
).reset_index(drop=True)
