# Project 101 - Feature Engineering

This notebook focuses on transforming the cleaned dataset into meaningful
features that can be used for machine learning models.  
The goal is to capture patterns, trends, and temporal behavior in the data
while keeping the features interpretable.

---


In [1]:
# ==============================
# Environment setup & imports
# ==============================

import pandas as pd
import numpy as np


## 1. Load Processed Data

In [2]:
# Load the processed dataset generated from the EDA step
# This ensures a clean separation between data preparation and feature creation

processed_path = "../data/processed/stock_data_clean.csv"
df = pd.read_csv(processed_path)

# Ensure data is sorted chronologically
df = df.sort_values("Date").reset_index(drop=True)

df.head()

Unnamed: 0,Date,Price,Open,High,Low,Vol.,Change %
0,2000-01-03,6.67,6.67,6.67,6.67,4430000.0,0.0
1,2000-01-04,6.67,6.67,6.67,6.67,4430000.0,0.0
2,2000-01-08,6.67,6.67,6.67,6.67,4430000.0,0.0
3,2000-01-09,6.67,6.67,6.67,6.67,4430000.0,0.0
4,2000-01-10,6.67,6.67,6.67,6.67,4430000.0,0.0


## 2. Feature Engineering Strategy

The feature engineering process in this project is intentionally simple
and interpretable.

The goal is not to maximize model complexity, but to create features that:
- Reflect real market behavior
- Are easy to explain to non-technical stakeholders
- Can serve as a strong and transparent baseline

The features are grouped into the following categories:

- **Temporal features**: Capture short-term momentum and recent price behavior.
- **Lagged features**: Allow the model to learn from historical patterns.
- **Statistical features**: Summarize trends and volatility without relying on complex indicators.
- **Volume-based features**: Represent changes in market participation.

This structured approach ensures that each feature has a clear meaning
and supports the explainability goals of the project.

## 3. Temporal & Lag Features

In [3]:
# Daily returns capture short-term price momentum
df["return_1d"] = df["Price"].pct_change()

In [4]:
# Lag features allow the model to learn from recent historical behavior
for lag in [1, 3, 5]:
    df[f"return_lag_{lag}"] = df["return_1d"].shift(lag)

## 4. Statistical / Technical Features

In [5]:
# Moving averages smooth price action and highlight trend direction
for window in [5, 10, 20]:
    df[f"ma_{window}"] = df["Price"].rolling(window).mean()

In [6]:
# Volatility reflects uncertainty and market risk
df["volatility_5"] = df["return_1d"].rolling(5).std()
df["volatility_10"] = df["return_1d"].rolling(10).std()

In [7]:
# Volume change captures shifts in market participation
df["volume_change"] = df["Vol."].pct_change()

## 5. Final Feature Dataset

In [8]:
# Binary target: 1 if next day's close is higher, else 0
df["target"] = (df["Price"].shift(-1) > df["Price"]).astype(int)

In [9]:
# Drop rows with missing values created by rolling windows and lags
df_features = df.dropna().reset_index(drop=True)

df_features.shape


(4981, 18)

## 6. Save Features for Modeling

This dataset represents the first stable feature set for modeling.
It is intentionally simple, interpretable, and designed to serve as a baseline.

In [10]:
import os

features_path = "../data/processed/stock_features_v1.csv"
os.makedirs("../data/processed", exist_ok=True)

df_features.to_csv(features_path, index=False)


## Next Steps
- Validate feature distributions
- Prepare train/test split
- Build baseline machine learning models