# üìä Feature Engineering for Stock Clustering

**Goal**: Transform raw stock prices into meaningful risk indicators.

**Why?** Clustering algorithms need numeric features that capture different aspects of risk:
- **Volatility**: How much the price swings
- **Returns**: Profitability patterns
- **Technical Indicators**: Market sentiment signals
- **Liquidity**: How easy to trade
- **Risk-adjusted performance**: Return vs risk tradeoff

In [None]:
import pandas as pd
import numpy as np
import sys
sys.path.append('../src')

from features import (
    calculate_returns,
    calculate_volatility_features,
    calculate_risk_metrics,
    calculate_technical_indicators,
    calculate_liquidity_features,
    calculate_momentum_features,
    calculate_drawdown,
    aggregate_stock_features
)

## 1Ô∏è‚É£ Load Cleaned Data

In [None]:
df = pd.read_csv('../Data/Processed/cleaned_nse.csv')
print(f"Loaded {len(df):,} records for {df['Stock_code'].nunique()} stocks")
df.head()

## 2Ô∏è‚É£ Feature Engineering Pipeline

We'll apply 7 transformations to create ~25 features per stock:

### A) Returns (Profitability)
Daily returns show how much profit/loss each day

In [None]:
print("Step 1/7: Calculating returns...")
df = df.groupby('Stock_code', group_keys=False).apply(
    calculate_returns, include_groups=False
).reset_index(drop=True)

print(f"‚úÖ Added: daily_return, log_return")

### B) Volatility (Price Swings)
Standard deviation of returns = how unpredictable the stock is

In [None]:
print("Step 2/7: Calculating volatility...")
df = df.groupby('Stock_code', group_keys=False).apply(
    calculate_volatility_features, include_groups=False
).reset_index(drop=True)

print(f"‚úÖ Added: volatility_7d, volatility_14d, volatility_30d")

### C) Advanced Risk Metrics
- **Downside deviation**: Only measures bad volatility (losses)
- **Value at Risk (VaR)**: "5% chance of losing this much or more"

In [None]:
print("Step 3/7: Calculating advanced risk metrics...")
df = df.groupby('Stock_code', group_keys=False).apply(
    calculate_risk_metrics, include_groups=False
).reset_index(drop=True)

print(f"‚úÖ Added: downside_deviation_30d, var_95")

### D) Technical Indicators
- **RSI** (0-100): <30 = oversold, >70 = overbought
- **Bollinger Bands**: Volatility envelope around price
- **MACD**: Trend momentum indicator

In [None]:
print("Step 4/7: Calculating technical indicators...")
df = df.groupby('Stock_code', group_keys=False).apply(
    calculate_technical_indicators, include_groups=False
).reset_index(drop=True)

print(f"‚úÖ Added: rsi, bb_width, bb_position, macd, macd_signal")

### E) Liquidity Features
Can you buy/sell easily? High volume = liquid, low volume = illiquid (risky)

In [None]:
print("Step 5/7: Calculating liquidity features...")
df = df.groupby('Stock_code', group_keys=False).apply(
    calculate_liquidity_features, include_groups=False
).reset_index(drop=True)

print(f"‚úÖ Added: avg_volume, volume_volatility, volume_trend, amihud_illiquidity")

### F) Momentum and Trends
Is the stock going up/down/sideways? Comparing current price to moving averages

In [None]:
print("Step 6/7: Calculating momentum features...")
df = df.groupby('Stock_code', group_keys=False).apply(
    calculate_momentum_features, include_groups=False
).reset_index(drop=True)

print(f"‚úÖ Added: momentum_7d, momentum_30d, momentum_90d, ma_7, ma_30, ma_50, price_to_ma30, price_to_ma50")

### G) Drawdown (Crash Risk)
**Max Drawdown**: Largest peak-to-trough decline. Shows worst-case scenario.

In [None]:
print("Step 7/7: Calculating drawdown metrics...")
df = df.groupby('Stock_code', group_keys=False).apply(
    calculate_drawdown, include_groups=False
).reset_index(drop=True)

print(f"‚úÖ Added: current_drawdown, max_drawdown, days_from_peak")
print(f"\nüéâ Feature engineering complete!")

## 3Ô∏è‚É£ Aggregate to Stock Level

**Problem**: We have ~1000 rows per stock (one per day)

**Solution**: Take **averages/medians** to get ONE row per stock

**Key aggregated features**:
- **Volatility**: mean, max
- **Returns**: mean, std, skew, kurtosis
- **Sharpe Ratio**: Return per unit of risk (CRUCIAL!)
- **Technical**: RSI mean, Bollinger width, MACD volatility
- **Liquidity**: volume, trading frequency, illiquidity
- **Risk**: max drawdown, VaR, downside deviation

In [None]:
print("Aggregating features to stock level...")

features_list = []
for stock_code, group in df.groupby('Stock_code'):
    stock_features = aggregate_stock_features(group)
    if stock_features is not None:
        features_list.append(stock_features)

df_features = pd.DataFrame(features_list)
print(f"\n‚úÖ Created {len(df_features)} stock profiles with {len(df_features.columns)} features")
df_features.head()

## 4Ô∏è‚É£ Inspect Key Features

In [None]:
print("Feature Statistics:\n")
print(df_features[[
    'volatility_mean', 'sharpe_ratio', 'max_drawdown', 
    'trading_frequency', 'rsi_mean', 'downside_deviation'
]].describe().round(4))

## 5Ô∏è‚É£ Save Features

In [None]:
output_path = '../Data/Processed/nse_features.csv'
df_features.to_csv(output_path, index=False)
print(f"‚úÖ Saved features to {output_path}")

---

## üìö Summary

**What we did**:
1. ‚úÖ Calculated returns and volatility (basic risk)
2. ‚úÖ Added advanced risk metrics (downside dev, VaR)
3. ‚úÖ Computed technical indicators (RSI, Bollinger, MACD)
4. ‚úÖ Measured liquidity (volume, illiquidity)
5. ‚úÖ Tracked momentum and trends (MAs, price ratios)
6. ‚úÖ Analyzed drawdowns (max loss)
7. ‚úÖ Aggregated ~1000 daily rows ‚Üí 1 stock profile

**Key insight**: Clustering works MUCH better with diverse features that capture different risk dimensions.

**Next**: Use these features for K-Means clustering! üéØ