# Notebook 02: Feature Engineering & Profiling

## Table of Contents
1. [Setup and Data Load](#setup-and-data-load)
2. [Granularity Transformation (Daily Aggregation)](#granularity-transformation)
    * **Base Metrics:** Net PnL, Volume, Trade Frequency, Win Rate, Avg Leverage, Aggression (Taker %) , MAE.
3. [Advanced Metric Calculation](#advanced-metric-calculation)
    * **Efficiency:** Profit Factor, PnL Efficiency, Avg Trade Size.
    * **Risk:** Risk-Adjusted Return, ROE, Avg Loss Severity.
4. [Behavioral Features](#behavioral-features)
    * Long/Short Ratio
    * Panic Factor (Leveraged Loss Proxy)
    * Risk-Sentiment Interaction
5. [Leakage Prevention (Lagging)](#leakage-prevention)
    * Creating $t-1$ features for ML.
6. [Export](#export)
7. [Documentation](#documentation)
    * Engineering Log
    * Metric Dictionary

### 1. Setup and Data Load <a id="setup-and-data-load"></a>

In [8]:
import pandas as pd
import numpy as np
import os

PROCESSED_DATA_PATH = '../data/processed'
OUTPUT_TABLES_PATH = '../outputs/tables'
os.makedirs(OUTPUT_TABLES_PATH, exist_ok=True)

df = pd.read_csv(os.path.join(PROCESSED_DATA_PATH, '01_merged_data.csv'))
df['date'] = pd.to_datetime(df['date'])

df['is_win'] = (df['Closed PnL'] > 0).astype(int)
df['gross_win'] = df['Closed PnL'].apply(lambda x: x if x > 0 else 0)
df['gross_loss'] = df['Closed PnL'].apply(lambda x: abs(x) if x < 0 else 0)
df['is_long'] = df['Side'].apply(lambda x: 1 if str(x).upper() in ['BUY', 'LONG'] else 0)
df['is_short'] = df['Side'].apply(lambda x: 1 if str(x).upper() in ['SELL', 'SHORT'] else 0)
df['margin_used'] = df['Start Position']

### 2. Granularity Transformation (Daily Aggregation) <a id="granularity-transformation"></a>

In [9]:
daily_stats = df.groupby(['date', 'Account']).agg({
    'Closed PnL': ['sum', 'std', 'min'],   
    'gross_win': 'sum',
    'gross_loss': 'sum',
    'Size USD': 'sum',                     
    'Trade ID': 'count',                   
    'is_win': 'mean',                      
    'leverage_capped': 'mean',             
    'Crossed': 'mean',                     
    'is_long': 'sum',                      
    'is_short': 'sum',                     
    'margin_used': 'sum',                  
    'sentiment_score': 'first',            
    'sentiment_class': 'first'
}).reset_index()

daily_stats.columns = ['_'.join(col).strip() if col[1] else col[0] for col in daily_stats.columns.values]

daily_stats.rename(columns={
    'date_': 'date',
    'Account_': 'Account',
    'Closed PnL_sum': 'net_pnl', # Daily PnL per Trader
    'Closed PnL_std': 'pnl_std',
    'Closed PnL_min': 'max_adverse_excursion', 
    'gross_win_sum': 'total_gross_win',
    'gross_loss_sum': 'total_gross_loss',
    'Size USD_sum': 'total_volume',
    'Trade ID_count': 'trade_frequency', # Trades per Day
    'is_win_mean': 'win_rate', # Win Rate per Account
    'leverage_capped_mean': 'avg_leverage', # Leverage Distribution
    'Crossed_mean': 'aggression_score', # Aggression (Taker vs Maker)
    'sentiment_score_first': 'sentiment_score',
    'sentiment_class_first': 'sentiment_class'
}, inplace=True)

### 3. Advanced Metric Calculation <a id="advanced-metric-calculation"></a>

In [10]:
daily_stats['profit_factor'] = np.where(
    daily_stats['total_gross_loss'] > 0,
    daily_stats['total_gross_win'] / daily_stats['total_gross_loss'],
    np.where(daily_stats['total_gross_win'] > 0, 100.0, 0.0)
) # Profit Factor

daily_stats['roe'] = daily_stats['net_pnl'] / daily_stats['margin_used_sum'].replace(0, np.nan) # ROE

daily_stats['risk_adjusted_return'] = daily_stats['net_pnl'] / daily_stats['pnl_std'].replace(0, np.nan) # Risk Adj Return

daily_stats['avg_trade_size'] = daily_stats['total_volume'] / daily_stats['trade_frequency'].replace(0, 1) # Avg Trade Size

daily_stats['pnl_efficiency'] = daily_stats['net_pnl'] / daily_stats['total_volume'].replace(0, np.nan) # PnL Efficiency

est_losing_trades = daily_stats['trade_frequency'] * (1 - daily_stats['win_rate'])
daily_stats['avg_loss_severity'] = daily_stats['total_gross_loss'] / est_losing_trades.replace(0, 1) # Loss Severity

### 4. Behavioral Features <a id="behavioral-features"></a>

In [11]:
daily_stats['long_short_ratio'] = daily_stats['is_long_sum'] / daily_stats['is_short_sum'].replace(0, 1) # Long/Short Ratio

daily_stats['panic_score'] = daily_stats['total_gross_loss'] * daily_stats['avg_leverage'] # Panic Proxy

daily_stats['risk_sentiment_interaction'] = daily_stats['avg_leverage'] * daily_stats['sentiment_score']

### 5. Leakage Prevention (Lagging) <a id="leakage-prevention"></a>

In [12]:
daily_stats = daily_stats.sort_values(['Account', 'date'])

features_to_lag = ['net_pnl', 'total_volume', 'sentiment_score', 'win_rate', 'avg_leverage']

for col in features_to_lag:
    daily_stats[f'prev_{col}'] = daily_stats.groupby('Account')[col].shift(1)

daily_stats['target_profitable_next_day'] = daily_stats.groupby('Account')['net_pnl'].shift(-1) > 0
daily_stats['target_profitable_next_day'] = daily_stats['target_profitable_next_day'].astype(float)

### 6. Export <a id="export"></a>

In [13]:
output_path = os.path.join(PROCESSED_DATA_PATH, '02_engineered_features.csv')
daily_stats.to_csv(output_path, index=False)

print(f"Feature Engineering Complete. Shape: {daily_stats.shape}")
print(daily_stats.head())

Feature Engineering Complete. Shape: (102, 32)
                        date                                     Account  \
16 2024-10-27 00:00:00+00:00  0x083384f897ee0f19899168e3b1bec365f52a9012   
45 2025-02-19 00:00:00+00:00  0x083384f897ee0f19899168e3b1bec365f52a9012   
17 2024-10-27 00:00:00+00:00  0x23e7a7f8d14b550961925fbfdaa92f5d195ba5bd   
46 2025-02-19 00:00:00+00:00  0x23e7a7f8d14b550961925fbfdaa92f5d195ba5bd   
77 2025-06-15 00:00:00+00:00  0x23e7a7f8d14b550961925fbfdaa92f5d195ba5bd   

         net_pnl      pnl_std  max_adverse_excursion  total_gross_win  \
16 -3.275059e+05  5734.473080         -117990.104100     1.603100e+03   
45  1.927736e+06  4509.256818          -19841.240140     2.030102e+06   
17  2.060745e+04   214.005887               0.000000     2.060745e+04   
46  1.709873e+04   176.854670           -6820.769550     5.046399e+04   
77  1.017915e+04    36.793048           -1162.026288     1.992737e+04   

    total_gross_loss  total_volume  trade_frequency  win_

### 7. Documentation
<a id="documentation"></a>

#### Table 1: Engineering Reasoning Log

| **Phase** | **Action Taken** | **Engineering Justification** |
| --- | --- | --- |
| **Resampling** | Grouped `Tick` data to `Daily` level. | Aligns high-frequency trade data with low-frequency (Daily) Sentiment Index for correlation. |
| **Aggregation** | Calculated Sums (Vol/PnL) and Means (Rates). | Compresses 1000s of rows into a single "Trader-Day" profile while preserving performance signals. |
| **Win/Loss Logic** | Separated `Gross Win` and `Gross Loss`. | Essential for Profit Factor; Net PnL hides the "churn" (high wins + high losses). |
| **Lagging** | Shifted key features by 1 day ($t-1$). | **Crucial:** Prevents data leakage. Models must predict "Tomorrow" using only "Today's" data. |

#### Table 2: Metric Dictionary

| **Metric Name** | **Definition** | **Purpose / Insight** |
| --- | --- | --- |
| **Profit Factor** | $\frac{\text{Gross Win}}{\text{Gross Loss}}$ | Best measure of sustainability. PF > 1.5 usually implies a skilled strategy. |
| **Aggression Score** | $\%$ of Taker (Market) Orders. | Identifies "Impulsive" vs "Patient" traders. High score often correlates with FOMO. |
| **Risk Adj Return** | $\frac{\text{Net PnL}}{\text{StdDev PnL}}$ | Measures efficiency. How much risk did you take to make that dollar? |
| **PnL Efficiency** | $\frac{\text{Net PnL}}{\text{Volume}}$ | Measures edge per dollar traded. High volume + Low Efficiency = Churning/Rebating. |
| **Panic Proxy** | $\text{Gross Loss} \times \text{Leverage}$ | Identifies high-risk behavior: Losing lots of money while highly leveraged. |
| **Loss Severity** | Average $ lost per losing trade. | Checks if a trader cuts losses early or lets them bleed (Risk Management check). |