# **AGENT EVALUATION**

## **Part 1: Logging**

The `Logger` class provides a streamlined way to record agent information, market states, actions, rewards, and any other relevant data. This data is output into a CSV file.

#### **What does the Logger do?**
- Opens a CSV file at the specified path
- Infers or uses provided columns on first logging call
- Appends one row per `log_step()` call, recording necessary data

#### **Why a Class?**
- Reusability: Having the `Logger` class allows us to instantiate multiple independent Loggers, which can give us separate logs for training, validation, and backtesting.
- Extensibility: Future additions such as supporting JSON output can be easily added by adapting the `Logger` class.

#### **How to Use**
1. Import and create an instance of the Logger class before get the data to output:
  ```python
  from evaluation.logger import Logger

  headers = [
      "episode_id", "step_idx", "timestamp", "mid_price",
      "action", "reward", "pnl", "inventory"
  ]
  logger = Logger("logs/agent_run.csv", headers=headers)
  ```
2. Call `log_step()` at each step with arguments matching the headers:
  ```python
  for episode in range(num_episodes):
    for step in range(max_steps):
        # Run agent/environment update...
        logger.log_step(
            episode_id=episode,
            step_idx=step,
            timestamp=current_time,
            mid_price=env.mid_price,
            action=agent_action,
            reward=step_reward,
            pnl=agent.cumulative_pnl,
            inventory=agent.position
        )
  ```
3. Close the Logger to finish writing to the output file:
  ```python
  logger.close()
  ```

## **Part 2: Clustering and Agent Evaluation**

In complex trading environments, an agent can exhibit a variety of behaviors -- sometimes switching between trend-following, mean-reversion, market-making or other unseen tactics as market conditions evolve. We use clustering and other data-driven evaluation methods to help visualize and understand the agent's actions and what strategies, existing or novel, the agent uses.

### **cluster_strategy_rediscovery**

This function uses clustering to automatically rediscover any strategies that the agent may have used. Instead of assuming the agent only knows certain strategies by pre-defining them, we can discover which tactics the policy uses by visualizing the data. This is crucial for validating that the model has learned distinct behaviors and for monitoring how these behaviors emerge during training.

#### **Usage**

```python
from evaluation.cluster import cluster_strategy_rediscovery

labels = cluster_strategy_rediscovery(df,
    pca_components=3,
    min_samples=10,
    min_cluster_size=50
)
df["strategy_cluster"] = labels
```

#### **How does this function work?**
- Reduces high-dimensional step-by-step logs into low-dimensional embedding using PCA.
- Runs HDBSCAN over the PCA embedding to create clusters that correspond to an existing strategy.

#### **What features are used?**
- Normalized step index
- Inventory
- Market signals: spread, volume inbalance, price velocity
- Engineered features:
  - Price action correlation
  - Directional correctness
  - Churn - trades per step
  - 5-step smoothed velocity
  - Step-by-step inventory change
- Agent performance: reward, PnL

#### **Why PCA + HDBSCAN?**
- PCA
  - Our raw feature set (inventory, spread, velocity, engineered features, etc.) exists in a high-dimensional space that manes it difficult to cluster.
  - PCA can reduce the dimensionality of this data by findign the axes that capture the most variance, preserving the key differences between strategies.
  - This technique speeds up clustering and reduces noise from redundant or low-variance features.
- HDBSCAN
  - We expect each core existing strategy to form its own "cloud" of similar behavior in PCA space, but these clouds may vary in density and size.
  - HDBSCAN automatically discovers clusters at multiple density levels, requires only a minimum cluster size (no ɛ to hand-tune), and explicitly labels outliers as noise.
  - This lets us reliably pull out each latent strategy without worrying about having to specify a single radius for all scenarios or merging distinct patterns.

#### **HDBSCAN vs. DBSCAN**

TODO

### **cluster_novel_strategy**

This function uncovers small, tight clusters of unusual or emerging behavior that aren't captured by the main strategy clusters -- these may be novel strategies. By exaggerating local neighborhoods and then using density-based clustering, it reveals small patterns or tactics that our agent explores only briefly. This is important for detecting agent behavior from unintended exploits to genuine strategies that could either enhance performance or introduce hidden risks.

#### **Usage**
```python
  from evaluation.cluster import cluster_novel_strategy

  novel_labels = cluster_novel_strategy(df,
      tsne_components=2,
      min_samples=10,
      min_cluster_size=50
  )
  df["novel_cluster"] = novel_labels
```

#### **How does this function work?**
- Runs t-SNE to project these high-dimensional features into a low-dimensional space that accentuates very local similarities, making tight and uncommon behavioral pockets stand out.
- Applies HDBSCAN on the t-SNE embedding to automatically discover dense micro-clusters and label other points as noise.

#### **What features are used?**
- Reward
- Spread
- Volume imbalance

#### **Why t-SNE + HDBSCAN?**
- t-SNE
  - Emphasizes very local similarities in the high-dimensional feature space, creating distinct islands from sparse micro-behaviors.
  - Ideal for revealing tight pockets of steps that traditional linear projetions would lump together.
- HDBSCAN
  - Automatically discovers clusters at all density levels, which allows us to catch both small, high-density novel tactics and larger strategies.
  - Explicitly labels low-density or transitional points as noise, keep the novel clusters focused on actual strategies.

### **evaluate_risk_awareness**

This function measures per-episode risk and flags any data that surpasses the set thresholds as well as any statistical anomalies. By computing drawdowns, position volatility, and reaction lags, it quantifies how safely the agent trades under realistic constraints. Ensuring our model stays within predefined risk limits is essential when analyzing and potentially deploying emergent and rediscovered strategies.

#### **Usage**
```python
  from evaluation.cluster import evaluate_risk_awareness

  risk_df = evaluate_risk_awareness(df,
      drawdown_q=0.95,
      inv_std_q=0.95,
      lag_q=0.95,
      iforest_contam=0.05
  )
```

#### **How does this function work?**
- Groups the log by `episode_id` and computes important features that determine the agent's risk level
- Sets quantile-based thresholds to classify episodes as "safe" vs. "risky"
- Runs an IsolationForest on these metrics to flag additional anomalies

#### **What features are used?**
- `max_dd` - worst PnL drawdown
- `inv_std` - standard deviation of inventory
- `react_lag` - mean number of steps to respond the drawdown

#### **Why thresholding + IsolationForest?**
- Thresholding
  - Provides clear, interpretable risk limits aligned with real-world tolerances.
  - Allows for a deterministic and reproducible system that ensures an episode either passes or fails in a predictable way.
- IsolationForest
  - Captures episodes that are unusual in the joint distributiojn of risk metrics, even if they don't surpass any single threshold.
  - Serves as a data-driven safety net for spotting complex risk patterns.

### **evaluate_profitability**

This function assigns each episode to a profit category and tests whether the agent's profits are statistically significant. By bucketing episodes into low vs. high profit and running a one-sample t-test against zero, it provides both a performance breakdown and statistical confidence that returns are significant. Such statistical validation is key before claiming genuine trading success or comparing across different model versions.

#### **Usage**
```python
  from evaluation.cluster import evaluate_profitability

  profit_df, t_stat, p_val = evaluate_profitability(df)
```

#### **How does this function work?**
- Aggregates per-episode metrics
- Buckets episodes into `low` vs. `high` profit based on PnL quantiles
- Performs a one-sample t-test of `end_pnl` against zero to assess statistical significance

#### **What features are used?**
- `end_pnl` - total episode profit/loss
- `avg_reward` - average reward per step
- `trade_freq` - total trades in the episode

#### *Why profit quantiles + one-sample t-test?**
- Profit quantiles
  - Offer a relative performance split which allows us to compare episodes on a consistent scale.
  - Reveals the distribution of returns, not just the mean.
- One-sample t-test
  - Provides a p-value and t-statistic that determines whether the mean profit is significantly different from zero or not.
  - Adds statistical rigor and confidence to profitability claims.

## **Part 3: Plotting**

The `plot.py` file provides visualization functions for evaluating and understanding the performance of our trading agent. For each plot, we explain how it works and why it's useful.

### **Plotting Functions**
- Reward per Episode
  - Groups the log by `episode_id` and sums the `reward` column for each episode.
  - Plots episode index against total reward using a simple line chart.
  - This function tracks the total reward the agent accumulates in each episode. This gives a high-level view of learning progress and stability over time, which is ideal for spotting trends, improvements, or plateaus in training.
- PnL Over Time
  - Groups the log by `episode_id` and for each episode, plots the `step_idx` against the `pnl` column.
  - This function visualizes the trajectory of realized profit and loss within each episode. It helps determine whether the agent is profiting, suffering big drawdowns, or fluctuating unpredictably.
- Action Frequency Distribution
  - Uses the `action` column and tallies up each action code.
  - Creates a bar chart of counts per action with readable labels mapped from their respective numeric codes.
  - This function examines how often the agent chooses each action (e.g., hold, buy, sell). Exposure to unbalanced distributions can indicate bias towards one action or insufficient exploration.
- Cluster Scatter
  - There two modes: one for plotting strategy rediscovery clustering and one for plotting novel strategy clustering.
  - Standardizes the data.
  - Runs PCA or t-SNE based on which mode was specified and calls the necessary clustering function to get the labels.
  - Plots each cluster in a scatter plot using these determined labels.
  - This function embeds high-dimensional agent data into two dimensions and applies clustering to discover patterns in strategy usage.
- Inventory Over Time
  - Groups by `episode_id` and plots `step_idx` by `inventory` for each episode.
  - This function keeps track of the agent's position throughout each episode. It is useful for verifying inventory limits, assessing risk exposure, and ensuring that holding penalties align with desired behavior.
- Risk Scatter Plot
  - Computes per-episode metrics using `evaluate_risk_awareness`.
  - Plots standard deviation of inventory against the max drawdown.
  - This function allows us to visualize the relationship between inventory volatility and maximum drawdown,. It helps evaluate if risk-aware components are effective.
- Anomaly Detection Histogram
  - Uses `evaluate_risk_awareness` to get an anomaly flag per episode.
  - Creates a bar chart of normal vs. anomaly counts.
  - This function shows the count of episodes that are anomalies, which are flagged by risk or behavior detectors. It is useful to quickly gauge how many runs fall outside normal strategies.
- Profitability by Quantile
  - Calls `evaluate_profitability` to get data grouped by `profit_quantile` and a one-sample t-test resule.
  - Creates a boxplot of `end_pnl`, distributed by quantile.
  - Compares end-of-episode PnL across quantiles, and stastically tests if the mean doesn't return zero. We can also use it to assess whether top-performming strategies significantly outperform or break-even.