### Load dataframe

In [1]:
import pandas as pd

df = pd.read_csv("/Users/aleksandr/Desktop/Meta_Test.csv")
df = df.dropna()

### Initial cleaning

In [2]:
from clean import preprocess_tick_data

df_clean, df_diagnostics, outlier_counter = preprocess_tick_data(df)
df = df_clean
df = df.drop(columns="VOLATILITY")

Starting preprocessing with 570771 rows
After filtering trading hours: 282810 rows
After cleaning outliers: 282301 rows
Final clean dataset: 278585 rows

Outlier counts by detection method:
  zscore: 64
  extreme_deviation: 69
  isolated_point: 390
  price_reversal: 93
  timestamp_group: 34
  price_velocity: 3703
  suspicious_cluster: 52
  wavelet_outlier: 24


### Volatility estimation

In [3]:
from volatility_v1 import estimate_advanced_volatility

df = estimate_advanced_volatility(df)

Estimating advanced tick-level volatility for 278585 ticks...
Processing 278585 ticks...
Completed volatility estimation for 278585 ticks
Completed advanced tick-level volatility estimation


In [5]:
df.drop(columns=['return', "SYMBOL", "emd_vol", "sv_vol", 'log_price', 'smooth_vol', 'raw_vol'], inplace= True)
df.rename(columns={'filtered_vol' : 'Volatility', 
                  'TIMESTAMP':'Timestamp',
                   'VALUE' : 'Value',
                   'VOLUME' : 'Volume'}, inplace=True)
df.head()

Unnamed: 0,Timestamp,Value,Volume,Volatility
0,2025-01-30 09:30:00.740000+00:00,694.24,13.0,9.214653000000001e-17
1,2025-01-30 09:30:00.740000+00:00,694.17,15.0,0.00205606
2,2025-01-30 09:30:00.740000+00:00,694.17,15.0,0.003217889
3,2025-01-30 09:30:00.740000+00:00,694.11,8.0,0.002402309
4,2025-01-30 09:30:00.740000+00:00,694.1,249.0,0.003357669


### Encoder Only Transformer Feature engine

In [23]:
import os
import pandas as pd
from datetime import datetime
from Feature_engineering.feature_model_v1 import process_market_data

# Set up paths and configuration
df = df[:10000]
model_dir = '/Users/aleksandr/code/scripts/CronusV1/Feature_engineering/saved_models'

# Process data with the model 
features_df, model = process_market_data(
    df=df,
    model_dir=model_dir,
    retrain=True,
    num_epochs=10,
    context_length=50,
    num_attention_heads=8,
    num_encoder_layers=4,
    causal=True,
    temperature=0.5,
    grad_clip_norm=1.0
)

print(f"\nProcess completed successfully!")

Model will be trained and saved to: /Users/aleksandr/code/scripts/CronusV1/Feature_engineering/saved_models
Using causal mode. (Suitable for real-time applications)
Setting up feature extractor...
Extracting microstructure features...
Skip outlier clipping for Volatility: 596
Skip outlier clipping for price_change: 596
Skip outlier clipping for log_return: 596
Skip outlier clipping for time_delta: 596
Skip outlier clipping for trade_direction: 596
Skip outlier clipping for is_buy: 596
Skip outlier clipping for tick_imbalance: 596
Skip outlier clipping for jump_diffusion: 596
Skip outlier clipping for jump_magnitude: 596
Skip outlier clipping for jump_arrival: 596
Skip outlier clipping for kyle_lambda: 596
Skip outlier clipping for orderflow_imbalance: 596
Skip outlier clipping for momentum_short: 596
Skip outlier clipping for momentum_medium: 596
Skip outlier clipping for momentum_long: 596
Skip outlier clipping for price_range_short: 596
Skip outlier clipping for price_range_medium: 5

In [1]:
import pandas as pd

ttest = pd.read_csv("/Users/aleksandr/code/scripts/CronusV1/Feature_engineering/saved_models/regime_features.csv")

In [2]:
ttest.shape

(9951, 20)

In [4]:
from Feature_engineering.feature_selection import run_feature_selection

filtered_df = run_feature_selection(
    df=ttest,  
    min_gain_threshold=0.06,  
    max_features=7,
    correlation_threshold=0.65,
    verbose=True,
    weights={
        'diversity': 0.35,       
        'uniqueness': 0.30,      
        'signal_to_noise': 0.20, 
        'orthogonal_variance': 0.15  
    }
)

print(f"\nSelected {filtered_df.shape[1] - 4} features:")  

Working with 9951 rows and 16 features
Topology-First Feature Selection: min_gain=0.06, corr_threshold=0.65, max_features=7
Step 1: Computing feature metrics...
  - Computing topological diversity scores...
  - Computing information uniqueness scores...
  - Computing signal stability scores...
  - Computing orthogonal variance scores...
Step 2: Combining scores with weights: {'diversity': 0.35, 'uniqueness': 0.3, 'signal_to_noise': 0.2, 'orthogonal_variance': 0.15}

Top features by topology-focused score:
  1. regime_feature_11: 0.8475
  2. regime_feature_10: 0.8404
  3. regime_feature_7: 0.8346
  4. regime_feature_13: 0.7945
  5. regime_feature_1: 0.7848
  6. regime_feature_4: 0.7762
  7. regime_feature_6: 0.7648
  8. regime_feature_16: 0.7645
  9. regime_feature_3: 0.7606
  10. regime_feature_14: 0.7413

Step 4: Calculating information gain (before correlation filtering)...
Reached maximum number of features (7)
After information gain: 7 features with cumulative gain: 2.5125

Step 5:

In [5]:
filtered_df.shape

(9951, 9)

### TDA

# Note

# Hierarchical Regime Identification in TDA Pipeline

## Core Concept
Enhance the standard TDA-based regime identification by introducing hierarchical labeling to capture both primary market regimes and their sub-regimes using a tuple representation.

## Enhanced Mapper Function Implementation
1. **Integrated Hierarchical Mapper**
   - Extend the KMapper class to include hierarchical regime identification
   - Implement a single enhanced mapper function that handles both parent and child regimes
   - Maintain a single responsibility focused on structure identification at multiple scales
   - Return complete hierarchical labeling in one pass

2. **Primary-to-Secondary Regime Process**
   - First identify primary regimes using standard topological analysis
   - Within the same mapper function, apply HDBSCAN to each primary regime
   - Ensure algorithmic consistency through integrated parameter handling
   - Optimize computational efficiency by avoiding redundant data passing

3. **Hierarchical Labeling System**
   - Replace scalar labels with tuple representation: (parent_regime, child_regime)
   - Main regimes without sub-clustering: (1,0), (2,0), (3,0), etc.
   - Sub-regimes of regime 1: (1,1), (1,2), (1,3), etc.
   - Regimes failing quality checks remain as (n,0) without sub-divisions

4. **Direct Tensor Output**
   - Have mapper function directly output tensor with hierarchical labels
   - Eliminate need for separate post-processing modules
   - Simplify code architecture and data flow
   - Reduce pipeline complexity and improve maintainability

## Window Treatment for Pipeline
- **Critical**: Maintain entire parent regimes as single windows
- Do NOT split parent regimes into separate sub-regime windows
- Use sub-regime labels as conditioning metadata only
- Preserve temporal continuity and transition patterns between sub-regimes
- Keep statistical power by using larger parent-level windows

## Enhanced Sub-Regime Quality Controls
- **Minimum Size Enforcement**: Require at least 5% of parent regime points to form valid sub-regime
- **Statistical Validation**: Apply silhouette score threshold (>0.3) to ensure meaningful clusters
- **Automatic Optimization**: Use gap statistic to determine optimal number of sub-regimes
- **Density-Based Approach**: HDBSCAN naturally handles varying cluster densities and outliers
- **Bayesian GMM Alternative**: For regimes with Gaussian-like distribution characteristics

## Code Architecture Benefits
- **Data Efficiency**: Minimizes redundant data passing between components
- **Consistency**: Ensures parameter and methodology alignment across levels
- **Simplicity**: Reduces distinct components to maintain and debug
- **Cohesion**: Keeps related functionality together for better maintainability
- **Performance**: Reduces overhead from multiple processing stages

## Technical Benefits
- Preserves richer topological structure identified by TDA
- Captures nested behavior patterns within major regimes
- Enables more precise conditioning in the diffusion model
- Provides structured hierarchical information for Titan's memory mechanisms
- Maintains window sizes sufficient for robust statistical analysis

## Implementation Notes
- Adaptive sub-regime identification based on parent regime characteristics
- Consider NO sub-regimes when parent regime is already coherent (maintain as (n,0))
- Implement visualization tools to display hierarchical structure
- Calculate transition probabilities between sub-regimes for additional insights

## Integration with Subsequent Pipeline Steps
- **For Diffusion Model**: 
  - Feed entire parent regime windows to the model
  - Use tuple labels as conditioning information only
  - Maintain temporal coherence within parent windows
  - Learn to denoise and represent parent windows with awareness of internal structure

- **For Tensor Structure**: 
  - Adjust conditioning dimensions to accommodate tuple representation
  - Use parent regime for window boundaries
  - Incorporate sub-regime information as metadata within windows

- **For Titan**: 
  - Leverage hierarchical structure for more precise memory-based pattern matching
  - Utilize sub-regime transitions as potential predictive signals

## Key Metrics to Validate Approach
- Sub-regime stability over time
- Predictive power improvement compared to flat regime structure
- Transition patterns between sub-regimes
- Information gain from hierarchical representation
- Statistical significance of identified sub-regimes

## Priority Development Tasks
1. Extend KMapper class with hierarchical mapping capability
2. Implement integrated HDBSCAN sub-regime detection
3. Add quality control mechanisms within mapper function
4. Create direct tensor output with hierarchical labels
5. Update visualizations to show hierarchical structure

# Window based contidioned on volatilty regime diffusion model

# Note:

# Enhanced Market Regime Representation Using Diffusion Models

## Core Concept
Extend traditional feature representation with diffusion models to create a 4D tensor representation of market data, capturing both explicit features and latent regime characteristics for high-frequency trading applications.

## Data Structure
- **Input**: Tensor of shape (i,j,k)
  - i = window number (segmented by volatility regimes)
  - j = datapoints within regime window 
  - k = feature dimensions
- **Output**: Enhanced tensor with either:
  - Extended features: (i,j,q) where q > k
  - Multi-scale representation: (i,(j,k,r)) with new dimension r for latent features

## Key Benefits
1. Captures subtle market microstructure patterns specific to volatility regimes
2. Enhances regime transition identification
3. Creates richer representations for Titan's memory mechanisms 
4. Improves pattern recognition across similar historical regimes
5. Preserves both raw features and their latent abstractions

## Technical Implementation Path

### Phase 1: Diffusion Model Design
- Design diffusion model architecture for window-level processing
- Implement forward diffusion process (noise injection)
- Implement reverse diffusion process (denoising)
- Incorporate contrastive learning objective to enhance regime separation

### Phase 2: Latent Space Extension
- Extract intermediate latent representations from diffusion model
- Define structure of additional dimension to capture regime-specific patterns
- Implement dimension extension procedure
- Design feature fusion mechanism to combine original and latent features

### Phase 3: Training Pipeline
- Train diffusion model on windows from similar volatility regimes
- Fine-tune with contrastive loss to enhance regime separation
- Evaluate quality of latent representations using clustering metrics
- Optimize hyperparameters for regime distinction

### Phase 4: Integration with Titan Model
- Structure 4D tensor as input for Titan's memory mechanisms
- Configure Titan to utilize both raw features and latent representations
- Optimize memory update parameters for enhanced pattern recognition
- Evaluate improvements in prediction accuracy and regime identification

## Evaluation Metrics
1. Latent space clustering quality (silhouette score, within-cluster variance)
2. Regime transition detection accuracy
3. Prediction performance across different volatility regimes
4. Memory utilization efficiency in Titan architecture
5. Trading strategy performance metrics

## Considerations
- Carefully balance complexity vs. inference speed for HFT applications
- Focus on regime transition periods for maximum trading advantage
- Structure inference pipeline to minimize latency in production
- Design ablation studies to quantify value of latent dimension extension

## Implementation Priority
1. Design and implement basic diffusion model for window denoising
2. Add contrastive learning component for regime separation
3. Implement latent feature extraction mechanism
4. Integrate with Titan architecture
5. Optimize for production performance