## 1.Data Preprocessing

### Cleaning + Estimating volatility
- Ensuring validity of datapoints 
- Cleaning out deviations | Isolated points | etc ...
- Estimating Volatility

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
from datetime import datetime

df = pd.read_csv("/Users/aleksandr/Desktop/Meta_Test.csv")
df = df.dropna()

In [2]:
from preprocess_td import preprocess_tick_data

# Would be lovely to estimate parameters of function
df_clean, df_diagnostics, outlier_counter = preprocess_tick_data(df)
df = df_clean
df = df.drop(columns="VOLATILITY")

Starting preprocessing with 570771 rows
After filtering trading hours: 282810 rows
After cleaning outliers: 282301 rows
Final clean dataset: 278585 rows

Outlier counts by detection method:
  zscore: 64
  extreme_deviation: 69
  isolated_point: 390
  price_reversal: 93
  market_open_artifact: 0
  timestamp_group: 34
  price_velocity: 3703
  suspicious_cluster: 52
  wavelet_outlier: 24


In [3]:
# Would be lovely to estimate parameters of function
from volatility_estimation import estimate_tick_volatility

df = estimate_tick_volatility(df, method = 'wavelet')

Estimating advanced tick-level volatility for 278585 ticks...
Computing wavelet-based volatility for META.O...
Completed advanced tick-level volatility estimation


In [4]:
df.drop(columns=['return', "SYMBOL"], inplace= True)
df.rename(columns={'wavelet_vol' : 'Volatility', 
                  'TIMESTAMP':'Timestamp',
                   'VALUE' : 'Value',
                   'VOLUME' : 'Volume'}, inplace=True)
df.head()

Unnamed: 0,Timestamp,Value,Volume,Volatility
0,2025-01-30 09:30:00.740000+00:00,694.24,13.0,0.00026
1,2025-01-30 09:30:00.740000+00:00,694.17,15.0,0.00026
2,2025-01-30 09:30:00.740000+00:00,694.17,15.0,0.000261
3,2025-01-30 09:30:00.740000+00:00,694.11,8.0,0.000261
4,2025-01-30 09:30:00.740000+00:00,694.1,249.0,0.000261


## 2. Synthetic Noise Injection 

Suppose that, the latent log-price Xt is an Ito-semimartingale of the form 

$dX_t = b_t dt + \sigma_t dW_t + dJ_t,$  
$d\sigma_t = \tilde{b}_t dt + \tilde{\sigma}^{(1)}_t dW_t + \tilde{\sigma}^{(2)}_t d\tilde{W}_t + d\tilde{J}_t$





### Market Microstructure Noise:
- Quote Spread Noise: The price fluctuation caused by trades alternating between bid and ask prices, creating a "bouncing" effect that obscures the true efficient price.

- Order Flow Noise: Price movements driven by the imbalance between buy and sell orders, where persistent directional trading pressure can create temporary price trends away from the efficient price.

- Strategic Order Noise: Price distortions created when large traders split their orders into smaller pieces to minimize market impact.

- Quote Positioning Noise: Price effects from market makers strategically placing and canceling quotes to create false impressions of supply and demand.

In [5]:
# Generate 1000 noise samples
from microstructure_simulator import MarketMicrostructureSimulator

simulator = MarketMicrostructureSimulator(n_samples=1000)
result = simulator.simulate_full_microstructure()
noise_components = simulator.extract_noise_components(result)
noise_components["total_noise"][:10]

Simulating efficient price process...
Adding quote spread noise...
Simulating order flow noise...
Simulating strategic order splitting...
Simulating strategic quote positioning...


array([-0.01059243, -0.02092213, -0.0834803 , -0.00851905, -0.01510907,
       -0.12214871, -0.08950043,  0.02924293, -0.08329231, -0.03416489])

## 3. Windowing data
- Split the series into overlapping or non-overlapping windows.
- Ensure no future leakage.

In [40]:
class VolatilityWindowGenerator:
    """
    Generates windows based on volatility regimes and provides tensor conversion utilities.
    
    This class handles:
    1. Volatility regime detection and window creation
    2. Conversion to tensor representations
    3. Train/val/test splitting
    4. Window analysis and diagnostics
    """
    
    def __init__(self, df, feature_columns=['Value', 'Volume', 'Volatility']):
        """
        Initialize the window generator with tick data.
        
        Parameters:
        -----------
        df : pandas.DataFrame
            DataFrame with columns ['Timestamp', 'Value', 'Volume', 'Volatility']
        feature_columns : list
            Columns to include as features
        """
        self.df = df.copy()
        self.feature_columns = feature_columns
        self.windows = []
        self.window_tensors = None
        self.window_info = None
    
    def detect_volatility_regimes(self, sensitivity=1.5, min_regime_size=30, smoothing_window=5):
        """
        Detect regime changes based on significant volatility shifts.
        
        Parameters:
        -----------
        sensitivity : float
            Multiplier for standard deviation to detect regime changes
        min_regime_size : int
            Minimum number of ticks in a regime
        smoothing_window : int
            Window size for volatility smoothing
        
        Returns:
        --------
        list of int
            Indices where regime changes occur
        """
        # Smooth volatility for cleaner regime detection
        smoothed_vol = self.df['Volatility'].rolling(window=smoothing_window, min_periods=1).mean()
        
        # Compute volatility changes
        vol_changes = smoothed_vol.diff().abs()
        
        # Identify significant changes (> sensitivity * std)
        std_change = vol_changes.std()
        threshold = sensitivity * std_change
        
        # Get indices of significant changes
        potential_breakpoints = list(vol_changes[vol_changes > threshold].index)
        
        # Filter breakpoints to ensure minimum regime size
        breakpoints = [0]  # Always start with the first point
        for bp in potential_breakpoints:
            if bp - breakpoints[-1] >= min_regime_size:
                breakpoints.append(bp)
        
        # Add the end of series if the last segment is long enough
        if len(self.df) - breakpoints[-1] >= min_regime_size:
            breakpoints.append(len(self.df))
        else:
            # Adjust the last breakpoint to include all data
            breakpoints[-1] = len(self.df)
        
        print(f"Detected {len(breakpoints)-1} volatility regimes")
        return breakpoints
    
    def create_regime_windows(self, sensitivity=1.5, min_regime_size=30, 
                            max_window_size=200, smoothing_window=5):
        """
        Create windows based on detected volatility regimes.
        """
        # Get regime breakpoints
        breakpoints = self.detect_volatility_regimes(
            sensitivity=sensitivity, 
            min_regime_size=min_regime_size,
            smoothing_window=smoothing_window
        )
        
        windows = []
        window_id = 0
        
        # Create windows from regimes
        for i in range(len(breakpoints) - 1):
            start_idx = breakpoints[i]
            end_idx = breakpoints[i+1]
            
            regime_length = end_idx - start_idx
            
            # Skip if regime is too small
            if regime_length < min_regime_size:
                continue
                
            # If regime is too large, split it into multiple windows
            if regime_length > max_window_size:
                # Calculate number of windows needed
                n_windows = int(np.ceil(regime_length / max_window_size))
                window_size = regime_length // n_windows  # Ensure even division
                
                # Create windows of equal size
                for j in range(n_windows):
                    seg_start = start_idx + (j * window_size)
                    seg_end = start_idx + ((j + 1) * window_size)
                    
                    # Adjust last window to include any remaining points
                    if j == n_windows - 1:
                        seg_end = end_idx
                    
                    window = self.df.iloc[seg_start:seg_end].copy()
                    
                    # Only add if window has data
                    if len(window) >= min_regime_size:
                        window['window_start'] = seg_start
                        window['window_end'] = seg_end
                        window['window_id'] = window_id
                        window['regime_id'] = i
                        windows.append(window)
                        window_id += 1
            else:
                # Use the entire regime as a window
                window = self.df.iloc[start_idx:end_idx].copy()
                if len(window) >= min_regime_size:
                    window['window_start'] = start_idx
                    window['window_end'] = end_idx
                    window['window_id'] = window_id
                    window['regime_id'] = i
                    windows.append(window)
                    window_id += 1
        
        self.windows = windows
        
        print(f"Created {len(windows)} windows from {len(breakpoints)-1} regimes")
        if windows:  # Only print statistics if there are windows
            print(f"Average window size: {sum(len(w) for w in windows) / len(windows):.2f}")
            print(f"Min window size: {min(len(w) for w in windows)}")
            print(f"Max window size: {max(len(w) for w in windows)}")
        else:
            print("Warning: No windows were created!")
        
        return windows
    
    def create_adaptive_windows(self, base_window_size=100, min_window_size=50, 
                              max_window_size=200, overlap_ratio=0.0):
        """
        Create windows with adaptive sizing based on volatility levels.
        
        Parameters:
        -----------
        base_window_size : int
            Base window size for median volatility
        min_window_size : int
            Minimum window size for high volatility periods
        max_window_size : int
            Maximum window size for low volatility periods
        overlap_ratio : float
            Ratio of overlap between consecutive windows (0 to 1)
            
        Returns:
        --------
        list of DataFrames
            List of windowed DataFrames
        """
        # Calculate median volatility for scaling
        median_vol = self.df['Volatility'].median()
        
        windows = []
        i = 0
        window_id = 0
        
        while i < len(self.df):
            # Get current volatility
            current_vol = self.df['Volatility'].iloc[i]
            
            # Adjust window size based on volatility
            # Higher volatility = smaller window, Lower volatility = larger window
            vol_ratio = current_vol / median_vol
            window_size = int(base_window_size / vol_ratio)
            
            # Ensure window size stays within bounds
            window_size = max(min_window_size, min(window_size, max_window_size))
            
            # Create window
            if i + window_size <= len(self.df):
                window = self.df.iloc[i:i + window_size].copy()
                
                # Only add windows with sufficient data
                if len(window) >= min_window_size:
                    # Add window index information
                    window['window_start'] = i
                    window['window_end'] = i + window_size
                    window['window_id'] = window_id
                    windows.append(window)
                    window_id += 1
            
            # Move by step size (with overlap)
            step_size = int(window_size * (1 - overlap_ratio))
            i += max(1, step_size)  # Ensure we move at least 1 step
        
        self.windows = windows
        
        print(f"Created {len(windows)} {'overlapping' if overlap_ratio > 0 else ''} windows")
        print(f"Average window size: {sum(len(w) for w in windows) / len(windows):.2f}")
        print(f"Min window size: {min(len(w) for w in windows)}")
        print(f"Max window size: {max(len(w) for w in windows)}")
        
        return windows


    def analyze_windows(self):
        """
        Print statistics about volatility in the windows
        """
        if not self.windows:
            raise ValueError("No windows created yet. Call create_regime_windows() or create_adaptive_windows() first.")
        
        # First verify all windows have data
        empty_windows = [i for i, w in enumerate(self.windows) if len(w) == 0]
        if empty_windows:
            raise ValueError(f"Found empty windows at indices: {empty_windows}")
            
        window_stats = []
        for w in self.windows:
            try:
                stats = {
                    'window_id': w['window_id'].iloc[0],
                    'size': len(w),
                    'mean_vol': w['Volatility'].mean(),
                    'max_vol': w['Volatility'].max(),
                    'min_vol': w['Volatility'].min(),
                    'vol_range': w['Volatility'].max() - w['Volatility'].min(),
                    'vol_stability': w['Volatility'].std() / w['Volatility'].mean(),
                    'start_time': w['Timestamp'].iloc[0],
                    'end_time': w['Timestamp'].iloc[-1]
                }
                
                # Add regime_id if available
                if 'regime_id' in w.columns:
                    stats['regime_id'] = w['regime_id'].iloc[0]
                    
                window_stats.append(stats)
            except Exception as e:
                print(f"Error processing window {w['window_id'].iloc[0] if len(w) > 0 else 'unknown'}:")
                print(f"Window length: {len(w)}")
                print(f"Error: {str(e)}")
                raise
        
        stats_df = pd.DataFrame(window_stats)
        print("\nWindow Statistics:")
        print(stats_df.describe())
        return stats_df


    
    def _timestamp_to_seconds(self, ts):
        """Convert timestamp to seconds within the day"""
        if isinstance(ts, str):
            ts = pd.to_datetime(ts)
        return ts.hour * 3600 + ts.minute * 60 + ts.second + ts.microsecond / 1e6
    
    def create_tensor_windows(self):
        """
        Convert windows to tensor representation.
        
        Returns:
        --------
        tf.RaggedTensor, dict
            A ragged tensor of windows and metadata dictionary
        """
        if not self.windows:
            raise ValueError("No windows created yet. Call create_regime_windows() or create_adaptive_windows() first.")
        
        # Create list to hold window arrays
        window_arrays = []
        window_lengths = []
        window_metadata = []
        
        for window in self.windows:
            # Convert timestamp to numerical feature
            timestamps = pd.to_datetime(window['Timestamp'])
            time_seconds = timestamps.apply(self._timestamp_to_seconds)
            
            # Ensure all features are properly shaped
            features = np.column_stack([
                time_seconds.values,  # Make sure it's a numpy array
                window[self.feature_columns].values
            ]).astype(np.float32)  # Ensure float32 dtype
            
            window_arrays.append(features)
            window_lengths.append(len(features))
            
            # Store metadata
            metadata = {
                'window_id': window['window_id'].iloc[0],
                'start_time': window['Timestamp'].iloc[0],
                'end_time': window['Timestamp'].iloc[-1],
                'mean_vol': window['Volatility'].mean()
            }
            
            # Add regime_id if available
            if 'regime_id' in window.columns:
                metadata['regime_id'] = window['regime_id'].iloc[0]
                
            window_metadata.append(metadata)
        
        # Create ragged tensor with explicit shape
        ragged_tensor = tf.RaggedTensor.from_row_lengths(
            values=tf.concat(window_arrays, axis=0),
            row_lengths=[len(w) for w in window_arrays]
        )
        
        # Create dictionary with window information
        window_info = {
            'n_windows': len(self.windows),
            'max_length': max(window_lengths),
            'min_length': min(window_lengths),
            'avg_length': sum(window_lengths) / len(window_lengths),
            'feature_names': ['time_seconds'] + self.feature_columns,
            'metadata': window_metadata,
            'n_features': len(self.feature_columns) + 1  # Add 1 for time_seconds
        }
        
        self.window_tensors = ragged_tensor
        self.window_info = window_info
        
        print(f"\nTensor shape: {ragged_tensor.shape}")
        print(f"Number of features: {window_info['n_features']}")
        print(f"Max window length: {window_info['max_length']}")
        print(f"Min window length: {window_info['min_length']}")
        print(f"Average window length: {window_info['avg_length']:.2f}")
        
        return ragged_tensor, window_info
    
    def pad_to_dense(self):
        """
        Convert ragged tensor to dense tensor by padding with zeros
        
        Returns:
        --------
        tf.Tensor
            Padded dense tensor
        """
        if self.window_tensors is None:
            raise ValueError("No tensor windows created yet. Call create_tensor_windows() first.")
            
        return self.window_tensors.to_tensor(default_value=0.0)
    
    def split_windows(self, train_ratio=0.7, val_ratio=0.15):
        """
        Split window tensor into train/validation/test sets
        
        Parameters:
        -----------
        train_ratio : float
            Proportion of data for training
        val_ratio : float
            Proportion of data for validation
            
        Returns:
        --------
        tuple
            (train_tensor, val_tensor, test_tensor)
        """
        if self.window_tensors is None:
            raise ValueError("No tensor windows created yet. Call create_tensor_windows() first.")
            
        n_windows = self.window_tensors.shape[0]
        indices = np.random.permutation(n_windows)
        
        train_size = int(n_windows * train_ratio)
        val_size = int(n_windows * val_ratio)
        
        train_indices = indices[:train_size]
        val_indices = indices[train_size:train_size + val_size]
        test_indices = indices[train_size + val_size:]
        
        train_tensor = tf.gather(self.window_tensors, train_indices)
        val_tensor = tf.gather(self.window_tensors, val_indices)
        test_tensor = tf.gather(self.window_tensors, test_indices)
        
        print(f"Training tensor shape: {train_tensor.shape}")
        print(f"Validation tensor shape: {val_tensor.shape}")
        print(f"Testing tensor shape: {test_tensor.shape}")
        
        return train_tensor, val_tensor, test_tensor

In [41]:
# Initialize the window generator
window_gen = VolatilityWindowGenerator(df)

# Use regime-based windowing (better for handling different market conditions)
windows = window_gen.create_regime_windows(sensitivity=1.8, min_regime_size=40)

# Analyze the windows
stats = window_gen.analyze_windows()

# Convert to tensors
tensors, info = window_gen.create_tensor_windows()

# Split into train/val/test
train_tensor, val_tensor, test_tensor = window_gen.split_windows()

print(f"Features: {info['feature_names']}")

Detected 762 volatility regimes
Created 1884 windows from 762 regimes
Average window size: 147.87
Min window size: 40
Max window size: 229

Window Statistics:
         window_id         size     mean_vol      max_vol      min_vol  \
count  1884.000000  1884.000000  1884.000000  1884.000000  1884.000000   
mean    941.500000   147.868896     0.000334     0.000348     0.000320   
std     544.008272    59.940387     0.000361     0.000371     0.000351   
min       0.000000    40.000000     0.000018     0.000020     0.000014   
25%     470.750000    88.000000     0.000232     0.000242     0.000221   
50%     941.500000   180.500000     0.000270     0.000281     0.000259   
75%    1412.250000   192.000000     0.000316     0.000329     0.000305   
max    1883.000000   229.000000     0.003218     0.003225     0.003201   

          vol_range  vol_stability    regime_id  
count  1.884000e+03    1884.000000  1884.000000  
mean   2.810576e-05       0.030617   429.266454  
std    4.234299e-05     

## 3. Model Architecture Adaptations

Modify the original CSDI code (GitHub) for denoising:

### Remove Imputation Logic:
- Delete code blocks that handle missing value imputation.

### Time Embeddings:
- Encode irregular timestamps as sinusoidal embeddings (normalized to [0,1]).

### Conditioning Mechanism:
- Use the noisy input as the conditional context (instead of partial observations).

### Diffusion Process:
- Use the original diffusion steps but disable masking (all positions are observed).
- Adjust the noise schedule (β) to match financial noise characteristics.


## 4. Training Pipeline

**Objective:** Learn to reverse the diffusion process conditioned on noisy ticks.

### Inputs:
- **noisy_data:** Corrupted ticks (price, volume, etc.).
- **mask:** All ones (no missing data).
- **time_embeddings:** Encoded timestamps.

### Forward Process:
- Gradually add Gaussian noise to `noisy_data` across diffusion timesteps.
- Forward Process: Replace Gaussian SDE with a market-realistic stochastic process.

### Reverse Process:
- Train the model to predict the score (gradient) to denoise the data.

### Loss Function:
- Weighted MSE between predicted and true noise at each diffusion step.

## 5. Key Functions to Implement

### Data Loader:
```python
def load_tick_data():
    """Reads raw ticks and converts to windowed sequences."""
    pass

def inject_microstructure_noise():
    """Adds synthetic bid-ask bounce, order flow noise."""
    pass
```

### Time Embeddings:
```python
def encode_timestamps():
    """Converts irregular timestamps to continuous embeddings."""
    pass
```

### Diffusion Utils:
```python
def beta_scheduler():
    """Defines the noise schedule (linear, cosine, etc.)."""
    pass

def q_sample():
    """Forward diffusion process (adding noise)."""
    pass
```

### Model:
```python
class ConditionalScoreModel:
    """Modified CSDI backbone (transformer/TCN)."""
    pass

def train_step():
    """Computes loss and updates weights."""
    pass
```

## 6. Training Process

### Hyperparameters:
- Diffusion steps (`T=1000`), learning rate (`1e-4`), batch size (`64`).
- Noise schedule (e.g., `beta_start=1e-4`, `beta_end=0.02`).

### Training Loop:
For each batch:
1. Generate noisy data via `inject_microstructure_noise()`.
2. Compute time embeddings for irregular timestamps.
3. **Forward pass:** Corrupt noisy data with diffusion.
4. **Reverse pass:** Predict denoised data.
5. Update model weights via gradient descent.

### Checkpointing:
- Save model weights periodically (e.g., every epoch).
- Track validation loss on a held-out tick dataset.

## 7. Validation & Testing

### Metrics:
- **Reconstruction Loss:** MSE between denoised and clean data (if synthetic).
- **Volatility Consistency:** Compare realized volatility of raw vs. denoised data.
- **Microstructure Preservation:** Autocorrelation of trade signs.

### Visual Checks:
- Plot raw vs. denoised ticks (ensure no new timestamps are added).

## 8. Deployment Pipeline

### Preprocessing:
- Normalize new tick data using the same scaler from training.
- Encode timestamps.

### Inference:
- Run the trained CSDI in reverse diffusion mode with `mask=1`.

### Postprocessing:
- Inverse-transform normalized denoised data.