# VizFlow Demo - Basic Usage

This notebook demonstrates the basic workflow of VizFlow for analyzing TB-scale financial market data.

## 1. Setup

Import VizFlow and Polars, then configure the global settings.

In [1]:
import polars as pl
import vizflow as vf

# Configure VizFlow once at the top of your notebook
config = vf.Config(
    market="CN",
    input_dir="data/raw",
    output_dir="data/processed",
    columns={
        "timestamp": "ticktime",
        "price": "close",
        "volume": "vol",
        "symbol": "ukey"
    },
    binwidths={
        "alpha": 1e-4,
        "return": 1e-4
    }
)

# Set global config - now all VizFlow functions will use these settings
vf.set_config(config)

## 2. Load Sample Data

Create sample tick data for demonstration. In production, you'd use:
```python
df = pl.scan_parquet("data/raw/20241201.parquet")
```

In [2]:
# Sample tick data - CN market trading hours
df = pl.DataFrame({
    "ukey": ["600000", "600000", "600000", "000001", "000001", "000001"],
    "ticktime": [
        93000000,  # 09:30:00.000 (market open)
        93012145,  # 09:30:12.145
        100000000, # 10:00:00.000
        130000000, # 13:00:00.000 (afternoon open)
        142058425, # 14:20:58.425
        150000000  # 15:00:00.000 (market close)
    ],
    "close": [10.52, 10.53, 10.51, 15.82, 15.85, 15.83],
    "vol": [1000, 1500, 2000, 3000, 2500, 1800],
    "alpha": [0.00012, 0.00025, -0.00018, 0.00032, 0.00015, -0.00008]
}).lazy()

print("Sample data loaded (LazyFrame - not yet computed)")

Sample data loaded (LazyFrame - not yet computed)


## 3. Parse Timestamps

Convert integer timestamps (HHMMSSMMM format) to:
- **tod_ticktime**: Time of day (pl.Time) - perfect for plotting
- **elapsed_ticktime**: Milliseconds since market open (pl.Int64)

In [3]:
# Parse timestamps - no need to pass market parameter!
df = vf.parse_time(df, timestamp_col="ticktime")

# Collect and display
result = df.collect()
print("\nTimestamp columns added:")
print(result.select(["ticktime", "tod_ticktime", "elapsed_ticktime"]))


Timestamp columns added:
shape: (6, 3)
┌───────────┬──────────────┬──────────────────┐
│ ticktime  ┆ tod_ticktime ┆ elapsed_ticktime │
│ ---       ┆ ---          ┆ ---              │
│ i64       ┆ time         ┆ i64              │
╞═══════════╪══════════════╪══════════════════╡
│ 93000000  ┆ 09:30:00     ┆ 0                │
│ 93012145  ┆ 09:30:12.145 ┆ 12145            │
│ 100000000 ┆ 10:00:00     ┆ 1800000          │
│ 130000000 ┆ 13:00:00     ┆ 7200000          │
│ 142058425 ┆ 14:20:58.425 ┆ 12058425         │
│ 150000000 ┆ 15:00:00     ┆ 14400000         │
└───────────┴──────────────┴──────────────────┘


**Key observations:**
- Morning session: 09:30:00 → elapsed = 0 ms
- Afternoon session: 13:00:00 → elapsed = 7,200,000 ms (2 hours of morning)
- `tod_ticktime` is pl.Time - can be used directly in plots as x-axis

## 4. Discretize Values (Binning)

Round values to bins for aggregation and analysis.

In [4]:
# Bin alpha values to 0.0001 increments
df = vf.bin(df, widths={"alpha": 1e-4})

result = df.collect()
print("\nAlpha values binned:")
print(result.select(["ukey", "alpha", "alpha_bin", "tod_ticktime"]))


Alpha values binned:
shape: (6, 4)
┌────────┬──────────┬───────────┬──────────────┐
│ ukey   ┆ alpha    ┆ alpha_bin ┆ tod_ticktime │
│ ---    ┆ ---      ┆ ---       ┆ ---          │
│ str    ┆ f64      ┆ i64       ┆ time         │
╞════════╪══════════╪═══════════╪══════════════╡
│ 600000 ┆ 0.00012  ┆ 1         ┆ 09:30:00     │
│ 600000 ┆ 0.00025  ┆ 2         ┆ 09:30:12.145 │
│ 600000 ┆ -0.00018 ┆ -2        ┆ 10:00:00     │
│ 000001 ┆ 0.00032  ┆ 3         ┆ 13:00:00     │
│ 000001 ┆ 0.00015  ┆ 1         ┆ 14:20:58.425 │
│ 000001 ┆ -0.00008 ┆ -1        ┆ 15:00:00     │
└────────┴──────────┴───────────┴──────────────┘


**Binning logic:**
- `alpha_bin = round(alpha / binwidth)`
- Example: alpha=0.00012 → bin=1 (because 0.00012/0.0001 ≈ 1.2 rounds to 1)
- Useful for grouping similar values together

## 5. Aggregate Data

Group by bins and calculate metrics using Polars expressions.

In [5]:
# Define aggregation metrics
metrics = {
    "count": pl.len(),
    "total_volume": pl.col("vol").sum(),
    "avg_price": pl.col("close").mean(),
    "vwap": (pl.col("close") * pl.col("vol")).sum() / pl.col("vol").sum()
}

# Aggregate by symbol and alpha bin
agg_df = vf.aggregate(
    df,
    group_by=["ukey", "alpha_bin"],
    metrics=metrics
)

result = agg_df.collect().sort(["ukey", "alpha_bin"])
print("\nAggregated results:")
print(result)


Aggregated results:
shape: (6, 6)
┌────────┬───────────┬───────┬──────────────┬───────────┬───────┐
│ ukey   ┆ alpha_bin ┆ count ┆ total_volume ┆ avg_price ┆ vwap  │
│ ---    ┆ ---       ┆ ---   ┆ ---          ┆ ---       ┆ ---   │
│ str    ┆ i64       ┆ u32   ┆ i64          ┆ f64       ┆ f64   │
╞════════╪═══════════╪═══════╪══════════════╪═══════════╪═══════╡
│ 000001 ┆ -1        ┆ 1     ┆ 1800         ┆ 15.83     ┆ 15.83 │
│ 000001 ┆ 1         ┆ 1     ┆ 2500         ┆ 15.85     ┆ 15.85 │
│ 000001 ┆ 3         ┆ 1     ┆ 3000         ┆ 15.82     ┆ 15.82 │
│ 600000 ┆ -2        ┆ 1     ┆ 2000         ┆ 10.51     ┆ 10.51 │
│ 600000 ┆ 1         ┆ 1     ┆ 1000         ┆ 10.52     ┆ 10.52 │
│ 600000 ┆ 2         ┆ 1     ┆ 1500         ┆ 10.53     ┆ 10.53 │
└────────┴───────────┴───────┴──────────────┴───────────┴───────┘


## 6. Complete Workflow Example

Putting it all together in a typical analysis pipeline.

In [6]:
# Typical workflow: Load → Parse → Filter → Bin → Aggregate
(
    pl.DataFrame({
        "ukey": ["600000"] * 100,
        "ticktime": [93000000 + i*1000 for i in range(100)],  # Every second for 100 seconds
        "close": [10.50 + (i % 10) * 0.01 for i in range(100)],
        "vol": [1000 + (i % 20) * 100 for i in range(100)],
        "alpha": [(i % 30 - 15) * 0.00001 for i in range(100)]
    }).lazy()
    # Parse timestamps
    .pipe(vf.parse_time, timestamp_col="ticktime")
    # Filter to first minute only
    .filter(pl.col("elapsed_ticktime") < 60000)
    # Bin alpha
    .pipe(vf.bin, widths={"alpha": 1e-5})
    # Aggregate
    .pipe(vf.aggregate, 
          group_by=["alpha_bin"], 
          metrics={
              "count": pl.len(),
              "avg_price": pl.col("close").mean(),
              "total_volume": pl.col("vol").sum()
          })
    .collect()
    .sort("alpha_bin")
)

alpha_bin,count,avg_price,total_volume
i64,u32,f64,i64
-15,2,10.5,3000
-14,2,10.51,3200
-13,2,10.52,3400
-12,2,10.53,3600
-11,2,10.54,3800
…,…,…,…
10,2,10.55,4000
11,2,10.56,4200
12,2,10.57,4400
13,2,10.58,4600


## 7. Using Time for Plotting

The `tod_*` column is perfect for time-series plots.

In [7]:
# Example: prepare data for plotting
plot_data = (
    pl.DataFrame({
        "ticktime": [93000000 + i*60000 for i in range(120)],  # Every minute for 2 hours
        "price": [10.50 + 0.1 * (i % 20) for i in range(120)]
    }).lazy()
    .pipe(vf.parse_time, timestamp_col="ticktime")
    .collect()
)

print("\nData ready for plotting:")
print(plot_data.head())
print("\nUse tod_ticktime as x-axis in your plots!")


Data ready for plotting:
shape: (5, 4)
┌──────────┬───────┬──────────────┬──────────────────┐
│ ticktime ┆ price ┆ tod_ticktime ┆ elapsed_ticktime │
│ ---      ┆ ---   ┆ ---          ┆ ---              │
│ i64      ┆ f64   ┆ time         ┆ i64              │
╞══════════╪═══════╪══════════════╪══════════════════╡
│ 93000000 ┆ 10.5  ┆ 09:30:00     ┆ 0                │
│ 93060000 ┆ 10.6  ┆ 09:31:00     ┆ 60000            │
│ 93120000 ┆ 10.7  ┆ 09:31:20     ┆ 80000            │
│ 93180000 ┆ 10.8  ┆ 09:32:20     ┆ 140000           │
│ 93240000 ┆ 10.9  ┆ 09:32:40     ┆ 160000           │
└──────────┴───────┴──────────────┴──────────────────┘

Use tod_ticktime as x-axis in your plots!


## Summary

**VizFlow workflow:**

1. **Configure once**: `vf.set_config(config)` at the top of your notebook
2. **Load data**: Use Polars `pl.scan_parquet()` for lazy loading
3. **Parse time**: `vf.parse_time(df)` adds `tod_*` and `elapsed_*` columns
4. **Bin values**: `vf.bin(df, widths={...})` for discretization
5. **Aggregate**: `vf.aggregate(df, group_by=[...], metrics={...})`
6. **Chain operations**: Use `.pipe()` for clean functional workflows

**Key benefits:**
- No repetitive parameter passing (config is global)
- Works seamlessly with Polars LazyFrame (TB-scale data)
- `tod_*` columns ready for plotting
- `elapsed_*` columns in int milliseconds (no float errors)

**Next steps:**
- Try with your own market data
- Explore Phase 3 features (forward returns, calendar handling)
- Scale to multi-day batch processing with `run_batch()`