<a href="https://github.com/timeseriesAI/tsai-rs" target="_parent"><img src="https://img.shields.io/badge/tsai--rs-Time%20Series%20AI%20in%20Rust-blue" alt="tsai-rs"/></a>

# Time Series Data Preparation

This notebook demonstrates how to prepare your time series data for use with **tsai-rs**.

## Required Input Shape

tsai-rs requires 3D numpy arrays with shape:

- **samples**: Number of time series samples
- **variables**: Number of features/channels/dimensions  
- **length**: Number of time steps

Shape: `(samples, variables, length)`

## Install tsai-rs

```bash
cd crates/tsai_python
maturin develop --release
```

## Import Libraries

In [None]:
import tsai_rs
import numpy as np
import pandas as pd

print(f"tsai-rs version: {tsai_rs.version()}")
tsai_rs.my_setup()

## UCR/UEA Data

The easiest case is using UCR/UEA datasets, which are already formatted correctly.

In [None]:
# Load a multivariate dataset
dsid = 'NATOPS'
X, y, splits = tsai_rs.get_UCR_data(dsid, return_split=False)

print(f"Dataset: {dsid}")
print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")
print(f"Splits: train={len(splits[0])}, valid={len(splits[1])}")

In [None]:
# Or get pre-split data
X_train, y_train, X_test, y_test = tsai_rs.get_UCR_data(dsid, return_split=True)

print(f"X_train: {X_train.shape}")
print(f"X_test: {X_test.shape}")

## Converting 2D Data to 3D

If you have univariate time series in 2D format, you need to reshape them.

In [None]:
# Example: 2D data (samples, length)
X_2d = np.random.randn(100, 50)  # 100 samples, 50 time steps
print(f"Original 2D shape: {X_2d.shape}")

# Convert to 3D (samples, 1, length)
X_3d = X_2d[:, np.newaxis, :]
print(f"Converted 3D shape: {X_3d.shape}")

## Converting from Pandas DataFrame

In [None]:
# Example: Wide format DataFrame
# Each row is a sample, each column is a time step
n_samples = 50
n_steps = 100

df = pd.DataFrame(
    np.random.randn(n_samples, n_steps),
    columns=[f't_{i}' for i in range(n_steps)]
)
df['label'] = np.random.randint(0, 3, n_samples)

print(f"DataFrame shape: {df.shape}")
print(df.head())

In [None]:
# Convert to numpy array
y = df['label'].values
X = df.drop('label', axis=1).values

# Reshape to 3D
X = X[:, np.newaxis, :]  # Add channel dimension

print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")

## Multivariate Data from Multiple DataFrames

In [None]:
# Example: Separate DataFrames for each variable
n_samples = 50
n_steps = 100
n_vars = 3

# Create sample DataFrames for each variable
dfs = [pd.DataFrame(np.random.randn(n_samples, n_steps)) for _ in range(n_vars)]

# Stack into 3D array
X = np.stack([df.values for df in dfs], axis=1)

print(f"Number of variables: {n_vars}")
print(f"X shape: {X.shape}  # (samples, variables, length)")

## Long Format to Wide Format

In [None]:
# Example: Long format DataFrame
n_samples = 20
n_steps = 50

long_df = pd.DataFrame({
    'sample_id': np.repeat(range(n_samples), n_steps),
    'time_step': np.tile(range(n_steps), n_samples),
    'value': np.random.randn(n_samples * n_steps),
    'label': np.repeat(np.random.randint(0, 3, n_samples), n_steps)
})

print(f"Long format shape: {long_df.shape}")
print(long_df.head(10))

In [None]:
# Pivot to wide format
wide_df = long_df.pivot(index='sample_id', columns='time_step', values='value')

# Convert to 3D array
X = wide_df.values[:, np.newaxis, :]
y = long_df.groupby('sample_id')['label'].first().values

print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")

## Handling Variable Length Sequences

In [None]:
# If sequences have different lengths, you need to pad or truncate
sequences = [
    np.random.randn(50),
    np.random.randn(75),
    np.random.randn(60),
    np.random.randn(45)
]

print("Original lengths:", [len(s) for s in sequences])

In [None]:
# Option 1: Pad to max length
max_len = max(len(s) for s in sequences)

def pad_sequence(seq, max_len, pad_value=0):
    padded = np.full(max_len, pad_value)
    padded[:len(seq)] = seq
    return padded

X_padded = np.stack([pad_sequence(s, max_len) for s in sequences])
X_padded = X_padded[:, np.newaxis, :]  # Add channel dimension

print(f"Padded X shape: {X_padded.shape}")

In [None]:
# Option 2: Truncate to min length
min_len = min(len(s) for s in sequences)

X_truncated = np.stack([s[:min_len] for s in sequences])
X_truncated = X_truncated[:, np.newaxis, :]  # Add channel dimension

print(f"Truncated X shape: {X_truncated.shape}")

## Creating Train/Validation Splits

In [None]:
# Create synthetic data
n_samples = 200
n_vars = 5
seq_len = 100
n_classes = 3

X = np.random.randn(n_samples, n_vars, seq_len).astype(np.float32)
y = np.random.randint(0, n_classes, n_samples)

print(f"Total samples: {n_samples}")
print(f"X shape: {X.shape}")

In [None]:
# Random split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Train: {X_train.shape}, Test: {X_test.shape}")
print(f"Train labels: {np.bincount(y_train)}")
print(f"Test labels: {np.bincount(y_test)}")

## Using with tsai-rs

In [None]:
# Standardize
X_train_std = tsai_rs.ts_standardize(X_train, by_sample=True)
X_test_std = tsai_rs.ts_standardize(X_test, by_sample=True)

# Create datasets
train_ds = tsai_rs.TSDataset(X_train_std, y_train)
test_ds = tsai_rs.TSDataset(X_test_std, y_test)

print(f"Train dataset: {train_ds}")
print(f"Test dataset: {test_ds}")

In [None]:
# Configure model
n_vars = X_train.shape[1]
seq_len = X_train.shape[2]
n_classes = len(np.unique(y_train))

config = tsai_rs.InceptionTimePlusConfig(
    n_vars=n_vars,
    seq_len=seq_len,
    n_classes=n_classes
)

print(f"Model config: {config}")

## Data Quality Checks

In [None]:
def check_data_quality(X, y, name="Data"):
    """Check data quality before training."""
    print(f"\n{name} Quality Check")
    print("=" * 40)
    
    # Shape
    print(f"Shape: {X.shape}")
    assert len(X.shape) == 3, f"Expected 3D array, got {len(X.shape)}D"
    
    # NaN/Inf
    nan_count = np.isnan(X).sum()
    inf_count = np.isinf(X).sum()
    print(f"NaN values: {nan_count}")
    print(f"Inf values: {inf_count}")
    
    # Statistics
    print(f"Mean: {X.mean():.4f}, Std: {X.std():.4f}")
    print(f"Min: {X.min():.4f}, Max: {X.max():.4f}")
    
    # Labels
    unique_labels = np.unique(y)
    print(f"Labels: {unique_labels}")
    print(f"Label distribution: {np.bincount(y.astype(int))}")
    
    return nan_count == 0 and inf_count == 0

# Check your data
is_valid = check_data_quality(X_train, y_train, "Train")
print(f"\nData is valid: {is_valid}")

## Summary

### Data Requirements
1. **Shape**: `(samples, variables, length)`
2. **Dtype**: `float32` for X, `int` or `float` for y
3. **No NaN/Inf values**

### Common Conversions
| From | To | Code |
|------|-----|------|
| 2D (samples, length) | 3D | `X[:, np.newaxis, :]` |
| DataFrame (wide) | 3D | `df.values[:, np.newaxis, :]` |
| Multiple variables | 3D | `np.stack([v1, v2, v3], axis=1)` |
| Variable length | Fixed | Pad or truncate |

In [None]:
# Quick reference
print("Data Preparation Quick Reference")
print("=" * 50)
print("\n# Required shape")
print("X.shape = (samples, variables, length)")
print("\n# 2D to 3D")
print("X_3d = X_2d[:, np.newaxis, :]")
print("\n# Standardize")
print("X_std = tsai_rs.ts_standardize(X.astype(np.float32), by_sample=True)")
print("\n# Create dataset")
print("ds = tsai_rs.TSDataset(X_std, y)")