# tsam - Clustering Methods Showcase

This notebook demonstrates all clustering methods and configuration options available in tsam.

## Available Methods

| Method | Description | Best For |
|--------|-------------|----------|
| `hierarchical` | Agglomerative hierarchical clustering | General purpose, recommended default |
| `kmeans` | K-means with centroids | Fast clustering, large datasets |
| `kmedoids` | K-medoids (MILP exact) | Optimal solution, smaller datasets (slow) |
| `kmaxoids` | Selects most dissimilar periods | Capturing extremes |
| `contiguous` | Hierarchical with temporal constraint | Storage modeling, seasonal patterns |
| `averaging` | Sequential period averaging | Simple baseline |

**Tip:** For medoid-based clustering on large datasets, use `hierarchical` with `representation="medoid"` instead of `kmedoids`.

## Key Configuration Options

| Option | Description |
|--------|-------------|
| `weights` | Per-column importance weights |
| `representation` | How to represent cluster centers (mean, medoid, maxoid, distribution, distribution_minmax) |
| `normalize_column_means` | Normalize columns to same mean before clustering |
| `use_duration_curves` | Match by value distribution rather than timing |

## Setup

In [None]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import plotly.express as px
import plotly.io as pio

import tsam
from tsam import ClusterConfig

pio.renderers.default = "notebook"

## Input Data

The test dataset contains hourly time series for one year with four columns:
- **GHI**: Global Horizontal Irradiance (solar)
- **T**: Temperature
- **Wind**: Wind speed
- **Load**: Electrical load

In [None]:
raw = pd.read_csv("testdata.csv", index_col=0)
print(f"Shape: {raw.shape} ({raw.shape[0]} hours = {raw.shape[0] // 24} days)")
raw.head()

## 1. Hierarchical Clustering (Recommended Default)

Agglomerative hierarchical clustering builds a tree of clusters and cuts it at the desired number. It's the recommended default because it:
- Produces consistent results (deterministic)
- Works well with various representations
- Handles multi-variate data effectively

In [None]:
result_hierarchical = tsam.aggregate(
    raw,
    n_clusters=8,
    period_duration=24,
    cluster=ClusterConfig(method="hierarchical"),
)
print(f"Accuracy: RMSE = {result_hierarchical.accuracy.rmse.mean():.4f}")

## 2. K-Means Clustering

K-means is fast and widely used. It computes cluster centroids (averages), which may not correspond to actual periods in the data.

In [None]:
result_kmeans = tsam.aggregate(
    raw,
    n_clusters=8,
    period_duration=24,
    cluster=ClusterConfig(method="kmeans"),
)
print(f"Accuracy: RMSE = {result_kmeans.accuracy.rmse.mean():.4f}")

## 3. K-Medoids-like Clustering

K-medoids selects actual periods as cluster centers (medoids) rather than computing averages. This preserves realistic patterns.

**Note:** The true `kmedoids` method uses an exact MILP solver which can be slow for large datasets. For most use cases, `hierarchical` with `representation="medoid"` gives similar results much faster.

In [None]:
# Use hierarchical with medoid representation (fast alternative to kmedoids)
result_kmedoids = tsam.aggregate(
    raw,
    n_clusters=8,
    period_duration=24,
    cluster=ClusterConfig(method="hierarchical", representation="medoid"),
)
print(f"Accuracy: RMSE = {result_kmedoids.accuracy.rmse.mean():.4f}")

## 4. K-Maxoids Clustering

K-maxoids selects the most dissimilar periods as cluster centers. This is useful for capturing extreme conditions.

**Note:** We set `preserve_column_means=False` below because mean preservation adjusts typical period values to match the original data's mean. For k-maxoids, where the goal is to preserve extreme values, this would diminish the very extremes we're trying to capture. Use `preserve_column_means=True` (default) when mean preservation is more important than extreme value preservation.

In [None]:
result_kmaxoids = tsam.aggregate(
    raw,
    n_clusters=8,
    period_duration=24,
    cluster=ClusterConfig(method="kmaxoids"),
    preserve_column_means=False,  # Don't rescale to preserve extreme values
)
print(f"Accuracy: RMSE = {result_kmaxoids.accuracy.rmse.mean():.4f}")

## 5. Contiguous Clustering

Contiguous clustering enforces temporal continuity - adjacent typical periods must come from adjacent original periods. This is important for:
- **Storage modeling**: State-of-charge must be continuous
- **Seasonal patterns**: Preserving the natural progression of seasons

In [None]:
result_contiguous = tsam.aggregate(
    raw,
    n_clusters=8,
    period_duration=24,
    cluster=ClusterConfig(method="contiguous"),
)
print(f"Accuracy: RMSE = {result_contiguous.accuracy.rmse.mean():.4f}")

## 6. Comparison of Methods

In [None]:
# Collect all results for comparison
results = {
    "Original": raw,
    "Hierarchical": result_hierarchical.reconstructed,
    "K-Means": result_kmeans.reconstructed,
    "K-Medoids": result_kmedoids.reconstructed,
    "K-Maxoids": result_kmaxoids.reconstructed,
    "Contiguous": result_contiguous.reconstructed,
}

### Duration Curve Comparison

Duration curves show how well each method preserves the value distribution.

In [None]:
# Duration curve comparison - Load
frames = []
for name, df in results.items():
    sorted_vals = df["Load"].sort_values(ascending=False).reset_index(drop=True)
    frames.append(
        pd.DataFrame(
            {"Hour": range(len(sorted_vals)), "Load": sorted_vals, "Method": name}
        )
    )
long_df = pd.concat(frames, ignore_index=True)

px.line(
    long_df,
    x="Hour",
    y="Load",
    color="Method",
    title="Duration Curve Comparison - Load",
)

In [None]:
# Duration curve comparison - GHI
frames = []
for name, df in results.items():
    sorted_vals = df["GHI"].sort_values(ascending=False).reset_index(drop=True)
    frames.append(
        pd.DataFrame(
            {"Hour": range(len(sorted_vals)), "GHI": sorted_vals, "Method": name}
        )
    )
long_df = pd.concat(frames, ignore_index=True)

px.line(
    long_df, x="Hour", y="GHI", color="Method", title="Duration Curve Comparison - GHI"
)

### Accuracy Comparison

In [None]:
# Compare RMSE across methods
accuracy_comparison = pd.DataFrame(
    {
        "Method": ["Hierarchical", "K-Means", "K-Medoids", "K-Maxoids", "Contiguous"],
        "Mean RMSE": [
            result_hierarchical.accuracy.rmse.mean(),
            result_kmeans.accuracy.rmse.mean(),
            result_kmedoids.accuracy.rmse.mean(),
            result_kmaxoids.accuracy.rmse.mean(),
            result_contiguous.accuracy.rmse.mean(),
        ],
    }
)
accuracy_comparison.sort_values("Mean RMSE")

## 7. Configuration Options

### Using Weights

When clustering multi-variate time series, you can assign different importance to each column. This is useful when one variable is more critical for your application.

In [None]:
# Prioritize Load over other columns (e.g., for demand-focused energy systems)
result_weighted = tsam.aggregate(
    raw,
    n_clusters=8,
    period_duration=24,
    cluster=ClusterConfig(
        method="hierarchical",
        weights={"Load": 3.0, "GHI": 1.0, "T": 1.0, "Wind": 1.0},
    ),
)
print(f"Load RMSE (weighted): {result_weighted.accuracy.rmse['Load']:.4f}")
print(f"Load RMSE (unweighted): {result_hierarchical.accuracy.rmse['Load']:.4f}")

### Using Duration Curves for Clustering

By default, clustering matches periods by their temporal patterns. Setting `use_duration_curves=True` matches periods by their value distributions instead, ignoring timing.

In [None]:
# Cluster by value distribution rather than temporal pattern
result_duration_curves = tsam.aggregate(
    raw,
    n_clusters=8,
    period_duration=24,
    cluster=ClusterConfig(
        method="hierarchical",
        use_duration_curves=True,
    ),
)
print(f"RMSE with duration curves: {result_duration_curves.accuracy.rmse.mean():.4f}")

### Distribution-Preserving Representation

The `distribution_minmax` representation preserves both the value distribution AND the min/max values. This is excellent for energy system optimization where both the shape and extremes matter.

In [None]:
# Use distribution_minmax representation
result_dist_minmax = tsam.aggregate(
    raw,
    n_clusters=8,
    period_duration=24,
    cluster=ClusterConfig(
        method="hierarchical",
        representation="distribution_minmax",
    ),
)

# Compare min/max preservation
print("Original data range:")
print(f"  Load: {raw['Load'].min():.2f} - {raw['Load'].max():.2f}")

reconstructed_standard = result_hierarchical.reconstructed
reconstructed_dist = result_dist_minmax.reconstructed

print("\nStandard medoid representation:")
print(
    f"  Load: {reconstructed_standard['Load'].min():.2f} - {reconstructed_standard['Load'].max():.2f}"
)

print("\nDistribution + MinMax representation:")
print(
    f"  Load: {reconstructed_dist['Load'].min():.2f} - {reconstructed_dist['Load'].max():.2f}"
)

### Comparison: Standard vs Distribution-Preserving

In [None]:
# Comparison: Standard vs Distribution-Preserving
comparison_dist = {
    "Original": raw,
    "Medoid (standard)": reconstructed_standard,
    "Distribution + MinMax": reconstructed_dist,
}

frames = []
for name, df in comparison_dist.items():
    sorted_vals = df["Load"].sort_values(ascending=False).reset_index(drop=True)
    frames.append(
        pd.DataFrame(
            {"Hour": range(len(sorted_vals)), "Load": sorted_vals, "Method": name}
        )
    )
long_df = pd.concat(frames, ignore_index=True)

px.line(
    long_df,
    x="Hour",
    y="Load",
    color="Method",
    title="Effect of Distribution-Preserving Representation",
)

## Summary

| Use Case | Recommended Method | Key Options |
|----------|-------------------|-------------|
| General purpose | `hierarchical` | Default settings |
| Fast clustering | `kmeans` | - |
| Preserve realistic patterns | `hierarchical` | `representation="medoid"` |
| Capture extremes | `kmaxoids` | `preserve_column_means=False` |
| Storage modeling | `contiguous` | - |
| Demand-focused | `hierarchical` | `weights={"Load": 3.0, ...}` |
| Preserve distribution | `hierarchical` | `representation="distribution_minmax"` |

**Note:** The `kmedoids` method uses an exact MILP solver and can be slow for datasets with many periods (365+ days). Use `hierarchical` with `representation="medoid"` for similar results with much better performance.