# 02 -- Feature Analysis

Exploratory analysis of the extracted feature matrix to inform
modeling decisions (dimensionality reduction, feature selection,
model architecture).

**Sections:**
1. Data loading & preprocessing
2. PCA: scree plot & explained variance
3. PCA: component loadings & feature importance
4. UMAP: team separation
5. UMAP: temporal structure
6. Feature redundancy: hierarchical clustering
7. Correlation block structure
8. Key observations & modeling recommendations

In [None]:
from __future__ import annotations

import sys
import warnings
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import polars as pl
import seaborn as sns
from scipy.cluster.hierarchy import dendrogram, fcluster, linkage
from scipy.spatial.distance import squareform
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from tqdm.auto import tqdm
from umap import UMAP

PROJECT_ROOT = Path.cwd().parent
sys.path.insert(0, str(PROJECT_ROOT / "src"))

from tactical.models.preprocessing import PreprocessingPipeline

sns.set_theme(style="whitegrid", font_scale=0.9)
plt.rcParams["figure.dpi"] = 120
warnings.filterwarnings("ignore", category=FutureWarning)

FEATURES_PATH = PROJECT_ROOT / "data" / "output" / "features.parquet"

METADATA_COLS: set[str] = {
    "match_id",
    "team_id",
    "segment_type",
    "start_time",
    "end_time",
    "period",
    "match_minute",
}

TIER_PALETTE: dict[str, str] = {
    "Tier 1": "#4c72b0",
    "Tier 2": "#dd8452",
    "Tier 3": "#55a868",
    "Unknown": "#999999",
}

---
## 1. Data Loading & Preprocessing

In [None]:
if FEATURES_PATH.exists():
    df = pl.read_parquet(FEATURES_PATH)
    print(f"Loaded {FEATURES_PATH}")
else:
    print(
        f"{FEATURES_PATH} not found.\n"
        "Run `python scripts/run_feature_extraction.py` first."
    )
    raise SystemExit(1)

In [None]:
feature_cols: list[str] = sorted(
    c for c in df.columns if c not in METADATA_COLS
)

tier_map: dict[str, str] = {}
for col in feature_cols:
    if col.startswith("t1_"):
        tier_map[col] = "Tier 1"
    elif col.startswith("t2_"):
        tier_map[col] = "Tier 2"
    elif col.startswith("t3_"):
        tier_map[col] = "Tier 3"
    else:
        tier_map[col] = "Unknown"

# Focus on window segments for PCA/UMAP (consistent segment length)
df_win = df.filter(pl.col("segment_type") == "window")

print(f"Total rows: {df.height:,}  |  Window rows: {df_win.height:,}")
print(f"Feature columns: {len(feature_cols)}")

In [None]:
# Use PreprocessingPipeline for Tier 1+2 features (model-ready subset)
# Tier 3 features are excluded because they contain nulls for most matches
pipeline = PreprocessingPipeline(
    feature_prefix="t",
    null_strategy="drop_rows",
    pca_variance_threshold=None,  # no PCA yet -- we analyze raw scaled space first
)

# Fit on window segments only
X_scaled = pipeline.fit_transform(df_win)
retained_mask = pipeline.get_retained_row_mask(df_win)
df_clean = df_win.filter(pl.Series(retained_mask))

scaled_feature_names: list[str] = pipeline._feature_columns

print(f"Scaled matrix shape: {X_scaled.shape}")
print(f"Rows retained after null handling: {X_scaled.shape[0]:,} / {df_win.height:,}")
print(f"Features used: {X_scaled.shape[1]}")

---
## 2. PCA: Scree Plot & Explained Variance

Fit a full PCA to understand the intrinsic dimensionality of the
feature space and determine how many components are needed.

In [None]:
max_components = min(X_scaled.shape[0], X_scaled.shape[1])
pca_full = PCA(n_components=max_components)
pca_full.fit(X_scaled)

explained = pca_full.explained_variance_ratio_
cumulative = np.cumsum(explained)

# Key variance thresholds
for thresh in (0.80, 0.90, 0.95, 0.99):
    n_comp = int(np.searchsorted(cumulative, thresh) + 1)
    print(f"  {thresh:.0%} variance explained by {n_comp} / {max_components} components")

In [None]:
# Scree plot + cumulative explained variance
n_show = min(40, max_components)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Left: individual explained variance
ax1.bar(
    range(1, n_show + 1),
    explained[:n_show] * 100,
    color="#4c72b0",
    edgecolor="white",
    linewidth=0.4,
)
ax1.set_xlabel("Principal Component")
ax1.set_ylabel("Variance Explained (%)")
ax1.set_title("Scree Plot")
ax1.set_xlim(0.5, n_show + 0.5)

# Right: cumulative
ax2.plot(
    range(1, max_components + 1),
    cumulative * 100,
    color="#4c72b0",
    linewidth=1.5,
)
for thresh, ls in [(80, ":"), (90, "--"), (95, "-.")]:
    ax2.axhline(thresh, color="#c44e52", linestyle=ls, linewidth=0.8, alpha=0.7)
    n_at = int(np.searchsorted(cumulative, thresh / 100) + 1)
    ax2.annotate(
        f"{thresh}% @ {n_at}",
        xy=(n_at, thresh),
        fontsize=7,
        color="#c44e52",
        ha="left",
        va="bottom",
    )
ax2.set_xlabel("Number of Components")
ax2.set_ylabel("Cumulative Variance (%)")
ax2.set_title("Cumulative Explained Variance")

fig.suptitle("PCA Dimensionality Analysis", fontsize=11, y=1.02)
plt.tight_layout()
plt.show()

---
## 3. PCA: Component Loadings & Feature Importance

Identify which original features contribute most to the top
principal components.

In [None]:
# Top N components to inspect
N_TOP_PC = min(6, max_components)
N_TOP_FEATURES = 10

loadings = pca_full.components_[:N_TOP_PC]  # (N_TOP_PC, n_features)

loading_rows: list[dict[str, object]] = []
for pc_idx in range(N_TOP_PC):
    abs_loadings = np.abs(loadings[pc_idx])
    top_indices = np.argsort(abs_loadings)[::-1][:N_TOP_FEATURES]
    for rank, feat_idx in enumerate(top_indices):
        loading_rows.append(
            {
                "PC": pc_idx + 1,
                "rank": rank + 1,
                "feature": scaled_feature_names[feat_idx],
                "loading": round(float(loadings[pc_idx, feat_idx]), 4),
                "abs_loading": round(float(abs_loadings[feat_idx]), 4),
                "tier": tier_map.get(scaled_feature_names[feat_idx], "Unknown"),
            }
        )

loading_df = pl.DataFrame(loading_rows)
for pc in range(1, N_TOP_PC + 1):
    subset = loading_df.filter(pl.col("PC") == pc)
    var_pct = explained[pc - 1] * 100
    print(f"\n--- PC{pc} ({var_pct:.1f}% variance) ---")
    print(subset.select("rank", "feature", "loading", "tier"))

In [None]:
# Heatmap of top feature loadings across first N components
# Collect all features that appear in any top-N list
top_features_set: set[str] = set()
for pc_idx in range(N_TOP_PC):
    abs_l = np.abs(loadings[pc_idx])
    top_idx = np.argsort(abs_l)[::-1][:N_TOP_FEATURES]
    for i in top_idx:
        top_features_set.add(scaled_feature_names[i])

top_features_sorted = sorted(top_features_set)
feat_indices = [scaled_feature_names.index(f) for f in top_features_sorted]

loading_matrix = loadings[:, feat_indices]  # (N_TOP_PC, len(top_features_sorted))

fig, ax = plt.subplots(figsize=(14, max(4, N_TOP_PC * 0.8)))
im = ax.imshow(loading_matrix, cmap="RdBu_r", aspect="auto", vmin=-0.5, vmax=0.5)

ax.set_yticks(range(N_TOP_PC))
ax.set_yticklabels([f"PC{i+1} ({explained[i]*100:.1f}%)" for i in range(N_TOP_PC)], fontsize=8)
ax.set_xticks(range(len(top_features_sorted)))
short_names = [
    f.replace("t1_", "").replace("t2_", "").replace("t3_", "")
    for f in top_features_sorted
]
ax.set_xticklabels(short_names, rotation=90, fontsize=6)
ax.set_title("PCA Loadings: Top Contributing Features", fontsize=11)
fig.colorbar(im, ax=ax, fraction=0.02, pad=0.04, label="Loading")
plt.tight_layout()
plt.show()

In [None]:
# Aggregate feature importance: sum of squared loadings across
# components weighted by explained variance
n_for_importance = int(np.searchsorted(cumulative, 0.95) + 1)
weights = explained[:n_for_importance]
importance = (pca_full.components_[:n_for_importance] ** 2) * weights[:, np.newaxis]
importance_scores = importance.sum(axis=0)  # (n_features,)

importance_order = np.argsort(importance_scores)[::-1]

n_show_imp = min(25, len(importance_order))
top_imp_names = [scaled_feature_names[i] for i in importance_order[:n_show_imp]]
top_imp_scores = importance_scores[importance_order[:n_show_imp]]
top_imp_tiers = [tier_map.get(n, "Unknown") for n in top_imp_names]

fig, ax = plt.subplots(figsize=(8, max(4, n_show_imp * 0.3)))
bar_colors = [TIER_PALETTE.get(t, "#999999") for t in top_imp_tiers]
y_pos = np.arange(n_show_imp)
ax.barh(y_pos, top_imp_scores, color=bar_colors, edgecolor="white", linewidth=0.4)
ax.set_yticks(y_pos)
ax.set_yticklabels(
    [n.replace("t1_", "").replace("t2_", "").replace("t3_", "") for n in top_imp_names],
    fontsize=7,
)
ax.invert_yaxis()
ax.set_xlabel("Weighted Squared Loading (importance)")
ax.set_title(f"Top {n_show_imp} Features by PCA Importance (95% variance)", fontsize=10)

from matplotlib.patches import Patch

handles = [
    Patch(facecolor=c, label=t)
    for t, c in TIER_PALETTE.items()
    if t in set(top_imp_tiers)
]
ax.legend(handles=handles, loc="lower right", fontsize=8)
plt.tight_layout()
plt.show()

---
## 4. UMAP: Team Separation

Project the feature space into 2D with UMAP and colour by team
to assess whether different teams occupy distinct regions.

In [None]:
# Reduce to PCA space first (faster UMAP, denoised)
n_pca_for_umap = int(np.searchsorted(cumulative, 0.95) + 1)
X_pca = pca_full.transform(X_scaled)[:, :n_pca_for_umap]

reducer = UMAP(n_components=2, n_neighbors=30, min_dist=0.3, random_state=42, n_jobs=1)
X_umap = reducer.fit_transform(X_pca)

print(f"UMAP input: {X_pca.shape} -> output: {X_umap.shape}")

In [None]:
# Colour by team_id
team_ids = df_clean["team_id"].to_list()
unique_teams = sorted(set(team_ids))
n_teams = len(unique_teams)

# Build a color map: use a qualitative palette
cmap_teams = plt.cm.get_cmap("tab20", max(n_teams, 2))
team_to_idx = {t: i for i, t in enumerate(unique_teams)}
team_colors = np.array([team_to_idx[t] for t in team_ids])

fig, ax = plt.subplots(figsize=(10, 8))
scatter = ax.scatter(
    X_umap[:, 0],
    X_umap[:, 1],
    c=team_colors,
    cmap=cmap_teams,
    s=4,
    alpha=0.5,
    edgecolors="none",
)
ax.set_xlabel("UMAP-1")
ax.set_ylabel("UMAP-2")
ax.set_title(f"UMAP Projection Coloured by Team ({n_teams} teams)", fontsize=11)

# Legend: show up to 15 teams to keep it readable
max_legend = min(15, n_teams)
legend_handles = [
    plt.Line2D(
        [0], [0],
        marker="o",
        color="w",
        markerfacecolor=cmap_teams(team_to_idx[t]),
        markersize=5,
        label=str(t)[:20],
    )
    for t in unique_teams[:max_legend]
]
if n_teams > max_legend:
    legend_handles.append(
        plt.Line2D([0], [0], marker="", color="w", label=f"... +{n_teams - max_legend} more")
    )
ax.legend(
    handles=legend_handles,
    loc="upper left",
    bbox_to_anchor=(1.01, 1),
    fontsize=6,
    frameon=True,
    ncol=1,
)
plt.tight_layout()
plt.show()

In [None]:
# Highlight individual teams: overlay the densest teams
team_counts = (
    df_clean.group_by("team_id")
    .agg(pl.len().alias("cnt"))
    .sort("cnt", descending=True)
)
top_n_highlight = min(6, team_counts.height)
highlight_teams = team_counts.head(top_n_highlight)["team_id"].to_list()

n_cols_grid = min(3, top_n_highlight)
n_rows_grid = (top_n_highlight + n_cols_grid - 1) // n_cols_grid

fig, axes = plt.subplots(
    n_rows_grid, n_cols_grid,
    figsize=(n_cols_grid * 4, n_rows_grid * 3.5),
)
axes_flat = np.array(axes).flatten()

team_ids_arr = np.array(team_ids)

for idx, tid in enumerate(highlight_teams):
    ax = axes_flat[idx]
    mask = team_ids_arr == tid
    ax.scatter(
        X_umap[~mask, 0], X_umap[~mask, 1],
        c="#dddddd", s=2, alpha=0.3, edgecolors="none",
    )
    ax.scatter(
        X_umap[mask, 0], X_umap[mask, 1],
        c="#c44e52", s=6, alpha=0.7, edgecolors="none",
    )
    ax.set_title(str(tid)[:25], fontsize=8)
    ax.tick_params(labelsize=5)
    ax.set_xlabel("UMAP-1", fontsize=6)
    ax.set_ylabel("UMAP-2", fontsize=6)

for idx in range(top_n_highlight, len(axes_flat)):
    axes_flat[idx].set_visible(False)

fig.suptitle("UMAP: Individual Team Overlays (top by segment count)", fontsize=10, y=1.02)
plt.tight_layout()
plt.show()

---
## 5. UMAP: Temporal & Contextual Structure

Colour UMAP by match minute, period, and score differential
to check whether temporal or game-state structure is present.

In [None]:
# Colour by match minute
match_minutes = df_clean["match_minute"].to_numpy().astype(float)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Match minute
sc1 = axes[0].scatter(
    X_umap[:, 0], X_umap[:, 1],
    c=match_minutes, cmap="viridis", s=4, alpha=0.5, edgecolors="none",
)
axes[0].set_title("Coloured by Match Minute", fontsize=10)
axes[0].set_xlabel("UMAP-1")
axes[0].set_ylabel("UMAP-2")
fig.colorbar(sc1, ax=axes[0], fraction=0.04, pad=0.04, label="Minute")

# Period
periods = df_clean["period"].to_numpy().astype(float)
sc2 = axes[1].scatter(
    X_umap[:, 0], X_umap[:, 1],
    c=periods, cmap="Set1", s=4, alpha=0.5, edgecolors="none",
)
axes[1].set_title("Coloured by Period", fontsize=10)
axes[1].set_xlabel("UMAP-1")
axes[1].set_ylabel("UMAP-2")
fig.colorbar(sc2, ax=axes[1], fraction=0.04, pad=0.04, label="Period")

fig.suptitle("UMAP: Temporal Structure", fontsize=11, y=1.02)
plt.tight_layout()
plt.show()

In [None]:
# Colour by score differential and possession share (if available)
_CONTEXT_FEATURES = [
    ("t1_context_score_differential", "Score Differential"),
    ("t1_context_possession_share", "Possession Share"),
]
available_ctx = [
    (col, label) for col, label in _CONTEXT_FEATURES
    if col in df_clean.columns
]

if available_ctx:
    fig, axes = plt.subplots(
        1, len(available_ctx),
        figsize=(6 * len(available_ctx), 5),
    )
    if len(available_ctx) == 1:
        axes = [axes]

    for ax, (col, label) in zip(axes, available_ctx):
        vals = df_clean[col].fill_null(0).to_numpy().astype(float)
        sc = ax.scatter(
            X_umap[:, 0], X_umap[:, 1],
            c=vals, cmap="RdYlBu_r", s=4, alpha=0.5, edgecolors="none",
        )
        ax.set_title(f"Coloured by {label}", fontsize=10)
        ax.set_xlabel("UMAP-1")
        ax.set_ylabel("UMAP-2")
        fig.colorbar(sc, ax=ax, fraction=0.04, pad=0.04, label=label)

    fig.suptitle("UMAP: Game-State Context", fontsize=11, y=1.02)
    plt.tight_layout()
    plt.show()
else:
    print("No context features available for UMAP colouring.")

In [None]:
# Colour by a representative tactical feature
_TACTICAL_FEATURES = [
    ("t1_spatial_event_centroid_x", "Event Centroid X"),
    ("t2_press_intensity", "Pressing Intensity"),
    ("t2_shape_engagement_line", "Engagement Line"),
    ("t1_pass_completion_rate", "Pass Completion Rate"),
]
available_tac = [
    (col, label) for col, label in _TACTICAL_FEATURES
    if col in df_clean.columns
]

if available_tac:
    n_tac = len(available_tac)
    n_cols_g = min(2, n_tac)
    n_rows_g = (n_tac + n_cols_g - 1) // n_cols_g
    fig, axes = plt.subplots(n_rows_g, n_cols_g, figsize=(n_cols_g * 6, n_rows_g * 4.5))
    axes_flat = np.array(axes).flatten()

    for idx, (col, label) in enumerate(available_tac):
        ax = axes_flat[idx]
        vals = df_clean[col].fill_null(0).to_numpy().astype(float)
        sc = ax.scatter(
            X_umap[:, 0], X_umap[:, 1],
            c=vals, cmap="coolwarm", s=4, alpha=0.5, edgecolors="none",
        )
        ax.set_title(f"Coloured by {label}", fontsize=9)
        ax.set_xlabel("UMAP-1", fontsize=8)
        ax.set_ylabel("UMAP-2", fontsize=8)
        fig.colorbar(sc, ax=ax, fraction=0.04, pad=0.04)

    for idx in range(n_tac, len(axes_flat)):
        axes_flat[idx].set_visible(False)

    fig.suptitle("UMAP: Tactical Feature Gradients", fontsize=11, y=1.02)
    plt.tight_layout()
    plt.show()

---
## 6. Feature Redundancy: Hierarchical Clustering

Cluster features by their absolute correlation to identify
redundant groups that could be removed or merged.

In [None]:
# Correlation matrix on scaled features
corr_matrix = np.corrcoef(X_scaled, rowvar=False)
corr_matrix = np.nan_to_num(corr_matrix, nan=0.0)

# Convert correlation to distance: d = 1 - |r|
dist_matrix = 1.0 - np.abs(corr_matrix)
np.fill_diagonal(dist_matrix, 0.0)
# Ensure symmetry and non-negativity
dist_matrix = np.clip((dist_matrix + dist_matrix.T) / 2, 0, None)

condensed = squareform(dist_matrix, checks=False)
linkage_matrix = linkage(condensed, method="average")

print(f"Feature correlation matrix: {corr_matrix.shape}")

In [None]:
# Dendrogram
short_labels = [
    f.replace("t1_", "").replace("t2_", "").replace("t3_", "")
    for f in scaled_feature_names
]

fig, ax = plt.subplots(figsize=(16, max(6, len(scaled_feature_names) * 0.18)))
dendro = dendrogram(
    linkage_matrix,
    labels=short_labels,
    orientation="left",
    leaf_font_size=5,
    color_threshold=0.3,
    ax=ax,
)
ax.axvline(0.15, color="#c44e52", linestyle="--", linewidth=0.8, label="r = 0.85 threshold")
ax.axvline(0.30, color="#dd8452", linestyle="--", linewidth=0.8, label="r = 0.70 threshold")
ax.set_xlabel("Distance (1 - |correlation|)")
ax.set_title("Hierarchical Clustering of Features by Correlation", fontsize=11)
ax.legend(fontsize=8)
plt.tight_layout()
plt.show()

In [None]:
# Identify redundant feature groups at |r| > 0.85 (distance < 0.15)
REDUNDANCY_THRESHOLD = 0.15
cluster_labels = fcluster(linkage_matrix, t=REDUNDANCY_THRESHOLD, criterion="distance")

# Find clusters with more than one member
cluster_to_features: dict[int, list[str]] = {}
for feat, cid in zip(scaled_feature_names, cluster_labels):
    cluster_to_features.setdefault(int(cid), []).append(feat)

redundant_groups = {
    cid: feats for cid, feats in cluster_to_features.items() if len(feats) > 1
}

print(f"Redundant groups (|r| > 0.85): {len(redundant_groups)}")
print(f"Total features in redundant groups: {sum(len(f) for f in redundant_groups.values())}")
print(f"Singleton features: {sum(1 for f in cluster_to_features.values() if len(f) == 1)}")
print()

for i, (cid, feats) in enumerate(sorted(redundant_groups.items())):
    print(f"  Group {i+1}: {feats}")

---
## 7. Correlation Block Structure

Reorder the correlation matrix using the hierarchical clustering
to reveal block-diagonal structure.

In [None]:
# Reorder correlation matrix by dendrogram leaf order
leaf_order = dendro["leaves"]
corr_reordered = corr_matrix[np.ix_(leaf_order, leaf_order)]
reordered_names = [scaled_feature_names[i] for i in leaf_order]

fig, ax = plt.subplots(figsize=(14, 12))
im = ax.imshow(corr_reordered, cmap="RdBu_r", vmin=-1, vmax=1, aspect="auto")

# Tier boundary annotations on the reordered axis
n_feats = len(reordered_names)
step = max(1, n_feats // 30)
tick_positions = list(range(0, n_feats, step))
tick_labels = [
    reordered_names[i].replace("t1_", "").replace("t2_", "").replace("t3_", "")
    for i in tick_positions
]
ax.set_xticks(tick_positions)
ax.set_xticklabels(tick_labels, rotation=90, fontsize=5)
ax.set_yticks(tick_positions)
ax.set_yticklabels(tick_labels, fontsize=5)
ax.set_title("Correlation Matrix (reordered by hierarchical clustering)", fontsize=11)
fig.colorbar(im, ax=ax, fraction=0.046, pad=0.04)
plt.tight_layout()
plt.show()

In [None]:
# Distribution of absolute pairwise correlations (upper triangle)
upper_idx = np.triu_indices(corr_matrix.shape[0], k=1)
abs_corrs = np.abs(corr_matrix[upper_idx])

fig, ax = plt.subplots(figsize=(7, 3.5))
ax.hist(abs_corrs, bins=50, color="#4c72b0", edgecolor="white", linewidth=0.4)
ax.axvline(0.85, color="#c44e52", linestyle="--", linewidth=1, label="|r|=0.85")
ax.axvline(0.70, color="#dd8452", linestyle="--", linewidth=1, label="|r|=0.70")
ax.set_xlabel("|Pairwise Correlation|")
ax.set_ylabel("Count")
ax.set_title("Distribution of Absolute Pairwise Feature Correlations")
ax.legend(fontsize=8)
plt.tight_layout()
plt.show()

n_pairs = len(abs_corrs)
n_high = int(np.sum(abs_corrs > 0.85))
n_moderate = int(np.sum(abs_corrs > 0.70))
print(f"Total feature pairs: {n_pairs:,}")
print(f"Pairs with |r| > 0.85: {n_high} ({100*n_high/n_pairs:.1f}%)")
print(f"Pairs with |r| > 0.70: {n_moderate} ({100*n_moderate/n_pairs:.1f}%)")

In [None]:
# Effective dimensionality estimates
eigenvalues = pca_full.explained_variance_

# Participation ratio: (sum(lambda))^2 / sum(lambda^2)
participation_ratio = float(eigenvalues.sum() ** 2 / (eigenvalues ** 2).sum())

# Kaiser criterion: eigenvalue > 1 (on correlation matrix = eigenvalue > mean)
mean_eigenvalue = eigenvalues.mean()
n_kaiser = int(np.sum(eigenvalues > mean_eigenvalue))

# 95% cumulative variance
n_95 = int(np.searchsorted(cumulative, 0.95) + 1)

print("Effective dimensionality estimates:")
print(f"  Participation ratio       : {participation_ratio:.1f}")
print(f"  Kaiser criterion (> mean) : {n_kaiser}")
print(f"  95% cumulative variance   : {n_95}")
print(f"  Original feature count    : {X_scaled.shape[1]}")

---
## 8. Key Observations & Modeling Recommendations

### Dimensionality

- The PCA scree plot shows how quickly variance is concentrated in the
  first few components. Check the 95% threshold above to determine the
  practical dimensionality.
- The participation ratio and Kaiser criterion provide complementary
  estimates. If these converge (e.g., all around 10-15), the effective
  dimensionality is well-defined.
- **Recommendation:** Use PCA with 95% variance retention as the
  default for the `PreprocessingPipeline` (`pca_variance_threshold=0.95`).

### Team Separation in UMAP

- If teams cluster into distinct regions of the UMAP embedding, the
  feature space captures team-level tactical identity. This supports
  per-team or across-team modeling.
- If teams are intermixed, the feature space primarily captures
  *situational* variation (game state, phase of play) rather than
  *team identity*. This is favorable for discovering universal
  tactical states.
- Check the individual team overlays (Section 4) for specific cases.

### Temporal Structure

- Gradients in match minute or period on the UMAP plot indicate that
  the feature space encodes temporal information (early-match vs.
  late-match behavior).
- If score differential produces visible gradients, the features
  are sensitive to game state -- important for tactical discovery.
- **Implication:** HMM may capture temporal dynamics better than GMM
  if strong temporal structure is present.

### Feature Redundancy

- The dendrogram reveals groups of highly correlated features that
  carry essentially the same information.
- Each redundant group (|r| > 0.85) could be represented by a single
  feature or collapsed via PCA.
- **Recommendation:** Prefer PCA-based reduction over manual feature
  selection. PCA naturally handles correlated features.

### Modeling Implications

1. **PCA retention:** Use the 95% threshold identified above.
2. **Null strategy:** Tier 1+2 features should have negligible nulls;
   `drop_rows` is safe. Exclude Tier 3 from model input.
3. **Model input:** Feed PCA-reduced features to GMM/HMM.
4. **Number of clusters (K):** The UMAP plots provide a visual prior
   for the expected number of natural groupings.