# 01 — NMF Basis Selection

This notebook walks through the analysis pipeline for choosing the number of
NMF components (*k*) used to cluster glomerular response data.

**Steps:**
1. Load raw odorant × glomerulus response matrices
2. Reconstruction analysis — PCA vs. NMF variance explained
3. NMF basis scores (KL-divergence across 100 random seeds)
4. Consensus clustering stability sweep (*k* = 2–12)
5. Justify the choice of *k* = 7

In [None]:
import numpy as np
import glom_explorer as gx

## 1. Load data

In [None]:
odorants, low, high = gx.load_processed_data()
print(f"Odorants: {odorants.shape[0]}")
print(f"Low-conc matrix:  {low.shape}  (odorants × glomeruli)")
print(f"High-conc matrix: {high.shape}")

## 2. Reconstruction analysis

Compare PCA and NMF reconstruction of the low-concentration response matrix
as a function of the number of components (1–15).  A shuffled-PCA baseline
shows the floor expected by chance.

In [None]:
results = gx.run_reconstruction_analysis(low, max_components=15, n_iterations=50)

In [None]:
fig = gx.plot_reconstruction(results)
fig.show()

NMF tracks PCA closely and both plateau around 7–10 components, well above
the shuffled baseline.  This suggests that ~7 NMF components capture the
majority of structured variance.

## 3. NMF basis scores (KL-divergence)

For a fixed *k* = 7, we fit NMF with 100 different random seeds and compute
the mean pairwise KL-divergence between normalised basis vectors.  Higher
divergence → more distinct, non-overlapping clusters.

In [None]:
# Add small offset for NMF non-negativity
X = low.values.copy()
X = X - X.min() if X.min() < 0 else X
epsilon = np.finfo(np.float32).eps
X = X + epsilon

basis_scores = gx.calc_nmf_basis_scores(X, n_components=7, n_runs=100)
print(f"Mean KL-div: {basis_scores['kl-div'].mean():.3f} ± {basis_scores['kl-div'].std():.3f}")

In [None]:
import plotly.graph_objects as go

fig = go.Figure()
fig.add_trace(go.Histogram(x=basis_scores['kl-div'], nbinsx=25,
                           marker_color='#B255E8', opacity=0.8))
fig.update_layout(
    title='NMF Basis Score Distribution (k=7, 100 seeds)',
    xaxis_title='Mean pairwise KL-divergence',
    yaxis_title='Count',
    width=500, height=350, plot_bgcolor='white',
)
fig.show()

## 4. Consensus clustering stability

Sweep *k* = 2–12 and measure two stability metrics:
- **ARI** — agreement between individual NMF runs and the consensus partition
- **AUC** — area under the consensus CDF (1 = perfectly crisp consensus)

In [None]:
k_range = range(2, 13)
ari_means, ari_stds, aucs = [], [], []

for k in k_range:
    print(f"k={k}...", end=" ")
    result = gx.consensus_clustering_stability(X, n_components=k, n_runs=50)
    ari_means.append(result['mean_ari_to_consensus'])
    ari_stds.append(result['std_ari_to_consensus'])
    aucs.append(result['consensus_auc'])
print("done.")

In [None]:
from plotly.subplots import make_subplots

ks = list(k_range)
fig = make_subplots(rows=1, cols=2, subplot_titles=['Mean ARI to Consensus', 'Consensus AUC'])

fig.add_trace(go.Scatter(
    x=ks, y=ari_means, mode='lines+markers',
    error_y=dict(type='data', array=ari_stds, visible=True),
    marker=dict(color='#B255E8'), line=dict(color='#B255E8'),
), row=1, col=1)

fig.add_trace(go.Scatter(
    x=ks, y=aucs, mode='lines+markers',
    marker=dict(color='steelblue'), line=dict(color='steelblue'),
), row=1, col=2)

fig.update_xaxes(title_text='k (number of clusters)', row=1, col=1)
fig.update_xaxes(title_text='k (number of clusters)', row=1, col=2)
fig.update_yaxes(title_text='ARI', row=1, col=1)
fig.update_yaxes(title_text='AUC', row=1, col=2)
fig.update_layout(width=800, height=350, showlegend=False, plot_bgcolor='white')
fig.show()

## 5. Summary

- The reconstruction analysis shows that 7 components capture most structured variance.
- NMF basis scores are consistently high at *k* = 7, confirming well-separated bases.
- Consensus clustering ARI remains high and stable around *k* = 7, with AUC indicating
  a crisp consensus partition.

These convergent lines of evidence support **k = 7** as the number of glomerular
response clusters.