Goal: Identify best k using reconstruction error and stability metrics.

Let's identify the top 2-3 values of k using same parameters as SigprofilerExtractor and then do further optimization of other parameters.

SigProfilerExctractor:
"By default, the tool decomposes the input mutation matrix M by searching for an optimal number of mutational signatures, k, ranging from 1 to 25

For each value of k, the tool performs 100 independent NMF factorizations. During each run, the matrix M is first Poisson resampled and normalized. The decomposition is then carried out using the multiplicative update NMF algorithm, minimizing an objective function based on the Kullback-Leibler divergence.

To assess stability, custom partition clustering is applied to the 100 repetitions, using the Hungarian algorithm to compare different solutions. Stable clusters are identified, and their centroids are selected as the optimal signature solutions, ensuring robustness to input noise and the non-uniqueness of NMF."

In [None]:
import os

# change working directory to project-3 root
if os.getcwd().split('/')[-1] != 'project-3':
    os.chdir('../../../')

from src.models.nmf_runner import NMFDecomposer
import numpy as np
import joblib
import matplotlib.pyplot as plt

In [None]:
matrix = joblib.load("data/processed/mutation_matrix.pkl")
X = matrix['X']

In [None]:
ks = range(1, 26)
recon_errors = []
stabilities = []

# loop over k
for k in ks:
    nmf = NMFDecomposer(n_components=k, objective_function="frobenius")
    W, H = nmf.fit(X)
    error = np.linalg.norm(X - W @ H, 'fro')
    stab = np.mean(nmf.get_stability(W))  # or std
    recon_errors.append(error)
    stabilities.append(stab)

# make elbow plot

fig, ax1 = plt.subplots()
ax1.plot(ks, recon_errors, label="Reconstruction Error")
ax1.set_ylabel("Reconstruction Error")

ax2 = ax1.twinx()
ax2.plot(ks, stabilities, label="Stability", color="orange")
ax2.set_ylabel("Stability")

plt.title("NMF Component Selection")
plt.xlabel("Number of Components (k)")
plt.show()
