# Module 2: Malignant Functional State Characterization via cNMF

This notebook describes the secondary analysis of the malignant cell compartment. We utilize consensus Non-negative Matrix Factorization (cNMF) to dissect transcriptional intratumoral heterogeneity and identify the cNMF7 (MES-like) state, which drives therapeutic resistance in IDH-wildtype GBM.

## 1. Data Loading and Integration

We begin by loading the malignant cells isolated during the primary annotation phase. The data is re-integrated to ensure optimal dimensionality reduction specifically for the malignant lineage.

In [None]:
import os
import pandas as pd
import scanpy as sc
import numpy as np
from multiprocessing import Pool
import time

# Load sparse matrix and associated metadata for malignant cells
matrix = sc.read_mtx('/media/wangfei/GliomaSuppressiveTME/FirstRoundAnnotation/Malignant/sparse_matrix.mtx')
adata = matrix.T
var_names = pd.read_csv('/media/wangfei/GliomaSuppressiveTME/FirstRoundAnnotation/Malignant/geneinfo.csv', sep=',', header=0)
adata.var_names = var_names.iloc[0:, 0]

metadata = pd.read_csv('/media/wangfei/GliomaSuppressiveTME/FirstRoundAnnotation/Malignant/cellinfo.csv', sep=',', header=0)
metadata.set_index(metadata.columns[0], inplace=True)
adata.obs = metadata

# Normalization and Regression of technical covariates
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.regress_out(adata, ["orig.ident", "nCount_RNA", "nFeature_RNA"])
sc.pp.scale(adata, max_value=10)

# Feature selection: Top 2000 Highly Variable Genes (HVGs)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5, n_top_genes=2000)
adata = adata[:, adata.var.highly_variable]

# Dimensionality reduction and BBKNN batch correction [cite: 57, 303]
sc.pp.pca(adata, n_comps=30)
sc.pp.neighbors(adata)
sc.external.pp.bbknn(adata, batch_key="batch", n_pcs=30)
sc.tl.umap(adata)

adata.write("GRIT_Atlas/FirstRoundAnnotation/Malignant/adata_concat_integrate_fullgenes.h5ad")

## 2. Omicverse Preprocessing

To prepare for cNMF, we employ the Omicverse framework for advanced scaling and Pearson residual-based gene selection.

In [None]:
import omicverse as ov

adata = sc.read_h5ad("GRIT_Atlas/FirstRoundAnnotation/Malignant/adata_concat_integrate_fullgenes.h5ad")

# Advanced preprocessing using 'shiftlog|pearson' mode
adata = ov.pp.preprocess(adata, mode='shiftlog|pearson', n_HVGs=2000)
ov.pp.scale(adata)
ov.pp.pca(adata)

adata.write("GRIT_Atlas/FirstRoundAnnotation/Malignant/OVprocess_adata.h5ad")

## 3. consensus Non-negative Matrix Factorization (cNMF)

We perform cNMF decomposition across a range of ranks (K=2 to 20) to identify stable transcriptional programs representing discrete malignant states.

In [None]:
# Initialize cNMF object
# Seed is set to 14 for reproducibility [cite: 58]
cnmf_obj = ov.single.cNMF(
    adata, 
    components=np.arange(2, 20), 
    n_iter=20, 
    seed=14, 
    num_highvar_genes=2000,
    output_dir='GRIT_Atlas/FirstRoundAnnotation/Malignant', 
    name='dg_cNMF'
)

# Launch factorization and combine iterative results
cnmf_obj.factorize(worker_i=0, total_workers=2)
cnmf_obj.combine(skip_missing_files=True)

# Generate stability and reconstruction error plots to determine optimal K [cite: 59, 332]
cnmf_obj.k_selection_plot(close_fig=False)

## 4. Consensus Clustering and Optimal Rank Selection

Based on stability analysis, K=15 was selected as the optimal rank, providing the best trade-off between stability and reconstruction error.

In [None]:
# Perform consensus clustering at K=15
selected_K = 15
density_threshold = 0.2
cnmf_obj.consensus(
    k=selected_K, 
    density_threshold=density_threshold, 
    show_clustering=True, 
    close_clustergram_fig=False
)

# Save cNMF object for downstream analysis
import pickle
save_path = "GRIT_Atlas/FirstRoundAnnotation/Malignant/cnmf_obj.pkl"
with open(save_path, "wb") as f:
    pickle.dump(cnmf_obj, f)

## 5. Result Extraction and Visualization

We extract the identified programs and project the cNMF clusters onto the UMAP embedding. This allows us to define the 11 functional states, including the cNMF7 (MES-like) resistance state.

In [None]:
# Reload data and cNMF results
adata = sc.read_h5ad("GRIT_Atlas/FirstRoundAnnotation/Malignant/OVprocess_adata.h5ad")
with open("GRIT_Atlas/FirstRoundAnnotation/Malignant/cnmf_obj.pkl", "rb") as f:
   cnmf_obj = pickle.load(f)

# Assign programs to individual cells [cite: 123]
result_dict = cnmf_obj.load_results(K=selected_K, density_threshold=density_threshold)
cnmf_obj.get_results(adata, result_dict)

# Visualization of malignant states
ov.pl.embedding(
    adata,
    basis="X_umap",
    color=['cNMF_cluster'],
    frameon='small',
    legend_fontsize=14,
    legend_fontoutline=2,
    add_outline=False, 
    show=False,
)

adata.write("GRIT_Atlas/FirstRoundAnnotation/Malignant/adata_concat_integrate_fullgenes_cNMF.h5ad")