## 06_2. Consensus Non-negative Matrix factorization (cNMF)

<div style="text-align: left;">
    <p style="text-align: left;">Updated Time: 2025-02-13</p>
</div>

cNMF is an analysis pipeline for inferring gene expression programs from single-cell RNA-Seq (scRNA-Seq) data.

It takes a count matrix (N cells X G genes) as input and produces a (K x G) matrix of gene expression programs (GEPs) and a (N x K) matrix specifying the usage of each program for each cell in the data. You can read more about the method in the github and check out examples on dentategyrus.

In [None]:
import os
import sys
import numpy as np
import pandas as pd
import scanpy as sc
import omicverse as ov
from omicverse.externel import VIA

import matplotlib.pyplot as plt
ov.plot_set()

import warnings
warnings.simplefilter("ignore")

##### Set working directory for analysis

In [None]:
cwd = '/media/bio/Disk/Research Data/EBV/omicverse'
os.chdir(cwd)
updated_dir = os.getcwd()
print("Updated working directory: ", updated_dir)

#### Reading in annotated AnnData object

In [None]:
adata_epi = sc.read_h5ad("Processed Data/scRNA_Epi_CNV_Traj.h5ad")
adata_epi

In [None]:
for i in adata_epi.obs['Epi_celltype'].cat.categories:
  number = len(adata_epi.obs[adata_epi.obs['Epi_celltype']==i])
  print('the number of category {} is {}'.format(i,number))

In [None]:
adata_epi.layers['counts']=adata_epi.X
print(np.min(adata_epi.layers['counts']), np.max(adata_epi.layers['counts']))

In [None]:
ov.utils.embedding(adata_epi,basis='X_umap',
                   color=['Epi_celltype'],
                   frameon='small',cmap='Reds',wspace=0.55)

##### Preprocess data


In [None]:
adata_epi=ov.pp.preprocess(adata_epi,mode='shiftlog|pearson',n_HVGs=2000)
adata_epi = adata_epi[:, adata_epi.var.highly_variable_features]
ov.pp.scale(adata_epi)
ov.pp.pca(adata_epi,layer='scaled',n_pcs=50)

## Initialize and Training model

In [None]:
## Initialize the cnmf object that will be used to run analyses
cnmf_obj = ov.single.cNMF(adata_epi,components=np.arange(6,16), n_iter=100, seed=14, num_highvar_genes=2000, output_dir='Results/06.Epithelial/cNMF_Results', name='dg_cNMF')

In [None]:
## Specify that the jobs are being distributed over a single worker (total_workers=1) and then launch that worker
cnmf_obj.factorize(worker_i=0, total_workers=1)

In [None]:
cnmf_obj.combine(skip_missing_files=True)

## Compute the stability and error at each choice of K to see if a clear choice jumps out.

Please note that the maximum stability solution is not always the best choice depending on the application. However it is often a good starting point even if you have to investigate several choices of K

In [None]:
cnmf_obj.k_selection_plot(close_fig=False)

In this range, K=10 gave the most stable solution so we will begin by looking at that.

The next step computes the consensus solution for a given choice of K. We first run it without any outlier filtering to see what that looks like. Setting the density threshold to anything >= 2.00 (the maximum possible distance between two unit vectors) ensures that nothing will be filtered.

Then we run the consensus with a filter for outliers determined based on inspecting the histogram of distances between components and their nearest neighbors

In [None]:
selected_K = 10
density_threshold = 2.00

In [None]:
cnmf_obj.consensus(k=selected_K, 
                   density_threshold=density_threshold, 
                   show_clustering=True, 
                   close_clustergram_fig=False)

The above consensus plot shows that there is a substantial degree of concordance between the replicates with a few outliers. An outlier threshold of 0.1 seems appropriate

In [None]:
density_threshold = 0.1

In [None]:
cnmf_obj.consensus(k=selected_K, 
                   density_threshold=density_threshold, 
                   show_clustering=True, 
                   close_clustergram_fig=False)

## Visualization the result

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import patheffects

from matplotlib import gridspec
import matplotlib.pyplot as plt

width_ratios = [0.2, 4, 0.5, 10, 1]
height_ratios = [0.2, 4]
fig = plt.figure(figsize=(sum(width_ratios), sum(height_ratios)))
gs = gridspec.GridSpec(len(height_ratios), len(width_ratios), fig,
                        0.01, 0.01, 0.98, 0.98,
                       height_ratios=height_ratios,
                       width_ratios=width_ratios,
                       wspace=0, hspace=0)
            
D = cnmf_obj.topic_dist[cnmf_obj.spectra_order, :][:, cnmf_obj.spectra_order]
dist_ax = fig.add_subplot(gs[1,1], xscale='linear', yscale='linear',
                                      xticks=[], yticks=[],xlabel='', ylabel='',
                                      frameon=True)
dist_im = dist_ax.imshow(D, interpolation='none', cmap='magma',
                         aspect='auto', rasterized=True)

left_ax = fig.add_subplot(gs[1,0], xscale='linear', yscale='linear', xticks=[], yticks=[],
                xlabel='', ylabel='', frameon=True)
left_ax.imshow(cnmf_obj.kmeans_cluster_labels.values[cnmf_obj.spectra_order].reshape(-1, 1),
                            interpolation='none', cmap='Spectral', aspect='auto',
                            rasterized=True)

top_ax = fig.add_subplot(gs[0,1], xscale='linear', yscale='linear', xticks=[], yticks=[],
                xlabel='', ylabel='', frameon=True)
top_ax.imshow(cnmf_obj.kmeans_cluster_labels.values[cnmf_obj.spectra_order].reshape(1, -1),
                  interpolation='none', cmap='Spectral', aspect='auto',
                    rasterized=True)

cbar_gs = gridspec.GridSpecFromSubplotSpec(3, 3, subplot_spec=gs[1, 2],
                                   wspace=0, hspace=0)
cbar_ax = fig.add_subplot(cbar_gs[1,2], xscale='linear', yscale='linear',
    xlabel='', ylabel='', frameon=True, title='Euclidean\nDistance')
cbar_ax.set_title('Euclidean\nDistance',fontsize=12)
vmin = D.min().min()
vmax = D.max().max()
fig.colorbar(dist_im, cax=cbar_ax,
        ticks=np.linspace(vmin, vmax, 3),
        )
cbar_ax.set_yticklabels(cbar_ax.get_yticklabels(),fontsize=12)

# Save to PDF
fig.savefig("Results/06.Epithelial/cNMF_Results/cnmf_clustergram.pdf", format='pdf', bbox_inches='tight')

In [None]:
density_filter = cnmf_obj.local_density.iloc[:, 0] < density_threshold
fig, hist_ax = plt.subplots(figsize=(5,5))

#hist_ax = fig.add_subplot(hist_gs[0,0], xscale='linear', yscale='linear',
 #   xlabel='', ylabel='', frameon=True, title='Local density histogram')
hist_ax.hist(cnmf_obj.local_density.values, bins=np.linspace(0, 1, 50))
hist_ax.yaxis.tick_right()

xlim = hist_ax.get_xlim()
ylim = hist_ax.get_ylim()
if density_threshold < xlim[1]:
    hist_ax.axvline(density_threshold, linestyle='--', color='k')
    hist_ax.text(density_threshold  + 0.02, ylim[1] * 0.95, 'filtering\nthreshold\n\n', va='top')
hist_ax.set_xlim(xlim)
hist_ax.set_xlabel('Mean distance to k nearest neighbors\n\n%d/%d (%.0f%%) spectra above threshold\nwere removed prior to clustering'%(sum(~density_filter), len(density_filter), 100*(~density_filter).mean()))
hist_ax.set_title('Local density histogram')

# Save to PDF
fig.savefig("Results/06.Epithelial/cNMF_Results/cnmf_local_density_histogram.pdf", format='pdf', bbox_inches='tight')

## Explode the cNMF result

We can load the results for a cNMF run with a given K and density filtering threshold like below

In [None]:
result_dict = cnmf_obj.load_results(K=selected_K, density_threshold=density_threshold)


In [None]:
result_dict['usage_norm'].head()

In [None]:
result_dict['gep_scores'].head()

In [None]:
result_dict['gep_tpm'].head()

In [None]:
result_dict['top_genes'].head()

We can extract cell classes directly based on the highest cNMF in each cell, but this has the disadvantage that it will lead to mixed cell classes if the heterogeneity of our data is not as strong as it should be.

In [None]:
cnmf_obj.get_results(adata_epi,result_dict)

In [None]:
sc.settings.figdir = "Results/06.Epithelial/cNMF"

ov.pl.embedding(adata_epi, basis='X_umap', color=result_dict['usage_norm'].columns,
                use_raw=False, ncols=5, vmin=0, vmax=1, frameon='small', save="_cnmf.pdf")


In [None]:
ov.pl.embedding(
    adata_epi,
    basis="X_umap",
    color=['cNMF_cluster'],
    frameon='small',
    #title="Celltypes",
    #legend_loc='on data',
    legend_fontsize=14,
    legend_fontoutline=2,
    #size=10,
    #legend_loc=True, 
    add_outline=False, 
    #add_outline=True,
    outline_color='black',
    outline_width=1,
    show=False,
)

Here we are, proposing another idea of categorisation. We use cells with cNMF greater than 0.5 as a primitive class, and then train a random forest classification model, and then use the random forest classification model to classify cells with cNMF less than 0.5 to get a more accurate

In [None]:
cnmf_obj.get_results_rfc(adata_epi,result_dict,
                         use_rep='scaled|original|X_pca',
                        cNMF_threshold=0.5)

In [None]:
ov.pl.embedding(
    adata_epi,
    basis="X_umap",
    color=['cNMF_cluster_rfc','cNMF_cluster_clf'],
    frameon='small',
    #title="Celltypes",
    #legend_loc='on data',
    legend_fontsize=14,
    legend_fontoutline=2,
    #size=10,
    #legend_loc=True, 
    add_outline=False, 
    #add_outline=True,
    outline_color='black',
    outline_width=1,
    show=False,
)

In [None]:
result_dict['top_genes'].head(10)

In [None]:
plot_genes=[]
for i in result_dict['top_genes'].columns:
    plot_genes+=result_dict['top_genes'][i][:5].values.reshape(-1).tolist()

In [None]:
sc.pl.dotplot(adata_epi,plot_genes, "cNMF_cluster_rfc", dendrogram=False,standard_scale='var',)

In [None]:
import matplotlib.pyplot as plt
import scanpy as sc

keys = ['cNMF_2', 'cNMF_3', 'cNMF_4', 'cNMF_5', 'cNMF_6', 'cNMF_7', 'cNMF_8', 'cNMF_9', 'cNMF_10']
ncols = 3
nrows = (len(keys) + ncols - 1) // ncols 

fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(ncols * 4, nrows * 4))
axes = axes.flatten() 

for i, key in enumerate(keys):
    sc.pl.violin(
        adata_epi,
        keys=key,
        groupby='Epi_celltype',
        stripplot=False,
        inner='box',
        rotation=45,
        ax=axes[i],
        show=False
    )
    
for j in range(len(keys), len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.savefig("Results/06.Epithelial/cNMF_Results/cNMF2_to_10_violin_combined.pdf", dpi=300, bbox_inches='tight')
plt.show()


In [None]:
fig2, ax1 = plt.subplots(figsize=(4.5, 4))
ov.pl.bardotplot(adata_epi, groupby='Epi_celltype', color='cNMF_1', figsize=(4, 4), 
                 ax=ax1,
                 ylabel='Expression',
                 bar_kwargs={'alpha': 0.5, 'linewidth': 2, 'width': 0.6, 'capsize': 4},
                 scatter_kwargs={'alpha': 0.8, 's': 10, 'marker': '.'})

ov.pl.add_palue(ax1, line_x1=0, line_x2=4, line_y=0.7,
                text_y=0.02,
                text='$p={}$'.format(round(0.001, 3)),
                fontsize=12, fontcolor='#000000', 
                horizontalalignment='center')
ov.pl.add_palue(ax1, line_x1=3, line_x2=4, line_y=0.6,
                text_y=0.02,
                text='$p={}$'.format(round(0.001, 3)),
                fontsize=12, fontcolor='#000000', 
                horizontalalignment='center')
ov.pl.add_palue(ax1, line_x1=1, line_x2=2, line_y=0.5,
                text_y=0.02,
                text='$p={}$'.format(round(0.001, 3)),
                fontsize=12, fontcolor='#000000',
                horizontalalignment='center')
ax1.tick_params(axis='x', labelrotation=45, labelsize=12)
ax1.set_ylabel('GEP1 Score', fontsize=12)
ax1.set_title('')
plt.tight_layout(pad=0.1)

violin_file = 'Results/06.Epithelial/cNMF_Results/cNMF_1_Score_Violin_Plot.pdf'
plt.savefig(violin_file, dpi=300, bbox_inches='tight')
print(f"Violin plot saved as: {violin_file}")
plt.show()
plt.close()

In [None]:
import os
import matplotlib.pyplot as plt

out_dir = "Results/06.Epithelial/cNMF_Results"
os.makedirs(out_dir, exist_ok=True)

geps = [f"cNMF_{k}" for k in range(2, 11)]

for gep in geps:
    fig, ax = plt.subplots(figsize=(4.5, 4))

    ov.pl.violin(
        adata_epi,
        keys=gep,
        groupby='Epi_celltype',
        ax=ax,
        rotation=45,     
        stripplot=True,  
        jitter=True
    )

    ax.set_ylabel(f'{gep.replace("cNMF_", "GEP")} Score', fontsize=12)
    ax.set_title('')
    plt.tight_layout(pad=0.1)

    out_file = os.path.join(out_dir, f'{gep}_Score_Violin_Plot.pdf')
    plt.savefig(out_file, dpi=300, bbox_inches='tight')
    print(f"Saved: {out_file}")
    plt.close(fig)

rows, cols = 3, 3
fig, axes = plt.subplots(rows, cols, figsize=(4.5*cols, 4*rows))
axes = axes.flatten()

for i, gep in enumerate(geps):
    ax = axes[i]

    ov.pl.violin(
        adata_epi,
        keys=gep,
        groupby='Epi_celltype',
        ax=ax,
        rotation=45,
        stripplot=True,
        jitter=True
    )

    ax.set_title(gep.replace("cNMF_", "GEP"), fontsize=12)
    ax.set_ylabel('')

for j in range(len(geps), rows*cols):
    fig.delaxes(axes[j])

plt.tight_layout(pad=0.6, w_pad=0.6, h_pad=0.8)
panel_file = os.path.join(out_dir, 'GEP2-10_Violin_Panel_3x3.pdf')
plt.savefig(panel_file, dpi=300, bbox_inches='tight')
print(f"Panel saved: {panel_file}")
plt.show()
plt.close(fig)


In [None]:
topgenes_df = pd.DataFrame(result_dict['top_genes'])

In [None]:
topgenes_df.head(10)

In [None]:
topgenes_df.to_csv('Results/06.Epithelial/cNMF_Results/cNMF_topgenes_df.csv', index=False)

In [None]:
gep1 = topgenes_df[1].squeeze().str.strip().to_list()[:100]
gep7 = topgenes_df[7].squeeze().str.strip().to_list()[:100]

In [None]:
# %matplotlib inline
# %config InlineBackend.figure_format='retina' # mac
%load_ext autoreload
%autoreload 2
import pandas as pd
import gseapy as gp
import matplotlib.pyplot as plt

In [None]:
yeast = gp.get_library_name(organism='Human')
yeast

In [None]:
# Over-representation analysis via Enrichr web services
# This is an Example of the Enrichr analysis
# NOTE: 1. Enrichr Web Sevices need gene symbols as input 2. Gene symbols will convert to upcases automatically. 3. (Optional) Input an user defined background gene list

# Enrichr Web Serives (without a backgound input)
# if you are only intrested in dataframe that enrichr returned, please set outdir=None

enr1 = gp.enrichr(gene_list=gep1, # or "./tests/data/gene_list.txt",
                 gene_sets=['MSigDB_Hallmark_2020','KEGG_2021_Human'],
                 organism='human', # don't forget to set organism to the one you desired! e.g. Yeast
                 outdir=None, # don't write to disk
                )

enr7 = gp.enrichr(gene_list=gep7, # or "./tests/data/gene_list.txt",
                 gene_sets=['MSigDB_Hallmark_2020','KEGG_2021_Human'],
                 organism='human', # don't forget to set organism to the one you desired! e.g. Yeast
                 outdir=None, # don't write to disk
                )

In [None]:
# obj.results stores all results
enr1.results.head(10)

In [None]:
# simple plotting function
from gseapy import barplot, dotplot

In [None]:
# categorical scatterplot
ax1 = dotplot(enr1.results,
              column="P-value",
              x='Gene_set', # set x axis, so you could do a multi-sample/library comparsion
              size=20,
              top_term=5,
              cmap = "magma",
              figsize=(3,5),
              title = "GEP1",
              xticklabels_rot=45, # rotate xtick labels
              show_ring=True, # set to False to revmove outer ring
              marker='o',
             )

In [None]:
# categorical scatterplot
ax1 = barplot(enr1.results,
              column="P-value",
              group='Gene_set', # set group, so you could do a multi-sample/library comparsion
              size=10,
              top_term=5,
              title = "GEP1",
              figsize=(3,5),
              color=['darkred', 'darkblue'] # set colors for group
              # color = {'KEGG_2021_Human': 'salmon', 'MSigDB_Hallmark_2020':'darkblue'}
             )

plt.tight_layout()
plt.savefig("Results/06.Epithelial/cNMF_Results/GEP1_barplot.pdf", format='pdf', bbox_inches='tight', dpi=300)

plt.show()
plt.close()

In [None]:
# categorical scatterplot
ax1 = barplot(enr7.results,
              column="P-value",
              group='Gene_set', # set group, so you could do a multi-sample/library comparsion
              size=10,
              top_term=5,
              title = "GEP7",
              figsize=(3,5),
              color=['darkred', 'darkblue'] # set colors for group
              # color = {'KEGG_2021_Human': 'salmon', 'MSigDB_Hallmark_2020':'darkblue'}
             )

plt.tight_layout()
plt.savefig("Results/06.Epithelial/cNMF_Results/GEP7_barplot.pdf", format='pdf', bbox_inches='tight', dpi=300)

plt.show()
plt.close()