# MIRA Topic Analysis

With trained topic models and a Joint-KNN representation of the data, we can analyze the topics to understand the regulatory dynamics present within a sample. Expression topics may be analyzed with functional enrichments of the top genes activated in a given topic/module. Accessibility topics correspond to a set of coordinated cis-regulatory elements, and may be analyzed to find emergent transcription factor regulators of particular cell states.

This tutorial will cover predicting factor binding and analyzing topic modules in both modes. First, we import packages:

In [None]:
!hostnamectl

In [None]:
import mira
import anndata
import scanpy as sc
import pandas as pd
import numpy as np
import matplotlib
import math
import matplotlib.pyplot as plt
matplotlib.rc('font',size=12)
import logging
import warnings
warnings.simplefilter("ignore")
mira.utils.pretty_sderr()

Next, we need to load our datasets and models from the joint representation step.

In [None]:
rna_adata = anndata.read_h5ad("/gpfs/Home/esm5360/MIRA/mira-datasets/ds011_rna_data_joint_representation.h5ad")
atac_adata = anndata.read_h5ad("/gpfs/Home/esm5360/MIRA/mira-datasets/ds011_atac_data_joint_representation.h5ad")

rna_model = mira.topics.load_model("/gpfs/Home/esm5360/MIRA/mira-datasets/ds011_rna_model.pth")
atac_model = mira.topics.load_model("/gpfs/Home/esm5360/MIRA/mira-datasets/ds011_atac_model.pth")

We pick up from the previous tutorial, making the joint representation, in which we constructed a UMAP view of the data. We can visualize the flow of topics to visualize cellular heterogeneity or differentiation:

In [None]:
rna_adata

In [None]:
topics = [i for i in rna_adata.obs if "topic" in i and not "ATAC" in i]
sc.pl.umap(rna_adata, color  = topics, frameon=False, ncols=5,
          color_map = 'magma')

## Expression Topic Analysis

We can plot expression patterns of genes that are activated by these topics. To get the top genes associated with a topic:

In [None]:
rna_model.get_top_genes(6, top_n=2)
rna_model.get_top_genes(9, top_n=2)

And plotting:

In [None]:
sc.pl.umap(rna_adata, color = rna_model.get_top_genes(6, top_n=2), **mira.pref.raw_umap(ncols=3, size=24))

Above, the `mira.pref.raw_umap` function simply provides default values to the Scanpy plotting function to make easily readable plots for normalized expression values.

Let’s see what functional enrichments represent these topics. MIRA uses Enrichr to get functional enrichments for each topic by posting the `top_n` genes associated with a topic to their API. You can change the number of genes sent, or output genes sorted in order of activation by the topic for rank-based functional enrichments (like GSEApy).

To post a topic’s top genes to Enrichr, use `post_topic`, or use `post_topics` to post all topics’ lists at once.

**Note**: A good rule of thumb for setting `top_n` genes is to take the top 5% of genes modeled by the expression topic model.

In [None]:
num_genes = rna_adata.X.shape[0]
top_n_genes = math.ceil(num_genes * 0.05)
print(top_n_genes)

rna_model.post_topic(6, top_n=top_n_genes)
rna_model.post_topic(9, top_n=top_n_genes)

To retreive a sorted list of genes (least activated to most activated) for GSEA, use:

In [None]:
rna_model.rank_genes(6)

To download the enrichment results, run `fetch_topic_enrichments`, or similarly run `fetch_enrichments` to download results for all topics. Here, you may provide list of onotologies to compare against. The ontologies available on Enrichr may be found here.

In [None]:
rna_model.fetch_topic_enrichments(6, ontologies= ['WikiPathways_2019_Mouse'])
rna_model.fetch_topic_enrichments(9, ontologies= ['WikiPathways_2019_Mouse'])

To analyze enrichments, you can use:

In [None]:
rna_model.plot_enrichments(6, show_top=5)

You can compare enrichments against a pre-compiled list of genes-of-interest, for example, a list of transcription factors, using the `label_genes` parameter. If genes in this list appear in the enrichment plot, they are labeled with a *.

In [None]:
rna_model.plot_enrichments(9, show_top=5, plots_per_row=1,
                           label_genes=['CDK1','PIGF'])

For a full list of parameters, see `plot_enrichments`. You can also access the enrichment data using `get_enrichments`:

In [None]:
pd.DataFrame(
    rna_model.get_enrichments(9)['WikiPathways_2019_Mouse']
).head(3)

In [None]:
rna_adata

## Accessibility Topic Analysis

Next, we will find transcription factor enrichments in accessibility topics. First, visualize the cell states represented by some topics:

In [None]:
topics = [i for i in rna_adata.obs if "ATAC_topic" in i]
sc.pl.umap(rna_adata, color = topics, frameon=False, palette='viridis', ncols=6)

ATAC topics 0 and 7 have very different associations. It would be interesting to compare and contrast transcription factors influential in these cell states.

First, we must annotate transcription factor binding sites in our peaks using motif scanning. For this, we need the fasta sequence of the organism’s genome. Sequences may be downloaded from the UCSC repository.

In [None]:
!mkdir -p data
!wget https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/mm10.fa.gz -O /gpfs/Home/esm5360/MIRA/data/mm10.fa.gz
!cd data/ && gzip -d -f mm10.fa.gz

We must also ensure that we indicate the correct columns in the ATAC AnnData object corresponding to the chromosome, start, and end locations of each peak.

In [None]:
atac_adata = anndata.read_h5ad("/gpfs/Home/esm5360/MIRA/mira-datasets/ds011_atac_data_joint_representation.h5ad")

`atac_adata.var` needs to have `peak_id`, `chr`, `start`, and `end` columns corresponding to the peak locations for motif scanning

In [None]:
peak_locations = atac_adata.var.index

if not any(["chr", "start", "end"]) in peak_locations:
    peak_data = {
        "peak_id": [],
        "chr": [],
        "start": [],
        "end": []
    }
    for i, peak in enumerate(peak_locations):
        peak_id = i
        chr_num = peak.split(":")[0]
        peak_start = int(peak.split(":")[1].split("-")[0])
        peak_end = int(peak.split(":")[1].split("-")[1])
        
        peak_data["peak_id"].append(peak_id)
        peak_data["chr"].append(chr_num)
        peak_data["start"].append(peak_start)
        peak_data["end"].append(peak_end)
        
    peak_df = pd.DataFrame(peak_data, index=peak_locations)
    atac_adata.var = pd.concat([atac_adata.var, peak_df], axis=1)

In [None]:
atac_adata.var

Now, use the function `mira.tl.get_motif_hits_in_peaks`, which will scan the sequence of each peak against the JASPAR 2020 vertabrates collection of motifs. Facilities for scanning user-defined motifs and other motif databases will be added in the future.

I ran into an issue where `moods-dna.py` was installed in the environment but not in the `PATH`. The below code fixed it:

In [None]:
import os

os.environ["PATH"] = os.pathsep.join([
    os.path.expanduser("~/miniconda3/envs/mira-env/bin"),
    os.environ["PATH"]
])


In [None]:
atac_adata

In [None]:
atac_adata.write_h5ad("/gpfs/Home/esm5360/MIRA/mira-datasets/ds011_atac_data_joint_representation_peak_format.h5ad")

In [None]:
mira.tools.motif_scan.logger.setLevel(logging.INFO) # make sure progress messages are displayed
mira.tl.get_motif_hits_in_peaks(atac_adata,
                    genome_fasta='/gpfs/Home/esm5360/MIRA/data/mm10.fa',
                    chrom = 'chr', start = 'start', end = 'end',
                    pvalue_threshold=1e-4
                    ) # indicate chrom, start, end of peaks

The function above loads the motif hits into a (n_factors x n_peaks) sparse matrix in `.varm['motif_hits']`, where values are the MOODS3 “Match Score” given a motif PWM and the peak’s sequence. All matches that do not meet the p-value threshold were filtered.

The metadata on the motifs scanned are stored in `.uns['motifs']`, and can be accessed by `mira.utils.fetch_factor_meta`.

In [None]:
mira.utils.fetch_factor_meta(atac_adata)

Motif calling often includes many factors that may be irrelevant to the current system. Usually, it is convenient to filter out TFs for which we do not have expression data. Below, we use `mira.utils.subset_factors` to filter out TFs that do not have any associated data in the rna_data object (in addition to AP1 since these motifs clog up the plots we’re about to make).

**Important: Do not filter out TFs on the basis of mean expression or dispersion, as many TFs can influence cell state without being variably expressed.**

This function marks certain factors as not to be used, but does not remove them from the AnnData. This way, you can use a different filter or include different factors in your analysis without re-calling motifs.

In [None]:
mira.utils.subset_factors(atac_adata,
                          use_factors=[factor.upper() for factor in rna_adata.var_names
                                       if not ('FOS' in factor or 'JUN' in factor)])

With motifs called and a trained topic model, we find which motifs are enriched in each topic:

In [None]:
atac_model.get_enriched_TFs(atac_adata, topic_num=3, top_quantile=0.1)
atac_model.get_enriched_TFs(atac_adata, topic_num=9, top_quantile=0.1)

The parameter of the function above, `top_quantile`, controls what quantile of peaks are taken to represent the topic. Values between 0.1 and 0.2, so the top 10% to 20% peaks, work best. If a certain topic is enriching for non-specific factors, decrease the quantile used to take more topic-specific peaks.

You can retrieve enrichment results using `get_enrichments`. Note, this list is not sorted:

In [None]:
pd.DataFrame(atac_model.get_enrichments(9)).head(3)

Comparing and contrasting TF enrichments between topics elucidates common and topic-specific regulators. For this, you can use `plot_compare_topic_enrichments`, which plots the -log10 p-value of TF enrichment for one topic vs. another.

In [None]:
atac_model.plot_compare_topic_enrichments(3, 9,
            fontsize=10, label_closeness=3, figsize=(6,6),
        )

You can color the TFs on the plot to help narrow down import TFs. We could color by expression levels in our cell types of interest:

In [None]:
total_expression_in_cells = np.log10(
    np.squeeze(np.array(rna_adata.X.sum(0))) + 1
)

atac_model.plot_compare_topic_enrichments(23, 17,
            hue = {factor : disp  for factor, disp in zip(rna_adata.var_names, total_expression_in_cells)},
            palette = 'coolwarm', legend_label='Expression',
            fontsize=10, label_closeness=3, figsize=(6,6)
        )