# MIRA Joint Representation

We will use the pre-trained topic models to create a joint embedding representation of the accessibility and expression across cells. This can be used to calculate a joint-KNN graph to capture cellular heterogeneity by ordering cells not only be expression or accessibility, but by both. This can be used to cluster cells, pseudotime trajectory inference, and UMAP visualization.

In [None]:
!hostnamectl

In [None]:
import mira
import anndata
import scanpy as sc
import numpy as np
import seaborn as sns

import matplotlib.pyplot as plt
import matplotlib
matplotlib.rc('font',size=12)

import logging
mira.logging.getLogger().setLevel(logging.INFO)
import warnings
warnings.simplefilter("ignore")
umap_kwargs = dict(
    add_outline=True, outline_width=(0.1,0), outline_color=('grey', 'white'),
    legend_fontweight=350, frameon = False, legend_fontsize=12
)
print(mira.__version__)
mira.utils.pretty_sderr()

First, we need to load the datasets and the topic models

In [None]:
rna_adata = anndata.read_h5ad("/gpfs/Home/esm5360/MIRA/mira-datasets/ds011_rna_data.h5ad")
atac_adata = anndata.read_h5ad("/gpfs/Home/esm5360/MIRA/mira-datasets/ds011_atac_data.h5ad")

rna_model = mira.topics.load_model("/gpfs/Home/esm5360/MIRA/mira-datasets/ds011_rna_model.pth")
atac_model = mira.topics.load_model("/gpfs/Home/esm5360/MIRA/mira-datasets/ds011_atac_model.pth")

## Predicting Topics

Using the topic models, we can predict topic compositions for our cells. The topics are a distribution over expression of genes, so cell-topic compositions represent the degree to which different modules of gene expression are active in the cell. 

The `predict` method takes the requisite AnnData objects as input and saves topic compositions for cells and features.

In [None]:
atac_model.predict(atac_adata)
rna_model.predict(rna_adata)

Next, we wish to use those cell-topic compositions as features to find cells which are in similar states. Compositions are in the simplex space, which can distort inter-cell distances. Therefore, we convert the simplical topic compositions to *Real* space using the *Isometic log ratio* (ILR) transformation.

The parameter `box_cox` conrtols the box-cox power transformation applied to the simplical data. Passing zero or "log" gives the standard ILR transformation. Passing a float less than 1 gives a box-cox generalization of the ILR. Larger values generally produce more complex structures in the latent space. No value works perfectly for all datasets, so please see the section below for more details.

In [None]:
rna_model.get_umap_features(rna_adata, box_cox=0.25)
atac_model.get_umap_features(atac_adata, box_cox=0.25)

Let's visualize how the topics describe cell populations and variance in the dataset. We’ll start by creating separate visualizations for expression and accessibility. First, we need use the embedding space to create a K-nearsest neighbors graph using sc.pp.neighbors. To make sure the correct embeddings are used, make sure to specify `use_rep = 'X_joint_umap_features'`. Also, specify `metric = 'manhattan'` to leverage the orthonormality of ILR-transformed space to find cells in similar states.

One application of the joint-KNN graph is to calculate a 2-D UMAP view of the data. When calculating UMAPs, setting `min_dist = 0.1` highlights lineage structures and reduces the “fuzziness” of the UMAP view.

We do this for both modalities below:

In [None]:
# Run K-NN and UMAP for RNA data
sc.pp.neighbors(rna_adata, use_rep = 'X_umap_features', metric = 'manhattan', n_neighbors = 21)
sc.tl.umap(rna_adata, min_dist = 0.1)
rna_adata.obsm['X_umap'] = rna_adata.obsm['X_umap']*np.array([-1,-1]) # flip for consistency

# Run K-NN and UMAP for ATAC data
sc.pp.neighbors(atac_adata, use_rep = 'X_umap_features', metric = 'manhattan', n_neighbors = 21)
sc.tl.umap(atac_adata, min_dist = 0.1)
atac_adata.obsm['X_umap'] = atac_adata.obsm['X_umap']*np.array([1,-1]) # flip for consistency

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(10, 4))
umap_kwargs = dict(color='topic_0', na_color="lightgrey")
sc.pl.umap(
    rna_adata,
    ax=ax[0],
    size=20,
    title="Expression Only",
    show=False,
    **umap_kwargs
)

sc.pl.umap(
    atac_adata,
    ax=ax[1],
    size=20,
    title="Accessibility Only",
    show=False,
    **umap_kwargs
)

plt.tight_layout()
plt.show()


## Joining Modalities

Now, let’s combine the modailities. We can construct the joint embedding space using `mira.utils.make_joint_representation`. This function takes the two modalities’ AnnDatas as input, then finds common cells, joins the separate transformed topic spaces to make the joint embedding for each cell, and returns those AnnDatas.

In [None]:
rna_adata, atac_adata = mira.utils.make_joint_representation(rna_adata, atac_adata)

Finally, we can use the joint embedding space to create the joint-KNN graph using:

In [None]:
sc.pp.neighbors(rna_adata, use_rep = 'X_joint_umap_features', metric = 'manhattan',
               n_neighbors = 20)

And visualize the joint-KNN graph using UMAP. The UMAP view below, as analyzed thoroughly in the MIRA paper reveals interesting aspects of skin differentiation biology.

In [None]:
sc.tl.umap(rna_adata, min_dist = 0.1)

In [None]:
fig, ax = plt.subplots(1,1,figsize=(8,5))
sc.pl.umap(rna_adata, legend_loc = 'on data', ax = ax, size = 20,
          **umap_kwargs, title = '')

After joining the AnnDatas, it is useful to transfer some metadata from the ATAC dataframe to the RNA dataframe so that we have one main object for plotting and running other functions:

In [None]:
rna_adata.obs = rna_adata.obs.join(
    atac_adata.obs.add_prefix('ATAC_') # add a prefix so we know which AnnData the column came from
)

atac_adata.obsm['X_umap'] = rna_adata.obsm['X_umap']

## Analyzing Joint Topic Compositions

One question we can answer with topics is finding to what degree changes in one mode’s topics correspond or correlate with topics in the other mode. For this we can use the mutual information between RNA and ATAC topic compositions. Mutual information measures how much knowing one variable informs you about the distribution of another variable, or in this case, does knowing about the composition of one mode tell you about the other?

We can ask this question on a cell-by-cell basis with the `mira.tl.get_cell_pointwise_mutual_information` function, which calculates the pointwise mutual information between topics for in that cell:

In [None]:
mira.tl.get_cell_pointwise_mutual_information(rna_adata, atac_adata)

In [None]:
fig, ax = plt.subplots(1,1,figsize=(8,5))
sc.pl.umap(rna_adata, color = 'pointwise_mutual_information', ax = ax, vmin = 0,
          color_map='magma', frameon=False, add_outline=True, vmax = 3, size = 25)

Usually, more stable cell states, such as terminal cell states, will have greater concordance between topic compositions.

To summarize mutual information across all cells, use mira.tl.summarize_mutual_information. Typically, this will give a value between 0 -low concordance- and 0.5 -high concordance.

In [None]:
mira.tl.summarize_mutual_information(rna_adata, atac_adata)

Finally, one can see which topics correlate across modes. Use:

In [None]:
cross_correlation = mira.tl.get_topic_cross_correlation(rna_adata, atac_adata)

In [None]:
sns.clustermap(cross_correlation, vmin = 0,
               cmap = 'magma', method='ward',
               dendrogram_ratio=0.05, cbar_pos=None, figsize=(7,7))

In [None]:
mira.adata_interface.core.logger.setLevel(logging.WARN)
mira.adata_interface.topic_model.logger.setLevel(logging.WARN)
mira.adata_interface.utils.logger.setLevel(logging.WARN)

In [None]:
def boxcox_test(ax, box_cox, rna, atac):

    atac_model.get_umap_features(atac, box_cox=box_cox)
    rna_model.get_umap_features(rna, box_cox=box_cox)

    rna, atac = mira.utils.make_joint_representation(rna, atac)

    sc.pp.neighbors(rna, use_rep = 'X_joint_umap_features', metric = 'manhattan', n_neighbors = 10)
    sc.tl.umap(rna, min_dist = 0.2, negative_sample_rate=2)
    sc.pl.umap(rna, ax = ax, show = False, title = 'Box-cox: ' + str(box_cox), legend_loc='on data',
              add_outline=True, outline_width=(0.1,0), outline_color=('grey', 'white'),
              legend_fontweight=150, frameon = False, legend_fontsize=12, **umap_kwargs)

fig, ax = plt.subplots(1,4, figsize=(20,4))
for ax_i, box_cox in zip(ax, ['log',0.25,0.5,0.99]):
    boxcox_test(ax_i, box_cox, rna_adata, atac_adata)

plt.show()

We see that as the box_cox parameter increases, some finer details in the manifold emerge, such as multiple paths between the Matrix and IRS cells. For the hair follicle, “log” and 0.99 hide meaningful structure in the data, so it is clear that the optimal box_cox transformation for this dataset is somewhere in the realm of 0.25 to 0.5. Notably, the underlying topic compositions have not changed, just our definition of the joint-KNN graph and our subsequent view of the it in UMAP space.

Try multiple values for box_cox to find a view that sufficiently demonstrates the connectivity structure of the data.

Overall, in constructing the joint representation and ensuing visual representation of the data (atleast with UMAP), there are several hyperparameters to consider:


| Parameter | Source | What it does | Good value |
|:---------:|:------:|:------------:|:----------:|
|  box_cox  |  MIRA  |Controls box-cox power transformation of topic compositions. A value of zero/“log” performs ILR trasformation. Larger values give a box-cox generalization of ILR and generally find more complex structure in the data. |“log”, 0.25, 0.5, 0.75|
|n_neighbors|  MIRA  | Number of neighbors in joint-KNN graph. Greater values increase “clumpiness” of joint KNN and remove finer structures and neighborhoods | 15 |
|  min_dist |  UMAP  | How close together can cells of similar state be placed in 2-D space. Lower values decrease “fuzziness” of UMAP. | 0.1 |
| negative_sample_rate | UMAP | Repulsive force of UMAP algorithm. Decreasing this parameter makes UMAP view more similar to force-directed layouts, where attractive forces are prioritized. | 1 - 5 |

With the joint representation made, we can investigate regulatory axes captued by the topics. Please view the next tutorial to see MIRA’s topic analyis facilities, including motif calling and regulator enrichment.

In [None]:
atac_adata.write_h5ad("/gpfs/Home/esm5360/MIRA/mira-datasets/ds011_atac_data_joint_representation.h5ad")
rna_adata.write_h5ad("/gpfs/Home/esm5360/MIRA/mira-datasets/ds011_rna_data_joint_representation.h5ad")