# MuVIcell Tutorial: Multi-View Integration for Sample-Aggregated Single-Cell Data

This notebook demonstrates how to use the MuVIcell package for multi-view integration and analysis of sample-aggregated single-cell data using MuVI (Multi-View Integration).

## Overview

MuVIcell provides a streamlined workflow for:
1. **Generating/Loading** multi-view data in muon format (samples x features)
2. **Preprocessing** data for MuVI analysis
3. **Running MuVI** to identify latent factors using `muvi.tl.from_mdata`
4. **Analyzing** and interpreting factors
5. **Visualizing** results

Note: Each row represents a **sample** (not individual cells) and views contain **cell type aggregated data per sample**.

In [None]:
import muvicell
muvicell.__version__

In [None]:
import tensordict
tensordict.__version__

In [None]:
import numpy as np
import pandas as pd
import muon as mu
import warnings
warnings.filterwarnings('ignore')

from plotnine import *
import scanpy as sc

# Import muvicell modules
from muvicell import synthetic, preprocessing, analysis, visualization

# Import MuVI directly to show compatibility
import muvi

In [None]:
device = "cpu"
try:
    device = f"cuda:{muvi.get_free_gpu_idx()}"
except Exception as e:
    print(e)

## 1. Generate Synthetic Multi-View Data

Generate synthetic data with 3 views (5, 10, 15 features) and 200 samples:

In [None]:
# Generate synthetic multi-view data (3 views matching 3 true factors)
mdata = synthetic.generate_synthetic_data(
    n_samples=200,
    view_configs={
        'Cell Type 1': {'n_vars': 5, 'sparsity': 0.15},
        'Cell Type 2': {'n_vars': 10, 'sparsity': 0.25},
        'Cell Type 3': {'n_vars': 15, 'sparsity': 0.35}
    }
)

print(f"Generated synthetic data:")
print(f"- Samples: {mdata.n_obs}")
print(f"- Views: {len(mdata.mod)} ({', '.join([f'{k}: {v.n_vars} features' for k, v in mdata.mod.items()])})")
print(f"- Total features: {sum(v.n_vars for v in mdata.mod.values())}")

## 2. Add Latent Factor Structure

Add realistic latent factor structure to the synthetic data:

In [None]:
# Add latent structure with 3 factors (matching n_true_factors)
mdata_structured = synthetic.add_latent_structure(
    mdata, 
    n_latent_factors = 3,
    factor_variance = [0.5, 0.4, 0.3],
    structure_strength = 1.0,
    baseline_strength = 0.6
)

print(f"Sample metadata columns: {list(mdata_structured.obs.columns)}")

In [None]:
for mod in mdata_structured.mod:
    # Highly variable features can be used if there's enough of them
    sc.pp.pca(mdata_structured[mod], 
              use_highly_variable=False)
    sc.pp.neighbors(mdata_structured[mod])

mu.pp.neighbors(mdata_structured)

In [None]:
# Copy factor loadings to single view
for var in ['sim_factor_1', 'sim_factor_2', 'sim_factor_3']:
    mdata_structured[mod].obs[var] = mdata_structured.obs[var]

sc.tl.umap(mdata_structured[mod])
sc.pl.umap(mdata_structured[mod], color=['batch', 
                                        'sim_factor_1',
                                        'sim_factor_2',
                                        'sim_factor_3'])

In [None]:
mu.tl.umap(mdata_structured)
mu.pl.umap(mdata_structured, wspace=0.3, color=['batch', 
                                                'sim_factor_1',
                                                'sim_factor_2',
                                                'sim_factor_3'])

With these parameters, we create 3 latent factors with specified variances, a strong structured signal, and a moderate baseline signal across all features.

## 3. Preprocess Data for MuVI

Apply preprocessing pipeline (optimized for synthetic data):

In [None]:
# Preprocess for MuVI analysis
mdata_processed = preprocessing.preprocess_for_muvi(
    mdata_structured,
    filter_cells=False,  # Don't filter synthetic data
    filter_genes=False,  # Don't filter synthetic data
    normalize=True,
    find_hvg=False,      # Skip HVG for synthetic data
    subset_hvg=False
)

print(f"Preprocessed data shape: {mdata_processed.shape}")
print("Data ready for MuVI analysis")

## 4. Run MuVI Analysis

Run MuVI using the exact same API as the original analysis, with 3 factors to match our synthetic data:

In [None]:
# Run MuVI using the standard API
model = muvi.tl.from_mdata(
    mdata_processed,
    n_factors=3,
    nmf=False,
    device=device
)

# Fit the model
model.fit()

print(f"MuVI model fitted with {model.n_factors} factors")

In [None]:
# Display variance explained
r2_pool = []
for vn in model.get_factor_loadings().keys():
    rec = model.get_factor_scores() @ model.get_factor_loadings()[vn]
    r2 = pd.DataFrame({'x': mdata_processed[vn].X.flatten(), 
                       'y': rec.flatten()}).corr()
    r2_pool.append(r2.iloc[0,1])
print(f"Macro R2: {np.mean(np.square(r2_pool))}")

# Check factor scores
factor_scores = model.get_factor_scores()
print(f"Factor scores shape: {factor_scores.shape}")

## (Bonus) Confirm the factors recovered match the simulation parameters
This is only possible here since we generated the data ourselves, and cannot be done in real applications.

In [None]:
factors_df = pd.DataFrame(
    np.hstack([mdata_processed.obsm['true_factors'], factor_scores]),
    columns=[f"True_Factor_{i+1}" for i in range(3)] + [f"MuVI_Factor_{i+1}" for i in range(model.n_factors)]
)
corr_factors = factors_df.corr(method='spearman')
corr_factors

We see that many of the true factors are well recovered, with high positive/negative correlation (> 0.5) between true and inferred factor scores. Some effects are split across multiple inferred factors, as different combinations of factors can explain the variance if they are independent.

## 5. Characterize Factors
Identify top genes contributing to each factor:

## 6. Analyze Factor Associations

Test associations between factors and sample metadata:

In [None]:
# Identify factor associations with metadata
associations = analysis.identify_factor_associations(
    model,
    categorical_test='kruskal'
)

print("Factor-metadata associations:")
if len(associations) > 0:
    significant = associations[associations['p_value'] < 0.05]
    print(f"Found {len(significant)} significant associations (p < 0.05)")
    if len(significant) > 0:
        print(significant[['factor', 'metadata', 'test', 'p_value']].head())
else:
    print("No associations found")

## 7. Cluster Samples by Factors

Cluster samples based on their factor scores:

In [None]:
# Cluster samples based on factor scores
clusters = analysis.cluster_cells_by_factors(
    model,
    factors_to_use=None,  # Use all factors
    n_clusters=3
)

print(f"Sample clustering results:")
print(f"Cluster distribution: {np.bincount(clusters)}")
print(f"Number of clusters: {len(np.unique(clusters))}")

## 8. Visualize Results

Create publication-ready visualizations:

In [None]:
# 1. Variance explained by factors
p1 = visualization.plot_variance_explained(model, max_factors=3)
print("Variance explained by factors:")
p1.show()

In [None]:
# 2. Factor scores colored by cell type
p2 = visualization.plot_factor_scores(model, factors=(0, 1), color_by='cell_type')
print("Factor scores (Factor 0 vs Factor 1):")
p2.show()

In [None]:
# 3. Factor loadings for view1
p3 = visualization.plot_factor_loadings(model, 'view1', factor=0, top_genes=5)
print("Top gene loadings for Factor 0 in view1:")
p3.show()

In [None]:
# 4. Factor activity comparison across cell types
p4 = visualization.plot_factor_comparison(
    model,
    factors=[0, 1, 2],
    group_by='cell_type',
    plot_type='boxplot'
)
print("Factor activity by cell type:")
p4.show()