# MuVIcell Tutorial: Multi-View Integration for Single-Cell Data

This notebook demonstrates how to use the MuVIcell package for multi-view integration and analysis of single-cell data using MuVI (Multi-View Integration).

## Overview

MuVIcell provides a streamlined workflow for:
1. **Generating/Loading** multi-view data in muon format
2. **Preprocessing** data for MuVI analysis
3. **Running MuVI** to identify latent factors
4. **Analyzing** and interpreting factors
5. **Visualizing** results

Let's start by importing the necessary modules.

In [None]:
import numpy as np
import pandas as pd
import muon as mu
import warnings
warnings.filterwarnings('ignore')

# Import muvicell modules
from muvicell import synthetic, preprocessing, muvi_runner, analysis, visualization
# Or import MuVI functions directly
from muvicell import run_muvi, get_factor_scores

# Set random seed for reproducibility
np.random.seed(42)

## 1. Generate Synthetic Multi-View Data

First, let's generate synthetic multi-view data as specified in the requirements:
- 3 views with 200 rows (samples)
- Views with 5, 10, and 15 columns (features) respectively
- Note: Each row represents a sample (e.g., spatial region, tissue section)
- Views contain cell type aggregated data per sample

In [None]:
# Define view configurations
view_configs = {
    'view1': {'n_vars': 5, 'sparsity': 0.3},
    'view2': {'n_vars': 10, 'sparsity': 0.4},
    'view3': {'n_vars': 15, 'sparsity': 0.5}
}

# Generate synthetic data
mdata_raw = synthetic.generate_synthetic_data(
    n_cells=200,  # Note: these are samples, not individual cells
    view_configs=view_configs,
    random_state=42
)

print(f"Generated synthetic data with {mdata_raw.n_obs} samples")
print(f"Number of views: {len(mdata_raw.mod)}")

# Display view information
from muvicell.data import get_view_info
view_info = get_view_info(mdata_raw)
print("\nView information:")
print(view_info)

## 2. Add Realistic Structure to the Data

Let's add some realistic latent factor structure to make the analysis more interesting.

In [None]:
# Add realistic latent factor structure
mdata_structured = synthetic.add_latent_structure(
    mdata_raw,
    n_factors=5,
    factor_variance=[0.3, 0.2, 0.15, 0.1, 0.05]
)

print(f"Added {mdata_structured.obsm['true_factors'].shape[1]} latent factors")
print(f"True factor variance: {mdata_structured.uns['true_factor_variance']}")

## 3. Data Validation

Before preprocessing, let's validate that our data is suitable for MuVI analysis.

In [None]:
# Validate data for MuVI
from muvicell.data import validate_for_muvi
validation_results = validate_for_muvi(mdata_structured)

print("Data validation results:")
all_passed = True
for check, result in validation_results.items():
    status = "✓" if result else "✗"
    print(f"  {status} {check}: {result}")
    if not result:
        all_passed = False

if all_passed:
    print("\n✓ Data is valid for MuVI analysis!")
else:
    print("\n✗ Data validation failed. Please check your data.")

## 4. Preprocess Data for MuVI

Now let's preprocess the data using the MuVIcell preprocessing pipeline.

In [None]:
# Preprocess the data for MuVI - simplified approach
mdata_processed = preprocessing.preprocess_for_muvi(
    mdata_structured,
    filter_cells=True,
    filter_genes=True,
    normalize=True,
    find_hvg=False,  # Skip HVG detection for small synthetic data
    subset_hvg=False
)

print(f"Preprocessed data shape: {mdata_processed.shape}")
print("\nPreprocessing completed successfully!")

## 5. Run MuVI Analysis

Now let's run the actual MuVI analysis to identify latent factors that capture coordinated programs across views.

In [None]:
# Run MuVI analysis using the exact same pattern as the original notebook
mdata_muvi = run_muvi(
    mdata_processed,
    n_factors=10,
    nmf=False,  # Use standard factor analysis, not non-negative
    device="cpu"
)

print(f"MuVI analysis completed!")
print(f"Number of factors: {mdata_muvi.obsm['X_muvi'].shape[1]}")
print(f"Factor scores shape: {mdata_muvi.obsm['X_muvi'].shape}")

## 6. Analyze MuVI Results

Let's analyze the identified factors and understand what they represent.

In [None]:
# Extract factor scores and variance explained
factor_scores = get_factor_scores(mdata_muvi)
var_explained = muvi_runner.get_variance_explained(mdata_muvi)

print(f"Factor scores shape: {factor_scores.shape}")
print("\nVariance explained by each factor:")
for view_name, var_exp in var_explained.items():
    print(f"  {view_name}: {var_exp}")

In [None]:
# Characterize factors by identifying top contributing genes
factor_genes = analysis.characterize_factors(
    mdata_muvi,
    n_top_genes=3,
    loading_threshold=0.1
)

print("Top genes for each factor:")
for view_name, view_factors in factor_genes.items():
    print(f"\n{view_name.upper()} VIEW:")
    for factor_id, genes in view_factors.items():
        if genes:
            top_genes = [gene['gene'] for gene in genes[:3]]
            print(f"  Factor {factor_id}: {', '.join(top_genes)}")
    if not any(view_factors.values()):
        print("  No genes above threshold")

In [None]:
# Identify associations between factors and metadata
associations = analysis.identify_factor_associations(
    mdata_muvi,
    metadata_columns=['cell_type', 'condition'],
    test_types=['anova', 'correlation']
)

print("Factor-metadata associations:")
if associations is not None and not associations.empty:
    # Filter significant associations
    significant = associations[associations['p_value'] < 0.05]
    if not significant.empty:
        print("\nSignificant associations (p < 0.05):")
        for _, row in significant.iterrows():
            print(f"  {row['factor']} ↔ {row['metadata']}: {row['test']} (p = {row['p_value']:.3f})")
    else:
        print("  No significant associations found")
else:
    print("  No associations computed")

In [None]:
# Cluster cells based on factor scores
cluster_labels = analysis.cluster_cells_by_factors(
    mdata_muvi,
    n_clusters=3,
    top_factors=5
)

# Add cluster labels to metadata
mdata_muvi.obs['cluster'] = cluster_labels

print(f"Clustered samples into {len(np.unique(cluster_labels))} clusters")
print("\nCluster sizes:")
cluster_counts = pd.Series(cluster_labels).value_counts().sort_index()
for cluster_id, count in cluster_counts.items():
    print(f"  Cluster {cluster_id}: {count} samples")

## 7. Visualize Results

Let's create visualizations to understand the identified factors and their relationships.

In [None]:
# Plot variance explained by factors
p1 = visualization.plot_variance_explained(mdata_muvi)
print("Variance explained by factors (by view):")
p1.show()

In [None]:
# Plot factor scores in 2D space
p2 = visualization.plot_factor_scores(
    mdata_muvi,
    factors=[0, 1],
    color_by='cell_type',
    size=3
)
print("Factor scores (Factor 0 vs Factor 1):")
p2.show()

In [None]:
# Plot factor loadings for the first view
view_name = list(mdata_muvi.mod.keys())[0]
p3 = visualization.plot_factor_loadings(
    mdata_muvi,
    view=view_name,
    factor=0,
    n_top_genes=5
)
print(f"Top gene loadings for Factor 0 in {view_name}:")
p3.show()

In [None]:
# Plot cells colored by cluster in factor space
p4 = visualization.plot_factor_scores(
    mdata_muvi,
    factors=[0, 1],
    color_by='cluster',
    size=3
)
print("Sample clusters in factor space:")
p4.show()

In [None]:
# Compare factor activity across cell types
p5 = visualization.plot_factor_comparison(
    mdata_muvi,
    factors=[0, 1, 2],
    group_by='cell_type',
    plot_type='boxplot'
)
print("Factor activity by cell type:")
p5.show()

## 8. Summary Analysis

Let's summarize our findings and create some summary statistics.

In [None]:
# Create summary of factor activity by cell type
factor_summary = analysis.summarize_factor_activity(
    mdata_muvi,
    group_by='cell_type',
    top_factors=5
)

print("Factor activity summary by cell type:")
print(factor_summary.head(15))

In [None]:
# Calculate factor correlations
factor_correlations = analysis.calculate_factor_correlations(mdata_muvi)

print("Factor correlation matrix:")
print(factor_correlations.round(3))

# Find highly correlated factors
corr_threshold = 0.5
high_corr_pairs = []
for i in range(factor_correlations.shape[0]):
    for j in range(i+1, factor_correlations.shape[1]):
        if abs(factor_correlations.iloc[i, j]) > corr_threshold:
            high_corr_pairs.append((i, j, factor_correlations.iloc[i, j]))

if high_corr_pairs:
    print(f"\nHighly correlated factor pairs (|r| > {corr_threshold}):")
    for i, j, corr in high_corr_pairs:
        print(f"  Factor {i} ↔ Factor {j}: r = {corr:.3f}")
else:
    print(f"\nNo highly correlated factor pairs found (|r| > {corr_threshold})")

In [None]:
# Select top factors based on variance explained
top_factors = muvi_runner.select_top_factors(
    mdata_muvi,
    n_top_factors=3
)

print(f"Top 3 factors by variance explained: {top_factors}")

# Calculate total variance explained
total_var_per_factor = []
for i in range(mdata_muvi.obsm['X_muvi'].shape[1]):
    total_var = sum([var_explained[view][i] for view in var_explained.keys()])
    total_var_per_factor.append(total_var)

print("\nTotal variance explained by each factor:")
for i, var in enumerate(total_var_per_factor):
    status = "(top)" if i in top_factors else ""
    print(f"  Factor {i}: {var:.3f} {status}")

## Conclusion

This tutorial demonstrated the complete MuVIcell workflow:

1. **Data Generation**: Created synthetic multi-view data with realistic structure
2. **Preprocessing**: Applied appropriate normalization and filtering
3. **MuVI Analysis**: Identified latent factors using MuVI integration
4. **Factor Analysis**: Characterized factors by top genes and metadata associations
5. **Visualization**: Created publication-ready plots to interpret results
6. **Summary**: Provided comprehensive analysis of factor relationships

The MuVIcell package simplifies multi-view integration analysis while providing powerful tools for interpretation and visualization of the results. Users can apply the same workflow to their own multi-modal single-cell datasets.