# MuVIcell Tutorial: Multi-View Integration for Sample-Aggregated Single-Cell Data

This notebook demonstrates how to use the MuVIcell package for multi-view integration and analysis of sample-aggregated single-cell data using MuVI (Multi-View Integration).

## Overview

MuVIcell provides a streamlined workflow for:
1. **Generating/Loading** multi-view data in muon format (samples x features)
2. **Preprocessing** data for MuVI analysis
3. **Running MuVI** to identify latent factors using `muvi.tl.from_mdata`
4. **Analyzing** and interpreting factors
5. **Visualizing** results

Note: Each row represents a **sample** (not individual cells) and views contain **cell type aggregated data per sample**.

In [None]:
import numpy as np
import pandas as pd
import muon as mu
import warnings
warnings.filterwarnings('ignore')

# Import muvicell modules
from muvicell import synthetic, preprocessing, analysis, visualization

# Import MuVI directly to show compatibility
try:
    import muvi
    import muvi.tl
    MUVI_AVAILABLE = True
    print("MuVI is available")
except ImportError:
    MUVI_AVAILABLE = False
    print("MuVI not available - using mock implementation")

print(f"Python version compatible with MuVI: {MUVI_AVAILABLE}")

## 1. Generate Synthetic Multi-View Data

Generate synthetic data with 3 views (5, 10, 15 features) and 200 samples:

In [None]:
# Generate synthetic multi-view data (3 views matching 3 true factors)
mdata = synthetic.generate_synthetic_data(
    n_samples=200,
    n_true_factors=3,  # Match number of factors we'll infer
    view_configs={
        'view1': {'n_vars': 5, 'sparsity': 0.3},
        'view2': {'n_vars': 10, 'sparsity': 0.4},
        'view3': {'n_vars': 15, 'sparsity': 0.5}
    }
)

print(f"Generated synthetic data:")
print(f"- Samples: {mdata.n_obs}")
print(f"- Views: {len(mdata.mod)} ({', '.join([f'{k}: {v.n_vars} features' for k, v in mdata.mod.items()])})")
print(f"- Total features: {sum(v.n_vars for v in mdata.mod.values())}")

## 2. Add Latent Factor Structure

Add realistic latent factor structure to the synthetic data:

In [None]:
# Add latent structure with 3 factors (matching n_true_factors)
mdata_structured = synthetic.add_latent_structure(
    mdata, 
    n_latent_factors=3
)

print(f"Sample metadata columns: {list(mdata_structured.obs.columns)}")
print(f"Unique cell types: {mdata_structured.obs['cell_type'].unique()}")
print(f"Unique conditions: {mdata_structured.obs['condition'].unique()}")

## 3. Preprocess Data for MuVI

Apply preprocessing pipeline (optimized for synthetic data):

In [None]:
# Preprocess for MuVI analysis
mdata_processed = preprocessing.preprocess_for_muvi(
    mdata_structured,
    filter_cells=False,  # Don't filter synthetic data
    filter_genes=False,  # Don't filter synthetic data
    normalize=True,
    find_hvg=False,      # Skip HVG for synthetic data
    subset_hvg=False
)

print(f"Preprocessed data shape: {mdata_processed.shape}")
print("Data ready for MuVI analysis")

## 4. Run MuVI Analysis

Run MuVI using the exact same API as the original analysis, with 3 factors to match our synthetic data:

In [None]:
# Run MuVI using the exact same pattern as original analysis
if MUVI_AVAILABLE:
    # Real MuVI using the standard API
    model = muvi.tl.from_mdata(
        mdata_processed,
        n_factors=3,  # Match number of true factors
        nmf=False,
        device='cpu'
    )
    
    # Fit the model
    model.fit()
    
    print(f"MuVI model fitted with {model.n_factors} factors")
    
    # Display variance explained
    var_exp = muvi.tl.variance_explained(model)
    print(f"Average variance explained: {np.mean(list(var_exp[0].values())):.3f}")
    
else:
    # Use mock implementation for demonstration
    from muvicell.muvi_runner import _create_mock_muvi_model
    model = _create_mock_muvi_model(mdata_processed, n_factors=3)
    model.fit()
    print("Mock MuVI model fitted with 3 factors")

# Check factor scores
factor_scores = model.get_factor_scores()
print(f"Factor scores shape: {factor_scores.shape}")

## 5. Characterize Factors

Identify top genes contributing to each factor:

In [None]:
# Characterize factors to identify top contributing genes
factor_genes = analysis.characterize_factors(
    model, 
    top_genes_per_factor=3  # Get top 3 genes per factor for demo
)

print("Top genes for each factor:")
for view_name, view_genes in factor_genes.items():
    print(f"\n{view_name}:")
    if len(view_genes) > 0:
        for factor in view_genes['factor'].unique():
            factor_data = view_genes[view_genes['factor'] == factor]
            genes = ', '.join(factor_data['gene'].head(3).values)
            print(f"  Factor {factor}: {genes}")
    else:
        print("  No significant genes found")

## 6. Analyze Factor Associations

Test associations between factors and sample metadata:

In [None]:
# Identify factor associations with metadata
associations = analysis.identify_factor_associations(
    model,
    categorical_test='kruskal'
)

print("Factor-metadata associations:")
if len(associations) > 0:
    significant = associations[associations['p_value'] < 0.05]
    print(f"Found {len(significant)} significant associations (p < 0.05)")
    if len(significant) > 0:
        print(significant[['factor', 'metadata', 'test', 'p_value']].head())
else:
    print("No associations found")

## 7. Cluster Samples by Factors

Cluster samples based on their factor scores:

In [None]:
# Cluster samples based on factor scores
clusters = analysis.cluster_cells_by_factors(
    model,
    factors_to_use=None,  # Use all factors
    n_clusters=3
)

print(f"Sample clustering results:")
print(f"Cluster distribution: {np.bincount(clusters)}")
print(f"Number of clusters: {len(np.unique(clusters))}")

## 8. Visualize Results

Create publication-ready visualizations:

In [None]:
# 1. Variance explained by factors
p1 = visualization.plot_variance_explained(model, max_factors=3)
print("Variance explained by factors:")
p1.show()

In [None]:
# 2. Factor scores colored by cell type
p2 = visualization.plot_factor_scores(model, factors=(0, 1), color_by='cell_type')
print("Factor scores (Factor 0 vs Factor 1):")
p2.show()

In [None]:
# 3. Factor loadings for view1
p3 = visualization.plot_factor_loadings(model, 'view1', factor=0, top_genes=5)
print("Top gene loadings for Factor 0 in view1:")
p3.show()

In [None]:
# 4. Factor activity comparison across cell types
p4 = visualization.plot_factor_comparison(
    model,
    factors=[0, 1, 2],
    group_by='cell_type',
    plot_type='boxplot'
)
print("Factor activity by cell type:")
p4.show()

## 9. Summary Analysis

Calculate factor correlations and summary statistics:

In [None]:
# Calculate factor correlations
factor_corr = analysis.calculate_factor_correlations(model)
print("Factor correlation matrix:")
print(factor_corr.round(3))

# Summary statistics
factor_scores = model.get_factor_scores()
print(f"\nFactor scores summary:")
print(f"- Mean factor scores: {np.mean(factor_scores, axis=0).round(3)}")
print(f"- Std factor scores: {np.std(factor_scores, axis=0).round(3)}")

print(f"\nAnalysis complete! Successfully identified {factor_scores.shape[1]} latent factors from {sum(v.n_vars for v in mdata_processed.mod.values())} total features across {len(mdata_processed.mod)} views.")