# MuVIcell Tutorial: Multi-View Integration for Single-Cell Data

This notebook demonstrates how to use the MuVIcell package for multi-view integration and analysis of single-cell data using MuVI (Multi-View Integration).

## Overview

MuVIcell provides a streamlined workflow for:
1. **Generating/Loading** multi-view data in muon format
2. **Preprocessing** data for MuVI analysis
3. **Running MuVI** to identify latent factors
4. **Analyzing** and interpreting factors
5. **Visualizing** results

Let's start by importing the necessary modules.

In [None]:
import numpy as np
import pandas as pd
import muon as mu
import warnings
warnings.filterwarnings('ignore')

# Import muvicell modules
from muvicell import synthetic, preprocessing, muvi_runner, analysis, visualization
# Or import MuVI functions directly
from muvicell import run_muvi, get_factor_scores

# Set random seed for reproducibility
np.random.seed(42)

## 1. Generate Synthetic Multi-View Data

First, let's generate synthetic multi-view data as specified in the requirements:
- 3 views with 200 rows (cells)
- Views with 5, 10, and 15 columns (features) respectively

In [None]:
# Define view configurations
view_configs = {
    'view1': {'n_vars': 5, 'sparsity': 0.3},
    'view2': {'n_vars': 10, 'sparsity': 0.4},
    'view3': {'n_vars': 15, 'sparsity': 0.5}
}

# Generate synthetic data
mdata_raw = synthetic.generate_synthetic_data(
    n_cells=200,
    view_configs=view_configs,
    random_state=42
)

print(f"Generated synthetic data with {mdata_raw.n_obs} cells")
print(f"Number of views: {len(mdata_raw.mod)}")

# Display view information
from muvicell.data import get_view_info
view_info = get_view_info(mdata_raw)
print("\nView information:")
print(view_info)

## 2. Add Realistic Structure to the Data

Let's add some realistic latent factor structure to make the analysis more interesting.

In [None]:
# Add realistic latent factor structure
mdata_structured = synthetic.add_realistic_structure(
    mdata_raw,
    n_latent_factors=4,
    factor_variance=[0.25, 0.20, 0.15, 0.10]
)

print(f"Added {mdata_structured.obsm['true_factors'].shape[1]} latent factors")
print(f"True factor variance: {mdata_structured.uns['true_factor_variance']}")

## 3. Data Validation and Preprocessing

Before running MuVI, let's validate our data and preprocess it appropriately.

In [None]:
# Validate the data for MuVI analysis
from muvicell.data import validate_muon_data

validation_results = validate_muon_data(mdata_structured)
print("Data validation results:")
for check, result in validation_results.items():
    status = "✓" if result else "✗"
    print(f"  {status} {check}: {result}")

if all(validation_results.values()):
    print("\n✓ Data is valid for MuVI analysis!")
else:
    print("\n✗ Data validation failed. Please check your data.")

In [None]:
# Preprocess the data for MuVI - using simpler preprocessing to avoid issues
mdata_processed = preprocessing.preprocess_for_muvi(
    mdata_structured,
    filter_cells=False,  # Skip cell filtering for small dataset
    filter_genes=False,  # Skip gene filtering for small dataset  
    normalize=True,
    find_hvg=False,      # Skip HVG finding for small dataset
    subset_hvg=False
)

print(f"Preprocessed data shape: {mdata_processed.shape}")
print("\nPreprocessing completed successfully!")

## 4. Run MuVI Analysis

Now let's run MuVI to identify latent factors across views.

**Note about MuVI installation**: The MuVI package requires Python <3.11 and specific dependencies. If MuVI is not available in your environment, the package will use a mock implementation for demonstration purposes. To install MuVI:

```bash
# For Python <3.11
pip install muvi
# Or install muvicell with MuVI support
pip install muvicell[muvi]
```

In [None]:
# Run MuVI analysis
mdata_muvi = run_muvi(
    mdata_processed, 
    n_factors=6,
    n_iterations=100,  # Reduced for demo
    verbose=True
)

print(f"MuVI analysis completed!")
print(f"Number of factors: {mdata_muvi.obsm['X_muvi'].shape[1]}")
print(f"Factor scores shape: {mdata_muvi.obsm['X_muvi'].shape}")

## 5. Analyze MuVI Results

Let's analyze the factors to understand what biological processes they might represent.

In [None]:
# Get factor scores and loadings
factor_scores = get_factor_scores(mdata_muvi)
variance_explained = muvi_runner.get_variance_explained(mdata_muvi)

print(f"Factor scores shape: {factor_scores.shape}")
print("\nVariance explained by each factor:")
for view_name, var_exp in variance_explained.items():
    print(f"  {view_name}: {var_exp}")

In [None]:
# Characterize factors by top contributing genes
factor_characterization = analysis.characterize_factors(
    mdata_muvi,
    top_genes_per_factor=5,
    loading_threshold=0.01  # Lower threshold for small dataset
)

print("Top genes for each factor:")
for view_name, char_df in factor_characterization.items():
    print(f"\n{view_name.upper()} VIEW:")
    if len(char_df) > 0:
        for factor_id in char_df['factor'].unique():
            factor_genes = char_df[char_df['factor'] == factor_id]
            top_genes = factor_genes.nlargest(3, 'abs_loading')['gene'].tolist()
            print(f"  Factor {factor_id}: {', '.join(top_genes)}")
    else:
        print("  No genes above threshold")

In [None]:
# Identify factor associations with metadata
factor_associations = analysis.identify_factor_associations(
    mdata_muvi,
    metadata_columns=['cell_type', 'condition', 'batch']
)

print("Factor-metadata associations:")
if len(factor_associations) > 0:
    # Show significant associations
    significant = factor_associations[factor_associations['p_value'] < 0.05]
    if len(significant) > 0:
        print("\nSignificant associations (p < 0.05):")
        for _, row in significant.head(10).iterrows():
            print(f"  {row['factor']} ↔ {row['metadata']}: {row['test']} (p = {row['p_value']:.3f})")
    else:
        print("  No significant associations found")
else:
    print("  No associations computed")

In [None]:
# Cluster cells based on factor activity
cluster_labels = analysis.cluster_cells_by_factors(
    mdata_muvi,
    n_clusters=4,
    random_state=42
)

# Add cluster labels to metadata
mdata_muvi.obs['factor_clusters'] = [f'Cluster_{i}' for i in cluster_labels]

print(f"Clustered cells into {len(np.unique(cluster_labels))} clusters")
print("\nCluster sizes:")
cluster_counts = pd.Series(cluster_labels).value_counts().sort_index()
for cluster_id, count in cluster_counts.items():
    print(f"  Cluster {cluster_id}: {count} cells")

## 6. Visualize Results

Let's create visualizations to better understand the factors and their biological meaning.

In [None]:
# Plot variance explained by factors
p1 = visualization.plot_variance_explained(mdata_muvi, max_factors=6, by_view=True)
print("Variance explained by factors (by view):")
print(p1)

In [None]:
# Plot factor scores colored by cell type
p2 = visualization.plot_factor_scores(
    mdata_muvi,
    factors=(0, 1),
    color_by='cell_type',
    size=2.0
)
print("Factor scores (Factor 0 vs Factor 1):")
print(p2)

In [None]:
# Plot factor loadings for the first view
view_name = list(mdata_muvi.mod.keys())[0]
p3 = visualization.plot_factor_loadings(
    mdata_muvi,
    view=view_name,
    factor=0,
    top_genes=5,
    loading_threshold=0.01
)
print(f"Top gene loadings for Factor 0 in {view_name}:")
print(p3)

In [None]:
# Plot cell clusters in factor space
p4 = visualization.plot_cell_clusters(
    mdata_muvi,
    cluster_labels,
    factors=(0, 1)
)
print("Cell clusters in factor space:")
print(p4)

In [None]:
# Compare factor activity across cell types
p5 = visualization.plot_factor_comparison(
    mdata_muvi,
    factors=[0, 1, 2],
    group_by='cell_type',
    plot_type='boxplot'
)
print("Factor activity by cell type:")
print(p5)

## 7. Summary and Interpretation

Let's summarize the factor activity and provide some interpretation.

In [None]:
# Summarize factor activity by cell type
factor_summary = analysis.summarize_factor_activity(
    mdata_muvi,
    group_by='cell_type'
)

print("Factor activity summary by cell type:")
print(factor_summary.head(15))

In [None]:
# Calculate factor correlations
factor_correlations = analysis.calculate_factor_correlations(mdata_muvi)

print("Factor correlation matrix:")
print(factor_correlations.round(3))

# Find highly correlated factor pairs
corr_threshold = 0.5
high_corr_pairs = []
for i in range(len(factor_correlations)):
    for j in range(i+1, len(factor_correlations)):
        corr_val = factor_correlations.iloc[i, j]
        if abs(corr_val) > corr_threshold:
            high_corr_pairs.append((i, j, corr_val))

if high_corr_pairs:
    print(f"\nHighly correlated factor pairs (|r| > {corr_threshold}):")
    for i, j, corr in high_corr_pairs:
        print(f"  Factor {i} ↔ Factor {j}: r = {corr:.3f}")
else:
    print(f"\nNo highly correlated factor pairs found (|r| > {corr_threshold})")

In [None]:
# Select top factors based on variance explained
top_factors = muvi_runner.select_top_factors(
    mdata_muvi,
    n_top_factors=3
)

print(f"Top 3 factors by variance explained: {top_factors}")

# Show variance for top factors
total_var_per_factor = np.sum([
    variance_explained[view] for view in variance_explained.keys()
], axis=0)

print("\nTotal variance explained by each factor:")
for i, var in enumerate(total_var_per_factor):
    status = "(TOP)" if i in top_factors else ""
    print(f"  Factor {i}: {var:.3f} {status}")

## 8. Conclusions

This tutorial demonstrated the complete MuVIcell workflow:

1. **Data Generation**: We created synthetic multi-view data with 3 views (5, 10, 15 features) and 200 cells
2. **Data Structure**: Added realistic latent factor structure to simulate biological processes
3. **Preprocessing**: Normalized data and prepared for MuVI analysis
4. **MuVI Analysis**: Ran multi-view integration to identify latent factors
5. **Factor Analysis**: Characterized factors, identified associations with metadata, and clustered cells
6. **Visualization**: Created comprehensive plots to understand factor structure and biological meaning

### Key Features of MuVIcell:

- **Easy synthetic data generation** for testing and development
- **Comprehensive preprocessing** pipeline with sensible defaults
- **Direct MuVI integration** with proper wrapper functions
- **Rich analysis toolkit** for factor interpretation
- **Publication-ready visualizations** using plotnine
- **Extensive testing** to ensure reliability

### Next Steps:

1. **Load real data**: Replace synthetic data with your own multi-view datasets
2. **Customize preprocessing**: Adjust normalization and filtering parameters for your data type
3. **Optimize MuVI parameters**: Tune the number of factors and likelihood models
4. **Pathway analysis**: Use factor loadings for gene set enrichment analysis
5. **Biological interpretation**: Connect factors to known biological processes