# Clustrix Basic Usage Tutorial

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ContextLab/clustrix/blob/master/docs/notebooks/basic_usage.ipynb)

This notebook demonstrates the basic usage of Clustrix for distributed computing.

## Installation

First, let's install Clustrix:

In [None]:
# Install Clustrix (uncomment if running in Colab)
# !pip install clustrix

## Basic Setup

Import Clustrix and configure it for local execution (since we don't have a cluster in this tutorial):

In [None]:
import clustrix
import numpy as np
import time

# Configure for local execution
clustrix.configure(
    cluster_host=None,  # Use local execution
    default_cores=4,
    auto_parallel=True
)

# Get current configuration
config = clustrix.get_config()

print("Current configuration:")
print(f"  Cluster type: {config.cluster_type}")
print(f"  Cluster host: {config.cluster_host}")
print(f"  Default cores: {config.default_cores}")
print(f"  Default memory: {config.default_memory}")
print(f"  Auto parallel: {config.auto_parallel}")
print(f"  Max parallel jobs: {config.max_parallel_jobs}")

## Simple Function Decoration

The simplest way to use Clustrix is with the `@cluster` decorator:

In [None]:
@clustrix.cluster(cores=2)
def simple_computation(x, y):
    """A simple function that adds two numbers."""
    result = x + y
    print(f"Computing {x} + {y} = {result}")
    return result

# Execute the function
result = simple_computation(10, 20)
print(f"Result: {result}")

## CPU-Intensive Computation

Let's try a more computational task that benefits from parallelization:

In [None]:
@clustrix.cluster(cores=4, parallel=True)
def monte_carlo_pi(n_samples):
    """Estimate π using Monte Carlo method."""
    import random
    
    count_inside = 0
    
    # This loop could be parallelized automatically
    for i in range(n_samples):
        x = random.random()
        y = random.random()
        
        if x*x + y*y <= 1:
            count_inside += 1
    
    pi_estimate = 4.0 * count_inside / n_samples
    return pi_estimate

# Run with different sample sizes
for n in [1000, 10000, 100000]:
    start_time = time.time()
    pi_est = monte_carlo_pi(n)
    elapsed = time.time() - start_time
    
    print(f"n={n:6d}: π ≈ {pi_est:.6f} (error: {abs(pi_est - np.pi):.6f}, time: {elapsed:.3f}s)")

## Array Processing

Clustrix works well with NumPy arrays and scientific computing:

In [None]:
@clustrix.cluster(cores=4, memory="2GB")
def matrix_computation(size):
    """Perform matrix operations."""
    import numpy as np
    
    # Create random matrices
    A = np.random.random((size, size))
    B = np.random.random((size, size))
    
    # Matrix multiplication
    C = np.dot(A, B)
    
    # Some statistics
    return {
        'shape': C.shape,
        'mean': np.mean(C),
        'std': np.std(C),
        'max': np.max(C),
        'min': np.min(C)
    }

# Test with different matrix sizes
sizes = [100, 200, 300]

for size in sizes:
    start_time = time.time()
    stats = matrix_computation(size)
    elapsed = time.time() - start_time
    
    print(f"Size {size}x{size}: mean={stats['mean']:.4f}, std={stats['std']:.4f}, time={elapsed:.3f}s")

## Data Processing Pipeline

Let's create a more realistic data processing example:

In [None]:
@clustrix.cluster(cores=4, parallel=True)
def process_dataset(data, operations):
    """Process a dataset with multiple operations."""
    import numpy as np
    
    results = []
    
    # This loop could be parallelized
    for item in data:
        processed = item
        
        # Apply operations
        for op in operations:
            if op == 'square':
                processed = processed ** 2
            elif op == 'sqrt':
                processed = np.sqrt(abs(processed))
            elif op == 'log':
                processed = np.log(abs(processed) + 1)
            elif op == 'normalize':
                processed = processed / (1 + abs(processed))
        
        results.append(processed)
    
    return results

# Create test data
test_data = np.random.randn(1000) * 10
operations = ['square', 'sqrt', 'normalize']

# Process the data
start_time = time.time()
processed_data = process_dataset(test_data, operations)
elapsed = time.time() - start_time

print(f"Processed {len(test_data)} items in {elapsed:.3f} seconds")
print(f"Input range: [{np.min(test_data):.2f}, {np.max(test_data):.2f}]")
print(f"Output range: [{np.min(processed_data):.2f}, {np.max(processed_data):.2f}]")

## Performance Comparison

Let's compare parallel vs sequential execution:

In [None]:
def cpu_intensive_task(n):
    """A CPU-intensive task for benchmarking."""
    result = 0
    for i in range(n):
        result += i ** 0.5
    return result

# Sequential version
@clustrix.cluster(parallel=False)
def sequential_processing(data):
    results = []
    for item in data:
        results.append(cpu_intensive_task(item))
    return results

# Parallel version
@clustrix.cluster(cores=4, parallel=True)
def parallel_processing(data):
    results = []
    for item in data:
        results.append(cpu_intensive_task(item))
    return results

# Test data
test_sizes = [10000] * 8  # 8 tasks of 10k iterations each

# Time sequential execution
start_time = time.time()
seq_results = sequential_processing(test_sizes)
seq_time = time.time() - start_time

# Time parallel execution
start_time = time.time()
par_results = parallel_processing(test_sizes)
par_time = time.time() - start_time

print(f"Sequential execution: {seq_time:.3f} seconds")
print(f"Parallel execution: {par_time:.3f} seconds")
print(f"Speedup: {seq_time/par_time:.2f}x")
print(f"Results match: {seq_results == par_results}")

## Configuration Options

Clustrix provides many configuration options:

In [None]:
# Get current configuration
config = clustrix.get_config()

print("Current configuration:")
print(f"  Cluster type: {config.cluster_type}")
print(f"  Cluster host: {config.cluster_host}")
print(f"  Default cores: {config.default_cores}")
print(f"  Default memory: {config.default_memory}")
print(f"  Auto parallel: {config.auto_parallel}")
print(f"  Max parallel jobs: {config.max_parallel_jobs}")

## Next Steps

This tutorial covered the basics of Clustrix usage. For more advanced topics, check out:

- **Remote Cluster Configuration**: Setting up SLURM, PBS, or SSH clusters
- **Advanced Parallelization**: Custom loop detection and optimization
- **Machine Learning Workflows**: Using Clustrix with scikit-learn, TensorFlow, or PyTorch
- **Scientific Computing**: Integration with SciPy, pandas, and other scientific libraries

Visit the [Clustrix documentation](https://clustrix.readthedocs.io) for detailed guides and API reference.