# Google Cloud Platform (GCP) Tutorial

This tutorial demonstrates how to use Clustrix with Google Cloud Platform (GCP) infrastructure for scalable distributed computing.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ContextLab/clustrix/blob/master/docs/notebooks/gcp_cloud_tutorial.ipynb)

## Overview

GCP provides several services that integrate well with Clustrix:

- **Compute Engine**: Virtual machines for compute clusters
- **Google Kubernetes Engine (GKE)**: Managed Kubernetes clusters
- **Batch**: Managed job scheduling service
- **Cloud Run**: Serverless container platform
- **Vertex AI**: Machine learning platform
- **Cloud Storage**: Object storage for data and results
- **VPC**: Network isolation and security
- **Preemptible VMs**: Cost-effective compute instances

## Prerequisites

1. Google Cloud account with billing enabled
2. Google Cloud SDK (gcloud) installed and configured
3. SSH key pair for VM access
4. Basic understanding of GCP services

## Installation and Setup

Install Clustrix with GCP dependencies:

In [None]:
# Install Clustrix with GCP support
!pip install clustrix google-cloud-compute google-cloud-storage google-auth google-auth-oauthlib

# Import required libraries
import clustrix
from clustrix import cluster, configure
from google.cloud import compute_v1
from google.cloud import storage
from google.auth import default
import os
import numpy as np
import time
import json

## GCP Authentication Setup

Configure your GCP credentials. You can do this in several ways:

### Option 1: gcloud CLI Authentication

In [None]:
# Login with gcloud CLI (run this in terminal)
# gcloud auth login
# gcloud auth application-default login

# Set your default project
# gcloud config set project YOUR_PROJECT_ID

# Verify authentication
!gcloud auth list
!gcloud config get-value project

### Option 2: Service Account Authentication

In [ ]:
# Set service account key path
# os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/path/to/service-account-key.json'

# Test GCP connection
try:
    credentials, project_id = default()
    print(f"Successfully authenticated with project: {project_id}")
    
    # Test compute API
    compute_client = compute_v1.InstancesClient()
    print("Compute Engine API access confirmed")
    
    # Test storage API
    storage_client = storage.Client()
    print("Cloud Storage API access confirmed")
    
except Exception as e:
    print(f"GCP authentication failed: {e}")

**Make sure you have set up authentication and enabled required APIs.**

## Method 1: Google Compute Engine Configuration

### Create Compute Engine Instance for Clustrix

In [ ]:
def create_clustrix_compute_instance(project_id, zone='us-central1-a', machine_type='e2-standard-4'):
    """
    Create a GCP Compute Engine instance configured for Clustrix.
    
    Args:
        project_id: GCP project ID
        zone: GCP zone for the instance
        machine_type: Machine type (CPU/memory configuration)
    
    Returns:
        Instance configuration and gcloud commands
    """
    
    # Startup script for instance initialization
    startup_script = '''
#!/bin/bash

# Update system
apt-get update
apt-get install -y python3 python3-pip git htop curl

# Install clustrix and common packages
pip3 install clustrix numpy scipy pandas scikit-learn matplotlib

# Install uv for faster package management
curl -LsSf https://astral.sh/uv/install.sh | sh
source ~/.cargo/env

# Create clustrix user
useradd -m -s /bin/bash clustrix
usermod -aG sudo clustrix
echo "clustrix ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers

# Setup SSH for clustrix user
mkdir -p /home/clustrix/.ssh
# Copy SSH keys from default user
if [ -d "/home/$(logname)/.ssh" ]; then
    cp -r /home/$(logname)/.ssh/* /home/clustrix/.ssh/
    chown -R clustrix:clustrix /home/clustrix/.ssh
    chmod 700 /home/clustrix/.ssh
    chmod 600 /home/clustrix/.ssh/authorized_keys 2>/dev/null || true
fi

# Create working directory
mkdir -p /tmp/clustrix
chown clustrix:clustrix /tmp/clustrix

# Install Google Cloud SDK for clustrix user
curl https://sdk.cloud.google.com | bash
exec -l $SHELL

# Log completion
echo "Clustrix setup completed at $(date)" >> /var/log/clustrix-setup.log
'''
    
    # gcloud commands for instance creation
    gcloud_commands = f"""
# Create firewall rule for SSH (if not exists)
gcloud compute firewall-rules create allow-ssh \
  --allow tcp:22 \
  --source-ranges 0.0.0.0/0 \
  --description "Allow SSH access" \
  --project {project_id} || echo "SSH rule already exists"

# Create the instance
gcloud compute instances create clustrix-instance \
  --project={project_id} \
  --zone={zone} \
  --machine-type={machine_type} \
  --network-interface=network-tier=PREMIUM,subnet=default \
  --maintenance-policy=MIGRATE \
  --provisioning-model=STANDARD \
  --service-account=default \
  --scopes=https://www.googleapis.com/auth/cloud-platform \
  --tags=clustrix,http-server,https-server \
  --create-disk=auto-delete=yes,boot=yes,device-name=clustrix-instance,image=projects/ubuntu-os-cloud/global/images/family/ubuntu-2204-lts,mode=rw,size=50,type=projects/{project_id}/zones/{zone}/diskTypes/pd-balanced \
  --no-shielded-secure-boot \
  --shielded-vtpm \
  --shielded-integrity-monitoring \
  --labels=purpose=clustrix,environment=tutorial \
  --reservation-affinity=any \
  --metadata-from-file startup-script=startup-script.sh

# Get the external IP
gcloud compute instances describe clustrix-instance \
  --project={project_id} \
  --zone={zone} \
  --format='get(networkInterfaces[0].accessConfigs[0].natIP)'

# SSH to the instance (after startup script completes)
gcloud compute ssh clustrix-instance \
  --project={project_id} \
  --zone={zone}
"""
    
    return {
        'project_id': project_id,
        'zone': zone,
        'machine_type': machine_type,
        'instance_name': 'clustrix-instance',
        'gcloud_commands': gcloud_commands,
        'startup_script': startup_script
    }

# Example usage
instance_config = create_clustrix_compute_instance(
    project_id='your-project-id',  # Replace with your project ID
    zone='us-central1-a',
    machine_type='e2-standard-4'  # 4 vCPUs, 16 GB RAM
)

print("GCP Compute Engine Instance Creation:")
print(instance_config['gcloud_commands'])
print("\nStartup Script:")
print(instance_config['startup_script'])

**Save the startup script and execute the gcloud commands to create your instance.**

### Configure Clustrix for Compute Engine

In [ ]:
# Configure Clustrix to use your Compute Engine instance
configure(
    cluster_type="ssh",
    cluster_host="your-instance-external-ip",  # Replace with actual IP
    username="clustrix",  # or your default user
    key_file="~/.ssh/google_compute_engine",  # gcloud-generated key
    remote_work_dir="/tmp/clustrix",
    package_manager="auto",  # Will use uv if available
    default_cores=4,
    default_memory="8GB",
    default_time="01:00:00"
)

**Replace `your-instance-external-ip` with the actual external IP from your Compute Engine instance.**

### Example: Remote Computation on Compute Engine

In [None]:
@cluster(cores=2, memory="4GB")
def gcp_data_analysis(dataset_size=10000, analysis_type='regression'):
    """Perform data analysis on GCP Compute Engine."""
    import numpy as np
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
    from sklearn.metrics import mean_squared_error, accuracy_score
    from sklearn.datasets import make_regression, make_classification
    import time
    
    start_time = time.time()
    
    # Generate synthetic dataset
    if analysis_type == 'regression':
        X, y = make_regression(
            n_samples=dataset_size,
            n_features=20,
            noise=0.1,
            random_state=42
        )
        model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
        metric_name = 'rmse'
    else:
        X, y = make_classification(
            n_samples=dataset_size,
            n_features=20,
            n_classes=3,
            random_state=42
        )
        model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
        metric_name = 'accuracy'
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    # Train model
    training_start = time.time()
    model.fit(X_train, y_train)
    training_time = time.time() - training_start
    
    # Evaluate
    y_pred = model.predict(X_test)
    
    if analysis_type == 'regression':
        metric_value = np.sqrt(mean_squared_error(y_test, y_pred))
    else:
        metric_value = accuracy_score(y_test, y_pred)
    
    total_time = time.time() - start_time
    
    return {
        'analysis_type': analysis_type,
        'dataset_size': dataset_size,
        'training_time': training_time,
        'total_time': total_time,
        metric_name: metric_value,
        'feature_importance': model.feature_importances_[:5].tolist(),  # Top 5
        'training_samples': len(X_train),
        'test_samples': len(X_test)
    }

# Run computation on GCP Compute Engine
# result = gcp_data_analysis(dataset_size=50000, analysis_type='classification')
# print(f"Analysis completed: {result['accuracy']:.4f} accuracy in {result['total_time']:.2f} seconds")
print("Example function defined. Uncomment the lines above to run on your GCP instance.")

## Method 2: Google Kubernetes Engine (GKE) Configuration

GKE provides managed Kubernetes clusters ideal for containerized Clustrix workloads:

In [None]:
def setup_gke_cluster_for_clustrix(project_id, cluster_name='clustrix-cluster', zone='us-central1-a'):
    """
    Setup GKE cluster optimized for Clustrix workloads.
    """
    
    gke_commands = f"""
# Enable required APIs
gcloud services enable container.googleapis.com \
  --project {project_id}

# Create GKE cluster with auto-scaling
gcloud container clusters create {cluster_name} \
  --project {project_id} \
  --zone {zone} \
  --machine-type e2-standard-4 \
  --num-nodes 1 \
  --enable-autoscaling \
  --min-nodes 0 \
  --max-nodes 10 \
  --enable-autorepair \
  --enable-autoupgrade \
  --disk-size 50GB \
  --disk-type pd-ssd \
  --enable-network-policy \
  --enable-ip-alias \
  --labels purpose=clustrix,environment=tutorial

# Get cluster credentials
gcloud container clusters get-credentials {cluster_name} \
  --project {project_id} \
  --zone {zone}

# Verify cluster access
kubectl get nodes

# Create clustrix namespace
kubectl create namespace clustrix

# Set as default namespace
kubectl config set-context --current --namespace=clustrix
"""
    
    # Clustrix job template for Kubernetes
    k8s_job_template = """
apiVersion: batch/v1
kind: Job
metadata:
  name: clustrix-job-${JOB_ID}
  namespace: clustrix
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: clustrix-worker
        image: python:3.11-slim
        command: ["bash", "-c"]
        args:
        - |
          pip install clustrix numpy scipy pandas scikit-learn
          python -c "
          import pickle
          import sys
          
          # Load and execute function
          with open('/data/function_data.pkl', 'rb') as f:
              data = pickle.load(f)
          
          func = pickle.loads(data['function'])
          args = pickle.loads(data['args'])
          kwargs = pickle.loads(data['kwargs'])
          
          try:
              result = func(*args, **kwargs)
              with open('/data/result.pkl', 'wb') as f:
                  pickle.dump(result, f)
          except Exception as e:
              with open('/data/error.pkl', 'wb') as f:
                  pickle.dump({'error': str(e)}, f)
              raise
          "
        resources:
          requests:
            memory: "2Gi"
            cpu: "1"
          limits:
            memory: "4Gi"
            cpu: "2"
        volumeMounts:
        - name: job-data
          mountPath: /data
      volumes:
      - name: job-data
        persistentVolumeClaim:
          claimName: clustrix-pvc
  backoffLimit: 3
"""
    
    print("GKE Cluster Setup Commands:")
    print(gke_commands)
    print("\nKubernetes Job Template:")
    print(k8s_job_template)
    
    return {
        'cluster_name': cluster_name,
        'project_id': project_id,
        'zone': zone,
        'setup_commands': gke_commands,
        'job_template': k8s_job_template
    }

def configure_clustrix_for_gke(cluster_endpoint, cluster_name):
    """Configure Clustrix to use GKE cluster."""
    configure(
        cluster_type="kubernetes",
        cluster_host=cluster_endpoint,
        # For GKE, authentication is handled via kubectl config
        remote_work_dir="/tmp/clustrix",
        package_manager="pip",  # Container-based, pip is fine
        default_cores=2,
        default_memory="4GB",
        default_time="01:00:00"
    )
    print(f"Configured Clustrix for GKE cluster: {cluster_name}")

gke_config = setup_gke_cluster_for_clustrix(
    project_id='your-project-id',
    cluster_name='clustrix-cluster'
)

print("\nNote: GKE integration requires additional implementation in Clustrix.")
print("Current Clustrix supports basic Kubernetes, but GKE-specific features need custom setup.")

## Method 3: Google Cloud Batch

Google Cloud Batch provides managed job scheduling for large-scale workloads:

In [None]:
def setup_gcp_batch_environment(project_id, region='us-central1'):
    """
    Setup Google Cloud Batch for Clustrix workloads.
    """
    
    batch_setup_commands = f"""
# Enable Batch API
gcloud services enable batch.googleapis.com \
  --project {project_id}

# Create a service account for Batch jobs
gcloud iam service-accounts create clustrix-batch-sa \
  --project {project_id} \
  --description="Service account for Clustrix Batch jobs" \
  --display-name="Clustrix Batch Service Account"

# Grant necessary permissions
gcloud projects add-iam-policy-binding {project_id} \
  --member="serviceAccount:clustrix-batch-sa@{project_id}.iam.gserviceaccount.com" \
  --role="roles/batch.jobsEditor"

gcloud projects add-iam-policy-binding {project_id} \
  --member="serviceAccount:clustrix-batch-sa@{project_id}.iam.gserviceaccount.com" \
  --role="roles/storage.objectAdmin"

# Create Cloud Storage bucket for job data
gsutil mb -p {project_id} -l {region} gs://{project_id}-clustrix-batch
"""
    
    # Batch job configuration template
    batch_job_config = {
        "taskGroups": [
            {
                "taskSpec": {
                    "runnables": [
                        {
                            "script": {
                                "text": """
#!/bin/bash
set -e

# Install required packages
pip3 install clustrix numpy scipy pandas scikit-learn

# Download job data from Cloud Storage
gsutil cp gs://{project_id}-clustrix-batch/jobs/${{BATCH_JOB_ID}}/function_data.pkl .

# Execute the function
python3 -c "
import pickle
import traceback

try:
    with open('function_data.pkl', 'rb') as f:
        data = pickle.load(f)
    
    func = pickle.loads(data['function'])
    args = pickle.loads(data['args'])
    kwargs = pickle.loads(data['kwargs'])
    
    result = func(*args, **kwargs)
    
    with open('result.pkl', 'wb') as f:
        pickle.dump(result, f)
        
except Exception as e:
    with open('error.pkl', 'wb') as f:
        pickle.dump({{
            'error': str(e),
            'traceback': traceback.format_exc()
        }}, f)
    raise
"

# Upload results to Cloud Storage
gsutil cp result.pkl gs://{project_id}-clustrix-batch/jobs/${{BATCH_JOB_ID}}/result.pkl || \
gsutil cp error.pkl gs://{project_id}-clustrix-batch/jobs/${{BATCH_JOB_ID}}/error.pkl
"""
                            }
                        }
                    ],
                    "computeResource": {
                        "cpuMilli": 2000,  # 2 CPUs
                        "memoryMib": 4096  # 4 GB RAM
                    },
                    "maxRetryCount": 2,
                    "maxRunDuration": "3600s"  # 1 hour
                },
                "taskCount": 1
            }
        ],
        "allocationPolicy": {
            "instances": [
                {
                    "instanceTemplate": {
                        "machineType": "e2-standard-2",
                        "provisioningModel": "STANDARD"
                    }
                }
            ]
        },
        "labels": {
            "purpose": "clustrix",
            "environment": "tutorial"
        },
        "logsPolicy": {
            "destination": "CLOUD_LOGGING"
        }
    }
    
    print("Google Cloud Batch Setup Commands:")
    print(batch_setup_commands)
    print("\nBatch Job Configuration:")
    print(json.dumps(batch_job_config, indent=2))
    
    return {
        'project_id': project_id,
        'region': region,
        'bucket_name': f'{project_id}-clustrix-batch',
        'service_account': f'clustrix-batch-sa@{project_id}.iam.gserviceaccount.com',
        'job_config': batch_job_config,
        'setup_commands': batch_setup_commands
    }

batch_config = setup_gcp_batch_environment('your-project-id')
print("\nGoogle Cloud Batch provides excellent integration for large-scale Clustrix workloads.")

## Data Management with Google Cloud Storage

In [None]:
@cluster(cores=2, memory="4GB")
def process_gcs_data(bucket_name, input_blob, output_blob, project_id=None):
    """Process data from Google Cloud Storage and save results back."""
    from google.cloud import storage
    import numpy as np
    import pickle
    import io
    import time
    
    # Initialize Cloud Storage client
    storage_client = storage.Client(project=project_id)
    bucket = storage_client.bucket(bucket_name)
    
    # Download data from Cloud Storage
    input_blob_obj = bucket.blob(input_blob)
    data_bytes = input_blob_obj.download_as_bytes()
    data = pickle.loads(data_bytes)
    
    # Process the data
    processed_data = {
        'original_shape': data.shape if hasattr(data, 'shape') else len(data) if hasattr(data, '__len__') else 'scalar',
        'mean': float(np.mean(data)) if hasattr(data, '__iter__') else float(data),
        'std': float(np.std(data)) if hasattr(data, '__iter__') else 0.0,
        'max': float(np.max(data)) if hasattr(data, '__iter__') else float(data),
        'min': float(np.min(data)) if hasattr(data, '__iter__') else float(data),
        'processing_timestamp': time.time(),
        'processed_on': 'gcp-compute-engine',
        'data_type': str(type(data).__name__)
    }
    
    # Advanced processing based on data type
    if hasattr(data, 'shape') and len(data.shape) >= 2:
        # Matrix operations
        processed_data.update({
            'matrix_rank': int(np.linalg.matrix_rank(data)) if data.shape[0] == data.shape[1] else 'non_square',
            'frobenius_norm': float(np.linalg.norm(data, 'fro')),
            'condition_number': float(np.linalg.cond(data)) if data.shape[0] == data.shape[1] else None
        })
    
    # Upload results to Cloud Storage
    output_bytes = pickle.dumps(processed_data)
    output_blob_obj = bucket.blob(output_blob)
    output_blob_obj.upload_from_string(output_bytes)
    
    return f"Processed data saved to gs://{bucket_name}/{output_blob}"

# Utility functions for Google Cloud Storage
def upload_to_gcs(data, bucket_name, blob_name, project_id=None):
    """Upload data to Google Cloud Storage."""
    storage_client = storage.Client(project=project_id)
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(blob_name)
    
    data_bytes = pickle.dumps(data)
    blob.upload_from_string(data_bytes)
    print(f"Data uploaded to gs://{bucket_name}/{blob_name}")

def download_from_gcs(bucket_name, blob_name, project_id=None):
    """Download data from Google Cloud Storage."""
    storage_client = storage.Client(project=project_id)
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(blob_name)
    
    data_bytes = blob.download_as_bytes()
    return pickle.loads(data_bytes)

def create_gcs_bucket_for_clustrix(project_id, bucket_name, location='us-central1'):
    """Create a Cloud Storage bucket for Clustrix data."""
    gcs_commands = f"""
# Create bucket with appropriate settings
gsutil mb -p {project_id} -l {location} gs://{bucket_name}

# Set lifecycle policy to delete temporary files after 7 days
echo '{{
  "lifecycle": {{
    "rule": [
      {{
        "action": {{"type": "Delete"}},
        "condition": {{
          "age": 7,
          "matchesPrefix": ["temp/"]
        }}
      }}
    ]
  }}
}}' > lifecycle.json

gsutil lifecycle set lifecycle.json gs://{bucket_name}

# Set up proper permissions (if using service account)
gsutil iam ch serviceAccount:clustrix-batch-sa@{project_id}.iam.gserviceaccount.com:objectAdmin gs://{bucket_name}
"""
    
    print(f"Commands to create bucket gs://{bucket_name}:")
    print(gcs_commands)
    return gcs_commands

# Example usage:
# Create bucket first
bucket_commands = create_gcs_bucket_for_clustrix('your-project-id', 'your-clustrix-data')

# Then use the functions
# sample_data = np.random.rand(1000, 100)
# upload_to_gcs(sample_data, 'your-clustrix-data', 'input/sample_data.pkl', 'your-project-id')
# result = process_gcs_data('your-clustrix-data', 'input/sample_data.pkl', 'output/results.pkl', 'your-project-id')
print("\nGoogle Cloud Storage integration functions defined.")

## Vertex AI Integration

In [None]:
def setup_vertex_ai_for_clustrix(project_id, region='us-central1'):
    """
    Setup Vertex AI for ML workloads with Clustrix.
    """
    
    vertex_commands = f"""
# Enable Vertex AI API
gcloud services enable aiplatform.googleapis.com \
  --project {project_id}

# Create Vertex AI custom training job
gcloud ai custom-jobs create \
  --region={region} \
  --display-name=clustrix-training-job \
  --config=training_job_config.yaml

# Create Vertex AI endpoints for model serving
gcloud ai endpoints create \
  --region={region} \
  --display-name=clustrix-model-endpoint
"""
    
    # Vertex AI training job configuration
    training_config = """
# training_job_config.yaml
workerPoolSpecs:
- machineSpec:
    machineType: e2-standard-4
  replicaCount: 1
  containerSpec:
    imageUri: gcr.io/cloud-aiplatform/training/tf-cpu.2-8:latest
    command:
    - python3
    - -c
    args:
    - |
      import subprocess
      import sys
      
      # Install clustrix
      subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'clustrix', 'numpy', 'pandas', 'scikit-learn'])
      
      # Your training code here
      print("Clustrix training job completed on Vertex AI")
    env:
    - name: GOOGLE_CLOUD_PROJECT
      value: {project_id}
    - name: AIP_MODEL_DIR
      value: gs://{project_id}-vertex-models
"""
    
    print("Vertex AI Setup Commands:")
    print(vertex_commands)
    print("\nTraining Job Configuration:")
    print(training_config)
    
    return {
        'project_id': project_id,
        'region': region,
        'setup_commands': vertex_commands,
        'training_config': training_config
    }

@cluster(cores=4, memory="8GB")
def vertex_ai_ml_pipeline(dataset_config, model_config, project_id, bucket_name):
    """ML pipeline that could run on Vertex AI with Clustrix."""
    import numpy as np
    from sklearn.ensemble import GradientBoostingClassifier
    from sklearn.model_selection import cross_val_score, GridSearchCV
    from sklearn.datasets import make_classification
    from sklearn.metrics import classification_report
    from google.cloud import storage
    import pickle
    import time
    
    start_time = time.time()
    
    # Generate or load dataset
    X, y = make_classification(
        n_samples=dataset_config['n_samples'],
        n_features=dataset_config['n_features'],
        n_classes=dataset_config['n_classes'],
        n_informative=dataset_config.get('n_informative', dataset_config['n_features'] // 2),
        random_state=42
    )
    
    # Hyperparameter tuning
    param_grid = {
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 5, 7],
        'learning_rate': [0.01, 0.1, 0.2]
    }
    
    # Grid search with cross-validation
    model = GradientBoostingClassifier(random_state=42)
    grid_search = GridSearchCV(
        model, param_grid, cv=5, scoring='accuracy', n_jobs=-1
    )
    
    grid_search.fit(X, y)
    
    # Get best model
    best_model = grid_search.best_estimator_
    
    # Evaluate with cross-validation
    cv_scores = cross_val_score(best_model, X, y, cv=5, scoring='accuracy')
    
    # Save model to Cloud Storage
    storage_client = storage.Client(project=project_id)
    bucket = storage_client.bucket(bucket_name)
    
    model_blob = bucket.blob('models/clustrix_model.pkl')
    model_bytes = pickle.dumps(best_model)
    model_blob.upload_from_string(model_bytes)
    
    total_time = time.time() - start_time
    
    return {
        'best_params': grid_search.best_params_,
        'best_score': grid_search.best_score_,
        'cv_mean_score': cv_scores.mean(),
        'cv_std_score': cv_scores.std(),
        'training_time': total_time,
        'model_location': f'gs://{bucket_name}/models/clustrix_model.pkl',
        'feature_importance': best_model.feature_importances_[:10].tolist(),  # Top 10
        'dataset_size': len(X)
    }

vertex_config = setup_vertex_ai_for_clustrix('your-project-id')

# Example usage:
# dataset_params = {'n_samples': 10000, 'n_features': 20, 'n_classes': 3}
# model_params = {}
# result = vertex_ai_ml_pipeline(dataset_params, model_params, 'your-project-id', 'your-bucket')
# print(f"Best model score: {result['best_score']:.4f}")

print("Vertex AI integration examples defined.")

## Cost Optimization with Preemptible VMs

In [None]:
def setup_preemptible_cluster(project_id, zone='us-central1-a'):
    """
    Setup cost-effective cluster using preemptible VMs.
    """
    
    preemptible_commands = f"""
# Create preemptible instance template
gcloud compute instance-templates create clustrix-preemptible-template \
  --project {project_id} \
  --machine-type e2-standard-4 \
  --preemptible \
  --boot-disk-size 50GB \
  --boot-disk-type pd-standard \
  --image-family ubuntu-2204-lts \
  --image-project ubuntu-os-cloud \
  --metadata-from-file startup-script=clustrix-startup.sh \
  --scopes cloud-platform \
  --tags clustrix,preemptible

# Create managed instance group
gcloud compute instance-groups managed create clustrix-preemptible-group \
  --project {project_id} \
  --zone {zone} \
  --template clustrix-preemptible-template \
  --size 0

# Set up auto-scaling
gcloud compute instance-groups managed set-autoscaling clustrix-preemptible-group \
  --project {project_id} \
  --zone {zone} \
  --min-num-replicas 0 \
  --max-num-replicas 10 \
  --target-cpu-utilization 0.6

# Scale up the group
gcloud compute instance-groups managed resize clustrix-preemptible-group \
  --project {project_id} \
  --zone {zone} \
  --size 2
"""
    
    cost_optimization_tips = """
GCP Cost Optimization for Clustrix:

1. Compute Optimization:
   - Use Preemptible VMs for fault-tolerant workloads (up to 80% savings)
   - Use Spot VMs (successor to Preemptible) for even better savings
   - Choose appropriate machine types (E2, N2, C2 based on workload)
   - Use sustained use discounts for long-running workloads
   - Consider committed use discounts for predictable usage

2. Storage Optimization:
   - Use appropriate storage classes (Standard, Nearline, Coldline, Archive)
   - Enable object lifecycle management
   - Use regional storage for better performance/cost balance
   - Implement data compression and deduplication

3. Network Optimization:
   - Minimize inter-region data transfer
   - Use Cloud CDN for static content
   - Optimize data transfer patterns

4. Monitoring and Management:
   - Set up budget alerts and quotas
   - Use Cloud Billing reports
   - Implement proper resource labeling
   - Regular cost reviews and right-sizing

5. Service-Specific:
   - Use Cloud Functions for event-driven tasks
   - Consider Cloud Run for containerized applications
   - Use Google Cloud Batch for large batch processing
"""
    
    print("Preemptible VM Setup Commands:")
    print(preemptible_commands)
    print("\nCost Optimization Tips:")
    print(cost_optimization_tips)
    
    return {
        'project_id': project_id,
        'zone': zone,
        'template_name': 'clustrix-preemptible-template',
        'group_name': 'clustrix-preemptible-group',
        'setup_commands': preemptible_commands
    }

def configure_for_preemptible():
    """Configure Clustrix for preemptible VM usage."""
    configure(
        cluster_type="ssh",
        cluster_host="preemptible-instance-ip",
        username="clustrix",
        key_file="~/.ssh/google_compute_engine",
        remote_work_dir="/tmp/clustrix",
        package_manager="uv",
        # Preemptible VMs can be terminated, use shorter timeouts
        default_time="00:30:00",
        job_poll_interval=30,  # Check more frequently
        cleanup_on_success=True,  # Clean up quickly
        # Save work frequently
        max_parallel_jobs=20  # Higher parallelism for fault tolerance
    )
    print("Configured Clustrix for preemptible VMs with fault-tolerant settings.")

preemptible_config = setup_preemptible_cluster('your-project-id')
print("\nPreemptible VMs can provide up to 80% cost savings for fault-tolerant workloads.")

## Security Best Practices

In [None]:
def setup_gcp_security_for_clustrix(project_id):
    """
    Security configuration for GCP + Clustrix deployment.
    """
    
    security_commands = f"""
# Create VPC with private subnets
gcloud compute networks create clustrix-vpc \
  --project {project_id} \
  --subnet-mode custom

gcloud compute networks subnets create clustrix-subnet \
  --project {project_id} \
  --network clustrix-vpc \
  --range 10.1.0.0/24 \
  --region us-central1 \
  --enable-private-ip-google-access

# Create firewall rules (restrictive)
gcloud compute firewall-rules create clustrix-allow-ssh \
  --project {project_id} \
  --network clustrix-vpc \
  --allow tcp:22 \
  --source-ranges YOUR_IP/32 \
  --target-tags clustrix

gcloud compute firewall-rules create clustrix-internal \
  --project {project_id} \
  --network clustrix-vpc \
  --allow tcp,udp,icmp \
  --source-ranges 10.1.0.0/24 \
  --target-tags clustrix

# Create service account with minimal permissions
gcloud iam service-accounts create clustrix-compute \
  --project {project_id} \
  --description="Service account for Clustrix compute instances" \
  --display-name="Clustrix Compute Service Account"

# Grant only necessary permissions
gcloud projects add-iam-policy-binding {project_id} \
  --member="serviceAccount:clustrix-compute@{project_id}.iam.gserviceaccount.com" \
  --role="roles/storage.objectAdmin"

gcloud projects add-iam-policy-binding {project_id} \
  --member="serviceAccount:clustrix-compute@{project_id}.iam.gserviceaccount.com" \
  --role="roles/logging.logWriter"

# Enable OS Login for better SSH key management
gcloud compute project-info add-metadata \
  --project {project_id} \
  --metadata enable-oslogin=TRUE

# Create Cloud KMS key for encryption
gcloud kms keyrings create clustrix-keyring \
  --project {project_id} \
  --location global

gcloud kms keys create clustrix-key \
  --project {project_id} \
  --keyring clustrix-keyring \
  --location global \
  --purpose encryption
"""
    
    security_checklist = """
GCP Security Checklist for Clustrix:

✓ Use IAM service accounts with minimal permissions
✓ Enable OS Login for centralized SSH key management
✓ Create custom VPC with private subnets
✓ Restrict firewall rules to specific IP ranges
✓ Enable private Google access for instances without external IPs
✓ Use Cloud KMS for encryption at rest
✓ Enable audit logging and Cloud Security Command Center
✓ Use Binary Authorization for container security
✓ Implement VPC Service Controls for data perimeter
✓ Enable DDoS protection and Cloud Armor
✓ Use Secret Manager for sensitive configuration
✓ Enable vulnerability scanning for container images
✓ Set up budget alerts and billing account security
✓ Use organization policies for governance
✓ Regular security reviews and access audits
"""
    
    print("GCP Security Setup Commands:")
    print(security_commands)
    print("\nSecurity Checklist:")
    print(security_checklist)
    
    return {
        'project_id': project_id,
        'vpc_name': 'clustrix-vpc',
        'subnet_name': 'clustrix-subnet',
        'service_account': f'clustrix-compute@{project_id}.iam.gserviceaccount.com',
        'security_commands': security_commands
    }

security_config = setup_gcp_security_for_clustrix('your-project-id')
print("Security configuration templates generated.")

## Resource Cleanup

In [None]:
def cleanup_gcp_resources(project_id, zone='us-central1-a', region='us-central1'):
    """
    Clean up GCP resources to avoid ongoing charges.
    
    Args:
        project_id: GCP project ID
        zone: Zone where resources were created
        region: Region where resources were created
    """
    
    cleanup_commands = f"""
# List all compute instances
gcloud compute instances list --project {project_id}

# Delete specific instances
gcloud compute instances delete clustrix-instance \
  --project {project_id} \
  --zone {zone} \
  --quiet

# Delete managed instance groups
gcloud compute instance-groups managed delete clustrix-preemptible-group \
  --project {project_id} \
  --zone {zone} \
  --quiet

# Delete instance templates
gcloud compute instance-templates delete clustrix-preemptible-template \
  --project {project_id} \
  --quiet

# Delete GKE clusters
gcloud container clusters delete clustrix-cluster \
  --project {project_id} \
  --zone {zone} \
  --quiet

# Delete Cloud Storage buckets (BE CAREFUL - THIS DELETES ALL DATA)
gsutil -m rm -r gs://{project_id}-clustrix-batch
gsutil -m rm -r gs://{project_id}-vertex-models

# Delete firewall rules
gcloud compute firewall-rules delete clustrix-allow-ssh clustrix-internal \
  --project {project_id} \
  --quiet

# Delete VPC network
gcloud compute networks subnets delete clustrix-subnet \
  --project {project_id} \
  --region {region} \
  --quiet

gcloud compute networks delete clustrix-vpc \
  --project {project_id} \
  --quiet

# Delete service accounts
gcloud iam service-accounts delete clustrix-compute@{project_id}.iam.gserviceaccount.com \
  --project {project_id} \
  --quiet

gcloud iam service-accounts delete clustrix-batch-sa@{project_id}.iam.gserviceaccount.com \
  --project {project_id} \
  --quiet

# List remaining billable resources
gcloud compute instances list --project {project_id}
gcloud compute disks list --project {project_id}
gcloud compute addresses list --project {project_id}
gcloud container clusters list --project {project_id}
"""
    
    print(f"GCP Resource Cleanup Commands for Project: {project_id}")
    print(cleanup_commands)
    print("\n⚠️  WARNING: Some commands will permanently delete resources and data!")
    print("Review each resource before deleting and ensure you have backups if needed.")
    print("\n💡 TIP: Use 'gcloud compute instances stop' instead of 'delete' to preserve instances while stopping charges.")
    
    return {
        'project_id': project_id,
        'zone': zone,
        'region': region,
        'cleanup_commands': cleanup_commands
    }

cleanup_info = cleanup_gcp_resources('your-project-id')
print("\nCleanup commands generated. Always verify resources before deletion!")

## Advanced Example: Distributed Scientific Computing

In [None]:
@cluster(cores=4, memory="8GB", time="01:00:00")
def gcp_scientific_simulation(simulation_params, storage_config):
    """
    Distributed scientific simulation using GCP infrastructure.
    """
    import numpy as np
    from scipy.integrate import odeint
    from scipy.optimize import minimize
    from google.cloud import storage
    import pickle
    import time
    import matplotlib.pyplot as plt
    import io
    
    def lorenz_system(state, t, sigma, rho, beta):
        """Lorenz attractor differential equations."""
        x, y, z = state
        return [
            sigma * (y - x),
            x * (rho - z) - y,
            x * y - beta * z
        ]
    
    def simulate_lorenz(params, time_points):
        """Simulate Lorenz system with given parameters."""
        initial_state = [1.0, 1.0, 1.0]
        solution = odeint(
            lorenz_system, initial_state, time_points,
            args=(params['sigma'], params['rho'], params['beta'])
        )
        return solution
    
    start_time = time.time()
    
    # Parameter sweep
    parameter_sets = simulation_params['parameter_sets']
    time_points = np.linspace(0, simulation_params['max_time'], simulation_params['num_points'])
    
    results = []
    
    for i, params in enumerate(parameter_sets):
        # Run simulation
        solution = simulate_lorenz(params, time_points)
        
        # Analyze results
        x, y, z = solution[:, 0], solution[:, 1], solution[:, 2]
        
        analysis = {
            'params': params,
            'max_x': float(np.max(x)),
            'min_x': float(np.min(x)),
            'max_y': float(np.max(y)),
            'min_y': float(np.min(y)),
            'max_z': float(np.max(z)),
            'min_z': float(np.min(z)),
            'mean_energy': float(np.mean(x**2 + y**2 + z**2)),
            'final_state': [float(x[-1]), float(y[-1]), float(z[-1])]
        }
        
        results.append(analysis)
        
        # Create visualization for first few parameter sets
        if i < 3:
            fig = plt.figure(figsize=(12, 4))
            
            # Time series
            plt.subplot(1, 3, 1)
            plt.plot(time_points, x, label='X')
            plt.plot(time_points, y, label='Y')
            plt.plot(time_points, z, label='Z')
            plt.xlabel('Time')
            plt.ylabel('State')
            plt.title(f'Lorenz System (σ={params["sigma"]}, ρ={params["rho"]}, β={params["beta"]})')
            plt.legend()
            
            # Phase space (X-Y)
            plt.subplot(1, 3, 2)
            plt.plot(x, y)
            plt.xlabel('X')
            plt.ylabel('Y')
            plt.title('X-Y Phase Space')
            
            # Phase space (X-Z)
            plt.subplot(1, 3, 3)
            plt.plot(x, z)
            plt.xlabel('X')
            plt.ylabel('Z')
            plt.title('X-Z Phase Space')
            
            plt.tight_layout()
            
            # Save plot to Cloud Storage
            if storage_config:
                img_buffer = io.BytesIO()
                plt.savefig(img_buffer, format='png', dpi=150, bbox_inches='tight')
                img_buffer.seek(0)
                
                storage_client = storage.Client(project=storage_config['project_id'])
                bucket = storage_client.bucket(storage_config['bucket_name'])
                
                plot_blob = bucket.blob(f"plots/lorenz_simulation_{i}.png")
                plot_blob.upload_from_string(img_buffer.getvalue(), content_type='image/png')
            
            plt.close()
    
    computation_time = time.time() - start_time
    
    # Save detailed results to Cloud Storage
    if storage_config:
        storage_client = storage.Client(project=storage_config['project_id'])
        bucket = storage_client.bucket(storage_config['bucket_name'])
        
        results_blob = bucket.blob("results/simulation_results.pkl")
        results_bytes = pickle.dumps({
            'simulation_params': simulation_params,
            'results': results,
            'computation_time': computation_time,
            'timestamp': time.time()
        })
        results_blob.upload_from_string(results_bytes)
    
    return {
        'num_simulations': len(parameter_sets),
        'computation_time': computation_time,
        'average_energy': np.mean([r['mean_energy'] for r in results]),
        'max_energy': max([r['mean_energy'] for r in results]),
        'min_energy': min([r['mean_energy'] for r in results]),
        'results_summary': results[:3],  # First 3 for brevity
        'storage_location': f"gs://{storage_config['bucket_name']}/results/" if storage_config else None
    }

# Example usage:
# simulation_config = {
#     'parameter_sets': [
#         {'sigma': 10.0, 'rho': 28.0, 'beta': 8.0/3.0},
#         {'sigma': 10.0, 'rho': 24.0, 'beta': 8.0/3.0},
#         {'sigma': 10.0, 'rho': 32.0, 'beta': 8.0/3.0},
#     ],
#     'max_time': 25.0,
#     'num_points': 10000
# }
# 
# storage_config = {
#     'project_id': 'your-project-id',
#     'bucket_name': 'your-results-bucket'
# }
# 
# result = gcp_scientific_simulation(simulation_config, storage_config)
# print(f"Completed {result['num_simulations']} simulations in {result['computation_time']:.2f} seconds")
# print(f"Average energy: {result['average_energy']:.4f}")

print("Advanced scientific computing example defined.")

## Summary

This tutorial covered:

1. **Setup**: GCP authentication and Clustrix installation
2. **Compute Engine**: Direct VM configuration and management
3. **GKE Integration**: Kubernetes clusters for containerized workloads
4. **Cloud Batch**: Managed job scheduling for large-scale processing
5. **Cloud Storage**: Data management and result storage
6. **Vertex AI**: Machine learning platform integration
7. **Cost Optimization**: Preemptible VMs and cost management strategies
8. **Security**: Best practices for secure deployment
9. **Resource Management**: Proper cleanup procedures

### Next Steps

- Set up your GCP credentials and test the basic configuration
- Start with a simple Compute Engine instance for initial testing
- Consider GKE for containerized workloads and auto-scaling
- Explore Cloud Batch for large-scale batch processing
- Implement proper monitoring and cost controls
- Use preemptible VMs for cost-effective fault-tolerant workloads

### GCP-Specific Advantages

- **Preemptible/Spot VMs**: Exceptional cost savings (up to 80%)
- **Google Kubernetes Engine**: Industry-leading managed Kubernetes
- **Vertex AI**: Comprehensive ML platform with AutoML capabilities
- **Global Network**: Superior network performance and global reach
- **BigQuery Integration**: Seamless data analytics integration
- **Sustained Use Discounts**: Automatic discounts for sustained usage

### Resources

- [Google Cloud Compute Engine Documentation](https://cloud.google.com/compute/docs)
- [Google Kubernetes Engine Documentation](https://cloud.google.com/kubernetes-engine/docs)
- [Google Cloud Batch Documentation](https://cloud.google.com/batch/docs)
- [Vertex AI Documentation](https://cloud.google.com/vertex-ai/docs)
- [Google Cloud Storage Documentation](https://cloud.google.com/storage/docs)
- [Clustrix Documentation](https://clustrix.readthedocs.io/)

**Remember**: Always monitor your GCP costs and clean up resources when not in use. Use budget alerts and billing export to track spending!