# SeaweedFS Active-Active Resilience Demonstration

This notebook demonstrates the fault-tolerance and resilience capabilities of SeaweedFS in a high-availability setup. We'll use the existing `docker-compose-ha.yml` configuration to:

1. Start a full HA cluster with 3 master nodes, 3 volume servers, and 2 filer servers
2. Upload test data to the system
3. Simulate various failure scenarios (master node failure, volume server failure)
4. Monitor the system's recovery and healing process
5. Verify data integrity after recovery

## Architecture Overview

The high-availability setup consists of:

- **3 Master Servers**: Running as a Raft cluster for consensus and leader election
- **3 Volume Servers**: Each on a different logical rack for data distribution
- **2 Filer Servers**: Providing redundant access to the file system 
- **NGINX Load Balancer**: Distributing requests between filer instances

```
                   ┌──────────────┐
                   │     NGINX    │
                   │  Load Balancer│
                   └───────┬──────┘
                           │
            ┌──────────────┴───────────────┐
            │                              │
     ┌──────┴────────┐            ┌────────┴──────┐
     │    Filer1     │            │     Filer2    │
     └──────┬────────┘            └────────┬──────┘
            │                              │
            └──────────────┬───────────────┘
                           │
         ┌────────────┬────┴────┬────────────┐
         │            │         │            │
  ┌──────┴─────┐ ┌────┴────┐ ┌──┴───────┐ ┌──┴────────┐
  │  Master1   │ │ Master2 │ │ Master3  │ │          │
  │  (Leader)  │ │         │ │          │ │          │
  └──────┬─────┘ └────┬────┘ └──┬───────┘ │  Volume  │
         │            │         │         │  Servers │
         └────────────┴────┬────┘         │          │
                           │              │          │
                      ┌────┴────┐         │          │
                      │ Volume1 │         │          │
                      │ Volume2 ├─────────┘          │
                      │ Volume3 │                    │
                      └─────────┘                    │
```

## SeaweedFS Resilience Features

1. **Master Server Resilience**: Multiple masters in a Raft cluster ensure leadership failover if the leader goes down
2. **Data Redundancy**: Volume data can be replicated across multiple volume servers
3. **Rack Awareness**: Data distribution across different racks to survive rack failures
4. **Automatic Rebalancing**: When a volume server is added back, data is rebalanced
5. **Filer Redundancy**: Multiple filer servers provide uninterrupted access to files

In [None]:
# Install required packages
import importlib.util
import sys
import subprocess
import os

def install_package(package):
    """Install a Python package using the recommended python -m pip approach"""
    if importlib.util.find_spec(package.split('==')[0]) is None:
        print(f"Installing {package}...")
        try:
            # Use subprocess to run python -m pip install
            result = subprocess.run(
                [sys.executable, '-m', 'pip', 'install', package],
                check=True,
                capture_output=True,
                text=True
            )
            print(f"Successfully installed {package}")
            return True
        except subprocess.CalledProcessError as e:
            print(f"Failed to install {package}: {e}")
            print(f"Error output: {e.stderr}")
            return False
    else:
        print(f"{package.split('==')[0]} is already installed.")
        return True

# List of required packages
packages = ['requests', 'pandas', 'numpy', 'matplotlib', 'seaborn', 'python-dotenv', 'boto3', 'plotly', 'docker']

# Install each package
for package in packages:
    success = install_package(package)
    if not success:
        print(f"Warning: Failed to install {package}. Some notebook features may not work correctly.")

In [None]:
# Import required libraries
import os
import sys
import json
import time
import docker
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from IPython.display import HTML, display, clear_output
import datetime
import random
import subprocess
import threading
from pathlib import Path
import re
import boto3
import io

# Load environment variables
try:
    from dotenv import load_dotenv
    env_path = Path('/home/harry/projects/seaweedfs/demo/.env')
    if env_path.exists():
        print(f"Loading environment variables from: {env_path}")
        load_dotenv(dotenv_path=env_path)
    else:
        print("No .env file found. Using default environment variables.")
except ImportError:
    print("dotenv not installed. Using default environment variables.")

# Setup Docker client
client = docker.from_env()

# Configuration for SeaweedFS API endpoints - we'll use local ports from docker-compose-ha.yml
master1_url = "http://localhost:9333"  # Master1 port
master2_url = "http://localhost:9334"  # Master2 port
master3_url = "http://localhost:9335"  # Master3 port
volume1_url = "http://localhost:8080"  # Volume1 port
volume2_url = "http://localhost:8081"  # Volume2 port 
volume3_url = "http://localhost:8082"  # Volume3 port
filer_url = "http://localhost:9000"    # NGINX load balancer port
s3_url = "http://localhost:9000/s3"    # S3 API through NGINX

# Auth details
auth_user = os.getenv("SEAWEED_AUTH_USER", None)
auth_password = os.getenv("SEAWEED_AUTH_PASSWORD", None) 
auth = None if not auth_user else (auth_user, auth_password)

# S3 credentials
s3_access_key = os.getenv("AWS_ACCESS_KEY_ID", "seaweedfs")  # Default from docker-compose-ha.yml
s3_secret_key = os.getenv("AWS_SECRET_ACCESS_KEY", "seaweedfs")  # Default from docker-compose-ha.yml

print(f"Master1 URL: {master1_url}")
print(f"Master2 URL: {master2_url}")
print(f"Master3 URL: {master3_url}")
print(f"Volume1 URL: {volume1_url}")
print(f"Volume2 URL: {volume2_url}")
print(f"Volume3 URL: {volume3_url}")
print(f"Filer URL (NGINX): {filer_url}")
print(f"S3 URL: {s3_url}")
print(f"Authentication: {'Enabled' if auth else 'Disabled'}")
print(f"S3 Access Key: {s3_access_key}")
print(f"S3 Secret Key: {'*' * len(s3_secret_key)}")

## Helper Functions

Let's define some helper functions to interact with the SeaweedFS cluster, monitor its state, and perform experiments.

In [None]:
def api_request(method, url, params=None, data=None, headers=None, files=None, auth=auth, timeout=5):
    """Make an API request to the SeaweedFS servers with timeout and error handling"""
    try:
        response = requests.request(
            method=method,
            url=url,
            params=params,
            data=data,
            headers=headers,
            files=files,
            auth=auth,
            timeout=timeout
        )
        response.raise_for_status()
        
        # Try to parse as JSON
        try:
            return response.json()
        except:
            return response.text
    except requests.exceptions.RequestException as e:
        print(f"Error making API request to {url}: {e}")
        if hasattr(e, 'response') and e.response is not None:
            print(f"Response status code: {e.response.status_code}")
            try:
                print(f"Response text: {e.response.text[:200]}...")
            except:
                pass
        return None

def format_table(data, title=None):
    """Format data as a styled HTML table"""
    from IPython.display import HTML
    
    if isinstance(data, list):
        if not data:
            return HTML("<p>No data available</p>")
        df = pd.DataFrame(data)
    elif isinstance(data, dict):
        df = pd.DataFrame([data])
    else:
        df = pd.DataFrame(data)
    
    try:
        # Try to use pandas styling
        styled = df.style.set_table_attributes('style="border-collapse:collapse"')
        styled = styled.set_properties(**{'border': '1px solid black', 'padding': '5px'})
        styled = styled.set_table_styles([
            {'selector': 'th', 'props': [('background-color', '#f2f2f2'), 
                                       ('border', '1px solid black'),
                                       ('padding', '5px'),
                                       ('text-align', 'left')]}
        ])
        
        if title:
            return HTML(f"<h3>{title}</h3>" + styled.to_html())
        return HTML(styled.to_html())
    except Exception as e:
        # Fallback if styling fails
        print(f"Warning: Using unstyled table because: {str(e)}")
        html = df.to_html(border=1, index=False)
        if title:
            html = f"<h3>{title}</h3>{html}"
        return HTML(html)

In [None]:
# Docker container management functions

def get_container_by_service(service_name):
    """Get container by service name (e.g. 'master1', 'volume2')"""
    try:
        containers = client.containers.list()
        for container in containers:
            if service_name in container.name:
                return container
        print(f"Container for service '{service_name}' not found")
        return None
    except docker.errors.APIError as e:
        print(f"Docker API error: {e}")
        return None

def stop_container(service_name):
    """Stop a container by service name"""
    container = get_container_by_service(service_name)
    if container:
        print(f"Stopping container {container.name}...")
        container.stop()
        return True
    return False

def start_container(service_name):
    """Start a container by service name"""
    container = get_container_by_service(service_name)
    if container:
        print(f"Starting container {container.name}...")
        container.start()
        return True
    return False

def get_container_status(service_name):
    """Get status of a container by service name"""
    container = get_container_by_service(service_name)
    if container:
        container.reload()  # Refresh container info
        return container.status
    return "not found"

def list_containers():
    """List all containers related to the SeaweedFS cluster"""
    try:
        containers = client.containers.list(all=True)
        seaweed_containers = []
        seaweed_services = ['master', 'volume', 'filer', 'nginx']
        
        for container in containers:
            if any(service in container.name for service in seaweed_services):
                seaweed_containers.append({
                    'id': container.short_id,
                    'name': container.name,
                    'status': container.status,
                    'image': container.image.tags[0] if container.image.tags else container.image.short_id
                })
        
        return seaweed_containers
    except docker.errors.APIError as e:
        print(f"Docker API error: {e}")
        return []

def display_container_status():
    """Display current status of all SeaweedFS containers"""
    containers = list_containers()
    if containers:
        return format_table(containers, "SeaweedFS Container Status")
    else:
        return HTML("<p>No SeaweedFS containers found</p>")

In [None]:
# SeaweedFS system status monitoring functions

def get_cluster_status(master_url=master1_url):
    """Get the status of the SeaweedFS cluster"""
    return api_request('GET', f"{master_url}/cluster/status?pretty=y")

def get_system_topology(master_url=master1_url):
    """Get the system topology information"""
    return api_request('GET', f"{master_url}/dir/status?pretty=y")

def get_volume_status(master_url=master1_url):
    """Get the status of all volumes in the system"""
    return api_request('GET', f"{master_url}/vol/status?pretty=y")

def check_master_leader():
    """Determine which master is currently the leader"""
    masters = [master1_url, master2_url, master3_url]
    for i, master_url in enumerate(masters):
        try:
            status = api_request('GET', f"{master_url}/cluster/status")
            if status and 'Leader' in status:
                leader_url = status['Leader']
                leader_name = f"master{i+1}"
                print(f"✅ Leader master is {leader_name} at {leader_url}")
                return leader_name, leader_url
        except:
            pass
    
    print("❌ No leader master found")
    return None, None

def check_endpoint_health():
    """Check health of all endpoints"""
    endpoints = {
        "master1": f"{master1_url}/cluster/status",
        "master2": f"{master2_url}/cluster/status",
        "master3": f"{master3_url}/cluster/status",
        "volume1": f"{volume1_url}/status",
        "volume2": f"{volume2_url}/status",
        "volume3": f"{volume3_url}/status",
        "filer": f"{filer_url}/"
    }
    
    results = []
    for name, url in endpoints.items():
        try:
            start_time = time.time()
            response = requests.get(url, timeout=2)
            elapsed = time.time() - start_time
            
            results.append({
                "service": name,
                "status": "✅ Online" if response.status_code < 400 else "⚠️ Error",
                "code": response.status_code,
                "response_time": f"{elapsed:.3f}s"
            })
        except requests.exceptions.RequestException:
            results.append({
                "service": name,
                "status": "❌ Offline",
                "code": "N/A",
                "response_time": "N/A"
            })
    
    return results

def monitor_system_health(interval=5, duration=60):
    """Monitor system health for a specified duration"""
    end_time = time.time() + duration
    health_data = []
    
    try:
        while time.time() < end_time:
            timestamp = datetime.datetime.now().strftime("%H:%M:%S")
            leader, _ = check_master_leader()
            health_results = check_endpoint_health()
            
            # Record data
            record = {"timestamp": timestamp, "leader": leader}
            for result in health_results:
                record[result["service"]] = "Online" if "✅" in result["status"] else "Offline"
            
            health_data.append(record)
            
            # Display current status
            clear_output(wait=True)
            display(HTML(f"<h2>System Health Monitoring</h2><p>Current time: {timestamp}</p>"))
            display(format_table(health_results, "Current Endpoint Status"))
            
            if len(health_data) > 1:
                # Create and display chart
                plot_health_data(health_data)
            
            # Wait for next interval
            remaining = min(interval, end_time - time.time())
            if remaining > 0:
                time.sleep(remaining)
                
    except KeyboardInterrupt:
        print("\nMonitoring stopped by user")
    
    return health_data

def plot_health_data(health_data):
    """Plot health data over time"""
    df = pd.DataFrame(health_data)
    
    # Get service columns (exclude timestamp and leader)
    service_cols = [col for col in df.columns if col not in ["timestamp", "leader"]]
    
    # Prepare data for plotting
    for col in service_cols:
        df[col] = df[col].apply(lambda x: 1 if x == "Online" else 0)
    
    # Plot
    plt.figure(figsize=(12, 6))
    
    # Plot service status
    for col in service_cols:
        plt.plot(df["timestamp"], df[col], marker="o", label=col)
    
    # Highlight leader changes
    leader_changes = df[df["leader"] != df["leader"].shift(1)].index
    for idx in leader_changes:
        if idx > 0:  # Skip the first point
            plt.axvline(x=idx, color="r", linestyle="--", alpha=0.3)
            plt.text(idx, 0.5, f"Leader: {df.iloc[idx]['leader']}", 
                    rotation=90, verticalalignment="center")
    
    plt.yticks([0, 1], ["Offline", "Online"])
    plt.ylim(-0.1, 1.1)
    plt.grid(True, axis="y")
    plt.xticks(rotation=45)
    plt.title("SeaweedFS System Health Over Time")
    plt.xlabel("Time")
    plt.ylabel("Status")
    plt.legend(loc="lower left", bbox_to_anchor=(0, 1.02, 1, 0.2), mode="expand", ncol=4)
    plt.tight_layout()
    plt.show()

In [None]:
# S3 API interaction functions

def create_s3_client():
    """Create an S3 client to interact with the SeaweedFS S3 API"""
    endpoint_url = s3_url
    if not endpoint_url.startswith("http"):
        endpoint_url = f"http://{endpoint_url}"
        
    try:
        s3_client = boto3.client(
            's3',
            endpoint_url=endpoint_url,
            aws_access_key_id=s3_access_key,
            aws_secret_access_key=s3_secret_key,
            # For self-signed certificates
            verify=False,
            # Required for non-AWS S3 implementations
            config=boto3.session.Config(
                signature_version='s3v4',
                s3={'addressing_style': 'path'}
            )
        )
        return s3_client
    except Exception as e:
        print(f"Error creating S3 client: {e}")
        return None

def create_test_bucket(bucket_name="resilience-test"):
    """Create a test bucket for experiments"""
    s3_client = create_s3_client()
    if not s3_client:
        return False
        
    try:
        # Check if bucket exists first
        try:
            s3_client.head_bucket(Bucket=bucket_name)
            print(f"Bucket '{bucket_name}' already exists")
            return True
        except:
            # Bucket doesn't exist, create it
            s3_client.create_bucket(Bucket=bucket_name)
            print(f"Created bucket '{bucket_name}'")
            return True
    except Exception as e:
        print(f"Error creating bucket: {e}")
        return False

def generate_test_data(size_kb=100):
    """Generate random test data of specified size"""
    return os.urandom(size_kb * 1024)

def upload_test_files(bucket_name="resilience-test", file_count=10, size_kb=100):
    """Upload test files to the bucket"""
    s3_client = create_s3_client()
    if not s3_client:
        return []
        
    uploaded_files = []
    
    try:
        # Create the bucket if it doesn't exist
        create_test_bucket(bucket_name)
        
        # Upload files
        for i in range(file_count):
            key = f"test-file-{i}.dat"
            data = generate_test_data(size_kb)
            
            s3_client.put_object(
                Bucket=bucket_name,
                Key=key,
                Body=io.BytesIO(data)
            )
            
            uploaded_files.append({
                "bucket": bucket_name,
                "key": key,
                "size": size_kb,
                "md5": hash(data) # Simple hash for verification
            })
            
            print(f"Uploaded {key} ({size_kb}KB)")
            
        return uploaded_files
    except Exception as e:
        print(f"Error uploading test files: {e}")
        return uploaded_files

def verify_files(file_list):
    """Verify that files are accessible and match expected content"""
    s3_client = create_s3_client()
    if not s3_client:
        return []
        
    results = []
    
    for file_info in file_list:
        try:
            response = s3_client.get_object(
                Bucket=file_info["bucket"],
                Key=file_info["key"]
            )
            
            data = response["Body"].read()
            current_hash = hash(data)
            is_match = current_hash == file_info["md5"]
            
            results.append({
                "key": file_info["key"],
                "accessible": True,
                "size_match": len(data) == file_info["size"]*1024,
                "content_match": is_match,
                "status": "✅ Valid" if is_match else "❌ Corrupted"
            })
        except Exception as e:
            results.append({
                "key": file_info["key"],
                "accessible": False,
                "size_match": False,
                "content_match": False,
                "status": "❌ Not accessible",
                "error": str(e)
            })
    
    return results

In [None]:
# Visualization functions

def plot_topology():
    """Visualize the SeaweedFS cluster topology"""
    try:
        # Get topology information
        topology = get_system_topology()
        if not topology:
            print("Could not fetch topology information")
            return
        
        # Extract data centers, racks, and data nodes
        datacenters = topology.get("Topology", {}).get("DataCenters", [])
        
        # Create network graph
        G = nx.Graph()
        
        # Add master node
        G.add_node("Master", role="master", size=20)
        
        # Add data centers, racks, and volumes
        for dc_idx, dc in enumerate(datacenters):
            dc_name = dc.get("Id", f"DC-{dc_idx}")
            G.add_node(dc_name, role="datacenter", size=15)
            G.add_edge("Master", dc_name)
            
            for rack_idx, rack in enumerate(dc.get("Racks", [])):
                rack_name = rack.get("Id", f"{dc_name}-Rack-{rack_idx}")
                G.add_node(rack_name, role="rack", size=10)
                G.add_edge(dc_name, rack_name)
                
                for server_idx, server in enumerate(rack.get("DataNodes", [])):
                    server_name = f"Volume-{server_idx}-{rack_name}"
                    G.add_node(server_name, role="volume", size=5, volumes=server.get("Volumes", []))
                    G.add_edge(rack_name, server_name)
        
        # Draw the graph
        plt.figure(figsize=(12, 8))
        pos = nx.spring_layout(G, k=0.5, iterations=50)
        
        # Colormap based on role
        color_map = {
            'master': 'red',
            'datacenter': 'blue',
            'rack': 'green',
            'volume': 'orange'
        }
        node_colors = [color_map[G.nodes[node]['role']] for node in G.nodes()]
        node_sizes = [G.nodes[node]['size'] * 100 for node in G.nodes()]
        
        nx.draw_networkx_nodes(G, pos, node_color=node_colors, node_size=node_sizes, alpha=0.8)
        nx.draw_networkx_edges(G, pos, width=1.0, alpha=0.5)
        nx.draw_networkx_labels(G, pos, font_size=8)
        
        plt.title("SeaweedFS Cluster Topology")
        plt.axis('off')
        plt.tight_layout()
        plt.show()
        
    except Exception as e:
        print(f"Error visualizing topology: {e}")
    
def animate_failure_recovery(duration=30, interval=0.5):
    """Create an animation showing system state during failure and recovery"""
    # This requires stored monitoring data from monitor_system_health
    pass

## 1. Check If Cluster Is Running

First, let's check if our SeaweedFS high-availability cluster is running by inspecting the Docker containers and API endpoints.

In [None]:
# Check if our cluster is already running
display_container_status()

In [None]:
# Start the cluster if it's not already running
def ensure_cluster_running():
    """Ensure the SeaweedFS HA cluster is running"""
    try:
        # Check if Docker Compose exists
        docker_compose_path = '/home/harry/projects/seaweedfs/demo/docker-compose-ha.yml'
        if not os.path.exists(docker_compose_path):
            print(f"Error: docker-compose-ha.yml not found at {docker_compose_path}")
            return False
            
        # Check if containers are already running
        containers = list_containers()
        running_containers = [c for c in containers if c['status'] == 'running']
        
        if running_containers and len(running_containers) >= 9:  # 3 masters, 3 volumes, 2 filers, 1 nginx
            print("SeaweedFS HA cluster is already running")
            return True
            
        # Start the cluster using docker-compose
        print("Starting SeaweedFS HA cluster...")
        current_dir = os.getcwd()
        os.chdir(os.path.dirname(docker_compose_path))
        
        result = subprocess.run(
            ['docker-compose', '-f', 'docker-compose-ha.yml', 'up', '-d'], 
            capture_output=True,
            text=True
        )
        
        os.chdir(current_dir)
        
        if result.returncode == 0:
            print("SeaweedFS HA cluster started successfully")
            return True
        else:
            print(f"Error starting cluster: {result.stderr}")
            return False
            
    except Exception as e:
        print(f"Error ensuring cluster is running: {e}")
        return False

# Run the function
ensure_cluster_running()

In [None]:
# Check which master is the leader
check_master_leader()

In [None]:
# Check health of all endpoints
health_results = check_endpoint_health()
format_table(health_results, "Initial System Health Check")

## 2. Examine System Topology

Let's examine the system topology to understand the structure of our SeaweedFS HA cluster.

In [None]:
# Get and display cluster status
cluster_status = get_cluster_status()
print(json.dumps(cluster_status, indent=2))

In [None]:
# Get and display system topology
topology = get_system_topology()
print(json.dumps(topology, indent=2))

In [None]:
# Get and display volume status
volume_status = get_volume_status()
print(json.dumps(volume_status, indent=2))

## 3. Prepare Test Data

To demonstrate resilience, we'll create some test data that we can use to verify system integrity after failure scenarios.

In [None]:
# Create a test bucket
create_test_bucket("resilience-test")

In [None]:
# Upload test files for resilience testing
uploaded_files = upload_test_files(bucket_name="resilience-test", file_count=20, size_kb=500)
format_table(uploaded_files, "Uploaded Test Files")

In [None]:
# Verify files are accessible
verification_results = verify_files(uploaded_files)
format_table(verification_results, "Initial File Verification")

## 4. Master Node Failure Scenario

In this scenario, we'll simulate a master node failure and observe how the system handles it.

### Step 1: Identify Current Leader

In [None]:
# Identify the current leader
leader_name, leader_url = check_master_leader()

In [None]:
# Start monitoring system health
print("Starting health monitoring. Will fail the leader master in 10 seconds...")
print("Press Ctrl+C to stop monitoring")

# Start monitoring in a separate thread
monitor_thread = threading.Thread(target=lambda: monitor_system_health(interval=2, duration=120))
monitor_thread.daemon = True
monitor_thread.start()

# Wait 10 seconds before failing the leader
time.sleep(10)

# Fail the leader master
if leader_name:
    print(f"⚠️ Stopping {leader_name} (current leader)...")
    stop_container(leader_name)
else:
    print("No leader found, cannot perform failure test")

# Wait for monitoring to complete
monitor_thread.join()

In [None]:
# Check who the new leader is
new_leader_name, new_leader_url = check_master_leader()

if new_leader_name and new_leader_name != leader_name:
    print(f"✅ Leader election successful! New leader: {new_leader_name}")
elif new_leader_name == leader_name:
    print(f"❌ Leader did not change: still {leader_name}")
else:
    print("❌ No leader found after failure")

In [None]:
# Verify that files are still accessible after master failure
print("Verifying file accessibility after master failure...")
verification_results = verify_files(uploaded_files)
format_table(verification_results, "File Verification After Master Failure")

In [None]:
# Restart the failed master
if leader_name:
    print(f"Restarting {leader_name}...")
    start_container(leader_name)
    print(f"Waiting for {leader_name} to initialize...")
    time.sleep(5)
    
    # Check container status
    status = get_container_status(leader_name)
    print(f"{leader_name} status: {status}")
else:
    print("No leader name recorded, cannot restart")

## 5. Volume Server Failure Scenario

In this scenario, we'll simulate a volume server failure and observe data redundancy and healing.

In [None]:
# Start monitoring system health
print("Starting health monitoring. Will fail volume1 in 10 seconds...")
print("Press Ctrl+C to stop monitoring")

# Start monitoring in a separate thread
monitor_thread = threading.Thread(target=lambda: monitor_system_health(interval=2, duration=120))
monitor_thread.daemon = True
monitor_thread.start()

# Wait 10 seconds before failing the volume
time.sleep(10)

# Fail volume1
print("⚠️ Stopping volume1...")
stop_container('volume1')

# Wait for monitoring to complete
monitor_thread.join()

In [None]:
# Check system topology after volume1 failure
print("Checking system topology after volume1 failure...")
topology = get_system_topology()
print(json.dumps(topology, indent=2))

In [None]:
# Verify that files are still accessible after volume server failure
print("Verifying file accessibility after volume server failure...")
verification_results = verify_files(uploaded_files)
format_table(verification_results, "File Verification After Volume Server Failure")

In [None]:
# Restart the failed volume
print("Restarting volume1...")
start_container('volume1')
print("Waiting for volume1 to initialize...")
time.sleep(5)
    
# Check container status
status = get_container_status('volume1')
print(f"volume1 status: {status}")

## 6. Multiple Server Failure Scenario

In this scenario, we'll simulate multiple simultaneous failures to test the system's limits.

In [None]:
# Start monitoring system health
print("Starting health monitoring. Will fail multiple servers in 10 seconds...")
print("Press Ctrl+C to stop monitoring")

# Start monitoring in a separate thread
monitor_thread = threading.Thread(target=lambda: monitor_system_health(interval=2, duration=180))
monitor_thread.daemon = True
monitor_thread.start()

# Wait 10 seconds before failing servers
time.sleep(10)

# Fail multiple servers
print("⚠️ Stopping master2, volume2, and filer2...")
stop_container('master2')
stop_container('volume2')
stop_container('filer2')

# Wait for monitoring to complete
monitor_thread.join()

In [None]:
# Verify that files are still accessible after multiple server failures
print("Verifying file accessibility after multiple server failures...")
verification_results = verify_files(uploaded_files)
format_table(verification_results, "File Verification After Multiple Server Failures")

In [None]:
# Restart the failed servers
print("Restarting master2, volume2, and filer2...")
start_container('master2')
start_container('volume2')
start_container('filer2')
print("Waiting for servers to initialize...")
time.sleep(10)

# Check container status
containers = list_containers()
format_table(containers, "Container Status After Restart")

## 7. Simulating System Healing

Let's observe how the system heals after failures by monitoring topology changes and data rebalancing.

In [None]:
# Get volume status after all failures and recoveries
volume_status = get_volume_status()
print(json.dumps(volume_status, indent=2))

In [None]:
# Check endpoint health after all recoveries
health_results = check_endpoint_health()
format_table(health_results, "Final System Health Check")

## 8. Summary and Conclusions

Based on our experiments, we can draw the following conclusions about SeaweedFS resilience:

In [None]:
# Final file verification after all tests
final_verification = verify_files(uploaded_files)
format_table(final_verification, "Final File Verification After All Tests")

# Summary statistics
total_files = len(final_verification)
accessible_files = sum(1 for file in final_verification if file['accessible'])
content_match = sum(1 for file in final_verification if file['content_match'])

print(f"\nTest Summary:")
print(f"- Total files: {total_files}")
print(f"- Files still accessible: {accessible_files} ({accessible_files/total_files*100:.1f}%)")
print(f"- Files with intact content: {content_match} ({content_match/total_files*100:.1f}%)")

if accessible_files == total_files and content_match == total_files:
    print("\n✅ SUCCESS: All files remained accessible and intact through the failure scenarios.")
    print("This demonstrates SeaweedFS's excellent resilience against server failures.")
else:
    print(f"\n⚠️ WARNING: {total_files - accessible_files} files were not accessible or had content issues.")
    print("This may indicate configuration issues with replication or healing processes.")

### SeaweedFS Resilience Features Demonstrated

1. **Leader Election**: When a master leader fails, another master automatically takes over leadership

2. **Data Redundancy**: With proper replication settings, data remains accessible even when a volume server fails

3. **Service Continuity**: The system continues to operate despite server failures

4. **Self-Healing**: When servers are restored, they rejoin the cluster and synchronize data

5. **Load Balancing**: NGINX provides continuous access to filer services even when one filer is down

### Recommendations for Production Systems

1. **Replication Strategy**: Use at least `001` (one copy in each rack) for critical data

2. **Physical Separation**: Deploy masters and volume servers on physically separate hardware

3. **Monitoring**: Implement continuous monitoring of all SeaweedFS components

4. **Backup Strategy**: Even with high availability, maintain regular backups

5. **Volume Distribution**: Ensure volumes are well-distributed across available storage nodes

6. **Network Redundancy**: Implement redundant network paths between components

7. **Regular Testing**: Periodically test failure scenarios to ensure resilience mechanisms work as expected