# **Chapter 22: High-Scalability Challenges**

When your system grows from thousands to millions to billions of users, the challenges change fundamentally. It's no longer about making a single server faster—it's about architectural patterns that can absorb massive scale, handle unpredictable viral traffic, and maintain performance across the globe. This chapter covers the advanced techniques used by companies like Google, Meta, and Netflix to operate at planetary scale.

---

## **22.1 Handling Flash Traffic and Viral Growth**

Imagine waking up to discover your app is #1 on the App Store, or a celebrity just tweeted about your service. Normal traffic patterns go out the window. Your system must handle 100x or 1000x normal load within minutes—not hours.

### **The Thundering Herd Problem**

When a popular cache entry expires, thousands of requests simultaneously hit your database, potentially crashing it.

**The Scenario:**
```
Time 08:00:00 - Cache hit for "trending_post_123"
  ↓ 10,000 requests/second served from cache (fast)

Time 08:00:01 - Cache entry expires (TTL reached)
  ↓ All 10,000 requests miss cache simultaneously
  ↓ 10,000 database queries execute at once
  ↓ Database crashes under load
```

**Solution 1: Cache Warming with Staggered TTL**
Instead of letting entries expire all at once, add randomness to expiration times.

```python
import random
import redis

def set_with_jitter(cache, key, value, base_ttl=300):
    """
    Add randomness to prevent simultaneous expiration
    base_ttl = 300 seconds (5 minutes)
    jitter = +/- 10% random variation
    """
    jitter = random.randint(-30, 30)  # +/- 30 seconds
    actual_ttl = base_ttl + jitter
    cache.setex(key, actual_ttl, value)
    
# Without jitter: 1000 keys expire at exactly 08:00:00
# With jitter: Keys expire between 07:59:30 and 08:00:30
# Database load is spread over 60 seconds instead of 1 second
```

**Solution 2: Lease-Based Cache (Thundering Herd Protection)**

Only allow one request to regenerate the cache value; others wait or serve stale data.

```python
import threading
import time

class LeaseBasedCache:
    def __init__(self):
        self.cache = {}
        self.locks = {}
        self.lock = threading.Lock()
    
    def get(self, key, compute_func, ttl=300):
        # Check cache first
        if key in self.cache:
            value, expiry = self.cache[key]
            if time.time() < expiry:
                return value  # Cache hit
        
        # Try to acquire lease for regeneration
        with self.lock:
            if key not in self.locks:
                self.locks[key] = threading.Lock()
        
        # Only one thread can acquire the lease
        if self.locks[key].acquire(blocking=False):
            try:
                # This thread regenerates the value
                value = compute_func()
                self.cache[key] = (value, time.time() + ttl)
                return value
            finally:
                self.locks[key].release()
        else:
            # Another thread is regenerating
            # Option A: Wait briefly and retry
            time.sleep(0.1)
            return self.get(key, compute_func, ttl)
            
            # Option B: Return stale data (if available)
            # return self.cache.get(key, (None, 0))[0]
```

**Solution 3: Circuit Breakers with Graceful Degradation**

When overload is detected, fail fast and serve fallback content.

```python
from enum import Enum
import time

class CircuitState(Enum):
    CLOSED = "closed"      # Normal operation
    OPEN = "open"          # Failing fast
    HALF_OPEN = "half_open"  # Testing if recovered

class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=60):
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.failure_count = 0
        self.last_failure_time = None
        self.state = CircuitState.CLOSED
    
    def call(self, func, fallback_func, *args, **kwargs):
        if self.state == CircuitState.OPEN:
            if time.time() - self.last_failure_time > self.timeout:
                self.state = CircuitState.HALF_OPEN
            else:
                return fallback_func(*args, **kwargs)
        
        try:
            result = func(*args, **kwargs)
            if self.state == CircuitState.HALF_OPEN:
                self.state = CircuitState.CLOSED
                self.failure_count = 0
            return result
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = time.time()
            
            if self.failure_count >= self.failure_threshold:
                self.state = CircuitState.OPEN
            
            return fallback_func(*args, **kwargs)

# Usage example
breaker = CircuitBreaker(failure_threshold=10, timeout=30)

def get_user_profile(user_id):
    # Database query that might fail under load
    return db.query("SELECT * FROM users WHERE id = ?", user_id)

def get_cached_profile(user_id):
    # Fallback: Return cached version or default
    return cache.get(f"user:{user_id}") or {"error": "Service temporarily unavailable"}

# Under normal load: Returns fresh data
# Under flash traffic: Returns cached data, protecting database
profile = breaker.call(get_user_profile, get_cached_profile, user_id=123)
```

### **Autoscaling Strategies for Viral Events**

**Reactive Scaling (Traditional)**
```
Traffic increases → CloudWatch alarm triggers → Launch new instances → 5 minutes later → New instances handle load
```

**Problem**: 5-minute lag is too slow for viral traffic. By the time new servers are ready, the database is already down.

**Predictive Scaling**
Use machine learning to predict traffic spikes based on:
- Time of day patterns
- Social media trends (Twitter API monitoring)
- Marketing campaign schedules
- Historical viral events

```python
# Pseudo-code for predictive scaling
def predict_traffic(current_time):
    base_load = get_historical_average(current_time)
    
    # Check for viral indicators
    twitter_mentions = get_twitter_mentions_count("our_app")
    if twitter_mentions > threshold:
        viral_multiplier = min(twitter_mentions / threshold, 10)  # Cap at 10x
        return base_load * viral_multiplier
    
    return base_load

# Pre-scale before traffic hits
predicted_load = predict_traffic(datetime.now())
if predicted_load > current_capacity * 0.8:
    scale_up(predicted_load * 1.5)  # 50% buffer
```

**Scheduled Scaling**
For known events (product launches, Black Friday):
```yaml
# AWS Auto Scaling configuration
ScheduledActions:
  - ScheduledActionName: BlackFridayPrep
    StartTime: 2024-11-28T00:00:00Z
    EndTime: 2024-11-30T23:59:59Z
    MinSize: 100  # Normal: 10
    MaxSize: 1000
    DesiredCapacity: 500
```

---

## **22.2 Geographical Distribution**

When users are global, latency is physics. Light takes 67ms to travel from New York to London—and that's in a vacuum. Real networks take 80-150ms. If your data is in Virginia but your user is in Tokyo, every request pays a 200ms penalty.

### **Content Delivery Networks (CDNs)**

**How CDNs Work**
```
User in Tokyo requests image.jpg
    ↓
Local DNS resolves to nearest CDN edge (Tokyo)
    ↓
Tokyo edge checks cache:
    ├─ Cache Hit: Serve immediately (5-10ms)
    └─ Cache Miss: Fetch from Origin (Virginia), cache locally, serve (200ms + 5ms)
```

**CDN Caching Strategies**

**1. Static Asset Caching (Images, CSS, JS)**
```http
# Response headers for long-term caching
Cache-Control: public, max-age=31536000, immutable
ETag: "33a64df5"
# immutable = file never changes (versioned filenames: app.v2.js)
```

**2. Dynamic Content Caching (API Responses)**
```http
# Short-term caching for semi-dynamic content
Cache-Control: public, max-age=60, stale-while-revalidate=300
# Serve from cache for 60 seconds
# If expired, serve stale version while fetching fresh in background
```

**3. Cache Invalidation Strategies**

**Purge API** (Immediate but expensive):
```bash
# Invalidate specific URL across all edge locations
curl -X POST "https://api.cdn.com/purge" \
  -d '{"url": "https://example.com/prices"}'
# Takes 2-5 minutes to propagate globally
```

**Versioned URLs** (Preferred for static assets):
```html
<!-- Instead of -->
<script src="/app.js"></script>

<!-- Use -->
<script src="/app.v2.js"></script>
<!-- Change HTML to v3 when deploying new version -->
<!-- Old version stays cached forever, new version is fresh -->
```

**4. Edge Computing: Logic at the Edge**

Modern CDNs (Cloudflare Workers, AWS Lambda@Edge) allow you to run code at edge locations, not just serve static files.

```javascript
// Cloudflare Worker: A/B testing at the edge
addEventListener('fetch', event => {
  event.respondWith(handleRequest(event.request))
})

async function handleRequest(request) {
  const url = new URL(request.url)
  
  // Check cookie for existing assignment
  let group = request.headers.get('Cookie')?.match(/ab_group=(\w)/)?.[1]
  
  if (!group) {
    // Assign to group A or B (50/50 split)
    group = Math.random() < 0.5 ? 'A' : 'B'
    
    // Modify request to route to different origin
    if (group === 'B') {
      url.hostname = 'origin-b.example.com'
    }
    
    // Fetch from appropriate origin
    const response = await fetch(url, request)
    
    // Add cookie to response
    const newResponse = new Response(response.body, response)
    newResponse.headers.append('Set-Cookie', `ab_group=${group}; Path=/`)
    return newResponse
  }
  
  return fetch(url, request)
}
```

**Benefits**:
- A/B testing without latency penalty (decision made at edge)
- Authentication at edge (block bad requests before they hit origin)
- Geolocation-based routing (serve different content to EU vs. US users for GDPR compliance)

### **Global Databases: Spanner and CockroachDB**

**The Problem with Traditional Replication**
```
Master in Virginia, Replica in Tokyo
├─ Read from Tokyo: 5ms (fast, but might be stale)
└─ Write to Tokyo: Must go to Virginia (200ms round-trip)
```

**Google Spanner: True Global Consistency**

Spanner uses **TrueTime API** (atomic clocks + GPS) to provide globally consistent reads without locking.

**How TrueTime Works**
```python
# Traditional database timestamp: 2024-01-15 10:30:00.000
# Problem: Clocks on different servers differ by 10-100ms

# Spanner TrueTime: [earliest, latest]
# Example: [10:30:00.100, 10:30:00.200]
# Uncertainty interval: 100ms

# Spanner waits out the uncertainty interval before committing
# Guarantees: If transaction A commits before B starts, A's timestamp < B's
```

**Spanner Architecture**
```
Global Layer (Location Tracking):
  ├─ US-Central: Leader for Users 1-1000000
  ├─ Europe-West: Leader for Users 1000001-2000000
  └─ Asia-East: Leader for Users 2000001-3000000

Local Layer (Within each region):
  ├─ Paxos group (3-5 replicas for consensus)
  └─ Data split into chunks (splits), moved for load balancing
```

**Code Example** (CockroachDB, open-source Spanner alternative):
```sql
-- CockroachDB automatically distributes data geographically
-- Table partitioned by region for data locality

CREATE TABLE orders (
    id UUID DEFAULT gen_random_uuid(),
    region STRING,
    amount DECIMAL,
    PRIMARY KEY (region, id)
) PARTITION BY LIST (region) (
    PARTITION us_west VALUES IN ('us-west'),
    PARTITION eu_west VALUES IN ('eu-west'),
    PARTITION asia_east VALUES IN ('asia-east')
);

-- Pin partitions to specific regions for compliance/latency
ALTER PARTITION us_west CONFIGURE ZONE USING 
    constraints = '[+region=us-west1]';

-- Queries automatically routed to nearest replica
-- Writes go to regional leader, then replicate asynchronously
```

**Trade-offs**:
- **Spanner**: Strong consistency globally, but writes are slower (consensus required)
- **DynamoDB Global Tables**: Eventually consistent, but writes are fast (local)
- **Choice depends on**: Can your app tolerate temporary inconsistency for better performance?

---

## **22.3 Federated and Cell-Based Architecture**

When a single monolithic database can't scale further (even with sharding), you need **federation**—breaking the system into independent, self-contained units.

### **Cell-Based Architecture**

**Concept**: Instead of one giant system, create many small copies (cells), each handling a subset of users.

```
Traditional Monolith:
┌─────────────────────────────┐
│  Load Balancer              │
│     ↓                       │
│  App Servers (1000s)        │
│     ↓                       │
│  Database Cluster (Petabytes) │
└─────────────────────────────┘
     Single point of failure

Cell-Based:
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Cell A  │ │ Cell B  │ │ Cell C  │
│ Users   │ │ Users   │ │ Users   │
│ 0-1M    │ │ 1M-2M   │ │ 2M-3M   │
│ ┌─────┐ │ │ ┌─────┐ │ │ ┌─────┐ │
│ │App  │ │ │ │App  │ │ │ │App  │ │
│ │DB   │ │ │ │DB   │ │ │ │DB   │ │
│ └─────┘ │ │ └─────┘ │ │ └─────┘ │
└─────────┘ └─────────┘ └─────────┘
     ↓           ↓           ↓
  Isolated    Isolated    Isolated
  Failures    Failures    Failures
```

**Benefits**:
1. **Fault Isolation**: If Cell A fails, only 1M users affected, not 100M
2. **Incremental Deployment**: Deploy to Cell A first, monitor, then roll out to others
3. **Geographic Distribution**: Cell A in US, Cell B in EU (data sovereignty)
4. **Scalability**: Add new cells indefinitely; no single database grows too large

**Implementation: User Assignment**
```python
class CellRouter:
    def __init__(self):
        self.cells = {
            'cell-us-1': {'range': (0, 1000000), 'endpoint': 'us1.example.com'},
            'cell-us-2': {'range': (1000000, 2000000), 'endpoint': 'us2.example.com'},
            'cell-eu-1': {'range': (2000000, 3000000), 'endpoint': 'eu1.example.com'},
        }
    
    def get_cell_for_user(self, user_id):
        for cell_id, config in self.cells.items():
            if config['range'][0] <= user_id < config['range'][1]:
                return config['endpoint']
        raise ValueError(f"No cell found for user {user_id}")
    
    def move_user(self, user_id, target_cell):
        # Migration logic for rebalancing cells
        pass

# Usage
router = CellRouter()
endpoint = router.get_cell_for_user(user_id=123456)
# Route request to us1.example.com
```

**Cross-Cell Communication**
When User A (Cell 1) needs to message User B (Cell 2):

```python
class CrossCellMessenger:
    def send_message(self, from_user, to_user, message):
        target_cell = self.router.get_cell_for_user(to_user)
        
        # Async message to avoid blocking
        self.message_queue.publish(
            topic=f"cell-{target_cell}-messages",
            data={
                'from': from_user,
                'to': to_user,
                'content': message,
                'timestamp': time.time()
            }
        )
        
        # Local cell stores "sent" copy
        self.local_db.store_outbox(from_user, to_user, message)
```

### **Federated Architecture**

**Difference from Cells**: Cells are identical copies handling different users. Federation is when different services own different data types, with autonomy.

**Example: Federated Social Media**
```
User Service (Team A): Owns profiles, authentication
Post Service (Team B): Owns content, feeds
Media Service (Team C): Owns images, videos

Each service has its own:
- Database
- Deployment pipeline
- Scaling policies
- Team ownership

Communication via:
- Async events (Kafka) for loose coupling
- gRPC for synchronous queries (with circuit breakers)
```

**Federation Gateway**
```python
# GraphQL Federation: Single entry point, distributed backends
class FederationGateway:
    def resolve_query(self, query):
        # Parse query to find required services
        services_needed = self.analyze_query(query)
        
        # Parallel fetch from multiple services
        with ThreadPoolExecutor() as executor:
            futures = {
                executor.submit(self.call_service, svc): svc 
                for svc in services_needed
            }
            
            results = {}
            for future in as_completed(futures):
                service = futures[future]
                results[service] = future.result()
        
        # Stitch results together
        return self.stitch_results(query, results)
```

---

## **22.4 Petabyte-Scale Data Processing**

When your data no longer fits on one disk (or even one rack), you need distributed data processing frameworks.

### **The MapReduce Paradigm**

**Concept**: Split big task into small chunks (Map), process in parallel, combine results (Reduce).

**Word Count Example** (The "Hello World" of Big Data):
```
Input: 10TB of text files across 1000 servers

Map Phase (Parallel on each server):
  Server 1: "the quick brown fox" → (the,1), (quick,1), (brown,1), (fox,1)
  Server 2: "the lazy dog" → (the,1), (lazy,1), (dog,1)
  Server 3: "the quick dog" → (the,1), (quick,1), (dog,1)

Shuffle Phase (Sort by key):
  (the, [1,1,1])
  (quick, [1,1])
  (brown, [1])
  ...

Reduce Phase (Aggregate):
  (the, 3)
  (quick, 2)
  (brown, 1)
  ...
```

**Apache Spark** (Modern MapReduce):
```python
from pyspark.sql import SparkSession

# Initialize cluster connection
spark = SparkSession.builder \
    .appName("LargeScaleAnalytics") \
    .config("spark.executor.memory", "64g") \
    .config("spark.executor.cores", "16") \
    .getOrCreate()

# Load petabyte-scale dataset from S3/HDFS
df = spark.read.parquet("s3://data-lake/events/")

# Transformations (lazy evaluation - nothing executed yet)
processed = df.filter(df.timestamp > "2024-01-01") \
              .groupBy("user_id") \
              .agg({"amount": "sum"}) \
              .filter(col("sum(amount)") > 10000)

# Action (triggers distributed computation)
high_value_users = processed.collect()

# Spark automatically:
# 1. Splits data into partitions across cluster
# 2. Schedules tasks to nodes with local data (data locality)
# 3. Handles node failures by recomputing lost partitions
# 4. Optimizes query plan (predicate pushdown, etc.)
```

**Data Locality Optimization**
```
Without locality: Data in S3 → Download to Node → Process → Upload result
With locality: Data already on Node's local disk → Process immediately
Speed difference: 10-100x faster
```

### **Stream Processing at Scale**

**Lambda Architecture** (Batch + Speed layers):
```
Raw Data → ┌──────────────┐
           │ Batch Layer  │ → Precompute views (hourly, daily)
           │ (Hadoop/Spark)│   Accurate but slow
           └──────────────┘
           ↓
        Serving Layer (Merge batch + real-time views)
           ↑
           └──────────────┐
             Speed Layer  │ → Real-time approximations
             (Storm/Flink)│   Fast but approximate
             └──────────────┘
```

**Kappa Architecture** (Streaming only, simpler):
```
Raw Data → Kafka → Stream Processor (Flink) → Serving Database
                    ↓
              Real-time aggregations
              Exactly-once processing
```

**Apache Flink Example** (Real-time analytics):
```java
// Process 1 million events/second with exactly-once semantics
StreamExecutionEnvironment env = 
    StreamExecutionEnvironment.getExecutionEnvironment();

env.enableCheckpointing(60000); // Snapshot state every 60s for fault tolerance

DataStream<Event> stream = env
    .addSource(new KafkaSource<>("events-topic"))
    .map(event -> parseJson(event))
    .keyBy(event -> event.userId)
    .window(TumblingEventTimeWindows.of(Time.minutes(5)))
    .aggregate(new CountAggregate());

// Results written to Redis for real-time dashboards
stream.addSink(new RedisSink<>());

env.execute();
```

---

## **22.5 Real-Time Systems and Low-Latency Optimization**

When milliseconds matter (high-frequency trading, gaming, ad bidding), architecture changes fundamentally.

### **The Latency Hierarchy**

```
Latency      System Type                    Techniques
─────────────────────────────────────────────────────────
< 1ms        In-memory, single machine      Lock-free queues, CPU pinning
1-10ms       Local network                  Kernel bypass (DPDK), RDMA
10-100ms     Same datacenter                Async I/O, connection pooling
100ms+       Cross-region                   Caching, CDNs, eventual consistency
```

### **Kernel Bypass: DPDK**

**Problem**: Traditional networking:
```
Network Card → Kernel TCP stack → User Application
     ↓              ↓                    ↓
  10µs          100µs                Your code
```

**DPDK (Data Plane Development Kit)**:
```
Network Card → DPDK (User space) → Application
     ↓              ↓                    ↓
  10µs           1µs                  Your code
```

**How it works**: Application takes direct control of network card, bypassing kernel entirely.

```c
// DPDK allows processing packets in userspace
// Used by Cloudflare, AWS, high-frequency trading firms

// Traditional: recv() system call (context switch to kernel)
// DPDK: Poll network card directly from userspace (no context switch)

// Trade-off: 
// - Pros: 10x lower latency, 10x higher throughput
// - Cons: Application must implement TCP stack, harder to debug
```

### **Lock-Free Data Structures**

**Problem**: Locks cause contention and cache coherence traffic between CPU cores.

**Solution**: Atomic operations and memory ordering.

```java
// Java: Lock-free queue (Disruptor pattern)
// Used by LMAX exchange for 6 million transactions/second

class RingBuffer<T> {
    private final long[] sequence;
    private final Object[] entries;
    private final AtomicLong writeSequence = new AtomicLong(-1);
    private final AtomicLong readSequence = new AtomicLong(-1);
    
    public void publish(T event) {
        long sequence = writeSequence.incrementAndGet();
        entries[(int)(sequence % entries.length)] = event;
        // Memory barrier ensures visibility across cores
        writeSequence.lazySet(sequence);
    }
    
    public T poll() {
        long nextSequence = readSequence.get() + 1;
        if (nextSequence <= writeSequence.get()) {
            T result = (T) entries[(int)(nextSequence % entries.length)];
            readSequence.set(nextSequence);
            return result;
        }
        return null;
    }
}
```

### **Pre-computation and Materialized Views**

For read-heavy workloads, compute answers before questions are asked.

```sql
-- Traditional: Compute on read (slow for complex aggregations)
SELECT user_id, COUNT(*), SUM(amount) 
FROM transactions 
WHERE date > NOW() - INTERVAL 30 DAY
GROUP BY user_id;

-- Optimized: Materialized view (pre-computed)
CREATE MATERIALIZED VIEW user_monthly_stats AS
SELECT user_id, COUNT(*) as txn_count, SUM(amount) as total_amount
FROM transactions
WHERE date > NOW() - INTERVAL 30 DAY
GROUP BY user_id;

-- Refresh every 5 minutes (acceptable staleness for dashboard)
REFRESH MATERIALIZED VIEW user_monthly_stats;

-- Read is now O(1) instead of O(n)
SELECT * FROM user_monthly_stats WHERE user_id = 123;
```

---

## **22.6 Key Takeaways**

1. **Flash traffic requires defense in depth**: Cache warming, circuit breakers, and autoscaling work together to prevent cascades.

2. **Geography is physics**: You can't beat the speed of light. Use CDNs, edge computing, and global databases to put data close to users.

3. **Cells provide blast radius containment**: When scaling beyond millions of users, cellular architecture limits the impact of failures.

4. **Batch for throughput, stream for latency**: Use Spark/Hadoop for petabyte-scale batch processing, Flink/Kafka for real-time streams.

5. **Last-mile optimization**: When every microsecond counts, kernel bypass (DPDK) and lock-free structures provide 10x improvements.

6. **Pre-computation trades space for time**: Materialized views and edge caching make reads instant at the cost of staleness.

---

## **Chapter Summary**

This chapter explored the challenges of planetary-scale systems. We learned to handle viral traffic through cache warming and circuit breakers, distribute data globally using CDNs and databases like Spanner, and architect cellular systems for fault isolation. We covered processing petabytes of data with MapReduce and Spark, and achieving microsecond latencies through kernel bypass and lock-free programming.

The key insight: At massive scale, architecture matters more than code optimization. A well-designed cellular system with eventual consistency will outperform a tightly-coupled strongly-consistent monolith every time.

**Coming up next**: In Chapter 23, we'll explore AI/ML System Design—how to serve machine learning models at scale, implement feature stores, and build RAG (Retrieval-Augmented Generation) architectures for LLMs.

---

## **Exercises**

1. **Thundering Herd Calculation**: Your cache expires 1000 keys simultaneously. Each cache miss triggers a database query taking 50ms. Your database can handle 100 concurrent connections. How long does it take to serve all 1000 requests? How would staggered TTL with 30-second jitter improve this?

2. **Cell-Based Routing**: Design a routing function that assigns users to cells based on user ID hash, ensuring even distribution across 8 cells. Write the code to determine which cell handles user 12345, and how to migrate that user to cell 3 with zero downtime.

3. **Global Latency Math**: A user in Sydney (Australia) requests data from a database in Dublin (Ireland). Calculate:
   - Minimum theoretical latency (speed of light, fiber optic refractive index 1.44)
   - Realistic latency with network hops (add 30% overhead)
   - How much faster would the request be if served from a Singapore edge node (distance: Sydney to Singapore)?

4. **Spark Optimization**: You have 1TB of data in S3 and a Spark cluster with 10 nodes (each 32 cores, 128GB RAM). Your job is taking 2 hours. Identify three optimizations to reduce this to 10 minutes (hint: consider data locality, partition size, and serialization).

5. **Circuit Breaker Design**: Implement a circuit breaker that transitions from CLOSED → OPEN after 5 failures, stays open for 60 seconds, then enters HALF_OPEN state where it allows 1 test request before deciding to close or reopen.

---
