# Exercise 5: Break/Fix Challenge üîß

## Learning Objectives
- Diagnose cluster failures using monitoring tools
- Understand HDFS fault tolerance in action
- Observe Spark resilience when executors fail
- Practice real-world troubleshooting skills

---

## ‚ö†Ô∏è Important Note
This exercise involves intentionally breaking components. Make sure you understand how to restart services before proceeding!

## Challenge 1: DataNode Failure

### Scenario
A DataNode in your cluster has failed. Your task is to:
1. Observe the failure in the HDFS UI
2. Verify data is still accessible
3. Watch the replication recovery process

In [None]:
# First, check current cluster status
!hdfs dfsadmin -report | head -30

### Step 1: Record the current state
Open http://localhost:9870 and note:
- How many live DataNodes?
- What's the block replication status?

In [None]:
# Check block locations for our test file
!hdfs fsck /user/student/data/transactions.csv -files -blocks -locations 2>/dev/null | tail -20

### Step 2: Simulate DataNode Failure
Run this in a terminal (not in this notebook):
```bash
docker stop datanode1
```

In [None]:
# Wait 30 seconds, then check status again
import time
print("Waiting for NameNode to detect failure...")
time.sleep(30)
!hdfs dfsadmin -report | head -20

### üîç Checkpoint Question 5a
Look at the HDFS UI (http://localhost:9870):
- How many DataNodes are now shown as live?
- Are there any under-replicated blocks?
- Can you still read the transactions file?

In [None]:
# Try to read the file - does it work?
!hdfs dfs -head /user/student/data/transactions.csv

**Your Answer:**
- Live DataNodes: 
- Under-replicated blocks: 
- File readable: Yes/No
- Explanation: 

### Step 3: Recovery
```bash
docker start datanode1
```
Wait 1 minute and observe the HDFS UI - watch replicas recover!

In [None]:
# Verify recovery
print("Waiting for DataNode to rejoin...")
time.sleep(60)
!hdfs dfsadmin -report | head -20

---

## Challenge 2: Spark Job Resilience

### Scenario
An executor fails during a long-running Spark job. Observe how Spark handles the failure.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder \
    .appName("Resilience Test") \
    .master("yarn") \
    .config("spark.executor.instances", "2") \
    .config("spark.task.maxFailures", "4") \
    .getOrCreate()

print(f"App ID: {spark.sparkContext.applicationId}")

In [None]:
# Load data
df = spark.read.option("header", "true").option("inferSchema", "true") \
    .csv("hdfs:///user/student/data/transactions.csv")

df.cache()
df.count()  # Materialize

### Step 1: Start a Long-Running Job
Run the next cell, then quickly kill a NodeManager:
```bash
docker stop nodemanager1
```

In [None]:
# Long-running aggregation (run this, then kill nodemanager1)
import time
start = time.time()

result = df.repartition(8) \
    .groupBy("store_region", "payment_method") \
    .agg(
        count("*").alias("count"),
        sum("total_amount").alias("total"),
        avg("quantity").alias("avg_qty")
    ) \
    .orderBy(desc("total"))

result.show(20)
print(f"Completed in {time.time() - start:.2f} seconds")

### üîç Checkpoint Question 5b
Look at the Spark UI (http://localhost:4040 or YARN UI):
- Did the job complete successfully?
- Were any tasks retried?
- How did Spark recover from the lost executor?

**Your Answer:**

### Step 2: Recovery
```bash
docker start nodemanager1
```

In [None]:
spark.stop()

## Summary Questions

1. Why did HDFS continue to work with one DataNode down?
2. What would happen if replication factor was 1 and that DataNode failed?
3. How does Spark's RDD lineage help with fault tolerance?
4. What's the difference between executor failure and driver failure?