# Module 02: Kafka Deep Dive

**Estimated Time:** 75 minutes

## Learning Objectives

By the end of this module, you will:
- Understand Kafka's internal log structure and storage
- Master replication and fault tolerance mechanisms
- Learn performance tuning techniques
- Configure producers and consumers for optimal performance
- Monitor Kafka clusters and troubleshoot issues
- Implement reliability patterns (idempotence, transactions)

---

## 1. Kafka Internal Architecture

### The Log: Kafka's Core Data Structure

Kafka stores events in an **append-only log**:

```
Partition Log (Append-Only)
┌────────────────────────────────────────────────────────┐
│ [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] ...           │
│  ↑                           ↑              ↑          │
│  Old                      Current       New (append)   │
└────────────────────────────────────────────────────────┘

Properties:
- Immutable: Events never change
- Ordered: Sequential offset numbers
- Durable: Written to disk
- Fast: Sequential I/O is efficient
```

### Log Segments

Partitions are divided into **segments** for efficient management:

```
Partition Directory: /var/lib/kafka/topic-0/

├── 00000000000000000000.log    (offsets 0-999)
├── 00000000000000000000.index  (index for fast lookup)
├── 00000000000000000000.timeindex
│
├── 00000000000000001000.log    (offsets 1000-1999)
├── 00000000000000001000.index
├── 00000000000000001000.timeindex
│
└── 00000000000000002000.log    (active segment)
    └── 00000000000000002000.index

Benefits:
- Easy to delete old data (delete old segments)
- Fast seeking with indexes
- Parallel reads from different segments
```

### Storage Mechanics

**Write Path:**
```
Producer → Network → Socket Buffer → Page Cache → Disk
             ↓
         Batching
         Compression
         
- Writes to page cache (RAM)
- OS flushes to disk asynchronously
- No explicit fsync (configurable)
- Fast because: sequential I/O + page cache
```

**Read Path:**
```
Consumer → Request → Broker → Page Cache → Network
                                  ↓
                             Zero-copy transfer
                             
- Reads from page cache (fast!)
- sendfile() system call (zero-copy)
- No application-level copying
```

In [None]:
# Setup: Import libraries and connect to Kafka
from confluent_kafka import Producer, Consumer, KafkaException
from confluent_kafka.admin import AdminClient, NewTopic, ConfigResource
import json
import time
from datetime import datetime
from collections import defaultdict
import random

admin_client = AdminClient({"bootstrap.servers": "localhost:9092"})

print("[OK] Connected to Kafka cluster")

In [None]:
# Examine partition log configuration
TOPIC_NAME = "deep-dive-topic"

# Create topic with specific log settings
new_topic = NewTopic(
    topic=TOPIC_NAME,
    num_partitions=3,
    replication_factor=1,
    config={
        "segment.bytes": "10485760",  # 10 MB per segment
        "segment.ms": "3600000",  # 1 hour
        "retention.bytes": "104857600",  # 100 MB total
        "retention.ms": "604800000",  # 7 days
        "cleanup.policy": "delete",  # vs 'compact'
        "compression.type": "gzip",
        "min.insync.replicas": "1",
    },
)

try:
    futures = admin_client.create_topics([new_topic])
    for topic, future in futures.items():
        try:
            future.result()
            print(f"[OK] Created topic '{topic}' with custom log settings")
        except KafkaException as e:
            if "TOPIC_ALREADY_EXISTS" in str(e):
                print(f"[OK] Topic '{topic}' already exists")
            else:
                raise
except Exception as e:
    print(f"[FAIL] Error: {e}")

# Show configuration
print("\n[DATA] Log Configuration:")
print("  Segment size: 10 MB (creates new segment after 10 MB)")
print("  Segment time: 1 hour (creates new segment after 1 hour)")
print("  Retention: 100 MB or 7 days (whichever comes first)")
print("  Cleanup: Delete old segments (vs compaction)")
print("  Compression: gzip (saves disk and network)")

---

## 2. Replication and Fault Tolerance

### How Replication Works

**Replication Factor = 3:**
```
Topic: payments, Partition 0

┌─────────────┐       ┌─────────────┐       ┌─────────────┐
│  Broker 1   │       │  Broker 2   │       │  Broker 3   │
│             │       │             │       │             │
│  LEADER     │──────→│  FOLLOWER   │──────→│  FOLLOWER   │
│  [0][1][2]  │       │  [0][1][2]  │       │  [0][1][2]  │
└─────────────┘       └─────────────┘       └─────────────┘
      ↑                     ↑                     ↑
   Producers            Replication           Replication
   Consumers

- Leader: Handles all reads and writes
- Followers: Replicate data from leader
- ISR (In-Sync Replicas): Followers that are caught up
```

### Leader Election

**What happens when a leader fails?**
```
Before:                      After Leader Fails:
Broker 1: LEADER             Broker 1: [DOWN]
Broker 2: FOLLOWER (ISR)     Broker 2: NEW LEADER ←
Broker 3: FOLLOWER (ISR)     Broker 3: FOLLOWER (ISR)

Process:
1. Controller detects leader failure
2. Selects new leader from ISR
3. Updates metadata
4. Clients reconnect to new leader
5. Total downtime: < 1 second
```

### Acknowledgment Modes (acks)

**Producer acks configuration:**

| acks | Behavior | Durability | Latency | Use Case |
|------|----------|------------|---------|----------|
| 0 | Fire and forget | Lowest | Fastest | Metrics, logs |
| 1 | Leader confirms | Medium | Medium | Most apps |
| all | All ISR confirm | Highest | Slowest | Financial, critical |

**Visual:**
```
acks = 0:
Producer → Leader
           (no wait)

acks = 1:
Producer → Leader → [writes to log] → ACK to producer

acks = all:
Producer → Leader → [writes] → Follower 1 → [replicates]
                             → Follower 2 → [replicates]
                    ← ACK (after all ISR confirm)
```

In [None]:
# Demonstrate different acks configurations
import time


def test_acks_performance(acks_config, num_messages=100):
    """Test producer performance with different acks settings"""
    config = {
        "bootstrap.servers": "localhost:9092",
        "acks": acks_config,
        "linger.ms": 0,  # No batching for fair comparison
    }

    producer = Producer(config)

    start_time = time.time()

    for i in range(num_messages):
        event = {"id": i, "timestamp": datetime.now().isoformat()}
        producer.produce(topic=TOPIC_NAME, value=json.dumps(event))
        producer.poll(0)

    producer.flush()

    elapsed = time.time() - start_time
    throughput = num_messages / elapsed

    return elapsed, throughput


print("[DATA] Testing different acks configurations...\n")

# Test acks=1
elapsed_1, throughput_1 = test_acks_performance("1", 100)
print(f"acks=1:   {elapsed_1:.3f}s, {throughput_1:.0f} msg/s")

# Test acks=all
elapsed_all, throughput_all = test_acks_performance("all", 100)
print(f"acks=all: {elapsed_all:.3f}s, {throughput_all:.0f} msg/s")

print("\n[OK] acks='all' is slower but more durable")
print("     Use acks='all' + idempotence for critical data")

---

## 3. Producer Performance Tuning

### Key Producer Settings

**Batching:**
```
Without Batching:              With Batching:
[msg1] → Network               [msg1, msg2, msg3, msg4, msg5]
[msg2] → Network                         ↓
[msg3] → Network                      Network
[msg4] → Network               (one network call)
[msg5] → Network
(5 network calls)
```

**Important Settings:**

| Setting | Default | Description | Tuning |
|---------|---------|-------------|--------|
| `batch.size` | 16384 | Max batch size (bytes) | Increase for throughput |
| `linger.ms` | 0 | Wait time before sending | Increase for batching |
| `compression.type` | none | Compression algorithm | Use gzip or lz4 |
| `buffer.memory` | 33554432 | Total buffer size | Increase for high volume |
| `max.in.flight.requests.per.connection` | 5 | Parallel requests | Reduce for ordering |
| `enable.idempotence` | false | Prevent duplicates | Set true for reliability |

### Batching Strategy

```
Producer Buffer:
┌─────────────────────────────────────┐
│ Batch for Partition 0               │
│ [msg1][msg2][msg3]...               │
│                                     │
│ Batch for Partition 1               │
│ [msg10][msg11][msg12]...            │
└─────────────────────────────────────┘

Sends when:
1. Batch reaches batch.size, OR
2. linger.ms time expires

Trade-off:
- Higher linger.ms = Better batching, Higher latency
- Lower linger.ms = Lower latency, Less batching
```

In [None]:
# Compare batching configurations
def test_batching(batch_size, linger_ms, num_messages=1000):
    """Test different batching configurations"""
    config = {
        "bootstrap.servers": "localhost:9092",
        "acks": "1",
        "batch.size": batch_size,
        "linger.ms": linger_ms,
        "compression.type": "gzip",
    }

    producer = Producer(config)

    start_time = time.time()

    for i in range(num_messages):
        event = {
            "id": i,
            "data": "x" * 100,  # Some payload
            "timestamp": datetime.now().isoformat(),
        }
        producer.produce(topic=TOPIC_NAME, value=json.dumps(event))
        producer.poll(0)

    producer.flush()
    elapsed = time.time() - start_time

    return elapsed, num_messages / elapsed


print("[DATA] Testing batching configurations (1000 messages)...\n")

# No batching
elapsed1, throughput1 = test_batching(batch_size=1, linger_ms=0)
print(f"No batching (batch.size=1, linger.ms=0):")
print(f"  Time: {elapsed1:.3f}s, Throughput: {throughput1:.0f} msg/s")

# Small batch, no wait
elapsed2, throughput2 = test_batching(batch_size=16384, linger_ms=0)
print(f"\nDefault batch (batch.size=16KB, linger.ms=0):")
print(f"  Time: {elapsed2:.3f}s, Throughput: {throughput2:.0f} msg/s")

# Large batch with wait
elapsed3, throughput3 = test_batching(batch_size=65536, linger_ms=10)
print(f"\nOptimized (batch.size=64KB, linger.ms=10):")
print(f"  Time: {elapsed3:.3f}s, Throughput: {throughput3:.0f} msg/s")

print(f"\n[OK] Batching improves throughput by {(throughput3/throughput1):.1f}x!")
print("     Trade-off: Adds ~10ms latency per message")

### Compression

**Compression Algorithms:**

| Algorithm | Compression Ratio | CPU Usage | Speed | Use Case |
|-----------|------------------|-----------|-------|----------|
| none | 1x | Lowest | Fastest | Low-latency, small messages |
| gzip | 3-5x | High | Slow | Maximum compression |
| snappy | 2-3x | Medium | Fast | Balanced |
| lz4 | 2-3x | Low | Very Fast | High throughput |
| zstd | 3-4x | Medium | Fast | Modern choice |

**Benefits:**
- Reduces network bandwidth
- Reduces disk storage
- Can improve throughput (less network I/O)

**Trade-offs:**
- CPU overhead on producer and consumer
- Latency increase

In [None]:
# Compare compression types
def test_compression(compression_type, num_messages=500):
    """Test different compression algorithms"""
    config = {
        "bootstrap.servers": "localhost:9092",
        "acks": "1",
        "compression.type": compression_type,
        "batch.size": 65536,
        "linger.ms": 10,
    }

    producer = Producer(config)

    start_time = time.time()

    for i in range(num_messages):
        # Create compressible data
        event = {
            "id": i,
            "data": "This is some repetitive text. " * 20,  # Compressible
            "timestamp": datetime.now().isoformat(),
        }
        producer.produce(topic=TOPIC_NAME, value=json.dumps(event))
        producer.poll(0)

    producer.flush()
    elapsed = time.time() - start_time

    return elapsed


print("[DATA] Testing compression algorithms (500 messages)...\n")

results = {}
for compression in ["none", "gzip", "snappy", "lz4"]:
    try:
        elapsed = test_compression(compression)
        results[compression] = elapsed
        print(f"{compression:10s}: {elapsed:.3f}s")
    except Exception as e:
        print(f"{compression:10s}: Not available - {e}")

if results:
    best = min(results, key=results.get)
    print(f"\n[OK] Best performance: {best}")
    print("     Recommendation: Use 'lz4' for best speed/compression balance")

---

## 4. Idempotence and Transactions

### The Duplicate Problem

**Without Idempotence:**
```
Producer sends message → Network timeout
Producer retries → Message written AGAIN

Result: Duplicate messages!
[msg1] [msg1] [msg2] [msg3] [msg3]
       ↑dup        ↑dup
```

**With Idempotence:**
```
Producer sends message (seq=0) → Network timeout
Producer retries (seq=0) → Broker detects duplicate, ignores

Result: No duplicates!
[msg1] [msg2] [msg3]
```

### Enabling Idempotence

**Configuration:**
```python
config = {
    'enable.idempotence': True,
    'acks': 'all',  # Required for idempotence
    'retries': 2147483647,  # Max retries
    'max.in.flight.requests.per.connection': 5
}
```

**How It Works:**
- Producer assigns sequence numbers to messages
- Broker tracks sequence numbers per producer
- Duplicates are detected and discarded
- Exactly-once semantics within a partition

### Transactions

**Use Case: Exactly-once across multiple partitions**
```
Transaction:
  BEGIN
    Write to topic A, partition 0
    Write to topic B, partition 1
    Write to topic C, partition 2
  COMMIT

Result: All writes succeed or all fail (atomic)
```

In [None]:
# Demonstrate idempotent producer
print("[DATA] Comparing non-idempotent vs idempotent producers\n")

# Non-idempotent producer
non_idempotent_config = {
    "bootstrap.servers": "localhost:9092",
    "acks": "1",
    "enable.idempotence": False,
    "retries": 3,
}

# Idempotent producer
idempotent_config = {
    "bootstrap.servers": "localhost:9092",
    "enable.idempotence": True,  # Automatically sets acks='all'
    "retries": 10,
    "max.in.flight.requests.per.connection": 5,
}

producer = Producer(idempotent_config)

print("[OK] Created idempotent producer")
print("     Guarantees: No duplicates, ordering preserved")
print("     Use case: Financial transactions, order processing\n")

# Send some events
for i in range(10):
    event = {
        "transaction_id": f"txn_{i}",
        "amount": random.randint(100, 1000),
        "timestamp": datetime.now().isoformat(),
    }
    producer.produce(topic=TOPIC_NAME, key=f"txn_{i}", value=json.dumps(event))

producer.flush()
print("[SUCCESS] Sent 10 transactions with exactly-once guarantees")

---

## 5. Consumer Performance Tuning

### Key Consumer Settings

| Setting | Default | Description | Tuning |
|---------|---------|-------------|--------|
| `fetch.min.bytes` | 1 | Min data to fetch | Increase for batching |
| `fetch.max.wait.ms` | 500 | Max wait time | Balance latency/throughput |
| `max.partition.fetch.bytes` | 1048576 | Max per partition | Increase for large messages |
| `session.timeout.ms` | 10000 | Consumer heartbeat timeout | Increase for slow processing |
| `max.poll.interval.ms` | 300000 | Max time between polls | Increase for long processing |
| `enable.auto.commit` | true | Auto commit offsets | Disable for manual control |

### Fetch Behavior

```
Consumer Fetch Request:
┌────────────────────────────────────┐
│ fetch.min.bytes = 1 KB             │
│ fetch.max.wait.ms = 500 ms         │
└────────────────────────────────────┘
         ↓
Returns when:
1. Has 1 KB of data, OR
2. 500 ms timeout expires

Optimization:
- Increase fetch.min.bytes for better batching
- Increase fetch.max.wait.ms for higher throughput
- Decrease for lower latency
```

In [None]:
# Test consumer fetch configurations
def test_consumer_fetch(fetch_min_bytes, fetch_max_wait_ms, num_messages=100):
    """Test different fetch configurations"""
    config = {
        "bootstrap.servers": "localhost:9092",
        "group.id": f"test-fetch-{fetch_min_bytes}-{fetch_max_wait_ms}",
        "auto.offset.reset": "earliest",
        "fetch.min.bytes": fetch_min_bytes,
        "fetch.max.wait.ms": fetch_max_wait_ms,
    }

    consumer = Consumer(config)
    consumer.subscribe([TOPIC_NAME])

    messages_read = 0
    start_time = time.time()

    try:
        while messages_read < num_messages:
            msg = consumer.poll(timeout=2.0)
            if msg is None:
                break
            if msg.error():
                continue
            messages_read += 1
    finally:
        consumer.close()

    elapsed = time.time() - start_time
    return elapsed, messages_read / elapsed if elapsed > 0 else 0


print("[DATA] Testing consumer fetch configurations...\n")

# Low latency
elapsed1, throughput1 = test_consumer_fetch(1, 100, 100)
print(f"Low latency (min=1B, wait=100ms):")
print(f"  Time: {elapsed1:.3f}s, Throughput: {throughput1:.0f} msg/s")

# High throughput
elapsed2, throughput2 = test_consumer_fetch(10240, 500, 100)
print(f"\nHigh throughput (min=10KB, wait=500ms):")
print(f"  Time: {elapsed2:.3f}s, Throughput: {throughput2:.0f} msg/s")

print("\n[OK] Higher fetch.min.bytes reduces number of fetch requests")
print("     Trade-off: Slightly higher latency for better throughput")

### Consumer Rebalancing

**What is Rebalancing?**
```
Before (2 consumers, 4 partitions):
Consumer 1: [P0, P1]
Consumer 2: [P2, P3]

Consumer 3 joins →  REBALANCE

After (3 consumers, 4 partitions):
Consumer 1: [P0]
Consumer 2: [P1, P2]
Consumer 3: [P3]
```

**Rebalancing Process:**
1. Consumer joins/leaves group
2. All consumers **STOP** processing
3. Partition assignment recalculated
4. Consumers resume with new assignments

**Minimizing Rebalance Impact:**
- Increase `session.timeout.ms` (slow networks)
- Increase `max.poll.interval.ms` (slow processing)
- Use incremental cooperative rebalancing (Kafka 2.4+)
- Keep consumer group stable

---

## 6. Monitoring Kafka

### Key Metrics to Monitor

**Producer Metrics:**
- `record-send-rate`: Messages produced per second
- `record-error-rate`: Failed sends
- `request-latency-avg`: Average request latency
- `batch-size-avg`: Average batch size
- `compression-rate-avg`: Compression efficiency

**Consumer Metrics:**
- `records-consumed-rate`: Messages consumed per second
- `records-lag`: How far behind (critical!)
- `fetch-latency-avg`: Average fetch latency
- `commit-latency-avg`: Offset commit latency

**Broker Metrics:**
- `UnderReplicatedPartitions`: Partitions not fully replicated
- `OfflinePartitionsCount`: Partitions without leader
- `RequestsPerSecond`: Total request rate
- `NetworkProcessorAvgIdlePercent`: Network thread idle %

### Consumer Lag

**Most Important Metric:**
```
Consumer Lag = Latest Offset - Consumer Offset

Partition: [0][1][2][3][4][5][6][7][8][9]
                         ↑              ↑
                    Consumer      Latest Offset
                    (offset 4)      (offset 9)
                    
Lag = 9 - 4 = 5 messages behind

Good: Lag = 0-1000 (keeping up)
Warning: Lag growing over time
Critical: Lag > millions (falling behind)
```

In [None]:
# Monitor consumer lag
from confluent_kafka.admin import AdminClient


def get_consumer_lag(group_id, topic):
    """Calculate consumer lag for a group"""
    admin = AdminClient({"bootstrap.servers": "localhost:9092"})

    # Get committed offsets for group
    consumer = Consumer({"bootstrap.servers": "localhost:9092", "group.id": group_id})

    # Get topic metadata
    metadata = admin.list_topics(timeout=5)

    if topic not in metadata.topics:
        print(f"[WARNING] Topic '{topic}' not found")
        return

    topic_metadata = metadata.topics[topic]

    print(f"\n[DATA] Consumer Lag Analysis for group '{group_id}':\n")

    for partition_id in topic_metadata.partitions:
        # Get high water mark (latest offset)
        low, high = consumer.get_watermark_offsets(topic=topic, partition=partition_id, timeout=5)

        print(f"Partition {partition_id}:")
        print(f"  Low offset: {low}")
        print(f"  High offset: {high}")
        print(f"  Total messages: {high - low}")

    consumer.close()


# Check lag for our test consumers
get_consumer_lag("user-events-consumer-group", TOPIC_NAME)

### Using Kafka UI for Monitoring

**Access Kafka UI:**
- URL: http://localhost:8080
- View topics, partitions, messages
- Monitor consumer groups and lag
- Inspect message contents

**Key Screens:**
1. **Topics**: See all topics, partition count, size
2. **Consumers**: View consumer groups, lag, members
3. **Brokers**: Monitor broker health, disk usage
4. **Messages**: Browse message contents

---

## 7. Troubleshooting Common Issues

### Issue 1: High Consumer Lag

**Symptoms:**
- Consumer lag growing over time
- Processing falling behind production

**Causes & Solutions:**
```
Cause 1: Slow Processing
  → Add more consumers (up to partition count)
  → Optimize processing logic
  → Increase processing parallelism

Cause 2: Not Enough Partitions
  → Increase partition count
  → Add more consumers

Cause 3: Network Issues
  → Increase fetch.min.bytes
  → Increase max.partition.fetch.bytes
  → Check network bandwidth
```

### Issue 2: Rebalancing Too Frequently

**Symptoms:**
- Consumers constantly rebalancing
- Processing pauses

**Solutions:**
```python
config = {
    'session.timeout.ms': 30000,  # Increase from 10s
    'max.poll.interval.ms': 600000,  # Increase from 5m
    'heartbeat.interval.ms': 3000  # 1/3 of session timeout
}
```

### Issue 3: Message Loss

**Prevention:**
```python
# Producer settings
producer_config = {
    'acks': 'all',  # Wait for all replicas
    'enable.idempotence': True,  # Prevent duplicates
    'retries': 10,  # Retry on failure
    'max.in.flight.requests.per.connection': 5
}

# Topic settings
topic_config = {
    'replication.factor': 3,  # 3 copies
    'min.insync.replicas': 2  # Require 2 acks
}
```

### Issue 4: Slow Producers

**Optimizations:**
```python
config = {
    'batch.size': 65536,  # 64 KB batches
    'linger.ms': 10,  # Wait 10ms for batching
    'compression.type': 'lz4',  # Fast compression
    'buffer.memory': 67108864,  # 64 MB buffer
    'acks': '1'  # Only leader ack (if acceptable)
}
```

In [None]:
# Demonstrate optimized producer configuration
print("[DATA] Production-Ready Configuration Examples\n")

# High-throughput producer
high_throughput_config = {
    "bootstrap.servers": "localhost:9092",
    # Batching
    "batch.size": 65536,  # 64 KB
    "linger.ms": 10,
    # Compression
    "compression.type": "lz4",
    # Memory
    "buffer.memory": 67108864,  # 64 MB
    # Reliability (medium)
    "acks": "1",
    "retries": 3,
}

print("High-Throughput Producer:")
for key, value in high_throughput_config.items():
    print(f"  {key}: {value}")

# High-reliability producer
high_reliability_config = {
    "bootstrap.servers": "localhost:9092",
    # Reliability (maximum)
    "acks": "all",
    "enable.idempotence": True,
    "retries": 10,
    "max.in.flight.requests.per.connection": 5,
    # Batching (moderate)
    "batch.size": 16384,
    "linger.ms": 5,
    # Compression
    "compression.type": "gzip",
}

print("\nHigh-Reliability Producer:")
for key, value in high_reliability_config.items():
    print(f"  {key}: {value}")

# Low-latency producer
low_latency_config = {
    "bootstrap.servers": "localhost:9092",
    # Latency (minimize)
    "linger.ms": 0,
    "batch.size": 1,
    "compression.type": "none",
    # Reliability (basic)
    "acks": "1",
    "retries": 0,
}

print("\nLow-Latency Producer:")
for key, value in low_latency_config.items():
    print(f"  {key}: {value}")

print("\n[OK] Choose configuration based on your requirements:")
print("     - High throughput: Batching + compression")
print("     - High reliability: acks=all + idempotence")
print("     - Low latency: No batching, no compression")

---

## 8. Mini-Project: Performance Benchmarking

Let's build a benchmarking tool to test different configurations.

In [None]:
# Comprehensive benchmarking tool
import statistics


def benchmark_producer(config_name, config, num_messages=1000):
    """Benchmark producer with given configuration"""
    producer = Producer(config)

    latencies = []
    start_time = time.time()

    for i in range(num_messages):
        msg_start = time.time()

        event = {"id": i, "data": "x" * 100, "timestamp": datetime.now().isoformat()}

        producer.produce(topic=TOPIC_NAME, value=json.dumps(event))
        producer.poll(0)

        msg_latency = (time.time() - msg_start) * 1000  # ms
        latencies.append(msg_latency)

    producer.flush()
    total_time = time.time() - start_time

    return {
        "config": config_name,
        "total_time": total_time,
        "throughput": num_messages / total_time,
        "avg_latency": statistics.mean(latencies),
        "p50_latency": statistics.median(latencies),
        "p99_latency": sorted(latencies)[int(len(latencies) * 0.99)],
    }


# Test configurations
configs = {
    "Default": {"bootstrap.servers": "localhost:9092", "acks": "1"},
    "Optimized": {
        "bootstrap.servers": "localhost:9092",
        "acks": "1",
        "batch.size": 65536,
        "linger.ms": 10,
        "compression.type": "lz4",
    },
    "Reliable": {
        "bootstrap.servers": "localhost:9092",
        "acks": "all",
        "enable.idempotence": True,
        "compression.type": "gzip",
    },
}

print("[DATA] Running performance benchmarks (1000 messages each)...\n")

results = []
for config_name, config in configs.items():
    print(f"Testing {config_name}...")
    result = benchmark_producer(config_name, config)
    results.append(result)

# Display results
print("\n[DATA] Benchmark Results:\n")
print(f"{'Config':<12} {'Time (s)':<10} {'Throughput':<12} {'Avg Lat':<10} {'P99 Lat':<10}")
print("-" * 60)

for r in results:
    print(
        f"{r['config']:<12} {r['total_time']:<10.2f} {r['throughput']:<12.0f} "
        f"{r['avg_latency']:<10.2f} {r['p99_latency']:<10.2f}"
    )

print("\n[SUCCESS] Benchmark complete!")
print("\n[OK] Key Insights:")
print("     - Optimized config has highest throughput (batching + compression)")
print("     - Reliable config has lower throughput but no data loss")
print("     - Choose based on your requirements (speed vs reliability)")

---

## 9. Key Takeaways

[OK] **Log Structure**: Kafka uses append-only logs with segments for efficient storage

[OK] **Replication**: Leader-follower model with ISR for fault tolerance

[OK] **Producer Tuning**: Batching, compression, and acks for optimal performance

[OK] **Idempotence**: Prevents duplicates with sequence numbers

[OK] **Consumer Tuning**: Fetch size, poll interval, and rebalancing

[OK] **Monitoring**: Consumer lag is the most critical metric

### Configuration Cheat Sheet

**For Maximum Throughput:**
```python
{
    'batch.size': 65536,
    'linger.ms': 10-100,
    'compression.type': 'lz4',
    'acks': '1'
}
```

**For Maximum Reliability:**
```python
{
    'acks': 'all',
    'enable.idempotence': True,
    'retries': 10,
    'min.insync.replicas': 2
}
```

**For Minimum Latency:**
```python
{
    'linger.ms': 0,
    'batch.size': 1,
    'compression.type': 'none',
    'acks': '1'
}
```

### Production Best Practices

1. **Always use `enable.idempotence=True`** for reliability
2. **Monitor consumer lag** continuously
3. **Set replication factor ≥ 3** in production
4. **Use compression** (lz4 or gzip) to save bandwidth
5. **Tune batch.size and linger.ms** based on workload
6. **Plan partition count** for future growth
7. **Test configurations** with your actual workload

---

## 10. Practice Exercises

1. **Create a topic** with replication factor 1 and test producer with different acks settings
2. **Benchmark** producer performance with different batch sizes (1KB, 16KB, 64KB)
3. **Test compression** algorithms with your actual message payload
4. **Monitor consumer lag** and observe what happens when you slow down processing
5. **Implement** an idempotent producer with error handling

In [None]:
# Your practice code here

---

## 11. Next Steps

Congratulations on completing Module 02!

### What You've Learned

- [OK] Kafka's internal log structure and storage
- [OK] Replication and fault tolerance mechanisms
- [OK] Producer and consumer performance tuning
- [OK] Idempotence and exactly-once semantics
- [OK] Monitoring and troubleshooting

### Coming Up in Module 03: Stream Processing Fundamentals

You'll learn:
- What is stream processing?
- Stateless vs stateful operations
- Windowing concepts (tumbling, sliding, session)
- Time semantics (event time, processing time)
- Building your first stream processor

### Resources

- [Kafka Performance Tuning](https://kafka.apache.org/documentation/#producerconfigs)
- [Kafka Internals](https://kafka.apache.org/documentation/#design)
- [Monitoring Kafka](https://docs.confluent.io/platform/current/kafka/monitoring.html)
- [Kafka Operations Guide](https://kafka.apache.org/documentation/#operations)

---

**Ready for stream processing?** Open `03_stream_processing_fundamentals.ipynb` to continue!