# Module 1: The Disk is the Enemy
**Goal**: Shatter the illusion that data access is instant. We will prove that Disk I/O is the bottleneck and understand why "Full Table Scans" (Sequential) often beat "Index Lookups" (Random) on large datasets.

## 1.0 The Laboratory Setup
First, let's load our tools and define the location of our laboratory data.

In [None]:
import time
import os
import random
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import duckdb
import psycopg2

# Configure visualization style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_context("talk")

# The "God Script" Data Locations
DATA_DIR = "../data/"
USERS_FILE = os.path.join(DATA_DIR, "users.csv")
ORDERS_SORTED = os.path.join(DATA_DIR, "orders_sorted.csv")
ORDERS_SHUFFLED = os.path.join(DATA_DIR, "orders_shuffled.csv")

# Verify data exists
print(f"Checking environment...")
for f in [USERS_FILE, ORDERS_SORTED, ORDERS_SHUFFLED]:
    if os.path.exists(f):
        size_mb = os.path.getsize(f) / (1024 * 1024)
        print(f"Found {os.path.basename(f)}: {size_mb:.2f} MB")
    else:
        print(f"MISSING: {f}")

---

## 1.1 The Hierarchy of Speed: Light vs. The Snail
**The Hypothesis**: We often treat memory (RAM) and Disk (SSD/HDD) as simply "storage." But physically, they are different worlds. RAM sits on the motherboard next to the CPU. Disk sits across a bus, far away.

We will compare:
1. CPU/RAM: Summing 10 million numbers already in memory.
2. Disk: Reading those same numbers from a file on disk.

**The Experiment**:

In [None]:
import numpy as np

# 1. Create a dummy array in RAM (10 million integers)
# This simulates data already loaded in the Buffer Pool
data_ram = np.random.randint(0, 100, 10_000_000)

# 2. Write this to disk so we can read it back
# This simulates data sitting in "Cold Storage"
np.save(os.path.join(DATA_DIR, "temp_experiment_1.npy"), data_ram)
disk_file_path = os.path.join(DATA_DIR, "temp_experiment_1.npy")

def run_cpu_test():
    start = time.time()
    _ = data_ram.sum()
    return time.time() - start

def run_disk_test():
    start = time.time()
    # We must open the file, read bytes, and parse them
    _ = np.load(disk_file_path)
    return time.time() - start

# Run 5 iterations to average
cpu_times = [run_cpu_test() for _ in range(5)]
disk_times = [run_disk_test() for _ in range(5)]

avg_cpu = sum(cpu_times) / len(cpu_times)
avg_disk = sum(disk_times) / len(disk_times)

print(f"RAM Latency:  {avg_cpu:.6f} seconds")
print(f"Disk Latency: {avg_disk:.6f} seconds")
print(f"Factor: Disk was {avg_disk / avg_cpu:.0f}x slower")

# Clean up
os.remove(disk_file_path)

**The Visualization**:

In [None]:
labels = ['RAM Access', 'Disk Access']
times = [avg_cpu, avg_disk]

plt.figure(figsize=(10, 6))
# We use log scale because the difference is often too massive to see on a linear scale
plt.bar(labels, times, color=['#4c72b0', '#c44e52'])
plt.ylabel('Time (Seconds)')
plt.title('The I/O Cliff: RAM vs Disk Latency')
plt.yscale('log') # Log scale is crucial here!
for i, v in enumerate(times):
    plt.text(i, v, f"{v:.4f}s", ha='center', va='bottom')
plt.show()

**The Physics**: Why the massive gap?
1. **Distance**: Electricity travels fast, but the CPU-RAM bus is optimized for nanoseconds. The Disk bus (SATA/NVMe) is optimized for milliseconds (or microseconds on fast NVMe).
2. **Protocol Overhead**: Reading from disk involves the Operating System (Kernel), Filesystem drivers, and hardware controllers. Reading from RAM is a direct CPU instruction. This is why Databases fight so hard to keep the "Working Set" in the Buffer Pool (RAM).

---

## 1.2 The "Minimum Order" Rule (Pages & Blocks)
**The Hypothesis**: If reading a file is slow, surely reading just 1 byte is much faster than reading 8 KB (8192 bytes), right? It's 8000 times less data!

**The Experiment**: We will use Python's raw file handler to read from `orders_sorted.csv`.
1. Read 1 Byte, 10,000 times.
2. Read 8 KB (Standard Page Size), 10,000 times.

In [None]:
def read_bytes(chunk_size, iterations=1000):
    start = time.time()
    with open(ORDERS_SORTED, 'rb') as f:
        for _ in range(iterations):
            # Seek to a random position to prevent OS readahead caching from cheating too much
            # We want to simulate distinct fetch requests
            pos = random.randint(0, os.path.getsize(ORDERS_SORTED) - chunk_size)
            f.seek(pos)
            _ = f.read(chunk_size)
    return time.time() - start

# Run the test
time_1_byte = read_bytes(chunk_size=1)
time_8_kb   = read_bytes(chunk_size=8192)

print(f"Time to read 1 Byte  (x1000): {time_1_byte:.4f} s")
print(f"Time to read 8 KB    (x1000): {time_8_kb:.4f} s")

**The Visualization**:

In [None]:
plt.figure(figsize=(8, 6))
plt.bar(['1 Byte', '8 KB (Page)'], [time_1_byte, time_8_kb], color=['gray', 'green'])
plt.title('The Cost of I/O is the TRIP, not the LUGGAGE')
plt.ylabel('Total Time (Seconds)')
plt.show()

**The Physics**: You should see that the times are **nearly identical**, despite one payload being 8000x larger.
- **The Page**: Disks and Operating Systems do not speak in "bytes." They speak in "Pages" (usually 4KB or 8KB) or "Blocks."
- **The Overhead**: When you ask for 1 byte, the OS fetches the entire 4KB/8KB page into memory, then gives you the 1 byte you asked for.
- **Lesson**: In Database design, fetching a single row is just as expensive as fetching the whole page of rows surrounding it. This drives the philosophy of "**Data Locality**"—packing related data into the same page.

---

## 1.3 Sequential vs. Random Access (The Seek Tax)
**The Hypothesis**: We have `orders_sorted.csv` (physically sorted by date) and `orders_shuffled.csv` (random order).
- **Sequential Read**: The disk head (or flash controller) reads continuous blocks.
- **Random Seek**: The disk must "jump" to new locations for every read.

**The Experiment**: We will simulate reading 50MB of data.
- **Sequential**: Read 50MB continuously from one file.
- **Random**: Read 50MB by jumping around (seeking) and reading small chunks.

In [None]:
READ_TOTAL_MB = 50
CHUNK_SIZE = 8192 # 8KB
ITERATIONS = int((READ_TOTAL_MB * 1024 * 1024) / CHUNK_SIZE)

def test_sequential():
    start = time.time()
    with open(ORDERS_SORTED, 'rb') as f:
        # Just read straight through
        for _ in range(ITERATIONS):
            _ = f.read(CHUNK_SIZE)
    return time.time() - start

def test_random():
    file_size = os.path.getsize(ORDERS_SHUFFLED)
    start = time.time()
    with open(ORDERS_SHUFFLED, 'rb') as f:
        # Jump around for every read
        for _ in range(ITERATIONS):
            pos = random.randint(0, file_size - CHUNK_SIZE)
            f.seek(pos)
            _ = f.read(CHUNK_SIZE)
    return time.time() - start

seq_time = test_sequential()
rand_time = test_random()

print(f"Sequential Read (50MB): {seq_time:.4f} s")
print(f"Random Seek     (50MB): {rand_time:.4f} s")

**The Visualization**:

In [None]:
plt.figure(figsize=(10, 6))
plt.bar(['Sequential Scan', 'Random Seek'], [seq_time, rand_time], color=['#55a868', '#c44e52'])
plt.title('Throughput Killer: Scanning vs. Seeking')
plt.ylabel('Time to read 50MB (Seconds)')

# Calculate Throughput
seq_mb_s = READ_TOTAL_MB / seq_time
rand_mb_s = READ_TOTAL_MB / rand_time

# Annotate with speed
plt.text(0, seq_time, f"{seq_mb_s:.0f} MB/s", ha='center', va='bottom', fontsize=14, fontweight='bold')
plt.text(1, rand_time, f"{rand_mb_s:.0f} MB/s", ha='center', va='bottom', fontsize=14, fontweight='bold')

plt.show()

**The Physics**:
- **Mechanical (HDD)**: Random access requires moving the physical arm (Seek Time) and waiting for the platter to rotate (Rotational Latency). This is physical movement—it takes forever.
- **Solid State (SSD)**: While SSDs have no moving parts, random I/O still stresses the controller's "IOPS" (Input/Output Operations Per Second) limit and prevents the OS from doing "Prefetching" (predicting what you need next).
- **Database Implication**: This is why Full Table Scans (Sequential) are often preferred over Index Lookups if you need more than 5-10% of the table. The index forces random seeking (jumping from index to table heap), which destroys throughput.

---

## Module Summary
1. **Disk is slow**: Avoid going to disk whenever possible (Cache/RAM is King).
2. **I/O is block-based**: Never read 1 byte. Read the whole page.
3. **Seeking is expensive**: Sequential access is high throughput; Random access is low throughput.