# Module 10: Database Hygiene
**Goal**: Understanding that deleting data doesn't actually remove it.
In the physical world, if you erase a sentence from a whiteboard, the ink is gone. In the database world, "deleting" usually just means turning off the visibility switch. The data remains on the disk, taking up space, until a garbage collector comes along to clean it up.

This chapter explores Entropy: the tendency of databases to get slower and larger over time unless actively maintained

----

## 1. Setup and Tools
We will use Postgres to demonstrate "Dead Tuples" (the cost of MVCC) and DuckDB/Parquet to demonstrate the "Small Files Problem" (a common issue in Data Lakes).

In [None]:
import psycopg2
import duckdb
import pandas as pd
import matplotlib.pyplot as plt
from decimal import Decimal
import seaborn as sns
import os
import shutil
import time

# Postgres Connection
DB_PARAMS = {
    "host": "db_int_opt",
    "port": 5432,
    "user": "admin",
    "password": "password",
    "dbname": "db_int_opt"
}

# Cleanup function for Postgres
def reset_postgres_table():
    with psycopg2.connect(**DB_PARAMS) as conn:
        with conn.cursor() as cur:
            cur.execute("DROP TABLE IF EXISTS sensor_logs;")
            # Disable autovacuum so we can see the mess accumulate manually
            cur.execute("""
                CREATE TABLE sensor_logs (
                    id SERIAL PRIMARY KEY,
                    payload TEXT
                ) WITH (autovacuum_enabled = false);
            """)
        conn.commit()

# Helper to get table size in MB
def get_table_size(conn):
    with conn.cursor() as cur:
        cur.execute("SELECT pg_relation_size('sensor_logs') / 1024 / 1024.0;")
        return cur.fetchone()[0]

print("Setup Complete. Tools Ready.")

----

## 2. Experiment 10.1: The Zombie Data (Dead Tuples)
**The Concept**: Postgres uses MVCC. When you `DELETE` a row, Postgres doesn't overwrite the data with zeros. It simply marks the row header: "Valid until Transaction X." Future transactions see that the row is "expired" and ignore it. However, the row still occupies bytes on the hard drive. These are called Dead Tuples.

#### Step 1: Hypothesis
If we insert 200,000 rows, measure the size, and then `DELETE` 100,000 rows, what happens to the table size on disk?
- A) It drops by 50%.
- B) It stays exactly the same.

#### Step 2: The Experiment

In [None]:
# --- Run the Experiment ---
reset_postgres_table()

sizes = []
phases = []

# Connect manually with autocommit=True for VACUUM compatibility
conn = psycopg2.connect(**DB_PARAMS)
conn.autocommit = True 

try:
    # 1. Insert 200k Rows
    print("[Step 1] Inserting 200k rows...")
    with conn.cursor() as cur:
        cur.execute("INSERT INTO sensor_logs (payload) SELECT md5(random()::text) FROM generate_series(1, 200000);")
    
    size_1 = get_table_size(conn)
    sizes.append(size_1)
    phases.append("1. Initial")
    print(f"Table Size: {size_1:.2f} MB")

    # 2. Delete 100k Rows (The "Zombie" Phase)
    print("[Step 2] Deleting 100k rows...")
    with conn.cursor() as cur:
        cur.execute("DELETE FROM sensor_logs WHERE id <= 100000;")
    
    # Wait for disk flush
    time.sleep(1)
    
    size_2 = get_table_size(conn)
    sizes.append(size_2)
    phases.append("2. After Delete")
    print(f"Table Size: {size_2:.2f} MB")

    # 3. VACUUM FULL
    print("[Step 3] Running VACUUM FULL...")
    with conn.cursor() as cur:
        cur.execute("VACUUM FULL sensor_logs;")
        
    size_3 = get_table_size(conn)
    sizes.append(size_3)
    phases.append("3. After Vacuum")
    print(f"Table Size: {size_3:.2f} MB")

finally:
    conn.close()

#### Step 3: Visualization

In [None]:
# --- Visualization ---
plt.figure(figsize=(8, 5))
# Use hue=phases to avoid future seaborn warnings, set legend=False
sns.barplot(x=phases, y=sizes, hue=phases, palette=['blue', 'red', 'green'], legend=False)
plt.title('The Myth of Deletion (Disk Usage)')
plt.ylabel('Size on Disk (MB)')

# Add labels (Now safe because v is a float)
for i, v in enumerate(sizes):
    plt.text(i, v + Decimal(0.1), f"{v:.2f} MB", ha='center', fontweight='bold')

plt.show()

#### Step 4: The Physics
**Why didn't the size shrink?** In Phase 2, the table contained 50% "Live" data and 50% "Dead" data. The disk blocks were still full of bytes, but half of them were marked "invisible."
- **Standard** `VACUUM`: Scans the file, marks the dead space as "reusable" for future inserts, but usually does not return space to the OS.
- `VACUUM FULL`: actually creates a brand new copy of the table file, packs the live rows tightly, and deletes the old file. This is why the size dropped in Phase 3.

----

## 3. Experiment 10.2: The Small Files Problem (Compaction)
**The Concept**: In Data Lakes (S3, Parquet, Delta Lake), a common mistake is streaming data in tiny batches. This creates thousands of small files (KB size). Reading 1,000 files of 1KB is orders of magnitude slower than reading 1 file of 1MB, due to the overhead of opening files and parsing metadata headers.

#### Step 1: Hypothesis
We will write the exact same data (100k rows) in two ways:
1. **Fragmented**: 100 files (1,000 rows each).
2. **Compacted**: 1 file (100,000 rows). Which one will DuckDB query faster?

#### Step 2: The Experiment

In [None]:
# Setup directories
DATA_DIR = "./data_hygiene"
if os.path.exists(DATA_DIR):
    shutil.rmtree(DATA_DIR)
os.makedirs(f"{DATA_DIR}/fragmented")
os.makedirs(f"{DATA_DIR}/compacted")

# Create a dummy DataFrame
df = pd.DataFrame({'id': range(100000), 'data': 'x' * 100})

print("Generating Fragmented Files (Please wait)...")
# Write 100 small files
for i in range(100):
    subset = df.iloc[i*1000 : (i+1)*1000]
    subset.to_parquet(f"{DATA_DIR}/fragmented/part_{i}.parquet")

print("Generating Compacted File...")
# Write 1 big file
df.to_parquet(f"{DATA_DIR}/compacted/full.parquet")

# Measurement
con = duckdb.connect()
times = []

# 1. Read Fragmented
start = time.time()
con.execute(f"SELECT COUNT(*) FROM read_parquet('{DATA_DIR}/fragmented/*.parquet')").fetchall()
t_frag = time.time() - start
times.append(t_frag)
print(f"Read Fragmented: {t_frag:.4f}s")

# 2. Read Compacted
start = time.time()
con.execute(f"SELECT COUNT(*) FROM read_parquet('{DATA_DIR}/compacted/full.parquet')").fetchall()
t_compact = time.time() - start
times.append(t_compact)
print(f"Read Compacted:  {t_compact:.4f}s")

shutil.rmtree(DATA_DIR)

#### Step 3: Visualization

In [None]:
plt.figure(figsize=(6, 4))
sns.barplot(x=['100 Small Files', '1 Big File'], y=times, palette=['orange', 'green'])
plt.title('The Cost of Fragmentation (Read Latency)')
plt.ylabel('Execution Time (seconds)')
plt.show()

#### Step 4: The Physics
Every file has a "Header" and a "Footer" (metadata).
- **Fragmented**: The engine had to perform 100 open() syscalls, parse 100 headers, and issue 100 separate read requests.
- **Compacted**: The engine opened 1 file, read 1 header, and streamed the data sequentially. Compaction is the process of merging these small files into larger ones (often 128MB - 1GB) to optimize read performance.

---

## 4. Experiment 10.3: Statistics Drift (Flying Blind)
**The Concept**: The Query Optimizer (Chapter 6) doesn't count rows before every query (that would be slow). Instead, it looks at a "Cheat Sheet" called Statistics. If you add/delete data but don't update the cheat sheet (via `ANALYZE`), the database will make terrible decisions because it thinks the table is empty or full when it isn't.

#### Step 1: Hypothesis
We will deceive Postgres. We will insert data, check the statistics, then delete data, and see if Postgres notices.

#### Step 2: The Experiment

In [None]:
reset_postgres_table() # Starts with autovacuum OFF

stats_log = []

with psycopg2.connect(**DB_PARAMS) as conn:
    conn.autocommit = True
    with conn.cursor() as cur:
        
        # 1. Insert Data
        print("Inserting 100k rows...")
        cur.execute("INSERT INTO sensor_logs (payload) SELECT 'data' FROM generate_series(1, 100000);")
        
        # Manually ANALYZE so it knows about the inserts
        cur.execute("ANALYZE sensor_logs;")
        
        cur.execute("SELECT reltuples FROM pg_class WHERE relname = 'sensor_logs';")
        est_rows = cur.fetchone()[0]
        stats_log.append({'Phase': '1. After Insert', 'Real Count': 100000, 'DB Estimate': est_rows})
        
        # 2. Delete Everything (But don't tell the stats collector!)
        print("Deleting 100k rows (Secretly)...")
        cur.execute("DELETE FROM sensor_logs;")
        
        # Check Stats WITHOUT Analyzing
        cur.execute("SELECT reltuples FROM pg_class WHERE relname = 'sensor_logs';")
        est_rows = cur.fetchone()[0]
        stats_log.append({'Phase': '2. After Delete (Stale)', 'Real Count': 0, 'DB Estimate': est_rows})
        
        # 3. Analyze
        print("Running ANALYZE...")
        cur.execute("ANALYZE sensor_logs;")
        
        cur.execute("SELECT reltuples FROM pg_class WHERE relname = 'sensor_logs';")
        est_rows = cur.fetchone()[0]
        stats_log.append({'Phase': '3. After Analyze', 'Real Count': 0, 'DB Estimate': est_rows})

print("\nStats Log:")
print(pd.DataFrame(stats_log))

#### Step 3: Visualization

In [None]:
df_stats = pd.DataFrame(stats_log)

x = range(len(df_stats))
width = 0.35

plt.figure(figsize=(8, 5))
plt.bar([i - width/2 for i in x], df_stats['Real Count'], width, label='Real Row Count', color='green')
plt.bar([i + width/2 for i in x], df_stats['DB Estimate'], width, label='DB Estimate (pg_stats)', color='gray')

plt.xticks(x, df_stats['Phase'])
plt.title('Statistics Drift: When the DB Hallucinates')
plt.ylabel('Row Count')
plt.legend()
plt.show()

#### Step 4: The Physics
In Phase 2, the table was actually empty (Real Count = 0). However, the DB Estimate was still 100,000. If you ran a query joining this table to another, the Optimizer would allocate memory for a Hash Join expecting 100k rows, wasting massive resources on an empty table. `ANALYZE` scans the table to update these statistics. In production, `autovacuum` usually handles this, but heavy load can cause it to lag.

---

## Key Takeaways
1. **Deletes are fake**: They are just updates to metadata. Space is not freed until a Vacuum occurs.
2. **Small Files kill performance**: Always buffer your writes or run a compaction job to merge small files.
3. **Stats must be fresh**: A database with old statistics is like a driver using a map from 1990.