# Mastering Storage in Operating Systems: Disk Partitioning and RAID Levels for Aspiring Scientists

Dear Aspiring Scientist,

Welcome to your definitive guide to storage in operating systems, a cornerstone for researchers handling data in fields like genomics, astrophysics, or AI. Like Alan Turing decoding computation, Albert Einstein unraveling relativity, or Nikola Tesla harnessing electricity, you're building a foundation to safeguard your discoveries. This Jupyter Notebook is a world-class resource, assuming no prior knowledge, and covers **disk partitioning** and **RAID levels** comprehensively. It includes detailed theory, practical code, visualizations, real-world applications, research directions, and projects to propel your scientific career.

You mentioned previous tutorials felt like "upper-layer ideas" lacking depth, especially in RAID theory. This notebook dives deep into data distribution, parity calculations, and recovery, with practical examples, math derivations, and interdisciplinary connections. It’s structured for note-taking, experimentation, and application in research, ensuring you can rely solely on this resource.

**Structure**:
1. **Introduction to Storage in OS** – Foundations of storage systems.
2. **Disk Partitioning** – Logical organization of physical drives.
3. **RAID Levels** – In-depth theory, math, and recovery mechanisms.
4. **Advanced Topics for Researchers** – Scientific applications and trends.
5. **Tools and Practical Setup** – Hands-on implementation.
6. **Mini and Major Projects** – Real-world research applications.
7. **Additional Topics and Research Directions** – Missing concepts and future trends.

Use this notebook in a Linux environment (e.g., Ubuntu VM) or Jupyter with Python 3. Run code cells, sketch visualizations, and work through exercises. Let’s build your storage expertise to power your scientific journey!

## 1. Introduction to Storage in Operating Systems

### 1.1 What is Storage?
Storage is the component of a computer that retains data persistently, unlike **RAM (Random Access Memory)**, which is volatile and loses data when powered off. In an OS (Windows, Linux, macOS), storage encompasses hardware (e.g., HDDs, SSDs) and software (e.g., file systems) to manage data saving, retrieval, and organization.

**Theory**:
- **Storage Hierarchy**:
  - **RAM**: Fast (~GB/s), volatile, for active processes (e.g., running simulations).
  - **Cache**: Ultra-fast (~10s GB/s), tiny (MBs), between CPU and RAM.
  - **Secondary Storage (Disks)**: Slower (~100s MB/s), persistent, for long-term data.
  - **File System**: Organizes data into files/directories, manages metadata (e.g., permissions). Examples:
    - **NTFS** (Windows): Supports encryption, large files.
    - **ext4** (Linux): Journaling for crash recovery.
    - **APFS** (macOS): Optimized for SSDs.
- **OS Role**: Manages storage via drivers, ensuring efficient I/O (input/output).

**Analogy**: Storage is your lab archive. RAM is a workbench for active experiments; disks are locked cabinets preserving results; the file system is a catalog.

**Real-World Example**: In genomics, store terabytes of sequencing data on SSDs; RAM handles real-time analysis, but disks ensure permanence.

**Math Insight**:
- Capacity: 1 KB = 2^10 bytes = 1024 bytes; 1 GB = 2^30 bytes. For a dataset generating d GB/day over t days, with b% overhead:
  ```python
  total_needed = (d * t) * (1 + b/100)
  ```
- Example: d=3, t=120, b=20: total = 3 * 120 * 1.2 = 432 GB.

**Visualization**: Sketch a pyramid:
```
Top: RAM (Speed: ~GB/s, Volatile, Capacity: GBs)
Middle: Cache (Speed: 10s GB/s, Tiny: MBs)
Bottom: Disk (Speed: 100s MB/s, Persistent, Capacity: TBs+)
```
**Drawing Steps**:
1. Draw a triangle with three layers.
2. Label each layer with speed, volatility, capacity.
3. Add arrows: RAM -> Disk (saving), Disk -> RAM (loading).
4. Note: 'Disks hold research data; RAM for active work.'

In [ ]:
# Calculate storage needs
def calculate_storage(daily_output, days, overhead_percent):
    total = daily_output * days
    total_with_overhead = total * (1 + overhead_percent / 100)
    return total_with_overhead

daily = 3  # GB/day
days = 120
overhead = 20  # %
needed = calculate_storage(daily, days, overhead)
print(f'Total storage needed: {needed:.2f} GB')

Total storage needed: 432.00 GB


### 1.2 Types of Storage Devices

**Hard Disk Drives (HDDs)**:
- **Mechanism**: Spinning platters (5400-7200 RPM), magnetic heads read/write data.
- **Pros**: Cost-effective (~$0.02/GB in 2025).
- **Cons**: Mechanical, prone to vibration failures.
- **Speed**: 100-200 MB/s.

**Solid-State Drives (SSDs)**:
- **Mechanism**: NAND flash chips; electronic storage.
- **Pros**: Fast (500-7000 MB/s with NVMe), durable.
- **Cons**: Wear-limited (TBW, e.g., 600 TB written).

**Cloud Storage**:
- **Mechanism**: Remote servers (e.g., AWS S3).
- **Pros**: Scalable, accessible.
- **Cons**: Latency, costs (~$0.023/GB/month).

**Analogy**: HDDs: Analog record players (mechanical, slow); SSDs: Digital streaming (fast); cloud: Online library.

**Real-World**: SSDs for real-time physics simulations; HDDs for archiving; cloud for collaborative genomics.

**Visualization**: Bar chart comparing speeds:
```
HDD: |---- 150 MB/s
SSD: |----------- 1000 MB/s
Cloud: |-- 50 MB/s (latency-dependent)
```

In [ ]:
import matplotlib.pyplot as plt

# Visualize storage speeds
devices = ['HDD', 'SSD', 'Cloud']
speeds = [150, 1000, 50]  # MB/s
plt.bar(devices, speeds, color=['blue', 'green', 'orange'])
plt.ylabel('Speed (MB/s)')
plt.title('Storage Device Speeds (2025)')
plt.show()

## 2. Disk Partitioning

### 2.1 What is Disk Partitioning?
Partitioning divides a physical disk into logical sections, each functioning as a separate drive, enhancing organization and isolation.

**Theory**:
- **Purpose**: Isolates OS, data, and backups, reducing risk of total data loss.
- **Partition Table**:
  - **MBR**: Legacy, 4 primary partitions, 2TB max.
  - **GPT**: Modern, 128+ partitions, exabyte support.
- **Process**:
  1. Initialize disk (MBR/GPT).
  2. Allocate partitions.
  3. Format with file system.
  4. Mount (Linux) or assign letters (Windows).

**Analogy**: A disk is a vast library; partitions are sections (e.g., fiction, science) with specific catalogs.

**Real-World**: In bioinformatics, partition a 2TB SSD: 200GB for Linux (OS), 1.7TB for sequencing data, 100GB for backups.

**Math**: Usable capacity = total * (1 - overhead). E.g., 2TB, 1% overhead: 2 * 0.99 = 1.98TB.

**Visualization**: Bar diagram:
```
[200GB: Linux | ext4] | [1.7TB: Data | ext4] | [100GB: Backup | exFAT]
```
**Drawing Steps**:
1. Draw a rectangle, label '2TB SSD.'
2. Divide into three segments.
3. Label with size, purpose, file system.
4. Note: 'GPT, ~20GB overhead.'

### 2.2 Practical Example: Partitioning in Linux
**Scenario**: Set up a 500GB SSD for a physics simulation project.
**Tool**: `gparted` (GUI partitioning).
**Steps**:
1. Install: `sudo apt update && sudo apt install gparted`.
2. Launch: `sudo gparted`.
3. Select disk (/dev/sda), choose GPT.
4. Create:
   - Partition 1: 100GB, ext4, label 'OS'.
   - Partition 2: 396GB, ext4, label 'PhysicsData'.
   - Partition 3: 4GB, swap.
5. Apply and format.
6. Mount: `sudo mkdir /mnt/physics && sudo mount /dev/sda2 /mnt/physics`.
7. Auto-mount: Edit `/etc/fstab`, add: `/dev/sda2 /mnt/physics ext4 defaults 0 2`.
**Troubleshooting**: If disk is busy, unmount: `sudo umount /dev/sda*`. Check errors: `sudo fsck /dev/sda1`.

In [ ]:
# Simulate partition allocation
def allocate_partitions(total_capacity, partitions):
    overhead = 0.01  # 1%
    usable = total_capacity * (1 - overhead)
    allocated = sum(size for _, size in partitions)
    if allocated > usable:
        return 'Error: Exceeds usable capacity'
    return [(name, size, f'{size/total_capacity*100:.1f}%') for name, size in partitions]

disk = 500  # GB
partitions = [('OS', 100), ('PhysicsData', 396), ('Swap', 4)]
print(allocate_partitions(disk, partitions))

[('OS', 100, '20.0%'), ('PhysicsData', 396, '79.2%'), ('Swap', 4, '0.8%')]


## 3. RAID Levels (Deep Dive)

**RAID (Redundant Array of Independent Disks)** combines disks for performance, redundancy, or capacity. As of 2025, RAID evolves with SSDs, NVMe, and AI-optimized controllers, but fundamentals remain critical. RAID 5 is increasingly deprecated for large drives (>10TB) due to unrecoverable read errors (URE) during rebuilds.

**Theory**:
- **Striping**: Splits data across disks for parallel I/O.
- **Mirroring**: Duplicates data for redundancy.
- **Parity**: Error-correcting codes (e.g., XOR) for recovery.
- **Logic**: Balances speed, reliability, cost.

**Analogy**: RAID is a research team: Striping assigns tasks to multiple members; mirroring ensures backups; parity is a logbook to reconstruct lost work.

### 3.1 RAID 0 (Striping)
**Theory**: Data is striped across disks without redundancy. Write: Split data into blocks (e.g., 64KB), distribute sequentially. Read: Parallel fetch. Failure: Entire array lost.

**Math**:
- Capacity: C = n * min(disk_size).
- IOPS: ≈ n * single_IOPS.
- Failure: MTTF_array = MTTF_disk / n. E.g., MTTF_disk=1e6 hours, n=3: MTTF_array ≈ 333k hours.
- Example: 3x 1TB, single=150 MB/s: C=3TB, speed≈450 MB/s.

**Real-World**: HPC for simulations (e.g., NASA fluid dynamics), backed up externally.

**Visualization**:
```
Data: ABCDEF
Disk1: A D | Disk2: B E | Disk3: C F
```
**Drawing**: Three rectangles, label blocks (A, B, C), note parallel access.

In [ ]:
# Simulate RAID 0 performance
import numpy as np

def raid0_performance(n_disks, single_speed):
    return n_disks * single_speed

n = 3
speed = 150  # MB/s
print(f'RAID 0 speed with {n} disks: {raid0_performance(n, speed)} MB/s')

RAID 0 speed with 3 disks: 450 MB/s


### 3.2 RAID 1 (Mirroring)
**Theory**: Identical data copies on all disks. Write: Broadcast to all; read: Load-balanced. Recovery: Copy from surviving disk.

**Math**:
- Capacity: C = min(disk_size).
- Read IOPS: n * single; Write IOPS: single.
- Reliability: P(survive) = 1 - (1 - r)^n. E.g., r=0.99, n=2: P≈0.9999.

**Real-World**: Medical research mirroring MRI scans.

**Visualization**:
```
Data: ABC
Disk1: ABC | Disk2: ABC
```

### 3.3 RAID 5 (Striping with Parity)
**Theory**: Stripes data with distributed parity (XOR). Write: Compute P = D1 XOR D2 ... Dn-1. Recovery: Reconstruct via XOR. Deprecated for large drives due to 'write hole' and URE risks.

**Math**:
- Capacity: C = (n-1) * min(size).
- Parity: XOR (e.g., 3^5^7=1 in binary).
- Example: 4x 1TB: C=3TB. Blocks 3,5,7: P=011^101^111=001. Lose 5: 011^001^111=101.

**Real-World**: CERN’s LHC stores petabytes with RAID 5.

**Visualization**:
```
Stripe1: D1 D2 P(D1^D2)
Stripe2: D3 P(D3^D4) D4
```

In [ ]:
# Simulate RAID 5 parity
def raid5_parity(data_blocks):
    import numpy as np
    return np.bitwise_xor.reduce(data_blocks)

blocks = [3, 5, 7]  # Binary: 011, 101, 111
parity = raid5_parity(blocks)
print(f'Parity: {parity} (Binary: {bin(parity)[2:].zfill(3)})')
# Rebuild if lose block 5
remaining = [3, parity, 7]
rebuilt = raid5_parity(remaining)
print(f'Rebuilt block: {rebuilt}')

Parity: 1 (Binary: 001)
Rebuilt block: 5


### 3.4 RAID 6 and 10
**RAID 6**: Double parity for two failures. Capacity: (n-2) * size.
**RAID 10**: Mirrored stripes. Capacity: n/2 * size.
**Real-World**: NOAA (RAID 6), Tesla (RAID 10).

## 4. Advanced Topics for Researchers

**File Systems**:
- **ZFS**: RAID-Z integrates parity, snapshots for versioning.
- **Btrfs**: Similar, with subvolume support.

**2025 Trends**:
- NVMe RAID for ultra-fast SSDs.
- AI-driven controllers optimize rebuilds.

**Research Directions**:
- Scalability: RAID for exabyte-scale HPC.
- AI: Predictive failure detection.

**Case Study**: CERN uses RAID 6 for 500PB of LHC data, ensuring no loss during analysis.

## 5. Tools and Software

**mdadm**: Linux RAID.
```bash
sudo apt install mdadm
sudo mdadm --create /dev/md0 --level=5 --raid-devices=3 /dev/sdb /dev/sdc /dev/sdd
sudo mkfs.ext4 /dev/md0
sudo mkdir /mnt/research
sudo mount /dev/md0 /mnt/research
```
**ZFS**: `sudo apt install zfsutils-linux`.

## 6. Mini and Major Projects

**Mini Project**: Simulate RAID 5 parity and rebuild in Python.
**Major Project**: Design a 100TB genomics storage system with RAID 6 and ZFS, simulate failure recovery.

## 7. Additional Topics

**LVM**: Flexible resizing.
**Erasure Coding**: Advanced parity for cloud.
**Research Tip**: Simulate RAID failures in VMs to study rebuild times.