# Partitions in Databases

## Learning Objectives
- Understand what partitions are and why they're useful
- Learn how the partitioning process works
- Learn how to create partitioned tables
- Understand how partitions improve query performance


## 1. Introduction to Partitions

### What is a Partition?

A **partition** is a way to divide a large table into smaller, more manageable pieces called partitions. Think of it like organizing a filing cabinet - instead of having all documents in one big drawer, you organize them into separate drawers by category (like by year, by department, etc.).

### Why Use Partitions?

1. **Performance**: Queries run faster because the database only needs to scan relevant partitions
2. **Manageability**: Easier to maintain and manage smaller chunks of data
3. **Maintenance**: Can perform operations (like backups, deletes) on specific partitions
4. **Storage**: Can store different partitions in different locations
5. **Query Optimization**: Database can skip irrelevant partitions entirely

### Real-World Analogy:
Imagine you have a library with millions of books. Instead of searching through all books every time, you organize them by:
- **Date**: Books from 2020, 2021, 2022 in separate sections
- **Category**: Fiction, Non-fiction, Science in separate sections
- **Location**: Books from different regions in separate sections

When someone asks for "books from 2023", you only need to check the 2023 section, not the entire library!

### Common Partitioning Strategies:

1. **Range Partitioning**: Partition by date ranges (e.g., monthly, yearly)
2. **List Partitioning**: Partition by specific values (e.g., by country, by status)
3. **Hash Partitioning**: Partition by hash function (distributes data evenly)
4. **Composite Partitioning**: Combination of multiple strategies


## 2. The Process of Partitioning

### How Partitioning Works:

1. **Choose Partition Key**: Select the column(s) you want to partition by
   - Common choices: Date columns, region, status
   - Should be frequently used in WHERE clauses

2. **Define Partition Boundaries**: Decide how to split the data
   - For dates: Monthly, quarterly, or yearly
   - For regions: By country, state, or city
   - For status: By different status values

3. **Data Distribution**: Database automatically routes data to correct partition
   - When you INSERT data, database determines which partition it belongs to
   - Data is stored in the appropriate partition based on partition key value

4. **Query Processing**: When querying, database uses partition pruning
   - Database identifies which partitions are relevant
   - Only scans those partitions (skips others)
   - This is called "partition pruning" or "partition elimination"

### Example Flow:

```
Large Table (1 million rows)
    ↓
Partition by Year
    ↓
┌─────────────┬─────────────┬─────────────┐
│ 2021 Data   │ 2022 Data   │ 2023 Data   │
│ (300K rows) │ (350K rows) │ (350K rows) │
└─────────────┴─────────────┴─────────────┘

Query: SELECT * FROM table WHERE year = 2023
    ↓
Database only scans 2023 partition (350K rows)
Instead of scanning entire table (1M rows)!
```

### Key Benefits of This Process:

- **Faster Queries**: Only relevant partitions are scanned
- **Parallel Processing**: Different partitions can be processed simultaneously
- **Easier Maintenance**: Can drop/add partitions without affecting others
- **Better Statistics**: Database maintains statistics per partition


## 3. Creating Partitioned Tables

### Syntax Overview:

Different databases have different syntax, but the concept is similar. Here are examples for common scenarios:

### Example 1: Range Partitioning by Date (Monthly)

```sql
-- Create a partitioned table by month
CREATE TABLE sales_data (
    sale_id INT,
    sale_date DATE,
    product_name VARCHAR(100),
    amount DECIMAL(10, 2),
    region VARCHAR(50)
)
PARTITION BY RANGE (sale_date) (
    PARTITION p2023_01 VALUES LESS THAN ('2023-02-01'),
    PARTITION p2023_02 VALUES LESS THAN ('2023-03-01'),
    PARTITION p2023_03 VALUES LESS THAN ('2023-04-01'),
    PARTITION p2023_04 VALUES LESS THAN ('2023-05-01')
    -- Add more partitions as needed
);
```

### Example 2: List Partitioning by Region

```sql
-- Create a partitioned table by region
CREATE TABLE customer_data (
    customer_id INT,
    customer_name VARCHAR(100),
    region VARCHAR(50),
    registration_date DATE
)
PARTITION BY LIST (region) (
    PARTITION p_north VALUES IN ('North', 'Northeast'),
    PARTITION p_south VALUES IN ('South', 'Southeast'),
    PARTITION p_west VALUES IN ('West', 'Northwest'),
    PARTITION p_east VALUES IN ('East', 'Northeast')
);
```

### Example 3: Hash Partitioning

```sql
-- Create a hash partitioned table
CREATE TABLE user_data (
    user_id INT,
    username VARCHAR(50),
    email VARCHAR(100),
    created_date DATE
)
PARTITION BY HASH (user_id)
PARTITIONS 4;  -- Creates 4 partitions
```

### Snowflake-Specific Note:

In **Snowflake**, partitioning is handled automatically through **micro-partitions**. You don't manually create partitions, but you can use **clustering keys** to optimize data organization:

```sql
-- Snowflake: Create table with clustering key
CREATE TABLE sales_data (
    sale_id INT,
    sale_date DATE,
    product_name VARCHAR(100),
    amount DECIMAL(10, 2),
    region VARCHAR(50)
)
CLUSTER BY (sale_date, region);  -- Similar to partitioning
```

### Best Practices for Creating Partitions:

1. **Choose the Right Column**: Select columns frequently used in WHERE clauses
2. **Avoid Too Many Partitions**: Too many small partitions can hurt performance
3. **Avoid Too Few Partitions**: Too few large partitions don't provide benefits
4. **Consider Data Distribution**: Ensure data is evenly distributed across partitions
5. **Plan for Growth**: Design partitions to accommodate future data


## 4. Performance Benefits of Partitions

### How Partitions Improve Performance:

#### 1. **Partition Pruning (Partition Elimination)**

When you query a partitioned table with a WHERE clause on the partition key, the database automatically skips irrelevant partitions.

**Example:**
```sql
-- Without partitioning: Scans entire table (1 million rows)
SELECT * FROM sales_data WHERE sale_date = '2023-03-15';

-- With partitioning: Only scans March 2023 partition (50K rows)
-- Database automatically skips all other partitions!
```

**Performance Gain**: Instead of scanning 1M rows, only 50K rows are scanned = **20x faster!**

#### 2. **Parallel Processing**

Different partitions can be processed simultaneously by different CPU cores or threads.

```
Query: SELECT SUM(amount) FROM sales_data WHERE sale_date BETWEEN '2023-01-01' AND '2023-03-31'

Without Partitioning:
└── Single thread scans entire table sequentially

With Partitioning:
├── Thread 1: Processes Jan 2023 partition
├── Thread 2: Processes Feb 2023 partition
└── Thread 3: Processes Mar 2023 partition
    └── Results combined at the end
```

#### 3. **Index Efficiency**

Indexes on partitioned tables are smaller and more efficient:
- Each partition has its own index
- Smaller indexes = faster lookups
- Less memory usage

#### 4. **Maintenance Operations**

Operations like DELETE, UPDATE, or DROP can target specific partitions:

```sql
-- Delete all data from a specific partition (very fast!)
ALTER TABLE sales_data DROP PARTITION p2022_01;

-- Instead of: DELETE FROM sales_data WHERE sale_date BETWEEN '2022-01-01' AND '2022-01-31'
-- (This would be much slower on a non-partitioned table)
```

### Performance Comparison Example:

**Scenario**: Table with 10 million rows, partitioned by month (12 partitions)

| Operation | Without Partitioning | With Partitioning | Improvement |
|-----------|---------------------|-------------------|-------------|
| Query single month | Scan 10M rows | Scan ~833K rows | **12x faster** |
| Query 3 months | Scan 10M rows | Scan ~2.5M rows | **4x faster** |
| Delete old data | Scan all rows | Drop partition | **100x faster** |
| Backup | Full table | Per partition | **More flexible** |

### When Partitions Help Most:

✅ **Highly Beneficial:**
- Large tables (millions of rows)
- Queries frequently filter by partition key
- Time-series data (dates, timestamps)
- Data with clear logical divisions

❌ **Less Beneficial:**
- Small tables (< 100K rows)
- Queries don't filter by partition key
- Random access patterns
- Frequently changing partition key values

### Important Considerations:

1. **Partition Key Selection**: Choose columns used in WHERE clauses
2. **Partition Size**: Aim for partitions with 100K - 10M rows each
3. **Query Patterns**: Design partitions based on how you query data
4. **Overhead**: Too many partitions can create overhead
5. **Maintenance**: Monitor and maintain partition statistics


In [None]:
-- Example: Creating a simple partitioned table
-- This is a conceptual example - syntax varies by database

-- Step 1: Create a partitioned table by date (monthly)
CREATE TABLE monthly_sales (
    sale_id INT,
    sale_date DATE,
    product_id INT,
    quantity INT,
    price DECIMAL(10, 2)
)
PARTITION BY RANGE (sale_date) (
    PARTITION p2023_01 VALUES LESS THAN ('2023-02-01'),
    PARTITION p2023_02 VALUES LESS THAN ('2023-03-01'),
    PARTITION p2023_03 VALUES LESS THAN ('2023-04-01')
);

-- Step 2: Insert data (database automatically routes to correct partition)
INSERT INTO monthly_sales VALUES
(1, '2023-01-15', 101, 5, 100.00),  -- Goes to p2023_01
(2, '2023-02-20', 102, 3, 150.00),  -- Goes to p2023_02
(3, '2023-03-10', 103, 2, 200.00);  -- Goes to p2023_03

-- Step 3: Query with partition pruning
-- Database only scans p2023_02 partition, skips others!
SELECT * FROM monthly_sales 
WHERE sale_date BETWEEN '2023-02-01' AND '2023-02-28';


In [None]:
-- Example: List partitioning by region
CREATE TABLE regional_customers (
    customer_id INT,
    customer_name VARCHAR(100),
    region VARCHAR(50),
    registration_date DATE
)
PARTITION BY LIST (region) (
    PARTITION p_usa VALUES IN ('USA', 'Canada'),
    PARTITION p_europe VALUES IN ('UK', 'Germany', 'France'),
    PARTITION p_asia VALUES IN ('India', 'China', 'Japan')
);

-- Insert data
INSERT INTO regional_customers VALUES
(1, 'John Doe', 'USA', '2023-01-15'),      -- Goes to p_usa
(2, 'Jane Smith', 'UK', '2023-02-20'),     -- Goes to p_europe
(3, 'Raj Kumar', 'India', '2023-03-10');   -- Goes to p_asia

-- Query - only scans p_europe partition
SELECT * FROM regional_customers WHERE region = 'UK';


## Practice Exercise

### Questions to Think About:

1. **When would you use range partitioning vs list partitioning?**
   - Range: When data has natural ordering (dates, numbers)
   - List: When data has distinct categories (regions, status values)

2. **What happens if you query without the partition key in WHERE clause?**
   - Database scans ALL partitions (still works, but slower)

3. **Can you partition by multiple columns?**
   - Yes! This is called composite partitioning

4. **What's the downside of having too many partitions?**
   - Overhead in managing many small partitions
   - Query planning becomes more complex
   - Storage overhead for partition metadata

