# Indexes in Databases - Snowflake Perspective

## Learning Objectives
- Understand what indexes are and why they're important
- Learn about different types of indexes
- Understand heap structure and clustered indexes
- Compare clustered vs non-clustered indexes
- Learn about composite, unique, and filtered indexes
- Understand columnstore vs rowstore indexes
- Learn how to choose the right indexing strategy
- Understand index monitoring and fragmentation
- Learn about execution plans and indexing strategies
- Understand Snowflake's unique approach to indexing


## 1. Introduction to Indexes

An **index** is a database object that improves the speed of data retrieval operations on a database table. Think of it like an index in a book - instead of reading every page to find a topic, you can look it up in the index and go directly to the relevant page.

### Why Do We Need Indexes?

1. **Performance**: Dramatically speeds up SELECT queries
2. **Efficiency**: Reduces the amount of data that needs to be scanned
3. **Sorting**: Helps with ORDER BY operations
4. **Joins**: Improves join performance
5. **Uniqueness**: Enforces unique constraints

### Trade-offs:
- **Storage**: Indexes consume additional storage space
- **Write Performance**: INSERT, UPDATE, DELETE operations may be slower as indexes need to be maintained
- **Maintenance**: Indexes require maintenance and can become fragmented

### Important Note About Snowflake:
Snowflake **does not use traditional indexes** like SQL Server or Oracle. Instead, Snowflake uses:
- **Micro-partitions**: Automatic partitioning of data
- **Clustering Keys**: Similar to clustered indexes but more flexible
- **Automatic Optimization**: Snowflake automatically optimizes queries

However, understanding traditional index concepts helps you understand how databases work and how Snowflake's architecture achieves similar benefits.


## 2. Types of Indexes

### Common Index Types:

1. **Clustered Index**
   - Determines the physical order of data in a table
   - Only one per table
   - Data is stored in sorted order

2. **Non-Clustered Index**
   - Separate structure from the table data
   - Contains pointers to the actual data
   - Multiple can exist per table

3. **Composite Index**
   - Index on multiple columns
   - Order of columns matters

4. **Unique Index**
   - Ensures uniqueness of values
   - Can be clustered or non-clustered

5. **Filtered/Partial Index**
   - Index on a subset of rows
   - Only indexes rows that meet a condition

6. **Columnstore Index**
   - Stores data column-wise instead of row-wise
   - Excellent for analytical queries

### Snowflake's Approach:
- **Clustering Keys**: Similar to clustered indexes but optional
- **Automatic Micro-partitioning**: Data is automatically partitioned
- **Columnar Storage**: Snowflake uses columnar storage by default
- **No Manual Indexes**: You don't create indexes manually in Snowflake


## 3. Heap Structure

A **heap** is a table without a clustered index. In a heap:
- Data is stored in no particular order
- Rows are inserted wherever there's space
- No logical ordering of data
- Requires additional lookups to find data

### Characteristics:
- **Unordered Storage**: Data pages are not linked in any order
- **Forwarding Pointers**: When rows are updated and don't fit, pointers are used
- **Fragmentation**: Can become fragmented over time
- **Full Table Scans**: Queries often require scanning the entire table

### Example in Traditional Databases:
```sql
-- In SQL Server, a table without clustered index is a heap
CREATE TABLE heap_table (
    id INT,
    name VARCHAR(50)
);
-- This is a heap - no clustered index
```

### In Snowflake:
Snowflake doesn't have traditional heaps. All tables in Snowflake are automatically organized using micro-partitions, which provide better performance than traditional heaps.


## 4. Clustered Index

A **clustered index** determines the physical order of data in a table. The table data itself is sorted according to the clustered index key.

### Key Characteristics:
- **One per table**: Only one clustered index can exist per table
- **Physical ordering**: Data is stored in sorted order
- **Fast retrieval**: Very fast for range queries and sorting
- **Primary key**: Often created automatically on primary key

### How It Works:
- Data pages are linked in sorted order
- Leaf nodes contain the actual data
- Searching is very efficient for the indexed column

### Example in Traditional Databases:
```sql
-- In SQL Server
CREATE TABLE employees (
    emp_id INT PRIMARY KEY,  -- Creates clustered index automatically
    name VARCHAR(50),
    department VARCHAR(50)
);
```

### In Snowflake - Clustering Keys:
Snowflake uses **clustering keys** which are similar but more flexible. Let's see how to use them:


In [None]:
-- Create sample tables for our examples
-- First, let's create a table without clustering (similar to heap)
CREATE OR REPLACE TABLE employees_heap (
    emp_id INT,
    name VARCHAR(50),
    department VARCHAR(50),
    salary DECIMAL(10, 2),
    hire_date DATE
);

-- Insert dummy data
INSERT INTO employees_heap VALUES
(5, 'John Doe', 'IT', 75000, '2023-01-15'),
(2, 'Jane Smith', 'HR', 65000, '2023-02-20'),
(8, 'Bob Johnson', 'IT', 80000, '2023-03-10'),
(1, 'Alice Williams', 'Finance', 70000, '2023-04-05'),
(9, 'Charlie Brown', 'IT', 72000, '2023-05-12'),
(3, 'David Lee', 'HR', 68000, '2023-06-01'),
(7, 'Emma Davis', 'Finance', 75000, '2023-07-15'),
(4, 'Frank Miller', 'IT', 85000, '2023-08-20'),
(6, 'Grace Wilson', 'HR', 70000, '2023-09-10');

-- Now create a table with clustering key (similar to clustered index)
CREATE OR REPLACE TABLE employees_clustered (
    emp_id INT,
    name VARCHAR(50),
    department VARCHAR(50),
    salary DECIMAL(10, 2),
    hire_date DATE
) CLUSTER BY (emp_id);

-- Insert the same data
INSERT INTO employees_clustered VALUES
(5, 'John Doe', 'IT', 75000, '2023-01-15'),
(2, 'Jane Smith', 'HR', 65000, '2023-02-20'),
(8, 'Bob Johnson', 'IT', 80000, '2023-03-10'),
(1, 'Alice Williams', 'Finance', 70000, '2023-04-05'),
(9, 'Charlie Brown', 'IT', 72000, '2023-05-12'),
(3, 'David Lee', 'HR', 68000, '2023-06-01'),
(7, 'Emma Davis', 'Finance', 75000, '2023-07-15'),
(4, 'Frank Miller', 'IT', 85000, '2023-08-20'),
(6, 'Grace Wilson', 'HR', 70000, '2023-09-10');


### Query Performance Comparison:

Let's see how clustering affects query performance:


In [None]:
-- Query on heap table (no clustering)
-- This may require scanning more micro-partitions
SELECT * FROM employees_heap 
WHERE emp_id BETWEEN 3 AND 6
ORDER BY emp_id;

-- Query on clustered table
-- Snowflake can skip micro-partitions that don't contain the range
SELECT * FROM employees_clustered 
WHERE emp_id BETWEEN 3 AND 6
ORDER BY emp_id;

-- Note: In Snowflake, you can check query profile to see the difference
-- The clustered table should show better pruning of micro-partitions


## 5. Clustered vs Non-Clustered Index

### Clustered Index:
- **Physical Order**: Data is physically sorted
- **One per table**: Only one clustered index allowed
- **Contains data**: Leaf nodes contain actual row data
- **Fast for**: Range queries, sorting, primary key lookups
- **Slower for**: INSERT operations (data must be inserted in order)

### Non-Clustered Index:
- **Logical Order**: Separate structure from table data
- **Multiple allowed**: Can have many non-clustered indexes
- **Contains pointers**: Leaf nodes point to actual data
- **Fast for**: Specific lookups, covering queries
- **Slower for**: Range queries (requires key lookup)

### Visual Comparison:

**Clustered Index:**
```
Table Data (sorted by index):
[1, Alice] -> [2, Jane] -> [3, Bob] -> [4, Charlie]
```

**Non-Clustered Index:**
```
Index Structure:
[1] -> pointer -> [1, Alice] (in heap)
[2] -> pointer -> [2, Jane] (in heap)
[3] -> pointer -> [3, Bob] (in heap)
```

### In Snowflake:
- **Clustering Keys** = Similar to clustered index (but optional)
- **No non-clustered indexes** = Snowflake uses automatic micro-partition pruning instead
- **Automatic optimization** = Snowflake automatically optimizes without manual indexes


## 6. Composite Index

A **composite index** (also called multi-column index) is an index on multiple columns. The order of columns in a composite index is crucial for its effectiveness.

### Key Points:

1. **Column Order Matters**: Index on (A, B) is different from (B, A)
2. **Leftmost Prefix Rule**: Index on (A, B, C) can be used for queries on:
   - (A)
   - (A, B)
   - (A, B, C)
   - But NOT for queries on just (B) or (C)
3. **Cardinality**: Put high-cardinality columns first
4. **Query Patterns**: Order columns based on how they're used in queries

### Best Practices:
- Put most selective (high cardinality) columns first
- Put columns used in equality predicates before range predicates
- Consider query patterns when ordering columns

### In Snowflake:
Use composite clustering keys for multiple columns:


In [None]:
-- Create a table with composite clustering key
CREATE OR REPLACE TABLE sales_composite (
    sale_id INT,
    customer_id INT,
    product_id INT,
    sale_date DATE,
    amount DECIMAL(10, 2),
    region VARCHAR(50)
) CLUSTER BY (customer_id, sale_date);

-- Insert dummy data
INSERT INTO sales_composite VALUES
(1, 101, 201, '2023-01-15', 150.00, 'North'),
(2, 101, 202, '2023-02-20', 250.50, 'North'),
(3, 102, 201, '2023-01-10', 175.25, 'South'),
(4, 101, 203, '2023-03-05', 320.00, 'North'),
(5, 103, 201, '2023-02-15', 180.75, 'East'),
(6, 102, 202, '2023-03-20', 95.50, 'South'),
(7, 103, 203, '2023-04-01', 210.00, 'East'),
(8, 101, 201, '2023-04-10', 125.00, 'North'),
(9, 102, 203, '2023-05-05', 275.00, 'South');

-- Query that benefits from composite clustering
-- This query can efficiently use the clustering key
SELECT * FROM sales_composite 
WHERE customer_id = 101 
  AND sale_date BETWEEN '2023-01-01' AND '2023-03-31'
ORDER BY customer_id, sale_date;

-- Query that partially benefits
-- Can use customer_id part of the clustering key
SELECT * FROM sales_composite 
WHERE customer_id = 101
ORDER BY customer_id;

-- Query that may not benefit as much
-- Cannot efficiently use clustering key for just sale_date
SELECT * FROM sales_composite 
WHERE sale_date = '2023-03-05';


## 7. Columnstore Index

A **columnstore index** stores data in a column-wise format instead of row-wise format. This is excellent for analytical queries that scan many rows but few columns.

### Rowstore vs Columnstore:

**Rowstore (Traditional):**
```
Row 1: [ID: 1, Name: 'John', Age: 30, Salary: 50000]
Row 2: [ID: 2, Name: 'Jane', Age: 25, Salary: 60000]
Row 3: [ID: 3, Name: 'Bob', Age: 35, Salary: 55000]
```

**Columnstore:**
```
Column ID:    [1, 2, 3]
Column Name:  ['John', 'Jane', 'Bob']
Column Age:   [30, 25, 35]
Column Salary: [50000, 60000, 55000]
```

### Benefits of Columnstore:
- **Compression**: Better compression ratios (similar values stored together)
- **Analytical Queries**: Excellent for aggregations, GROUP BY, SUM, AVG
- **Scan Performance**: Can scan only needed columns
- **Batch Processing**: Processes data in batches

### Drawbacks:
- **Point Lookups**: Slower for single row lookups
- **OLTP Workloads**: Not ideal for transactional workloads
- **Write Performance**: Can be slower for INSERT/UPDATE operations

### In Snowflake:
**Snowflake uses columnar storage by default!** This is one of Snowflake's key architectural features. All tables in Snowflake are stored in a columnar format, which is why it's so efficient for analytical queries.


In [None]:
-- Create a large table to demonstrate columnstore benefits
CREATE OR REPLACE TABLE sales_large (
    sale_id INT,
    customer_id INT,
    product_id INT,
    sale_date DATE,
    amount DECIMAL(10, 2),
    quantity INT,
    region VARCHAR(50),
    category VARCHAR(50),
    salesperson_id INT
);

-- Insert larger dataset (simulating analytical workload)
INSERT INTO sales_large
SELECT 
    ROW_NUMBER() OVER (ORDER BY UNIFORM(1, 1000000, RANDOM())) as sale_id,
    UNIFORM(1, 1000, RANDOM()) as customer_id,
    UNIFORM(1, 100, RANDOM()) as product_id,
    DATEADD(day, UNIFORM(0, 365, RANDOM()), '2023-01-01') as sale_date,
    UNIFORM(10, 1000, RANDOM()) as amount,
    UNIFORM(1, 10, RANDOM()) as quantity,
    CASE UNIFORM(1, 4, RANDOM())
        WHEN 1 THEN 'North'
        WHEN 2 THEN 'South'
        WHEN 3 THEN 'East'
        ELSE 'West'
    END as region,
    CASE UNIFORM(1, 5, RANDOM())
        WHEN 1 THEN 'Electronics'
        WHEN 2 THEN 'Clothing'
        WHEN 3 THEN 'Books'
        WHEN 4 THEN 'Food'
        ELSE 'Other'
    END as category,
    UNIFORM(1, 50, RANDOM()) as salesperson_id
FROM TABLE(GENERATOR(ROWCOUNT => 10000));

-- Analytical query that benefits from columnstore
-- Snowflake can scan only the needed columns (amount, region)
SELECT 
    region,
    SUM(amount) as total_sales,
    AVG(amount) as avg_sales,
    COUNT(*) as transaction_count
FROM sales_large
WHERE sale_date >= '2023-06-01'
GROUP BY region
ORDER BY total_sales DESC;


## 8. Process of Building Columnstore Index

### Traditional Columnstore Index Building Process:

1. **Create Index Structure**: Allocate space for column segments
2. **Read Row Data**: Read data from the table row by row
3. **Transform to Columns**: Reorganize data from rows to columns
4. **Compress**: Apply compression algorithms
5. **Build Segments**: Organize into column segments
6. **Update Metadata**: Update system catalogs

### Steps in Detail:

**Step 1: Data Extraction**
- Read all rows from the source table
- Extract values for each column

**Step 2: Column Organization**
- Group values by column
- Create column segments

**Step 3: Compression**
- Apply compression (dictionary encoding, run-length encoding)
- Store compressed segments

**Step 4: Metadata Creation**
- Create segment metadata
- Build dictionaries for encoded values

### In Snowflake:
Snowflake automatically builds columnar storage when you insert data. The process is:
1. **Data Ingestion**: Data is loaded into Snowflake
2. **Automatic Partitioning**: Data is automatically partitioned into micro-partitions
3. **Columnar Storage**: Each micro-partition stores data in columnar format
4. **Automatic Compression**: Snowflake automatically compresses data
5. **Metadata**: Snowflake maintains metadata about micro-partitions

You don't need to manually build columnstore indexes - it happens automatically!


In [None]:
-- In Snowflake, columnar storage is automatic
-- Let's see how data is stored by checking table information

-- Check table details (shows columnar storage is automatic)
SHOW TABLES LIKE 'sales_large';

-- You can see compression and storage details in Snowflake's information schema
SELECT 
    TABLE_NAME,
    ROW_COUNT,
    BYTES,
    CLUSTERING_KEY
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME = 'SALES_LARGE';

-- Note: Snowflake automatically uses columnar storage
-- No manual index creation needed!


## 9. Columnstore vs Rowstore Index

### Comparison Table:

| Feature | Rowstore Index | Columnstore Index |
|---------|---------------|-------------------|
| **Storage** | Row-wise | Column-wise |
| **Best For** | OLTP (point lookups) | OLAP (analytical queries) |
| **Compression** | Moderate | Excellent |
| **Scan Performance** | Slower for many rows | Faster for many rows |
| **Point Lookups** | Fast | Slower |
| **Aggregations** | Slower | Much faster |
| **GROUP BY** | Slower | Much faster |
| **INSERT/UPDATE** | Fast | Can be slower |
| **Storage Size** | Larger | Smaller (better compression) |

### When to Use Rowstore:
- High-frequency point lookups
- OLTP workloads
- Frequent INSERT/UPDATE operations
- Queries that return entire rows

### When to Use Columnstore:
- Analytical queries
- Aggregations and GROUP BY
- Queries scanning many rows but few columns
- Data warehousing scenarios
- Reporting and BI workloads

### In Snowflake:
**Snowflake uses columnstore by default** because it's designed for analytical workloads. This is why Snowflake excels at:
- Data warehousing
- Analytical queries
- Aggregations
- Large-scale data processing

For OLTP workloads, Snowflake may not be the best choice, but it handles mixed workloads well.


## 10. Unique Index

A **unique index** ensures that no two rows can have the same value(s) in the indexed column(s). It enforces uniqueness constraints.

### Key Characteristics:
- **Uniqueness**: Prevents duplicate values
- **NULL Handling**: Usually allows one NULL value (depends on database)
- **Performance**: Can improve lookup performance
- **Constraint**: Often used to enforce primary key or unique constraints

### Use Cases:
- Primary keys
- Unique business identifiers (email, SSN, etc.)
- Composite unique constraints

### In Snowflake:
Snowflake doesn't have traditional unique indexes, but you can enforce uniqueness through:
- **PRIMARY KEY constraints**
- **UNIQUE constraints**
- These are enforced but don't create traditional indexes


In [None]:
-- Create table with unique constraint (similar to unique index)
CREATE OR REPLACE TABLE customers_unique (
    customer_id INT PRIMARY KEY,  -- Primary key enforces uniqueness
    email VARCHAR(100) UNIQUE,    -- Unique constraint
    username VARCHAR(50) UNIQUE,
    first_name VARCHAR(50),
    last_name VARCHAR(50),
    registration_date DATE
);

-- Insert valid data
INSERT INTO customers_unique VALUES
(1, 'john.doe@email.com', 'johndoe', 'John', 'Doe', '2023-01-15'),
(2, 'jane.smith@email.com', 'janesmith', 'Jane', 'Smith', '2023-02-20'),
(3, 'bob.johnson@email.com', 'bobjohnson', 'Bob', 'Johnson', '2023-03-10');

-- Try to insert duplicate email (will fail)
-- This demonstrates unique constraint enforcement
INSERT INTO customers_unique VALUES
(4, 'john.doe@email.com', 'anotheruser', 'Alice', 'Williams', '2023-04-05');
-- Error: Duplicate key value violates unique constraint

-- Try to insert duplicate username (will fail)
INSERT INTO customers_unique VALUES
(4, 'alice.williams@email.com', 'johndoe', 'Alice', 'Williams', '2023-04-05');
-- Error: Duplicate key value violates unique constraint

-- Composite unique constraint example
CREATE OR REPLACE TABLE order_items_unique (
    order_id INT,
    product_id INT,
    quantity INT,
    price DECIMAL(10, 2),
    CONSTRAINT uk_order_product UNIQUE (order_id, product_id)  -- Composite unique
);

-- Insert valid data
INSERT INTO order_items_unique VALUES
(101, 201, 2, 29.99),
(101, 202, 1, 49.99),
(102, 201, 3, 29.99);

-- Try to insert duplicate combination (will fail)
INSERT INTO order_items_unique VALUES
(101, 201, 5, 29.99);
-- Error: Duplicate key value violates unique constraint


## 11. Filtered Index (Partial Index)

A **filtered index** (also called partial index) is an index that only includes rows that meet a specific condition. This makes the index smaller and more efficient for queries that match the filter.

### Benefits:
- **Smaller Size**: Only indexes relevant rows
- **Better Performance**: Faster to maintain and query
- **Lower Storage**: Uses less storage space
- **Targeted Optimization**: Optimizes specific query patterns

### Use Cases:
- Indexing active records only (WHERE status = 'Active')
- Indexing recent data only (WHERE date > '2023-01-01')
- Indexing non-null values only (WHERE column IS NOT NULL)
- Indexing specific categories

### In Snowflake:
Snowflake doesn't support filtered indexes directly, but you can achieve similar benefits using:
- **Clustering keys** on filtered columns
- **Materialized views** with WHERE clauses
- **Partitioning** strategies


In [None]:
-- Example: Table with status column
-- We want to optimize queries for active records
CREATE OR REPLACE TABLE orders_filtered (
    order_id INT,
    customer_id INT,
    order_date DATE,
    status VARCHAR(20),
    total_amount DECIMAL(10, 2),
    region VARCHAR(50)
) CLUSTER BY (status, order_date);  -- Clustering helps with filtered queries

-- Insert data with different statuses
INSERT INTO orders_filtered VALUES
(1, 101, '2023-01-15', 'Active', 150.00, 'North'),
(2, 102, '2023-02-20', 'Active', 250.50, 'South'),
(3, 103, '2023-03-10', 'Completed', 175.25, 'East'),
(4, 101, '2023-04-05', 'Active', 320.00, 'North'),
(5, 104, '2023-05-12', 'Cancelled', 180.75, 'West'),
(6, 102, '2023-06-01', 'Active', 95.50, 'South'),
(7, 103, '2023-07-15', 'Completed', 210.00, 'East'),
(8, 101, '2023-08-20', 'Active', 125.00, 'North'),
(9, 105, '2023-09-10', 'Pending', 275.00, 'West');

-- Query that benefits from clustering on status
-- Snowflake can prune micro-partitions that don't contain 'Active' status
SELECT * FROM orders_filtered 
WHERE status = 'Active'
ORDER BY order_date;

-- Materialized view as alternative to filtered index
CREATE OR REPLACE MATERIALIZED VIEW active_orders_mv AS
SELECT 
    order_id,
    customer_id,
    order_date,
    total_amount,
    region
FROM orders_filtered
WHERE status = 'Active';

-- Query the materialized view (pre-filtered)
SELECT * FROM active_orders_mv
ORDER BY order_date;


## 12. Choosing the Right Index

Choosing the right indexing strategy is crucial for database performance. Here are key considerations:

### Factors to Consider:

1. **Query Patterns**
   - What columns are frequently used in WHERE clauses?
   - What columns are used in JOIN conditions?
   - What columns are used in ORDER BY?

2. **Data Characteristics**
   - High cardinality (many unique values) = Good for indexes
   - Low cardinality (few unique values) = May not benefit much
   - NULL values = Consider filtered indexes

3. **Workload Type**
   - OLTP (many writes) = Fewer indexes, focus on primary keys
   - OLAP (many reads) = More indexes, focus on query columns

4. **Table Size**
   - Small tables = Indexes may not help (full scan is fast)
   - Large tables = Indexes are essential

5. **Write Frequency**
   - High write frequency = Fewer indexes (maintenance overhead)
   - Low write frequency = More indexes acceptable

### Index Selection Guidelines:

**Create Indexes On:**
- Primary keys (automatic in most databases)
- Foreign keys (for join performance)
- Columns in WHERE clauses
- Columns in ORDER BY
- Columns in GROUP BY
- Columns used in JOIN conditions

**Avoid Indexes On:**
- Very small tables
- Columns with very low cardinality
- Columns that are frequently updated
- Too many indexes on the same table

### In Snowflake:
- **Clustering Keys**: Use for large tables with range queries
- **Automatic Optimization**: Snowflake handles most optimization automatically
- **Micro-partition Pruning**: Snowflake automatically prunes partitions
- **Query Profile**: Use to identify when clustering would help


In [None]:
-- Example: Analyzing query patterns to choose clustering key

-- Create a table for analysis
CREATE OR REPLACE TABLE customer_orders (
    order_id INT,
    customer_id INT,
    order_date DATE,
    product_id INT,
    quantity INT,
    amount DECIMAL(10, 2),
    status VARCHAR(20),
    region VARCHAR(50)
);

-- Insert sample data
INSERT INTO customer_orders
SELECT 
    ROW_NUMBER() OVER (ORDER BY UNIFORM(1, 1000000, RANDOM())) as order_id,
    UNIFORM(1, 10000, RANDOM()) as customer_id,
    DATEADD(day, UNIFORM(0, 730, RANDOM()), '2022-01-01') as order_date,
    UNIFORM(1, 500, RANDOM()) as product_id,
    UNIFORM(1, 10, RANDOM()) as quantity,
    UNIFORM(10, 1000, RANDOM()) as amount,
    CASE UNIFORM(1, 4, RANDOM())
        WHEN 1 THEN 'Active'
        WHEN 2 THEN 'Completed'
        WHEN 3 THEN 'Pending'
        ELSE 'Cancelled'
    END as status,
    CASE UNIFORM(1, 4, RANDOM())
        WHEN 1 THEN 'North'
        WHEN 2 THEN 'South'
        WHEN 3 THEN 'East'
        ELSE 'West'
    END as region
FROM TABLE(GENERATOR(ROWCOUNT => 50000));

-- Analyze query patterns
-- Pattern 1: Queries by customer_id and date range
-- This suggests clustering by (customer_id, order_date)
SELECT * FROM customer_orders
WHERE customer_id = 1234
  AND order_date BETWEEN '2023-01-01' AND '2023-12-31'
ORDER BY order_date;

-- Pattern 2: Queries by date range
-- This suggests clustering by order_date
SELECT 
    region,
    SUM(amount) as total
FROM customer_orders
WHERE order_date >= '2023-06-01'
GROUP BY region;

-- Based on analysis, choose the most common pattern
-- Let's recreate with appropriate clustering
CREATE OR REPLACE TABLE customer_orders_clustered (
    order_id INT,
    customer_id INT,
    order_date DATE,
    product_id INT,
    quantity INT,
    amount DECIMAL(10, 2),
    status VARCHAR(20),
    region VARCHAR(50)
) CLUSTER BY (customer_id, order_date);  -- Based on most common query pattern


## 13. Monitoring Duplicate Indexes

Duplicate or redundant indexes waste storage space and slow down write operations. It's important to identify and remove them.

### Types of Duplicate Indexes:

1. **Exact Duplicates**: Same columns in the same order
2. **Redundant Indexes**: Index that is a subset of another index
   - Example: Index on (A, B, C) makes index on (A, B) redundant
3. **Overlapping Indexes**: Indexes with overlapping columns

### How to Identify:

1. **Review Index Definitions**: Check system catalogs
2. **Analyze Query Plans**: See which indexes are actually used
3. **Monitor Index Usage**: Track index usage statistics
4. **Use Tools**: Database-specific tools for index analysis

### In Snowflake:
Since Snowflake doesn't use traditional indexes, you don't have duplicate index issues. However, you should monitor:
- **Duplicate Clustering Keys**: Multiple tables with same clustering strategy
- **Unused Clustering Keys**: Clustering keys that don't improve queries
- **Over-clustering**: Too many clustering keys on the same table


In [None]:
-- In Snowflake, check clustering information
-- Show all tables and their clustering keys
SELECT 
    TABLE_SCHEMA,
    TABLE_NAME,
    CLUSTERING_KEY,
    ROW_COUNT,
    BYTES
FROM INFORMATION_SCHEMA.TABLES
WHERE CLUSTERING_KEY IS NOT NULL
ORDER BY TABLE_SCHEMA, TABLE_NAME;

-- Check if clustering is being used effectively
-- You can use SYSTEM$CLUSTERING_INFORMATION function
SELECT SYSTEM$CLUSTERING_INFORMATION('customer_orders_clustered', '(customer_id, order_date)');

-- Monitor table statistics
SELECT 
    TABLE_NAME,
    ROW_COUNT,
    BYTES,
    CLUSTERING_KEY,
    AUTO_CLUSTERING_ON
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = CURRENT_SCHEMA()
ORDER BY BYTES DESC;

-- Check for tables that might benefit from clustering
-- Large tables without clustering keys
SELECT 
    TABLE_NAME,
    ROW_COUNT,
    BYTES
FROM INFORMATION_SCHEMA.TABLES
WHERE CLUSTERING_KEY IS NULL
  AND ROW_COUNT > 100000
ORDER BY BYTES DESC;


## 14. Fragmentation

**Fragmentation** occurs when index pages become disorganized, with data scattered across many pages instead of being contiguous. This degrades performance.

### Types of Fragmentation:

1. **Internal Fragmentation**: Empty space within pages
2. **External Fragmentation**: Pages out of logical order
3. **Logical Fragmentation**: Pages not in sorted order

### Causes:
- **INSERT operations**: New data inserted in random order
- **UPDATE operations**: Rows updated and moved
- **DELETE operations**: Empty space left in pages
- **Page splits**: Pages split when they become full

### Effects:
- **Slower Queries**: More I/O operations needed
- **Wasted Space**: Unused space in pages
- **Poor Cache Usage**: Data not contiguous in memory

### Solutions:
- **Rebuild Index**: Reorganize index structure
- **Reorganize Index**: Defragment without rebuilding
- **Fill Factor**: Control how full pages are

### In Snowflake:
Snowflake handles fragmentation automatically through:
- **Automatic Micro-partition Management**: Snowflake automatically manages partitions
- **Automatic Clustering Service**: Snowflake can automatically recluster data
- **No Manual Maintenance**: You don't need to manually defragment

However, you can monitor and control clustering:


In [None]:
-- Check clustering information for a table
-- This shows how well-clustered the data is
SELECT SYSTEM$CLUSTERING_INFORMATION('customer_orders_clustered', '(customer_id, order_date)');

-- The output shows:
-- - average_overlaps: Lower is better (indicates less fragmentation)
-- - average_depth: Lower is better
-- - partition_depth_histogram: Distribution of partition depths

-- Enable automatic clustering (if needed)
-- This allows Snowflake to automatically recluster data
ALTER TABLE customer_orders_clustered 
SET AUTO_CLUSTERING_ON = TRUE;

-- Check if automatic clustering is enabled
SELECT 
    TABLE_NAME,
    AUTO_CLUSTERING_ON
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME = 'CUSTOMER_ORDERS_CLUSTERED';

-- Manually trigger clustering (if automatic is off)
-- Note: This consumes credits
ALTER TABLE customer_orders_clustered 
RECLUSTER;

-- Monitor clustering effectiveness over time
-- Run this periodically to check if reclustering is needed
SELECT 
    SYSTEM$CLUSTERING_INFORMATION('customer_orders_clustered', '(customer_id, order_date)') as clustering_info;


## 15. Execution Plan

An **execution plan** (query plan) shows how the database will execute a query. It's essential for understanding index usage and query performance.

### What Execution Plans Show:

1. **Index Usage**: Which indexes are used (or not used)
2. **Scan Type**: Table scan vs Index scan
3. **Join Methods**: How tables are joined
4. **Sort Operations**: Where sorting occurs
5. **Filter Operations**: Where filtering happens
6. **Cost Estimates**: Estimated cost of operations

### Key Operations:

- **Index Scan**: Using an index to find rows
- **Index Seek**: Direct lookup in index (very fast)
- **Table Scan**: Scanning entire table (slow)
- **Key Lookup**: Looking up data after index scan
- **Sort**: Sorting operation
- **Hash Join**: Hash-based join
- **Nested Loop**: Nested loop join

### In Snowflake:
Snowflake provides query profiles that show:
- **Micro-partition Pruning**: Which partitions were scanned
- **Statistics**: Row counts, bytes scanned
- **Operations**: Join types, aggregation methods
- **Timing**: How long each operation took


In [None]:
-- In Snowflake, use EXPLAIN to see query plan
-- This shows how Snowflake will execute the query

-- Simple query plan
EXPLAIN 
SELECT * FROM customer_orders_clustered
WHERE customer_id = 1234
  AND order_date BETWEEN '2023-01-01' AND '2023-12-31';

-- More detailed plan with statistics
EXPLAIN USING TABULAR
SELECT 
    region,
    SUM(amount) as total_sales,
    COUNT(*) as order_count
FROM customer_orders_clustered
WHERE order_date >= '2023-06-01'
GROUP BY region
ORDER BY total_sales DESC;

-- Check if micro-partition pruning is happening
-- Look for "Partitions scanned" vs "Partitions total" in the plan
EXPLAIN 
SELECT * FROM customer_orders_clustered
WHERE customer_id BETWEEN 1000 AND 2000;

-- Compare plans with and without clustering
-- Query on non-clustered table
EXPLAIN 
SELECT * FROM customer_orders
WHERE customer_id = 1234;

-- Query on clustered table
EXPLAIN 
SELECT * FROM customer_orders_clustered
WHERE customer_id = 1234;

-- Note: In Snowflake UI, you can see detailed query profiles
-- with visual representations of the execution plan


## 16. Indexing Strategy

A good **indexing strategy** is crucial for database performance. Here's a comprehensive approach:

### Step-by-Step Indexing Strategy:

#### 1. **Analyze Your Workload**
   - Identify frequently executed queries
   - Understand query patterns (point lookups vs range scans)
   - Determine read vs write ratio
   - Identify slow queries

#### 2. **Identify Candidate Columns**
   - Columns in WHERE clauses
   - Columns in JOIN conditions
   - Columns in ORDER BY
   - Columns in GROUP BY
   - Foreign keys

#### 3. **Consider Data Characteristics**
   - **Cardinality**: High cardinality columns are better candidates
   - **Selectivity**: How selective is the column?
   - **Data Distribution**: Is data evenly distributed?
   - **NULL Values**: How many NULLs?

#### 4. **Prioritize Indexes**
   - Start with most frequently used queries
   - Focus on queries that are slow
   - Consider impact on write operations
   - Balance between read and write performance

#### 5. **Design Composite Indexes**
   - Order columns by selectivity (most selective first)
   - Consider leftmost prefix rule
   - Match query patterns

#### 6. **Monitor and Tune**
   - Monitor index usage
   - Remove unused indexes
   - Identify missing indexes
   - Rebuild/reorganize as needed

### Indexing Strategy for Different Scenarios:

#### **OLTP (Online Transaction Processing)**
- Focus on primary keys
- Index foreign keys
- Minimal indexes (fewer writes)
- Point lookups optimized

#### **OLAP (Online Analytical Processing)**
- More indexes acceptable
- Focus on query columns
- Composite indexes for common patterns
- Columnstore indexes

#### **Mixed Workload**
- Balance read and write performance
- Monitor both query and write performance
- Use filtered indexes where appropriate

### In Snowflake - Clustering Strategy:

1. **Identify Large Tables**: Focus on tables with > 1M rows
2. **Analyze Query Patterns**: Look for range queries
3. **Choose Clustering Keys**: Based on common WHERE clauses
4. **Monitor Clustering**: Use SYSTEM$CLUSTERING_INFORMATION
5. **Enable Auto-Clustering**: For frequently updated tables
6. **Review Periodically**: Check if clustering is still effective


In [None]:
-- Comprehensive example: Implementing an indexing strategy

-- Step 1: Create a table for a typical business scenario
CREATE OR REPLACE TABLE ecommerce_orders (
    order_id INT,
    customer_id INT,
    order_date DATE,
    order_status VARCHAR(20),
    product_id INT,
    quantity INT,
    unit_price DECIMAL(10, 2),
    total_amount DECIMAL(10, 2),
    shipping_address VARCHAR(200),
    payment_method VARCHAR(50),
    created_at TIMESTAMP,
    updated_at TIMESTAMP
);

-- Step 2: Insert sample data
INSERT INTO ecommerce_orders
SELECT 
    ROW_NUMBER() OVER (ORDER BY UNIFORM(1, 1000000, RANDOM())) as order_id,
    UNIFORM(1, 50000, RANDOM()) as customer_id,
    DATEADD(day, UNIFORM(0, 1095, RANDOM()), '2021-01-01') as order_date,
    CASE UNIFORM(1, 5, RANDOM())
        WHEN 1 THEN 'Pending'
        WHEN 2 THEN 'Processing'
        WHEN 3 THEN 'Shipped'
        WHEN 4 THEN 'Delivered'
        ELSE 'Cancelled'
    END as order_status,
    UNIFORM(1, 1000, RANDOM()) as product_id,
    UNIFORM(1, 5, RANDOM()) as quantity,
    UNIFORM(10, 500, RANDOM()) as unit_price,
    UNIFORM(10, 500, RANDOM()) * UNIFORM(1, 5, RANDOM()) as total_amount,
    'Address ' || UNIFORM(1, 1000, RANDOM()) as shipping_address,
    CASE UNIFORM(1, 4, RANDOM())
        WHEN 1 THEN 'Credit Card'
        WHEN 2 THEN 'Debit Card'
        WHEN 3 THEN 'PayPal'
        ELSE 'Bank Transfer'
    END as payment_method,
    CURRENT_TIMESTAMP() as created_at,
    CURRENT_TIMESTAMP() as updated_at
FROM TABLE(GENERATOR(ROWCOUNT => 100000));

-- Step 3: Analyze common query patterns
-- Pattern A: Find orders by customer and date range (most common)
-- Pattern B: Find orders by status
-- Pattern C: Find orders by date range for reporting

-- Step 4: Implement clustering strategy based on analysis
CREATE OR REPLACE TABLE ecommerce_orders_clustered (
    order_id INT,
    customer_id INT,
    order_date DATE,
    order_status VARCHAR(20),
    product_id INT,
    quantity INT,
    unit_price DECIMAL(10, 2),
    total_amount DECIMAL(10, 2),
    shipping_address VARCHAR(200),
    payment_method VARCHAR(50),
    created_at TIMESTAMP,
    updated_at TIMESTAMP
) CLUSTER BY (customer_id, order_date);  -- Based on most common query pattern

-- Insert data into clustered table
INSERT INTO ecommerce_orders_clustered
SELECT * FROM ecommerce_orders;

-- Step 5: Test query performance
-- Query that benefits from clustering
EXPLAIN
SELECT 
    order_id,
    order_date,
    total_amount,
    order_status
FROM ecommerce_orders_clustered
WHERE customer_id = 12345
  AND order_date >= '2023-01-01'
ORDER BY order_date DESC;

-- Step 6: Monitor clustering effectiveness
SELECT SYSTEM$CLUSTERING_INFORMATION('ecommerce_orders_clustered', '(customer_id, order_date)');

-- Step 7: Enable auto-clustering for ongoing maintenance
ALTER TABLE ecommerce_orders_clustered 
SET AUTO_CLUSTERING_ON = TRUE;


## Summary

### Key Takeaways:

1. **Indexes improve query performance** by reducing the amount of data scanned
2. **Different index types** serve different purposes (clustered, non-clustered, unique, composite, etc.)
3. **Columnstore indexes** are excellent for analytical workloads
4. **Index selection** should be based on query patterns and data characteristics
5. **Monitoring and maintenance** are essential for optimal performance
6. **Execution plans** help understand how queries are executed

### Snowflake-Specific Notes:

- **No traditional indexes**: Snowflake uses clustering keys and micro-partitions
- **Automatic optimization**: Snowflake handles most optimization automatically
- **Columnar storage**: All tables use columnar storage by default
- **Clustering keys**: Use for large tables with range queries
- **Auto-clustering**: Can automatically maintain clustering
- **Query profiles**: Use to analyze and optimize queries

### Best Practices:

1. ✅ Analyze query patterns before creating indexes/clustering keys
2. ✅ Start with the most frequently used queries
3. ✅ Monitor index/clustering effectiveness
4. ✅ Remove unused indexes/clustering keys
5. ✅ Balance read and write performance
6. ✅ Use composite indexes/clustering keys wisely
7. ✅ Consider data cardinality and selectivity
8. ✅ Review and tune periodically

### Common Mistakes to Avoid:

1. ❌ Creating too many indexes
2. ❌ Indexing low-cardinality columns
3. ❌ Ignoring write performance impact
4. ❌ Not monitoring index usage
5. ❌ Wrong column order in composite indexes
6. ❌ Indexing small tables unnecessarily
7. ❌ Not considering query patterns
