Change log:

* 5/17/2025: Added notes up to "Why B-Trees Are The Default choice". Will continue adding hand-written notes from video later.

# Database Indexing

## How Database Indexes Work

* data in a database is written to disk as a collection of files
    - think of it like a notebook where you write line by line when a new thought comes in

### Physical Storage and Access Patterns

* when data lives on disk (SSDs nowadays), we can only process that data when we bring it into memory
    - every query has to load data from disk => RAM
* without an index, you have to scan through every page of data one by one
    - loading each page into memory and scanning for the item you're looking for
    - e.g. like looking through every page of a book to find a specific word
* but an index gives us a structured path to the data we need
    - it can tell us which pages contain the data we're looking for
    - this is like using the Table of Contents to jump to the relevant pages

### The Cost of Indexing

* indexes are not free though
* they require additonal disk space since they're a data structure
    - and they might even occupy as much space as the original data
* write performance takes a hit
    - inserting new rows/updating existing rows will not only update the main table but also every index on it
    - multiple indexes = multipe disk writes for a single write operation
* __when do indexes hurt more than help?__:
    - when a table has frequent writes but infrequent reads
        - e.g. logging table where we insert new records but rarely query old ones
    - or when the table is small with just a few hundred rows
        - cost of maintaining an index might be more than cost of a simple sequentialscan

## Types of Indexes:

### B-Tree Indexes:

* most common type of index

#### The Structure of B-Trees:

* self-balancing tree
    - maintains __sorted data__
    - allows efficient insertions, deletions, and searches
    - can have multiple children (hundreds in practice)
    - each node contains an ordered array of keys and pointers, structured to minimize disk reads
* every node in a B-tree follows strict rules:
    - all leaf nodes must be at the same depth
    - each node can contain between m/2 and m keys (where m is the order of the tree)
        * order(m) = max # of children a node can have
    - a node with k keys must have exactly k+1 children
    - keys within a node are kept in sorted order
* each node fits in a single disk page
    - typicall 8KB
    - e.g. when PostgreSQL needs to find a record with id=350, it might only need to read 2-3 pages from disk: root node => internal node => leaf node

#### Real-World Examples

* PostgreSQL uses B-trees for almost everything: primary keys, unique constraints, and most regular indexs
    ```
    CREATE TABLE users {
        id SERIAL PRIMARY KEY,
        email VARCHAR(255) UNIQUE
    }
    ```
    - this automatically creates 2 B-tree indexes: one for the primary key and one for the unique email constraint
    - these B-trees maintain sorted order
* when you create an index in MongoDB: db.users.createIndex({ "email": 1 });
    - you create a B-tree that maps email values to document locations

#### Why B-trees Are the Default Choice:

* they excel at everything databases need
* B-trees are a safe bet to use for indexes in interviews
1. maintain sorted order, making range queries and ORDER BY operations efficient
2. self-balancing, ensuring predictable performance even as data grows
3. minimize disk I/O by matching their structure to how databases store data
4. handle both equality searches (email='X') and range searches (age > 25) equally well
5. remain balanced even with random inserts and deletes, avoiding the performance cliffs you might see with simpler tree structures