# Understanding Physical Storage in PostgreSQL

This notebook explores the fundamental concepts of how PostgreSQL stores data on disk. Understanding the physical layer, including **pages (blocks)**, **tuples (rows)**, and the `ctid` identifier, is crucial for grasping how indexes work and why query performance varies.

We will also look at **TOAST** (The Oversized-Attribute Storage Technique), which is how PostgreSQL handles data that is too large to fit in a single block.

---
## Setup

As always, we start by loading the `ipython-sql` extension and connecting to our local `people` database.

In [1]:
%load_ext sql
%sql postgresql://fahad:secret@localhost:5432/people

---
## Pages, Tuples, and the CTID

PostgreSQL stores data in fixed-size blocks called **pages** (usually 8KB). Each row stored in a page is called a **tuple**. Every tuple in a table has a unique identifier called the **`ctid`** which represents its physical location.

The `ctid` consists of two numbers `(page_number, tuple_index_within_page)`. This is the fastest possible way to access a specific row.

Let's create a simple table and inspect the `ctid` of its rows.

In [3]:
%%sql
DROP TABLE IF EXISTS pages_test;

CREATE TABLE pages_test (
    id SERIAL PRIMARY KEY,
    name TEXT
);

INSERT INTO pages_test (name) VALUES 
('Alice'), ('Bob'), ('Charlie'), ('David'), ('Eve');

 * postgresql://fahad:***@localhost:5432/people
Done.
Done.
5 rows affected.


[]

In [4]:
%%sql
-- Select all columns plus the ctid to see the physical location
SELECT ctid, id, name FROM pages_test;

 * postgresql://fahad:***@localhost:5432/people
5 rows affected.


ctid,id,name
"(0,1)",1,Alice
"(0,2)",2,Bob
"(0,3)",3,Charlie
"(0,4)",4,David
"(0,5)",5,Eve


You'll notice all the rows likely have a `ctid` starting with `(0,...)`. This means they all fit into the very first page (page 0) of the table's data file.

Let's add more data to see if we can force PostgreSQL to allocate a new page.

In [5]:
%%sql
-- Let's insert a lot more rows. We use generate_series to create 500 rows.
INSERT INTO pages_test (name)
SELECT 'User ' || i FROM generate_series(1, 500) AS i;

 * postgresql://fahad:***@localhost:5432/people
500 rows affected.


[]

In [6]:
%%sql
-- Let's check the ctid of the last few rows
SELECT ctid, id, name FROM pages_test ORDER BY id DESC LIMIT 10;

 * postgresql://fahad:***@localhost:5432/people
10 rows affected.


ctid,id,name
"(2,134)",505,User 500
"(2,133)",504,User 499
"(2,132)",503,User 498
"(2,131)",502,User 497
"(2,130)",501,User 496
"(2,129)",500,User 495
"(2,128)",499,User 494
"(2,127)",498,User 493
"(2,126)",497,User 492
"(2,125)",496,User 491


Now you should see `ctid` values like `(1,...)`, `(2,...)`, etc., indicating that PostgreSQL has allocated new pages to store the additional rows. An index, at its core, is a data structure that stores a data value (like a name or ID) and the `ctid` of the row containing that value.

---
## Handling Large Data with TOAST

What happens if a single row is larger than a page (8KB)? PostgreSQL automatically uses a mechanism called **TOAST** (The Oversized-Attribute Storage Technique).

TOAST will first try to compress the data. If it's still too large, it will break the data into smaller chunks and store them in a separate TOAST table, leaving a pointer in the main table. This is all handled automatically.

Let's create a table and insert a very large text string to see TOAST in action.

In [7]:
%%sql
DROP TABLE IF EXISTS toast_test;

CREATE TABLE toast_test (
    id SERIAL PRIMARY KEY,
    big_text TEXT
);

-- Create a string that is ~50,000 characters long (~50 KB)
INSERT INTO toast_test (big_text)
SELECT repeat('PostgreSQL is amazing. ', 2500);

 * postgresql://fahad:***@localhost:5432/people
Done.
Done.
1 rows affected.


[]

Even though the text is much larger than 8KB, the `INSERT` works perfectly. Let's check the size of the data on disk.

In [8]:
%%sql
-- pg_column_size shows the size of the data in a column
-- Note: This is the size after any compression by TOAST.
SELECT pg_column_size(big_text) FROM toast_test;

 * postgresql://fahad:***@localhost:5432/people
1 rows affected.


pg_column_size
694


---
## Conclusion

In this notebook, we learned:
- Data is stored in **pages** (blocks) and rows are called **tuples**.
- The `ctid` is a tuple's physical address, composed of `(page_number, tuple_index)`.
- Indexes work by mapping data values to `ctid`s for fast lookups.
- **TOAST** is the automatic mechanism PostgreSQL uses to store data that is too large for a single page, using compression and chunking.

With this foundation, we are now ready to explore how indexes are implemented.