# Test Suite: Database Creation and Configuration

**Purpose:** Validate the sql_retail_analytics_warehouse database creation and configuration

**Scope:**
- Database existence and naming
- Encoding configuration (UTF-8)
- Locale settings (en_GB.UTF-8)
- Template configuration
- Ownership and privileges
- Connection validation

**Testing Strategy:**
- Existence validation (database created successfully)
- Configuration validation (encoding, collation, ctype)
- Ownership validation (correct owner assigned)
- Connection testing (can establish connections)
- Isolation testing (clean template, no extra objects)

**Prerequisites:**
- PostgreSQL server running
- `setup/create_db.sql` has been executed
- Connection credentials available
- Required packages: psycopg2, pytest, ipytest, pandas

## Setup: Import Dependencies & Configure Connection

In [2]:
import os
import psycopg2
from psycopg2 import sql
from psycopg2.extensions import ISOLATION_LEVEL_AUTOCOMMIT
import pytest
import ipytest
import pandas as pd

# Configure ipytest for notebook usage
ipytest.autoconfig()

# Database connection parameters
DB_CONFIG = {
    'host': 'localhost',
    'user': 'postgres',
    'password': os.getenv('POSTGRES_PASSWORD', 'your_password_here')
}

# Target database name
TARGET_DB = 'sql_retail_analytics_warehouse'

print("✅ Dependencies imported successfully")

✅ Dependencies imported successfully


## Fixtures: Database Connections

In [2]:
@pytest.fixture(scope='module')
def postgres_connection():
    """Connection to postgres database for catalog queries."""
    conn = psycopg2.connect(database='postgres', **DB_CONFIG)
    conn.autocommit = True
    yield conn
    conn.close()

@pytest.fixture(scope='module')
def postgres_cursor(postgres_connection):
    """Cursor for postgres database."""
    cursor = postgres_connection.cursor()
    yield cursor
    cursor.close()

@pytest.fixture(scope='module')
def target_connection():
    """Connection to target warehouse database."""
    conn = psycopg2.connect(database=TARGET_DB, **DB_CONFIG)
    conn.autocommit = True
    yield conn
    conn.close()

@pytest.fixture(scope='module')
def target_cursor(target_connection):
    """Cursor for target database."""
    cursor = target_connection.cursor()
    yield cursor
    cursor.close()

print("✅ Fixtures defined")

✅ Fixtures defined


## Test Suite 1: Database Existence

**Tests in this suite:**
1. `test_database_exists` - Queries pg_database catalog to verify the database exists
2. `test_database_name_exact_match` - Validates database name is exactly 'sql_retail_analytics_warehouse' (case-sensitive)
3. `test_database_is_accessible` - Connects to the database and verifies we can execute queries against it

**How these tests work:**
- **Test 1** executes `SELECT COUNT(*) FROM pg_database WHERE datname = 'sql_retail_analytics_warehouse'`
  - ✅ Success: Count = 1 (database exists exactly once)
  - ❌ Failure: Count ≠ 1 (database missing or duplicated)
  
- **Test 2** executes `SELECT datname FROM pg_database WHERE datname = 'sql_retail_analytics_warehouse'`
  - ✅ Success: datname = 'sql_retail_analytics_warehouse' (exact case match)
  - ❌ Failure: Name mismatch, wrong case, or NULL
  - **Why case matters**: PostgreSQL identifiers are case-sensitive when quoted
  
- **Test 3** connects directly to the target database and runs `SELECT current_database()`
  - ✅ Success: current_database = 'sql_retail_analytics_warehouse'
  - ❌ Failure: Connection fails or wrong database name returned
  - **Purpose**: Validates not just existence, but actual accessibility

In [3]:
%%ipytest -vv

def test_database_exists(postgres_cursor):
    """Verify sql_retail_analytics_warehouse database exists."""
    postgres_cursor.execute("""
        SELECT COUNT(*)
        FROM pg_database
        WHERE datname = %s
    """, (TARGET_DB,))
    
    count = postgres_cursor.fetchone()[0]
    assert count == 1, f"Database '{TARGET_DB}' must exist"

def test_database_name_exact_match(postgres_cursor):
    """Verify database name matches exactly (case-sensitive)."""
    postgres_cursor.execute("""
        SELECT datname
        FROM pg_database
        WHERE datname = %s
    """, (TARGET_DB,))
    
    result = postgres_cursor.fetchone()
    assert result is not None, f"Database '{TARGET_DB}' not found"
    assert result[0] == TARGET_DB, f"Database name mismatch: expected '{TARGET_DB}', got '{result[0]}'"

def test_database_is_accessible(target_cursor):
    """Verify we can connect to and query the database."""
    target_cursor.execute("SELECT current_database()")
    current_db = target_cursor.fetchone()[0]
    assert current_db == TARGET_DB, f"Connected to wrong database: {current_db}"

platform win32 -- Python 3.12.4, pytest-8.4.2, pluggy-1.6.0 -- c:\Users\Laurent\Studies\sql-ultimate-course\Udemy-SQL-Data-Warehouse-Project\.venv\Scripts\python.exe
cachedir: .pytest_cache
rootdir: c:\Users\Laurent\Studies\sql-ultimate-course\Udemy-SQL-Data-Warehouse-Project\tests\tests_setup
plugins: anyio-4.11.0, nbmake-1.5.5
[1mcollecting ... [0mcollected 3 items

t_2cd27d8b688c4eb9be5163904688617e.py::test_database_exists [32mPASSED[0m[32m                           [ 33%][0m
t_2cd27d8b688c4eb9be5163904688617e.py::test_database_name_exact_match [32mPASSED[0m[32m                 [ 66%][0m
t_2cd27d8b688c4eb9be5163904688617e.py::test_database_is_accessible [32mPASSED[0m[32m                    [100%][0m



## Test Suite 2: Encoding Configuration

**Tests in this suite:**
1. `test_database_encoding_utf8` - Queries pg_database catalog to verify encoding is UTF8
2. `test_database_encoding_from_connection` - Uses SHOW command to verify server_encoding setting
3. `test_client_encoding_utf8` - Verifies client connection is using UTF8 encoding

**How these tests work:**
- **Test 1** queries pg_database catalog: `SELECT pg_encoding_to_char(encoding) FROM pg_database WHERE datname = 'sql_retail_analytics_warehouse'`
  - ✅ Success: encoding = 'UTF8'
  - ❌ Failure: encoding ≠ 'UTF8' (e.g., 'LATIN1', 'SQL_ASCII')
  - **Purpose**: Validates database was created with UTF8 encoding (critical for international characters)
  - **Note**: This is a database creation parameter - cannot be changed without recreating the database
  
- **Test 2** executes `SHOW server_encoding` within target database connection
  - ✅ Success: server_encoding = 'UTF8'
  - ❌ Failure: server_encoding ≠ 'UTF8'
  - **Purpose**: Cross-validates encoding from runtime perspective
  - **Validation approach**: Confirms catalog setting matches active session setting
  
- **Test 3** executes `SHOW client_encoding` within target database connection
  - ✅ Success: client_encoding = 'UTF8'
  - ❌ Failure: client_encoding ≠ 'UTF8'
  - **Purpose**: Ensures client-server communication uses UTF8 (prevents character corruption)
  - **Impact**: Mismatched client/server encodings cause silent data corruption

In [4]:
%%ipytest -vv

def test_database_encoding_utf8(postgres_cursor):
    """Verify database uses UTF8 encoding."""
    postgres_cursor.execute("""
        SELECT pg_encoding_to_char(encoding) AS encoding
        FROM pg_database
        WHERE datname = %s
    """, (TARGET_DB,))
    
    encoding = postgres_cursor.fetchone()[0]
    assert encoding == 'UTF8', f"Expected UTF8 encoding, got '{encoding}'"

def test_database_encoding_from_connection(target_cursor):
    """Verify encoding setting from within the database."""
    target_cursor.execute("SHOW server_encoding")
    encoding = target_cursor.fetchone()[0]
    assert encoding == 'UTF8', f"Server encoding should be UTF8, got '{encoding}'"

def test_client_encoding_utf8(target_cursor):
    """Verify client encoding is also UTF8."""
    target_cursor.execute("SHOW client_encoding")
    encoding = target_cursor.fetchone()[0]
    assert encoding == 'UTF8', f"Client encoding should be UTF8, got '{encoding}'"

platform win32 -- Python 3.12.4, pytest-8.4.2, pluggy-1.6.0 -- c:\Users\Laurent\Studies\sql-ultimate-course\Udemy-SQL-Data-Warehouse-Project\.venv\Scripts\python.exe
cachedir: .pytest_cache
rootdir: c:\Users\Laurent\Studies\sql-ultimate-course\Udemy-SQL-Data-Warehouse-Project\tests\tests_setup
plugins: anyio-4.11.0, nbmake-1.5.5
[1mcollecting ... [0mcollected 3 items

t_2cd27d8b688c4eb9be5163904688617e.py::test_database_encoding_utf8 [32mPASSED[0m[32m                    [ 33%][0m
t_2cd27d8b688c4eb9be5163904688617e.py::test_database_encoding_from_connection [32mPASSED[0m[32m         [ 66%][0m
t_2cd27d8b688c4eb9be5163904688617e.py::test_client_encoding_utf8 [32mPASSED[0m[32m                      [100%][0m



## Test Suite 3: Locale Configuration

**Tests in this suite:**
1. `test_database_collation_en_gb` - Verifies datcollate is set to 'en_GB.UTF-8' in pg_database catalog
2. `test_database_ctype_en_gb` - Verifies datctype is set to 'en_GB.UTF-8' in pg_database catalog
3. `test_lc_collate_matches_database_setting` - Cross-validates LC_COLLATE between postgres and target connections via catalog
4. `test_lc_ctype_matches_database_setting` - Cross-validates LC_CTYPE between postgres and target connections via catalog

**Note:** LC_COLLATE and LC_CTYPE are database creation parameters stored in pg_database catalog, not runtime session parameters.

**How these tests work:**
- **Test 1** executes `SELECT datcollate FROM pg_database WHERE datname = 'sql_retail_analytics_warehouse'`
  - ✅ Success: datcollate = 'en_GB.UTF-8'
  - ❌ Failure: datcollate ≠ 'en_GB.UTF-8' (e.g., 'C', 'en_US.UTF-8')
  - **Purpose**: Validates string sorting/comparison rules (affects ORDER BY, indexes, unique constraints)
  - **Impact**: Wrong collation can break alphabetical sorting and cause unexpected query results
  
- **Test 2** executes `SELECT datctype FROM pg_database WHERE datname = 'sql_retail_analytics_warehouse'`
  - ✅ Success: datctype = 'en_GB.UTF-8'
  - ❌ Failure: datctype ≠ 'en_GB.UTF-8'
  - **Purpose**: Validates character classification (uppercase/lowercase conversions, regex, LIKE patterns)
  - **Example impact**: UPPER('café') behavior depends on ctype setting
  
- **Test 3** cross-validates LC_COLLATE consistency across different connection contexts
  - From postgres connection: `SELECT datcollate FROM pg_database WHERE datname = 'sql_retail_analytics_warehouse'`
  - From target connection: `SELECT datcollate FROM pg_database WHERE datname = current_database()`
  - ✅ Success: Both queries return identical 'en_GB.UTF-8'
  - ❌ Failure: Values don't match (configuration inconsistency)
  - **Purpose**: Ensures catalog integrity - same database property should return same value regardless of query source
  
- **Test 4** cross-validates LC_CTYPE consistency (same approach as Test 3 for datctype)
  - ✅ Success: Both queries return identical 'en_GB.UTF-8'
  - ❌ Failure: Values don't match
  - **Purpose**: Validates ctype configuration is consistent across query contexts

**⚠️ CRITICAL: Why these tests matter**
- Locale settings are **IMMUTABLE after database creation**
- Incorrect settings require **dropping and recreating the entire database**
- These tests catch configuration errors **before** any data is loaded
- Changing locale after data import causes index corruption and requires full REINDEX

In [6]:
%%ipytest -vv

def test_database_collation_en_gb(postgres_cursor):
    """Verify database uses en_GB.UTF-8 collation."""
    postgres_cursor.execute("""
        SELECT datcollate
        FROM pg_database
        WHERE datname = %s
    """, (TARGET_DB,))
    
    collation = postgres_cursor.fetchone()[0]
    assert collation == 'en_GB.UTF-8', f"Expected 'en_GB.UTF-8' collation, got '{collation}'"

def test_database_ctype_en_gb(postgres_cursor):
    """Verify database uses en_GB.UTF-8 character classification."""
    postgres_cursor.execute("""
        SELECT datctype
        FROM pg_database
        WHERE datname = %s
    """, (TARGET_DB,))
    
    ctype = postgres_cursor.fetchone()[0]
    assert ctype == 'en_GB.UTF-8', f"Expected 'en_GB.UTF-8' ctype, got '{ctype}'"

def test_lc_collate_matches_database_setting(postgres_cursor, target_cursor):
    """Verify LC_COLLATE visible in current session matches database configuration."""
    # Get database collation from catalog
    postgres_cursor.execute("""
        SELECT datcollate
        FROM pg_database
        WHERE datname = %s
    """, (TARGET_DB,))
    expected_collate = postgres_cursor.fetchone()[0]
    
    # Query via pg_settings or database properties
    target_cursor.execute("""
        SELECT datcollate
        FROM pg_database
        WHERE datname = current_database()
    """)
    actual_collate = target_cursor.fetchone()[0]
    
    assert actual_collate == expected_collate, \
        f"LC_COLLATE mismatch: expected '{expected_collate}', got '{actual_collate}'"

def test_lc_ctype_matches_database_setting(postgres_cursor, target_cursor):
    """Verify LC_CTYPE visible in current session matches database configuration."""
    # Get database ctype from catalog
    postgres_cursor.execute("""
        SELECT datctype
        FROM pg_database
        WHERE datname = %s
    """, (TARGET_DB,))
    expected_ctype = postgres_cursor.fetchone()[0]
    
    # Query via database properties
    target_cursor.execute("""
        SELECT datctype
        FROM pg_database
        WHERE datname = current_database()
    """)
    actual_ctype = target_cursor.fetchone()[0]
    
    assert actual_ctype == expected_ctype, \
        f"LC_CTYPE mismatch: expected '{expected_ctype}', got '{actual_ctype}'"

platform win32 -- Python 3.12.4, pytest-8.4.2, pluggy-1.6.0 -- c:\Users\Laurent\Studies\sql-ultimate-course\Udemy-SQL-Data-Warehouse-Project\.venv\Scripts\python.exe
cachedir: .pytest_cache
rootdir: c:\Users\Laurent\Studies\sql-ultimate-course\Udemy-SQL-Data-Warehouse-Project\tests\tests_setup
plugins: anyio-4.11.0, nbmake-1.5.5
[1mcollecting ... [0mcollected 4 items

t_2cd27d8b688c4eb9be5163904688617e.py::test_database_collation_en_gb collected 4 items

t_2cd27d8b688c4eb9be5163904688617e.py::test_database_collation_en_gb [32mPASSED[0m[32m                  [ 25%][0m
t_2cd27d8b688c4eb9be5163904688617e.py::test_database_ctype_en_gb [32mPASSED[0m[32m                      [ 50%][0m
t_2cd27d8b688c4eb9be5163904688617e.py::test_lc_collate_matches_database_setting [32mPASSED[0m[32m                  [ 25%][0m
t_2cd27d8b688c4eb9be5163904688617e.py::test_database_ctype_en_gb [32mPASSED[0m[32m                      [ 50%][0m
t_2cd27d8b688c4eb9be5163904688617e.py::test_lc_collate_m

## Test Suite 4: Template Configuration

**Tests in this suite:**
1. `test_database_allows_connections` - Verifies datallowconn=true (database accepts connections)
2. `test_database_not_a_template` - Verifies datistemplate=false (database is not a template)
3. `test_no_active_connections_limit` - Verifies datconnlimit=-1 (unlimited connections allowed)

**How these tests work:**
- **Test 1** executes `SELECT datallowconn FROM pg_database WHERE datname = 'sql_retail_analytics_warehouse'`
  - ✅ Success: datallowconn = True (boolean)
  - ❌ Failure: datallowconn = False
  - **Purpose**: Ensures database accepts connections (required for data warehouse operations)
  - **Context**: Template databases (template0, template1) have datallowconn=False to prevent modifications
  
- **Test 2** executes `SELECT datistemplate FROM pg_database WHERE datname = 'sql_retail_analytics_warehouse'`
  - ✅ Success: datistemplate = False (boolean)
  - ❌ Failure: datistemplate = True
  - **Purpose**: Confirms this is a working database, not a template
  - **Impact**: Template databases cannot be dropped with simple DROP DATABASE command
  - **Use case**: Templates are meant for CREATE DATABASE...TEMPLATE operations, not data storage
  
- **Test 3** executes `SELECT datconnlimit FROM pg_database WHERE datname = 'sql_retail_analytics_warehouse'`
  - ✅ Success: datconnlimit = -1 (integer, unlimited)
  - ❌ Failure: datconnlimit ≥ 0 (any specific limit)
  - **Purpose**: Validates unlimited concurrent connections (critical for data warehouse with multiple ETL jobs)
  - **Value meanings**:
    - `-1` = unlimited connections (required for production warehouses)
    - `0` = no connections allowed (database locked)
    - Positive number = maximum concurrent connections

**💼 Business Impact:**
Unlimited connections (-1) are essential for data warehouses where:
- Multiple ETL processes run simultaneously
- BI tools maintain connection pools
- Analysts query data concurrently
- Monitoring tools maintain persistent connections

In [7]:
%%ipytest -vv

def test_database_allows_connections(postgres_cursor):
    """Verify database allows connections (not a template)."""
    postgres_cursor.execute("""
        SELECT datallowconn
        FROM pg_database
        WHERE datname = %s
    """, (TARGET_DB,))
    
    allows_conn = postgres_cursor.fetchone()[0]
    assert allows_conn is True, "Database should allow connections"

def test_database_not_a_template(postgres_cursor):
    """Verify database is not marked as a template."""
    postgres_cursor.execute("""
        SELECT datistemplate
        FROM pg_database
        WHERE datname = %s
    """, (TARGET_DB,))
    
    is_template = postgres_cursor.fetchone()[0]
    assert is_template is False, "Database should not be a template"

def test_no_active_connections_limit(postgres_cursor):
    """Verify database has no connection limit (-1 = unlimited)."""
    postgres_cursor.execute("""
        SELECT datconnlimit
        FROM pg_database
        WHERE datname = %s
    """, (TARGET_DB,))
    
    conn_limit = postgres_cursor.fetchone()[0]
    assert conn_limit == -1, f"Expected unlimited connections (-1), got {conn_limit}"

platform win32 -- Python 3.12.4, pytest-8.4.2, pluggy-1.6.0 -- c:\Users\Laurent\Studies\sql-ultimate-course\Udemy-SQL-Data-Warehouse-Project\.venv\Scripts\python.exe
cachedir: .pytest_cache
rootdir: c:\Users\Laurent\Studies\sql-ultimate-course\Udemy-SQL-Data-Warehouse-Project\tests\tests_setup
plugins: anyio-4.11.0, nbmake-1.5.5
[1mcollecting ... [0mcollected 3 items

t_2cd27d8b688c4eb9be5163904688617e.py::test_database_allows_connections collected 3 items

t_2cd27d8b688c4eb9be5163904688617e.py::test_database_allows_connections [32mPASSED[0m[32m               [ 33%][0m
t_2cd27d8b688c4eb9be5163904688617e.py::test_database_not_a_template [32mPASSED[0m[32m                   [ 66%][0m
t_2cd27d8b688c4eb9be5163904688617e.py::test_no_active_connections_limit [32mPASSED[0m[32m               [100%][0m[32mPASSED[0m[32m               [ 33%][0m
t_2cd27d8b688c4eb9be5163904688617e.py::test_database_not_a_template [32mPASSED[0m[32m                   [ 66%][0m
t_2cd27d8b688c4eb9b

## Test Suite 5: Ownership and Privileges

**Tests in this suite:**
1. `test_database_owner` - Queries pg_database to verify database has a valid owner (postgres or creating user)
2. `test_current_user_can_create_schema` - Creates a test schema to verify CREATE SCHEMA privilege, then cleans up

**Note:** Tests verify sufficient privileges for data warehouse operations.

**How these tests work:**
- **Test 1** queries database ownership: `SELECT pg_catalog.pg_get_userbyid(d.datdba) AS owner FROM pg_catalog.pg_database d WHERE d.datname = 'sql_retail_analytics_warehouse'`
  - ✅ Success: owner = 'postgres' (or creating user name, non-empty string)
  - ❌ Failure: owner is NULL or empty string
  - **Purpose**: Validates database has a valid owner with full privileges
  - **Technical note**: `pg_get_userbyid()` converts internal OID (Object Identifier) to human-readable username
  - **Security**: Database owner has unrestricted access to all objects and can grant privileges to others
  
- **Test 2** performs a CREATE/DROP cycle to test DDL privileges:
  1. Executes `CREATE SCHEMA IF NOT EXISTS test_privilege_check`
  2. Verifies creation: `SELECT COUNT(*) FROM information_schema.schemata WHERE schema_name = 'test_privilege_check'`
     - ✅ Success: count = 1 (schema created successfully)
     - ❌ Failure: count ≠ 1 or SQL error (insufficient privileges)
  3. Cleans up: `DROP SCHEMA IF EXISTS test_privilege_check CASCADE`
  - **Purpose**: Validates current user can perform essential DDL operations (required for Bronze/Silver/Gold schema creation)
  - **Side effects**: Creates and removes test_privilege_check schema (ephemeral, no lasting impact)
  - **Why functional testing**: Checking pg_catalog for privileges doesn't guarantee actual DDL operations will succeed

**⚠️ Why This Matters:**
- Without CREATE SCHEMA privilege, the entire medallion architecture setup (Bronze/Silver/Gold layers) would fail
- This test catches permission issues **before** running expensive ETL scripts
- Functional privilege testing is more reliable than catalog-based permission checks
- Prevents partial setup failures that are difficult to rollback

In [8]:
%%ipytest -vv

def test_database_owner(postgres_cursor):
    """Verify database is owned by postgres user."""
    postgres_cursor.execute("""
        SELECT pg_catalog.pg_get_userbyid(d.datdba) AS owner
        FROM pg_catalog.pg_database d
        WHERE d.datname = %s
    """, (TARGET_DB,))
    
    owner = postgres_cursor.fetchone()[0]
    # Owner should be postgres or the creating user
    assert owner is not None, "Database must have an owner"
    assert len(owner) > 0, "Owner name should not be empty"

def test_current_user_can_create_schema(target_cursor):
    """Verify current user has privileges to create schemas."""
    # Try to create a test schema (will rollback)
    target_cursor.execute("""
        CREATE SCHEMA IF NOT EXISTS test_privilege_check
    """)
    
    # Verify it was created
    target_cursor.execute("""
        SELECT COUNT(*)
        FROM information_schema.schemata
        WHERE schema_name = 'test_privilege_check'
    """)
    
    count = target_cursor.fetchone()[0]
    assert count == 1, "User should be able to create schemas"
    
    # Clean up
    target_cursor.execute("DROP SCHEMA IF EXISTS test_privilege_check CASCADE")

platform win32 -- Python 3.12.4, pytest-8.4.2, pluggy-1.6.0 -- c:\Users\Laurent\Studies\sql-ultimate-course\Udemy-SQL-Data-Warehouse-Project\.venv\Scripts\python.exe
cachedir: .pytest_cache
rootdir: c:\Users\Laurent\Studies\sql-ultimate-course\Udemy-SQL-Data-Warehouse-Project\tests\tests_setup
plugins: anyio-4.11.0, nbmake-1.5.5
[1mcollecting ... [0mcollected 2 items

t_2cd27d8b688c4eb9be5163904688617e.py::test_database_owner collected 2 items

t_2cd27d8b688c4eb9be5163904688617e.py::test_database_owner [32mPASSED[0m[32m                            [ 50%][0m
t_2cd27d8b688c4eb9be5163904688617e.py::test_current_user_can_create_schema [32mPASSED[0m[32m                            [ 50%][0m
t_2cd27d8b688c4eb9be5163904688617e.py::test_current_user_can_create_schema [32mPASSED[0m[32m            [100%][0m[32mPASSED[0m[32m            [100%][0m





## Test Suite 6: Clean State Validation

**Tests in this suite:**
1. `test_expected_default_schemas_only` - Validates only expected schemas exist (public, bronze, silver, gold, setup)
2. `test_database_size_reasonable` - Checks database size is under 20MB (reasonable for a clean warehouse)

**Note:** These tests ensure the database starts in a clean, predictable state for the data warehouse.

**How these tests work:**
- **Test 1** queries all user-created schemas: `SELECT schema_name FROM information_schema.schemata WHERE schema_name NOT IN ('pg_catalog', 'information_schema', 'pg_toast') ORDER BY schema_name`
  - ✅ Success: schemas in {'public', 'bronze', 'silver', 'gold', 'setup'} (after create_schemas.sql runs)
  - ⚠️ Warning: Prints list if unexpected schemas found (e.g., 'test', 'temp', 'legacy')
  - ❌ Failure: Only fails if 'public' schema is missing (always required)
  - **Purpose**: Detects schema pollution or incomplete cleanup from previous operations
  - **Behavior**: Lenient approach - warns but doesn't fail on extra schemas (allows for dev experimentation)
  - **Filtered schemas**: Excludes PostgreSQL system schemas (pg_catalog, information_schema, pg_toast)
  
- **Test 2** queries database size:
  ```sql
  SELECT pg_size_pretty(pg_database_size(current_database())) AS size,
         pg_database_size(current_database()) AS size_bytes
  ```
  - ✅ Success: size_bytes < 20,971,520 (20MB)
  - ❌ Failure: size_bytes ≥ 20MB (indicates data/objects already loaded)
  - **Typical clean database size**: 8-10MB (includes system catalogs and empty schemas)
  - **Purpose**: Ensures database is in initial clean state before ETL operations begin
  - **Diagnostic**: Always prints human-readable size (e.g., "9845 kB") for visual confirmation
  - **Size components**: Includes all tables, indexes, sequences, and system objects

**🔍 Troubleshooting Failed Tests:**
- **Unexpected schemas found**: 
  - May indicate incomplete cleanup from previous testing
  - Could be leftover from failed ETL runs
  - Action: Review schema list, DROP unwanted schemas if safe
  
- **Size > 20MB**: 
  - Database may already have data loaded (not a clean state)
  - Large objects (LOBs, BLOBs) may exist
  - Bloated system catalogs from many DROP/CREATE cycles
  - Action: Investigate with `SELECT pg_size_pretty(pg_total_relation_size('schema.table'))` on largest tables

In [9]:
%%ipytest -vv

def test_expected_default_schemas_only(target_cursor):
    """Verify only default PostgreSQL schemas exist (if using template0)."""
    target_cursor.execute("""
        SELECT schema_name
        FROM information_schema.schemata
        WHERE schema_name NOT IN ('pg_catalog', 'information_schema', 'pg_toast')
        ORDER BY schema_name
    """)
    
    schemas = [row[0] for row in target_cursor.fetchall()]
    
    # Should only have 'public' by default (if clean template0)
    # May have bronze/silver/gold if create_schemas.sql was run
    # This test verifies no unexpected schemas exist
    expected_schemas = {'public', 'bronze', 'silver', 'gold', 'setup'}
    unexpected = set(schemas) - expected_schemas
    
    if unexpected:
        print(f"⚠️  Unexpected schemas found: {unexpected}")
        print(f"   All schemas: {schemas}")
    
    # At minimum, 'public' should exist
    assert 'public' in schemas, "Default 'public' schema should exist"

def test_database_size_reasonable(target_cursor):
    """Verify database size is reasonable for a clean database."""
    target_cursor.execute("""
        SELECT pg_size_pretty(pg_database_size(current_database())) AS size,
               pg_database_size(current_database()) AS size_bytes
    """)
    
    size_pretty, size_bytes = target_cursor.fetchone()
    
    # Clean database should be under 20MB
    max_size_bytes = 20 * 1024 * 1024  # 20MB
    
    print(f"Database size: {size_pretty}")
    assert size_bytes < max_size_bytes, \
        f"Database seems too large for a clean database: {size_pretty}"

platform win32 -- Python 3.12.4, pytest-8.4.2, pluggy-1.6.0 -- c:\Users\Laurent\Studies\sql-ultimate-course\Udemy-SQL-Data-Warehouse-Project\.venv\Scripts\python.exe
cachedir: .pytest_cache
rootdir: c:\Users\Laurent\Studies\sql-ultimate-course\Udemy-SQL-Data-Warehouse-Project\tests\tests_setup
plugins: anyio-4.11.0, nbmake-1.5.5
[1mcollecting ... [0mcollected 2 items

t_2cd27d8b688c4eb9be5163904688617e.py::test_expected_default_schemas_only collected 2 items

t_2cd27d8b688c4eb9be5163904688617e.py::test_expected_default_schemas_only [32mPASSED[0m[32m             [ 50%][0m
t_2cd27d8b688c4eb9be5163904688617e.py::test_database_size_reasonable [32mPASSED[0m[32m             [ 50%][0m
t_2cd27d8b688c4eb9be5163904688617e.py::test_database_size_reasonable [32mPASSED[0m[32m                  [100%][0m[32mPASSED[0m[32m                  [100%][0m





## Test Suite 7: Connection and Session Settings

**Tests in this suite:**
1. `test_timezone_setting` - Verifies timezone is configured (uses SHOW timezone)
2. `test_datestyle_setting` - Verifies DateStyle includes 'ISO' for consistent date formatting
3. `test_can_create_table` - Creates and drops a test table to validate DDL operations work
4. `test_can_insert_and_query_utf8_data` - Inserts and retrieves UTF-8 data (emojis, Japanese, Greek) to verify encoding works correctly

**Note:** These tests validate the database is fully functional and ready for data warehouse operations.

**How these tests work:**
- **Test 1** executes `SHOW timezone`
  - ✅ Success: Non-empty string (e.g., 'UTC', 'Europe/London', 'America/New_York')
  - ❌ Failure: timezone is NULL or empty string
  - **Purpose**: Validates timezone configuration exists (critical for timestamp operations)
  - **Note**: Actual timezone value not tested (depends on server/client configuration)
  - **Impact**: Affects TIMESTAMP WITH TIME ZONE storage and display
  
- **Test 2** executes `SHOW datestyle`
  - ✅ Success: String containing 'ISO' (e.g., 'ISO, DMY', 'ISO, MDY')
  - ❌ Failure: 'ISO' not in datestyle string (e.g., 'Postgres, DMY', 'SQL, MDY')
  - **Purpose**: Ensures consistent date formatting (ISO standard prevents ambiguity)
  - **Why ISO matters**: 
    - ISO format (YYYY-MM-DD) is unambiguous internationally
    - 'Postgres' style (Mon DD YYYY) varies by locale
    - 'SQL' style varies by regional settings
  - **Data warehouse impact**: Prevents date parsing errors in ETL pipelines
  
- **Test 3** performs full DDL cycle to validate table operations:
  1. Executes `CREATE TABLE IF NOT EXISTS test_table_creation (id SERIAL PRIMARY KEY, name TEXT NOT NULL)`
  2. Verifies: `SELECT COUNT(*) FROM information_schema.tables WHERE table_name = 'test_table_creation'`
     - ✅ Success: count = 1
     - ❌ Failure: count ≠ 1 or SQL error
  3. Cleans up: `DROP TABLE IF EXISTS test_table_creation`
  - **Purpose**: Validates DDL operations work (required for Bronze layer table creation)
  - **Why functional testing**: Tests actual CREATE TABLE capability, not just theoretical permissions
  - **Side effects**: Temporarily creates and removes test_table_creation (no lasting impact)
  - **Tests**: PRIMARY KEY constraint creation, SERIAL sequence generation, NOT NULL constraints
  
- **Test 4** performs comprehensive UTF-8 validation:
  1. Creates temp table: `CREATE TEMP TABLE test_utf8 (data TEXT)`
  2. Inserts 5 diverse test strings:
     - 'Hello World' (ASCII baseline - control case)
     - 'Café' (Latin-1 Supplement: é = U+00E9)
     - '日本語' (Japanese CJK Unified Ideographs)
     - '🎯📊' (Emoji: U+1F3AF, U+1F4CA - 4-byte UTF-8)
     - 'Ελληνικά' (Greek Extended characters)
  3. Queries back: `SELECT data FROM test_utf8 ORDER BY data`
  4. Validates: len(results) == 5 (all strings retrievable without corruption)
  - ✅ Success: All 5 strings returned exactly as inserted
  - ❌ Failure: Count ≠ 5, or any string corrupted (e.g., '???' or '日???')
  - **Purpose**: End-to-end UTF-8 validation (client → network → storage → retrieval)
  - **Technical**: Uses TEMP table (automatically cleaned up at session end, no cleanup code needed)
  - **Coverage**: Tests 1-byte (ASCII), 2-byte (Latin), 3-byte (CJK), and 4-byte (Emoji) UTF-8 sequences

**🌍 Why UTF-8 Testing is Critical:**
- Retail data includes international customer names (e.g., 'François', 'José', '田中')
- Product descriptions in multiple languages
- Special characters in addresses, currencies (€, £, ¥)
- Modern data includes emojis in customer feedback, social media
- **UTF-8 corruption is irreversible** - once data is corrupted, original values are lost
- Silent corruption is worse than errors - bad data looks valid but is wrong

In [10]:
%%ipytest -vv

def test_timezone_setting(target_cursor):
    """Verify timezone is set (should have default or configured value)."""
    target_cursor.execute("SHOW timezone")
    timezone = target_cursor.fetchone()[0]
    assert timezone is not None, "Timezone should be set"
    assert len(timezone) > 0, "Timezone value should not be empty"

def test_datestyle_setting(target_cursor):
    """Verify DateStyle is set to ISO standard."""
    target_cursor.execute("SHOW datestyle")
    datestyle = target_cursor.fetchone()[0]
    # Should contain 'ISO' for consistent date formatting
    assert 'ISO' in datestyle, f"DateStyle should include ISO, got '{datestyle}'"

def test_can_create_table(target_cursor):
    """Verify basic DDL operations work."""
    # Create a test table
    target_cursor.execute("""
        CREATE TABLE IF NOT EXISTS test_table_creation (
            id SERIAL PRIMARY KEY,
            name TEXT NOT NULL
        )
    """)
    
    # Verify it exists
    target_cursor.execute("""
        SELECT COUNT(*)
        FROM information_schema.tables
        WHERE table_name = 'test_table_creation'
    """)
    
    count = target_cursor.fetchone()[0]
    assert count == 1, "Should be able to create tables"
    
    # Clean up
    target_cursor.execute("DROP TABLE IF EXISTS test_table_creation")

def test_can_insert_and_query_utf8_data(target_cursor):
    """Verify UTF-8 data can be inserted and queried correctly."""
    # Create temp table
    target_cursor.execute("""
        CREATE TEMP TABLE test_utf8 (data TEXT)
    """)
    
    # Insert various UTF-8 characters
    test_strings = [
        'Hello World',
        'Café',
        '日本語',  # Japanese
        '🎯📊',    # Emojis
        'Ελληνικά'  # Greek
    ]
    
    for test_str in test_strings:
        target_cursor.execute(
            "INSERT INTO test_utf8 (data) VALUES (%s)",
            (test_str,)
        )
    
    # Query back and verify
    target_cursor.execute("SELECT data FROM test_utf8 ORDER BY data")
    results = [row[0] for row in target_cursor.fetchall()]
    
    # All strings should be retrievable
    assert len(results) == len(test_strings), "All UTF-8 strings should be retrievable"

platform win32 -- Python 3.12.4, pytest-8.4.2, pluggy-1.6.0 -- c:\Users\Laurent\Studies\sql-ultimate-course\Udemy-SQL-Data-Warehouse-Project\.venv\Scripts\python.exe
cachedir: .pytest_cache
rootdir: c:\Users\Laurent\Studies\sql-ultimate-course\Udemy-SQL-Data-Warehouse-Project\tests\tests_setup
plugins: anyio-4.11.0, nbmake-1.5.5
[1mcollecting ... [0mcollected 4 items

t_2cd27d8b688c4eb9be5163904688617e.py::test_timezone_setting collected 4 items

t_2cd27d8b688c4eb9be5163904688617e.py::test_timezone_setting [32mPASSED[0m[32m                          [ 25%][0m
t_2cd27d8b688c4eb9be5163904688617e.py::test_datestyle_setting [32mPASSED[0m[32m                         [ 50%][0m
t_2cd27d8b688c4eb9be5163904688617e.py::test_can_create_table [32mPASSED[0m[32m                          [ 25%][0m
t_2cd27d8b688c4eb9be5163904688617e.py::test_datestyle_setting [32mPASSED[0m[32m                         [ 50%][0m
t_2cd27d8b688c4eb9be5163904688617e.py::test_can_create_table [32mPASSED[0

## Summary: Run All Tests

**Executes all test suites (22 tests total):**
- Suite 1: Database Existence (3 tests)
- Suite 2: Encoding Configuration (3 tests)
- Suite 3: Locale Configuration (4 tests)
- Suite 4: Template Configuration (3 tests)
- Suite 5: Ownership and Privileges (2 tests)
- Suite 6: Clean State Validation (2 tests)
- Suite 7: Connection and Session Settings (4 tests)

**How this cell works:**
- Executes `ipytest.run('-vv')` which runs all pytest functions defined in this notebook
- `-vv` flag provides **very verbose** output showing:
  - Each test function name as it runs
  - PASSED/FAILED status for each test
  - Detailed assertion messages on failure
  - Full traceback on errors
  - Percentage completion progress
  
**What verbose output includes:**
- Test collection phase (shows how many tests found)
- Platform and Python version info
- Pytest version and loaded plugins
- Individual test results with pass/fail status
- Captured stdout/stderr for failed tests
- Final summary with counts and execution time

**✅ Success Criteria:**
- All 22 tests show `PASSED` status
- Final summary shows: `22 passed in X.XXs`
- No `FAILED`, `ERROR`, or `SKIPPED` statuses
- No warnings about connection issues or deprecations

**🔧 Troubleshooting Test Failures:**

| Failure Type | Likely Cause | Solution |
|-------------|--------------|----------|
| Connection refused | PostgreSQL not running | Start PostgreSQL service |
| Password authentication failed | Wrong `.env` credentials | Verify POSTGRES_PASSWORD in `.env` |
| Database does not exist | `create_db.sql` not run | Execute `setup/create_db.sql` first |
| Encoding mismatch | Database created with wrong encoding | Drop and recreate with `ENCODING='UTF8'` |
| Locale mismatch | Database created with wrong locale | Drop and recreate with correct LC_COLLATE/LC_CTYPE |
| Permission denied | Insufficient user privileges | Grant CREATE privilege to current user |
| Size test fails | Data already loaded | Drop and recreate database for clean state |

**📊 Reading Test Output:**
- `[  4%]` indicates progress through test suite
- `PASSED` in green = test succeeded
- `FAILED` in red = assertion failed (see details below)
- `ERROR` in red = test couldn't run (setup issue)
- Final line shows total time - useful for performance tracking

In [None]:
# Run all tests in this notebook
ipytest.run('-vv')

## Manual Inspection: Database Properties

**What this cell does:**
1. **Database Configuration** - Displays all database properties from pg_database catalog (owner, encoding, collation, ctype, size, etc.)
2. **Session Settings** - Shows runtime configuration parameters (server_encoding, client_encoding, timezone, datestyle)
3. **Locale Settings** - Displays database creation locale parameters (lc_collate, lc_ctype) from catalog
4. **Schemas** - Lists all user-created schemas with their owners
5. **Database Statistics** - Shows database size, table count, and schema count

Run this cell for a comprehensive visual overview of the database configuration.

**How this cell works:**

**Step 1: Database Configuration (pg_database catalog query)**
```sql
SELECT
    d.datname,                                  -- Database name
    pg_catalog.pg_get_userbyid(d.datdba),      -- Owner (converts OID to username)
    pg_catalog.pg_encoding_to_char(d.encoding), -- Encoding (e.g., UTF8)
    d.datcollate,                              -- Collation (e.g., en_GB.UTF-8)
    d.datctype,                                -- Character classification
    d.datallowconn,                            -- Allows connections (boolean)
    d.datconnlimit,                            -- Connection limit (-1 = unlimited)
    d.datistemplate,                           -- Is template (boolean)
    pg_size_pretty(pg_database_size(d.datname)) -- Human-readable size
FROM pg_catalog.pg_database d
WHERE d.datname = 'sql_retail_analytics_warehouse'
```
- **Output format**: Transposed DataFrame (properties as rows for readability)
- **Expected values**: See test suites above for expected configuration

**Step 2: Session Settings (runtime parameters)**
```sql
SELECT 'server_encoding', current_setting('server_encoding')
UNION ALL SELECT 'client_encoding', current_setting('client_encoding')
UNION ALL SELECT 'timezone', current_setting('timezone')
UNION ALL SELECT 'datestyle', current_setting('datestyle')
```
- **Output format**: 2-column DataFrame (setting | value)
- **Expected values**:
  - server_encoding: 'UTF8'
  - client_encoding: 'UTF8'
  - timezone: System dependent (e.g., 'UTC', 'Europe/London')
  - datestyle: Should include 'ISO'

**Step 3: Locale Settings (database catalog - not runtime)**
```sql
SELECT 'lc_collate', datcollate FROM pg_database WHERE datname = 'sql_retail_analytics_warehouse'
UNION ALL
SELECT 'lc_ctype', datctype FROM pg_database WHERE datname = 'sql_retail_analytics_warehouse'
```
- **Output format**: 2-column DataFrame (setting | value)
- **Expected values**:
  - lc_collate: 'en_GB.UTF-8'
  - lc_ctype: 'en_GB.UTF-8'
- **Note**: Cannot use current_setting() - these are database creation parameters only

**Step 4: Schemas (user-created schemas)**
```sql
SELECT
    schema_name,
    pg_catalog.pg_get_userbyid(schema_owner::regrole::oid) AS owner
FROM information_schema.schemata
WHERE schema_name NOT IN ('pg_catalog', 'information_schema', 'pg_toast')
ORDER BY schema_name
```
- **Output format**: 2-column DataFrame (schema_name | owner)
- **Expected values**: {public, bronze, silver, gold, setup} with their respective owners

**Step 5: Database Statistics (size and object counts)**
```sql
SELECT
    current_database() AS database,
    pg_size_pretty(pg_database_size(current_database())) AS total_size,
    (SELECT COUNT(*) FROM information_schema.tables 
     WHERE table_schema NOT IN ('pg_catalog', 'information_schema')) AS user_tables,
    (SELECT COUNT(*) FROM information_schema.schemata 
     WHERE schema_name NOT IN ('pg_catalog', 'information_schema', 'pg_toast')) AS user_schemas
```
- **Output format**: Single-row DataFrame with 4 columns
- **Expected values**:
  - database: 'sql_retail_analytics_warehouse'
  - total_size: < 20MB (e.g., "9845 kB")
  - user_tables: Depends on setup stage (0 if clean, >0 if Bronze tables created)
  - user_schemas: 5 (public, bronze, silver, gold, setup)

**Expected console output:**
```
🗄️  Database Configuration:
[Transposed DataFrame showing all database properties]

⚙️  Session Settings:
[DataFrame with 4 runtime settings]

🌍 Locale Settings (from database catalog):
[DataFrame with 2 locale settings]

📁 Schemas:
[DataFrame listing all user schemas]

📊 Database Statistics:
[Single-row DataFrame with size and counts]

✅ Inspection complete
```

**Use this for:**
- Visual confirmation of all test assertions
- Debugging test failures
- Documentation/screenshots of database configuration
- Comparing expected vs. actual configuration

In [12]:
# Connect to postgres database to query catalog
conn_postgres = psycopg2.connect(database='postgres', **DB_CONFIG)

# Get comprehensive database information
df_db_info = pd.read_sql(f"""
    SELECT
        d.datname                                  AS database_name,
        pg_catalog.pg_get_userbyid(d.datdba)       AS owner,
        pg_catalog.pg_encoding_to_char(d.encoding) AS encoding,
        d.datcollate                               AS collation,
        d.datctype                                 AS ctype,
        d.datallowconn                             AS allows_connections,
        d.datconnlimit                             AS connection_limit,
        d.datistemplate                            AS is_template,
        pg_size_pretty(pg_database_size(d.datname)) AS size
    FROM pg_catalog.pg_database d
    WHERE d.datname = '{TARGET_DB}'
""", conn_postgres)

print("\n🗄️  Database Configuration:")
display(df_db_info.T)  # Transpose for better readability

conn_postgres.close()

# Connect to target database for additional info
conn_target = psycopg2.connect(database=TARGET_DB, **DB_CONFIG)

# Get session settings (excluding lc_collate and lc_ctype which are not runtime parameters)
df_settings = pd.read_sql("""
    SELECT
        'server_encoding' AS setting,
        current_setting('server_encoding') AS value
    UNION ALL
    SELECT 'client_encoding', current_setting('client_encoding')
    UNION ALL
    SELECT 'timezone', current_setting('timezone')
    UNION ALL
    SELECT 'datestyle', current_setting('datestyle')
    ORDER BY setting
""", conn_target)

print("\n⚙️  Session Settings:")
display(df_settings)

# Get locale settings from database catalog (not runtime parameters)
df_locale = pd.read_sql(f"""
    SELECT
        'lc_collate' AS setting,
        datcollate AS value
    FROM pg_database
    WHERE datname = '{TARGET_DB}'
    UNION ALL
    SELECT
        'lc_ctype' AS setting,
        datctype AS value
    FROM pg_database
    WHERE datname = '{TARGET_DB}'
    ORDER BY setting
""", conn_target)

print("\n🌍 Locale Settings (from database catalog):")
display(df_locale)

# Get list of schemas
df_schemas = pd.read_sql("""
    SELECT
        schema_name,
        pg_catalog.pg_get_userbyid(schema_owner::regrole::oid) AS owner
    FROM information_schema.schemata
    WHERE schema_name NOT IN ('pg_catalog', 'information_schema', 'pg_toast')
    ORDER BY schema_name
""", conn_target)

print("\n📁 Schemas:")
display(df_schemas)

# Get database statistics
df_stats = pd.read_sql("""
    SELECT
        current_database() AS database,
        pg_size_pretty(pg_database_size(current_database())) AS total_size,
        (
            SELECT COUNT(*)
            FROM information_schema.tables
            WHERE table_schema NOT IN ('pg_catalog', 'information_schema')
        ) AS user_tables,
        (
            SELECT COUNT(*)
            FROM information_schema.schemata
            WHERE schema_name NOT IN ('pg_catalog', 'information_schema', 'pg_toast')
        ) AS user_schemas
""", conn_target)

print("\n📊 Database Statistics:")
display(df_stats)

conn_target.close()
print("\n✅ Inspection complete")


🗄️  Database Configuration:


  df_db_info = pd.read_sql(f"""


Unnamed: 0,0
database_name,sql_retail_analytics_warehouse
owner,postgres
encoding,UTF8
collation,en_GB.UTF-8
ctype,en_GB.UTF-8
allows_connections,True
connection_limit,-1
is_template,False
size,8062 kB



⚙️  Session Settings:


  df_settings = pd.read_sql("""


Unnamed: 0,setting,value
0,client_encoding,UTF8
1,datestyle,"ISO, DMY"
2,server_encoding,UTF8
3,timezone,Europe/London



🌍 Locale Settings (from database catalog):


  df_locale = pd.read_sql(f"""


Unnamed: 0,setting,value
0,lc_collate,en_GB.UTF-8
1,lc_ctype,en_GB.UTF-8



📁 Schemas:


  df_schemas = pd.read_sql("""


Unnamed: 0,schema_name,owner
0,bronze,postgres
1,gold,postgres
2,pg_temp_28,postgres
3,pg_toast_temp_28,postgres
4,public,pg_database_owner
5,silver,postgres



📊 Database Statistics:


  df_stats = pd.read_sql("""


Unnamed: 0,database,total_size,user_tables,user_schemas
0,sql_retail_analytics_warehouse,8062 kB,0,6



✅ Inspection complete
