# SQL Handling NULLs: COALESCE, NVL, and Fail-Safe Pipeline Design

## Introduction

**NULL values are one of the most common causes of data pipeline failures in production.**

As a data engineer, you'll encounter NULLs everywhere:
- üî¥ **Missing data** from source systems
- üî¥ **Failed transformations** that produce NULLs
- üî¥ **JOIN operations** that don't find matches
- üî¥ **Aggregations** that exclude NULLs unexpectedly
- üî¥ **Data type conversions** that fail silently

**The Problem:** NULLs behave differently than empty strings or zeros. They can:
- Break calculations (NULL + 5 = NULL)
- Cause type errors in downstream systems
- Create unexpected results in aggregations
- Fail data validation checks
- Cause application crashes

**What you'll learn:**
- Understanding NULL behavior in SQL
- Using COALESCE and NVL to handle NULLs
- Other NULL handling functions (ISNULL, NULLIF, CASE)
- Production scenarios where NULLs break pipelines
- Designing fail-safe pipelines with NULL handling
- Best practices for NULL management

**Database:** This course uses **Snowflake** database. All examples are Snowflake-compatible. Additional notes for SQL Server are provided where relevant.

---

## Why NULLs Are Dangerous

### NULL vs Empty String vs Zero

```sql
-- NULL: Absence of value (unknown/missing)
-- Empty String: Known value that is empty
-- Zero: Known numeric value of zero

-- These are NOT the same:
NULL          -- Unknown/missing
''            -- Empty string (known to be empty)
0             -- Zero (known numeric value)
```

### How NULLs Break Calculations

**Rule:** Any arithmetic operation with NULL returns NULL

```sql
-- All these return NULL:
SELECT NULL + 5;        -- NULL
SELECT NULL * 10;       -- NULL
SELECT NULL - 3;        -- NULL
SELECT NULL / 2;        -- NULL
SELECT 100 + NULL;      -- NULL
SELECT NULL + NULL;     -- NULL
```

**Impact:** If one NULL value exists in a calculation, the entire result becomes NULL, potentially breaking downstream processes.

---

## Real-World Scenario: E-Commerce Revenue Pipeline

**The Business Problem:**
Your company has a daily revenue pipeline that:
1. Extracts sales data from multiple sources
2. Transforms and aggregates revenue by product category
3. Loads into a data warehouse for reporting

**The Failure:**
One morning, the revenue dashboard shows **NULL** for all categories. The pipeline ran successfully, but the data is unusable.

**Root Cause:** A new product category was introduced with NULL values in the `category_name` field. When the pipeline tried to aggregate by category, it created NULL groups, and all calculations involving these groups returned NULL.

**Impact:**
- üìä Reports show NULL instead of revenue
- üìà Dashboards fail to render
- üíº Business users can't make decisions
- üî¥ Data quality alerts trigger
- ‚è∞ Hours of debugging required

---

## Dataset Setup

Let's create realistic tables that demonstrate NULL issues:


In [None]:
-- Create sales table with potential NULL issues
CREATE OR REPLACE TABLE sales (
    sale_id INT PRIMARY KEY,
    customer_id INT,
    product_id INT,
    product_name VARCHAR(100),
    category VARCHAR(50),
    sale_date DATE,
    quantity INT,
    unit_price DECIMAL(10, 2),
    discount_percent DECIMAL(5, 2),
    sales_rep_id INT,
    sales_rep_name VARCHAR(100),
    region VARCHAR(50),
    notes VARCHAR(500)
);

-- Create customers table
CREATE OR REPLACE TABLE customers (
    customer_id INT PRIMARY KEY,
    customer_name VARCHAR(100),
    email VARCHAR(100),
    phone VARCHAR(20),
    registration_date DATE,
    status VARCHAR(20)
);

-- Create products table
CREATE OR REPLACE TABLE products (
    product_id INT PRIMARY KEY,
    product_name VARCHAR(100),
    category VARCHAR(50),
    base_price DECIMAL(10, 2),
    cost DECIMAL(10, 2)
);


In [None]:
-- Insert sample data with various NULL scenarios
INSERT INTO customers VALUES
    (101, 'John Smith', 'john@email.com', '555-0101', '2023-01-15', 'active'),
    (102, 'Jane Doe', 'jane@email.com', NULL, '2023-02-20', 'active'),
    (103, 'Bob Johnson', NULL, '555-0103', '2023-03-10', 'active'),
    (104, 'Alice Brown', 'alice@email.com', '555-0104', NULL, 'active'),
    (105, NULL, 'unknown@email.com', '555-0105', '2023-05-01', 'active');

INSERT INTO products VALUES
    (201, 'Laptop Pro', 'Electronics', 999.99, 600.00),
    (202, 'Wireless Mouse', 'Electronics', 29.99, 10.00),
    (203, 'Office Chair', 'Furniture', 199.99, 100.00),
    (204, 'Desk Lamp', NULL, 49.99, 20.00),  -- NULL category
    (205, NULL, 'Electronics', 79.99, 30.00);  -- NULL product name

INSERT INTO sales VALUES
    -- Normal records
    (1, 101, 201, 'Laptop Pro', 'Electronics', '2024-01-15', 1, 999.99, 10.00, 1, 'Sarah Johnson', 'North', 'Regular sale'),
    (2, 102, 202, 'Wireless Mouse', 'Electronics', '2024-01-16', 2, 29.99, 0.00, 1, 'Sarah Johnson', 'North', NULL),
    
    -- NULL in critical fields
    (3, NULL, 203, 'Office Chair', 'Furniture', '2024-01-17', 1, 199.99, 5.00, 2, 'Mike Davis', 'South', 'Customer ID missing'),
    (4, 103, NULL, NULL, 'Electronics', '2024-01-18', 1, 999.99, 0.00, 2, 'Mike Davis', 'South', 'Product discontinued'),
    (5, 104, 204, 'Desk Lamp', NULL, '2024-01-19', 1, 49.99, 15.00, 3, 'Lisa Chen', 'East', 'New category pending'),
    
    -- NULL in calculations
    (6, 105, 201, 'Laptop Pro', 'Electronics', '2024-01-20', 1, 999.99, NULL, 3, 'Lisa Chen', 'East', 'Discount not applied'),
    (7, 101, 202, 'Wireless Mouse', 'Electronics', '2024-01-21', NULL, 29.99, 0.00, 1, 'Sarah Johnson', 'North', 'Quantity error'),
    (8, 102, 203, 'Office Chair', 'Furniture', '2024-01-22', 1, NULL, 10.00, 2, 'Mike Davis', 'South', 'Price lookup failed'),
    
    -- NULL in text fields
    (9, 103, 201, 'Laptop Pro', 'Electronics', '2024-01-23', 1, 999.99, 0.00, NULL, NULL, 'West', 'Sales rep not assigned'),
    (10, 104, 202, 'Wireless Mouse', 'Electronics', '2024-01-24', 2, 29.99, 0.00, 1, 'Sarah Johnson', NULL, 'Region data missing');


## Section 1: Understanding NULL Behavior

### How NULLs Affect Aggregations

**Key Rule:** Most aggregate functions **ignore NULLs**, but the result can still be NULL if all values are NULL.

```sql
-- COUNT(*) counts all rows (including NULLs)
-- COUNT(column) counts non-NULL values only
-- SUM, AVG, MIN, MAX ignore NULLs
-- If all values are NULL, these return NULL
```

Let's see this in action:


In [None]:
-- Example 1: COUNT behavior with NULLs
SELECT 
    COUNT(*) as total_rows,                    -- Counts all rows
    COUNT(quantity) as non_null_quantity,      -- Counts only non-NULL
    COUNT(sales_rep_id) as non_null_rep_id,    -- Counts only non-NULL
    SUM(quantity) as total_quantity,           -- Sum ignores NULLs
    AVG(unit_price) as avg_price,              -- Average ignores NULLs
    MIN(sale_date) as earliest_sale,           -- MIN ignores NULLs
    MAX(sale_date) as latest_sale              -- MAX ignores NULLs
FROM sales;


In [None]:
-- Example 2: How NULLs break calculations
-- Calculate total revenue: quantity * unit_price * (1 - discount_percent/100)

SELECT 
    sale_id,
    quantity,
    unit_price,
    discount_percent,
    -- This will return NULL if ANY component is NULL
    quantity * unit_price * (1 - COALESCE(discount_percent, 0) / 100) as revenue_without_handling,
    -- Proper handling with COALESCE
    COALESCE(quantity, 0) * COALESCE(unit_price, 0) * (1 - COALESCE(discount_percent, 0) / 100) as revenue_with_handling
FROM sales
ORDER BY sale_id;


### How NULLs Break JOINs

When joining tables, NULL values in join keys can cause records to be excluded or create unexpected results.

```sql
-- NULL values in JOIN conditions don't match anything
-- Even NULL = NULL returns FALSE in SQL (use IS NULL instead)
```


In [None]:
-- Example 3: JOIN with NULLs - records with NULL customer_id are lost
SELECT 
    s.sale_id,
    s.customer_id,
    c.customer_name,
    s.product_name,
    s.quantity,
    s.unit_price
FROM sales s
LEFT JOIN customers c ON s.customer_id = c.customer_id
ORDER BY s.sale_id;

-- Notice: Sale ID 3 has NULL customer_id, so customer_name is NULL
-- This might be acceptable, but what if we need to ensure customer_id exists?


---

## Section 2: NULL Handling Functions

### COALESCE: The Universal NULL Handler

**COALESCE** returns the first non-NULL value from a list of expressions.

**Syntax:**
```sql
COALESCE(value1, value2, value3, ..., default_value)
```

**Behavior:**
- Evaluates arguments from left to right
- Returns the first non-NULL value
- Returns NULL if all arguments are NULL
- Works with any data type

**Use Cases:**
- Replace NULL with a default value
- Choose the first available value from multiple columns
- Provide fallback values for missing data


In [None]:
-- Example 4: Basic COALESCE usage
SELECT 
    sale_id,
    customer_id,
    COALESCE(customer_id, 0) as customer_id_safe,  -- Replace NULL with 0
    product_name,
    COALESCE(product_name, 'Unknown Product') as product_name_safe,  -- Replace NULL with text
    category,
    COALESCE(category, 'Uncategorized') as category_safe,
    discount_percent,
    COALESCE(discount_percent, 0) as discount_safe  -- Replace NULL with 0
FROM sales
ORDER BY sale_id;


In [None]:
-- Example 5: COALESCE with multiple fallback values
-- Try to get customer name from sales_rep_name, or use a default
SELECT 
    sale_id,
    sales_rep_name,
    COALESCE(sales_rep_name, 'Unassigned', 'N/A') as rep_name_safe
FROM sales
ORDER BY sale_id;

-- Note: The second and third arguments are only evaluated if the first is NULL


In [None]:
-- Example 6: COALESCE in calculations
-- Calculate revenue with proper NULL handling
SELECT 
    sale_id,
    quantity,
    unit_price,
    discount_percent,
    -- Safe calculation: handle NULLs at each step
    COALESCE(quantity, 0) * 
    COALESCE(unit_price, 0) * 
    (1 - COALESCE(discount_percent, 0) / 100) as revenue_calculated,
    -- Alternative: handle the entire expression
    COALESCE(
        quantity * unit_price * (1 - discount_percent / 100),
        0
    ) as revenue_alternative
FROM sales
ORDER BY sale_id;


### NVL: Oracle/Snowflake NULL Replacement

**NVL** is Oracle's original NULL handling function, also available in Snowflake.

**Syntax:**
```sql
NVL(expression, default_value)
```

**Behavior:**
- Returns `default_value` if `expression` is NULL
- Returns `expression` if it's not NULL
- Simpler than COALESCE but only handles two arguments

**Note:** 
- **Snowflake:** Supports both NVL and COALESCE (COALESCE is preferred)
- **SQL Server:** Use ISNULL() instead of NVL
- **MySQL:** Use IFNULL() or COALESCE()


In [None]:
-- Example 7: NVL usage (Snowflake/Oracle)
SELECT 
    sale_id,
    customer_id,
    NVL(customer_id, 0) as customer_id_safe,
    product_name,
    NVL(product_name, 'Unknown') as product_name_safe,
    discount_percent,
    NVL(discount_percent, 0) as discount_safe
FROM sales
ORDER BY sale_id;

-- NVL is equivalent to COALESCE(expression, default_value)
-- COALESCE is more flexible and portable across databases


### Other NULL Handling Functions

#### ISNULL (SQL Server)
```sql
ISNULL(expression, replacement_value)
```
- SQL Server specific
- Equivalent to NVL in Oracle/Snowflake

#### IFNULL (MySQL)
```sql
IFNULL(expression, replacement_value)
```
- MySQL specific
- Equivalent to NVL

#### NULLIF: Convert Values to NULL
```sql
NULLIF(expression1, expression2)
```
- Returns NULL if expression1 equals expression2
- Otherwise returns expression1
- Useful for converting specific values to NULL

#### CASE Statement for Complex NULL Logic
```sql
CASE 
    WHEN column IS NULL THEN default_value
    WHEN condition THEN value1
    ELSE value2
END
```
- Most flexible option
- Can handle complex conditional logic


In [None]:
-- Example 8: NULLIF - Convert specific values to NULL
-- Sometimes you want to treat certain values as NULL
SELECT 
    sale_id,
    discount_percent,
    NULLIF(discount_percent, 0) as discount_or_null,  -- Convert 0 to NULL
    -- Useful when 0 has special meaning vs NULL
    CASE 
        WHEN NULLIF(discount_percent, 0) IS NULL THEN 'No discount applied'
        ELSE 'Discount: ' || discount_percent || '%'
    END as discount_status
FROM sales
ORDER BY sale_id;


In [None]:
-- Example 9: CASE statement for complex NULL handling
-- Different defaults based on column type or business rules
SELECT 
    sale_id,
    customer_id,
    CASE 
        WHEN customer_id IS NULL THEN -1  -- Use -1 for missing customer
        ELSE customer_id
    END as customer_id_handled,
    region,
    CASE 
        WHEN region IS NULL THEN 'Unknown'
        WHEN region = '' THEN 'Unknown'  -- Also handle empty strings
        ELSE region
    END as region_handled,
    discount_percent,
    CASE 
        WHEN discount_percent IS NULL THEN 0
        WHEN discount_percent < 0 THEN 0  -- Also handle invalid values
        WHEN discount_percent > 100 THEN 100  -- Cap at 100%
        ELSE discount_percent
    END as discount_handled
FROM sales
ORDER BY sale_id;


---

## Section 3: Production Scenarios - How NULLs Break Pipelines

### Scenario 1: Revenue Aggregation Failure

**The Problem:**
A daily ETL pipeline aggregates revenue by product category. One day, a new product is added with a NULL category. The aggregation groups by category, creating a NULL group. All revenue calculations involving this group return NULL, causing the entire report to show NULL.

**The Broken Query:**
```sql
-- This fails when category is NULL
SELECT 
    category,
    SUM(quantity * unit_price) as total_revenue
FROM sales
GROUP BY category;
```

**The Fix:**
```sql
-- Handle NULLs before grouping
SELECT 
    COALESCE(category, 'Uncategorized') as category,
    SUM(COALESCE(quantity, 0) * COALESCE(unit_price, 0)) as total_revenue
FROM sales
GROUP BY COALESCE(category, 'Uncategorized');
```


In [None]:
-- Demonstrate the problem
SELECT 
    category,
    COUNT(*) as sale_count,
    SUM(quantity * unit_price) as total_revenue_broken,  -- Returns NULL for NULL category
    SUM(COALESCE(quantity, 0) * COALESCE(unit_price, 0)) as total_revenue_fixed
FROM sales
GROUP BY category
ORDER BY category;


In [None]:
-- The fixed version
SELECT 
    COALESCE(category, 'Uncategorized') as category,
    COUNT(*) as sale_count,
    SUM(COALESCE(quantity, 0) * COALESCE(unit_price, 0)) as total_revenue,
    AVG(COALESCE(unit_price, 0)) as avg_price
FROM sales
GROUP BY COALESCE(category, 'Uncategorized')
ORDER BY category;


### Scenario 2: String Concatenation Failure

**The Problem:**
A pipeline creates customer full names by concatenating first and last names. When either field is NULL, the entire concatenation becomes NULL, breaking downstream processes that expect a string.

**The Broken Query:**
```sql
-- This returns NULL if first_name or last_name is NULL
SELECT first_name || ' ' || last_name as full_name
FROM customers;
```

**The Fix:**
```sql
-- Handle NULLs in concatenation
SELECT COALESCE(first_name, '') || ' ' || COALESCE(last_name, '') as full_name
FROM customers;
```


In [None]:
-- Demonstrate string concatenation with NULLs
SELECT 
    customer_id,
    customer_name,
    -- Broken: Returns NULL if customer_name is NULL
    customer_name || ' (ID: ' || customer_id || ')' as display_name_broken,
    -- Fixed: Handle NULLs properly
    COALESCE(customer_name, 'Unknown Customer') || ' (ID: ' || customer_id || ')' as display_name_fixed
FROM customers
ORDER BY customer_id;


### Scenario 3: JOIN Data Loss

**The Problem:**
A pipeline joins sales with customer data. Sales records with NULL customer_id are lost in an INNER JOIN, or customer_name becomes NULL in a LEFT JOIN. Downstream processes fail validation checks that require customer_name to be non-NULL.

**The Broken Query:**
```sql
-- INNER JOIN loses records with NULL customer_id
SELECT s.sale_id, c.customer_name, s.product_name
FROM sales s
INNER JOIN customers c ON s.customer_id = c.customer_id;
```

**The Fix:**
```sql
-- Use LEFT JOIN and handle NULLs
SELECT 
    s.sale_id, 
    COALESCE(c.customer_name, 'Unknown Customer') as customer_name,
    s.product_name
FROM sales s
LEFT JOIN customers c ON s.customer_id = c.customer_id;
```


In [None]:
-- Compare INNER JOIN vs LEFT JOIN with NULL handling
-- INNER JOIN: Loses records with NULL customer_id
SELECT 
    COUNT(*) as record_count,
    'INNER JOIN' as join_type
FROM sales s
INNER JOIN customers c ON s.customer_id = c.customer_id

UNION ALL

-- LEFT JOIN: Keeps all sales records
SELECT 
    COUNT(*) as record_count,
    'LEFT JOIN' as join_type
FROM sales s
LEFT JOIN customers c ON s.customer_id = c.customer_id;


In [None]:
-- Fixed version with proper NULL handling
SELECT 
    s.sale_id,
    s.customer_id,
    COALESCE(c.customer_name, 'Unknown Customer') as customer_name,
    s.product_name,
    COALESCE(s.region, 'Unknown Region') as region,
    s.quantity,
    s.unit_price
FROM sales s
LEFT JOIN customers c ON s.customer_id = c.customer_id
ORDER BY s.sale_id;


### Scenario 4: Data Type Conversion Failure

**The Problem:**
A pipeline converts text fields to numbers. When a field contains NULL, the conversion might fail or produce unexpected results. Downstream systems expecting numeric values receive NULL and fail.

**The Broken Query:**
```sql
-- CAST might fail or produce NULL
SELECT CAST(discount_percent AS INT) as discount_int
FROM sales;
```

**The Fix:**
```sql
-- Handle NULLs before conversion
SELECT COALESCE(CAST(discount_percent AS INT), 0) as discount_int
FROM sales;
```


In [None]:
-- Demonstrate type conversion with NULLs
SELECT 
    sale_id,
    discount_percent,
    -- CAST preserves NULL
    CAST(discount_percent AS INT) as discount_int,
    -- Safe conversion with default
    COALESCE(CAST(discount_percent AS INT), 0) as discount_int_safe,
    -- Round and handle NULLs
    COALESCE(ROUND(discount_percent), 0) as discount_rounded_safe
FROM sales
ORDER BY sale_id;


### Scenario 5: Window Function Failures

**The Problem:**
Window functions like ROW_NUMBER(), RANK(), and running totals can produce unexpected results when NULLs are present. For example, a running total might reset to NULL if one value in the sequence is NULL.

**The Broken Query:**
```sql
-- Running total breaks when encountering NULL
SELECT 
    sale_id,
    quantity,
    SUM(quantity) OVER (ORDER BY sale_id) as running_total
FROM sales;
```

**The Fix:**
```sql
-- Handle NULLs before window function
SELECT 
    sale_id,
    quantity,
    SUM(COALESCE(quantity, 0)) OVER (ORDER BY sale_id) as running_total
FROM sales;
```


In [None]:
-- Demonstrate window functions with NULLs
SELECT 
    sale_id,
    quantity,
    -- Broken: Running total becomes NULL when quantity is NULL
    SUM(quantity) OVER (ORDER BY sale_id) as running_total_broken,
    -- Fixed: Handle NULLs before window function
    SUM(COALESCE(quantity, 0)) OVER (ORDER BY sale_id) as running_total_fixed,
    -- Average with NULL handling
    AVG(COALESCE(unit_price, 0)) OVER (ORDER BY sale_id) as running_avg_price
FROM sales
ORDER BY sale_id;


---

## Section 4: Designing Fail-Safe Pipelines

### Principle 1: Validate and Handle NULLs Early

**Best Practice:** Handle NULLs as early as possible in your pipeline, ideally during the extraction or first transformation step.

**Why:**
- Prevents NULLs from propagating through multiple transformations
- Makes debugging easier (you know where NULLs are handled)
- Reduces the chance of NULLs causing failures in downstream steps

**Example:**
```sql
-- Create a staging table with NULL handling
CREATE OR REPLACE TABLE sales_staging AS
SELECT 
    sale_id,
    COALESCE(customer_id, -1) as customer_id,  -- Handle NULL early
    COALESCE(product_id, -1) as product_id,
    COALESCE(product_name, 'Unknown') as product_name,
    COALESCE(category, 'Uncategorized') as category,
    COALESCE(quantity, 0) as quantity,
    COALESCE(unit_price, 0) as unit_price,
    COALESCE(discount_percent, 0) as discount_percent,
    COALESCE(region, 'Unknown') as region
FROM sales_source;
```


In [None]:
-- Example: Create a fail-safe staging table
CREATE OR REPLACE TABLE sales_staging AS
SELECT 
    sale_id,
    -- Handle NULLs with appropriate defaults
    COALESCE(customer_id, -1) as customer_id,  -- Use -1 for missing customer
    COALESCE(product_id, -1) as product_id,     -- Use -1 for missing product
    COALESCE(product_name, 'Unknown Product') as product_name,
    COALESCE(category, 'Uncategorized') as category,
    sale_date,
    COALESCE(quantity, 0) as quantity,          -- Use 0 for missing quantity
    COALESCE(unit_price, 0) as unit_price,      -- Use 0 for missing price
    COALESCE(discount_percent, 0) as discount_percent,
    COALESCE(sales_rep_id, -1) as sales_rep_id,
    COALESCE(sales_rep_name, 'Unassigned') as sales_rep_name,
    COALESCE(region, 'Unknown') as region,
    -- Calculate derived fields safely
    COALESCE(quantity, 0) * COALESCE(unit_price, 0) * 
        (1 - COALESCE(discount_percent, 0) / 100) as revenue
FROM sales;

-- Verify the staging table
SELECT * FROM sales_staging ORDER BY sale_id;


### Principle 2: Use Appropriate Default Values

**Key Decision:** What default value should you use for NULLs?

**Considerations:**
- **Numeric fields:** 0, -1, or NULL? 
  - Use `0` if NULL means "no value" (e.g., quantity, price)
  - Use `-1` if NULL means "missing/unknown" and you need to distinguish from 0
  - Keep `NULL` if NULL has business meaning (e.g., optional discount)
  
- **Text fields:** Empty string, 'Unknown', or NULL?
  - Use `'Unknown'` or `'N/A'` for missing names/descriptions
  - Use empty string `''` only if it has no business meaning
  - Keep `NULL` if you need to distinguish between missing and empty
  
- **Date fields:** Current date, far future, or NULL?
  - Use `CURRENT_DATE` if NULL means "today"
  - Use a sentinel date (e.g., '1900-01-01') if NULL means "unknown"
  - Keep `NULL` if missing dates are valid business cases

**Example:**
```sql
-- Business-appropriate defaults
COALESCE(customer_id, -1)           -- -1 = missing customer (distinct from customer 0)
COALESCE(quantity, 0)               -- 0 = no quantity (makes sense)
COALESCE(product_name, 'Unknown')   -- 'Unknown' = missing name
COALESCE(discount_percent, 0)       -- 0 = no discount (common case)
COALESCE(region, 'Unknown')         -- 'Unknown' = missing region
```


### Principle 3: Add Data Quality Checks

**Best Practice:** Add validation queries to detect NULLs in critical fields before they cause failures.

**Create monitoring queries that:**
1. Count NULLs in critical fields
2. Alert when NULL percentage exceeds thresholds
3. Identify which records have NULLs
4. Track NULL trends over time

**Example Validation Query:**
```sql
-- Data quality check: Find NULLs in critical fields
SELECT 
    'customer_id' as field_name,
    COUNT(*) as total_rows,
    COUNT(customer_id) as non_null_count,
    COUNT(*) - COUNT(customer_id) as null_count,
    ROUND((COUNT(*) - COUNT(customer_id)) * 100.0 / COUNT(*), 2) as null_percentage
FROM sales
UNION ALL
SELECT 
    'product_id',
    COUNT(*),
    COUNT(product_id),
    COUNT(*) - COUNT(product_id),
    ROUND((COUNT(*) - COUNT(product_id)) * 100.0 / COUNT(*), 2)
FROM sales;
```


In [None]:
-- Comprehensive data quality check for NULLs
WITH null_checks AS (
    SELECT 
        'customer_id' as field_name,
        COUNT(*) as total_rows,
        COUNT(customer_id) as non_null_count,
        COUNT(*) - COUNT(customer_id) as null_count
    FROM sales
    UNION ALL
    SELECT 
        'product_id',
        COUNT(*),
        COUNT(product_id),
        COUNT(*) - COUNT(product_id)
    FROM sales
    UNION ALL
    SELECT 
        'product_name',
        COUNT(*),
        COUNT(product_name),
        COUNT(*) - COUNT(product_name)
    FROM sales
    UNION ALL
    SELECT 
        'category',
        COUNT(*),
        COUNT(category),
        COUNT(*) - COUNT(category)
    FROM sales
    UNION ALL
    SELECT 
        'quantity',
        COUNT(*),
        COUNT(quantity),
        COUNT(*) - COUNT(quantity)
    FROM sales
    UNION ALL
    SELECT 
        'unit_price',
        COUNT(*),
        COUNT(unit_price),
        COUNT(*) - COUNT(unit_price)
    FROM sales
)
SELECT 
    field_name,
    total_rows,
    non_null_count,
    null_count,
    ROUND(null_count * 100.0 / total_rows, 2) as null_percentage,
    CASE 
        WHEN null_count = 0 THEN '‚úÖ PASS'
        WHEN null_count * 100.0 / total_rows < 5 THEN '‚ö†Ô∏è WARNING'
        ELSE 'üî¥ FAIL'
    END as status
FROM null_checks
ORDER BY null_percentage DESC;


In [None]:
-- Example: Aggregation with proper NULL handling
SELECT 
    COALESCE(category, 'Uncategorized') as category,
    COUNT(*) as total_sales,
    -- Sum: Handle NULLs
    SUM(COALESCE(quantity, 0)) as total_quantity,
    SUM(COALESCE(quantity, 0) * COALESCE(unit_price, 0)) as total_revenue,
    -- Average: Two approaches
    AVG(unit_price) as avg_price_excluding_nulls,  -- Excludes NULLs
    AVG(COALESCE(unit_price, 0)) as avg_price_including_nulls,  -- Treats NULLs as 0
    -- Min/Max: Handle NULLs
    MIN(COALESCE(sale_date, '1900-01-01')) as earliest_sale,
    MAX(COALESCE(sale_date, '1900-01-01')) as latest_sale
FROM sales
GROUP BY COALESCE(category, 'Uncategorized')
ORDER BY total_revenue DESC;


### Principle 5: Document NULL Handling Strategy

**Best Practice:** Document your NULL handling decisions in code comments and data dictionaries.

**What to Document:**
- Which fields can contain NULLs
- What default values are used and why
- Business rules for NULL handling
- Any fields where NULLs are not allowed (should be validated)

**Example:**
```sql
-- NULL Handling Strategy:
-- customer_id: NULL -> -1 (missing customer, distinct from customer 0)
-- quantity: NULL -> 0 (no quantity purchased)
-- discount_percent: NULL -> 0 (no discount applied)
-- category: NULL -> 'Uncategorized' (product not yet categorized)
-- region: NULL -> 'Unknown' (region data missing)

SELECT 
    COALESCE(customer_id, -1) as customer_id,
    COALESCE(quantity, 0) as quantity,
    COALESCE(discount_percent, 0) as discount_percent,
    COALESCE(category, 'Uncategorized') as category,
    COALESCE(region, 'Unknown') as region
FROM sales;
```


---

## Section 5: Complete Fail-Safe Pipeline Example

Let's build a complete example of a fail-safe revenue reporting pipeline that handles NULLs at every step.


In [None]:
-- Step 1: Create staging table with NULL handling
-- This is your first line of defense
CREATE OR REPLACE TABLE revenue_staging AS
SELECT 
    sale_id,
    -- Handle all potential NULLs with business-appropriate defaults
    COALESCE(customer_id, -1) as customer_id,
    COALESCE(product_id, -1) as product_id,
    COALESCE(product_name, 'Unknown Product') as product_name,
    COALESCE(category, 'Uncategorized') as category,
    COALESCE(sale_date, CURRENT_DATE) as sale_date,  -- Use current date if missing
    COALESCE(quantity, 0) as quantity,
    COALESCE(unit_price, 0) as unit_price,
    COALESCE(discount_percent, 0) as discount_percent,
    COALESCE(region, 'Unknown') as region,
    -- Calculate revenue safely
    COALESCE(quantity, 0) * 
    COALESCE(unit_price, 0) * 
    (1 - COALESCE(discount_percent, 0) / 100) as revenue
FROM sales;

-- Verify staging table
SELECT * FROM revenue_staging ORDER BY sale_id;


In [None]:
-- Step 2: Create final aggregated table
-- All NULLs are already handled in staging, so this is safe
CREATE OR REPLACE TABLE revenue_by_category AS
SELECT 
    category,
    region,
    COUNT(*) as sale_count,
    SUM(quantity) as total_quantity,
    SUM(revenue) as total_revenue,
    AVG(revenue) as avg_revenue_per_sale,
    MIN(sale_date) as first_sale_date,
    MAX(sale_date) as last_sale_date
FROM revenue_staging
GROUP BY category, region
ORDER BY total_revenue DESC;

-- Verify aggregated results
SELECT * FROM revenue_by_category;


In [None]:
-- Step 3: Data quality validation
-- Check that no critical fields are NULL in final table
SELECT 
    'revenue_by_category' as table_name,
    COUNT(*) as total_rows,
    SUM(CASE WHEN category IS NULL THEN 1 ELSE 0 END) as null_category_count,
    SUM(CASE WHEN region IS NULL THEN 1 ELSE 0 END) as null_region_count,
    SUM(CASE WHEN total_revenue IS NULL THEN 1 ELSE 0 END) as null_revenue_count,
    CASE 
        WHEN SUM(CASE WHEN category IS NULL THEN 1 ELSE 0 END) = 0 
         AND SUM(CASE WHEN region IS NULL THEN 1 ELSE 0 END) = 0
         AND SUM(CASE WHEN total_revenue IS NULL THEN 1 ELSE 0 END) = 0
        THEN '‚úÖ PASS - No NULLs in critical fields'
        ELSE 'üî¥ FAIL - NULLs detected in critical fields'
    END as validation_status
FROM revenue_by_category;


---

## Section 6: Best Practices Summary

### ‚úÖ DO's

1. **Handle NULLs early** - In staging/transformation layers, not in final queries
2. **Use COALESCE** - More portable than NVL/ISNULL across databases
3. **Choose appropriate defaults** - Match business logic (0 for quantities, 'Unknown' for names)
4. **Document your strategy** - Comment your NULL handling decisions
5. **Validate data quality** - Add checks to detect unexpected NULLs
6. **Test with NULL data** - Include NULL scenarios in your test cases
7. **Handle NULLs in JOINs** - Use LEFT JOIN and COALESCE when needed
8. **Handle NULLs in aggregations** - Explicitly handle NULLs in GROUP BY and aggregate functions

### ‚ùå DON'Ts

1. **Don't ignore NULLs** - They will cause failures downstream
2. **Don't use empty strings for numeric defaults** - Use 0 or -1
3. **Don't assume NULLs are excluded** - Some functions exclude them, others don't
4. **Don't use NULL in WHERE clauses** - Use `IS NULL` or `IS NOT NULL`, not `= NULL`
5. **Don't propagate NULLs** - Handle them as early as possible
6. **Don't use inconsistent defaults** - Use the same default value for the same field across your pipeline
7. **Don't forget window functions** - Handle NULLs before using window functions
8. **Don't skip validation** - Always validate that NULL handling worked correctly

### üîç Common Patterns

```sql
-- Pattern 1: Replace NULL with default
COALESCE(column, default_value)

-- Pattern 2: Multiple fallbacks
COALESCE(column1, column2, column3, 'Default')

-- Pattern 3: NULL-safe calculation
COALESCE(value1, 0) * COALESCE(value2, 0)

-- Pattern 4: NULL-safe string concatenation
COALESCE(name, '') || ' ' || COALESCE(surname, '')

-- Pattern 5: Group by with NULL handling
GROUP BY COALESCE(category, 'Uncategorized')

-- Pattern 6: JOIN with NULL handling
LEFT JOIN table2 ON table1.id = table2.id
WHERE COALESCE(table2.name, 'Unknown') != 'Unknown'
```


---

## Section 7: Database-Specific Notes

### Snowflake
- ‚úÖ Supports `COALESCE` (preferred)
- ‚úÖ Supports `NVL` (Oracle compatibility)
- ‚úÖ Supports `NULLIF`
- ‚úÖ Supports `CASE` statements
- **Recommendation:** Use `COALESCE` for portability

### SQL Server
- ‚úÖ Supports `COALESCE` (preferred)
- ‚úÖ Supports `ISNULL(expression, replacement)` (SQL Server specific)
- ‚úÖ Supports `NULLIF`
- ‚úÖ Supports `CASE` statements
- **Note:** `ISNULL` only handles 2 arguments, `COALESCE` handles multiple

### Oracle
- ‚úÖ Supports `COALESCE`
- ‚úÖ Supports `NVL(expression, replacement)` (original)
- ‚úÖ Supports `NVL2(expression, value_if_not_null, value_if_null)`
- ‚úÖ Supports `NULLIF`
- ‚úÖ Supports `CASE` statements

### MySQL
- ‚úÖ Supports `COALESCE`
- ‚úÖ Supports `IFNULL(expression, replacement)` (MySQL specific)
- ‚úÖ Supports `NULLIF`
- ‚úÖ Supports `CASE` statements

### PostgreSQL
- ‚úÖ Supports `COALESCE` (preferred)
- ‚úÖ Supports `NULLIF`
- ‚úÖ Supports `CASE` statements
- **Note:** No `NVL` or `ISNULL`, use `COALESCE`

**Cross-Database Recommendation:** Use `COALESCE` for maximum portability across all databases.


---

## Section 8: Practice Exercises

### Exercise 1: Fix Broken Revenue Calculation

**Problem:** The following query returns NULL for some sales. Fix it to handle NULLs properly.

```sql
-- Broken query
SELECT 
    sale_id,
    quantity * unit_price * (1 - discount_percent / 100) as revenue
FROM sales;
```

**Your Task:** Modify the query to handle NULLs in quantity, unit_price, and discount_percent.

<details>
<summary>Click for Solution</summary>

```sql
SELECT 
    sale_id,
    COALESCE(quantity, 0) * 
    COALESCE(unit_price, 0) * 
    (1 - COALESCE(discount_percent, 0) / 100) as revenue
FROM sales;
```
</details>


### Exercise 2: Create Customer Display Names

**Problem:** Create a query that displays customer information. Handle NULLs in customer_name, email, and phone fields.

**Requirements:**
- Display format: "Customer Name (Email: email@example.com, Phone: 555-1234)"
- If customer_name is NULL, use "Unknown Customer"
- If email is NULL, use "No email"
- If phone is NULL, use "No phone"

<details>
<summary>Click for Solution</summary>

```sql
SELECT 
    customer_id,
    COALESCE(customer_name, 'Unknown Customer') || 
    ' (Email: ' || COALESCE(email, 'No email') || 
    ', Phone: ' || COALESCE(phone, 'No phone') || ')' as customer_display
FROM customers;
```
</details>


### Exercise 3: Safe Aggregation by Category

**Problem:** Create a query that aggregates sales by category, ensuring NULL categories are handled properly.

**Requirements:**
- Group by category (handle NULLs)
- Calculate total revenue, total quantity, and average price
- Ensure no NULLs appear in the results

<details>
<summary>Click for Solution</summary>

```sql
SELECT 
    COALESCE(category, 'Uncategorized') as category,
    COUNT(*) as sale_count,
    SUM(COALESCE(quantity, 0)) as total_quantity,
    SUM(COALESCE(quantity, 0) * COALESCE(unit_price, 0)) as total_revenue,
    AVG(COALESCE(unit_price, 0)) as avg_price
FROM sales
GROUP BY COALESCE(category, 'Uncategorized')
ORDER BY total_revenue DESC;
```
</details>


---

## Summary

**Key Takeaways:**

1. **NULLs are dangerous** - They can break calculations, aggregations, and downstream processes
2. **COALESCE is your friend** - Use it to handle NULLs consistently across databases
3. **Handle NULLs early** - Fix them in staging/transformation layers, not in final queries
4. **Choose appropriate defaults** - Match your business logic (0 for numbers, 'Unknown' for text)
5. **Validate your data** - Add checks to detect unexpected NULLs
6. **Document your strategy** - Make NULL handling decisions clear and consistent
7. **Test with NULLs** - Include NULL scenarios in your test cases

**Remember:** A fail-safe pipeline handles NULLs at every step, validates the results, and documents the strategy. This prevents production failures and makes debugging easier.

**Next Steps:**
- Practice using COALESCE in your queries
- Review existing pipelines for NULL handling
- Add data quality checks for NULLs
- Document your NULL handling strategy

---

**End of Notebook**
