## WWI Data Pipeline and Dashboard

**Author:**  

Pavel Grigoryev

**Project Description:**   

Wide World Importers (WWI) is a global distributor of consumer goods, dealing with suppliers and clients worldwide.

WWI aims to enhance the visibility of key business performance indicators for rapid decision-making. Currently, data is siloed across the operational database, and management spends significant time manually compiling reports, leading to delayed and inefficient insights.

**Project Goal:**  

To build an automated, end-to-end analytics pipeline and an interactive dashboard. This system will provide leadership and the sales, procurement, and logistics departments with a single source of truth for key metrics related to sales performance and delivery efficiency.

**Expected Outcome:**

The final dashboard will drastically reduce data analysis time, enable the identification of trends and anomalies, and support strategic and operational decision-making based on accurate, consolidated information.

**Data Sources:**  

The official Microsoft Wide World Importers sample database (OLTP schema) [Microsoft](https://learn.microsoft.com/en-us/sql/samples/wide-world-importers-what-is?view=sql-server-ver17)

**Project Resources:**  

- [**WWI Business Performance Overview Dashboard**](https://datalens.yandex/42t45uco5jxup)

**Main Conclusion:**

- End-to-End Pipeline Delivered: Successfully built an analytics pipeline from the raw OLTP database to a optimized star schema data mart.
- Process Automation Engineered: Designed and implemented an automated daily ETL process for incremental data updates.
- Interactive Dashboard Developed: Created a centralized, interactive dashboard for key sales and logistics performance metrics.
- Goal Achieved: The solution provides stakeholders with immediate, data-driven insights, eliminating the need for manual reporting and enabling faster, informed decision-making.

# Importing Libraries

In [1]:
from dotenv import load_dotenv
from sqlalchemy import create_engine
from sqlalchemy import text
import os
load_dotenv();

In [2]:
%load_ext sql
%config SqlMagic.displaycon = False
%config SqlMagic.feedback = False
%config SqlMagic.autopandas = True

# Database Connection

Let's create a connection string to the main database.

In [3]:
src_db_config = {
    'user': os.getenv('WWI_NEON_USER'),
    'pwd': os.getenv('WWI_NEON_PASSWORD'),
    'host': os.getenv('WWI_NEON_HOST'),
    'port': 5432, 
    'db': os.getenv('WWI_NEON_DB'),
    'sslmode': os.getenv('WWI_NEON_SSLMODE'),
    'channel_binding': os.getenv('WWI_NEON_CHANNEL_BINDING')
}

connection_string_wwi = (
    f"postgresql://{src_db_config['user']}:{src_db_config['pwd']}"
    f"@{src_db_config['host']}:{src_db_config['port']}/{src_db_config['db']}"
    f"?sslmode={src_db_config['sslmode']}"
    f"&channel_binding={src_db_config['channel_binding']}"
)

Let's create a connection string to the analytical database. It is currently empty; we will load data into it later.

In [4]:
analytics_db_config = {
    'user': os.getenv('WWI_ANALYTICS_NEON_USER'),
    'pwd': os.getenv('WWI_ANALYTICS_NEON_PASSWORD'),
    'host': os.getenv('WWI_ANALYTICS_NEON_HOST'),
    'port': 5432, 
    'db': os.getenv('WWI_ANALYTICS_NEON_DB'),
    'sslmode': os.getenv('WWI_ANALYTICS_NEON_SSLMODE'),
    'channel_binding': os.getenv('WWI_ANALYTICS_NEON_CHANNEL_BINDING')
}

connection_string_wwi_analytics = (
    f"postgresql://{analytics_db_config['user']}:{analytics_db_config['pwd']}"
    f"@{analytics_db_config['host']}:{analytics_db_config['port']}/{analytics_db_config['db']}"
    f"?sslmode={analytics_db_config['sslmode']}"
    f"&channel_binding={analytics_db_config['channel_binding']}"
)

For the analytical database, we will need an engine; let's create it now.

In [5]:
wwi_analytics_engine = create_engine(
    f"postgresql://{analytics_db_config['user']}:{analytics_db_config['pwd']}"
    f"@{analytics_db_config['host']}:{analytics_db_config['port']}/{analytics_db_config['db']}"
    , connect_args={
        'sslmode': os.getenv('WWI_ANALYTICS_NEON_SSLMODE'),
        'channel_binding': os.getenv('WWI_ANALYTICS_NEON_CHANNEL_BINDING')
    }
)

Let's create a function to connect to the required database.

In [6]:
def con(db='src'):
    """
    Connects to the specified database using SQL magic.
    
    Parameters:
    -----------    
    db (str): Database to connect to - 'src' for source or 'dst' for destination.
                Defaults to 'src'.
    """
    if db == 'src':
        get_ipython().run_line_magic('sql', '$connection_string_wwi')
        print("Connected to srс")
    elif db == 'dst':
        get_ipython().run_line_magic('sql', '$connection_string_wwi_analytics')
        print("Connected to dst")
    else:
        raise ValueError(f"Unknown database identifier: '{db}'. Use 'src' or 'dst'")

Let's create a function to close the connection.

In [7]:
def con_close(db='src'):
    """
    Closes the database connection for the specified database.
    
    Parameters:
    -----------
    db (str): Database to close - 'src' for source or 'dst' for destination.
                Defaults to 'src'.
    """
    # Get expected hosts from environment
    expected_hosts = {
        'src': os.getenv('WWI_NEON_HOST'),
        'dst': os.getenv('WWI_ANALYTICS_NEON_HOST')
    }
    
    if db not in expected_hosts:
        raise ValueError(f"Unknown db '{db}'. Use 'src' or 'dst'")
    
    # Get active connections
    connections = get_ipython().run_line_magic('sql', '--connections')
    
    # Find matching connection
    for url in connections.keys():
        if expected_hosts[db] in url:
            get_ipython().run_line_magic('sql', f'--close {url}')
            print(f"Connection closed for {db}")
            return
    
    print(f"No active connection found for {db}")

Let's open connections to both databases for work.

In [8]:
con('src')

Connected to srс


In [9]:
con('dst')

Connected to dst


# Data Description and Exploration


## Data Description


Wide World Importers (WWI) is a wholesale importer and distributor operating in the San Francisco Bay Area.

WWI's customers are primarily companies that resell goods to individuals. WWI sells to retail customers across the United States, including specialty stores, supermarkets, computer stores, and some individuals. WWI also sells to other wholesalers through a network of agents who promote products on behalf of WWI.

WWI purchases goods from suppliers. They store the goods in their WWI warehouse and reorder from suppliers as needed to fulfill customer orders. They also purchase large volumes of packaging materials and sell them in smaller quantities for customer convenience.

The WWI database contains many different schemas. For our analysis, we will need the following schemas.

### Sales Schema


Data on product sales to customers.

<img src="assets/er_sales.png" alt="">

We will need the following tables and fields in this schema:

**sales.orders**

Field | Description
-|-
order_id | Order ID.
customer_id | ID of the customer who placed the order.
order_date | Date the order was created.
expected_delivery_date | Expected delivery date of the order.
picking_completed_when | Time when order picking was completed.

**sales.order_lines**

Field | Description
-|-
order_line_id | Order line ID.
order_id | Order ID to which this line belongs.
stock_item_id | ID of the stock item (from the warehouse.stock_items table) specified in the order line.
package_type_id | ID of the package type (from the warehouse.package_types table) used for the item.
quantity | Quantity of the item to be supplied.
unit_price | Price per unit of the item.
tax_rate | Tax rate applied to the item.
picked_quantity | Quantity of the item that was picked from the warehouse.
picking_completed_when | Time when picking for this order line was completed.

**sales.customer_categories**

Field | Description
-|-
customer_category_id | Customer category ID.
customer_category_name | Full name of the category to which customers can be assigned.

**sales.customers**

Field | Description
-|-
customer_id | Customer ID.
customer_name | Full name of the customer (usually the trade name).
customer_category_id | ID of the customer's category.
delivery_method_id | ID of the standard delivery method for goods shipped to this customer.
delivery_city_id | ID of the delivery city for this address.

**sales.invoices**

Field | Description
-|-
invoice_id | Invoice ID.
customer_id | ID of the customer to whom the invoice is issued.
order_id | ID of the order associated with this invoice.
delivery_method_id | ID of the delivery method for the goods listed in the invoice.
invoice_date | Date when the invoice was issued.
confirmed_delivery_time | Confirmed delivery time.

**sales.invoice_lines**

Field | Description
-|-
invoice_line_id | Invoice line ID.
invoice_id | ID of the invoice to which this line belongs.
stock_item_id | ID of the stock item (from the warehouse.stock_items table) specified in the invoice line.
package_type_id | ID of the package type (from the warehouse.package_types table) used for the item.
quantity | Quantity of the item specified in the invoice line.
unit_price | Price per unit of the item.
tax_rate | Tax rate applied to the item.
tax_amount | Tax amount calculated for the invoice line.
line_profit | Profit earned from this invoice line, based on the current cost price.
extended_price | Total cost of the invoice line ($\text{quantity} * \text{unit\_price} + \text{tax\_amount}$).

**sales.customer_transactions**

Field | Description
-|-
customer_transaction_id | Transaction ID.
customer_id | ID of the customer associated with this transaction.
transaction_type_id | ID of the transaction type (e.g., invoice, payment, credit note).
invoice_id | ID of the invoice associated with this transaction (if applicable).
payment_method_id | ID of the payment method (e.g., cash, bank transfer).
transaction_date | Transaction date.
amount_excluding_tax | Transaction amount excluding tax.
tax_amount | Tax amount calculated for the transaction.
transaction_amount | Total transaction amount (including tax).
outstanding_balance | Amount still unpaid for this transaction. Indicates the outstanding debt for the transaction.
finalization_date | Date when the transaction was finalized (if finalized).


### Application Schema

Reference data and system settings.

<img src="assets/er_application.png" alt="">

We will need the following tables and fields in this schema:

**application.countries**

Field | Description
-|-
country_id | Country ID
country_name | Country name

**application.state_provinces**

Field | Description
-|-
state_province_id | State or province ID
state_province_name | Official name of the state or province
country_id | Country for this state or province

**application.cities**

Field | Description
-|-
city_id | City ID
city_name | Official name of the city
state_province_id | State or province for this city

**application.delivery_methods**

Field | Description
-|-
delivery_method_id | Delivery method ID
delivery_method_name | Delivery method name

**application.payment_methods**

Field | Description
-|-
payment_method_id | Payment method ID
payment_method_name | Payment method name

**application.transaction_types**

Field | Description
-|-
transaction_type_id | Transaction type ID in the database
transaction_type_name | Full name of the transaction type

### Warehouse Schema

Data on inventory and warehouse operations.

<img src="assets/er_warehouse.png" alt="">

We will need the following tables and fields in this schema:

**warehouse.stock_items**

Field | Description
-|-
stock_item_id | Stock item ID
stock_item_name | Full name of the stock item
color_id | Color ID of the item
size | Item size

**warehouse.stock_item_stock_groups**

Field | Description
-|-
stock_item_stock_group_id | Record ID in the table (this is a junction table)
stock_item_id | Stock item ID
stock_group_id | Stock group ID

**warehouse.stock_groups**

Field | Description
-|-
stock_group_id | Stock group ID
stock_group_name | Stock group name

**warehouse.package_types**

Field | Description
-|-
package_type_id | Package type ID
package_type_name | Full name of the package type

**warehouse.colors**

Field | Description
-|-
color_id | Color ID
color_name | Color name

## Creating Functions

Let's switch to the source schema.

In [10]:
con('src')

Connected to srс


Let's create a function that will output information about a column.

In [11]:
%%sql
CREATE OR REPLACE FUNCTION get_column_summary(
    table_name TEXT,
    column_name TEXT,
    only_summary BOOLEAN DEFAULT FALSE
)
RETURNS TABLE(
    "Summary Type" TEXT,
    "Summary Count" TEXT,
    "-" TEXT,
    "Stats Type" TEXT,
    "Stats Value" TEXT,
    "--" TEXT,
    "Top Values" TEXT
) AS $$
DECLARE
    sql_query TEXT;
    schema_name TEXT;
    table_only_name TEXT;    
    column_type TEXT;
    is_numeric BOOLEAN;    
BEGIN
    IF strpos(table_name, '.') > 0 THEN
        schema_name := split_part(table_name, '.', 1);
        table_only_name := split_part(table_name, '.', 2);
    ELSE
        schema_name := 'public';
        table_only_name := table_name;
    END IF;
    -- We get the type of column data
    EXECUTE format('SELECT data_type FROM information_schema.columns 
                   WHERE table_schema = %L AND table_name = %L AND column_name = %L', 
                   schema_name, table_only_name, column_name)
    INTO column_type;    
    -- Check if the type is numerical
    is_numeric := column_type IN ('smallint', 'integer', 'bigint', 'decimal', 'numeric', 'real', 'double precision');    

    sql_query := 
        'WITH 
        column_summary AS (
            SELECT 
                *
                , row_number() OVER () AS dummy_id
            FROM (        
                SELECT 
                    ''Total Count'' AS summary_1, COUNT(*) AS summary_2
                FROM ' || quote_ident(schema_name) || '.' || quote_ident(table_only_name) || '
                UNION ALL
                SELECT
                    ''Unique Count'', COUNT(DISTINCT ' || quote_ident(column_name) || ')
                FROM ' || quote_ident(schema_name) || '.' || quote_ident(table_only_name) || '
                UNION ALL
                SELECT
                    ''Missing'', COUNT(*) FILTER (WHERE ' || quote_ident(column_name) || ' IS NULL)
                FROM ' || quote_ident(schema_name) || '.' || quote_ident(table_only_name) || '
                UNION ALL
                SELECT
                    ''Duplicated'', COUNT(' || quote_ident(column_name) || ') - COUNT(DISTINCT ' || quote_ident(column_name) || ')
                FROM ' || quote_ident(schema_name) || '.' || quote_ident(table_only_name) || '
                ';
    IF is_numeric THEN
        sql_query := sql_query || '                
                UNION ALL
                SELECT
                    ''Zero'', COUNT(' || quote_ident(column_name) || ') FILTER (WHERE ' || quote_ident(column_name) || ' = 0)
                FROM ' || quote_ident(schema_name) || '.' || quote_ident(table_only_name) || '
                UNION ALL
                SELECT
                    ''Negative'', COUNT(' || quote_ident(column_name) || ') FILTER (WHERE ' || quote_ident(column_name) || ' < 0)
                FROM ' || quote_ident(schema_name) || '.' || quote_ident(table_only_name) || '
                ';
    ELSE
        sql_query := sql_query || '
                UNION ALL
                SELECT
                    ''Zero'', NULL 
                UNION ALL
                SELECT
                    ''Negative'', NULL
                ';
    END IF;      
    sql_query := sql_query || '              
            )
        )';

IF NOT only_summary THEN
    IF is_numeric THEN
        sql_query := sql_query || '
            , column_stats AS (
                SELECT 
                    *
                    , row_number() OVER () AS dummy_id
                FROM (           
                    SELECT 
                        ''Max'' AS stats_1, MAX(' || quote_ident(column_name) || ') AS stats_2
                    FROM ' || quote_ident(schema_name) || '.' || quote_ident(table_only_name) || '    
                    UNION ALL
                    SELECT
                        ''75%'', PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY ' || quote_ident(column_name) || ')
                    FROM ' || quote_ident(schema_name) || '.' || quote_ident(table_only_name) || '
                    UNION ALL    
                    SELECT
                        ''Mean'', ROUND(AVG(' || quote_ident(column_name) || '), 2)
                    FROM ' || quote_ident(schema_name) || '.' || quote_ident(table_only_name) || '
                    UNION ALL
                    SELECT
                        ''Median'', PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY ' || quote_ident(column_name) || ')
                    FROM ' || quote_ident(schema_name) || '.' || quote_ident(table_only_name) || '
                    UNION ALL
                    SELECT
                        ''25%'', PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY ' || quote_ident(column_name) || ')
                    FROM ' || quote_ident(schema_name) || '.' || quote_ident(table_only_name) || '
                    UNION ALL
                    SELECT
                        ''Min'', MIN(' || quote_ident(column_name) || ')
                    FROM ' || quote_ident(schema_name) || '.' || quote_ident(table_only_name) || '   
                ) AS t
            )';
    ELSE
        sql_query := sql_query || '
            , column_stats AS (
                SELECT 
                    *
                    , row_number() OVER () AS dummy_id
                FROM (           
                    SELECT 
                        ''Max'' AS stats_1, MAX(' || quote_ident(column_name) || ') AS stats_2
                    FROM ' || quote_ident(schema_name) || '.' || quote_ident(table_only_name) || '    
                    UNION ALL
                    SELECT
                        ''75%'', NULL AS stats_2
                    UNION ALL    
                    SELECT
                        ''Mean'', NULL AS stats_2
                    UNION ALL
                    SELECT
                        ''Median'', NULL AS stats_2
                    UNION ALL
                    SELECT
                        ''25%'', NULL AS stats_2
                    UNION ALL
                    SELECT
                        ''Min'', MIN(' || quote_ident(column_name) || ')
                    FROM ' || quote_ident(schema_name) || '.' || quote_ident(table_only_name) || '   
                ) AS t
            )';
    END IF;
    
    sql_query := sql_query || '
            , top_values AS (
                SELECT 
                    *
                    , row_number() OVER () AS dummy_id
                FROM (   
                    SELECT 
                        ' || quote_ident(column_name) || '::TEXT || '' ('' || COUNT(*)::TEXT || '')'' AS top_count
                    FROM ' || quote_ident(schema_name) || '.' || quote_ident(table_only_name) || '
                    WHERE ' || quote_ident(column_name) || ' IS NOT NULL
                    GROUP BY ' || quote_ident(column_name) || '
                    ORDER BY COUNT(*) DESC, ' || quote_ident(column_name) || '::TEXT
                    LIMIT 6
                ) AS t
            )';
END IF;
    if only_summary then
        sql_query := sql_query || '
                SELECT
                    summary_1::TEXT AS "Type",
                    summary_2::TEXT AS "Count",
                    '' '' AS " ",
                    '' '' AS " ",
                    '' '' AS " ",
                    '' '' AS " ",
                    '' '' AS " "
                FROM
                    column_summary;'; 
    else
        sql_query := sql_query || '
                SELECT
                    summary_1::TEXT AS "Type",
                    summary_2::TEXT AS "Count",
                    '' '' AS " ",
                    stats_1::TEXT AS "Type",
                    stats_2::TEXT AS "Value",
                    '' '' AS " ",
                    top_count::TEXT AS "Top Values"
                FROM
                    column_summary
                    LEFT JOIN column_stats USING(dummy_id)
                    LEFT JOIN top_values USING(dummy_id);';
    end if;
    RETURN QUERY EXECUTE sql_query;
END;
$$ LANGUAGE plpgsql;

Let's create a function that will show information about the relationship between tables.

In [12]:
%%sql
CREATE OR REPLACE FUNCTION analyze_relationship(
    left_table_name TEXT,
    right_table_name TEXT,
    left_key_name TEXT,
    right_key_name TEXT
)
RETURNS TABLE (
    relationship_type TEXT,
    left_only_keys BIGINT,
    right_only_keys BIGINT,
    left_size BIGINT,
    right_size BIGINT,
    common_keys BIGINT
) AS $$
DECLARE
    left_count BIGINT;
    right_count BIGINT;
    left_distinct_count BIGINT;
    right_distinct_count BIGINT;
    common_count BIGINT;
    left_only_count BIGINT;
    right_only_count BIGINT;
    max_right_per_left_val BIGINT;
    max_left_per_right_val BIGINT;
    rel_type TEXT;
BEGIN
    -- Basic counts
    EXECUTE format('SELECT COUNT(DISTINCT %I), COUNT(*) FROM %s', 
                  left_key_name, left_table_name)
    INTO left_distinct_count, left_count;
    
    EXECUTE format('SELECT COUNT(DISTINCT %I), COUNT(*) FROM %s', 
                  right_key_name, right_table_name)
    INTO right_distinct_count, right_count;
    
    -- Common keys 
    EXECUTE format('
        SELECT COUNT(DISTINCT l.%I) 
        FROM %s l
        WHERE EXISTS (SELECT 1 FROM %s r WHERE r.%I = l.%I)',
        left_key_name, left_table_name, right_table_name, right_key_name, left_key_name)
    INTO common_count;
    
    -- Keys only in left/right
    left_only_count := left_distinct_count - common_count;
    right_only_count := right_distinct_count - common_count;
    
    -- Max right per left key
    EXECUTE format('
        SELECT COALESCE(MAX(cnt), 0) FROM (
            SELECT COUNT(r.%I) as cnt
            FROM %s r
            WHERE r.%I IN (SELECT %I FROM %s)
            GROUP BY r.%I
        ) t',
        right_key_name, right_table_name, right_key_name, left_key_name, left_table_name, right_key_name)
    INTO max_right_per_left_val;
    
    -- Max left per right key 
    EXECUTE format('
        SELECT COALESCE(MAX(cnt), 0) FROM (
            SELECT COUNT(l.%I) as cnt
            FROM %s l
            WHERE l.%I IN (SELECT %I FROM %s)
            GROUP BY l.%I
        ) t',
        left_key_name, left_table_name, left_key_name, right_key_name, right_table_name, left_key_name)
    INTO max_left_per_right_val;
    
    -- Determine relationship type
    IF max_right_per_left_val <= 1 AND max_left_per_right_val <= 1 THEN
        rel_type := '1:1';
    ELSIF max_left_per_right_val > 1 AND max_right_per_left_val <= 1 THEN
        rel_type := 'N:1';
    ELSIF max_left_per_right_val <= 1 AND max_right_per_left_val > 1 THEN
        rel_type := '1:N';
    ELSIF common_count > 0 THEN
        rel_type := 'N:M';
    ELSE
        rel_type := 'no_relation';
    END IF;
    
    RETURN QUERY SELECT 
        rel_type,
        left_only_count,
        right_only_count,
        left_count,
        right_count,
        common_count;
END;
$$ LANGUAGE plpgsql;

## Data Exploration


Before developing the dashboard, let's explore the necessary tables and fields, as well as the relationships between them.

### Variable Exploration


#### Sales Schema

##### Table sales.orders

Let's look at the rows.

In [13]:
%%sql
SELECT
    *
FROM
    sales.orders
LIMIT 5

Unnamed: 0,order_id,customer_id,salesperson_person_id,picked_by_person_id,contact_person_id,backorder_order_id,order_date,expected_delivery_date,customer_purchase_order_number,is_undersupply_backordered,comments,delivery_instructions,internal_comments,picking_completed_when,last_edited_by,last_edited_when
0,2,803,8,,3003,46.0,2013-01-01,2013-01-02,15342,True,,,,2013-01-01 12:00:00,7,2013-01-01 12:00:00
1,3,105,7,,1209,47.0,2013-01-01,2013-01-02,12211,True,,,,2013-01-01 12:00:00,7,2013-01-01 12:00:00
2,4,57,16,3.0,1113,,2013-01-01,2013-01-02,17129,True,,,,2013-01-01 11:00:00,3,2013-01-01 11:00:00
3,5,905,3,,3105,48.0,2013-01-01,2013-01-02,10369,True,,,,2013-01-01 12:00:00,7,2013-01-01 12:00:00
4,6,976,13,3.0,3176,,2013-01-01,2013-01-02,13383,True,,,,2013-01-01 11:00:00,3,2013-01-01 11:00:00


Let's examine each column we will use for creating the dashboard individually.

**order_id**

In [14]:
%%sql
SELECT * FROM get_column_summary('sales.orders', 'order_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Missing,0,,,,,
1,Unique Count,73595,,,,,
2,Zero,0,,,,,
3,Total Count,73595,,,,,
4,Duplicated,0,,,,,
5,Negative,0,,,,,


**customer_id**

In [15]:
%%sql
SELECT * FROM get_column_summary('sales.orders', 'customer_id', False);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Missing,0,,Max,1061.0,,90 (150)
1,Unique Count,663,,75%,877.0,,831 (147)
2,Zero,0,,Mean,528.79,,968 (146)
3,Duplicated,72932,,Median,518.0,,405 (145)
4,Negative,0,,25%,160.0,,804 (145)
5,Total Count,73595,,Min,1.0,,143 (144)


**order_date**

In [16]:
%%sql
SELECT * FROM get_column_summary('sales.orders', 'order_date', False);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Negative,,,25%,,,2016-01-06 (133)
1,Duplicated,72526.0,,Median,,,2015-10-19 (127)
2,Total Count,73595.0,,Mean,,,2015-07-06 (126)
3,Zero,,,75%,,,2015-02-03 (125)
4,Unique Count,1069.0,,Min,2013-01-01,,2016-04-28 (123)
5,Missing,0.0,,Max,2016-05-31,,2015-02-23 (122)


**expected_delivery_date**

In [17]:
%%sql
SELECT * FROM get_column_summary('sales.orders', 'expected_delivery_date', False);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Negative,,,25%,,,2015-03-02 (173)
1,Unique Count,891.0,,Median,,,2015-04-06 (167)
2,Total Count,73595.0,,Mean,,,2015-03-30 (164)
3,Zero,,,75%,,,2015-09-14 (164)
4,Duplicated,72704.0,,Min,2013-01-02,,2014-04-28 (163)
5,Missing,0.0,,Max,2016-06-01,,2016-04-11 (163)


**picking_completed_when**

In [18]:
%%sql
SELECT * FROM get_column_summary('sales.orders', 'picking_completed_when', False);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Negative,,,25%,,,2016-02-26 11:00:00 (107)
1,Duplicated,68386.0,,Median,,,2016-04-18 11:00:00 (107)
2,Total Count,73595.0,,Mean,,,2016-01-07 11:00:00 (106)
3,Zero,,,75%,,,2016-05-04 11:00:00 (106)
4,Unique Count,2124.0,,Min,2013-01-01 11:00:00,,2016-02-24 11:00:00 (104)
5,Missing,3085.0,,Max,2016-05-31 12:00:00,,2015-01-21 11:00:00 (103)


Let's look at rows with missing values in picking_completed_when.

In [19]:
%%sql
SELECT
    *
FROM
    sales.orders
WHERE
    picking_completed_when is NULL
LIMIT 5

Unnamed: 0,order_id,customer_id,salesperson_person_id,picked_by_person_id,contact_person_id,backorder_order_id,order_date,expected_delivery_date,customer_purchase_order_number,is_undersupply_backordered,comments,delivery_instructions,internal_comments,picking_completed_when,last_edited_by,last_edited_when
0,694,430,13,,2059,,2013-01-12,2013-01-14,18641,True,,,,,5,2013-01-12 12:00:00
1,858,197,13,,1393,,2013-01-15,2013-01-16,17999,True,,,,,9,2013-01-15 12:00:00
2,863,538,2,,2275,,2013-01-15,2013-01-16,13574,True,,,,,9,2013-01-15 12:00:00
3,865,926,2,,3126,,2013-01-15,2013-01-16,18066,True,,,,,9,2013-01-15 12:00:00
4,1065,70,20,,1139,,2013-01-19,2013-01-21,12157,True,,,,,19,2013-01-19 12:00:00


**Key Observations:**  

- There are missing values in the picking_completed_when column. These same rows also have missing values in picked_by_person_id. Most likely, the order was not picked. Possibly the item was out of stock.
- No critical anomalies were found.
- The sales.orders table contains order data from 2013-01-01 to 2016-05-31.

##### Table sales.order_lines

Let's look at the rows.

In [20]:
%%sql
SELECT
    *
FROM
    sales.order_lines
LIMIT 5

Unnamed: 0,order_line_id,order_id,stock_item_id,description,package_type_id,quantity,unit_price,tax_rate,picked_quantity,picking_completed_when,last_edited_by,last_edited_when
0,1,45,164,32 mm Double sided bubble wrap 50m,7,50,112.0,15.0,50,2013-01-02 11:00:00,4,2013-01-02 11:00:00
1,2,1,67,Ride on toy sedan car (Black) 1/12 scale,7,10,230.0,15.0,10,2013-01-01 11:00:00,3,2013-01-01 11:00:00
2,3,2,50,Developer joke mug - old C developers never di...,7,9,13.0,15.0,9,2013-01-01 11:00:00,3,2013-01-01 11:00:00
3,4,46,89,"""The Gu"" red shirt XML tag t-shirt (Black) 3XS",7,72,18.0,15.0,72,2013-01-02 11:00:00,4,2013-01-02 11:00:00
4,5,46,171,32 mm Anti static bubble wrap (Blue) 10m,7,90,32.0,15.0,90,2013-01-02 11:00:00,4,2013-01-02 11:00:00


Let's examine each column we will use for creating the dashboard individually.

**order_line_id**

In [21]:
%%sql
SELECT * FROM get_column_summary('sales.order_lines', 'order_line_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Missing,0,,,,,
1,Unique Count,231412,,,,,
2,Duplicated,0,,,,,
3,Total Count,231412,,,,,
4,Zero,0,,,,,
5,Negative,0,,,,,


**order_id**

In [22]:
%%sql
SELECT * FROM get_column_summary('sales.order_lines', 'order_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,231412,,,,,
1,Missing,0,,,,,
2,Negative,0,,,,,
3,Zero,0,,,,,
4,Duplicated,157817,,,,,
5,Unique Count,73595,,,,,


**stock_item_id**

In [23]:
%%sql
SELECT * FROM get_column_summary('sales.order_lines', 'stock_item_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,231412,,,,,
1,Missing,0,,,,,
2,Negative,0,,,,,
3,Zero,0,,,,,
4,Duplicated,231185,,,,,
5,Unique Count,227,,,,,


**description**

In [24]:
%%sql
SELECT * FROM get_column_summary('sales.order_lines', 'description', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Negative,,,,,,
1,Zero,,,,,,
2,Total Count,231412.0,,,,,
3,Missing,0.0,,,,,
4,Duplicated,231185.0,,,,,
5,Unique Count,227.0,,,,,


**package_type_id**

In [25]:
%%sql
SELECT * FROM get_column_summary('sales.order_lines', 'package_type_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,231412,,,,,
1,Missing,0,,,,,
2,Negative,0,,,,,
3,Duplicated,231408,,,,,
4,Unique Count,4,,,,,
5,Zero,0,,,,,


**quantity**

In [26]:
%%sql
SELECT * FROM get_column_summary('sales.order_lines', 'quantity', False);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,231412,,Max,360.0,,10 (15799)
1,Unique Count,61,,75%,60.0,,5 (12876)
2,Zero,0,,Mean,40.24,,1 (12716)
3,Missing,0,,Median,10.0,,8 (12701)
4,Duplicated,231351,,25%,5.0,,7 (12681)
5,Negative,0,,Min,1.0,,2 (12654)


**unit_price**

In [27]:
%%sql
SELECT * FROM get_column_summary('sales.order_lines', 'unit_price', False);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,231412,,Max,1899.0,,13.00 (44577)
1,Unique Count,62,,75%,32.0,,18.00 (36536)
2,Zero,0,,Mean,45.21,,32.00 (35575)
3,Missing,0,,Median,18.0,,25.00 (13553)
4,Duplicated,231350,,25%,13.0,,30.00 (7307)
5,Negative,0,,Min,0.66,,4.10 (7290)


**tax_rate**

In [28]:
%%sql
SELECT * FROM get_column_summary('sales.order_lines', 'tax_rate', False);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,231412,,Max,15.0,,15.000 (230376)
1,Duplicated,231410,,75%,15.0,,10.000 (1036)
2,Zero,0,,Mean,14.98,,
3,Missing,0,,Median,15.0,,
4,Unique Count,2,,25%,15.0,,
5,Negative,0,,Min,10.0,,


**picked_quantity**

In [29]:
%%sql
SELECT * FROM get_column_summary('sales.order_lines', 'picked_quantity', False);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,231412,,Max,360.0,,10 (15799)
1,Unique Count,62,,75%,60.0,,5 (12876)
2,Zero,3147,,Mean,38.68,,1 (12716)
3,Missing,0,,Median,9.0,,8 (12701)
4,Duplicated,231350,,25%,5.0,,7 (12681)
5,Negative,0,,Min,0.0,,2 (12654)


**picking_completed_when**

In [30]:
%%sql
SELECT * FROM get_column_summary('sales.order_lines', 'picking_completed_when', False);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Negative,,,25%,,,2016-05-04 11:00:00 (395)
1,Zero,,,Median,,,2015-01-21 11:00:00 (390)
2,Total Count,231412.0,,Mean,,,2015-11-24 11:00:00 (388)
3,Duplicated,227196.0,,75%,,,2015-06-26 11:00:00 (386)
4,Unique Count,1069.0,,Min,2013-01-01 11:00:00,,2016-03-23 11:00:00 (386)
5,Missing,3147.0,,Max,2016-05-31 11:00:00,,2015-10-19 11:00:00 (385)


**Key Observations:**  

- There are missing values in the picking_completed_when column.
- No critical anomalies were found.
- The date range matches the orders table.

##### Table sales.customer_categories

Let's look at the rows.

In [31]:
%%sql
SELECT
    *
FROM
    sales.customer_categories
LIMIT 5

Unnamed: 0,customer_category_id,customer_category_name,last_edited_by
0,1,Agent,1
1,2,Wholesaler,1
2,3,Novelty Shop,1
3,4,Supermarket,1
4,5,Computer Store,1


Let's examine each column we will use for creating the dashboard individually.

**customer_category_id**

In [32]:
%%sql
SELECT * FROM get_column_summary('sales.customer_categories', 'customer_category_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,8,,,,,
1,Unique Count,8,,,,,
2,Missing,0,,,,,
3,Duplicated,0,,,,,
4,Zero,0,,,,,
5,Negative,0,,,,,


**customer_category_name**

In [33]:
%%sql
SELECT * FROM get_column_summary('sales.customer_categories', 'customer_category_name', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,8.0,,,,,
1,Unique Count,8.0,,,,,
2,Missing,0.0,,,,,
3,Duplicated,0.0,,,,,
4,Zero,,,,,,
5,Negative,,,,,,


**Key Observations:**  

- There are no missing values in the columns we need.
- No critical anomalies were found.

##### Table sales.customers

Let's look at the rows.

In [34]:
%%sql
SELECT
    *
FROM
    sales.customers
LIMIT 5

Unnamed: 0,customer_id,customer_name,bill_to_customer_id,customer_category_id,buying_group_id,primary_contact_person_id,alternate_contact_person_id,delivery_method_id,delivery_city_id,postal_city_id,...,delivery_run,run_position,website_url,delivery_address_line_1,delivery_address_line_2,delivery_postal_code,postal_address_line_1,postal_address_line_2,postal_postal_code,last_edited_by
0,1,Tailspin Toys (Head Office),1,3,1,1001,1002,3,19586,19586,...,,,http://www.tailspintoys.com,Shop 38,1877 Mittal Road,90410,PO Box 8975,Ribeiroville,90410,1
1,2,"Tailspin Toys (Sylvanite, MT)",1,3,1,1003,1004,3,33475,33475,...,,,http://www.tailspintoys.com/Sylvanite,Shop 245,705 Dita Lane,90216,PO Box 259,Jogiville,90216,1
2,3,"Tailspin Toys (Peeples Valley, AZ)",1,3,1,1005,1006,3,26483,26483,...,,,http://www.tailspintoys.com/PeeplesValley,Unit 217,1970 Khandke Road,90205,PO Box 3648,Lucescuville,90205,1
3,4,"Tailspin Toys (Medicine Lodge, KS)",1,3,1,1007,1008,3,21692,21692,...,,,http://www.tailspintoys.com/MedicineLodge,Suite 164,967 Riutta Boulevard,90152,PO Box 5065,Maciasville,90152,1
4,5,"Tailspin Toys (Gasport, NY)",1,3,1,1009,1010,3,12748,12748,...,,,http://www.tailspintoys.com/Gasport,Unit 176,1674 Skujins Boulevard,90261,PO Box 6294,Kellnerovaville,90261,1


Let's examine each column we will use for creating the dashboard individually.

**customer_id**

In [35]:
%%sql
SELECT * FROM get_column_summary('sales.customers', 'customer_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,663,,,,,
1,Unique Count,663,,,,,
2,Missing,0,,,,,
3,Duplicated,0,,,,,
4,Zero,0,,,,,
5,Negative,0,,,,,


**customer_name**

In [36]:
%%sql
SELECT * FROM get_column_summary('sales.customers', 'customer_name', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,663.0,,,,,
1,Unique Count,663.0,,,,,
2,Missing,0.0,,,,,
3,Duplicated,0.0,,,,,
4,Zero,,,,,,
5,Negative,,,,,,


**customer_category_id**

In [37]:
%%sql
SELECT * FROM get_column_summary('sales.customers', 'customer_category_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,663,,,,,
1,Unique Count,5,,,,,
2,Missing,0,,,,,
3,Duplicated,658,,,,,
4,Zero,0,,,,,
5,Negative,0,,,,,


**buying_group_id**

In [38]:
%%sql
SELECT * FROM get_column_summary('sales.customers', 'buying_group_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,663,,,,,
1,Unique Count,2,,,,,
2,Missing,261,,,,,
3,Duplicated,400,,,,,
4,Zero,0,,,,,
5,Negative,0,,,,,


Let's look at rows with missing values in buying_group_id.

In [39]:
%%sql
SELECT
    *
FROM
    sales.customers
WHERE
    buying_group_id is NULL
LIMIT 5

Unnamed: 0,customer_id,customer_name,bill_to_customer_id,customer_category_id,buying_group_id,primary_contact_person_id,alternate_contact_person_id,delivery_method_id,delivery_city_id,postal_city_id,...,delivery_run,run_position,website_url,delivery_address_line_1,delivery_address_line_2,delivery_postal_code,postal_address_line_1,postal_address_line_2,postal_postal_code,last_edited_by
0,801,Eric Torres,801,7,,3001,,3,31321,31321,...,,,http://www.microsoft.com/EricTorres/,Unit 26,1772 Allu Street,90218,PO Box 4858,Sandhuville,90218,1
1,802,Cosmina Vlad,802,7,,3002,,3,5192,5192,...,,,http://www.microsoft.com/CosminaVlad/,Suite 9,908 Nadar Lane,90602,PO Box 1954,Gonzalesville,90602,15
2,803,Bala Dixit,803,3,,3003,,3,33799,33799,...,,,http://www.microsoft.com/BalaDixit/,Unit 7,844 Magnusson Lane,90676,PO Box 8565,Blahoville,90676,1
3,804,Aleksandrs Riekstins,804,5,,3004,,3,18069,18069,...,,,http://www.microsoft.com/AleksandrsRiekstins/,Shop 20,498 Bagheri Lane,90797,PO Box 6490,Linnaville,90797,1
4,805,Ratan Poddar,805,3,,3005,,3,10194,10194,...,,,http://www.microsoft.com/RatanPoddar/,Shop 16,1071 Goransson Crescent,90457,PO Box 6237,Shakibaville,90457,1


**delivery_method_id**

In [40]:
%%sql
SELECT * FROM get_column_summary('sales.customers', 'delivery_method_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,663,,,,,
1,Unique Count,1,,,,,
2,Missing,0,,,,,
3,Duplicated,662,,,,,
4,Zero,0,,,,,
5,Negative,0,,,,,


**delivery_city_id**

In [41]:
%%sql
SELECT * FROM get_column_summary('sales.customers', 'delivery_city_id', False);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,663,,Max,38184.0,,16702 (2)
1,Unique Count,655,,75%,28468.0,,242 (2)
2,Missing,0,,Mean,19033.07,,26010 (2)
3,Duplicated,8,,Median,19232.0,,29320 (2)
4,Zero,0,,25%,9369.5,,31685 (2)
5,Negative,0,,Min,15.0,,33832 (2)


**Key Observations:**  

- There are missing values in the buying_group_id column. Apparently, these are customers without a group.
- No critical anomalies were found.

##### Table sales.invoices

Let's look at the rows.

In [42]:
%%sql
SELECT
    *
FROM
    sales.invoices
LIMIT 5

Unnamed: 0,invoice_id,customer_id,bill_to_customer_id,order_id,delivery_method_id,contact_person_id,accounts_person_id,salesperson_person_id,packed_by_person_id,invoice_date,...,internal_comments,total_dry_items,total_chiller_items,delivery_run,run_position,returned_delivery_data,confirmed_delivery_time,confirmed_received_by,last_edited_by,last_edited_when
0,1,832,832,1,3,3032,3032,2,14,2013-01-01,...,,1,0,,,"{""Events"": [{ ""Event"":""Ready for collection"",""...",2013-01-02 07:05:00,Aakriti Byrraju,15,2013-01-02 07:00:00
1,2,803,803,2,3,3003,3003,8,14,2013-01-01,...,,2,0,,,"{""Events"": [{ ""Event"":""Ready for collection"",""...",2013-01-02 07:10:00,Bala Dixit,15,2013-01-02 07:00:00
2,3,105,1,3,3,1209,1001,7,14,2013-01-01,...,,1,0,,,"{""Events"": [{ ""Event"":""Ready for collection"",""...",2013-01-02 07:15:00,Sung-Hwan Hwang,15,2013-01-02 07:00:00
3,4,57,1,4,3,1113,1001,16,14,2013-01-01,...,,3,0,,,"{""Events"": [{ ""Event"":""Ready for collection"",""...",2013-01-02 07:20:00,Aile Mae,15,2013-01-02 07:00:00
4,5,905,905,5,3,3105,3105,3,14,2013-01-01,...,,3,0,,,"{""Events"": [{ ""Event"":""Ready for collection"",""...",2013-01-02 07:25:00,Sara Huiting,15,2013-01-02 07:00:00


Let's examine each column we will use for creating the dashboard individually.

**invoice_id**

In [43]:
%%sql
SELECT * FROM get_column_summary('sales.invoices', 'invoice_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Missing,0,,,,,
1,Unique Count,70510,,,,,
2,Zero,0,,,,,
3,Negative,0,,,,,
4,Duplicated,0,,,,,
5,Total Count,70510,,,,,


**customer_id**

In [44]:
%%sql
SELECT * FROM get_column_summary('sales.invoices', 'customer_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,70510,,,,,
1,Missing,0,,,,,
2,Negative,0,,,,,
3,Unique Count,663,,,,,
4,Duplicated,69847,,,,,
5,Zero,0,,,,,


**order_id**

In [45]:
%%sql
SELECT * FROM get_column_summary('sales.invoices', 'order_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,70510,,,,,
1,Missing,0,,,,,
2,Negative,0,,,,,
3,Zero,0,,,,,
4,Duplicated,0,,,,,
5,Unique Count,70510,,,,,


**delivery_method_id**

In [46]:
%%sql
SELECT * FROM get_column_summary('sales.invoices', 'delivery_method_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,70510,,,,,
1,Missing,0,,,,,
2,Negative,0,,,,,
3,Duplicated,70509,,,,,
4,Unique Count,1,,,,,
5,Zero,0,,,,,


**invoice_date**

In [47]:
%%sql
SELECT * FROM get_column_summary('sales.invoices', 'invoice_date', False);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Negative,,,25%,,,2016-01-06 (117)
1,Duplicated,69441.0,,Median,,,2016-04-18 (117)
2,Total Count,70510.0,,Mean,,,2015-07-06 (116)
3,Zero,,,75%,,,2016-02-24 (116)
4,Unique Count,1069.0,,Min,2013-01-01,,2016-02-26 (116)
5,Missing,0.0,,Max,2016-05-31,,2016-05-04 (116)


**total_dry_items**

In [48]:
%%sql
SELECT * FROM get_column_summary('sales.invoices', 'total_dry_items', False);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,70510,,Max,5.0,,3 (17179)
1,Unique Count,6,,75%,4.0,,2 (17024)
2,Zero,16,,Mean,3.22,,4 (16883)
3,Missing,0,,Median,3.0,,5 (13676)
4,Duplicated,70504,,25%,2.0,,1 (5732)
5,Negative,0,,Min,0.0,,0 (16)


**total_chiller_items**

In [49]:
%%sql
SELECT * FROM get_column_summary('sales.invoices', 'total_chiller_items', False);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,70510,,Max,3.0,,0 (69519)
1,Unique Count,4,,75%,0.0,,1 (948)
2,Zero,69519,,Mean,0.01,,2 (41)
3,Duplicated,70506,,Median,0.0,,3 (2)
4,Negative,0,,25%,0.0,,
5,Missing,0,,Min,0.0,,


**Key Observations:**  

- There are no missing values in the columns we need.
- No critical anomalies were found.
- The date range matches the orders table.

##### Table sales.invoice_lines

Let's look at the rows.

In [50]:
%%sql
SELECT
    *
FROM
    sales.invoice_lines
LIMIT 5

Unnamed: 0,invoice_line_id,invoice_id,stock_item_id,description,package_type_id,quantity,unit_price,tax_rate,tax_amount,line_profit,extended_price,last_edited_by,last_edited_when
0,1,1,67,Ride on toy sedan car (Black) 1/12 scale,7,10,230.0,15.0,345.0,850.0,2645.0,7,2013-01-01 12:00:00
1,2,2,50,Developer joke mug - old C developers never di...,7,9,13.0,15.0,17.55,76.5,134.55,7,2013-01-01 12:00:00
2,3,2,10,USB food flash drive - chocolate bar,7,9,32.0,15.0,43.2,180.0,331.2,7,2013-01-01 12:00:00
3,4,3,114,Superhero action jacket (Blue) XXL,7,3,30.0,15.0,13.5,24.0,103.5,7,2013-01-01 12:00:00
4,5,4,206,Permanent marker black 5mm nib (Black) 5mm,7,96,2.7,15.0,38.88,96.0,298.08,7,2013-01-01 12:00:00


Let's examine each column we will use for creating the dashboard individually.

**invoice_line_id**

In [51]:
%%sql
SELECT * FROM get_column_summary('sales.invoice_lines', 'invoice_line_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Missing,0,,,,,
1,Unique Count,228265,,,,,
2,Duplicated,0,,,,,
3,Total Count,228265,,,,,
4,Zero,0,,,,,
5,Negative,0,,,,,


**invoice_id**

In [52]:
%%sql
SELECT * FROM get_column_summary('sales.invoice_lines', 'invoice_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,228265,,,,,
1,Missing,0,,,,,
2,Negative,0,,,,,
3,Zero,0,,,,,
4,Duplicated,157755,,,,,
5,Unique Count,70510,,,,,


**stock_item_id**

In [53]:
%%sql
SELECT * FROM get_column_summary('sales.invoice_lines', 'stock_item_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,228265,,,,,
1,Missing,0,,,,,
2,Negative,0,,,,,
3,Zero,0,,,,,
4,Duplicated,228038,,,,,
5,Unique Count,227,,,,,


**description**

In [54]:
%%sql
SELECT * FROM get_column_summary('sales.invoice_lines', 'description', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Negative,,,,,,
1,Zero,,,,,,
2,Total Count,228265.0,,,,,
3,Missing,0.0,,,,,
4,Unique Count,227.0,,,,,
5,Duplicated,228038.0,,,,,


**package_type_id**

In [55]:
%%sql
SELECT * FROM get_column_summary('sales.invoice_lines', 'package_type_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,228265,,,,,
1,Missing,0,,,,,
2,Negative,0,,,,,
3,Duplicated,228261,,,,,
4,Unique Count,4,,,,,
5,Zero,0,,,,,


**quantity**

In [56]:
%%sql
SELECT * FROM get_column_summary('sales.invoice_lines', 'quantity', False);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,228265,,Max,360.0,,10 (15799)
1,Unique Count,61,,75%,60.0,,5 (12876)
2,Zero,0,,Mean,39.21,,1 (12716)
3,Missing,0,,Median,10.0,,8 (12701)
4,Duplicated,228204,,25%,5.0,,7 (12681)
5,Negative,0,,Min,1.0,,2 (12654)


**unit_price**

In [57]:
%%sql
SELECT * FROM get_column_summary('sales.invoice_lines', 'unit_price', False);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,228265,,Max,1899.0,,13.00 (44577)
1,Duplicated,228203,,75%,32.0,,32.00 (35177)
2,Zero,0,,Mean,45.59,,18.00 (34326)
3,Missing,0,,Median,18.0,,25.00 (13553)
4,Unique Count,62,,25%,13.0,,30.00 (7307)
5,Negative,0,,Min,0.66,,4.10 (7290)


**tax_rate**

In [58]:
%%sql
SELECT * FROM get_column_summary('sales.invoice_lines', 'tax_rate', False);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,228265,,Max,15.0,,15.000 (227229)
1,Duplicated,228263,,75%,15.0,,10.000 (1036)
2,Zero,0,,Mean,14.98,,
3,Missing,0,,Median,15.0,,
4,Unique Count,2,,25%,15.0,,
5,Negative,0,,Min,10.0,,


**tax_amount**

In [59]:
%%sql
SELECT * FROM get_column_summary('sales.invoice_lines', 'tax_amount', False);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,228265,,Max,2848.5,,13.65 (4613)
1,Duplicated,227772,,75%,129.6,,1.95 (4580)
2,Negative,0,,Mean,112.95,,5.85 (4514)
3,Missing,0,,Median,34.5,,3.90 (4505)
4,Unique Count,493,,25%,14.4,,17.55 (4469)
5,Zero,0,,Min,0.38,,15.60 (4451)


**line_profit**

In [60]:
%%sql
SELECT * FROM get_column_summary('sales.invoice_lines', 'line_profit', False);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,228265,,Max,9200.0,,85.00 (5056)
1,Unique Count,570,,75%,390.0,,68.00 (4662)
2,Negative,4626,,Mean,375.57,,59.50 (4613)
3,Duplicated,227695,,Median,120.0,,25.50 (4610)
4,Zero,0,,25%,51.0,,17.00 (4598)
5,Missing,0,,Min,-645.0,,8.50 (4580)


**extended_price**

In [61]:
%%sql
SELECT * FROM get_column_summary('sales.invoice_lines', 'extended_price', False);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,228265,,Max,21838.5,,104.65 (4613)
1,Duplicated,227768,,75%,993.6,,14.95 (4580)
2,Zero,0,,Mean,867.6,,44.85 (4514)
3,Unique Count,497,,Median,264.5,,29.90 (4505)
4,Negative,0,,25%,110.4,,134.55 (4469)
5,Missing,0,,Min,2.88,,119.60 (4451)


**Key Observations:**  

- There are no missing values in the columns we need.
- No critical anomalies were found.

##### Table sales.customer_transactions

Let's look at the rows.

In [62]:
%%sql 
SELECT
    *
FROM
    sales.customer_transactions
LIMIT 5

Unnamed: 0,customer_transaction_id,customer_id,transaction_type_id,invoice_id,payment_method_id,transaction_date,amount_excluding_tax,tax_amount,transaction_amount,outstanding_balance,finalization_date,is_finalized,last_edited_by,last_edited_when
0,5,803,1,2,,2013-01-01,405.0,60.75,465.75,0.0,2013-01-02,True,10,2013-01-02 11:30:00
1,7,1,1,3,,2013-01-01,90.0,13.5,103.5,0.0,2013-01-02,True,10,2013-01-02 11:30:00
2,11,1,1,4,,2013-01-01,445.2,66.78,511.98,0.0,2013-01-02,True,10,2013-01-02 11:30:00
3,15,905,1,5,,2013-01-01,704.0,105.6,809.6,0.0,2013-01-02,True,10,2013-01-02 11:30:00
4,19,976,1,6,,2013-01-01,430.0,64.5,494.5,0.0,2013-01-02,True,10,2013-01-02 11:30:00


Let's examine each column we will use for creating the dashboard individually.

**customer_transaction_id**

In [63]:
%%sql
SELECT * FROM get_column_summary('sales.customer_transactions', 'customer_transaction_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Missing,0,,,,,
1,Total Count,97147,,,,,
2,Duplicated,0,,,,,
3,Zero,0,,,,,
4,Unique Count,97147,,,,,
5,Negative,0,,,,,


**customer_id**

In [64]:
%%sql
SELECT * FROM get_column_summary('sales.customer_transactions', 'customer_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Missing,0,,,,,
1,Total Count,97147,,,,,
2,Negative,0,,,,,
3,Unique Count,263,,,,,
4,Duplicated,96884,,,,,
5,Zero,0,,,,,


**transaction_type_id**

In [65]:
%%sql
SELECT * FROM get_column_summary('sales.customer_transactions', 'transaction_type_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Missing,0,,,,,
1,Total Count,97147,,,,,
2,Negative,0,,,,,
3,Zero,0,,,,,
4,Unique Count,2,,,,,
5,Duplicated,97145,,,,,


**invoice_id**

In [66]:
%%sql
SELECT * FROM get_column_summary('sales.customer_transactions', 'invoice_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Missing,26637,,,,,
1,Total Count,97147,,,,,
2,Negative,0,,,,,
3,Zero,0,,,,,
4,Duplicated,0,,,,,
5,Unique Count,70510,,,,,


**payment_method_id**

In [67]:
%%sql
SELECT * FROM get_column_summary('sales.customer_transactions', 'payment_method_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Missing,70510,,,,,
1,Total Count,97147,,,,,
2,Negative,0,,,,,
3,Duplicated,26636,,,,,
4,Unique Count,1,,,,,
5,Zero,0,,,,,


**transaction_date**

In [68]:
%%sql
SELECT * FROM get_column_summary('sales.customer_transactions', 'transaction_date', False);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Negative,,,25%,,,2016-01-07 (164)
1,Duplicated,95900.0,,Median,,,2015-11-24 (159)
2,Total Count,97147.0,,Mean,,,2015-07-07 (157)
3,Zero,,,75%,,,2016-03-22 (155)
4,Unique Count,1247.0,,Min,2013-01-01,,2016-01-06 (154)
5,Missing,0.0,,Max,2016-05-31,,2015-07-23 (151)


**Key Observations:**

- Many missing values in the invoice_id column. This is normal, as not all transactions have an invoice_id. This is a system feature.
- Very many missing values in payment_method_id. And only one unique value. This is normal, as transactions are not always related to payments.
- The date range matches the orders table.

#### Application Schema

##### Table application.countries

Let's look at the rows.

In [69]:
%%sql
SELECT
    *
FROM
    application.countries
LIMIT 5

Unnamed: 0,country_id,country_name,formal_name,iso_alpha_3_code,iso_numeric_code,country_type,latest_recorded_population,continent,region,subregion,last_edited_by
0,1,Afghanistan,Islamic State of Afghanistan,AFG,4,UN Member State,28400000,Asia,Asia,Southern Asia,1
1,3,Albania,Republic of Albania,ALB,8,UN Member State,3785031,Europe,Europe,Southern Europe,20
2,4,Algeria,People's Democratic Republic of Algeria,DZA,12,UN Member State,34178188,Africa,Africa,Northern Africa,1
3,6,Andorra,Principality of Andorra,AND,20,UN Member State,87243,Europe,Europe,Southern Europe,15
4,7,Angola,People's Republic of Angola,AGO,24,UN Member State,12799293,Africa,Africa,Middle Africa,1


Let's examine each column we will use for creating the dashboard individually.

**country_id**

In [70]:
%%sql
SELECT * FROM get_column_summary('application.countries', 'country_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,190,,,,,
1,Unique Count,190,,,,,
2,Missing,0,,,,,
3,Duplicated,0,,,,,
4,Zero,0,,,,,
5,Negative,0,,,,,


**country_name**

In [71]:
%%sql
SELECT * FROM get_column_summary('application.countries', 'country_name', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,190.0,,,,,
1,Unique Count,190.0,,,,,
2,Missing,0.0,,,,,
3,Duplicated,0.0,,,,,
4,Zero,,,,,,
5,Negative,,,,,,


**iso_alpha_3_code**

In [72]:
%%sql
SELECT * FROM get_column_summary('application.countries', 'iso_alpha_3_code', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,190.0,,,,,
1,Unique Count,190.0,,,,,
2,Missing,0.0,,,,,
3,Duplicated,0.0,,,,,
4,Zero,,,,,,
5,Negative,,,,,,


**Key Observations:**  

- No missing values in the columns we need.
- No critical anomalies were found.

##### Table application.state_provinces

Let's look at the rows.

In [73]:
%%sql
SELECT
    *
FROM
    application.state_provinces
LIMIT 5

Unnamed: 0,state_province_id,state_province_code,state_province_name,country_id,sales_territory,latest_recorded_population,last_edited_by
0,1,AL,Alabama,230,Southeast,5437278,15
1,2,AK,Alaska,230,Far West,735132,1
2,3,AZ,Arizona,230,Southwest,6891688,8
3,4,AR,Arkansas,230,Southeast,3077747,8
4,5,CA,California,230,Far West,41460453,15


Let's examine each column we will use for creating the dashboard individually.

**state_province_id**

In [74]:
%%sql
SELECT * FROM get_column_summary('application.state_provinces', 'state_province_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,53,,,,,
1,Unique Count,53,,,,,
2,Missing,0,,,,,
3,Duplicated,0,,,,,
4,Zero,0,,,,,
5,Negative,0,,,,,


**state_province_code**

In [75]:
%%sql
SELECT * FROM get_column_summary('application.state_provinces', 'state_province_code', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,53.0,,,,,
1,Unique Count,53.0,,,,,
2,Missing,0.0,,,,,
3,Duplicated,0.0,,,,,
4,Zero,,,,,,
5,Negative,,,,,,


**state_province_name**

In [76]:
%%sql
SELECT * FROM get_column_summary('application.state_provinces', 'state_province_name', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,53.0,,,,,
1,Unique Count,53.0,,,,,
2,Missing,0.0,,,,,
3,Duplicated,0.0,,,,,
4,Zero,,,,,,
5,Negative,,,,,,


**country_id**

In [77]:
%%sql
SELECT * FROM get_column_summary('application.state_provinces', 'country_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,53,,,,,
1,Unique Count,1,,,,,
2,Missing,0,,,,,
3,Duplicated,52,,,,,
4,Zero,0,,,,,
5,Negative,0,,,,,


**sales_territory**

In [78]:
%%sql
SELECT * FROM get_column_summary('application.state_provinces', 'sales_territory', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,53.0,,,,,
1,Unique Count,9.0,,,,,
2,Missing,0.0,,,,,
3,Duplicated,44.0,,,,,
4,Zero,,,,,,
5,Negative,,,,,,


**latest_recorded_population**

In [79]:
%%sql
SELECT * FROM get_column_summary('application.state_provinces', 'latest_recorded_population', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,53,,,,,
1,Unique Count,53,,,,,
2,Missing,0,,,,,
3,Duplicated,0,,,,,
4,Zero,0,,,,,
5,Negative,0,,,,,


**Key Observations:**

- No missing values in the columns we need.
- No critical anomalies were found.
- All states in the application.state_provinces table are from the USA.

##### Table application.cities

Let's look at the rows.

In [80]:
%%sql
SELECT
    *
FROM
    application.cities
LIMIT 5

Unnamed: 0,city_id,city_name,state_province_id,latest_recorded_population,last_edited_by
0,1,Aaronsburg,39,613,1
1,3,Abanda,1,192,1
2,4,Abbeville,42,5237,1
3,5,Abbeville,11,2908,1
4,6,Abbeville,1,2688,1


Let's examine each column we will use for creating the dashboard individually.

**city_id**

In [81]:
%%sql
SELECT * FROM get_column_summary('application.cities', 'city_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Missing,0,,,,,
1,Total Count,37940,,,,,
2,Duplicated,0,,,,,
3,Negative,0,,,,,
4,Unique Count,37940,,,,,
5,Zero,0,,,,,


**city_name**

In [82]:
%%sql
SELECT * FROM get_column_summary('application.cities', 'city_name', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Negative,,,,,,
1,Zero,,,,,,
2,Missing,0.0,,,,,
3,Total Count,37940.0,,,,,
4,Unique Count,23279.0,,,,,
5,Duplicated,14661.0,,,,,


**state_province_id**

In [83]:
%%sql
SELECT * FROM get_column_summary('application.cities', 'state_province_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Missing,0,,,,,
1,Total Count,37940,,,,,
2,Negative,0,,,,,
3,Zero,0,,,,,
4,Unique Count,53,,,,,
5,Duplicated,37887,,,,,


**latest_recorded_population**

In [84]:
%%sql
SELECT * FROM get_column_summary('application.cities', 'latest_recorded_population', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Missing,11048,,,,,
1,Total Count,37940,,,,,
2,Negative,0,,,,,
3,Zero,14,,,,,
4,Duplicated,17568,,,,,
5,Unique Count,9324,,,,,


**Key Observations:**  

- Not all cities have population values.
- No critical anomalies were found.
- The date range matches the orders table.

##### Table application.delivery_methods

Let's look at the rows.

In [85]:
%%sql
SELECT
    *
FROM
    application.delivery_methods
LIMIT 5

Unnamed: 0,delivery_method_id,delivery_method_name,last_edited_by
0,1,Post,1
1,2,Courier,1
2,3,Delivery Van,1
3,4,Customer Collect,1
4,5,Chilled Van,16


Let's examine each column we will use for creating the dashboard individually.

**delivery_method_id**

In [86]:
%%sql
SELECT * FROM get_column_summary('application.delivery_methods', 'delivery_method_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,10,,,,,
1,Unique Count,10,,,,,
2,Missing,0,,,,,
3,Duplicated,0,,,,,
4,Zero,0,,,,,
5,Negative,0,,,,,


**delivery_method_name**

In [87]:
%%sql
SELECT * FROM get_column_summary('application.delivery_methods', 'delivery_method_name', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,10.0,,,,,
1,Unique Count,10.0,,,,,
2,Missing,0.0,,,,,
3,Duplicated,0.0,,,,,
4,Zero,,,,,,
5,Negative,,,,,,


**Key Observations:**  

- No missing values in the columns we need.
- No critical anomalies were found.

##### Table application.payment_methods

Let's look at the rows.

In [88]:
%%sql
SELECT
    *
FROM
    application.payment_methods
LIMIT 5

Unnamed: 0,payment_method_id,payment_method_name,last_edited_by
0,1,Cash,1
1,2,Check,1
2,3,Credit-Card,9
3,4,EFT,1


Let's examine each column we will use for creating the dashboard individually.

**payment_method_id**

In [89]:
%%sql
SELECT * FROM get_column_summary('application.payment_methods', 'payment_method_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,4,,,,,
1,Unique Count,4,,,,,
2,Missing,0,,,,,
3,Duplicated,0,,,,,
4,Zero,0,,,,,
5,Negative,0,,,,,


**payment_method_name**

In [90]:
%%sql
SELECT * FROM get_column_summary('application.payment_methods', 'payment_method_name', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,4.0,,,,,
1,Unique Count,4.0,,,,,
2,Missing,0.0,,,,,
3,Duplicated,0.0,,,,,
4,Zero,,,,,,
5,Negative,,,,,,


**Key Observations:**  

- No missing values in the columns we need.
- No critical anomalies were found.

##### Table application.transaction_types

Let's look at the rows.

In [91]:
%%sql
SELECT
    *
FROM
    application.transaction_types
LIMIT 5

Unnamed: 0,transaction_type_id,transaction_type_name,last_edited_by
0,1,Customer Invoice,1
1,2,Customer Credit Note,1
2,3,Customer Payment Received,1
3,4,Customer Refund,1
4,5,Supplier Invoice,1


Let's examine each column we will use for creating the dashboard individually.

**transaction_type_id**

In [92]:
%%sql
SELECT * FROM get_column_summary('application.transaction_types', 'transaction_type_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,13,,,,,
1,Unique Count,13,,,,,
2,Missing,0,,,,,
3,Duplicated,0,,,,,
4,Zero,0,,,,,
5,Negative,0,,,,,


**transaction_type_name**

In [93]:
%%sql
SELECT * FROM get_column_summary('application.transaction_types', 'transaction_type_name', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,13.0,,,,,
1,Unique Count,13.0,,,,,
2,Missing,0.0,,,,,
3,Duplicated,0.0,,,,,
4,Zero,,,,,,
5,Negative,,,,,,


**Key Observations:**  

- No missing values in the columns we need.
- No critical anomalies were found.

#### Warehouse Schema

##### Table warehouse.stock_items

Let's look at the rows.

In [94]:
%%sql
SELECT
    *
FROM
    warehouse.stock_items
LIMIT 5

Unnamed: 0,stock_item_id,stock_item_name,supplier_id,color_id,unit_package_id,outer_package_id,brand,size,lead_time_days,quantity_per_outer,...,unit_price,recommended_retail_price,typical_weight_per_unit,marketing_comments,internal_comments,photo,custom_fields,tags,search_details,last_edited_by
0,1,USB missile launcher (Green),12,,7,7,,,14,1,...,25.0,37.38,0.3,Complete with 12 projectiles,,,"{ ""CountryOfManufacture"": ""China"", ""Tags"": [""U...","[""USB Powered""]",USB missile launcher (Green) Complete with 12 ...,1
1,2,USB rocket launcher (Gray),12,12.0,7,7,,,14,1,...,25.0,37.38,0.3,Complete with 12 projectiles,,,"{ ""CountryOfManufacture"": ""China"", ""Tags"": [""U...","[""USB Powered""]",USB rocket launcher (Gray) Complete with 12 pr...,1
2,3,Office cube periscope (Black),12,3.0,7,6,,,14,10,...,18.5,27.66,0.25,Need to see over your cubicle wall? This is ju...,,,"{ ""CountryOfManufacture"": ""China"", ""Tags"": [] }",[],Office cube periscope (Black) Need to see over...,1
3,4,USB food flash drive - sushi roll,12,,7,7,,,14,1,...,32.0,47.84,0.05,,,,"{ ""CountryOfManufacture"": ""Japan"", ""Tags"": [""3...","[""32GB"",""USB Powered""]",USB food flash drive - sushi roll,1
4,5,USB food flash drive - hamburger,12,,7,7,,,14,1,...,32.0,47.84,0.05,,,,"{ ""CountryOfManufacture"": ""Japan"", ""Tags"": [""1...","[""16GB"",""USB Powered""]",USB food flash drive - hamburger,1


Let's examine each column we will use for creating the dashboard individually.

**stock_item_id**

In [95]:
%%sql
SELECT * FROM get_column_summary('warehouse.stock_items', 'stock_item_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,227,,,,,
1,Unique Count,227,,,,,
2,Missing,0,,,,,
3,Duplicated,0,,,,,
4,Zero,0,,,,,
5,Negative,0,,,,,


**stock_item_name**

In [96]:
%%sql
SELECT * FROM get_column_summary('warehouse.stock_items', 'stock_item_name', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,227.0,,,,,
1,Unique Count,227.0,,,,,
2,Missing,0.0,,,,,
3,Duplicated,0.0,,,,,
4,Zero,,,,,,
5,Negative,,,,,,


**color_id**

In [97]:
%%sql
SELECT * FROM get_column_summary('warehouse.stock_items', 'color_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,227,,,,,
1,Unique Count,7,,,,,
2,Missing,99,,,,,
3,Duplicated,121,,,,,
4,Zero,0,,,,,
5,Negative,0,,,,,


**unit_package_id**

In [98]:
%%sql
SELECT * FROM get_column_summary('warehouse.stock_items', 'unit_package_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,227,,,,,
1,Unique Count,4,,,,,
2,Missing,0,,,,,
3,Duplicated,223,,,,,
4,Zero,0,,,,,
5,Negative,0,,,,,


**outer_package_id**

In [99]:
%%sql
SELECT * FROM get_column_summary('warehouse.stock_items', 'outer_package_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,227,,,,,
1,Unique Count,3,,,,,
2,Missing,0,,,,,
3,Duplicated,224,,,,,
4,Zero,0,,,,,
5,Negative,0,,,,,


**brand**

In [100]:
%%sql
SELECT * FROM get_column_summary('warehouse.stock_items', 'brand', False);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,227.0,,Max,Northwind,,Northwind (18)
1,Unique Count,1.0,,75%,,,
2,Missing,209.0,,Mean,,,
3,Duplicated,17.0,,Median,,,
4,Zero,,,25%,,,
5,Negative,,,Min,Northwind,,


**size**

In [101]:
%%sql
SELECT * FROM get_column_summary('warehouse.stock_items', 'size', False);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,227.0,,Max,XXS,,XL (12)
1,Unique Count,43.0,,75%,,,L (11)
2,Missing,64.0,,Mean,,,M (11)
3,Duplicated,120.0,,Median,,,S (11)
4,Zero,,,25%,,,1/12 scale (9)
5,Negative,,,Min,1.5m,,1/50 scale (9)


**tax_rate**

In [102]:
%%sql
SELECT * FROM get_column_summary('warehouse.stock_items', 'tax_rate', False);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,227,,Max,15.0,,15.000 (219)
1,Unique Count,2,,75%,15.0,,10.000 (8)
2,Missing,0,,Mean,14.82,,
3,Duplicated,225,,Median,15.0,,
4,Zero,0,,25%,15.0,,
5,Negative,0,,Min,10.0,,


**unit_price**

In [103]:
%%sql
SELECT * FROM get_column_summary('warehouse.stock_items', 'unit_price', False);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,227,,Max,1899.0,,13.00 (42)
1,Unique Count,57,,75%,32.0,,18.00 (35)
2,Missing,0,,Mean,44.16,,32.00 (34)
3,Duplicated,170,,Median,18.0,,25.00 (13)
4,Zero,0,,25%,13.0,,30.00 (7)
5,Negative,0,,Min,0.66,,4.10 (7)


**typical_weight_per_unit**

In [104]:
%%sql
SELECT * FROM get_column_summary('warehouse.stock_items', 'typical_weight_per_unit', False);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,227,,Max,21.0,,0.150 (42)
1,Unique Count,23,,75%,0.7,,0.400 (28)
2,Missing,0,,Mean,1.83,,0.350 (25)
3,Duplicated,204,,Median,0.35,,0.300 (21)
4,Zero,0,,25%,0.15,,0.250 (18)
5,Negative,0,,Min,0.05,,0.500 (13)


**Key Observations:**

- Not all items have a brand, color, or size.
- No critical anomalies were found.

##### Table warehouse.stock_item_stock_groups

Let's look at the rows.

In [105]:
%%sql
SELECT
    *
FROM
    warehouse.stock_item_stock_groups
LIMIT 5

Unnamed: 0,stock_item_stock_group_id,stock_item_id,stock_group_id,last_edited_by,last_edited_when
0,1,1,6,1,2013-01-01
1,2,1,1,1,2013-01-01
2,3,1,7,1,2013-01-01
3,4,2,6,1,2013-01-01
4,5,2,1,1,2013-01-01


Let's examine each column we will use for creating the dashboard individually.

**stock_item_stock_group_id**

In [106]:
%%sql
SELECT * FROM get_column_summary('warehouse.stock_item_stock_groups', 'stock_item_stock_group_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,442,,,,,
1,Unique Count,442,,,,,
2,Missing,0,,,,,
3,Duplicated,0,,,,,
4,Zero,0,,,,,
5,Negative,0,,,,,


**stock_item_id**

In [107]:
%%sql
SELECT * FROM get_column_summary('warehouse.stock_item_stock_groups', 'stock_item_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,442,,,,,
1,Unique Count,227,,,,,
2,Missing,0,,,,,
3,Duplicated,215,,,,,
4,Zero,0,,,,,
5,Negative,0,,,,,


**stock_group_id**

In [108]:
%%sql
SELECT * FROM get_column_summary('warehouse.stock_item_stock_groups', 'stock_group_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,442,,,,,
1,Unique Count,9,,,,,
2,Missing,0,,,,,
3,Duplicated,433,,,,,
4,Zero,0,,,,,
5,Negative,0,,,,,


**Key Observations:**  

- No missing values in the columns we need.
- No critical anomalies were found.

##### Table warehouse.stock_groups

Let's look at the rows.

In [109]:
%%sql
SELECT
    *
FROM
    warehouse.stock_groups
LIMIT 5

Unnamed: 0,stock_group_id,stock_group_name,last_edited_by
0,1,Novelty Items,1
1,2,Clothing,1
2,3,Mugs,1
3,4,T-Shirts,1
4,5,Airline Novelties,1


Let's examine each column we will use for creating the dashboard individually.

**stock_group_id**

In [110]:
%%sql
SELECT * FROM get_column_summary('warehouse.stock_groups', 'stock_group_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,10,,,,,
1,Unique Count,10,,,,,
2,Missing,0,,,,,
3,Duplicated,0,,,,,
4,Zero,0,,,,,
5,Negative,0,,,,,


**stock_group_name**

In [111]:
%%sql
SELECT * FROM get_column_summary('warehouse.stock_groups', 'stock_group_name', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,10.0,,,,,
1,Unique Count,10.0,,,,,
2,Missing,0.0,,,,,
3,Duplicated,0.0,,,,,
4,Zero,,,,,,
5,Negative,,,,,,


**Key Observations:**  

- No missing values in the columns we need.
- No critical anomalies were found.

##### Table warehouse.package_types

Let's look at the rows.

In [112]:
%%sql
SELECT
    *
FROM
    warehouse.package_types
LIMIT 5

Unnamed: 0,package_type_id,package_type_name,last_edited_by
0,1,Bag,1
1,2,Block,1
2,3,Bottle,1
3,4,Box,1
4,5,Can,1


Let's examine each column we will use for creating the dashboard individually.

**package_type_id**

In [113]:
%%sql
SELECT * FROM get_column_summary('warehouse.package_types', 'package_type_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,14,,,,,
1,Unique Count,14,,,,,
2,Missing,0,,,,,
3,Duplicated,0,,,,,
4,Zero,0,,,,,
5,Negative,0,,,,,


**package_type_name**

In [114]:
%%sql
SELECT * FROM get_column_summary('warehouse.package_types', 'package_type_name', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,14.0,,,,,
1,Unique Count,14.0,,,,,
2,Missing,0.0,,,,,
3,Duplicated,0.0,,,,,
4,Zero,,,,,,
5,Negative,,,,,,


**Key Observations:**  

- No missing values in the columns we need.
- No critical anomalies were found.

##### Table warehouse.colors

Let's look at the rows.

In [115]:
%%sql
SELECT
    *
FROM
    warehouse.colors
LIMIT 5

Unnamed: 0,color_id,color_name,last_edited_by
0,1,Azure,1
1,2,Beige,1
2,3,Black,1
3,4,Blue,1
4,5,Charcoal,1


Let's examine each column we will use for creating the dashboard individually.

**color_id**

In [116]:
%%sql
SELECT * FROM get_column_summary('warehouse.colors', 'color_id', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,36,,,,,
1,Unique Count,36,,,,,
2,Missing,0,,,,,
3,Duplicated,0,,,,,
4,Zero,0,,,,,
5,Negative,0,,,,,


**color_name**

In [117]:
%%sql
SELECT * FROM get_column_summary('warehouse.colors', 'color_name', True);

Unnamed: 0,Summary Type,Summary Count,-,Stats Type,Stats Value,--,Top Values
0,Total Count,36.0,,,,,
1,Unique Count,36.0,,,,,
2,Missing,0.0,,,,,
3,Duplicated,0.0,,,,,
4,Zero,,,,,,
5,Negative,,,,,,


**Key Observations:**  

- No missing values in the columns we need.
- No critical anomalies were found.

### Exploring Relationships Between Tables

Let's examine the relationships between tables for further joins.

Let's check if there are any key mismatches.

#### Sales Schema

**sales.orders and sales.order_lines**

In [118]:
%%sql
Select * from analyze_relationship('sales.orders', 'sales.order_lines', 'order_id', 'order_id')

Unnamed: 0,relationship_type,left_only_keys,right_only_keys,left_size,right_size,common_keys
0,1:N,0,0,73595,231412,73595


**sales.orders and sales.customers**

In [119]:
%%sql
Select * from analyze_relationship('sales.orders', 'sales.customers', 'customer_id', 'customer_id')

Unnamed: 0,relationship_type,left_only_keys,right_only_keys,left_size,right_size,common_keys
0,N:1,0,0,73595,663,663


**sales.orders and sales.invoiced**

In [120]:
%%sql
Select * from analyze_relationship('sales.orders', 'sales.invoices', 'order_id', 'order_id')

Unnamed: 0,relationship_type,left_only_keys,right_only_keys,left_size,right_size,common_keys
0,1:1,3085,0,73595,70510,70510


**sales.customers and sales.customer_categories**

In [121]:
%%sql
Select * from analyze_relationship('sales.customers', 'sales.customer_categories', 'customer_category_id', 'customer_category_id')

Unnamed: 0,relationship_type,left_only_keys,right_only_keys,left_size,right_size,common_keys
0,N:1,0,3,663,8,5


**sales.invoiced and sales.invoice_lines**

In [122]:
%%sql
Select * from analyze_relationship('sales.invoices', 'sales.invoice_lines', 'invoice_id', 'invoice_id')

Unnamed: 0,relationship_type,left_only_keys,right_only_keys,left_size,right_size,common_keys
0,1:N,0,0,70510,228265,70510


**sales.customers and sales.customer_transactions**

In [123]:
%%sql
Select * from analyze_relationship('sales.customers', 'sales.customer_transactions', 'customer_id', 'customer_id')

Unnamed: 0,relationship_type,left_only_keys,right_only_keys,left_size,right_size,common_keys
0,1:N,400,0,663,97147,263


**sales.customers and sales.invoiced**

In [124]:
%%sql
Select * from analyze_relationship('sales.customers', 'sales.invoices', 'customer_id', 'customer_id')

Unnamed: 0,relationship_type,left_only_keys,right_only_keys,left_size,right_size,common_keys
0,1:N,0,0,663,70510,663


**sales.invoiced and sales.customer_transactions**

In [125]:
%%sql
Select * from analyze_relationship('sales.invoices', 'sales.customer_transactions', 'invoice_id', 'invoice_id')

Unnamed: 0,relationship_type,left_only_keys,right_only_keys,left_size,right_size,common_keys
0,1:1,0,0,70510,97147,70510


**Key Observations:**

- The sales.orders table contains order_id values that are not present in sales.invoices. This is normal, as not all orders have invoices.
- The sales.customer_categories table contains customer_id values that are not present in the sales.customers table. This is also normal.
- The sales.customers table contains customer_id values that are not present in the sales.customer_transactions table. This is also normal, as not all customers have transactions.
- No critical anomalies were found.

#### Application Schema

**application.country and application.state_provinces**

In [126]:
%%sql
Select * from analyze_relationship('application.countries', 'application.state_provinces', 'country_id', 'country_id')

Unnamed: 0,relationship_type,left_only_keys,right_only_keys,left_size,right_size,common_keys
0,1:N,189,0,190,53,1


**application.state_provinces and application.cities**

In [127]:
%%sql
Select * from analyze_relationship('application.state_provinces', 'application.cities', 'state_province_id', 'state_province_id')

Unnamed: 0,relationship_type,left_only_keys,right_only_keys,left_size,right_size,common_keys
0,1:N,0,0,53,37940,53


**application.delivery_methods and sales.customers**

In [128]:
%%sql
Select * from analyze_relationship('application.delivery_methods', 'sales.customers', 'delivery_method_id', 'delivery_method_id')

Unnamed: 0,relationship_type,left_only_keys,right_only_keys,left_size,right_size,common_keys
0,1:N,9,0,10,663,1


**application.payment_methods and sales.customer_transactions**

In [129]:
%%sql
Select * from analyze_relationship('application.payment_methods', 'sales.customer_transactions', 'payment_method_id', 'payment_method_id')

Unnamed: 0,relationship_type,left_only_keys,right_only_keys,left_size,right_size,common_keys
0,1:N,3,0,4,97147,1


**application.transaction_types and sales.customer_transactions**

In [130]:
%%sql
Select * from analyze_relationship('application.transaction_types', 'sales.customer_transactions', 'transaction_type_id', 'transaction_type_id')

Unnamed: 0,relationship_type,left_only_keys,right_only_keys,left_size,right_size,common_keys
0,1:N,11,0,13,97147,2


**Key Observations:**

- The application.countries table contains country_id values that are not present in application.state_provinces. This is normal, as not all countries are represented in state_provinces.
- The application.delivery_methods table contains delivery_method_id values that are not present in the sales.customers table. This is also normal, as not all delivery methods may have been used.
- The application.payment_methods table contains payment_method_id values that are not present in the sales.customer_transactions table. This is also normal, as not all payment methods may have been used.
- The application.transaction_types table contains transaction_type_id values that are not present in the sales.customer_transactions table. This is also normal, as not all transaction types may have been used.
- No critical anomalies were found.

#### Warehouse Schema

**warehouse.stock_items and warehouse.stock_item_stock_groups**

In [131]:
%%sql
Select * from analyze_relationship('warehouse.stock_items', 'warehouse.stock_item_stock_groups', 'stock_item_id', 'stock_item_id')

Unnamed: 0,relationship_type,left_only_keys,right_only_keys,left_size,right_size,common_keys
0,1:N,0,0,227,442,227


**warehouse.stock_groups and warehouse.stock_item_stock_groups**

In [132]:
%%sql
Select * from analyze_relationship('warehouse.stock_groups', 'warehouse.stock_item_stock_groups', 'stock_group_id', 'stock_group_id')

Unnamed: 0,relationship_type,left_only_keys,right_only_keys,left_size,right_size,common_keys
0,1:N,1,0,10,442,9


**warehouse.stock_items and warehouse.colors**

In [133]:
%%sql
Select * from analyze_relationship('warehouse.stock_items', 'warehouse.colors', 'color_id', 'color_id')

Unnamed: 0,relationship_type,left_only_keys,right_only_keys,left_size,right_size,common_keys
0,N:1,0,29,227,36,7


**warehouse.stock_items and warehouse.package_types (unit_package_id)**

In [134]:
%%sql
Select * from analyze_relationship('warehouse.stock_items', 'warehouse.package_types', 'unit_package_id', 'package_type_id')

Unnamed: 0,relationship_type,left_only_keys,right_only_keys,left_size,right_size,common_keys
0,N:1,0,10,227,14,4


**warehouse.stock_items and warehouse.package_types (outer_package_id)**

In [135]:
%%sql
Select * from analyze_relationship('warehouse.stock_items', 'warehouse.package_types', 'outer_package_id', 'package_type_id')

Unnamed: 0,relationship_type,left_only_keys,right_only_keys,left_size,right_size,common_keys
0,N:1,0,11,227,14,3


**Key Observations:**

- The warehouse.stock_groups table contains stock_group_id values that are not present in warehouse.stock_item_stock_groups. This is normal, as not all stock groups may be represented.
- The warehouse.colors table contains color_id values that are not present in the warehouse.stock_items table. This is also normal, as not all colors may be used.
- The warehouse.package_types table contains package_type_id values that are not present in the warehouse.stock_items table. This is also normal, as not all package types may be used.
- No critical anomalies were found.

# Designing the Analytical Database

### Database Schema Selection

Let's create an analytical database wwi_analytics for the dashboard.

To do this, we will transform the OLTP (current structure) to OLAP.

We will use a star schema as it is optimally suited for analytical databases.

Let's switch to the analytical database.

In [136]:
con('dst')

Connected to dst


Let's create a new schema where we will create the tables.

In [None]:
%%sql
CREATE SCHEMA analytics;

Install the dblink extension for communication between databases.

In [None]:
%%sql
CREATE EXTENSION IF NOT EXISTS dblink;

Form the connection string for dblink.

In [11]:
dblink_conn_str = f"""
    host={src_db_config['host']} 
    dbname={src_db_config['db']} 
    user={src_db_config['user']} 
    password={src_db_config['pwd']}
    port={src_db_config['port']}
"""

## Dimension Tables

### Customers

Create a dimension table for customers.

In [None]:
with wwi_analytics_engine.connect() as conn:
    stmt = text(f"""
        CREATE TABLE analytics.dim_customers AS
        SELECT * FROM dblink(
            '{dblink_conn_str}'
            , 'SELECT 
                c.customer_id
                , c.customer_name
                , cc.customer_category_name
                , ci.city_name
                , sp.state_province_name
                , CURRENT_DATE AS valid_from
                , NULL::DATE AS valid_to
                , TRUE AS current_record 
            FROM
                sales.customers c
                LEFT JOIN sales.customer_categories cc ON c.customer_category_id = cc.customer_category_id
                LEFT JOIN application.cities ci ON c.delivery_city_id = ci.city_id
                LEFT JOIN application.state_provinces sp ON ci.state_province_id = sp.state_province_id'
        ) AS t(
            customer_id INT
            , customer_name TEXT
            , customer_category_name TEXT
            , city_name TEXT
            , state_province_name TEXT
            , valid_from DATE
            , valid_to DATE
            , current_record BOOLEAN            
        );

        ALTER TABLE analytics.dim_customers ADD PRIMARY KEY (customer_id);
    """)
    conn.execute(stmt)
    conn.commit()

### Products

Create a dimension table for products.

In [None]:
with wwi_analytics_engine.connect() as conn:
    stmt = text(f"""
        CREATE TABLE analytics.dim_products AS
        SELECT * FROM dblink(
            '{dblink_conn_str}'
            , 'WITH stock_groups_agg AS (
                SELECT 
                    sisg.stock_item_id
                    , STRING_AGG(sg.stock_group_name, '', '') AS stock_group_names
                FROM
                    warehouse.stock_item_stock_groups sisg 
                    LEFT JOIN warehouse.stock_groups sg ON sisg.stock_group_id  = sg.stock_group_id
                GROUP BY
                    sisg.stock_item_id
            )
            SELECT 
                si.stock_item_id
                , si.stock_item_name
                , sg.stock_group_names
                , c.color_name
                , ptu.package_type_name AS unit_package_type_name
                , pto.package_type_name AS outer_package_type_name
                , si.brand
                , si.size
                , CURRENT_DATE AS valid_from
                , NULL::DATE AS valid_to
                , TRUE AS current_record                
            FROM
                warehouse.stock_items si
                LEFT JOIN stock_groups_agg sg ON si.stock_item_id = sg.stock_item_id
                LEFT JOIN warehouse.package_types ptu ON si.unit_package_id = ptu.package_type_id
                LEFT JOIN warehouse.package_types pto ON si.outer_package_id = pto.package_type_id
                LEFT JOIN warehouse.colors c ON c.color_id = si.color_id'
        ) AS t(
            stock_item_id INT
            , stock_item_name TEXT
            , stock_group_names TEXT
            , color_name TEXT
            , unit_package_type_name TEXT
            , outer_package_type_name TEXT
            , brand TEXT
            , size TEXT
            , valid_from DATE
            , valid_to DATE
            , current_record BOOLEAN
        );

        ALTER TABLE analytics.dim_products ADD PRIMARY KEY (stock_item_id)
    """)
    conn.execute(stmt)
    conn.commit()

### Dates

Create a dimension table for dates.

Select a range from 1 year before the minimum date in the database to 1 year after the maximum date in the database.

In [None]:
with wwi_analytics_engine.connect() as conn:
    stmt = text(f"""
        CREATE TABLE analytics.dim_dates AS
        SELECT * FROM dblink(
            '{dblink_conn_str}'
            , 'WITH date_range AS (
                SELECT 
                    MIN(order_date) AS min_date
                    , MAX(confirmed_delivery_time::DATE) AS max_date
                FROM 
                    sales.orders o
                JOIN 
                    sales.invoices i ON o.order_id = i.order_id
            )
            SELECT
                date_series::DATE AS date_id
                , EXTRACT(DAY FROM date_series)::INT AS day
                , EXTRACT(MONTH FROM date_series)::INT AS month
                , EXTRACT(YEAR FROM date_series)::INT AS year
                , EXTRACT(QUARTER FROM date_series)::INT AS quarter
                , EXTRACT(DOW FROM date_series)::INT + 1 AS day_of_week
                , TO_CHAR(date_series, ''Day'') AS day_name
                , EXTRACT(DOW FROM date_series) IN (0, 6) AS is_weekend
                , TO_CHAR(date_series, ''Month'') AS month_name
            FROM 
                GENERATE_SERIES(
                    (SELECT min_date - INTERVAL ''1 year'' FROM date_range)
                    , (SELECT max_date + INTERVAL ''1 year'' FROM date_range)
                    , ''1 day''
                ) AS date_series;'
        ) AS t(
            date_id DATE
            , day INT
            , month INT
            , year INT
            , quarter INT
            , day_of_week INT
            , day_name TEXT
            , is_weekend BOOLEAN
            , month_name TEXT
        );

        ALTER TABLE analytics.dim_dates ADD PRIMARY KEY (date_id)
    """)
    conn.execute(stmt)
    conn.commit()

### Delivery Methods

Create a dimension table for delivery methods.

In [None]:
with wwi_analytics_engine.connect() as conn:
    stmt = text(f"""
        CREATE TABLE analytics.dim_delivery_methods AS
        SELECT * FROM dblink(
            '{dblink_conn_str}'
            , 'SELECT 
                delivery_method_id
                , delivery_method_name
            FROM
                application.delivery_methods'
        ) AS t(
            delivery_method_id INT
            , delivery_method_name TEXT
        );

        ALTER TABLE analytics.dim_delivery_methods ADD PRIMARY KEY (delivery_method_id)
    """)
    conn.execute(stmt)
    conn.commit()

## Fact Tables

### Orders

Create a fact table for orders.

In [None]:
with wwi_analytics_engine.connect() as conn:
    stmt = text(f"""
        CREATE TABLE analytics.fact_orders AS
        SELECT * FROM dblink(
            '{dblink_conn_str}'
            , 'SELECT 
                order_id
                , customer_id
                , order_date
                , expected_delivery_date
                , picking_completed_when
            FROM
                sales.orders'
        ) AS t(
            order_id INT
            , customer_id INT
            , order_date DATE
            , expected_delivery_date DATE
            , picking_completed_when TIMESTAMP
        );

        ALTER TABLE analytics.fact_orders ADD PRIMARY KEY (order_id);
        ALTER TABLE analytics.fact_orders ADD CONSTRAINT fk_orders_customer 
            FOREIGN KEY (customer_id) REFERENCES analytics.dim_customers(customer_id);
        ALTER TABLE analytics.fact_orders ADD CONSTRAINT fk_orders_date
            FOREIGN KEY (order_date) REFERENCES analytics.dim_dates(date_id);
        CREATE INDEX idx_fact_orders_customer ON analytics.fact_orders(customer_id);
        CREATE INDEX idx_fact_orders_date ON analytics.fact_orders(order_date);        
    """)
    conn.execute(stmt)
    conn.commit()

### Invoices

Create a fact table for invoices.

In [None]:
with wwi_analytics_engine.connect() as conn:
    stmt = text(f"""
        CREATE TABLE analytics.fact_invoices AS
        SELECT * FROM dblink(
            '{dblink_conn_str}'
            , 'WITH invoice_totals AS (
                SELECT
                    i.invoice_id
                    , SUM(il.line_profit) AS invoice_profit
                    , SUM(il.extended_price) AS invoice_amount
                    , COUNT(il.invoice_line_id) AS invoice_lines_count
                FROM
                    sales.invoices i
                    LEFT JOIN sales.invoice_lines il ON i.invoice_id = il.invoice_id
                GROUP BY
                    i.invoice_id
            ),
            payment_totals AS (
                SELECT
                    invoice_id
                    , SUM(transaction_amount) AS paid_amount
                FROM
                    sales.customer_transactions
                where 
                    is_finalized = TRUE
                GROUP BY
                    invoice_id
            )
            SELECT 
                i.invoice_id
                , i.order_id
                , pt.paid_amount
                , i.invoice_date
                , i.delivery_method_id
                , i.confirmed_delivery_time
                , i.returned_delivery_data    
                , CASE WHEN i.confirmed_delivery_time IS NOT NULL THEN TRUE ELSE FALSE END AS is_delivered
                , it.invoice_profit
                , it.invoice_amount
                , it.invoice_lines_count
            FROM 
                sales.invoices i 
                LEFT JOIN invoice_totals it ON i.invoice_id = it.invoice_id
                LEFT JOIN payment_totals pt ON it.invoice_id = pt.invoice_id;'
        ) AS t(
            invoice_id INT
            , order_id INT
            , paid_amount NUMERIC
            , invoice_date DATE
            , delivery_method_id INT
            , confirmed_delivery_time TIMESTAMP
            , returned_delivery_data TEXT
            , is_delivered BOOLEAN
            , invoice_profit NUMERIC
            , invoice_amount NUMERIC
            , invoice_lines_count INT
        );

        ALTER TABLE analytics.fact_invoices ADD PRIMARY KEY (invoice_id);
        ALTER TABLE analytics.fact_invoices ADD CONSTRAINT fk_invoices_order 
            FOREIGN KEY (order_id) REFERENCES analytics.fact_orders(order_id);
        ALTER TABLE analytics.fact_invoices ADD CONSTRAINT fk_invoice_delivery 
            FOREIGN KEY (delivery_method_id) REFERENCES analytics.dim_delivery_methods(delivery_method_id);
        ALTER TABLE analytics.fact_invoices ADD CONSTRAINT fk_invoices_date
            FOREIGN KEY (invoice_date) REFERENCES analytics.dim_dates(date_id);
        CREATE INDEX idx_fact_invoices_order ON analytics.fact_invoices(order_id);
        CREATE INDEX idx_fact_invoices_date ON analytics.fact_invoices(invoice_date);
        CREATE INDEX idx_fact_invoices_delivery_status ON analytics.fact_invoices(is_delivered);
        CREATE INDEX idx_fact_invoices_delivery_method ON analytics.fact_invoices(delivery_method_id);            
    """)
    conn.execute(stmt)
    conn.commit()

### Order Lines

Create a fact table for order lines.

In [None]:
with wwi_analytics_engine.connect() as conn:
    stmt = text(f"""
        CREATE TABLE analytics.fact_order_lines AS
        SELECT * FROM dblink(
            '{dblink_conn_str}'
            , 'SELECT 
                ol.order_line_id
                , o.order_id
                , ol.stock_item_id
                , ol.quantity
                , ol.unit_price
                , il.quantity AS il_quantity
                , il.line_profit AS il_line_profit
                , il.extended_price AS il_extended_price
            FROM
                sales.orders o
                LEFT JOIN sales.order_lines ol ON o.order_id = ol.order_id
                LEFT JOIN sales.invoices i ON i.order_id = o.order_id
                LEFT JOIN sales.invoice_lines il ON i.invoice_id = il.invoice_id AND ol.stock_item_id = il.stock_item_id'
        ) AS t(
            order_line_id INT
            , order_id INT
            , stock_item_id INT
            , quantity INT
            , unit_price NUMERIC
            , il_quantity INT
            , il_line_profit NUMERIC
            , il_extended_price NUMERIC
        );

        ALTER TABLE analytics.fact_order_lines ADD PRIMARY KEY (order_line_id);
        ALTER TABLE analytics.fact_order_lines ADD CONSTRAINT fk_orderlines_order 
            FOREIGN KEY (order_id) REFERENCES analytics.fact_orders(order_id);
        ALTER TABLE analytics.fact_order_lines ADD CONSTRAINT fk_orderlines_product 
            FOREIGN KEY (stock_item_id) REFERENCES analytics.dim_products(stock_item_id);      
        CREATE INDEX idx_fact_order_lines_product ON analytics.fact_order_lines(stock_item_id);
        CREATE INDEX idx_fact_order_lines_order ON analytics.fact_order_lines(order_id);  
        CREATE INDEX idx_fact_order_lines_profit ON analytics.fact_order_lines(il_line_profit);             
        CREATE INDEX idx_fact_order_extended_price ON analytics.fact_order_lines(il_extended_price);             
    """)
    conn.execute(stmt)
    conn.commit()

Update database statistics to ensure optimal query performance.

In [None]:
%%sql
ANALYZE analytics.dim_customers;
ANALYZE analytics.dim_products;
ANALYZE analytics.dim_dates;
ANALYZE analytics.dim_delivery_methods;
ANALYZE analytics.fact_orders;
ANALYZE analytics.fact_invoices;
ANALYZE analytics.fact_order_lines;

As a result, we obtained the following schema.

<img src="assets/er_analytics.png" alt="">

# Creating Materialized Views

To optimize data loading into the dashboard, let's create materialized views.

## Orders Materialized View

Create a materialized view for all metrics at the order level.

In [None]:
%%sql
CREATE MATERIALIZED VIEW analytics.mv_orders AS
WITH invoices_agg AS (
    SELECT
        fi.order_id
        , COUNT(fi.invoice_id) AS invoices_count
        , SUM(fi.invoice_amount) AS amount
        , SUM(fi.paid_amount) AS paid_amount
        , SUM(fi.invoice_profit) AS profit
        , STRING_AGG(DISTINCT ddm.delivery_method_name, ' | ' ORDER BY ddm.delivery_method_name) AS delivery_methods
        , MIN(fi.invoice_date) AS first_invoice_date
        , MAX(fi.invoice_date) AS last_invoice_date
        , MIN(fi.confirmed_delivery_time) AS first_delivery_time
        , MAX(fi.confirmed_delivery_time) AS last_delivery_time
        , BOOL_OR(fi.is_delivered) AS is_any_delivered
        , BOOL_AND(fi.is_delivered) AS is_all_delivered
		, SUM(CASE WHEN fi.is_delivered THEN fi.invoice_amount ELSE 0 END) AS delivered_amount
		, SUM(CASE WHEN fi.is_delivered THEN fi.paid_amount ELSE 0 END) AS paid_delivered_amount
    FROM
        analytics.fact_invoices fi
        LEFT JOIN analytics.dim_delivery_methods ddm ON fi.delivery_method_id = ddm.delivery_method_id
    GROUP BY
        fi.order_id
),
order_lines_agg AS (
    SELECT 
        fol.order_id
        , COUNT(fol.order_line_id) AS order_lines_count
        , SUM(fol.quantity) AS total_quantity
        , COUNT(DISTINCT fol.stock_item_id) AS unique_products_count
        , STRING_AGG(DISTINCT dp.stock_item_name, ' | ' ORDER BY dp.stock_item_name) AS products_list
    FROM
        analytics.fact_order_lines fol
        LEFT JOIN analytics.dim_products dp 
        	ON fol.stock_item_id = dp.stock_item_id
        	AND dp.current_record = TRUE
    GROUP BY
        fol.order_id
),
last_delivery_attempts_per_invoice AS (
    SELECT DISTINCT ON (fi.invoice_id)
        fi.invoice_id,
        fi.order_id,
        fi.invoice_date,
        (event->>'Latitude')::numeric AS delivery_latitude,
        (event->>'Longitude')::numeric AS delivery_longitude
    FROM
        analytics.fact_invoices fi
        CROSS JOIN LATERAL jsonb_array_elements(fi.returned_delivery_data::jsonb->'Events') AS event
    WHERE 
        event->>'Event' = 'DeliveryAttempt'
        AND event->>'Latitude' IS NOT NULL
        AND event->>'Longitude' IS NOT NULL
    ORDER BY 
        fi.invoice_id, 
        (event->>'EventTime')::timestamp DESC
),
last_delivery_coordinates_per_order AS (
    SELECT DISTINCT ON (ldai.order_id)
        ldai.order_id,
        ldai.delivery_latitude AS last_delivery_latitude,
        ldai.delivery_longitude AS last_delivery_longitude
    FROM
        last_delivery_attempts_per_invoice ldai
    ORDER BY
        ldai.order_id,
        ldai.invoice_date DESC
)
SELECT
    fo.order_id
    , fo.order_date
    , fo.expected_delivery_date
    , fo.picking_completed_when
    , fo.customer_id
    , dc.customer_name
    , dc.customer_category_name
    , dc.state_province_name AS customer_state
    , dc.city_name AS customer_city
    -- Invoice
    , ia.invoices_count
    , ia.amount AS total_amount
    , ia.paid_amount
    , ia.profit
    , ia.delivery_methods
    , ia.first_invoice_date
    , ia.last_invoice_date
    , ia.first_delivery_time
    , ia.last_delivery_time
    , ia.is_any_delivered
    , ia.is_all_delivered
    , ia.delivered_amount
    , ia.paid_delivered_amount
    , ldco.last_delivery_latitude
    , ldco.last_delivery_longitude    
    -- Order_line
    , ola.order_lines_count
    , ola.total_quantity
    , ola.unique_products_count
    , ola.products_list
    -- Calculated metrics
    , ROUND((ia.paid_amount / NULLIF(ia.amount, 0)), 2) AS paid_share
    , (ia.last_delivery_time::DATE - fo.order_date::DATE) AS days_to_delivery
    , (ia.last_delivery_time::DATE - fo.expected_delivery_date::DATE) AS delivery_delay_days
    , (fo.picking_completed_when::DATE - fo.order_date::DATE) AS days_to_picking
    , (ia.last_delivery_time::DATE - fo.picking_completed_when::DATE) AS days_to_deliver_after_picking
    , CASE
        WHEN ia.last_delivery_time IS NULL OR fo.expected_delivery_date IS NULL THEN NULL  
        WHEN ia.last_delivery_time::DATE > fo.expected_delivery_date::DATE THEN TRUE 
        ELSE FALSE 
    END AS is_late_delivery
FROM
    analytics.fact_orders fo
    LEFT JOIN analytics.dim_customers dc 
    	ON fo.customer_id = dc.customer_id
    	AND dc.current_record = TRUE
    LEFT JOIN invoices_agg ia ON fo.order_id = ia.order_id
    LEFT JOIN order_lines_agg ola ON fo.order_id = ola.order_id
    LEFT JOIN last_delivery_coordinates_per_order ldco ON fo.order_id = ldco.order_id;

Create indexes to speed up dashboard performance.

In [None]:
%%sql
CREATE INDEX idx_mv_orders_order_id ON analytics.mv_orders(order_id);
CREATE INDEX idx_mv_orders_customer_id ON analytics.mv_orders(customer_id);
CREATE INDEX idx_mv_orders_order_date ON analytics.mv_orders(order_date);
CREATE INDEX idx_mv_orders_customer_state ON analytics.mv_orders(customer_state);
CREATE INDEX idx_mv_orders_is_late_delivery ON analytics.mv_orders(is_late_delivery);
CREATE INDEX idx_mv_orders_is_all_delivered ON analytics.mv_orders(is_all_delivered);

## Products Materialized View

Create a materialized view for all metrics at the product level.

In [None]:
%%sql
CREATE MATERIALIZED VIEW analytics.mv_products AS
WITH invoices_agg AS (
    SELECT
        fi.order_id
        , BOOL_OR(fi.is_delivered) AS is_any_delivered
    FROM
        analytics.fact_invoices fi
    GROUP BY
        fi.order_id
)
SELECT
	fo.order_id
	, ia.is_any_delivered
	, fol.order_line_id
	, fol.stock_item_id
	, dp.stock_item_name
	, dp.stock_group_names
	, dp.color_name
	, dp.unit_package_type_name
	, dp.outer_package_type_name
	, dp.brand
	, dp.size
	, fo.order_date
	, fol.quantity
	, fol.unit_price
	, fol.il_quantity
	, fol.il_extended_price
	, fol.il_line_profit
	, dp.brand IS NOT NULL AS has_brand
	, dp.size IS NOT NULL AS has_size
FROM
	analytics.fact_orders fo 
	LEFT JOIN invoices_agg ia ON fo.order_id = ia.order_id
	LEFT JOIN analytics.fact_order_lines fol ON fo.order_id = fol.order_id
	LEFT JOIN analytics.dim_products dp ON fol.stock_item_id =dp.stock_item_id
		AND dp.current_record = TRUE;

Create indexes to speed up dashboard performance.

In [None]:
%%sql
CREATE INDEX idx_mv_products_id ON analytics.mv_products(stock_item_id);
CREATE INDEX idx_mv_products_name ON analytics.mv_products(stock_item_name);
CREATE INDEX idx_mv_products_brand ON analytics.mv_products(brand);
CREATE INDEX idx_mv_products_group ON analytics.mv_products(stock_group_names);

# ETL Process (Airflow DAG)

Create an Airflow DAG that will daily add data from the previous day to the analytical database.

To track changes in dimensions, we will use Slowly Changing Dimension Type 2.

In [None]:
from datetime import datetime, timedelta
from airflow.sdk import dag, task
from airflow.providers.postgres.hooks.postgres import PostgresHook
import pandas as pd
import numpy as np
import logging
from typing import Union
from sqlalchemy import text
    
# Constants
SCHEMA_NAME = 'analytics'
DIM_CUSTOMERS_TABLE = 'dim_customers'
DIM_PRODUCTS_TABLE = 'dim_products'

# SCD Configuration for Type 2 dimensions
SCD_CONFIG = {
    'customers': {
        'id_column': 'customer_id',
        'change_columns': ['customer_name', 'customer_category_name', 'city_name', 'state_province_name'],
        'target_table': DIM_CUSTOMERS_TABLE
    },
    'products': {
        'id_column': 'stock_item_id',
        'change_columns': ['stock_item_name', 'stock_group_names', 'color_name',
                          'unit_package_type_name', 'outer_package_type_name', 'brand', 'size'],
        'target_table': DIM_PRODUCTS_TABLE
    }
}

# Database connections
SRC_DB_ID = 'neon_wwi'
DST_DB_ID = 'neon_analytics'
src_hook = PostgresHook(postgres_conn_id=SRC_DB_ID)
dst_hook = PostgresHook(postgres_conn_id=DST_DB_ID)

logger = logging.getLogger(__name__)


default_args = {
    'owner': 'analytics_team',
    'depends_on_past': False,
    'start_date': datetime(2025, 9, 25),
    'retries': 2,
    'retry_delay': timedelta(minutes=5),
    'email_on_failure': True,
    'email_on_retry': False,
}

dag_config = {
    'default_args': default_args,
    'description': 'Daily ETL pipeline for loading data from operational DB to analytics warehouse',
    'schedule': '0 3 * * *', # Runs at 3 AM daily
    'catchup': False,
    'tags': ['analytics', 'etl'],
    'max_active_runs': 1,
    'doc_md': """\
    # Analytics Daily ETL Pipeline

    This DAG performs daily incremental load from operational database (WWI) to analytics data warehouse.

    ## Main Tasks:
    - Extract daily orders, invoices, and order lines
    - Process SCD Type 2 for customer and product dimensions
    - Load fact and dimension tables

    ## Dependencies:
    - Requires Postgres connections: `neon_wwi` and `neon_analytics`
    """
}

QUERY_EXTRACT_ORDERS = text("""
    SELECT
        order_id
        , customer_id
        , order_date
        , expected_delivery_date
        , picking_completed_when
    FROM
        sales.orders
    WHERE
        order_date = :yesterday_ds
""")

QUERY_EXTRACT_INVOICES = text("""
    WITH invoice_totals AS (
        SELECT
            i.invoice_id
            , SUM(il.line_profit) AS invoice_profit
            , SUM(il.extended_price) AS invoice_amount
            , COUNT(il.invoice_line_id) AS invoice_lines_count
        FROM
            sales.invoices i
        LEFT JOIN sales.invoice_lines il ON i.invoice_id = il.invoice_id
        WHERE
            i.invoice_date = :yesterday_ds
        GROUP BY
            i.invoice_id
    ),
    payment_totals AS (
        SELECT
            invoice_id
            , SUM(transaction_amount) AS paid_amount
        FROM
            sales.customer_transactions
        WHERE
            is_finalized = TRUE
            AND invoice_id IN (
                SELECT invoice_id
                FROM sales.invoices
                WHERE invoice_date = :yesterday_ds
        )
        GROUP BY
            invoice_id
    )
    SELECT
        i.invoice_id
        , i.order_id
        , COALESCE(pt.paid_amount, 0) AS paid_amount
        , i.invoice_date
        , i.delivery_method_id
        , i.confirmed_delivery_time
        , i.returned_delivery_data
        , CASE WHEN i.confirmed_delivery_time IS NOT NULL THEN TRUE ELSE FALSE END AS is_delivered
        , COALESCE(it.invoice_profit, 0) AS invoice_profit
        , COALESCE(it.invoice_amount, 0) AS invoice_amount
        , COALESCE(it.invoice_lines_count, 0) AS invoice_lines_count
    FROM
        sales.invoices i
    LEFT JOIN invoice_totals it ON i.invoice_id = it.invoice_id
    LEFT JOIN payment_totals pt ON i.invoice_id = pt.invoice_id
    WHERE
        i.invoice_date = :yesterday_ds
""")

QUERY_EXTRACT_ORDER_LINES = text("""
    SELECT
        ol.order_line_id
        , o.order_id
        , ol.stock_item_id
        , ol.quantity
        , ol.unit_price
        , il.quantity AS il_quantity
        , il.line_profit AS il_line_profit
        , il.extended_price AS il_extended_price
    FROM
        sales.orders o
        LEFT JOIN sales.order_lines ol ON o.order_id = ol.order_id
        LEFT JOIN sales.invoices i ON i.order_id = o.order_id
        LEFT JOIN sales.invoice_lines il ON i.invoice_id = il.invoice_id AND ol.stock_item_id = il.stock_item_id
    WHERE
        o.order_date = :yesterday_ds
""")

QUERY_EXTRACT_CUSTOMERS_SRC = """
    SELECT
        c.customer_id
        , c.customer_name
        , cc.customer_category_name
        , ci.city_name
        , sp.state_province_name
    FROM
        sales.customers c
        LEFT JOIN sales.customer_categories cc
            ON c.customer_category_id = cc.customer_category_id
        LEFT JOIN application.cities ci
            ON c.delivery_city_id = ci.city_id
        LEFT JOIN application.state_provinces sp
            ON ci.state_province_id = sp.state_province_id         
"""

QUERY_EXTRACT_CUSTOMERS_DST = """
    SELECT
        *
    FROM
        analytics.dim_customers
    WHERE
        current_record = TRUE
"""

QUERY_EXTRACT_PRODUCTS_SRC = """
    WITH stock_groups_agg AS (
        SELECT
            sisg.stock_item_id
            , STRING_AGG(sg.stock_group_name, ', ') AS stock_group_names
        FROM
            warehouse.stock_item_stock_groups sisg
            LEFT JOIN warehouse.stock_groups sg
                ON sisg.stock_group_id = sg.stock_group_id
        GROUP
            BY sisg.stock_item_id
    )
    SELECT
        si.stock_item_id
        , si.stock_item_name
        , sg.stock_group_names
        , c.color_name
        , ptu.package_type_name AS unit_package_type_name
        , pto.package_type_name AS outer_package_type_name
        , si.brand
        , si.size
    FROM
        warehouse.stock_items si
        LEFT JOIN stock_groups_agg sg
            ON si.stock_item_id = sg.stock_item_id
        LEFT JOIN warehouse.package_types ptu
            ON si.unit_package_id = ptu.package_type_id
        LEFT JOIN warehouse.package_types pto
            ON si.outer_package_id = pto.package_type_id
        LEFT JOIN warehouse.colors c
            ON c.color_id = si.color_id
"""

QUERY_EXTRACT_PRODUCTS_DST = """
    SELECT
        *
    FROM
        analytics.dim_products
    WHERE
        current_record = TRUE
"""

def handle_etl_failure(context):
    """Enhanced error handling for ETL tasks"""
    task_instance = context['task_instance']
    exception = context.get('exception')
    execution_date = context['execution_date']

    logger.error(f"ETL Task {task_instance.task_id} failed on {execution_date}")
    logger.error(f"Exception: {str(exception)}")
    logger.error(f"Task try number: {task_instance.try_number}")

    # Send email notification (uses Airflow's default email config)
    try:
        task_instance.email_on_failure(subject=f"ETL Failure: {task_instance.task_id}", html_content=None)
    except Exception as e:
        logger.error(f"Failed to send failure email: {e}")

def safe_column_comparison(col1: pd.Series, col2: pd.Series) -> pd.Series:
    """
    Safe column comparison handling NULL values appropriately
    Returns boolean series indicating where values differ
    """
    return (col1 != col2) & ~(col1.isna() & col2.isna())

def detect_scd_type2_changes(source_df: pd.DataFrame, target_df: pd.DataFrame, entity_type: str) -> Union[pd.DataFrame, None]:
    """Detect changes for SCD Type 2 dimension processing"""
    if source_df.empty:
        logger.info("Source DataFrame is empty - no changes to process")
        return None

    config = SCD_CONFIG[entity_type]
    id_col = config['id_column']
    change_cols = config['change_columns']

    # Check duplicates
    if source_df.duplicated(subset=[id_col]).any():
        logger.warning(f"Duplicate IDs found in source data for {entity_type}")

    source_df.rename(columns={id_col: f'{id_col}_src'}, inplace=True)
    target_df.rename(columns={id_col: f'{id_col}_dst'}, inplace=True)

    # Merge data
    merged = source_df.merge(
        target_df,
        left_on=f'{id_col}_src',
        right_on=f'{id_col}_dst',
        suffixes=['_src', '_dst'],
        how='left'
    )

    # Build conditions for changes
    change_conditions = [merged[f'{id_col}_dst'].isna()]
    for col in change_cols:
        change_conditions.append(
            safe_column_comparison(merged[f'{col}_src'], merged[f'{col}_dst'])
        )

    # Combine conditions
    changes_mask = np.any(change_conditions, axis=0)
    changes = merged[changes_mask]

    if changes.empty:
        logger.info(f"No dimension changes detected for {id_col}")
        return None

    # Prepare result
    result_columns = {f'{id_col}_src': id_col}
    for col in change_cols:
        result_columns[f'{col}_src'] = col

    result = changes[list(result_columns.keys())].rename(columns=result_columns)

    # Add SCD metadata
    result['valid_from'] = datetime.now().date()
    result['valid_to'] = None
    result['current_record'] = True
    logger.info(f"Detected {len(result)} changes for {id_col}")
    return result

def validate_scd_data(df: pd.DataFrame, entity_type: str) -> bool:
    """Validate dimension data before loading"""
    if df is None:
        return True

    config = SCD_CONFIG[entity_type]
    id_col = config['id_column']

    # Check required columns
    required_columns = [id_col, 'valid_from', 'valid_to', 'current_record']
    missing_columns = set(required_columns) - set(df.columns)
    if missing_columns:
        logger.error(f"Missing required columns: {missing_columns}")
        return False

    # Validate primary key
    if df[id_col].isna().any():
        logger.error(f"NULL values found in primary key column: {id_col}")
        return False

    # Check for duplicates
    duplicate_ids = df[df.duplicated(subset=[id_col])]
    if not duplicate_ids.empty:
        logger.warning(f"Found {len(duplicate_ids)} duplicate IDs in source data")

    return True

def load_scd_type_2(changed_data: pd.DataFrame, entity_type: str) -> None:
    """Load dimension changes using SCD Type 2 approach"""
    if changed_data is None or changed_data.empty:
        logger.info(f"No changes to load for {entity_type}")
        return

    config = SCD_CONFIG[entity_type]
    id_col = config['id_column']
    table_name = f"dim_{entity_type}"

    engine = dst_hook.get_sqlalchemy_engine()
    try:
        with engine.begin() as connection:
            # Update existing records
            ids = changed_data[id_col].tolist()
            if ids:
                update_sql = text(f"""
                    UPDATE {SCHEMA_NAME}.{table_name}
                    SET
                        valid_to = CURRENT_DATE - INTERVAL '1 day'
                        , current_record = FALSE
                    WHERE
                        {id_col} = ANY(:ids)
                        AND current_record = TRUE
                """)
                
                connection.execute(update_sql, {'ids': ids})
                # Insert new records
                changed_data.to_sql(
                    table_name,
                    connection,
                    schema=SCHEMA_NAME,
                    if_exists='append',
                    index=False,
                    method='multi'
                )

                logger.info(f"Successfully loaded {len(ids)} new records for {entity_type}")

    except Exception as e:
        logger.error(f"Failed to load {entity_type} dimension: {str(e)}")
        raise

@dag(**dag_config)
def wwi_to_analytics_daily():

    @task(
        retries=3,
        retry_delay=timedelta(minutes=5),
        on_failure_callback=handle_etl_failure
    )
    def extract_orders(**context) -> pd.DataFrame:
        """Extracts daily orders data from source system for incremental processing."""
        logical_date = context['logical_date']
        yesterday_ds = (logical_date - timedelta(days=1)).strftime('%Y-%m-%d')
        result = src_hook.get_pandas_df(QUERY_EXTRACT_ORDERS, parameters={'yesterday_ds': yesterday_ds})
        return result

    @task(
        retries=3,
        retry_delay=timedelta(minutes=5),
        on_failure_callback=handle_etl_failure
    )
    def extract_invoices(**context) -> pd.DataFrame:
        """Extracts invoice information with calculated totals and payment status for the processing date."""
        logical_date = context['logical_date']
        yesterday_ds = (logical_date - timedelta(days=1)).strftime('%Y-%m-%d')        
        result = src_hook.get_pandas_df(QUERY_EXTRACT_INVOICES, parameters={'yesterday_ds': yesterday_ds})
        return result

    @task(
        retries=3,
        retry_delay=timedelta(minutes=5),
        on_failure_callback=handle_etl_failure
    )
    def extract_order_lines(**context) -> pd.DataFrame:
        """Extracts order line details with associated invoice information for daily sales analysis."""
        logical_date = context['logical_date']
        yesterday_ds = (logical_date - timedelta(days=1)).strftime('%Y-%m-%d')           
        result = src_hook.get_pandas_df(QUERY_EXTRACT_ORDER_LINES, parameters={'yesterday_ds': yesterday_ds})
        return result

    @task(
        retries=3,
        retry_delay=timedelta(minutes=5),
        on_failure_callback=handle_etl_failure
    )
    def extract_customers_src() -> pd.DataFrame:
        """Extracts current customer master data including categories and geographical information from source."""       
        result = src_hook.get_pandas_df(QUERY_EXTRACT_CUSTOMERS_SRC)
        return result

    @task(
        retries=3,
        retry_delay=timedelta(minutes=5),
        on_failure_callback=handle_etl_failure
    )
    def extract_customers_dst() -> pd.DataFrame:
        """Extracts existing customer dimension records from analytics warehouse for change detection."""
        result = dst_hook.get_pandas_df(QUERY_EXTRACT_CUSTOMERS_DST)
        return result

    @task(
        retries=3,
        retry_delay=timedelta(minutes=5),
        on_failure_callback=handle_etl_failure
    )
    def extract_products_src() -> pd.DataFrame:
        """Extracts product master data with attributes and classifications from operational database."""
        result = src_hook.get_pandas_df(QUERY_EXTRACT_PRODUCTS_SRC)
        return result

    @task(
        retries=3,
        retry_delay=timedelta(minutes=5),
        on_failure_callback=handle_etl_failure
    )
    def extract_products_dst() -> pd.DataFrame:
        """Extracts current product dimension records from data warehouse for SCD comparison."""
        result = dst_hook.get_pandas_df(QUERY_EXTRACT_PRODUCTS_DST)
        return result

    @task(
        retries=3,
        retry_delay=timedelta(minutes=5),
        on_failure_callback=handle_etl_failure
    )
    def validate_source_data(df: pd.DataFrame, table_name: str) -> pd.DataFrame:
        """Basic data validation"""
        if df.empty:
            logger.error(f"Empty DataFrame for {table_name}")
            raise ValueError(f"Empty data for {table_name}")

        if df.isnull().any().any():
            null_cols = df.columns[df.isnull().any()].tolist()
            logger.warning(f"NULL values found in columns: {null_cols}")

        if df.duplicated().any():
            logger.warning(f"Found {df.duplicated().sum()} duplicates in {table_name}")

        return df

    @task(
        retries=3,
        retry_delay=timedelta(minutes=5),
        on_failure_callback=handle_etl_failure
    )
    def transform_check_changed_customers(customers_src: pd.DataFrame, customers_dst: pd.DataFrame) -> Union[pd.DataFrame, None]:
        """Identifies customer dimension changes using SCD Type 2 logic for incremental updates."""
        return detect_scd_type2_changes(customers_src, customers_dst, 'customers')

    @task(
        retries=3,
        retry_delay=timedelta(minutes=5),
        on_failure_callback=handle_etl_failure
    )
    def transform_check_changed_products(products_src: pd.DataFrame, products_dst: pd.DataFrame) -> Union[pd.DataFrame, None]:
        """Identifies product dimension modifications requiring new version creation in the warehouse."""
        return detect_scd_type2_changes(products_src, products_dst, 'products')

    @task(
        retries=3,
        retry_delay=timedelta(minutes=5),
        on_failure_callback=handle_etl_failure
    )
    def load_fact_orders(df_orders: pd.DataFrame) -> None:
        """Load orders fact table"""
        if df_orders.empty:
            logger.info("No orders data to load")
            return
        try:
            engine = dst_hook.get_sqlalchemy_engine()
            df_orders.to_sql(
                'fact_orders',
                engine,
                schema=SCHEMA_NAME,
                if_exists='append',
                index=False,
                method='multi'
            )
            logger.info(f"Loaded {len(df_orders)} orders")
        except Exception as e:
            logger.error(f"Failed to load orders: {str(e)}")
            raise
    
    @task(
        retries=3,
        retry_delay=timedelta(minutes=5),
        on_failure_callback=handle_etl_failure
    )
    def load_fact_invoices(df_invoices: pd.DataFrame) -> None:
        """Load invoices fact table"""
        if df_invoices.empty:
            logger.info("No invoices data to load")
            return
        try:
            engine = dst_hook.get_sqlalchemy_engine()
            df_invoices.to_sql(
                'fact_invoices',
                engine,
                schema=SCHEMA_NAME,
                if_exists='append',
                index=False,
                method='multi'
            )
            logger.info(f"Loaded {len(df_invoices)} invoices")
        except Exception as e:
            logger.error(f"Failed to load invoices: {str(e)}")
            raise
        
    @task(
        retries=3,
        retry_delay=timedelta(minutes=5),
        on_failure_callback=handle_etl_failure
    )
    def load_fact_order_lines(df_order_lines: pd.DataFrame) -> None:
        """Load order lines fact table"""
        if df_order_lines.empty:
            logger.info("No order lines data to load")
            return
        try:
            engine = dst_hook.get_sqlalchemy_engine()
            df_order_lines.to_sql(
                'fact_order_lines',
                engine,
                schema=SCHEMA_NAME,
                if_exists='append',
                index=False,
                method='multi'
            )
            logger.info(f"Loaded {len(df_order_lines)} order lines")
        except Exception as e:
            logger.error(f"Failed to load order lines: {str(e)}")
            raise
        
    @task(
        retries=3,
        retry_delay=timedelta(minutes=5),
        on_failure_callback=handle_etl_failure
    )
    def load_scd_type_2_customers(changed_customers: pd.DataFrame) -> None:
        """Loads customer dimension changes by closing old records and inserting new versions with validation."""
        if validate_scd_data(changed_customers, 'customers'):
            load_scd_type_2(changed_customers, 'customers')

    @task(
        retries=3,
        retry_delay=timedelta(minutes=5),
        on_failure_callback=handle_etl_failure
    )
    def load_scd_type_2_products(changed_products: pd.DataFrame) -> None:
        """Applies product dimension updates using SCD Type 2 pattern with data quality checks."""
        if validate_scd_data(changed_products, 'products'):
            load_scd_type_2(changed_products, 'products')

    # ==========================================================================
    # WORKFLOW
    # ==========================================================================

    # Extract
    df_orders = extract_orders()
    df_invoices = extract_invoices()
    df_order_lines = extract_order_lines()

    df_customers_src = extract_customers_src()
    df_customers_dst = extract_customers_dst()
    df_products_src = extract_products_src()
    df_products_dst = extract_products_dst()

    # Validate
    df_orders_valid = validate_source_data(df_orders, "orders")
    df_invoices_valid = validate_source_data(df_invoices, "invoices")
    df_order_lines_valid = validate_source_data(df_order_lines, "order_lines")

    # Transform
    df_customers_changes = transform_check_changed_customers(df_customers_src, df_customers_dst)
    df_products_changes = transform_check_changed_products(df_products_src, df_products_dst)

    # Load facts
    load_fact_orders(df_orders_valid)
    load_fact_invoices(df_invoices_valid)
    load_fact_order_lines(df_order_lines_valid)
    
    # Load dimensions
    load_scd_type_2_customers(df_customers_changes)
    load_scd_type_2_products(df_products_changes)

# ==========================================================================
# Instantiate the DAG
# ==========================================================================

wwi_to_analytics_daily()

# Dashboard Canvas

## Users and Context

**Who will use it?**  
- **Top Management**  
- **Sales Managers**  
- **Procurement and Logistics Specialists**  
- **Financial Analysts**  

**Roles, Tasks, and Interaction Context**  

| Role                     | Tasks                                                                 | Usage Context                                                                 | Frequency       | Devices          | Interaction Time | Special Requirements                     |
|--------------------------|-----------------------------------------------------------------------|-------------------------------------------------------------------------------|---------------|---------------------|------------------|----------------------------------------|
| **Top Management**        | Strategic decisions, company KPI evaluation                          | Meetings, weekly reports, urgent requests                                       | 1-2 times/week | Laptop, projector   | 10-15 min            | High-level metrics, mobile access |
| **Sales Managers** | Sales dynamics analysis, customer work                           | Daily monitoring, negotiation preparation                                       | Daily     | PC/laptop, tablet | 5-10 min             | Customer/region breakdown       |
| **Procurement Specialists** | Inventory control, procurement planning                              | Operational work, pre-order checks                                    | Daily     | PC                  | 5-7 min              | Supplier data integration       |
| **Financial Analysts**  | Profitability deep analysis, forecasting                        | Monthly report preparation, ad-hoc analysis                                            | 3-5 times/week   | PC (large screens) | 15-30 min            | Data export capability            |

**Technical Context:**  
- **Mobile version** is critical for CEO (viewing on the go).  
- **Printable report versions** are needed for meetings (PDF export).  
- **Email distribution** is not required (dashboard is live, data updates in real-time).  
 

## Task Understanding

**Main Dashboard Goal:**

Create a single source of truth for monitoring WWI's key business metrics that:
- Eliminates manual data collection from different systems
- Reduces decision-making time
- Enables real-time problem and opportunity identification

**Why is this task important now?**
- Business growth has complicated data analysis (Excel reports were sufficient before)
- Increased number of regional branches requires unified control system
- Competition demands faster response to market changes

**User Expectations:**
| Role | Main Expectations | Business Value |
|------|------------------|-----------------|
| **Top Management** | "I want to see 3 key metrics in 30 seconds" | Quick business status assessment |
| **Sales Managers** | "Need to quickly identify problem regions" | Sales increase through targeted interventions |
| **Logistics Specialists** | "Automatic alerts for critical stock levels required" | Preventing losses from stockouts |

**How the dashboard will change company operations:**
- Reduces weekly report preparation time
- Enables identification of loss-making items
- Increases response speed to market changes

**Success Criteria:**
- 90% of top management uses dashboard as primary data source
- Decision-making time decreases
- Number of manual requests to IT department reduces

**When do we need to develop it?**
- Development timeline: 2 weeks

**How will we measure effectiveness?**
- Weekly unique users
- Average dashboard usage time
- Reduction in manual report requests to IT
- Feedback from key stakeholders

## Metrics, Dimensions, and Data Sources

**Order-level metrics**
- Number of orders
- Order amount
- Line profit
- Average order value
- Average number of line items per order
- Average number of products per order
- Average product price in order
- Order weight
- Number of categories per order

**Order-level dimensions**
- Time (year, month, day of week, time of day, hour)
- Top 3 packaging types/payment types/delivery methods

**Customer-level metrics**
- Number of customers
- Number of new customers
- New customer percentage
- Purchase amount
- Average order value
- Average number of line items per order

**Customer-level dimensions**
- Customer category
- Customer state
- Customer city

**Product-level metrics**
- Number of units sold
- Total product sales amount
- Average sold product price
- Average number of products per order

**Product-level dimensions**
- Product category
- Product internal packaging type
- Product external packaging type
- Product colors

**Delivery-level metrics**
- Delivery time
- Delivery delay/advance
- Order picking time
- Percentage of delayed orders
- Picking time share of total delivery time

**Delivery-level dimensions**
- Whether order was delayed

**Data Sources**

- Orders Materialized View
- Products Materialized View

## Questions and Business Decisions

- Sales Page
    - What are the values for the last month, what is the trend over the last year, and what is the YoY change for the following metrics:
        - Revenue
        - Profit
        - Number of sales
        - Average order value
    - What are the metric values by states and cities?
    - What are the metric values by customer category?
    - What are the metric values by order delay status?
- Customers Page
    - What are the values for the last month, what is the trend over the last year, and what is the YoY change for the following metrics:
        - Number of customers
        - ARPPU (Average Revenue Per Paying User)
        - APC (Average Payment Count)
        - Average number of products per order
    - What are the metric values by states and cities?
    - What are the metric values by customer category?
- Products Page
    - What are the values for the last month, what is the trend over the last year, and what is the YoY change for the following metrics:
        - Average revenue per product
        - Average profit per product
        - Average number of units sold per product
        - Average product price
    - What are the metric values by product categories?
    - What are the metric values by external and internal packaging types?
- Delivery Page
    - What are the values for the last month, what is the trend over the last year, and what is the YoY change for the following metrics:
        - Average delivery time
        - Percentage of delayed orders
        - Average delivery delay
        - Average order picking time
    - What are the metric values by states and cities?

## Visualizations for Questions

- Sales Page
    - KPI cards with sparklines for:
        - Revenue
        - Profit
        - Number of sales
        - Average order value
    - Table with all metrics by states and cities with bar indicators for values
    - Table with all metrics by delivery delay status with bar indicators for values
    - Table with all metrics by customer category with bar indicators for values

- Customers Page
    - KPI cards with sparklines for:
        - Number of customers
        - ARPPU
        - APC
        - Average number of products per order
    - Table with all metrics by states and cities with bar indicators for values
    - Table with all metrics by customer category with bar indicators for values

- Products Page
    - KPI cards with sparklines for:
        - Average revenue per product
        - Average profit per product
        - Average number of units sold per product
        - Average product price
    - Table with all metrics by product categories with bar indicators for values
    - Table with all metrics by external and internal packaging types with bar indicators for values

- Delivery Page
    - KPI cards with sparklines for:
        - Average delivery time
        - Percentage of delayed orders
        - Average delivery delay
        - Average order picking time
    - Table with all metrics by states and cities with bar indicators for values

# Dashboard Mockup

Based on the compiled Dashboard Canvas, we will create a dashboard mockup.

**Tab Sales**

<img src="assets/dashboard_mockup_tab_sales.png" alt="">

**Tab Customers**

<img src="assets/dashboard_mockup_tab_customers.png" alt="">

**Tab Products**

<img src="assets/dashboard_mockup_tab_products.png" alt="">

**Tab Delivery&Logistics**

<img src="assets/dashboard_mockup_tab_delivery.png" alt="">

# Compiling Brief Documentation for the Dashboard

We will write brief documentation for the dashboard, which will be placed on a separate tab of the dashboard.

## General Information

The WWI Business Performance Overview Dashboard provides comprehensive insights into sales performance, customer behavior, product analytics, and delivery operations within the Wide World Importers company. This tool enables data-driven decision making across multiple business dimensions.

**Dashboard Owner:** Pavel Grigoryev

**Data Coverage:** January 2013 - May 2016

**Abbreviations:**

| Abbreviation | Full Name                   |
| :----------- | :-------------------------- |
| **LM**       | Last Month                  |
| **YoY**      | Year-over-Year              |
| **AOV**      | Average Order Value         |
| **ARPPU**    | Average Revenue Per Paying User |
| **APC**      | Average Payment Count      |

---

## Tab Overview

### Sales Tab

**Key Metrics:**

- **Revenue** - Total sales amount for the last month
- **Profit** - Total profit generated for the last month
- **Sales** - Total number of orders sold for the last month
- **AOV** - Average Order Value for the last month

*Each metric includes a monthly trend visualization (sparkline) for the last 12 months and the previous year, allowing for direct historical comparison.*

**Detailed Breakdowns:**

- By customer category
- By delivery status (on-time vs delayed orders)
- By state and city geographic regions

### Customers Tab

**Key Metrics:**

- **Customers** - Total active customers for the last month
- **ARPPU** - Average Revenue Per Paying User for the last month
- **APC** - Average Purchase Count per customer for the last month
- **Avg Items/Order** - Average Products per Order for the last month

*Each metric includes a monthly trend visualization (sparkline) for the last 12 months and the previous year, allowing for direct historical comparison.*

**Detailed Breakdowns:**

- Customer metrics by state and city

### Products Tab

**Key Metrics:**

- **Avg Product Revenue** - Average Revenue per Product for the last month
- **Avg Product Profit** - Average Profit per Product for the last month
- **Avg Items Sales** - Average Units Sold per Product for the last month
- **Avg Unit Price** - Average Product Price for the last month

*Each metric includes a monthly trend visualization (sparkline) for the last 12 months and the previous year, allowing for direct historical comparison.*

**Detailed Breakdowns:**

- By product category
- By packaging type (internal/external packaging)

### Delivery & Logistics Tab

**Key Metrics:**

- **Avg Delivery Days** - Mean time from order to delivery in days for the last month
- **Delay Rate** - Percentage of orders delivered late for the last month
- **Average Delay Days** - Mean lateness for delayed orders for the last month
- **Avg Picking Days** - Mean Order picking time for the last month

*Each metric includes a monthly trend visualization (sparkline) for the last 12 months and the previous year, allowing for direct historical comparison.*

**Detailed Breakdowns:**

- Performance by state and city regions

---

## Access Information

**Access Level:** Available to all

---

## Data Sources & Technical Details

**Data Sources:**

- Data is sourced from the analytical database.
- The dashboard connects to materialized views: analytics.mv\_orders and analytics.mv\_products.

**Calculation Methodology:**

- YoY (Year-over-Year) Growth: Calculated as (Metric Value for Last Month / Metric Value for Same Month Previous Year) - 1.
- Sparkline Trends: Each monthly point on sparklines represents the YoY growth for that specific month, calculated as (Current Month Value / Same Month Previous Year Value) - 1, providing a continuous YoY comparison over the past 12 months.

---

## Frequently Asked Questions

#### What does YoY mean and how is it calculated?

YoY stands for Year-over-Year.

- It is a key performance metric that compares a specific period (e.g., a month) to the exact same period in the previous year.
- This eliminates seasonal effects and provides a clearer view of growth trends.

It is calculated as:

(Metric Value for Current Period / Metric Value for Same Period Previous Year) - 1

#### How are delayed orders defined?

An order is classified as delayed if the actual delivery date is later than the promised/scheduled delivery date.

#### What is shown on the sparkline charts?

The sparklines provide a visual history of a metric's performance:

- The solid line represents the monthly values for the most recent 12-month period.
- The shaded dashed line behind it represents the values for the same months from the previous year (a 12-month period offset by one year).

This overlay allows for an instant visual comparison of current performance against the historical benchmark.

# Final Dashboard

Based on the dashboard mockup, a dashboard was created using the BI tool Yandex DataLens.

The dashboard is available at the following link: [**WWI Business Performance Overview Dashboard**](https://datalens.yandex/42t45uco5jxup)

Screenshots of the dashboard are provided below.

### 🛍️ Sales

Main page with key sales metrics, YoY growth, and geographical distribution.

<img src="assets/dashboard_tab_sales_1.png" width="800">

<img src="assets/dashboard_tab_sales_2.png" width="800">

### 👥 Customers

Customer metrics.

<img src="assets/dashboard_tab_customers_1.png" width="800">

<img src="assets/dashboard_tab_customers_2.png" width="800">

### 📦 Products

Analyzes product performance. 

<img src="assets/dashboard_tab_products_1.png" width="800">

<img src="assets/dashboard_tab_products_2.png" width="800">

### 🚚 Delivery & Logistics

Delivery time analysis, delay statistics, and operational efficiency.

<img src="assets/dashboard_tab_delivery_1.png" width="800">

<img src="assets/dashboard_tab_delivery_2.png" width="800">


# General Conclusion

## Achievements and Summary

This project encompassed the full lifecycle of a business intelligence solution, from initial data exploration to deployment. Key achievements include:

- End-to-End Pipeline Construction: Successfully deployed the WWI database in the cloud, engineered a star schema analytical data model, and developed automated SQL scripts for data transformation and loading.
- Automation & Engineering: Designed and implemented an automated daily ETL process for incremental data updates.
- Actionable Visualization: Developed a functional dashboard, connecting it to materialized views to provide intuitive access to critical KPIs for sales and delivery analysis.

## Business Value Delivered

The implemented solution delivers direct value to WWI by enabling:

- For Leadership: A high-level overview of business health and trends to guide strategic planning.
- For Sales & Procurement Teams: The ability to analyze sales performance to optimize inventory and purchasing strategies.
- For the Logistics Department: Tools to monitor delivery timelines, identifying bottlenecks and improving operational efficiency.