Create Pivot Table

This project provides an easy way to create a pivot table. It features pivot customization options similar to Excel and includes an Airflow operator.

The generate_pivot_query.py and create_pivot_table.py scripts can be used directly, or you can use create_pivot_table_operator.CreatePivotTableOperator with Airflow.

For convenience, this project includes create_table_from_select.py directly. create_table_from_select.py is also available separately; see its README for more information.

The only databases currently supported are [Snowflake][snowflakke] and Amazon Redshift.

Requirements

Snowflake Connector for Python is required for Snowflake support
psycopg2 is required for Redshift support
airflow is required to use the (optional) CreatePivotTableOperator or load database connection information from airflow

Usage

Suppose you have a database table, myschema.orders, that looks something like this:

order_id	customer_id	category	amount
100	1	gizmos	100.00
101	1	gadgets	200.00
102	2	gadgets	220.00
103	2	gizmos	85.00

And you want to pivot it by category with customer_id as the base column so that you'll end up with a result like:

customer_id	amount_category_gizmos	amount_category_gadgets	amount
1	100.00	200.00	300.00
2	85.00	220.00	305.00

You can use the scripts in this repository in three main ways to achieve this.

Generate pivot query text

You can use generate_pivot_query.py to generate the pivot query and write it to standard output:

python generate_pivot_query.py                  \
    --dbtype snowflake --database mydb          \
    --host myhost.url --port 5432               \
    --user me --password myp4ssw0rd             \
    --base-columns customer_id                  \
    --pivot_columns category                    \
    --exclude-columns order_id                  \
    --aggfunction-mappings amount=sum           \
    myschema orders

The resulting query will resemble:

SELECT
    customer_id,
    SUM(CASE
            WHEN category = 'gizmos' THEN amount
            ELSE 0
    END) AS amount_category_gizmos,
    SUM(CASE
            WHEN category = 'gadgets' THEN amount
            ELSE 0
    END) AS amount_category_gadgets,
    SUM(amount) AS amount
FROM
    myschema.orders

Create the pivot table

Or, you can use create_pivot_table.py to go a step further and create the pivot table under e.g. orders_summary:

python create_pivot_table.py                    \
    --dbtype snowflake --database mydb          \
    --host myhost.url --port 5432               \
    --user me --password myp4ssw0rd             \
    --base-columns customer_id                  \
    --pivot_columns category                    \
    --exclude-columns order_id                  \
    --aggfunction-mappings amount=sum           \
    myschema orders orders_summary

Airflow

To do the same thing using an Airflow operator, you could define your DAG such as:

from airflow import DAG

from create_pivot_table_operator import CreatePivotTableOperator

default_args = {
    # ...
    'snowflake_conn_id': 'my-snowflake-conn-id'
}

dag = DAG(
    # ...
    default_args=default_args
)

my_orders_summary_op = CreatePivotTableOperator(
    source_schema='myschema',
    source_table='orders',
    table_name='orders_summary',
    base_columns=['customer_id'],
    pivot_columns=['category'],
    aggfunction_mappings={'amount': 'sum'},
    dag=dag
)

Supported options

You can specify an arbitrary number of columns for both base_columns and pivot_columns.

Columns listed in exclude_columns won't be included in the resulting query or table at all. If a column is instead listed in exclude_aggregates, the pivoted versions will still be created, but not the overall aggregate. In the example above, this means that amount_category_gizmos and amount_category_gadgets would be created but not amount.

The aggfunction_mappings option specifies how the aggregation should work. In the example above, we specified sum, which meant that the query used SUM with a default of 0. The currently supported options are:

Name	Aggregate	Default
max	`MAX`	`NULL`
min	`MIN`	`NULL`
sum	`SUM`	`0`
list	`LISTAGG`	`NULL`
and	`MIN`	`true`
or	`MAX`	`NULL`

If a column isn't explicitly specified in aggfunction_mapping, the script will attempt to guess a reasonable choice based on the column's name. For example, column's beginning with first_ will be aggregated based on "min", and column's beginning with latest_ will be aggregate based on "max".

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
create_pivot_table.py		create_pivot_table.py
create_pivot_table_operator.py		create_pivot_table_operator.py
create_table_from_select.py		create_table_from_select.py
generate_pivot_query.py		generate_pivot_query.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Create Pivot Table

Requirements

Usage

Generate pivot query text

Create the pivot table

Airflow

Supported options

About

Releases

Packages

Languages

License

Ro-Data/Ro-Create-Pivot-Table

Folders and files

Latest commit

History

Repository files navigation

Create Pivot Table

Requirements

Usage

Generate pivot query text

Create the pivot table

Airflow

Supported options

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages