The Activity Schema data model is a standard designed for modern data warehouses. It is designed to make data modeling and analysis substantially simpler, faster, and more reliable than existing methodologies (e.g. a star schema)
The core approach is to separate modeling from querying. Models should be fully independent: asking new data questions should not require creating new models. Instead, queries can combine any number of existing models to build anything they need.
This is done by modeling all data in the warehouse as a single time series table.
At its core an activity schema consists of transforming raw tables into a single, time series table called an activity stream. All downstream plots, tables, materialized views, etc used for BI are built directly from that single table, with no other dependencies.
Traditional Data Modeling - many custom-structured tables
Activity Schema - one standard table
Business concepts are represented as entity doing an activity ('a customer completed an order') instead of facts or nouns (orders, products). Activities are built directly from source tables, store only their own data, and are the single source of truth for each concept.
All queries run against an activity stream table to assemble data for analysis, BI, and reporting. Instead of traditional foreign key joins, queries combine activities using relationships in time (e.g. all customers who completed an order and submitted a support ticket before their next completed order).
Maintainable data models
- fewer models — one (and only one) business concept per activity means fewer models to manage, understand, and maintain
- easier to build - no joins between models means no need to tie disparate source systems together.
- easily accommodate changes to source data — only need to update a single activity
- simple data lineage — a single data layer makes tracing data provenance and debugging far easier
- faster updates — time-series modeling means incremental updates (rather than full rebuilds) by default
- lower data latency - no dependencies means data is available upon insertion, rather than waiting for cascading rebuilds
- no data dictionaries — fewer models, with one concept each, makes them vastly easier to document
Faster analysis and querying
- single source of truth — because activity represents a single concept (like a 'page view' or 'completed order'), it's always clear which activity to use
- query across many source systems — time-based joins means any activity can be queried and combined with another without defining foreign keys
- reusable analyses — a standard data model means that any analysis can be reused across companies. This means a customer acquisition cost calculation for one company can be shared with another
- autogenerated queries — a standard data model means that queries don't have to be written by hand
- true ad-hoc querying — because all activities are related in time, swapping one activity for another requires no structural changes to queries.
- high performance — queries run substantially faster against an activity stream table, which has fewer columns, requires fewer joins, and can be easily partitioned / indexed by time.
Learn more by reading the full activity schema specification.
The activity schema home page has more info on the approach and its benefits
The implementation page covers how to build an activity schema in production.
A new activity schema dbt package has some helpful macros for building activity stream tables.
Narrator provides a full implementation of the activity schema as a service. The Narrator team directly supports the growth of the activity schema, and is always happy to provide suggestions on how to build one yourself.
The dbt Slack has a
#modeling-activity-schema channel for discussion about the activity schema.