From b1aeffa64c480ee1bb5412295e62d63214f7e970 Mon Sep 17 00:00:00 2001 From: Iaroslav Zeigerman Date: Wed, 15 Feb 2023 15:59:22 -0800 Subject: [PATCH 1/3] Update the Model Kinds doc page --- docs/concepts/environments.md | 6 +- docs/concepts/glossary.md | 4 +- docs/concepts/models/model_kinds.md | 208 +++++++++++++++++++++++- docs/concepts/plans.md | 4 +- sqlmesh/core/engine_adapter/redshift.py | 2 +- 5 files changed, 209 insertions(+), 15 deletions(-) diff --git a/docs/concepts/environments.md b/docs/concepts/environments.md index 2e3315605c..d966945cc1 100644 --- a/docs/concepts/environments.md +++ b/docs/concepts/environments.md @@ -5,8 +5,6 @@ SQLMesh differentiates between production and development environments. Currentl [Models](models/overview.md) in development environments get a special suffix appended to the schema portion of their names. For example, to access data for a model with name `db.model_a` in the target environment `my_dev`, the `db__my_dev.model_a` table name should be used in a query. Models in the production environment are referred to by their original names. -By default, the [`sqlmesh plan`](plans.md) command targets the production (`prod`) environment. - ## Why use environments Data pipelines and their dependencies tend to grow in complexity over time and so assessing the impact of local changes can become quite challenging. Pipeline owners may not be aware of all downstream consumers of their pipelines, or may drastically underestimate the impact a change would have. That's why it is so important to be able to iterate and test model changes using production dependencies and data, while simultaneously avoiding any impact to existing datasets and/or pipelines that are currently used in production. Recreating the entire data warehouse with given changes would be an ideal solution to fully understand their impact, but this process is usually excessively expensive and time consuming. @@ -15,6 +13,8 @@ SQLMesh environments allow you to easily spin up shallow 'clones' of the data wa ## How to use environments When running the [plan](plans.md) command, the environment name can be supplied in the first argument. An arbitrary string can be used as an environment name. The only special environment name by default is `prod`, which refers to the production environment. Environment with names other than `prod` are considered to be development environments. +By default, the [`sqlmesh plan`](plans.md) command targets the production (`prod`) environment. + ### Example A custom name can be provided as an argument to create/update a development environment. For example, to target an environment with name `my_dev`, run: @@ -26,7 +26,7 @@ A new environment is created automatically the first time a plan is applied to i ## How do environments work Whenever a model definition changes, a new model snapshot is created with a unique [fingerprint](architecture/snapshots.md#fingerprints). This fingerprint allows SQLMesh to detect if a given model variant exists in other environments or if it's a brand new variant. Because models may depend on other models, the fingerprint of a target model variant also includes fingerprints of its upstream dependencies. If a fingerprint already exists in SQLMesh, it is safe to reuse the existing physical table associated with that model variant, since we're confident that the logic that populates that table is exactly the same. This makes an environment a collection of references to model [snapshots](architecture/snapshots.md). -Please refer to the [Plans](plans.md) page for additional details. +Please refer to the [Plans](plans.md#plan-application) page for additional details. ## Date range A development environment includes a start date and end date. When creating a development environment, the intent is usually to test changes on a subset of data. The size of such a subset is determined by a time range defined through the start and end date of the environment. Both start and end date are provided during the [plan](plans.md) creation. diff --git a/docs/concepts/glossary.md b/docs/concepts/glossary.md index ba4c7b91ad..b8559cf93f 100644 --- a/docs/concepts/glossary.md +++ b/docs/concepts/glossary.md @@ -1,10 +1,10 @@ # Glossary ## CI/CD -An engineering process that combines both Continuous Integration (automated code creation and testing) and Continuous Delivery (deployment of code and tests) in a manner that is scalable, reliable, and secure. SQLMesh accomplishes this with [tests](concepts/tests.md) and [audits](concepts/audits.md). +An engineering process that combines both Continuous Integration (automated code creation and testing) and Continuous Delivery (deployment of code and tests) in a manner that is scalable, reliable, and secure. SQLMesh accomplishes this with [tests](tests.md) and [audits](audits.md). ## CTE -A Common Table Expression is a temporary named result set created from a SELECT statement, which can then be used in a subsequent SELECT statement. For more information, refer to [tests](concepts/tests.md). +A Common Table Expression is a temporary named result set created from a SELECT statement, which can then be used in a subsequent SELECT statement. For more information, refer to [tests](tests.md). ## DAG Directed Acyclic Graph. In this type of graph, objects are represented as nodes with relationships that show the dependencies between them; as such, the relationships are directed, meaning there is no way for data to travel through the graph in a loop that can circle back to the starting point. SQLMesh uses a DAG to keep track of a project's models. This allows SQLMesh to easily determine a model's lineage and to identify upstream and downstream dependencies. diff --git a/docs/concepts/models/model_kinds.md b/docs/concepts/models/model_kinds.md index 52dad2bee8..64fa4522f6 100644 --- a/docs/concepts/models/model_kinds.md +++ b/docs/concepts/models/model_kinds.md @@ -1,21 +1,215 @@ # Model kinds +This page describes supported kinds of [models](overview.md) which ultimately determine how the data for a model gets loaded. + ## INCREMENTAL_BY_TIME_RANGE -Incremental by time range load is the default model kind. It specifies that the data is incrementally computed. For example, many models representing 'facts' or 'logs' should be incremental because new data is continuously added. This strategy requires a time column. +Specifies that the model should be computed incrementally based on a time range. This is a good choice for datasets in which records are of temporal nature and represent immutable facts like events, logs or transactions. Using this kind for datasets that fit the described traits usually results in significant cost and time savings. + +As the name suggests a model of this kind is computed incrementally, meaning only missing data intervals are processed during each evaluation. This is in contrast to the [FULL](#full) model kind, which causes the recomputation of the entire dataset every time the model is evaluated. + +In order to take advantage of the incremental evaluation, the model query must contain an expression in its `WHERE` clause which filters the upstream records by time range. SQLMesh provides special macros which represent the start and the end of the time range that is being processed: `@start_date` / `@end_date` and `@start_ds` / `@end_ds`. Please refer to [Macros](overview.md#macros) to find more information on these. + +Below is an example of a definition which takes full advantage of the model's incremental nature: +```sql +MODEL ( + name db.events, + kind INCREMENTAL_BY_TIME_RANGE +); +SELECT + event_date::TEXT as ds, + event_payload::TEXT as payload +FROM raw_events +WHERE + event_date BETWEEN @start_ds AND @end_ds; +``` + +### Time column +SQLMesh needs to know which column in the model's output represents a timestamp or a date associated with each record. This column is used to determine which records will be overridden during data [restatement](../plans.md#restatement-plans) as well as a partition key for engines that support partitioning (eg. Apache Spark). By default the `ds` column name is used but it can be overridden in the model definition: +```sql +MODEL ( + name db.events, + kind INCREMENTAL_BY_TIME_RANGE ( + time_column date_column + ) +); +``` + +Additionally, the format in which the timestamp/date is stored is required. By default SQLMesh uses the `%Y-%m-%d` format but it can be overridden as follows: +```sql +MODEL ( + name db.events, + kind INCREMENTAL_BY_TIME_RANGE ( + time_column (date_column, '%Y-%m-%d') + ) +); +``` + +SQLMesh also uses the time column to automatically append a time range filter to the model's query at runtime which prevents records that are not a part of the target interval from being stored. This is a safety mechanism which prevents the unintended overriding of unrelated records when handling late arriving data. + +Consider the following model definition: +```sql +MODEL ( + name db.events, + kind INCREMENTAL_BY_TIME_RANGE ( + time_column event_date + ) +); +SELECT + event_date::TEXT as event_date, + event_payload::TEXT as payload +FROM raw_events +WHERE + receipt_date BETWEEN @start_ds AND @end_ds; +``` + +At runtime, SQLMesh will automatically modify the model's query to look like following: +```sql +SELECT + event_date::TEXT as event_date, + event_payload::TEXT as payload +FROM raw_events +WHERE + receipt_date BETWEEN @start_ds AND @end_ds + AND event_date BETWEEN @start_ds AND @end_ds; +``` + +### Materialization strategy +Depending on the target engine, models of the `INCREMENTAL_BY_TIME_RANGE` kind are materialized using the following strategies: + +| Engine | Strategy | +|------------|-------------------------------------------| +| Spark | INSERT OVERWRITE by time column partition | +| Databricks | INSERT OVERWRITE by time column partition | +| Snowflake | DELETE by time range, then INSERT | +| BigQuery | DELETE by time range, then INSERT | +| Redshift | DELETE by time range, then INSERT | +| Postgres | DELETE by time range, then INSERT | +| DuckDB | DELETE by time range, then INSERT | ## INCREMENTAL_BY_UNIQUE_KEY -Incremental by unique key will update or insert new records since the last load was run. This strategy requires a unique key. +This kind signifies that a model should be computed incrementally based on a unique key. If a key is missing in the model's table, the new row is inserted, otherwise the existing row associated with this key is updated with the new one. This kind is a good fit for datasets which have the following traits: + +* Each record has a key associated with it. +* There should be at most one record associated with each unique key. +* It's appropriate to upsert records, meaning existing records can be overridden by newly arrived ones when their keys match. + +[SCD](https://en.wikipedia.org/wiki/Slowly_changing_dimension) (Slowly Changing Dimensions) is one example that fits this description well. + +The name of the unique key column must be provided as part of the model definition as in the following example: +```sql +MODEL ( + name db.employees, + kind INCREMENTAL_BY_UNIQUE_KEY ( + unique_key name + ) +); +SELECT + name::TEXT as name, + title::TEXT as title, + salary::INT as salary +FROM raw_employees; +``` + +Composite keys are also supported: +```sql +MODEL ( + name db.employees, + kind INCREMENTAL_BY_UNIQUE_KEY ( + unique_key (first_name, last_name) + ) +); +``` + +Similarly to the [INCREMENTAL_BY_TIME_RANGE](#incremental_by_time_range) kind, the upstream records can be filtered by time range using the `@start_date`, `@end_date`, etc. [macros](overview.md#macros) in order to process the input data incrementally: +```sql +SELECT + name::TEXT as name, + title::TEXT as title, + salary::INT as salary +FROM raw_employee_events +WHERE + event_date BETWEEN @start_date AND @end_date; +``` + +Note, however, that models of this kind don't support data [restatement](../plans.md#restatement-plans). + +### Materialization strategy +Depending on the target engine, models of the `INCREMENTAL_BY_UNIQUE_KEY` kind are materialized using the following strategies: + +| Engine | Strategy | +|------------|---------------------| +| Spark | not supported | +| Databricks | MERGE ON unique key | +| Snowflake | MERGE ON unique key | +| BigQuery | MERGE ON unique key | +| Redshift | MERGE ON unique key | +| Postgres | MERGE ON unique key | +| DuckDB | not supported | ## FULL -Full refresh is used when the entire table needs to be recomputed from scratch every batch. +As the name suggests, this kind causes the dataset associated with a model to be fully refreshed (rewritten) on each model evaluation. It's somewhat easier to use than incremental kinds due to lack of any special settings or additional query considerations. This makes it suitable for smaller datasets, for which recomputing data from scratch is relatively cheap and which don't require preservation of processing history. However, using this kind with datasets which have a high volume of records will result in significant runtime and compute costs. + +This kind can be a good fit for aggregate tables that lack temporal dimension. For aggregate tables with temporal dimension consider the [INCREMENTAL_BY_TIME_RANGE](#incremental_by_time_range) kind instead. + +Example: +```sql +MODEL ( + name db.salary_by_title_agg, + kind FULL +); +SELECT + title, + AVG(salary) +FROM db.employees +GROUP BY title; +``` -## SNAPSHOT -Snapshot means recomputing the entire history of a table as of the compute date and storing that in a partition. Snapshots are expensive to compute and store, but allow you to look at the frozen snapshot at a certain point in time. An example of a snapshot model would be computing and storing lifetime revenue of a user daily. +### Materialization strategy +Depending on the target engine, models of the `FULL` kind are materialized using the following strategies: + +| Engine | Strategy | +|------------|----------------------------------| +| Spark | INSERT OVERWRITE | +| Databricks | INSERT OVERWRITE | +| Snowflake | CREATE OR REPLACE TABLE | +| BigQuery | CREATE OR REPLACE TABLE | +| Redshift | DROP TABLE, CREATE TABLE, INSERT | +| Postgres | DROP TABLE, CREATE TABLE, INSERT | +| DuckDB | CREATE OR REPLACE TABLE | ## VIEW -View models rely on datebase engine views and don't require any direct backfilling. Using a view will create a view in the same location as you may expect a physical table, but no table is computed. Other models that reference view models will incur compute cost because only the query is stored. +Up until now each model kind caused the output of a model query to be materialized and stored in a physical table. The `VIEW` kind is different because no data actually gets written during model evaluation. Instead a non-materialized view (aka "virtual table") is created or replaced based on the model's query. + +Please note that with this kind the model's query is evaluated every time the model gets referenced in downstream queries. This may incur undesirable compute cost in case when the model's query is compute intensive or when the model is referenced in many downstream queries. + +Example: +```sql +MODEL ( + name db.highest_salary, + kind VIEW +); +SELECT + MAX(salary) +FROM db.employees; +``` ## EMBEDDED -Embedded models are like views, except they don't interact with the data warehouse at all. They are embedded directly in models that reference them as expanded queries. They are an easy way to share logic across models. +This kind is similar to [VIEW](#view), except models of this kind are never evaluated, and therefore, there are no data assets (tables or views) associated with them in the data warehouse. Instead the embedded model's query gets injected directly into a query of each downstream model that references this model in its own query. + +Embedded models are a way to share common logic between different models of other kinds. + +Example: +```sql +MODEL ( + name db.unique_employees, + kind EMBEDDED +); +SELECT DISTINCT + name +FROM db.employees; +``` + +## SEED +This is a special kind reserved for [seed models](seed_models.md). diff --git a/docs/concepts/plans.md b/docs/concepts/plans.md index 997657dcd1..aa08994697 100644 --- a/docs/concepts/plans.md +++ b/docs/concepts/plans.md @@ -37,9 +37,9 @@ Every time a model is changed as part of a plan, a new variant of this model get When a plan is applied to an environment, that environment gets associated with a collection of model variants that are part of that plan. In other words each environment is a collection of references to model variants and the physical tables associated with them. -![Each model version gets its own physical table while environments only contain references to these tables](plans/model_versioning.png) +![Each model variant gets its own physical table while environments only contain references to these tables](plans/model_versioning.png) -*Each model version gets its own physical table while environments only contain references to these tables.* +*Each model variant gets its own physical table while environments only contain references to these tables.* This approach allows SQLMesh to ensure complete isolation between environments, while allowing it to share physical data assets between environments when appropriate and safe to do so. Additionally, since each model change is captured in a separate physical table, reverting to a previous version becomes a simple and quick operation (refer to [logical updates](#logical-updates)) as long as its physical table hasn't been garbage collected by the janitor process. SQLMesh makes it really hard to accidentally and irreversibly break things. diff --git a/sqlmesh/core/engine_adapter/redshift.py b/sqlmesh/core/engine_adapter/redshift.py index 11b1114a07..7bdb91c2e4 100644 --- a/sqlmesh/core/engine_adapter/redshift.py +++ b/sqlmesh/core/engine_adapter/redshift.py @@ -110,7 +110,7 @@ def replace_query( If the table doesn't exist then we just create it and load it with insert statements If it does exist then we need to do the: - `CREATE TABLE...`, `INSERT INTO...`, `RENAME TABLE...`, `RENAE TABLE...`, DROP TABLE...` dance. + `CREATE TABLE...`, `INSERT INTO...`, `RENAME TABLE...`, `RENAME TABLE...`, DROP TABLE...` dance. """ if not isinstance(query_or_df, pd.DataFrame): return super().replace_query(table_name, query_or_df, columns_to_types) From 3ee9c0c054e84ec1898e888ca790709a2f833b6a Mon Sep 17 00:00:00 2001 From: Iaroslav Zeigerman Date: Thu, 16 Feb 2023 15:25:38 -0800 Subject: [PATCH 2/3] Address comments --- docs/concepts/models/model_kinds.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/docs/concepts/models/model_kinds.md b/docs/concepts/models/model_kinds.md index 64fa4522f6..7765eddf3c 100644 --- a/docs/concepts/models/model_kinds.md +++ b/docs/concepts/models/model_kinds.md @@ -44,6 +44,7 @@ MODEL ( ) ); ``` +Please note that the time format should be defined using the same dialect as the one used to define the model's query. SQLMesh also uses the time column to automatically append a time range filter to the model's query at runtime which prevents records that are not a part of the target interval from being stored. This is a safety mechanism which prevents the unintended overriding of unrelated records when handling late arriving data. @@ -74,6 +75,9 @@ WHERE AND event_date BETWEEN @start_ds AND @end_ds; ``` +### Idempotency +It's recommended to ensure that queries of models of this kind are [idempotent](../../glossary/#idempotency) to prevent unexpected results during data [restatement](../plans.md#restatement-plans). Please note, however, that upstream models and tables can impact the extent to which the idempotency property can be guaranteed. For example, referencing an upstream model of kind [FULL](#full) in the model query automatically renders such a model as non-idempotent. + ### Materialization strategy Depending on the target engine, models of the `INCREMENTAL_BY_TIME_RANGE` kind are materialized using the following strategies: From b638943cdcdef26f1723f869b0dd3b4c44cc32ae Mon Sep 17 00:00:00 2001 From: Iaroslav Zeigerman Date: Thu, 16 Feb 2023 15:30:02 -0800 Subject: [PATCH 3/3] Update references to macros --- docs/concepts/models/model_kinds.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/concepts/models/model_kinds.md b/docs/concepts/models/model_kinds.md index 7765eddf3c..2ed69225ec 100644 --- a/docs/concepts/models/model_kinds.md +++ b/docs/concepts/models/model_kinds.md @@ -8,7 +8,7 @@ Specifies that the model should be computed incrementally based on a time range. As the name suggests a model of this kind is computed incrementally, meaning only missing data intervals are processed during each evaluation. This is in contrast to the [FULL](#full) model kind, which causes the recomputation of the entire dataset every time the model is evaluated. -In order to take advantage of the incremental evaluation, the model query must contain an expression in its `WHERE` clause which filters the upstream records by time range. SQLMesh provides special macros which represent the start and the end of the time range that is being processed: `@start_date` / `@end_date` and `@start_ds` / `@end_ds`. Please refer to [Macros](overview.md#macros) to find more information on these. +In order to take advantage of the incremental evaluation, the model query must contain an expression in its `WHERE` clause which filters the upstream records by time range. SQLMesh provides special macros which represent the start and the end of the time range that is being processed: `@start_date` / `@end_date` and `@start_ds` / `@end_ds`. Please refer to [Macros](../macros.md#predefined-variables) to find more information on these. Below is an example of a definition which takes full advantage of the model's incremental nature: ```sql @@ -126,7 +126,7 @@ MODEL ( ); ``` -Similarly to the [INCREMENTAL_BY_TIME_RANGE](#incremental_by_time_range) kind, the upstream records can be filtered by time range using the `@start_date`, `@end_date`, etc. [macros](overview.md#macros) in order to process the input data incrementally: +Similarly to the [INCREMENTAL_BY_TIME_RANGE](#incremental_by_time_range) kind, the upstream records can be filtered by time range using the `@start_date`, `@end_date`, etc. [macros](../macros.md#predefined-variables) in order to process the input data incrementally: ```sql SELECT name::TEXT as name,