Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 16 additions & 12 deletions docs/concepts/environments.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,19 @@
## Environments
Environments are isolated namespaces that allow you to develop and deploy SQLMesh projects. If an environment isn't specified, the `prod` environment is used, which does not append a prefix to model names. Given a [model](/concepts/models) `db.table`, the `prod` environment would create this model in `db.table`. The `dev` environment would be located at `dev__db.table`. All environments other than `prod` are considered to be development environments.
Environments are isolated namespaces that allow you to test and preview your changes.

Models in `dev` environments also get a special suffix appended to the schema portion of their names. For example, if the model's name is `db.model_a`, it will be available under the name `db__my_dev.model_a` in the `my_dev` environment.
SQLMesh differentiates between production and development environments. Currently only the environment with the name `prod` is treated by SQLMesh as the production one. Environments with other names are considered to be development ones.

By default, the [`sqlmesh plan`](/concepts/plans) command targets the `prod` environment.
[Models](/concepts/models) in development environments get a special suffix appended to the schema portion of their names. For example, to access data for a model with name `db.model_a` in the target environment `my_dev`, the `db__my_dev.model_a` table name should be used in a query. Models in the production environment are referred to by their original names.

## Why use environments?
It is important to be able to iterate and test changes to models with production data. Data pipelines can be very complex and can consist of many chained jobs. Being able to recreate your entire warehouse with these changes is powerful in order to understand the full impact of your changes, but usually expensive or time consuming.
By default, the [`sqlmesh plan`](/concepts/plans) command targets the production (`prod`) environment.

SQLMesh environments allow you to easily spin up 'clones' of your warehouse quickly and efficiently. SQLMesh understands which models have changed compared to the base environment, and only recomputes/backfills what doesn't already exist. Any changes or backfills within this environment **will not impact** other environments. However, any work that was done in this environment **can be reused safely** from other environments.
## Why use environments
Data pipelines and their dependencies tend to grow in complexity over time and so assessing the impact of local changes can become quite challenging. Pipeline owners may not be aware of all downstream consumers of their pipelines, or may drastically underestimate the impact a change would have. That's why it is so important to be able to iterate and test model changes using production dependencies and data, while simultaneously avoiding any impact to existing datasets and/or pipelines that are currently used in production. Recreating the entire data warehouse with given changes would be an ideal solution to fully understand their impact, but this process is usually excessively expensive and time consuming.

## How do you use an environment?
When running the [plan](/concepts/plans) command, the environment is the first variable. You can specify any string as your environment name. The only special environment by default is `prod`. All other environments will prefix the environment name to all models.
SQLMesh environments allow you to easily spin up shallow 'clones' of the data warehouse quickly and efficiently. SQLMesh understands which models have changed compared to the target environment, and only computes data gaps that have been directly caused by the changes. Any changes or backfills within the target environment **do not impact** other environments. At the same time, any computation that was done in this environment **can be safely reused** in other environments.

## How to use environments
When running the [plan](/concepts/plans) command, the environment name can be supplied in the first argument. An arbitrary string can be used as an environment name. The only special environment name by default is `prod`, which refers to the production environment. Environment with names other than `prod` are considered to be development environments.

### Example
A custom name can be provided as an argument to create/update a development environment. For example, to target an environment with name `my_dev`, run:
Expand All @@ -21,8 +23,10 @@ $ sqlmesh plan my_dev
```
A new environment is created automatically the first time a plan is applied to it.

## How do environments work?
Every model definition has a unique [fingerprint](/concepts/architecture/snapshots/#fingerprints). This fingerprint allows SQLMesh to detect if it exists in another environment or if it brand new. Because models depend on other models, the fingerprint also takes into account its upstream fingerprints. If a fingerpint already exists in SQLMesh, it is safe to reuse the existing table because the logic is exactly the same. An environment is essentially a collection of [snapshots](/concepts/architecture/snapshots) of models.
## How do environments work
Whenever a model definition changes, a new model snapshot is created with a unique [fingerprint](/concepts/architecture/snapshots/#fingerprints). This fingerprint allows SQLMesh to detect if a given model variant exists in other environments or if it's a brand new variant. Because models may depend on other models, the fingerprint of a target model variant also includes fingerprints of its upstream dependencies. If a fingerprint already exists in SQLMesh, it is safe to reuse the existing physical table associated with that model variant, since we're confident that the logic that populates that table is exactly the same. This makes an environment a collection of references to model [snapshots](/concepts/architecture/snapshots).

Please refer to the [Plans](/concepts/plans) page for additional details.

## Date ranges ##
A non-production environment consists of a start date and end date. When creating development environments, you usally want to test your data on a subset of dates, such as the last week or last month of data. Non-production environments do not automatically schedule recurring jobs.
## Date range
A development environment includes a start date and end date. When creating a development environment, the intent is usually to test changes on a subset of data. The size of such a subset is determined by a time range defined through the start and end date of the environment. Both start and end date are provided during the [plan](/concepts/plan) creation.
12 changes: 7 additions & 5 deletions docs/concepts/plans.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ A plan is a set of changes that summarizes the difference between the local stat

During plan creation:

* the local state of the SQLMesh project is compared against the state of a target environment. The difference computed is what constitutes a plan.
* the local state of the SQLMesh project is compared against the state of a target environment. The difference computed is what constitutes a plan.
* users are prompted to categorize changes (refer to [change categories](#change-categories)) to existing models in order for SQLMesh to devise a backfill strategy for models that have been affected indirectly (by being downstream dependencies of updated models).
* each plan requires a date range to which it will be applied. If not specified, the date range is derived automatically based on model definitions and the target environment.

Expand All @@ -31,9 +31,11 @@ If a directly modified model change is categorized as breaking, then it will be
A directly-modified model that is classified as non-breaking will be backfilled, but its downstream dependencies will not. This is a common choice in scenarios such as an addition of a new column, an action which doesn't affect downstream models as new columns can't be used by downstream models without modifying them directly.

## Plan application
Once a plan has been created and reviewed, it should then be applied in order for the changes that are part of it to take effect.
Once a plan has been created and reviewed, it should then be applied to a target [environment](/concepts/environments) in order for the changes that are part of it to take effect.

Typically, each model changed in a plan gets assigned with a new version. In turn, each model version gets a separate physical location for data. Data between different model versions is never shared, therefore an environment is simply a collection of references to physical tables of model versions which that environment has been created/updated with.
Every time a model is changed as part of a plan, a new variant of this model gets created behind the scenes (see [snapshots](/concepts/architecture/snapshots)). In turn, each model variant gets a separate physical location for data (i.e. table). Data between different variants of the same model is never shared (except for the [forward-only](#forward-only-plans) case).

When a plan is applied to an environment, that environment gets associated with a collection of model variants that are part of that plan. In other words each environment is a collection of references to model variants and the physical tables associated with them.

![Each model version gets its own physical table while environments only contain references to these tables](plans/model_versioning.png)

Expand All @@ -52,7 +54,7 @@ Another benefit of the aforementioned approach is that data for a new model vers
## Forward-only plans
Sometimes the runtime cost associated with rebuilding an entire physical table is too high, and outweighs the benefits a separate table provides. This is when a forward-only plan comes in handy.

When a forward-only plan is applied, all of the contained model changes will not get separate physical tables assigned to them. Instead, physical tables of previous model versions are reused. The benefit of such a plan is that no backfilling is required, so there is no runtime overhead and hence no cost. The drawback is that reverting to a previous version is no longer as straightforward, and requires a combination of additional forward-only changes and restatements (refer to [restatement plans](#restatement-plans)).
When a forward-only plan is applied, all of the contained model changes will not get separate physical tables assigned to them. Instead, physical tables of previous model versions are reused. The benefit of such a plan is that no backfilling is required, so there is no runtime overhead and hence no cost. The drawback is that reverting to a previous version is no longer as straightforward, and requires a combination of additional forward-only changes and restatements (refer to [restatement plans](#restatement-plans)).

Also note that once a forward-only change is applied to production, all development environments that referred to the previous versions of the updated models will be impacted.

Expand All @@ -69,7 +71,7 @@ $ sqlmesh plan --forward-only
There are cases when models need to be re-evaluated for a given time range, even though changes may not have been made to those model definitions. This could be due to an upstream issue with a dataset defined outside the SQLMesh platform, or when a [forward-only plan](#forward-only-plans) change needs to be applied retroactively to a bounded interval of historical data.

For this reason, the `plan` command supports the `--restate-model` option, which allows users to specify one or more names of a model to be reprocessed. Each name can also refer to an external table defined outside SQLMesh.

Application of such a plan will trigger a cascading backfill for all specified models (excluding external tables), as well as all models downstream from them. The plan's date range in this case determines data intervals that will be affected. For example:

```bash
Expand Down
Binary file modified docs/concepts/plans/model_versioning.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.