TL;DR: This is a guide for migrating the output tables of External DAGs (e.g., Weather, Trends) from Cortex Data Foundation v4.2 to their new locations in v5.0.
If you have not used External DAGs before, or if you have not deployed SAP, this guide will not apply and you can stop reading now.
Before Cortex Data Foundation version 4.2, _GEN_EXT
flag controlled external data sources deployment, and some sources are workload-dependent (e.g., currency conversion for SAP). As of version 5.0, _GEN_EXT
flag is retired, and the execution of DAGs that can serve multiple workloads is now controlled by a new execution module. This guide provides high-level recommendations for migrating your existing DAG output tables so your workflow will continue to function.
The K9 is introduced in v5.0 for the ingestion, processing and modelling of components that are reusable across different data sources.
The following external data sources are now deployed as a part of K9, into the K9_PROCESSING dataset:
date_dimension
holiday_calendar
trends
weather
Reporting views will refer to the K9_PROCESING dataset for these reusable components.
- The following SAP-dependent DAGs are still executed by
generate_external_dags.sh
, but now execute during the reporting build step, and now write into the SAP reporting dataset instead of the cdc dataset.currency_conversion
inventory_snapshots
prod_hierarchy_texts
hier_reader
First, deploy the newest version (v5.0) of Cortex Data Foundations to your desired projects, with the following guidelines:
- You should use your existing RAW and CDC datasets from prior development or staging deployments as your RAW and CDC datasets of this deployment, as no modification will be made to them during deployment.
- In
config/config.json
, set bothtestData
andSAP.deployCDC
toFalse
. - We recommend to first start testing with a new SAP Reporting project, adjacent to the current development or staging dataset, that is different from what you are currently using for v4.2, and follow the process below. This way, you may safely validate the deployment and migration process.
If you are currently running the DAGs, you need to first pause them in your Composer environment.
To migrate your existing tables to their new location, you can use jinja-cli
to format the provided migration script template to complete the migration.
- Install jinja-cli:
pip install jinja-cli
- Identify the following parameters from your existing v4.2 and new v5.0 deployment:
Name | Description |
---|---|
project_id_src |
Source Google Cloud Project: Project where your existing SAP CDC dataset from v4.2 deployment is located. K9_PROCESSING dataset will also be created in this project. |
project_id_tgt |
Target Google Cloud where your newly deployed SAP Reporting dataset from the new v5.0 deployment is located. This may or may not be different from the source project. |
dataset_cdc_processed |
CDC BigQuery Dataset: BigQuery dataset where the CDC processed data lands the latest available records. This may or may not be the same as the source dataset. |
dataset_reporting_tgt |
Target BigQuery reporting dataset: BigQuery dataset where the Data Foundation for SAP predefined data models will be deployed. |
k9_datasets_processing |
K9 BigQuery dataset: BigQuery dataset where the K9 (augmented data sources) is deployed. |
Then, create a json file with the required input data. Make sure to remove any DAGs you do not want to migrate from the migrate_list
section:
cat <<EOF > data.json
{
"project_id_src": "your-source-project",
"project_id_tgt": "your-target-project",
"dataset_cdc_processed": "your-cdc-processed-dataset",
"dataset_reporting_tgt": "your-reporting-target-dataset-OR-SAP_REPORTING",
"k9_datasets_processing": "your-k9-processing-dataset-OR-K9_REPORTING",
"migrate_list":
[
"holiday_calendar",
"trends",
"weather",
"currency_conversion",
"inventory_snapshots",
"prod_hierarchy_texts",
"hier_reader"
]
}
EOF
Here is what an example looks like, with weather
and trends
removed
{
"project_id_src": "kittycorn-demo",
"project_id_tgt": "kittycorn-demo",
"dataset_cdc_processed": "CDC_PROCESSED",
"dataset_reporting_tgt": "SAP_REPORTING",
"k9_datasets_processing": "K9_PROCESSING",
"migrate_list":
[
"holiday_calendar",
"currency_conversion",
"inventory_snapshots",
"prod_hierarchy_texts",
"hier_reader"
]
}
- Create an output folder
mkdir output
- Now generate the parsed migration script:
Assume you are at the root of the repository:
jinja -d data.json -o output/migrate_external_dags.sql docs/external_dag_migration/scripts/migrate_external_dags.sql
- Examine the output SQL file and execute in BigQuery to migrate your tables to the new location.
Back up the current DAG Files in your Airflow bucket. Then, replace them with the newly generated files from your Cortex Data Foundations v5.0 deployment.
The migration is now complete. You can now validate that all reporting views in the new v5.0 reporting deployment is working correctly.
If everything works properly, go through the above process again, this time targeting the v5.0 deployment to your production Reporting set.
Afterwards, feel free to remove all tables using the following script:
!!WARNING: THIS STEP PERMANENTLY REMOVES YOUR OLD DAG TABLES AND CANNOT BE UNDONE!!
Please only execute this step after all validation is complete. Consider taking backups of these tables.
jinja -d data.json -o output/delete_old_dag_tables.sql docs/external_dag_migration/scripts/delete_old_dag_tables.sql