# Snowflake Notebook Data Engineering

* Author: Jeremiah Hansen
* Last Updated: 6/11/2024

Welcome to the beginning of the Quickstart! Please refer to [the official Snowflake Notebook Data Engineering Quickstart](https://quickstarts.snowflake.com/guide/data_engineering_with_notebooks/index.html?index=..%2F..index#0) for all the details including set up steps.

## Step 03 Setup Snowflake

During this step we will create our demo environment. Update the SQL variables below with your GitHub username and Personal Access Token (PAT) as well as with your forked GitHub repository information.

**Important**: Please make sure you have created the `dev` branch in your forked repository before continuing here. For instructions please see [Step 2 in the Quickstart](https://quickstarts.snowflake.com/guide/data_engineering_with_notebooks/index.html?index=..%2F..index#1).

In [None]:
SET MY_USER = CURRENT_USER();

SET GITHUB_SECRET_USERNAME = 'username';
SET GITHUB_SECRET_PASSWORD = 'personal access token';
SET GITHUB_URL_PREFIX = 'https://github.com/username';
SET GITHUB_REPO_ORIGIN = 'https://github.com/username/sfguide-data-engineering-with-notebooks.git';

In [None]:
-- ----------------------------------------------------------------------------
-- Create the account level objects (ACCOUNTADMIN part)
-- ----------------------------------------------------------------------------

USE ROLE ACCOUNTADMIN;

-- Roles
CREATE OR REPLACE ROLE DEMO_ROLE;
GRANT ROLE DEMO_ROLE TO ROLE SYSADMIN;
GRANT ROLE DEMO_ROLE TO USER IDENTIFIER($MY_USER);

GRANT CREATE INTEGRATION ON ACCOUNT TO ROLE DEMO_ROLE;
GRANT EXECUTE TASK ON ACCOUNT TO ROLE DEMO_ROLE;
GRANT EXECUTE MANAGED TASK ON ACCOUNT TO ROLE DEMO_ROLE;
GRANT MONITOR EXECUTION ON ACCOUNT TO ROLE DEMO_ROLE;
GRANT IMPORTED PRIVILEGES ON DATABASE SNOWFLAKE TO ROLE DEMO_ROLE;

-- Databases
CREATE OR REPLACE DATABASE DEMO_DB;
GRANT OWNERSHIP ON DATABASE DEMO_DB TO ROLE DEMO_ROLE;

-- Warehouses
CREATE OR REPLACE WAREHOUSE DEMO_WH WAREHOUSE_SIZE = XSMALL, AUTO_SUSPEND = 300, AUTO_RESUME= TRUE;
GRANT OWNERSHIP ON WAREHOUSE DEMO_WH TO ROLE DEMO_ROLE;

In [None]:
-- ----------------------------------------------------------------------------
-- Create the database level objects
-- ----------------------------------------------------------------------------
USE ROLE DEMO_ROLE;
USE WAREHOUSE DEMO_WH;
USE DATABASE DEMO_DB;

-- Schemas
CREATE OR REPLACE SCHEMA INTEGRATIONS;
CREATE OR REPLACE SCHEMA DEV_SCHEMA;
CREATE OR REPLACE SCHEMA PROD_SCHEMA;

USE SCHEMA INTEGRATIONS;

-- External Frostbyte objects
CREATE OR REPLACE STAGE FROSTBYTE_RAW_STAGE
    URL = 's3://sfquickstarts/data-engineering-with-snowpark-python/'
;

-- Secrets (schema level)
CREATE OR REPLACE SECRET DEMO_GITHUB_SECRET
  TYPE = password
  USERNAME = $GITHUB_SECRET_USERNAME
  PASSWORD = $GITHUB_SECRET_PASSWORD;

-- API Integration (account level)
-- This depends on the schema level secret!
CREATE OR REPLACE API INTEGRATION DEMO_GITHUB_API_INTEGRATION
  API_PROVIDER = GIT_HTTPS_API
  API_ALLOWED_PREFIXES = ($GITHUB_URL_PREFIX)
  ALLOWED_AUTHENTICATION_SECRETS = (DEMO_GITHUB_SECRET)
  ENABLED = TRUE;

-- Git Repository
CREATE OR REPLACE GIT REPOSITORY DEMO_GIT_REPO
  API_INTEGRATION = DEMO_GITHUB_API_INTEGRATION
  GIT_CREDENTIALS = DEMO_GITHUB_SECRET
  ORIGIN = $GITHUB_REPO_ORIGIN;

In [None]:
-- ----------------------------------------------------------------------------
-- Create the event table
-- ----------------------------------------------------------------------------
USE ROLE ACCOUNTADMIN;

CREATE EVENT TABLE DEMO_DB.INTEGRATIONS.DEMO_EVENTS;
GRANT SELECT ON EVENT TABLE DEMO_DB.INTEGRATIONS.DEMO_EVENTS TO ROLE DEMO_ROLE;
GRANT INSERT ON EVENT TABLE DEMO_DB.INTEGRATIONS.DEMO_EVENTS TO ROLE DEMO_ROLE;

ALTER ACCOUNT SET EVENT_TABLE = DEMO_DB.INTEGRATIONS.DEMO_EVENTS;
ALTER DATABASE DEMO_DB SET LOG_LEVEL = INFO;

## Step 04 Deploy to Dev

Finally we will use `EXECUTE IMMEDIATE FROM <file>` along with Jinja templating to deploy the Dev version of our Notebooks. We will directly execute the SQL script `scripts/deploy_notebooks.sql` from our Git repository which has the SQL commands to deploy a Notebook from a Git repo.

See [EXECUTE IMMEDIATE FROM](https://docs.snowflake.com/en/sql-reference/sql/execute-immediate-from) for more details.

In [None]:
USE ROLE DEMO_ROLE;
USE WAREHOUSE DEMO_WH;
USE SCHEMA DEMO_DB.INTEGRATIONS;

EXECUTE IMMEDIATE FROM @DEMO_GIT_REPO/branches/main/scripts/deploy_notebooks.sql
    USING (env => 'DEV', branch => 'dev');

## Step 05 Load Weather

But what about data that needs constant updating - like the WEATHER data? We would need to build a pipeline process to constantly update that data to keep it fresh.

Perhaps a better way to get this external data would be to source it from a trusted data supplier. Let them manage the data, keeping it accurate and up to date.

Enter the Snowflake Data Cloud...

Weather Source is a leading provider of global weather and climate data and their OnPoint Product Suite provides businesses with the necessary weather and climate data to quickly generate meaningful and actionable insights for a wide range of use cases across industries. Let's connect to the "Weather Source LLC: frostbyte" feed from Weather Source in the Snowflake Data Marketplace by following these steps in Snowsight

* In the left navigation bar click on "Data Products" and then "Marketplace"
* Search: "Weather Source LLC: frostbyte" (and click on tile in results)
* Click the blue "Get" button
* Under "Options", adjust the Database name to read "FROSTBYTE_WEATHERSOURCE" (all capital letters)
* Grant to "HOL_ROLE"

That's it... we don't have to do anything from here to keep this data updated. The provider will do that for us and data sharing means we are always seeing whatever they they have published.

In [None]:
/*---
-- You can also do it via code if you know the account/share details...
SET WEATHERSOURCE_ACCT_NAME = '*** PUT ACCOUNT NAME HERE AS PART OF DEMO SETUP ***';
SET WEATHERSOURCE_SHARE_NAME = '*** PUT ACCOUNT SHARE HERE AS PART OF DEMO SETUP ***';
SET WEATHERSOURCE_SHARE = $WEATHERSOURCE_ACCT_NAME || '.' || $WEATHERSOURCE_SHARE_NAME;

CREATE OR REPLACE DATABASE FROSTBYTE_WEATHERSOURCE
  FROM SHARE IDENTIFIER($WEATHERSOURCE_SHARE);

GRANT IMPORTED PRIVILEGES ON DATABASE FROSTBYTE_WEATHERSOURCE TO ROLE HOL_ROLE;
---*/

In [None]:
-- Let's look at the data - same 3-part naming convention as any other table
SELECT * FROM FROSTBYTE_WEATHERSOURCE.ONPOINT_ID.POSTAL_CODES LIMIT 100;

## Step 06 Load Excel Files

Please follow the instructions in [Step 6 of the Quickstart](https://quickstarts.snowflake.com/guide/data_engineering_with_notebooks/index.html?index=..%2F..index#5) to open and run the `DEV_06_load_excel_files` Notebook. That Notebook will define the pipeline used to load data into the `LOCATION` and `ORDER_DETAIL` tables from the staged Excel files.

## Step 07 Load Daily City Metrics

Please follow the instructions in [Step 7 of the Quickstart](https://quickstarts.snowflake.com/guide/data_engineering_with_notebooks/index.html?index=..%2F..index#6) to open and run the `DEV_07_load_daily_city_metrics` Notebook. That Notebook will define the pipeline used to create the `DAILY_CITY_METRICS` table.

In [None]:
USE ROLE DEMO_ROLE;
USE WAREHOUSE DEMO_WH;
USE SCHEMA DEMO_DB.INTEGRATIONS;

SELECT TOP 100
  RECORD['severity_text'] AS SEVERITY,
  VALUE AS MESSAGE
FROM
  DEMO_DB.INTEGRATIONS.DEMO_EVENTS
WHERE 1 = 1
  AND SCOPE['name'] = 'demo_logger'
  AND RECORD_TYPE = 'LOG';

## Step 08 Orchestrate Pipelines

In this step we will create a DAG (or Directed Acyclic Graph) of Tasks using the new [Snowflake Python Management API](https://docs.snowflake.com/en/developer-guide/snowflake-python-api/snowflake-python-overview). The Task DAG API builds upon the Python Management API to provide advanced Task management capabilities. For more details see [Managing Snowflake tasks and task graphs with Python](https://docs.snowflake.com/en/developer-guide/snowflake-python-api/snowflake-python-managing-tasks).

This code is also available in the `scripts/deploy_task_dag.py` script which could be used to automate the Task DAG deployment.

In [None]:
# Import python packages
from snowflake.core import Root

# We can also use Snowpark for our analyses!
from snowflake.snowpark.context import get_active_session
session = get_active_session()

session.use_role("DEMO_ROLE")
session.use_warehouse("DEMO_WH")

In [None]:
database_name = "DEMO_DB"
schema_name = "DEV_SCHEMA"
#schema_name = "PROD_SCHEMA"
env = 'PROD' if schema_name == 'PROD_SCHEMA' else 'DEV'

session.use_schema(f"{database_name}.{schema_name}")

In [None]:
from snowflake.core.task.dagv1 import DAGOperation, DAG, DAGTask
from datetime import timedelta

# Create the tasks using the DAG API
warehouse_name = "DEMO_WH"
dag_name = "DEMO_DAG"

api_root = Root(session)
schema = api_root.databases[database_name].schemas[schema_name]
dag_op = DAGOperation(schema)

# Define the DAG
with DAG(dag_name, schedule=timedelta(days=1), warehouse=warehouse_name) as dag:
    dag_task1 = DAGTask("LOAD_EXCEL_FILES_TASK", definition=f'''EXECUTE NOTEBOOK "{database_name}"."{schema_name}"."{env}_06_load_excel_files"()''', warehouse=warehouse_name)
    dag_task2 = DAGTask("LOAD_DAILY_CITY_METRICS", definition=f'''EXECUTE NOTEBOOK "{database_name}"."{schema_name}"."{env}_07_load_daily_city_metrics"()''', warehouse=warehouse_name)

    # Define the dependencies between the tasks
    dag_task1 >> dag_task2 # dag_task1 is a predecessor of dag_task2

# Create the DAG in Snowflake
dag_op.deploy(dag, mode="orreplace")

In [None]:
dagiter = dag_op.iter_dags(like='demo_dag%')
for dag_name in dagiter:
    print(dag_name)

#dag_op.run(dag)

In [None]:
print('commiting this to dev branch')

## Step 09 Deploy to Production

Steps
1. Make a small change to a notebook and commit it to the dev branch
1. Go into GitHub and create a PR and Merge to main branch
1. Review GitHub Actions workflow definition and run results
1. See new "PROD_" versions of the Notebooks
1. Deploy the production version of the task DAG
1. Run production version of the task DAG and see new tables created!

## Step 10 Teardown

Finally, we will tear down our demo environment.

In [None]:
USE ROLE ACCOUNTADMIN;

DROP API INTEGRATION DEMO_GITHUB_API_INTEGRATION;
DROP DATABASE DEMO_DB;
DROP WAREHOUSE DEMO_WH;
DROP ROLE DEMO_ROLE;

-- Drop the weather share
DROP DATABASE FROSTBYTE_WEATHERSOURCE;

-- Remove the "dev" branch in your repo