<div style="color:#fff; background:#0070c0; font-weight:bold; border:2px solid #0070c0; padding:10px; border-radius:6px;">
**Purpose:** This notebook demonstrates how to use <span style="color:yellow;">Spark Expectations</span> with <span style="color:yellow;"> Slack by sending notifications</span>.<br>
Its main focus is to show how data quality alerts and results can be sent to Slack channels.</span>
</div>

### Spark - Expectations - User - Guide - Documentation

<div style="color:red; font-weight:bold; border:2px solid red; padding:8px;">
⚠️ ALERT: Notebook meant to be ran by spinning up local docker compose (containers/compose.yaml) !
</div>

* Please read through the [Spark Expectation Documentation](https://engineering.nike.com/spark-expectations) before proceeding with this demo

#### widgets 
* `catalog`, `schema` - leave default values 
  * Tables are going to be prefixed with value provided in user widget text field

<div style="color:orange; font-weight:regular; border:1px solid orange; padding:8px;">
⚠️ Container comes with SparkExpectation by default. If SE version is overriden Kernel will need to be restarted!
</div>

* `Override SE version` check box to install different SparkExpectation library version
* `library_source` combo box defines library url(git branch or pypi) from where to pull library 
  * `pypi` ( installs latest published version available in PyPi)
  * `git` ( installs library from specified git branch)
    * Set `git_branch` input field to match git branch (example `main`)  

# Method 1: Use Webhook URL for Notebook Testing

This is used by default for the example notebook. This will just use the webhook URL throughout the notebook for a quick and easy usage.

In [None]:
# create secret scope for Databricks and store it (check afterwards if secret can be pulled)
from databricks.sdk import WorkspaceClient
from pyspark.errors import PySparkException
w = WorkspaceClient()
w.secrets.list_scopes()
scope="slack"
# enter in your slack webhook URL to use in notebook (or retrieve from workspace)
webhook_url="your_webhook_url" # relace with your actual webhook URL

try:
    w.secrets.create_scope(scope=scope)
except PySparkException as pyex:
    print(f"PysparkException: {pyex}")
except Exception as e:
    print(f"Exception: {e}")

w.secrets.put_secret(scope=scope, key="webhook_url", string_value=webhook_url)

# Widget Setup

Widgets used in this notebook will be created and then set.

In [None]:
# GENERATE INPUT WIDGETS 

import ipywidgets as widgets
from IPython.display import display


widget_user = widgets.Text(
    value='user',
    placeholder='Type something',
    description='user: ',
    disabled=False,
    style={'description_width': '100px'}    
)

widget_git_org = widgets.Text(
    value='catalog_name',
    placeholder='Type something',
    description='git_org ',
    disabled=False,
    style={'description_width': '100px'}    
)

widget_catalog = widgets.Text(
    value='development',
    placeholder='Type something',
    description='catalog:',
    disabled=False,
    style={'description_width': '100px'}    
)

widget_schema = widgets.Text(
    value='team_name',
    placeholder='Type something',
    description='schema:',
    disabled=False,
    style={'description_width': '100px'}
)

widget_library_source = widgets.Combobox(
    placeholder='Choose source',
    options=['pypi', 'git'],
    description='library_source:',
    ensure_option=True,
    value='git',
    disabled=False,
    style={'description_width': '100px'}
)

widget_git_branch_or_commit = widgets.Text(
    value='main',
    placeholder='Type branch name or commit hash',
    description='git_branch_or_commit:',
    disabled=False,
    style={'description_width': '150px'}
)

widget_override_version = widgets.Checkbox(
    value=False,
    description='Override SE version',
    disabled=False,
    style={'description_width': '30px'}
    
)

hbox = widgets.HBox([
    widget_user,
    widget_catalog, 
    widget_schema,
    widget_override_version, 
    widget_library_source, 
    widget_git_org,
    widget_git_branch_or_commit
])

# Display widgets
display(hbox)

In [None]:
import re
import pandas as pd

user = re.sub(r'[^a-zA-Z]', '', widget_user.value).lower()
catalog = widget_catalog.value
schema = widget_schema.value
override_se_version = widget_override_version.value
library = widget_library_source.value
org = widget_git_org.value
branch_or_commit = widget_git_branch_or_commit.value

CONFIG = {
    "owner": user,
    "catalog": "development",
    "schema": schema,
    "user": user,
    "product_id": f"se_{user}_product",
    "in_memory_source": f"se_{user}_source",
    "rules_table": f"development.{schema}.se_{user}_rules",
    "stats_table": f"development.{schema}.se_{user}_stats",
    "target_table": f"development.{schema}.se_{user}_target",
    "override_se_version" : override_se_version,
    "library": library,
    "org": org,
    "branch_or_commit": branch_or_commit
}

config_df = pd.DataFrame(list(CONFIG.items()), columns=['Key', 'Value'])

# Install Spark Expectation

If Running from local container it will come with latest spark-expectation library

In [None]:
# Override Spark Expectations based on user input
if override_se_version:
    print("-----OVERRIDING SPARK-EXPECTATIONS VERSION")
    if CONFIG["library"] == "pypi":
      print("-----INSTALLING SPARK-EXPECTATIONS from PyPi")
      %pip install spark-expectations
    elif CONFIG["library"] == "git":
      print(f"-----INSTALLING SPARK-EXPECTATIONS from Git Org/User {CONFIG['org']}, Branch/Commit {CONFIG['branch_or_commit']}")
      giturl = f"git+https://github.com/{CONFIG['org']}/spark-expectations.git@{CONFIG['branch_or_commit']}"
      %pip install --force-reinstall {giturl}    
else:
    print(f"---- Using SparkExpectation from local codebase")

In [None]:
# CREATE SPARK SESSION AND DATABASE
from pyspark.sql import SparkSession

# Create or get a Spark session
spark = SparkSession.builder \
    .appName("Spark SQL Example") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .config("spark.jars.packages", "io.delta:delta-spark_2.12:3.0.0") \
    .getOrCreate()

In [None]:
databases_df = spark.sql("SHOW DATABASES")
databases_df.show(truncate=False)

tables_df = spark.sql("SHOW TABLES")
tables_df.show(truncate=False)

In [None]:
db_name = f"{CONFIG['catalog']}.{CONFIG['schema']}"
pattern = f"se_{CONFIG['user']}*"

# Set the current catalog
spark.sql(f"USE {CONFIG['catalog']}.{CONFIG['schema']}")

# Drop tables matching pattern
tables_df = spark.sql(f"SHOW TABLES IN {db_name} LIKE '{pattern}'")
tables_to_drop = [row for row in tables_df.collect() if not row["isTemporary"] ]

if tables_to_drop:
    print(f"Found {len(tables_to_drop)} tables to drop.")
    for row in tables_to_drop:
        table_name = row["tableName"]
        spark.sql(f"DROP TABLE IF EXISTS {db_name}.{table_name}")
        print(f"Dropped table: {db_name}.{table_name}")
else:
    print("----- No tables to drop")

# Drop global and local temp views matching pattern

views_df = spark.sql(f"SHOW VIEWS in {db_name} LIKE '{pattern}'")
views_to_drop = views_df.collect()

if views_to_drop:
    print(f"Found {len(views_to_drop)} views to drop.")
    for row in views_to_drop:
        view_name = row["viewName"]
        spark.sql(f"DROP VIEW IF EXISTS {view_name}")
        print(f"Dropped view: {view_name}")
else:
    print("----- No views to drop")

In [None]:
# Getting Started with Spark Expectations: Simple Example

## 1. Sample Source Dataset
# initialize simple Pandas DataFrame and convert it to a Spark DataFrame

import pandas as pd
from pyspark.sql import SparkSession


## 2. Define Simple `row_dq` Rules
# Create a rules DataFrame with a few simple data quality rules

rules_data = [
    {
        "product_id": CONFIG["product_id"],
        "table_name": CONFIG["target_table"],
        "rule_type": "row_dq",
        "rule": "age_not_null",
        "column_name": "age",
        "expectation": "age IS NOT NULL",
        "action_if_failed": "warn",
        "tag": "completeness",
        "description": "Age must not be null",
        "enable_for_source_dq_validation": True,
        "enable_for_target_dq_validation": True,
        "is_active": True,
        "enable_error_drop_alert": False,
        "error_drop_threshold": 0,
        "priority": "medium",
    },
    {
        "product_id": CONFIG["product_id"],
        "table_name": CONFIG["target_table"],
        "rule_type": "row_dq",
        "rule": "age_adult",
        "column_name": "age",
        "expectation": "age < 20",
        "action_if_failed": "ignore",
        "tag": "validity",
        "description": "Age must be less than 20",
        "enable_for_source_dq_validation": True,
        "enable_for_target_dq_validation": True,
        "is_active": True,
        "enable_error_drop_alert": False,
        "error_drop_threshold": 0,
        "priority": "medium",
    },
        {
        "product_id": CONFIG["product_id"],
        "table_name": CONFIG["target_table"],
        "rule_type": "row_dq",
        "rule": "email_not_null",
        "column_name": "email",
        "expectation": "email IS NOT NULL",
        "action_if_failed": "drop",
        "tag": "completeness",
        "description": "Email must not be null",
        "enable_for_source_dq_validation": True,
        "enable_for_target_dq_validation": True,
        "is_active": True,
        "enable_error_drop_alert": False,
        "error_drop_threshold": 0,
        "priority": "medium",
    }

    
]
rules_df = spark.createDataFrame(pd.DataFrame(rules_data))
rules_df.write.format("delta").mode("overwrite").saveAsTable(CONFIG['rules_table'])

rules_df.show(truncate=False)

# Notification Configurations for Slack

This is where you can set the webhook URL with the following ways:
- setting the webhook URL manually
- using databricks secret storage (assuming you have secrets stored within your notebook)
- using cerberus secrets storage 

In [None]:
# Configure streaming and notification configuration
from spark_expectations.config.user_config import Constants as user_config
from dbruntime.databricks_repl_context import get_context

# This is a dictionary that can be used to configure Spark Expectations behavior and override default settings.
stats_streaming_config_dict = {
    user_config.se_enable_streaming: False,
}

# For local Spark environment, we don't need Databricks workspace info
# dbx_workspace_id = get_context().workspaceId
# dbx_workspace_url = get_context().browserHostName

user_conf_dict = {
    user_config.se_notifications_enable_slack: True,
    # Slack Configuration - where you supply the webhook URL for your Slack channel
    user_config.se_notifications_slack_webhook_url: webhook_url,  # Replace with your actual webhook URL

    # Fill in your cbs_url + cbs_sdb_path on where your secret is stored.
    # user_config.secret_type: "cerberus",
    # user_config.cbs_url: "https://cerberus.com",
    # user_config.cbs_sdb_path: "app/your/sdb/path",

    # Optionally configure additional Slack settings
    # user_config.se_notifications_slack_channel: "#data-quality-alerts",
    # user_config.se_notifications_slack_username: "Spark Expectations Bot",
    # user_config.se_notifications_slack_icon_emoji: ":warning:",

    # Enable detailed results in notifications
    user_config.se_enable_query_dq_detailed_result: True,
    
    # Set notification on query failure - corrected attribute name
    user_config.se_notifications_on_error_drop_exceeds_threshold_breach: True,
     # Notification triggers
    user_config.se_notifications_on_start: True,
    user_config.se_notifications_on_completion: True,
    user_config.se_notifications_on_fail: True,
    user_config.se_notifications_on_error_drop_exceeds_threshold_breach: True,
    user_config.se_notifications_on_rules_action_if_failed_set_ignore: True,
    user_config.se_notifications_on_error_drop_threshold: 1,
}

In [None]:
## 3. Run Spark Expectations

from pyspark.sql import DataFrame

from spark_expectations.core import load_configurations

from spark_expectations.core.expectations import (
    SparkExpectations,
    WrappedDataFrameWriter,
)


writer = WrappedDataFrameWriter().mode("overwrite").format("delta")


# Initialize Default Config 
load_configurations(spark) 

"""
This class implements/supports running the data quality rules on a dataframe returned by a function

Args:
    product_id: Name of the product
    rules_df: DataFrame which contains the rules. User is responsible for reading
        the rules_table in which ever system it is
    stats_table: Name of the table where the stats/audit-info need to be written
    debugger: Mark it as "True" if the debugger mode need to be enabled, by default is False
    stats_streaming_options: Provide options to override the defaults, while writing into the stats streaming table
"""
se = SparkExpectations(
    product_id=CONFIG["product_id"],
    rules_df=rules_df,
    stats_table=CONFIG["stats_table"],
    stats_table_writer=writer,
    target_and_error_table_writer=writer,
    stats_streaming_options=stats_streaming_config_dict,
)

#  Initialize input data
data = [
    {"id": 1, "age": 19,   "email": "alice@example.com"},
    {"id": 2, "age": 17,   "email": "bob@example.com"},
    {"id": 3, "age": None, "email": "charlie@example.com"},
    {"id": 4, "age": 40,   "email": "mike@example.com"},
    {"id": 5, "age": None, "email": "ron@example.com"},
    {"id": 6, "age": 35,   "email": None},
]
input_df = spark.createDataFrame(pd.DataFrame(data))
input_df.show(truncate=False)

"""
This decorator helps to wrap a function which returns dataframe and apply dataframe rules on it

Args:
    target_table: Name of the table where the final dataframe need to be written
    write_to_table: Mark it as "True" if the dataframe need to be written as table
    write_to_temp_table: Mark it as "True" if the input dataframe need to be written to the temp table to break
                        the spark plan
    user_conf: Provide options to override the defaults, while writing into the stats streaming table
    target_table_view: This view is created after the _row_dq process to run the target agg_dq and query_dq.
        If value is not provided, defaulted to {target_table}_view
    target_and_error_table_writer: Provide the writer to write the target and error table,
        this will take precedence over the class level writer

Returns:
    Any: Returns a function which applied the expectations on dataset
"""


@se.with_expectations(
    target_table=CONFIG["target_table"],
    write_to_table=True,
    write_to_temp_table=True,
    user_conf=user_conf_dict,
)
def get_dataset():
    _df_source: DataFrame = input_df
    _df_source.createOrReplaceTempView(CONFIG["in_memory_source"])
    return _df_source


# This will run the DQ checks and raise if any "fail" rules are violated
get_dataset()