![Top <](./images/watsonxdata.png "watsonxdata")

# Spark and watsonx.data Integration
This notebook demonstrate how Spark can connect to watsonx.data and manipulate the data. This system has a local Spark engine that will be used to access watsonx.data. This is a minimally configured Spark engine, but is sufficient to demonstrate the steps needed to connect to watsonx.data and access the data that resides in the catalogs. Special thanks to Daniel Hancock on which this notebook was derived from.

## Watsonx.data Development Systems Updates
A number of configuration changes were made to the watsonx.data development system in order for these examples to run. 
* The MinIO server must have the 9000 port exposed in order to communicate with it. The default configuration for the watsonx.data server on TechZone runs in `diag` mode which automatically exposes this port. If you have not started the development server in `diag` mode, you can use the `ibm-lh-dev/bin/expose-minio` command to open up this port.
* The Hive Metastore (lh-ibm-hive-metastore) container uses port 9083 to communicate with other programs. Unfortunately the 9083 port is not exposed in the container, so the container was modified to expose port 9083 and was restarted.
* Ports 9000 and 9083 are exposed ports on the TechZone server so in theory you should be able to use external tools to access these ports.

## Copy Spark Libraries
The Spark libraries that are used by this notebook need to be loaded into the local file system in order for the spark calls to work properly.

In [None]:
%system tar -xf /spark/spark.tgz -C /usr/local

## Environment Variables 
We need to make sure that a number of environment variables are set so that the Spark code can be accessed.

In [None]:
%env SPARK_HOME=/usr/local/spark
%env PYSPARK_DRIVER_PYTHON=jupyter
%env PYSPARK_DRIVER_PYTHON_OPTS=notebook
%env PATH=/usr/local/bin:/usr/local/sbin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/spark/bin:/root/bin

## System Variables
In addition to the environment variables, we need to set some Python variables that will be used throughout the scripts. These settings are:
* minio_host - The URL of the Minio server
* minio_port - The port that the Minio server is using
* hive_host  - The URL of the Hive server
* hive_port  - The port that the Hive server is using

Note that the URLs and PORTS are for an internal connection in the watsonx.data development server. These URLs and PORTS will be different if you are connecting externally.

In [None]:
minio_host    = "watsonxdata"
minio_port    = "9000"
hive_host     = "watsonxdata"
hive_port     = "9083"

# Load Demonstration Data
The `staging-bucket` directory contains three files that will be used in the Spark examples:
* customer.csv
* orders.csv
* products.csv

Rather than using the MinIO UI to create a new bucket and upload these files, we will be using the MinIO CLI which provides direct access to the MinIO system.

## Minio CLI
In order to use the MinIO CLI, we must first register the MinIO server that we need to connect to. Before we do that we need to extract the passwords of the MinIO service, along with some other credentials. The passwords for all of the services can be found in the `/certs/passwords` file found in this server. 

In [None]:
%cat /certs/passwords

The following code will extract all of the passwords and userids that are required for the MinIO and Spark connections.

In [None]:
hive_id           = None
hive_password     = None
minio_access_key  = None
minio_secret_key  = None
keystore_password = None 
cert_file         = "/certs/lh-ssl-ts.jks"

try:
    with open('/certs/passwords') as fd:
        certs = fd.readlines()
    for line in certs:
        args = line.split()
        if (len(args) >= 3):
            system   = args[0].strip()
            user     = args[1].strip()
            password = args[2].strip()
            if (system == "Minio"):
                minio_access_key = user
                minio_secret_key = password
            elif (system == "Thrift"):
                hive_id = user
                hive_password = password
            elif (system == "Keystore"):
                keystore_password = password
            else:
                pass
except Error as e:
    print("Certificate file with passwords could not be found")

### Minio System Alias
Before running any commands against the MinIO server, an alias needs to be created that includes the access and secret key.

In [None]:
%system mc alias set watsonxdata http://{minio_host}:{minio_port} {minio_access_key} {minio_secret_key}

### List Buckets
The `mc` command provides us with a number of commands that allows us to manage buckets and files within them. The following command checks to see if the `staging-bucket` exists. This bucket is used for all of the Spark examples.

In [None]:
%system mc ls tree watsonxdata

If the staging bucket exists, we will delete the bucket and the contents.

In [None]:
%system mc rb --force watsonxdata/staging-bucket 

### Create a Bucket
At this point we will create the staging bucket that we are doing to use to hold our data.

In [None]:
%system mc mb watsonxdata/staging-bucket

### Load Data
Next we will load the data from the `/staging-bucket` directory. Note that we need to use the full name of the bucket. The `mc` command allows to select which files to place into a bucket, or an entire directory with recursion. In this case we are only going to select the csv files.

In [None]:
%system mc cp /notebooks/staging-bucket/*.csv watsonxdata/staging-bucket/

We can double check that our files are there with the `mc ls tree` command and using the `--files` option.

In [None]:
%system mc tree --files watsonxdata/staging-bucket/

# Spark Initialization

The next set of Python instructions will initialize the Spark connection. Once the connection is established to the engine, we need to update a number of values to provide credentials and a URL to the Hive and MinIO services.

### Initialize the Spark Connection
Initialize the settings for the Spark service.

In [None]:
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
import warnings
warnings.filterwarnings('ignore')

spark = SparkSession.builder.appName('sparky').getOrCreate()
sc = spark.sparkContext
sc.setLogLevel("ERROR")
conf = sc.getConf()

### Watsonx.data Configuration Information
Once we have the configuration established, we need to update the values corresponding to our MinIO and Hive settings.

In [None]:
_ = conf.set("spark.sql.debug.maxToStringFields",                    "100")
_ = conf.set("fs.s3a.path.style.access",                             "true")
_ = conf.set("fs.s3a.impl",                                          "org.apache.hadoop.fs.s3a.S3AFileSystem")
_ = conf.set("fs.s3a.connection.ssl.enabled",                        "true")
_ = conf.set("spark.driver.extraJavaOptions",                        "-Dcom.sun.jndi.ldap.object.disableEndpointIdentification=true")

_ = conf.set("spark.sql.catalogImplementation",                      "hive")
_ = conf.set("spark.sql.extensions",                                 "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
_ = conf.set("spark.sql.iceberg.vectorization.enabled",              "false")

_ = conf.set("spark.sql.defaultCatalog",                             "iceberg_data")
_ = conf.set("spark.sql.catalog.iceberg_data",                       "org.apache.iceberg.spark.SparkCatalog")
_ = conf.set("spark.sql.catalog.iceberg_data.type",                  "hive")
_ = conf.set("spark.sql.catalog.iceberg_data.uri",                   f"thrift://{hive_host}:{hive_port}")

_ = conf.set("spark.hive.metastore.client.auth.mode",                "PLAIN")
_ = conf.set("spark.hive.metastore.client.plain.username",           hive_id)
_ = conf.set("spark.hive.metastore.client.plain.password",           hive_password)

_ = conf.set("spark.hive.metastore.use.SSL",                         "true")
_ = conf.set("spark.hive.metastore.truststore.type",                 "jks")
_ = conf.set("spark.hive.metastore.truststore.path",                 cert_file)
_ = conf.set("spark.hive.metastore.truststore.password",             keystore_password)
_ = conf.set("spark.hive.metastore.uris",                            f"thrift://{hive_host}:{hive_port}")

_ = conf.set("spark.hadoop.fs.s3a.endpoint",                         f"http://{minio_host}:{minio_port}")
_ = conf.set("spark.hadoop.fs.s3a.access.key",                       minio_access_key)
_ = conf.set("spark.hadoop.fs.s3a.secret.key",                       minio_secret_key)

### Restart Spark with new Configuration
To make the configuration changes take effect, we need to stop the Spark services and recreate it with the new configuration information.

In [None]:
sc.stop()

spark = SparkSession.builder.config(conf=conf).getOrCreate()
sc = spark.sparkContext
conf = sc.getConf()

If you want to review the settings, run the following code to get the details.

In [None]:
c = sc.getConf().getAll()
for i in c:
    print(i)

### Spark SQL Helper Code
The follow Python will execute Spark SQL and return the success or error status from the call.

In [None]:
# sparksql - Run SQL statement and display results if it is a DML statement
# sqltext     -> valid SQL statement

def sparksql(sqltext):

    if (sqltext in [None,""]):
        print("Invalid SQL command")
        return
        
    print(f"SQL: {sqltext}")
    keywords = sqltext.split()
    sqltype  = keywords[0].upper()

    try:
        if (sqltype in ["DROP","CREATE","INSERT","DELETE","UPDATE","ALTER"]):
            spark.sql(sqltext)  # use for sql statements that don't have results (create, drop, use, etc.) or to omit results output
            print("The SQL command completed successfully.\n") 
        else:
            spark.sql(sqltext).show() # show the results in table format
           
    except Exception as e:
        print(f"An error occurred: {str(e)}")
        print("\n")

### Set variables used in notebook
The following variables will be used throughout the scripts. 

In [None]:
catalog        = "iceberg_data"
schemas        = ['bronze', 'silver', 'gold'] # don't change order of schemas in list
tables         = ['customers', 'products', 'orders']   # don't change order of tables in list, customer must be first
summ_tables    = ['customer_activity_summary', 'order_summary', 'product_category_summary']
iceberg_bucket = "iceberg-bucket"
staging_bucket = "staging-bucket"
files = {
    'customers': 'customer.csv',
    'products': 'products.csv',
    'orders': 'orders.csv'
}

### Example Reset - Drop Schemas and Tables
If you are running this notebook again, there will be files and schemas that exist in the system which will cause subsequent scripts to fail. Run this cell to make sure that any objects are deleted from the system. Note that you might get an error if an object does not exist.

In [None]:
# show tables in each schema
for schema in schemas:
    sparksql(
        f"SHOW TABLES from {schema}"
    )
  
# drop tables
for schema in schemas:
    for table in tables:
        sparksql(
            f"drop table if exists {schema}.{table}"
        )

for table in summ_tables:
    sparksql(
        f"drop table if exists {schema}.{table}"
    )


# drop schemas
sparksql(
    "SHOW SCHEMAS IN iceberg_data"
)

for schema in schemas:
    sparksql(
        f"DROP SCHEMA IF EXISTS {schema}"
    )

### Check Iceberg Catalog
We can use Spark to examine the contents of a catalog by connecting to the Hive service. The following statement will show the schemas in the iceberg catalog.

In [None]:
sparksql(
    f"SHOW SCHEMAS FROM {catalog}"
)

# Data Organization in a Lakehouse
A typical data lakehouse can classify data into different tiers based on the state of the data - raw, filtered, and optimized. Literature sometimes refers to these levels as Bronze, Silver, and Gold based on how well the data is refined.

In this notebook, we are going to use the Bronze, Silver and Gold classification:
* Bronze - This is the raw data that is kept in its original form. Examples of this type of data includes CSVs (Comma Separated Values), PDFs, images, JSON, and Text documents. There has been no filtering or refinement done on these data sets. Bronze data typically sits in an object store to minimize storage costs.
* Silver - Silver data refers to Bronze objects which have been refined to produce a queryable object (table) that has had some filtering and cleansing applied to it. Silver data provides good query performance and the tables that are created can be stored in object store or in higher performance file systems.
* Gold - Gold data further refines the Silver data. Data may be combined from various sources to create a single object to remove the need for joins. The data will be filtered to remove any inconsistent data. Gold data can reside on object stores but usually for performance reasons it will reside on high-performance storage using proprietary database engines to optimize query performance.


## Bronze Data Layer
The Bronze data layer will contain the raw data (CSV) files that be refined in subsequent steps. This first command will create the `bronze` schema for loading the data.

In [None]:
sparksql(
    f"CREATE SCHEMA IF NOT EXISTS {catalog}.{schemas[0]} LOCATION 's3a://{iceberg_bucket}/{schemas[0]}'"
)

Check that the schema was created.

In [None]:
sparksql(
    f"SHOW SCHEMAS FROM {catalog}"
)

### Ingest and validate tables
This SQL will read the CSV from Object Storage and do the following:
1. Infer schema from csv file
2. Show schema from csv file
3. Dislay the first 3 rows

In [None]:
for file in files.values():
    print(f"File: {staging_bucket}/{file}")
    try:
        df_customer = spark.read \
        .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
        .option('header', True)\
        .option("inferSchema", True)\
        .option("samplingRatio", 0.25)\
        .csv(f"s3a://{staging_bucket}/{file}")

        df_customer.printSchema()
        df_customer.show(3)
    except Exception as e:
        print(f"An error occurred: {str(e)}")

### Show iceberg tables
Doublecheck that there are no tables in the `bronze` schema.

In [None]:
sparksql(
    f"SHOW TABLES from {catalog}.{schemas[0]}"
)

### Ingest customer table
This code will read the data (see the above example) and then immediately write it out to the `bronze` schema.

In [None]:
file = files['customers']
print(f"Ingesting customer table using spark")
print(f"File: {staging_bucket}/{file}")
   
try:
    df_customer = spark.read \
    .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat') \
    .option('header', True) \
    .option("inferSchema", True) \
    .option("samplingRatio", 0.25) \
    .csv(f"s3a://{staging_bucket}/{file}")   
        
    df_customer.printSchema()
    df_customer.show(3)
except Exception as e:
    print(f"An error occurred: {str(e)}")

try:
    df_customer.writeTo(f"{catalog}.{schemas[0]}.{tables[0]}") \
        .tableProperty("write.format.default", "parquet") \
        .createOrReplace()
    print("The INGEST command completed successfully.\n\n")  
except Exception as e:
    print(f"An error occurred: {str(e)}")

We check to make sure the customer file has been loaded into the system.

In [None]:
# customer table expected
sparksql(
    f"SHOW TABLES from {schemas[0]}"
)

### Describe the Customer table
We can use the DESCRIBE function to print the column names and types for the customer table.

In [None]:
sparksql(
    f"DESCRIBE {schemas[0]}.{tables[0]}"
)

### Display Data
We can check to see whether or not the data is in the customer table.

In [None]:
sparksql( 
    f"SELECT * FROM {schemas[0]}.{tables[0]}"
)

Just to doublecheck, we make sure that the row count for the customer table is correct.

In [None]:
sparksql(
    f"SELECT count(*) FROM {schemas[0]}.{tables[0]}"
)
print("Table customers expected rowscount: 102")

## Ingest Product and Orders Tables
We now ingest the other two tables, products and orders, into the `bronze` schema.

In [None]:
for table in tables[1:]: # skip customers table (ingested in previous step)
    print(f"Ingesting {table} table using spark")
    print(f"File: {staging_bucket}/{file}")    
    file = files[f'{table}']
    df_table = spark.read \
    .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat') \
    .option('header', True) \
    .option("inferSchema", True) \
    .option("samplingRatio", 0.25) \
    .csv(f"s3a://{staging_bucket}/{file}")
    
    try:
        df_table.writeTo(f"{catalog}.{schemas[0]}.{table}") \
            .tableProperty("write.format.default", "parquet") \
            .createOrReplace()
        print("The INGEST command completed successfully.\n\n")  
    except Exception as e:
        print(f"An error occurred: {str(e)}")

### Bronze schema tables
All three tables should now reside in the `Bronze` schema.

In [None]:
sparksql(
    f"SHOW TABLES from {catalog}.bronze"
)

Check the row count for the tables.

In [None]:
for table in tables:
    print(f"Table: {table}")
    sparksql(
        f"SELECT count(*) from bronze.{table}"
    )
    
print("customers expected rows: 102")
print("products  expected rows: 22")
print("orders    expected rows: 501")

# Silver Data Layer
Silver data contains objects which have been refined to produce a queryable object (table) that has had some filtering and cleansing applied to it. Silver data provides good query performance and the tables that are created can be stored in object store or in higher performance file systems.

## Add Silver Schema
We start by creating the `silver` schema that will be used for the refined tables.

In [None]:
sparksql(
    f"CREATE SCHEMA IF NOT EXISTS {catalog}.silver LOCATION 's3a://{iceberg_bucket}/{schemas[1]}'"
)

# show the schemas in the iceberg catalog
sparksql(
    f"SHOW SCHEMAS FROM {catalog}"
)

## Create and Cleanse the Customer table
The first step to cleansing the customer data is to check for fields that are invalid or empty. In the case of the customer table, we are going to remove customers who have an invalid `customer_id` or `customer_name`.

In [None]:
sparksql(
    f"SELECT count(*) from {catalog}.{schemas[0]}.{tables[0]}"
)

sparksql(
    f'''
    CREATE OR REPLACE TABLE {catalog}.silver.{tables[0]} AS
    SELECT
      customer_id,
      customer_name,
      email,
      phone_number
    FROM
      {catalog}.bronze.{tables[0]}
    WHERE
      customer_id IS NOT NULL
      AND customer_name IS NOT NULL
      ;
    '''
)

sparksql(
    f"SELECT count(*) from {catalog}.silver.{tables[0]}"
)
    
print(f"{catalog}.silver.{tables[0]} expected rows inserted: 100")
print(f"{catalog}.silver.{tables[0]} expected rows cleansed:   2")

## Create and Cleanse the Product table
The product table needs to be checked for invalid product_id, category, or price.

In [None]:
sparksql(
    f"SELECT count(*) from {catalog}.bronze.{tables[1]}"
)

sparksql(
    f"""
    CREATE OR REPLACE TABLE {catalog}.silver.{tables[1]}  AS
    SELECT
      product_id,
      product_name,
      category,
      price
    FROM
      {catalog}.bronze.{tables[1]} 
    WHERE
      product_id IS NOT NULL
      AND product_name IS NOT NULL
      AND category IS NOT NULL
      AND price >= 0
      ;
    """
)

sparksql(
    f"SELECT count(*) from {catalog}.silver.{tables[1]}"
)
    
print(f"{catalog}.{schemas[1]}.{tables[1]} expected rows inserted:  20")
print(f"{catalog}.{schemas[1]}.{tables[1]} expected rows cleansed:   2")

## Create and Cleanse the Orders table
The final orders table needs to check 5 columns to make sure they are valid:
* order_id
* order_date
* customer_id
* product_id
* quantity
* unit_price

In [None]:
sparksql(
    f"SELECT count(*) from {catalog}.bronze.{tables[2]}"
)

sparksql(
    f"""
    CREATE OR REPLACE TABLE {catalog}.silver.{tables[2]}  AS
    SELECT
      order_id,
      order_date,
      customer_id,
      product_id,
      quantity,
      unit_price,
      unit_price * quantity as total_price -- add a total price column
    FROM
      {catalog}.bronze.{tables[2]}
    WHERE
      order_id IS NOT NULL
      OR order_date IS NOT NULL
      OR customer_id IS NOT NULL
      OR product_id IS NOT NULL
      OR quantity > 0
      OR unit_price >= 0
      ;
    """
)

sparksql(
    f"SELECT count(*) from {catalog}.silver.{tables[2]}"
)

print(f"{catalog}.silver.{tables[2]} expected rows inserted: 500")
print(f"{catalog}.silver.{tables[2]} expected rows cleansed:   1")

# Gold Data Layer
The Gold data further refines the Silver data and can be considered "business-level" because it is considered a trusted source of information. Data may be combined from various sources to create a single object to remove the need for joins. Gold data can reside on object stores but usually for performance reasons it will reside on high-performance storage using proprietary database engines to optimize query performance.

## Add Gold Schema
We start by creating the `gold` schema that will be used for the refined Silver tables.

In [None]:
sparksql(
    f"CREATE SCHEMA IF NOT EXISTS {catalog}.gold LOCATION 's3a://{iceberg_bucket}/{schemas[2]}'"
)

# show the schemas in the iceberg catalog
sparksql(
    f"SHOW SCHEMAS FROM {catalog}"
)

## Summary Tables
In order to improve query performance, a series of summary tables are created on each of the base tables found in the `silver` schema. These tables provide summary information on:
* Customer Activity
* Product Orders
* Product Catagories
### Create Customer Activity Summary

In [None]:
sparksql(
    f"""
    CREATE OR REPLACE TABLE {catalog}.gold.customer_activity_summary  AS
    SELECT
        o.customer_id,
        COUNT(DISTINCT o.order_id) AS total_orders_placed,
        SUM(o.total_price) AS total_revenue
    FROM
        {catalog}.silver.{tables[2]} o
    JOIN
         {catalog}.silver.{tables[0]} c ON o.customer_id = c.customer_id
    GROUP BY
        o.customer_id;
    """
)

### Create Order Summary

In [None]:
sparksql(
    f"""
    CREATE OR REPLACE TABLE {catalog}.gold.order_summary AS
    SELECT
        product_id,
        SUM(quantity) AS total_quantity,
        SUM(total_price) AS total_revenue
    FROM
        {catalog}.silver.{tables[2]}
    GROUP BY
        product_id;
    """
)

### Create Product Summary

In [None]:
sparksql(
    f"""
    CREATE TABLE {catalog}.gold.product_category_summary AS
    SELECT
        category,
        COUNT(*) AS product_count
    FROM
        {catalog}.silver.{tables[1]}
    GROUP BY
        category;
    """
)

## Sample Reports
The following examples show the use of `silver` and `gold` data being joined in queries.

### Report on Customer Activity: SILVER

In [None]:
sparksql(
    f"""
    SELECT
        c.customer_id,
        c.customer_name,
        COUNT(DISTINCT o.order_id) AS total_orders_placed,
        SUM(o.total_price) AS total_revenue
    FROM
         {catalog}.silver.customers c
    LEFT JOIN
         {catalog}.silver.orders o ON c.customer_id = o.customer_id
    GROUP BY
        c.customer_id, c.customer_name
    ORDER BY
        total_revenue DESC;
    """
)

### Report Total Revenue by Product Category: SILVER

In [None]:
sparksql(
    f"""
    SELECT
        p.category,
        SUM(o.total_price) AS total_revenue
    FROM
        {catalog}.silver.orders o
    JOIN
        {catalog}.silver.products p ON o.product_id = p.product_id
    GROUP BY
        p.category
    ORDER BY
        total_revenue DESC;
    """
)

### Report on Product Sales Summary: SILVER and GOLD

In [None]:
sparksql(
    f"""
    SELECT
        p_silver.product_id,
        p_silver.product_name,
        p_gold.total_quantity,
        p_gold.total_revenue
    FROM
        {catalog}.silver.products p_silver
    JOIN
        {catalog}.gold.order_summary p_gold ON p_silver.product_id = p_gold.product_id
    ORDER BY
        p_gold.total_revenue DESC;
    """
)

## Spark SQL with Insert/Update/Delete
We can use Spark SQL to insert, update, and delete records in the database. This first SQL statement will find all of the customers who have no CUSTOMER_ID or the CUSTOMER_NAME is NULL.

In [None]:
sparksql(
    f"""
    SELECT
        *
    FROM
        {catalog}.bronze.customers
    WHERE
        customer_id IS NULL OR
        customer_name IS NULL
    """
)

### Delete a Record
The following SQL will delete the customer that does not have a CUSTOMER_ID or has no CUSTOMER_NAME.

In [None]:
sparksql(
    f"""
    DELETE 
    FROM
        {catalog}.bronze.customers
    WHERE
        customer_id IS NULL OR customer_name IS NULL
    """
)

We double check to make sure the records have been removed.

In [None]:
sparksql(
    f"""
    SELECT
        *
    FROM
        {catalog}.bronze.customers
    WHERE
        customer_id IS NULL OR customer_name IS NULL
    """
)

### Insert a New Record
We can insert a new record into a table. The next SQL statement will add a new customer to the table.

In [None]:
sparksql(
    f"""
    INSERT INTO {catalog}.bronze.customers
    VALUES (
       99999,
       'A New Customer',
       '209 Somewhere Street',
       'newcustomer@ymail.com',
       '800-555-1212'
    )
    """
)

Check to see if our new customer exists.

In [None]:
sparksql(
    f"""
    SELECT
        *
    FROM
        {catalog}.bronze.customers
    WHERE
        customer_id = 99999
    """
)

### Update a Record
We can also update a record. The next statement will update the customer phone_number to a new value.

In [None]:
sparksql(
    f"""
    UPDATE {catalog}.bronze.customers
    SET 
       phone_number = '888-333-5555'
    WHERE
       customer_id = 99999
    """
)

Check to see if our customer has a new phone number.

In [None]:
sparksql(
    f"""
    SELECT
        *
    FROM
        {catalog}.bronze.customers
    WHERE
        customer_id = 99999
    """
)

## Reset Examples
Remove all tables and schemas.

In [None]:
# show tables in each schema
for schema in schemas:
    sparksql(
        f"SHOW TABLES from {schema}"
    )
  
# drop tables
for schema in schemas:
    for table in tables:
        sparksql(
            f"drop table if exists {schema}.{table}"
        )

for table in summ_tables:
    sparksql(
        f"drop table if exists {schema}.{table}"
    )


# drop schemas
sparksql(
    "SHOW SCHEMAS IN iceberg_data"
)

for schema in schemas:
    sparksql(
        f"DROP SCHEMA IF EXISTS {schema}"
    )

Remove the bucket from MinIO.

In [None]:
%system mc rb --force watsonxdata/staging-bucket 

#### Credits: IBM 2024, George Baklarz [baklarz@ca.ibm.com], Daniel Hancock [daniel.hancock@us.ibm.com]