# Hive Catalog

This notebook provides an example of external table registration on Hive Metastore and Delta table interaction.

In [1]:
import os

from dotenv import load_dotenv, find_dotenv
from pyspark.sql import SparkSession

load_dotenv(find_dotenv("../.env", raise_error_if_not_found=True))

os.environ["PYSPARK_SUBMIT_ARGS"] = (
            "--packages org.apache.hadoop:hadoop-aws:3.3.4,io.delta:delta-spark_2.12:3.3.0,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1 pyspark-shell"
        )

print("Initializing spark...")
print(os.getenv("AWS_ACCESS_KEY_ID"))
print(os.getenv("AWS_SECRET_ACCESS_KEY"))
spark = (
    SparkSession.builder.appName("Test")
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension,org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
    .config("spark.sql.catalog.spark_catalog.uri", "thrift://localhost:9083")
    .config('spark.sql.catalog.spark_catalog.warehouse', "s3a://lakehouse/")
    .config("spark.sql.catalog.iceberg", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.iceberg.type", "hive")
    .config("spark.sql.catalog.iceberg.uri", "thrift://localhost:9083")
    .config("spark.sql.catalog.iceberg.warehouse", "s3a://lakehouse/")
    .config("spark.hive.metastore.uris", "thrift://localhost:9083")
    .config("spark.sql.catalogImplementation", "hive")
    .config('spark.sql.warehouse.dir', "s3a://lakehouse/")
    .config("spark.hadoop.fs.s3a.endpoint", "http://localhost:9000")
    .config("spark.hadoop.fs.s3a.access.key", os.getenv("AWS_ACCESS_KEY_ID"))
    .config("spark.hadoop.fs.s3a.secret.key", os.getenv("AWS_SECRET_ACCESS_KEY"))
    .config("spark.hadoop.fs.s3a.path.style.access", "true")
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    .config("spark.executor.memory", "2g")
    .config("spark.driver.memory", "2g")
    .getOrCreate()
)

sc = spark.sparkContext
sc.setLogLevel("WARN")

Initializing spark...
admin
password


your 131072x1 screen size is bogus. expect trouble
25/08/12 10:27:13 WARN Utils: Your hostname, CPC-12806 resolves to a loopback address: 127.0.1.1; using 172.26.242.248 instead (on interface eth0)
25/08/12 10:27:13 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


:: loading settings :: url = jar:file:/home/arthur/dev/dbt-test/.venv/lib/python3.13/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/arthur/.ivy2/cache
The jars for the packages stored in: /home/arthur/.ivy2/jars
org.apache.hadoop#hadoop-aws added as a dependency
io.delta#delta-spark_2.12 added as a dependency
org.apache.iceberg#iceberg-spark-runtime-3.5_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-c0bf6ecc-e09d-44e2-b9fa-6538065c9ce7;1.0
	confs: [default]
	found org.apache.hadoop#hadoop-aws;3.3.4 in central
	found com.amazonaws#aws-java-sdk-bundle;1.12.262 in central
	found org.wildfly.openssl#wildfly-openssl;1.0.7.Final in central
	found io.delta#delta-spark_2.12;3.3.0 in central
	found io.delta#delta-storage;3.3.0 in central
	found org.antlr#antlr4-runtime;4.9.3 in central
	found org.apache.iceberg#iceberg-spark-runtime-3.5_2.12;1.8.1 in central
:: resolution report :: resolve 265ms :: artifacts dl 16ms
	:: modules in use:
	com.amazonaws#aws-java-sdk-bundle;1.12.262 from central in [default]
	io.delta#delta-spark_2.12;3.3.0 from centr

In [2]:
spark.sql("SHOW CATALOGS;").show()

25/08/12 10:27:18 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties


+-------------+
|      catalog|
+-------------+
|spark_catalog|
+-------------+



Iceberg catalog will only appear after first used.

To choose between catalogs, you must pass the full table name with catalog and schema names.

In [3]:
spark.sql("""
    CREATE TABLE IF NOT EXISTS spark_catalog.default.sample_delta (
        id INT,
        name STRING
    )
    USING delta
    LOCATION 's3a://lakehouse/delta/sample_delta';
"""
)

25/08/12 10:27:23 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta. Persisting data source table `spark_catalog`.`default`.`sample_delta` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
25/08/12 10:27:23 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.


DataFrame[]

In [4]:
spark.sql("""
INSERT INTO spark_catalog.default.sample_delta VALUES
    (1, 'Alice'),
    (2, 'Bob'),
    (3, 'Charlie');
""")

25/08/12 10:27:27 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

DataFrame[]

In [5]:
spark.sql("SELECT * FROM spark_catalog.default.sample_delta;").show()

                                                                                

+---+-------+
| id|   name|
+---+-------+
|  3|Charlie|
|  3|Charlie|
|  1|  Alice|
|  1|  Alice|
|  2|    Bob|
|  2|    Bob|
+---+-------+



In [6]:
spark.sql("""
    CREATE TABLE IF NOT EXISTS iceberg.default.sample_iceberg (
        id INT,
        name STRING
    )
    USING iceberg
    LOCATION 's3a://lakehouse/iceberg/sample_iceberg';
"""
)

DataFrame[]

In [7]:
spark.sql("""
INSERT INTO iceberg.default.sample_iceberg VALUES
    (1, 'Alice'),
    (2, 'Bob'),
    (3, 'Charlie');
""")

25/08/12 10:27:36 WARN S3ABlockOutputStream: Application invoked the Syncable API against stream writing to iceberg/sample_iceberg/data/00001-167-5355873d-9e2c-4212-9c5a-0348b2a7566b-0-00001.parquet. This is unsupported
                                                                                

DataFrame[]

In [8]:
spark.sql("SELECT * FROM iceberg.default.sample_iceberg;").show()

+---+-------+
| id|   name|
+---+-------+
|  1|  Alice|
|  2|    Bob|
|  3|Charlie|
+---+-------+



Now you should see iceberg catalog.

In [9]:
spark.sql("SHOW CATALOGS;").show()

+-------------+
|      catalog|
+-------------+
|      iceberg|
|spark_catalog|
+-------------+



In [10]:
spark.sql("DESCRIBE EXTENDED spark_catalog.default.sample_delta;").show(truncate=False)

+----------------------------+---------------------------------------------------+-------+
|col_name                    |data_type                                          |comment|
+----------------------------+---------------------------------------------------+-------+
|id                          |int                                                |NULL   |
|name                        |string                                             |NULL   |
|                            |                                                   |       |
|# Detailed Table Information|                                                   |       |
|Name                        |spark_catalog.default.sample_delta                 |       |
|Type                        |EXTERNAL                                           |       |
|Location                    |s3a://lakehouse/delta/sample_delta                 |       |
|Provider                    |delta                                              |       |

Although this is the recommendation to fully clear delta lake tables, this seems to not be working since data persists on MinIO after vacuum and drop table.

In [11]:
spark.sql("""
SET `spark.databricks.delta.retentionDurationCheck.enabled`=false
""")

spark.sql("""
VACUUM spark_catalog.default.sample_delta RETAIN 0 HOURS
""")

                                                                                

Deleted 0 files and directories in a total of 1 directories.


DataFrame[path: string]

In [12]:
spark.sql("VACUUM delta.`s3a://lakehouse/delta/sample_delta` RETAIN 0 HOURS")

                                                                                

Deleted 0 files and directories in a total of 1 directories.


DataFrame[path: string]

In [13]:
spark.sql("DESCRIBE EXTENDED iceberg.default.sample_iceberg;").show(truncate=False)

+----------------------------+----------------------------------------------------------------------------------------------------------------------+-------+
|col_name                    |data_type                                                                                                             |comment|
+----------------------------+----------------------------------------------------------------------------------------------------------------------+-------+
|id                          |int                                                                                                                   |NULL   |
|name                        |string                                                                                                                |NULL   |
|                            |                                                                                                                      |       |
|# Metadata Columns          |                      

Iceberg drops fully (data, iceberg metadata and hive metadata) when this is applied before drop table.

In [14]:
spark.sql("""
CALL iceberg.system.expire_snapshots(table => 'default.sample_iceberg', older_than => TIMESTAMP '2025-08-01 00:00:00')
""")

spark.sql("""
CALL iceberg.system.remove_orphan_files(table => 'default.sample_iceberg', older_than => TIMESTAMP '2025-08-01 00:00:00')
""")

DataFrame[orphan_file_location: string]

---

## Clear tables

Purge is used to assure that data has been fully dropped.

**NOTE:** external tables data are not deleted from object storage. 

When we use "LOCATION 's3a://lakehouse/iceberg/sample_iceberg';", data should be fully managed by us.

In [15]:
spark.sql("DROP TABLE IF EXISTS iceberg.default.sample_iceberg PURGE;")
spark.sql("DROP TABLE IF EXISTS spark_catalog.default.sample_delta PURGE;")

DataFrame[]