<a href="https://colab.research.google.com/github/Mbaroudi/DELTA_LAKE_TIPS/blob/main/delta_iceberg_integration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Integrating Delta Lake and Apache Iceberg with Hive

In this notebook, we explore how to use Delta Lake and Apache Iceberg in conjunction with Hive to create a robust data lakehouse architecture. This setup allows for the use of the same shared storage layer for Hive external tables, Delta Lake, and Iceberg tables, thus minimizing the impact on the existing big data environment.

We will demonstrate how to configure Spark to work with both Delta Lake and Apache Iceberg and how to perform data operations that are reflective of real-world data workflows.

### Initial Setup

First, we need to install and configure the necessary libraries and sessions for using Spark, Delta Lake, and Iceberg.

In [None]:
!pip install pyspark==3.1.2 delta-spark iceberg-spark

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Delta Lake and Iceberg Integration") \
    .config("spark.jars.packages", 'io.delta:delta-core_2.12:1.0.0,org.apache.iceberg:iceberg-spark3-runtime:0.13.1') \
    .config("spark.sql.extensions", 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,io.delta.sql.DeltaSparkSessionExtension') \
    .config("spark.sql.catalog.spark_catalog", 'org.apache.spark.sql.delta.catalog.DeltaCatalog,org.apache.iceberg.spark.SparkSessionCatalog') \
    .config("spark.sql.catalog.spark_catalog.type", "hadoop") \
    .master("local[*]") \
    .getOrCreate()


### Creating a Unified Storage Layer

Delta Lake and Apache Iceberg can share the same underlying storage, which can be a HDFS directory, an AWS S3 bucket, or any other storage system compatible with Hadoop. This shared access ensures that data managed by Delta Lake or Iceberg can also be accessed as external Hive tables without duplicating data or creating data silos.

### Example: Writing and Reading Data

We will create tables using both Delta Lake and Apache Iceberg, perform write and read operations, and show how these tables can be accessed via Hive.

In [None]:
# Define a path for Delta and Iceberg tables
path = '/tmp/delta_iceberg_tables'

# Creating a Delta Table
df = spark.range(0, 5)
df.write.format('delta').save(path + '/delta_table')

# Creating an Iceberg Table
df.write.format('iceberg').save(path + '/iceberg_table')


### Accessing Data via Hive

With the tables stored in a common directory, we can easily set up external Hive tables pointing to this location. This allows tools that query Hive to access up-to-date data without direct interaction with Delta Lake or Iceberg.

### Conclusion

This integration highlights the flexibility and efficiency of using Delta Lake and Apache Iceberg in a unified architecture. By leveraging a common storage layer and compatible configurations, we can simplify the management of large-scale data environments, enhance performance, and ensure consistency across different data management platforms.