# Versioned Data Lakehouse with Nessie, Iceberg, and Spark
This notebook demonstrates how to use Project Nessie as a transactional catalog for Apache Iceberg tables in a data lakehouse. Key features include:

* Versioning: Track changes to your data over time.

* Branching and Merging: Create branches for experimental changes and merge them back into the main branch.

* Tags: Create immutable snapshots of your data for reproducibility and auditing.

# Project Overview
#### We will build an ETL pipeline for IMDb movie data using Nessie's branching and versioning capabilities.

* Raw Data Ingestion: Load raw IMDb data into a raw branch.

* Data Transformation: Clean and transform the data in a dev branch.

* Data Validation: Perform quality checks before promoting data to production.

* Promotion to Main: Merge the validated data into the main branch.

* Versioning and Time Travel: Use tags and commit hashes to track changes and time trave

In [9]:
import os
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Stop existing Spark session if running
if 'spark' in globals():
    spark.stop()

# Initialize Spark with Iceberg and Nessie integrations
spark = SparkSession.builder \
    .appName("NessieIMDbDemo") \
    .config("spark.ui.port", "4041") \
    .getOrCreate()

print("Spark session started with Nessie and Iceberg.")

Spark session started with Nessie and Iceberg.


In [11]:
# Create the namespace
spark.sql("CREATE NAMESPACE IF NOT EXISTS imdb")

# Create a raw branch
spark.sql("CREATE BRANCH IF NOT EXISTS raw FROM main")
spark.sql("USE REFERENCE raw")
print("Created and switched to 'raw' branch for raw data ingestion.")

# List references to verify the branch creation
print("List of references:")
spark.sql("LIST REFERENCES").toPandas()

Created and switched to 'raw' branch for raw data ingestion.
List of references:


Unnamed: 0,refType,name,hash
0,Branch,main,41f9d302c4b7e228a15cb94e39198050759def757e8621...
1,Branch,raw,41f9d302c4b7e228a15cb94e39198050759def757e8621...


In [16]:
raw_df = spark.read.option("header", "true").csv("/home/iceberg/data/imdb-movies.csv")
print(" First 5 rows of raw IMDb data:")
raw_df.show(5)
print(" Schema of raw IMDb data:")
raw_df.printSchema()

 First 5 rows of raw IMDb data:
+-------------+--------------------+--------------------+----+--------------+--------------------+--------+--------------------+----------+------------+----------+------------------+--------------------+--------------------+---------------+--------------------+--------------------+--------------------+--------+------+--------+----------------+---------------------+------------------+
|imdb_title_id|               title|      original_title|year|date_published|               genre|duration|             country|language_1|  language_2|language_3|          director|              writer|              actors|       actors_1|           actors_f2|         description|              desc35|avg_vote| votes|  budget|usa_gross_income|worlwide_gross_income|reviews_from_users|
+-------------+--------------------+--------------------+----+--------------+--------------------+--------+--------------------+----------+------------+----------+------------------+------------

In [22]:
spark.sql("""
    CREATE TABLE IF NOT EXISTS nessie.imdb.movies (
        imdb_title_id STRING,
        title STRING,
        year INT,
        genre STRING,
        director STRING,
        avg_vote DOUBLE,
        votes INT
    )
    USING iceberg
""")

print(" Iceberg table 'movies' created in the 'raw' branch.")

 Iceberg table 'movies' created in the 'raw' branch.


In [23]:
raw_df.select(
    col("imdb_title_id"),
    col("title"),
    col("year").cast("int"),
    col("genre"),
    col("director"),
    col("avg_vote").cast("double"),
    col("votes").cast("int")
).writeTo("nessie.imdb.movies").append()
print("Raw data ingested into 'movies' in the 'raw' branch.")

                                                                                

Raw data ingested into 'movies' in the 'raw' branch.


In [25]:
print("Tables in the 'raw' branch:")
spark.sql("SHOW TABLES IN nessie.imdb").show(truncate=False)

print("Raw data in the 'raw' branch:")
spark.sql("SELECT * FROM nessie.imdb.movies").show(5)

Tables in the 'raw' branch:
+---------+---------+-----------+
|namespace|tableName|isTemporary|
+---------+---------+-----------+
|imdb     |movies   |false      |
+---------+---------+-----------+

Raw data in the 'raw' branch:
+-------------+--------------------+----+--------------------+------------------+--------+------+
|imdb_title_id|               title|year|               genre|          director|avg_vote| votes|
+-------------+--------------------+----+--------------------+------------------+--------+------+
|    tt0035423|      Kate & Leopold|2001|Comedy, Fantasy, ...|     James Mangold|     6.4| 77852|
|    tt0118589|             Glitter|2001|Drama, Music, Rom...|Vondie Curtis-Hall|     2.2| 21298|
|    tt0118694|In the Mood for Love|2000|      Drama, Romance|      Kar-Wai Wong|     8.1|119171|
|    tt0120202|  Hollywood, Vermont|2000|       Comedy, Drama|       David Mamet|     6.7| 20220|
|    tt0120263|Canzoni del secon...|2000|       Comedy, Drama|     Roy Andersson|    

Describe the Iceberg table to view its properties and configurations.

In [26]:
print("Table description for 'movies':")
spark.sql("DESCRIBE TABLE EXTENDED nessie.imdb.movies").show(truncate=False)

Table description for 'movies':
+----------------------------+---------------------------------------------------------------+-------+
|col_name                    |data_type                                                      |comment|
+----------------------------+---------------------------------------------------------------+-------+
|imdb_title_id               |string                                                         |NULL   |
|title                       |string                                                         |NULL   |
|year                        |int                                                            |NULL   |
|genre                       |string                                                         |NULL   |
|director                    |string                                                         |NULL   |
|avg_vote                    |double                                                         |NULL   |
|votes                       |int        

Create a dev branch from the raw branch to perform transformations.

In [27]:
# Create a dev branch from raw
spark.sql("CREATE BRANCH dev FROM raw")
spark.sql("USE REFERENCE dev")
print("Created and switched to 'dev' branch for transformations.")

print("List of references:")
spark.sql("LIST REFERENCES").toPandas()

print(" Tables in the 'dev' branch:")
spark.sql("SHOW TABLES IN nessie.imdb").toPandas()

Created and switched to 'dev' branch for transformations.
List of references:
 Tables in the 'dev' branch:


Unnamed: 0,namespace,tableName,isTemporary
0,imdb,movies,False
