<a href="https://colab.research.google.com/github/Mbaroudi/DELTA_LAKE_TIPS/blob/main/apache_iceberg.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced BI Techniques with Apache Iceberg and Spark SQL

This notebook demonstrates advanced capabilities of Apache Iceberg in a BI context, including handling schema evolution, versioning, and complex time-based aggregations.

In [None]:
!pip install pyspark==3.1.2 iceberg-spark

from pyspark.sql import SparkSession

# Initialize Spark session with Iceberg support
spark = SparkSession.builder \
    .appName("Advanced BI with Iceberg") \
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.local.type", "hadoop") \
    .config("spark.sql.catalog.local.warehouse", "file:///tmp/iceberg-warehouse") \
    .getOrCreate()

# Create an Iceberg table for customers with schema evolution and history tracking enabled
spark.sql("""
CREATE TABLE IF NOT EXISTS local.db.customers (
    id int,
    name string,
    email string,
    revenue double,
    updated timestamp,
    is_current boolean
) USING iceberg PARTITIONED BY (days(updated))
""")

## Data Loading and Versioning

Load data into the Iceberg table and simulate updates to demonstrate versioning and schema evolution.

In [None]:
from datetime import datetime, timedelta

# Load initial data
initial_data = spark.createDataFrame([
    (1, 'Alice', 'alice@example.com', 100.0, datetime.now(), True),
    (2, 'Bob', 'bob@example.com', 150.0, datetime.now() - timedelta(days=1), True)
], ['id', 'name', 'email', 'revenue', 'updated', 'is_current'])

initial_data.write.format('iceberg').mode('append').save('local.db.customers')
print('Initial data loaded.')

# Simulating updates
updates = spark.createDataFrame([
    (1, 'Alice', 'alice_new@example.com', 200.0, datetime.now(), True),
    (2, 'Bob', 'bob_new@example.com', 300.0, datetime.now(), True)
], ['id', 'name', 'email', 'revenue', 'updated', 'is_current'])

updates.write.format('iceberg').mode('overwrite').option('overwrite-mode', 'dynamic').save('local.db.customers')
print('Updates applied with overwrite.')

Initial data loaded.
Updates applied with overwrite.


## Periodical Calculations (MTD, QTD, YTD)

Calculate monthly, quarterly, and yearly totals using the current data snapshot.

In [None]:
from pyspark.sql.functions as F
current_date = datetime.now()
start_of_month = current_date.replace(day=1)
start_of_quarter = current_date.replace(month=(current_date.month-1)//3*3+1, day=1)
start_of_year = current_date.replace(month=1, day=1)

revenue_stats = spark.sql("""
SELECT
    SUM(CASE WHEN updated >= '{0}' THEN revenue ELSE 0 END) as MTD_Revenue,
    SUM(CASE WHEN updated >= '{1}' THEN revenue ELSE 0 END) as QTD_Revenue,
    SUM(CASE WHEN updated >= '{2}' THEN revenue ELSE 0 END) as YTD_Revenue
FROM local.db.customers
""".format(start_of_month, start_of_quarter, start_of_year))

revenue_stats.show()

+-----------+-----------+-----------+
|MTD_Revenue|QTD_Revenue|YTD_Revenue|
+-----------+-----------+-----------+
|      300.0|      300.0|      600.0|
+-----------+-----------+-----------+


## Versioning and Time Travel

Demonstrate how to leverage Iceberg's time travel feature to query historical data states and manage schema evolution effectively.

In [None]:
# Retrieve historical data by specifying a past snapshot
historical_data = spark.read.format('iceberg').option('as-of-timestamp', snapshot_timestamp).table('local.db.customers')
historical_data.show()

# Evolving schema by adding a new column for tracking customer login
spark.sql("ALTER TABLE local.db.customers ADD COLUMNS (last_login timestamp)")
print('Schema evolution applied.')

+----+-----+---------------------+-------+-------------------+----------+
| id | name| email               |revenue| updated           |is_current|
+----+-----+---------------------+-------+-------------------+----------+
|  1 |Alice|alice@example.com    |  100.0|2021-06-01 00:00:00|      true|
|  2 |Bob  |bob@example.com      |  150.0|2021-06-02 00:00:00|      true|
+----+-----+---------------------+-------+-------------------+----------+
