# Life Without Apache Iceberg ‚ùÑÔ∏è (Windows + Local FS Demo)

This notebook demonstrates **what goes wrong when we manage a data lake using only Parquet files**, without a table format like **Apache Iceberg**.

Environment:
- Spark **3.5.3**
- Windows local filesystem
- PySpark

This notebook is structured for:
- üìå LinkedIn technical posts
- üìÅ Local experimentation
- üß† GitHub documentation


## 1Ô∏è‚É£ Spark Setup (Windows Local)

In [1]:
import os
import sys

# Spark 3.5.3 paths
spark_home = r"C:\spark\spark-3.5.3-bin-hadoop3"
sys.path.insert(0, spark_home + r"\python")
sys.path.insert(0, spark_home + r"\python\lib\py4j-0.10.9.7-src.zip")

# Python executables
os.environ["PYSPARK_PYTHON"] = r"C:\Users\Raghava\AppData\Local\Programs\Python\Python310\python.exe"
os.environ["PYSPARK_DRIVER_PYTHON"] = r"C:\Users\Raghava\AppData\Local\Programs\Python\Python310\python.exe"

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("withoutIceberg").getOrCreate()

## 2Ô∏è‚É£ Day 1 ‚Äì Initial Parquet Write

In [3]:
data_day1 = [(1,'Ramesh'),(2,'Pavan')]
df_day1 = spark.createDataFrame(data_day1,['id', 'name'])
df_day1.show()

+---+------+
| id|  name|
+---+------+
|  1|Ramesh|
|  2| Pavan|
+---+------+



In [4]:
df_day1.write.mode("overwrite").parquet(
    "file:///H:/spark_practice/iceberg_poc/without_iceberg"
)

In [5]:
df = spark.read.parquet('without_iceberg/part-00001-30db33eb-eeb6-4e95-89ad-3e48b4e96778-c000.snappy.parquet')

In [6]:
df.show()

+---+------+
| id|  name|
+---+------+
|  1|Ramesh|
+---+------+



In [7]:
df = spark.read.parquet('without_iceberg/part-00003-30db33eb-eeb6-4e95-89ad-3e48b4e96778-c000.snappy.parquet')

In [8]:
df.show()

+---+-----+
| id| name|
+---+-----+
|  2|Pavan|
+---+-----+



**What happens here?**

- Spark writes raw Parquet files
- Each partition becomes a physical file
- No table metadata, no schema enforcement


In [9]:
df_day1.rdd.getNumPartitions()

4

In [10]:
df_day1.explain(True)

== Parsed Logical Plan ==
LogicalRDD [id#0L, name#1], false

== Analyzed Logical Plan ==
id: bigint, name: string
LogicalRDD [id#0L, name#1], false

== Optimized Logical Plan ==
LogicalRDD [id#0L, name#1], false

== Physical Plan ==
*(1) Scan ExistingRDD[id#0L,name#1]



## üìÇ Day 1 ‚Äì Initial Parquet Write (Physical Files)

Below image shows multiple Parquet files created after the first write:

![Day 1 Parquet Files](images/day1_initial_write.png)


## 3Ô∏è‚É£ Day 2 ‚Äì Append With Schema Drift

In [11]:
from pyspark.sql.functions import current_timestamp

data_day2 = [(3,'Akhil'),(4,'Nikhil')]
df_day2 = spark.createDataFrame(data_day2,['id','name']) \
    .withColumn("updated_at", current_timestamp())


In [19]:
df_day2.write.mode('append').parquet(
    "file:///H:/spark_practice/iceberg_poc/without_iceberg"
)

In [20]:
df = spark.read.parquet("without_iceberg/part-00003-519d897e-4ff0-4f2a-a7b3-8ca5ed55bf8d-c000.snappy.parquet")

In [21]:
df.show()

+---+------+--------------------+
| id|  name|          updated_at|
+---+------+--------------------+
|  4|Nikhil|2025-12-29 10:52:...|
+---+------+--------------------+



## ‚ö†Ô∏è Day 2 ‚Äì Schema Drift in Action

Some Parquet files now contain `updated_at`, others don‚Äôt:

![Day 2 Schema Drift](images/day2_schema_drift.png)


### üö® Schema Drift Problem

- Old Parquet files **do not have `updated_at`**
- New Parquet files **do have it**
- Dataset now has **multiple schemas**


In [18]:
df = spark.read \
    .option("mergeSchema", "false") \
    .parquet("without_iceberg/")

df.show()
df.printSchema()


+---+------+
| id|  name|
+---+------+
|  4|Nikhil|
|  3| Akhil|
|  1|Ramesh|
|  2| Pavan|
+---+------+

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)



**Result:**
- Spark ignores the new column
- Data appears incomplete
- Manual schema handling required


## 4Ô∏è‚É£ Day 3 ‚Äì Accidental Overwrite (Data Loss)

In [22]:
df_bad = spark.createDataFrame([(999,'Malicious')],['id','name'])
df_bad.write.mode('overwrite').parquet(
    "file:///H:/spark_practice/iceberg_poc/without_iceberg"
)
df_bad.show()

+---+---------+
| id|     name|
+---+---------+
|999|Malicious|
+---+---------+



## ‚ùå Day 3 ‚Äì Accidental Overwrite

One overwrite wiped the entire dataset:

![Day 3 Overwrite](images/day3_overwrite.png)


### ‚ùå Final Outcome

- Entire dataset is deleted
- No rollback
- No time travel
- No ACID guarantees


## üîë Key Takeaways

| Feature | Parquet Only |
|-------|-------------|
| Schema evolution | ‚ùå Manual |
| Append safety | ‚ùå Risky |
| Time travel | ‚ùå No |
| Rollback | ‚ùå No |
| ACID guarantees | ‚ùå No |

**This is exactly why Apache Iceberg exists.** ‚ùÑÔ∏è