# **Data storage**

### Storage Solution: Databricks Delta Lake

**Justification**

Standard CSV or Parquet files are insufficient for a production pipeline. We chose Delta Lake for the following reasons:

ACID Transactions: Ensures data integrity. If the pipeline fails midway through a write, we do not end up with partial or corrupt files.

Schema Enforcement: Strictly prevents bad data types from polluting the database (e.g., preventing text in the popularity column).

Time Travel: Delta automatically versions data, allowing us to query previous snapshots of the table if we accidentally delete rows.

Unified Batch & Streaming: The same table can be used for the batch load (current task) and future real-time streaming requirements without architecture changes.

**Partitioning Strategy**

Partition Column: track_genre

Reasoning: Downstream analytical queries frequently filter by genre (e.g., "Compare the energy of Pop vs. Rock tracks"). By partitioning on track_genre, Databricks can skip reading irrelevant files (Partition Pruning), significantly speeding up these queries.

In [0]:
# Configuration
source_table = "default.spotify_bronze"
target_table_partitioned = "default.spotify_bronze_partitioned"

#  Implementing data partitioning
df = spark.read.table(source_table)

print(f"Writing partitioned data to {target_table_partitioned}...")

df.write \
    .format("delta") \
    .mode("overwrite") \
    .partitionBy("track_genre") \
    .saveAsTable(target_table_partitioned)

print("Partitioning complete.")

#  Verify storage layout
print("Verifying partition structure...")
display(spark.sql(f"DESCRIBE EXTENDED {target_table_partitioned}"))

# Storage optimization
spark.sql(f"OPTIMIZE {target_table_partitioned}")

print(f"Success: Table {target_table_partitioned} is optimized and partitioned.")