# ✅ **Delta Lake Table Verification**

This notebook verifies that the Delta Lake table was successfully created and contains the expected data from the raw `.h5` files. We initialize a new Spark session, read in the Delta table, and run basic summary checks — including the total number of unique files, and the distribution of labeled examples (`good` vs `bad`). These checks confirm that the ingestion phase completed correctly and that the data is ready for analysis.

In [23]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import countDistinct

In [24]:
# 🔁 Start a new Spark session (same Delta config as before)
spark = SparkSession.builder \
    .appName("Delta Lake Summary") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

In [25]:
# Path to the Delta Lake
DELTA_PATH = "../data/delta"

In [26]:

# Load the Delta table
df = spark.read.format("delta").load(DELTA_PATH)

In [27]:

# Preview the schema and sample rows
df.printSchema()
df.show(5)


root
 |-- x: double (nullable = true)
 |-- y: double (nullable = true)
 |-- z: double (nullable = true)
 |-- machine_id: string (nullable = true)
 |-- month: string (nullable = true)
 |-- year: string (nullable = true)
 |-- operation: string (nullable = true)
 |-- example_no: string (nullable = true)
 |-- label: string (nullable = true)

+------+------+-------+----------+-----+----+---------+----------+-----+
|     x|     y|      z|machine_id|month|year|operation|example_no|label|
+------+------+-------+----------+-----+----+---------+----------+-----+
|1227.0|-407.0| -894.0|       M01|  Feb|2019|     OP00|       004| good|
|1079.0| -19.0| -853.0|       M01|  Feb|2019|     OP00|       004| good|
|1100.0| 314.0| -860.0|       M01|  Feb|2019|     OP00|       004| good|
|1323.0| -68.0|-1056.0|       M01|  Feb|2019|     OP00|       004| good|
|1389.0|-491.0|-1139.0|       M01|  Feb|2019|     OP00|       004| good|
+------+------+-------+----------+-----+----+---------+----------+-----+
onl

In [28]:
# Total number of files
df.select("machine_id", "month", "year", "operation", "example_no", "label") \
  .distinct() \
  .count()

1702

In [29]:
# Total number of "good" and "bad" examples
df.select("machine_id", "month", "year", "operation", "example_no", "label") \
  .distinct() \
  .groupBy("label") \
  .count() \
  .show()

+-----+-----+
|label|count|
+-----+-----+
|  bad|   70|
| good| 1632|
+-----+-----+



In [30]:
# Stop the Spark session
spark.stop()