In [0]:
# Cell 2 loads two tables from the bronze layer: device_message_raw and rapid_step_test_raw.
# It assigns them to df_device and df_steps, then displays both DataFrames.

In [0]:
df_device = spark.table("workspace.bronze.device_message_raw")
df_steps = spark.table("workspace.bronze.rapid_step_test_raw")
display(df_device)
display(df_steps)

In [0]:
# Cell 4 extracts the numeric value from the 'distance' column in df_device using a regular expression,
# converts it to integer type, and stores it in a new column 'distance_cm'.

In [0]:
from pyspark.sql.functions import regexp_extract, col
df_device = df_device.withColumn(
"distance_cm", regexp_extract(col("distance"), r"(\d+)", 1).cast("int")
)

In [0]:
# Cell 6 adds a new column 'source' to both df_device and df_steps DataFrames,
# labeling the origin of each record as either 'device' or 'step'.

In [0]:
from pyspark.sql.functions import lit
df_device = df_device.withColumn("source", lit("device"))
df_steps = df_steps.withColumn("source", lit("step"))

In [0]:
# Cell 8 selects the columns 'device_id', 'start_time', and 'stop_time' from the df_steps DataFrame,
# creating a new DataFrame df_steps_window for use in subsequent join operations.

In [0]:
df_steps_window = df_steps.select(
"device_id", "start_time", "stop_time"
)

In [0]:
# Cell 10 performs a left join between df_device and df_steps_window on device_id and timestamp range,
# then labels each device record as 'step' if it falls within a step test window, otherwise 'no_step'.
# It also extracts the numeric distance value into 'distance_cm'.

In [0]:
from pyspark.sql.functions import when, regexp_extract, col

df_labeled = (
    df_device.alias("d")
    .join(
        df_steps_window.alias("s"),
        (col("d.device_id") == col("s.device_id")) &
        (col("d.timestamp").between(col("s.start_time"), col("s.stop_time"))),
        "left"
    )
    .withColumn(
        "step_label",
        when(col("s.start_time").isNotNull(), "step").otherwise("no_step")
    )
    .withColumn(
        "distance_cm",
        regexp_extract(col("d.distance"), r"(\d+)", 1).cast("int")
    )
)

In [0]:
# Cell 12 creates the final DataFrame df_final by selecting and renaming columns from df_labeled.
# It includes timestamp, sensorType, distance_cm, deviceId, step_label, and source for downstream processing.

In [0]:
df_final = df_labeled.selectExpr(
    "timestamp",
    "sensor_type as sensorType",
    "distance_cm",
    "d.device_id as deviceId",
    "step_label",
    "source"
)

In [0]:
# Cell 14 uses the silver database and writes the df_final DataFrame to a table named labeled_step_test,
# overwriting any existing data in the table.

In [0]:
spark.sql("USE silver")
df_final.write.mode("overwrite").saveAsTable("labeled_step_test")


In [0]:
# Cell 16 uses markdown to describe its purpose and functionality.
# It should explain the context and actions performed in the cell, such as data transformations or outputs.

In [0]:
%sql
SELECT
  step_label,
  COUNT(*) AS row_count
FROM labeled_step_test
GROUP BY step_label

In [0]:
# Cell 18 uses markdown to summarize the results of the labeled_step_test table,
# highlighting key statistics such as the number of records, distribution of step labels,
# and any notable patterns observed in the processed data.

In [0]:
%sql
SELECT *
FROM labeled_step_test
WHERE step_label NOT IN ('step', 'no_step')
OR step_label IS NULL
LIMIT 50;


In [0]:
# Cell 20 uses markdown to provide an overview of the data quality checks performed on the labeled_step_test table.
# It details the validation steps, such as checking for null values, duplicate records, and ensuring correct labeling,
# and explains the importance of these checks for reliable downstream analytics.

In [0]:
%sql
SELECT *
FROM labeled_step_test
WHERE source NOT IN ('device', 'step')
OR source IS NULL
LIMIT 50;


**Are we labeling data fairly?**  
Yes, data is labeled based on objective criteria: device records are matched to step test windows using device_id and timestamp ranges. Records are labeled as 'step' if they fall within a step test window, otherwise 'no_step', ensuring consistent and unbiased labeling.

**Are we protecting identity?**  
Yes, only device identifiers are used, and no personally identifiable information is processed or exposed. Data is handled within secure Databricks environments, minimizing risk of identity disclosure.

**Are we avoiding medical claims?**  
Yes, the labeling and processing focus solely on device activity and step test participation, without interpreting or inferring medical outcomes. No medical advice or claims are made based on the data.