# Curated STEDI Step Test Dataset (Silver Layer)
Geovanny Peña Rueda

## Purpose
This notebook combines DeviceMessage and RapidStepTest data to create
a clean, labeled Silver dataset for machine learning.

## Part 1 – Load Raw Bronze Tables


In [0]:
df_device = spark.table("workspace.bronze.device_messages_raw")
df_steps = spark.table("workspace.bronze.rapid_step_tests_raw")

display(df_device)
display(df_steps)


## Part 2 – Prepare Each Table


In [0]:
from pyspark.sql.functions import regexp_extract, col, lit

df_device = df_device.withColumn(
    "distance_cm",
    regexp_extract(col("distance"), r"(\d+)", 1).cast("int")
)


In [0]:

# Add source labels
df_device = df_device.withColumn("source", lit("device"))
df_steps = df_steps.withColumn("source", lit("step"))


## Part 3 – Label Each Sensor Reading (step / no_step)


In [0]:
from pyspark.sql.functions import when

df_steps_window = df_steps.select(
    "device_id", "start_time", "stop_time"
)

df_labeled = (
    df_device.alias("d")
    .join(
        df_steps_window.alias("s"),
        (col("d.device_id") == col("s.device_id")) &
        (col("d.timestamp").between(col("s.start_time"), col("s.stop_time"))),
        "left"
    )
    .withColumn(
        "step_label",
        when(col("s.start_time").isNotNull(), "step").otherwise("no_step")
    )
)


## Part 4 – Select Final Curated Columns


In [0]:
from pyspark.sql.functions import col

df_final = df_labeled.select(
    col("d.timestamp").alias("timestamp"),
    col("d.sensor_type").alias("sensor_type"),
    col("distance_cm"),
    col("d.device_id").alias("device_id"),
    col("step_label"),
    col("source")
)

display(df_final)



## Part 5 – Save Curated Silver Dataset

In [0]:
spark.sql("USE workspace.silver")

df_final.write.mode("overwrite").saveAsTable("labeled_step_test")


## Part 6 – Verification Queries


In [0]:
%sql
SELECT step_label, COUNT(*) AS row_count
FROM workspace.silver.labeled_step_test
GROUP BY step_label;


In [0]:
%sql
SELECT *
FROM workspace.silver.labeled_step_test
WHERE step_label NOT IN ('step', 'no_step')
   OR step_label IS NULL
LIMIT 50;


In [0]:
%sql
SELECT
source,
COUNT(*) AS row_count
FROM workspace.silver.labeled_step_test
GROUP BY source;

In [0]:
%sql
select * from workspace.silver.labeled_step_test
where source NOT IN ('device','step')
or source IS NULL
limit 50;

## Ethics Check

- The dataset is labeled using objective timestamp ranges, reducing bias.
- No personally identifiable information is exposed.
- The dataset does not make medical claims; it only records step activity signals.
