# Automated STEDI ETL Pipeline

This notebook automates the ETL process that curates raw STEDI sensor data into a clean Silver dataset for machine learning. It is designed to run as a Databricks Job.


In [0]:
%sql
CREATE OR REPLACE TEMP VIEW device_messages_clean AS
SELECT
  date AS event_time,
  device_id,
  sensor_type,
  CAST(REGEXP_REPLACE(distance, 'cm', '') AS INT) AS distance_cm,
  'device' AS source_label
FROM workspace.bronze.device_messages_raw;


## Step 1: Load and Clean Device Messages

This step extracts raw device sensor messages from the Bronze layer and prepares them for alignment. The distance field is cleaned by removing the unit string and converting it into a numeric value. A source label is also added to identify that these records originate from device sensor data.


In [0]:
%sql
CREATE OR REPLACE TEMP VIEW device_messages_clean AS
SELECT
  date AS event_time,
  device_id,
  sensor_type,
  CAST(REGEXP_REPLACE(distance, 'cm', '') AS INT) AS distance_cm,
  'device' AS source_label
FROM workspace.bronze.device_messages_raw;


## Step 2: Load Step Test Time Windows

This step loads rapid step test session data from the Bronze layer. Each record defines a start and stop timestamp during which real steps occurred. These time windows will be used to label sensor readings accurately.


In [0]:
%sql
CREATE OR REPLACE TEMP VIEW step_tests_clean AS
SELECT
  device_id,
  start_time,
  stop_time
FROM workspace.bronze.rapid_step_tests_raw;



## Step 3: Align Sensor Readings Using Timestamps

In this step, sensor readings are aligned with step test windows using both the device ID and timestamp range. Each sensor reading is labeled as `step` if it occurs within a known step test window, or `no_step` otherwise. This join logic ensures that each device message appears only once and prevents row duplication.


In [0]:
%sql
CREATE OR REPLACE TABLE labeled_step_test AS
SELECT * FROM final_df;


In [0]:
%sql
CREATE OR REPLACE TEMP VIEW final_df AS
SELECT
  d.event_time,
  d.device_id,
  d.sensor_type,
  d.distance_cm,
  d.source_label,
  s.start_time,
  s.stop_time,
  CASE
    WHEN s.device_id IS NOT NULL THEN 'step'
    ELSE 'no_step'
  END AS step_label
FROM device_messages_clean d
LEFT JOIN step_tests_clean s
  ON d.device_id = s.device_id
 AND d.event_time BETWEEN s.start_time AND s.stop_time;



## Step 4: Save Curated Silver Dataset

This step rebuilds the curated Silver table using the labeled dataset. The table is overwritten each time the pipeline runs to ensure the data is always up to date and consistent.


In [0]:
%sql
CREATE OR REPLACE TABLE labeled_step_test AS
SELECT * FROM final_df;


## Step 5: Validate Automated ETL Results

The following queries validate that the automated pipeline produced correct labels and no invalid values. These checks confirm that the ETL logic executed successfully when run as a Databricks Job.


In [0]:
%sql
SELECT step_label, COUNT(*)
FROM labeled_step_test
GROUP BY step_label;


In [0]:
%sql
SELECT *
FROM labeled_step_test
WHERE step_label NOT IN ('step', 'no_step')
   OR step_label IS NULL
LIMIT 50;


In [0]:
%sql
SELECT *
FROM labeled_step_test
WHERE source_label NOT IN ('device', 'step')
   OR source_label IS NULL
LIMIT 50;


## Ethics Reflection

When automating health-related data pipelines, engineers must prioritize privacy and security by protecting identifiers and limiting access to sensitive data. Automated pipelines should include validation steps to prevent incorrect labels or missing data from silently propagating into downstream systems. Care must be taken to avoid bias introduced by labeling logic or inconsistent sensor behavior. Finally, curated step data should not be treated as medical conclusions; automation should support analysis while clearly communicating limitations and intended use.
