# STEDI Data Exploration (Bronze Layer)

This notebook explores the STEDI step test data using SQL and PySpark.
The goal is to understand the raw data structure, perform basic cleaning,
and prepare features that will be reused later in the ML pipeline.

#### 1.1 Peek at the data (row counts)

In [0]:
-- How many rows do we have?
SELECT 'device_messages_raw' AS table_name, COUNT(*) AS rows
FROM workspace.bronze.device_messages_raw
UNION ALL
SELECT 'rapid_step_tests_raw', COUNT(*)
FROM workspace.bronze.rapid_step_tests_raw;

#### 1.2 Basic summary of device messages

This query groups raw device messages by device_id to identify which devices are sending the highest number of messages. This helps validate ingestion and detect potential outliers or noisy devices.

In [0]:
-- Count messages per device (top 10)
SELECT
  device_id,
  COUNT(*) AS messages
FROM workspace.bronze.device_messages_raw
GROUP BY device_id
ORDER BY messages DESC
LIMIT 10;

#### 1.3 Clean and cast distance
Create a cleaned view for convenience during lab. Distance values may appear as strings (for example, "1cm"). This step removes non-numeric characters and safely casts distance to centimeters.

In [0]:
CREATE OR REPLACE TEMP VIEW device_messages_clean AS
SELECT
  device_id,
  TRY_CAST(
    REGEXP_REPLACE(CAST(distance AS STRING), '[^0-9.]', '')
    AS DOUBLE
  ) AS distance_cm,
  CAST(timestamp AS BIGINT) AS ts_ms,
  CAST(date AS BIGINT) AS date_ms,
  message_origin,
  sensor_type,
  message
FROM workspace.bronze.device_messages_raw;

In [0]:
-- Check the cleaning

SELECT * FROM device_messages_clean LIMIT 20;

#### 1.4 Per-device distance stats
Average, min, and max distance per device (first 10 devices)

In [0]:
SELECT device_id,
       AVG(distance_cm) AS avg_cm,
       MIN(distance_cm) AS min_cm,
       MAX(distance_cm) AS max_cm
FROM device_messages_clean
GROUP BY device_id
ORDER BY avg_cm DESC
LIMIT 10;

#### 1.6 Explode step_points (array â†’ rows)
step_points is an array of inter-step times; exploding lets us compute rich stats later.

In [0]:
WITH exploded AS (
  SELECT
    customer,
    device_id,
    start_time,
    posexplode(step_points) AS (step_index, step_ms)
  FROM workspace.bronze.rapid_step_tests_raw
)
SELECT customer, device_id,
       COUNT(*) AS steps,
       AVG(step_ms) AS avg_step_ms,
       STDDEV(step_ms) AS sd_step_ms
FROM exploded
GROUP BY customer, device_id
ORDER BY steps DESC
LIMIT 10;

#### 1.7 Time-window join: messages during each test

In [0]:
WITH tests AS (
  SELECT customer, device_id, start_time, stop_time, test_time, total_steps
  FROM workspace.bronze.rapid_step_tests_raw
),
msgs AS (
  SELECT device_id, ts_ms, distance_cm
  FROM device_messages_clean
  WHERE distance_cm IS NOT NULL
)
SELECT
  t.customer,
  t.device_id,
  t.start_time,
  t.stop_time,
  COUNT(m.ts_ms) AS readings_in_window,
  AVG(m.distance_cm) AS avg_cm_in_window,
  MIN(m.distance_cm) AS min_cm_in_window,
  MAX(m.distance_cm) AS max_cm_in_window
FROM tests t
JOIN msgs m
  ON m.device_id = t.device_id
 AND m.ts_ms BETWEEN t.start_time AND t.stop_time
GROUP BY t.customer, t.device_id, t.start_time, t.stop_time
ORDER BY readings_in_window DESC
LIMIT 20;

This query shows how many sensor readings occurred during each test
and the average, minimum, and maximum distance for that test window.We will reuse these aggregates as features for machine learning later.
