%undefined
### Notebook Overview

This notebook demonstrates a typical ETL and feature engineering workflow using PySpark and Pandas. It begins by exploring two Bronze tables containing raw device messages and rapid step test results. The notebook then cleans and transforms these datasets, computes summary statistics, and engineers features by joining sensor readings with test windows. Finally, it visualizes sensor data for a sample device and prepares a compact features table suitable for machine learning tasks. Each step is designed to illustrate best practices for data preparation and feature extraction in a modern analytics pipeline.

%undefined
### Bronze Table Exploration

This cell loads the raw Bronze tables (`device_messages_raw` and `rapid_step_tests_raw`) and prints their schemas. The purpose is to understand the available fields and data types before any cleaning or transformation.

In [0]:
# Spark DataFrames from the catalog

dm = spark.table("workspace.bronze.device_messages_raw")

rt = spark.table("workspace.bronze.rapid_step_tests_raw")

dm.printSchema()

rt.printSchema()


%undefined
### Device Message Cleaning

This cell loads raw device messages and performs data cleaning. It extracts numeric distance values, converts them to centimeters, and ensures timestamps are in integer format. The result is a cleaned DataFrame ready for analysis and feature engineering.

%undefined
### Device Message Statistics

This cell computes summary statistics for each device, including the number of readings, average, minimum, and maximum distance. The result helps identify device activity patterns and potential outliers.

In [0]:
dm_stats = (dm_clean
    .groupBy("device_id")
    .agg(F.count("*").alias("n"),
         F.avg("distance_cm").alias("avg_cm"),
         F.min("distance_cm").alias("min_cm"),
         F.max("distance_cm").alias("max_cm"))
    .orderBy(F.desc("n"))
)
display(dm_stats.limit(20))

%undefined
### Rapid Step Test Feature Engineering

This cell explodes the step points array in the rapid step tests, calculates step timing statistics per test, and displays the results. The purpose is to extract granular step timing features for each test, which are useful for ML modeling and analysis.

In [0]:
rt = spark.table("workspace.bronze.rapid_step_tests_raw")

rt_exploded = (rt
    .select("customer", "device_id", "start_time", "stop_time", "test_time", "total_steps", F.posexplode("step_points").alias("step_index", "step_ms"))
)

display(rt_exploded.limit(20))

# Step timing stats per test
step_stats = (rt_exploded
    .groupBy("customer", "device_id", "start_time", "stop_time")
    .agg(F.count("*").alias("steps"),
         F.avg("step_ms").alias("avg_step_ms"),
         F.stddev("step_ms").alias("sd_step_ms"))
    .orderBy(F.desc("steps"))
)

display(step_stats.limit(20))

%undefined
### Feature Table Construction

This cell joins rapid step test windows with device sensor readings, aggregating summary statistics (count, average, min, max, variance) for each test. The resulting features table is suitable for machine learning training and analysis.

%undefined
### Device Distance Line Plot

This cell samples distance readings for one device and plots them over time. The purpose is to visually inspect sensor behavior and identify trends or anomalies in the data.

In [0]:
# Small sample for a simple line plot of distances over time for one device

import pandas as pd
import matplotlib.pyplot as plt

sample_device = features.select("t.device_id").first()["device_id"]

pdf = (dm_clean
       .filter(F.col("device_id") == sample_device)
       .orderBy("ts_ms")
       .limit(1000)
       .select("ts_ms", "distance_cm")
       .toPandas())

plt.figure()
plt.plot(pdf["ts_ms"], pdf["distance_cm"])
plt.title(f"Distance over time (device {sample_device})")
plt.xlabel("timestamp (ms)")
plt.ylabel("distance (cm)")
plt.show()

%undefined
This Python section mirrors the SQL steps, but prepares a compact features table (avg/min/max/variance of distance within each test window). We will reuse this features table in ML weeks.