# ü§ñ 01 ‚Äî Data Exploration: Manipulator Health Monitoring (EDA)

This notebook performs the **Exploratory Data Analysis (EDA)** phase on the raw UR5 manipulator sensor data, sourced from NIST.

The primary goals are to:
1.  Consolidate and inspect the raw, multi-file sensor readings.
2.  Assess data quality, memory usage, and column types.
3.  Perform initial data parsing to convert list-like sensor readings (e.g., joint positions, forces) into separate, usable numeric columns.
4.  Save the cleaned, structured dataset for subsequent preprocessing and feature engineering.

---

## üìñ Table of Contents

1.  [Setup and Configuration](#1.-‚öôÔ∏è-Setup-and-Configuration)
2.  [Data Loading and Concatenation](#2.-üíæ-Data-Loading-and-Concatenation)
3.  [Basic Data Inspection](#3.-üìä-Basic-Data-Inspection)
4.  [Data Parsing and List Expansion](#4.-üìù-Data-Parsing-and-List-Expansion)
5.  [Save Processed Data](#5.-üíæ-Save-Processed-Data)
6.  [Summary](#6.-üßæ-Summary)


## 1. ‚öôÔ∏è Setup and Configuration

Import necessary libraries for data handling, file management, visualization, and specialized parsing. We configure Pandas and Seaborn for professional display.

In [1]:
# Data manipulation and analysis
import pandas as pd
import numpy as np
import ast # For safe evaluation of list-like strings
from tqdm.auto import tqdm # For tracking progress during large operations

# File handling
import os
import glob

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Configuration
pd.set_option('display.max_columns', None)
sns.set(style="whitegrid")
import warnings
warnings.filterwarnings('ignore')

## 2. üíæ Data Loading and Concatenation

The raw sensor data for this project is stored across multiple CSV files (`*.csv`), where each file represents a single test run. We must load and merge all records into a single DataFrame using a custom parsing function to handle the non-standard list/tuple structure within the CSVs.

In [3]:
# Function to safely parse complex list-like strings in the CSV
def parse_sensor_line(line):
    """Cleans up and safely evaluates a single line of the non-standard sensor CSV."""
    line = line.strip().lstrip('(').rstrip(')')
    try:
        # Wrap the content to create a valid Python list string for literal_eval
        return ast.literal_eval(f"[{line}]")
    except:
        # Log error or return None if parsing fails
        return None 

# Glob all sensor CSV files
csv_files = glob.glob("../data/raw/sensor_data/*.csv")
print(f"üìÅ Found {len(csv_files)} raw sensor data files.")

# Define columns (required to load the data without a header)
SENSOR_COLUMNS = [
    "ROBOT_TIME", "ROBOT_TARGET_JOINT_POSITIONS", "ROBOT_ACTUAL_JOINT_POSITIONS",
    "ROBOT_TARGET_JOINT_VELOCITIES", "ROBOT_ACTUAL_JOINT_VELOCITIES",
    "ROBOT_TARGET_JOINT_CURRENTS", "ROBOT_ACTUAL_JOINT_CURRENTS",
    "ROBOT_TARGET_JOINT_ACCELERATIONS", "ROBOT_TARGET_JOINT_TORQUES",
    "ROBOT_JOINT_CONTROL_CURRENT", "ROBOT_CARTESIAN_COORD_TOOL",
    "ROBOT_TCP_FORCE", "ROBOT_JOINT_TEMP"
]

all_data = []
NUM_COLUMNS = len(SENSOR_COLUMNS)

# --- REVERTED TO RELIABLE FILE-BY-FILE READING LOOP ---
for fpath in tqdm(csv_files, desc="Loading and parsing files"):
    try:
        with open(fpath, "r") as f:
            for line in f:
                parsed = parse_sensor_line(line)
                # Only append if parsing was successful (i.e., not None)
                if parsed is not None and len(parsed) == NUM_COLUMNS:
                    all_data.append(parsed)
    except Exception as e:
        print(f"‚ö†Ô∏è Error reading file {fpath}: {e}")
        continue

# Create final DataFrame using the robustly cleaned list
sensor_data = pd.DataFrame(all_data, columns=SENSOR_COLUMNS)

# ... (Rest of your metadata loading code remains the same)
header_path = "../data/raw/header/ur5testresult_header.xlsx"
summary_path = "../data/raw/summary/calculated_deviation_of_actual_position_to_nominal_position.xlsx"
# We wrap these in a try/except because they are not strictly necessary for the EDA flow
try:
    header_df = pd.read_excel(header_path)
    summary_df = pd.read_excel(summary_path)
except Exception as e:
    print(f"Metadata loading skipped (File not found or I/O error): {e}")


print(f"‚úÖ Combined sensor data shape: {sensor_data.shape}")

üìÅ Found 18 raw sensor data files.


Loading and parsing files:   0%|          | 0/18 [00:00<?, ?it/s]

‚úÖ Combined sensor data shape: (153658, 13)


## 3. üìä Basic Data Inspection

Initial inspection reveals the sheer volume of data and the presence of `object` types, which represent the list-like sensor readings that need to be expanded into individual numeric columns.

In [4]:
print("‚úÖ Dataset shape:", sensor_data.shape)

# Memory usage (deep calculation)
mem_usage = sensor_data.memory_usage(deep=True).sum() / 1024**2
print(f"üíæ Estimated memory usage: {mem_usage:.2f} MB")

# Dtypes overview
print("\nüìä Column type distribution:")
print(sensor_data.dtypes.value_counts())

# Preview
print("First 3 rows (raw structure):")
display(sensor_data.head(3))

‚úÖ Dataset shape: (153658, 13)
üíæ Estimated memory usage: 239.15 MB

üìä Column type distribution:
object    13
Name: count, dtype: int64
First 3 rows (raw structure):


Unnamed: 0,ROBOT_TIME,ROBOT_TARGET_JOINT_POSITIONS,ROBOT_ACTUAL_JOINT_POSITIONS,ROBOT_TARGET_JOINT_VELOCITIES,ROBOT_ACTUAL_JOINT_VELOCITIES,ROBOT_TARGET_JOINT_CURRENTS,ROBOT_ACTUAL_JOINT_CURRENTS,ROBOT_TARGET_JOINT_ACCELERATIONS,ROBOT_TARGET_JOINT_TORQUES,ROBOT_JOINT_CONTROL_CURRENT,ROBOT_CARTESIAN_COORD_TOOL,ROBOT_TCP_FORCE,ROBOT_JOINT_TEMP
0,[13867.472],"[-26.880068716264294, -79.91160892471794, 57.0...","[-26.881428894723115, -79.91090832767539, 57.0...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, -0.0, 0.0, 0.0, 0.0]","[2.433830495651467e-17, -2.2138144127730115, -...","[-0.2914353609085083, -2.640852928161621, -2.0...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[2.850164985011539e-16, -25.67417061620775, -1...","[-0.2914353609085083, -2.64982008934021, -2.08...","[-0.6376844833673514, 0.27758034088384226, 0.7...","[1.7623931851118197, -6.732057848264457, -14.2...","[27.37999153137207, 28.78999137878418, 28.9298..."
1,[13867.48],"[-26.880068716264294, -79.91160892471794, 57.0...","[-26.87937983797211, -79.9095422898414, 57.091...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, -0.0, 0.0, 0.0, 0.0]","[2.433830495651467e-17, -2.2138144127730115, -...","[-0.2847099304199219, -2.64982008934021, -2.06...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[2.850164985011539e-16, -25.67417061620775, -1...","[-0.2914353609085083, -2.64982008934021, -2.07...","[-0.6376930015017775, 0.27756734465474087, 0.7...","[0.9792112067350713, -6.221806455327369, -13.3...","[27.37999153137207, 28.78999137878418, 28.9298..."
2,[13867.488],"[-26.880068716264294, -79.91160892471794, 57.0...","[-26.88006285688911, -79.9088592709244, 57.091...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, -0.0, 0.0, 0.0, 0.0]","[2.433830495651467e-17, -2.2138144127730115, -...","[-0.2959190011024475, -2.663270950317383, -2.0...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[2.850164985011539e-16, -25.67417061620775, -1...","[-0.2914353609085083, -2.64982008934021, -2.07...","[-0.6376927034316546, 0.27757289684249814, 0.7...","[0.448984913981216, -6.268759803739754, -13.17...","[27.37999153137207, 28.78999137878418, 28.9298..."


### 3.2. Data Quality: Missing Values and Uniques

We check for missingness and analyze unique value counts, which will highlight the columns currently holding list-like data.

In [5]:

# Sample Statistical summary (only ROBOT_TIME is numeric for now)
print("\nüîç Statistical summary for ROBOT_TIME:")
# Only describe columns that are actually numeric (ROBOT_TIME)
numeric_cols = sensor_data.select_dtypes(include=np.number).columns
if len(numeric_cols) > 0:
    display(sensor_data[numeric_cols].describe().T)
else:
    print("No numeric columns detected yet (Data must be expanded first).")

# Missing values
print("\n‚ö†Ô∏è Missing values ratio (top 20 columns):")
missing = sensor_data.isna().mean().sort_values(ascending=False)
display(missing.head(20))


üîç Statistical summary for ROBOT_TIME:
No numeric columns detected yet (Data must be expanded first).

‚ö†Ô∏è Missing values ratio (top 20 columns):


ROBOT_TIME                          0.0
ROBOT_TARGET_JOINT_POSITIONS        0.0
ROBOT_ACTUAL_JOINT_POSITIONS        0.0
ROBOT_TARGET_JOINT_VELOCITIES       0.0
ROBOT_ACTUAL_JOINT_VELOCITIES       0.0
ROBOT_TARGET_JOINT_CURRENTS         0.0
ROBOT_ACTUAL_JOINT_CURRENTS         0.0
ROBOT_TARGET_JOINT_ACCELERATIONS    0.0
ROBOT_TARGET_JOINT_TORQUES          0.0
ROBOT_JOINT_CONTROL_CURRENT         0.0
ROBOT_CARTESIAN_COORD_TOOL          0.0
ROBOT_TCP_FORCE                     0.0
ROBOT_JOINT_TEMP                    0.0
dtype: float64

## 4. üìù Data Parsing and List Expansion

To make the data usable for modeling, we must expand all list-form sensor readings (e.g., 6 joint positions, 6 joint velocities) into separate, distinct numeric columns (e.g., `..._POSITIONS_1` to `..._POSITIONS_6`).

In [6]:
def expand_list_columns(df, columns_to_expand):
    """Expands list-like columns into multiple numeric columns."""
    expanded_dfs = []
    
    for col in tqdm(columns_to_expand, desc="Expanding list columns"):
        first_valid = df[col].dropna().iloc[0]
        
        # Check if column values are list-type 
        if isinstance(first_valid, list):
            # Convert list-column to a DataFrame of new columns
            expanded_df = pd.DataFrame(df[col].tolist(), 
                                       columns=[f"{col}_{i+1}" for i in range(len(first_valid))])
            expanded_dfs.append(expanded_df)
        else:
            # If not a list (like ROBOT_TIME), keep the original scalar column
            expanded_dfs.append(df[[col]])
            
    return pd.concat(expanded_dfs, axis=1)

### 4.2. Applying Expansion and Final Data Cleanup

We group the columns by their inherent dimensionality (e.g., 6 joints, 3 Cartesian coordinates) and apply the expansion. Crucially, we also fix the persistent data type issue in the ROBOT_TIME column.

#### 1. Data Expansion and Initial Concatenation

In [7]:
# ---Data Expansion ---

# Columns covering all list-like fields that need expansion
# Columns covering joint data (6 values each) and other list-like fields
list_data_cols = [
    "ROBOT_TARGET_JOINT_POSITIONS", "ROBOT_ACTUAL_JOINT_POSITIONS",
    "ROBOT_TARGET_JOINT_VELOCITIES", "ROBOT_ACTUAL_JOINT_VELOCITIES",
    "ROBOT_TARGET_JOINT_CURRENTS", "ROBOT_ACTUAL_JOINT_CURRENTS",
    "ROBOT_TARGET_JOINT_ACCELERATIONS", "ROBOT_TARGET_JOINT_TORQUES",
    "ROBOT_JOINT_CONTROL_CURRENT", 
    "ROBOT_CARTESIAN_COORD_TOOL",
    "ROBOT_TCP_FORCE",
    "ROBOT_JOINT_TEMP"
]

clean_sensor_data = expand_list_columns(sensor_data, list_data_cols)

# Add scalar column (ROBOT_TIME) which was not in the list_data_cols
scalar_cols = ["ROBOT_TIME"]
clean_sensor_data = pd.concat([clean_sensor_data, sensor_data[scalar_cols]], axis=1)

print(f"‚úÖ Final shape after expansion: {clean_sensor_data.shape}")
print("\nüìä Final Column Types and Memory Usage:")
clean_sensor_data.info(memory_usage="deep")


Expanding list columns:   0%|          | 0/12 [00:00<?, ?it/s]

‚úÖ Final shape after expansion: (153658, 73)

üìä Final Column Types and Memory Usage:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153658 entries, 0 to 153657
Data columns (total 73 columns):
 #   Column                              Non-Null Count   Dtype  
---  ------                              --------------   -----  
 0   ROBOT_TARGET_JOINT_POSITIONS_1      153658 non-null  float64
 1   ROBOT_TARGET_JOINT_POSITIONS_2      153658 non-null  float64
 2   ROBOT_TARGET_JOINT_POSITIONS_3      153658 non-null  float64
 3   ROBOT_TARGET_JOINT_POSITIONS_4      153658 non-null  float64
 4   ROBOT_TARGET_JOINT_POSITIONS_5      153658 non-null  float64
 5   ROBOT_TARGET_JOINT_POSITIONS_6      153658 non-null  float64
 6   ROBOT_ACTUAL_JOINT_POSITIONS_1      153658 non-null  float64
 7   ROBOT_ACTUAL_JOINT_POSITIONS_2      153658 non-null  float64
 8   ROBOT_ACTUAL_JOINT_POSITIONS_3      153658 non-null  float64
 9   ROBOT_ACTUAL_JOINT_POSITIONS_4      153658 non-null  float64
 10  ROB

#### 2. Final Time Column Cleanup

In [8]:
time_col_name = 'ROBOT_TIME'

# 1. Start with the data after expansion/concatenation, ensuring it's a string.
clean_time_strings = clean_sensor_data[time_col_name].astype(str)

# 2. Aggressive Cleaning: This is the robust fix.
# It removes all non-numeric characters (brackets, quotes, spaces) to isolate the number.
clean_time_strings = (
    clean_time_strings
    .str.strip()
    # Regex removes EVERYTHING that is NOT a digit, decimal point, or negative sign
    .str.replace(r'[^\d.\-]', '', regex=True) 
)

# 3. Convert to numeric.
fixed_time_data = pd.to_numeric(clean_time_strings, errors='coerce')

# 4. Overwrite the column in the final expanded DataFrame.
clean_sensor_data[time_col_name] = fixed_time_data

# --- Final check ---
time_non_null = clean_sensor_data[time_col_name].notna().sum()
print(f"Non-null count for ROBOT_TIME: {time_non_null} out of {len(clean_sensor_data)}")
print("\nFinal Data Types after Time Fix:")
clean_sensor_data.info(memory_usage='deep')

Non-null count for ROBOT_TIME: 153658 out of 153658

Final Data Types after Time Fix:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153658 entries, 0 to 153657
Data columns (total 73 columns):
 #   Column                              Non-Null Count   Dtype  
---  ------                              --------------   -----  
 0   ROBOT_TARGET_JOINT_POSITIONS_1      153658 non-null  float64
 1   ROBOT_TARGET_JOINT_POSITIONS_2      153658 non-null  float64
 2   ROBOT_TARGET_JOINT_POSITIONS_3      153658 non-null  float64
 3   ROBOT_TARGET_JOINT_POSITIONS_4      153658 non-null  float64
 4   ROBOT_TARGET_JOINT_POSITIONS_5      153658 non-null  float64
 5   ROBOT_TARGET_JOINT_POSITIONS_6      153658 non-null  float64
 6   ROBOT_ACTUAL_JOINT_POSITIONS_1      153658 non-null  float64
 7   ROBOT_ACTUAL_JOINT_POSITIONS_2      153658 non-null  float64
 8   ROBOT_ACTUAL_JOINT_POSITIONS_3      153658 non-null  float64
 9   ROBOT_ACTUAL_JOINT_POSITIONS_4      153658 non-null  float64
 10  ROBOT_

## 5. üíæ Save Processed Data

The expanded and structured sensor data is saved to the `../data/processed` directory in two formats for efficient use in downstream machine learning notebooks:

* **CSV:** For standard interoperability.
* **Parquet:** For highly efficient columnar storage, fast read/write speeds, and optimized memory usage.

In [9]:
# Create the directory if it doesn't exist
processed_dir = "../data/processed"
os.makedirs(processed_dir, exist_ok=True)

# 1. Save as Parquet (Highly recommended for speed and size in future notebooks)
# Parquet is the best format for efficient loading in your next notebooks (02_data_preprocessing.ipynb, etc.).
clean_sensor_data.to_parquet(f"{processed_dir}/sensor_data_processed.parquet", index=False)

# 2. Save as CSV (For interoperability)
clean_sensor_data.to_csv(f"{processed_dir}/sensor_data_processed.csv", index=False)

print("‚úÖ Saved processed dataset in Parquet and CSV formats.")

‚úÖ Saved processed dataset in Parquet and CSV formats.


## 6. üßæ Summary

This Exploratory Data Analysis (EDA) notebook successfully transformed the raw, non-standard UR5 sensor data into a clean, structured format.

* **Data Merged:** All raw UR5 sensor CSV files were successfully merged.
* **Data Volume:** The resulting dataset contains **~150K+ rows** of sensor readings.
* **Feature Expansion:** All list-form sensor readings (e.g., joint positions, currents) were safely parsed and **expanded into 70+ individual numeric columns**.
* **Dataset Ready:** The cleaned sensor data is now ready for deep analysis, feature creation, and merging with the summary/header data in the next phase.

***

**Next Notebook ‚Üí `02_feature_engineering.ipynb`**