# UR5 Manipulator Sensor Data — Notebook 05: Target Definition & Baseline Modeling

Objective: This final notebook in the EDA repository concludes the data preparation phase by:

1. Defining the Target Variable (Y): Using insights from the EDA (Notebook 04) to create a definitive, labeled ANOMALY_FLAG.

2. Exporting the combined X and Y matrix for immediate use in another repository.

3. Final Data Split: Separating the feature matrix (X) from the target vector (Y) and splitting them into standard Train/Test sets.

4. Establishing a Baseline: Training a simple model (Dummy and Logistic Regression) to set a performance benchmark for the future ML repository.

5. Final Save: Exporting the prepared, split datasets, completing the EDA project.

**Input Data:**

- Feature Set: `../data/features/feature_set.parquet` (≈153k rows, ≈250 features).

**Output:**

- Full ML-Ready Dataset: `../data/ml_ready/full_ml_ready_data.parquet` (X+Y combined).
    
- Prepared ML Files: Xtrain​,Xtest​,Ytrain​,Ytest​ saved to `../data/ml_ready`.

## Step 1. Setup and Data Loading

We load the necessary libraries and the fully engineered feature set from previous step. We also introduce the ML specific libraries for splitting and modeling.


In [1]:
import pandas as pd
import numpy as np
import os
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier
from sklearn.metrics import classification_report, f1_score, confusion_matrix, recall_score, precision_score

# Display settings
pd.set_option('display.max_columns', 20)

# Define paths
feature_set_path = "../data/features/feature_set.parquet"
output_dir = "../data/ml_ready"
os.makedirs(output_dir, exist_ok=True)

In [2]:
# Load the feature set
try:
    df = pd.read_parquet(feature_set_path)
    print(f"✔ Feature set loaded successfully. Shape: {df.shape}")
except FileNotFoundError:
    print(f"Error: File not found at {feature_set_path}. Please run Notebook 03 first.")

# Drop the plotting index if it exists
df = df.drop(columns=['TIME_INDEX'], errors='ignore')
display(df.head(2))

✔ Feature set loaded successfully. Shape: (153658, 127)


Unnamed: 0,ROBOT_TIME,ROBOT_TARGET_JOINT_POSITIONS (J1),ROBOT_TARGET_JOINT_POSITIONS (J2),ROBOT_TARGET_JOINT_POSITIONS (J3),ROBOT_TARGET_JOINT_POSITIONS (J4),ROBOT_TARGET_JOINT_POSITIONS (J5),ROBOT_TARGET_JOINT_POSITIONS (J6),ROBOT_ACTUAL_JOINT_POSITIONS (J1),ROBOT_ACTUAL_JOINT_POSITIONS (J2),ROBOT_ACTUAL_JOINT_POSITIONS (J3),...,ROBOT_JOINT_CONTROL_CURRENT_J1_ROLL_MEAN_50,ROBOT_JOINT_CONTROL_CURRENT_J1_ROLL_STD_50,ROBOT_ACTUAL_JOINT_VELOCITIES_J1_ROLL_MEAN_50,ROBOT_ACTUAL_JOINT_VELOCITIES_J1_ROLL_STD_50,ROBOT_TCP_FORCE_x_ROLL_MEAN_50,ROBOT_TCP_FORCE_x_ROLL_STD_50,ROBOT_TCP_FORCE_z_ROLL_MEAN_50,ROBOT_TCP_FORCE_z_ROLL_STD_50,ERROR_JOINT_POSITIONS_J1_ROLL_MEAN_50,ERROR_JOINT_POSITIONS_J1_ROLL_STD_50
0,747.248,-26.880069,-79.911609,57.095392,-157.771764,-105.009613,-44.724779,-26.87662,-79.910908,57.096775,...,0.237228,0.008602,0.0,0.0,-26.387519,0.60378,7.837572,0.65046,0.002732,0.001374
1,747.256,-26.880069,-79.911609,57.095392,-157.771764,-105.009613,-44.724779,-26.87662,-79.910225,57.096092,...,0.237228,0.008602,0.0,0.0,-26.387519,0.60378,7.837572,0.65046,0.002732,0.001374


## Step 2. Target Definition: Creating the Anomaly Flag (Y)

Based on the temporal and distributional analysis in Notebook 04, the positional error for Joint 1 (ERROR_JOINT_POSITIONS_(J1)) provides the clearest signal of a mechanical anomaly.

We define an ANOMALY_FLAG (Y) using a conservative threshold of 0.5 (or the 99.9th percentile) to isolate only the most severe deviations.

In [3]:
# Select the positional error feature for Joint 1
error_col = 'ERROR_JOINT_POSITIONS_(J1)'

# 1. CALCULATE DATA-DRIVEN THRESHOLD
# Use the 99.99th percentile of the error column to ensure we capture the most extreme cases.
# This makes the definition of "anomaly" relative to the dataset's own performance.
percentile_threshold = df[error_col].quantile(0.9999)

# We will use this calculated threshold
ANOMALY_THRESHOLD = percentile_threshold
print(f"Calculated 99.99th Percentile Threshold: {ANOMALY_THRESHOLD:.6f}")


# 2. Create the binary target variable (Y)
# Use the calculated threshold for labeling
df['ANOMALY_FLAG'] = np.where(df[error_col] >= ANOMALY_THRESHOLD, 1, 0)

# Drop all positional error features to prevent data leakage.
# NOTE: Dropping the column used to define the target is CRITICAL.
error_cols_to_drop = [col for col in df.columns if 'ERROR_JOINT_POSITIONS' in col]
df = df.drop(columns=error_cols_to_drop)

# Calculate class imbalance
anomaly_count = df['ANOMALY_FLAG'].sum()
total_count = len(df)
imbalance = (anomaly_count / total_count) * 100

print(f"\n--- Recalculated Target Summary ---")
print(f"Total Anomalous Records (Y=1): {anomaly_count}")
print(f"Total Records: {total_count}")
print(f"Class Imbalance: {imbalance:.4f}%")

# 

print("\nInterpretation: The target class is now correctly defined with a minimal number of positive examples.")
print("The primary challenge for the ML phase remains achieving high Recall on this rare class.")

Calculated 99.99th Percentile Threshold: 0.081011

--- Recalculated Target Summary ---
Total Anomalous Records (Y=1): 16
Total Records: 153658
Class Imbalance: 0.0104%

Interpretation: The target class is now correctly defined with a minimal number of positive examples.
The primary challenge for the ML phase remains achieving high Recall on this rare class.


## Step 3. Export Combined X+Y Matrix

We save the complete, labeled dataset (X and Y combined) for convenience in the future ML repository.

In [4]:
full_path = os.path.join(output_dir, "full_ml_ready_data.parquet")
df.to_parquet(full_path, index=False)
print(f"✔ Full ML-ready dataset saved at: {full_path}")

✔ Full ML-ready dataset saved at: ../data/ml_ready/full_ml_ready_data.parquet


## Step 4. Final Data Split: Time-Aware Train/Test Sets

Since the data is a time series, the split must be chronological (time-aware). We reserve the last 20% of the operational window for the test set, mimicking how a model trained on past data would perform on unseen, future data.

In [5]:
# Separate features (X) and target (Y)
X = df.drop(columns=['ANOMALY_FLAG'])
Y = df['ANOMALY_FLAG']

# --- Step 1: Find the Index for a Guaranteed Split ---
# We want to reserve the last 4 anomalies for the test set.

# 1. Get the indices (row numbers) where an anomaly (Y=1) occurred.
anomaly_indices = Y[Y == 1].index

# 2. Convert the Index object to a list for reliable positional indexing 
anomaly_indices_list = anomaly_indices.to_list()

# 3. Determine the split point: the index of the 4th to last anomaly.
if len(anomaly_indices_list) >= 4:
    # Use standard list indexing to get the element at the 4th from last position
    split_index = anomaly_indices_list[-4]
    print(f"Split point set at index: {split_index} to reserve 4 anomalies for testing.")
else:
    # Fallback (should not happen since we know we have 16)
    split_index = anomaly_indices_list[0] if len(anomaly_indices_list) > 0 else int(len(df) * 0.9)
    print("Warning: Fewer than 4 anomalies found. Splitting at 90% index.")


# --- Step 2: Perform the Index-Based Split ---
# Use the found index to slice the DataFrames.
# X_train and Y_train include the row at the split_index.
X_train = X.loc[:split_index]
Y_train = Y.loc[:split_index]

# X_test and Y_test start immediately after the split index.
X_test = X.loc[split_index + 1:]
Y_test = Y.loc[split_index + 1:]


print("\n--- Data Split Summary (INDEX-BASED) ---")
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"Y_train Anomaly Count: {Y_train.sum()}")
print(f"Y_test Anomaly Count: {Y_test.sum()}")
print(f"Y_train Anomaly Rate: {Y_train.mean():.4f}")
print(f"Y_test Anomaly Rate: {Y_test.mean():.4f}")

Split point set at index: 77096 to reserve 4 anomalies for testing.

--- Data Split Summary (INDEX-BASED) ---
X_train shape: (77097, 119)
X_test shape: (76561, 119)
Y_train Anomaly Count: 13
Y_test Anomaly Count: 3
Y_train Anomaly Rate: 0.0002
Y_test Anomaly Rate: 0.0000


## Step 5. Baseline Modeling and Evaluation

We establish a concrete baseline to set the minimum performance bar for the future ML models. Given the extreme imbalance, we focus on Recall (catching anomalies) and F1-Score (balancing precision and recall).

### 5.1 Dummy Classifier Baseline (The Minimum Bar)

The dummy classifier shows what happens when we simply guess the majority class every time.

In [6]:
# Dummy Classifier (Predicts the majority class: 0)
dummy_model = DummyClassifier(strategy="most_frequent")
dummy_model.fit(X_train, Y_train)
dummy_preds = dummy_model.predict(X_test)

print("--- Dummy Model Baseline (Absolute Minimum Bar) ---")
print(classification_report(Y_test, dummy_preds, zero_division=0))

--- Dummy Model Baseline (Absolute Minimum Bar) ---
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     76558
           1       0.00      0.00      0.00         3

    accuracy                           1.00     76561
   macro avg       0.50      0.50      0.50     76561
weighted avg       1.00      1.00      1.00     76561



**Conclusion**: The Dummy Model, which simply predicts "Normal" (Y=0) every time, fails entirely at the task of anomaly detection, as expected. Your future machine learning models must achieve an F1-Score greater than 0.00 to be considered useful.

### 5.2 Logistic Regression Baseline

Logistic Regression, a simple linear model, provides a benchmark that incorporates the actual features. We use class_weight='balanced' to try and force the model to pay attention to the rare anomaly class.

In [7]:
# Logistic Regression (Simple Learning Bar)
# class_weight='balanced' is essential for imbalanced data
log_model = LogisticRegression(solver='liblinear', random_state=42, max_iter=200, class_weight='balanced')
log_model.fit(X_train, Y_train)
log_preds = log_model.predict(X_test)

print("--- Logistic Regression Baseline ---")
print(classification_report(Y_test, log_preds, zero_division=0))

# Report the key metric
baseline_f1 = f1_score(Y_test, log_preds)
baseline_recall = recall_score(Y_test, log_preds)

print(f"\nTarget Class (Y=1) Metrics:")
print(f"Baseline F1-Score: {baseline_f1:.4f}")
print(f"Baseline Recall: {baseline_recall:.4f} (Ability to catch true faults)")

--- Logistic Regression Baseline ---
              precision    recall  f1-score   support

           0       1.00      0.98      0.99     76558
           1       0.00      0.67      0.00         3

    accuracy                           0.98     76561
   macro avg       0.50      0.82      0.50     76561
weighted avg       1.00      0.98      0.99     76561


Target Class (Y=1) Metrics:
Baseline F1-Score: 0.0029
Baseline Recall: 0.6667 (Ability to catch true faults)


**Conclusion**:

The Logistic Regression model, by using class_weight='balanced', prioritized Recall (catching faults) but paid a heavy price in Precision (generating false alarms).

The goal for your next, dedicated ML repository is now clearly defined: Build a model that achieves a Recall of >0.67 while simultaneously dramatically improving the Precision (and thus the F1-Score) from 0.0029.

## Step 6. Final Save: Exporting Split Datasets

The final step is to save the four split files, completing the Manipulator Health Monitoring EDA project.

In [8]:
# --- Save Split Datasets ---
ml_ready_dir = "../data/ml_ready"

# Save X (Features)
X_train.to_parquet(os.path.join(ml_ready_dir, "X_train.parquet"), index=False)
X_test.to_parquet(os.path.join(ml_ready_dir, "X_test.parquet"), index=False)

# Save Y (Target)
Y_train.to_frame(name='ANOMALY_FLAG').to_parquet(os.path.join(ml_ready_dir, "Y_train.parquet"), index=False)
Y_test.to_frame(name='ANOMALY_FLAG').to_parquet(os.path.join(ml_ready_dir, "Y_test.parquet"), index=False)


print("\n✔ Final ML-Ready datasets (X_train, X_test, Y_train, Y_test) saved successfully.")
print(f"All files exported to: {ml_ready_dir}")


✔ Final ML-Ready datasets (X_train, X_test, Y_train, Y_test) saved successfully.
All files exported to: ../data/ml_ready


# Conclusion of the EDA Project Repository

The Manipulator Health Monitoring EDA repository is now complete. We successfully executed the end-to-end data pipeline:

1. **01_data_exploration**: Defined structure and parsed non-standard files.

2. **02_data_cleaning**: Cleaned strings, imputed NaN values, and converted all data to float64.

3. **03_feature_engineering**: Created advanced time-series features (Lagging, Rolling Stats) and domain-specific error metrics.

4. **04_deep_eda**: Validated features, identified severe class imbalance, and isolated fault events.

5. **05_target_definition_and_baseline_modeling**: Defined the ANOMALY_FLAG target and established a measurable baseline F1-Score for future models.

The project is now ready to transition to the dedicated ML Modeling Repository where advanced techniques (e.g., XGBoost, LSTM, Sampling methods, and Hyperparameter Tuning) will be employed to surpass the established baseline performance.

---