# ErgoPose Risk Classifier — Data Preparation

This notebook is the **first stage** of the *ErgoPose Risk Classifier* project.  
Its goal is to **load, inspect, clean, and preprocess** the dataset used to train the neural network that classifies ergonomic risk levels based on body pose angles.

### Objectives
- Load the raw dataset downloaded from [Zenodo](https://zenodo.org/records/14230872).
- Inspect the structure and main statistics of the data.
- Handle missing values and inconsistent entries.
- Normalize or standardize numerical features.
- Save the cleaned dataset to the `data/processed/` directory for the next steps.

### Input and Output
- **Input:** `data/raw/postural_risk_dataset.csv`  
- **Output:** `data/processed/clean_postural_risk_dataset.csv`

In [61]:
"""
Imports required libraries for data handling, analysis, and preprocessing.
"""

# [1] Imports
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.preprocessing import StandardScaler

In [62]:
"""
Defines paths for raw and processed data directories.
"""

# [2] Paths configuration
RAW_DATA_PATH = Path("../data/raw")
PROCESSED_DATA_PATH = Path("../data/processed")

RAW_DATA_PATH.mkdir(exist_ok=True)
PROCESSED_DATA_PATH.mkdir(exist_ok=True)

RAW_DATA_FILE = RAW_DATA_PATH / "postural_risk_dataset.csv"
print(f"✅ Raw dataset path set to: {RAW_DATA_FILE}")


✅ Raw dataset path set to: ../data/raw/postural_risk_dataset.csv


In [63]:
"""
Loads the raw dataset from the specified path.
"""

# [3] Load dataset
try:
    df = pd.read_csv(RAW_DATA_FILE)
    print("✅ Dataset loaded successfully!")
except FileNotFoundError:
    print("Dataset not found! Please make sure the file exists in 'data/raw/'.")

df.head()

✅ Dataset loaded successfully!


Unnamed: 0,subject,upperbody_label,lowerbody_label,nose_x,nose_y,nose_z,left_eye_inner_x,left_eye_inner_y,left_eye_inner_z,left_eye_x,...,left_heel_z,right_heel_x,right_heel_y,right_heel_z,left_foot_index_x,left_foot_index_y,left_foot_index_z,right_foot_index_x,right_foot_index_y,right_foot_index_z
0,1,TLB,LCL,0.013146,-0.534424,-0.176213,0.019678,-0.557297,-0.160404,0.021779,...,-0.313284,-0.074499,0.623198,-0.041134,-0.362046,0.284611,-0.416112,-0.083041,0.690973,-0.153205
1,1,TLB,LCL,-0.027462,-0.499347,-0.235089,-0.013835,-0.522232,-0.217069,-0.012026,...,-0.339221,0.018673,0.683186,-0.037659,-0.247262,0.34029,-0.44457,0.034586,0.751673,-0.148952
2,1,TLB,LCL,-0.017639,-0.542063,-0.223344,-0.00043,-0.562522,-0.20206,0.001764,...,-0.243208,0.049054,0.677385,-0.03676,-0.249142,0.455043,-0.314104,0.051902,0.745649,-0.125802
3,1,TLB,LCL,-0.02763,-0.556502,-0.149826,-0.007174,-0.575659,-0.129025,-0.005589,...,-0.306242,0.06511,0.672955,0.004685,-0.261612,0.440069,-0.395166,0.057489,0.747611,-0.082085
4,1,TLB,LCL,-0.033802,-0.556527,-0.174968,-0.012644,-0.57593,-0.154702,-0.011082,...,-0.310197,0.063744,0.668368,0.012122,-0.26535,0.453247,-0.398636,0.061104,0.747678,-0.075673


In [64]:
"""
Displays basic information and statistics about the dataset.
"""

# [4] Data overview
print(" Dataset Info:")
print(df.info())

print("\n Summary Statistics:")
display(df.describe().T)

 Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4794 entries, 0 to 4793
Columns: 102 entries, subject to right_foot_index_z
dtypes: float64(99), int64(1), object(2)
memory usage: 3.7+ MB
None

 Summary Statistics:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
subject,4794.0,6.051314,3.688287,1.000000,3.000000,6.000000,9.000000,13.000000
nose_x,4794.0,0.005079,0.119513,-0.391735,-0.016859,0.003359,0.029534,0.449911
nose_y,4794.0,-0.452200,0.119104,-0.620617,-0.519387,-0.486973,-0.441859,0.223784
nose_z,4794.0,-0.343462,0.150283,-0.810915,-0.427662,-0.340562,-0.247549,0.084132
left_eye_inner_x,4794.0,0.019214,0.120859,-0.389286,-0.002378,0.018068,0.043223,0.469878
...,...,...,...,...,...,...,...,...
left_foot_index_y,4794.0,0.627462,0.162553,-0.125248,0.606920,0.669801,0.719547,0.865604
left_foot_index_z,4794.0,-0.148158,0.148310,-0.481448,-0.257432,-0.135818,-0.038823,0.412060
right_foot_index_x,4794.0,-0.137013,0.237763,-0.736322,-0.323197,-0.166831,0.001375,0.393840
right_foot_index_y,4794.0,0.586791,0.139372,-0.101390,0.557937,0.618563,0.667271,0.816386


In [65]:
"""
Checks for missing or null values in the dataset.
"""

# [5] Missing values analysis
missing_values = df.isnull().sum()
print("Missing Values per Column:")
display(missing_values[missing_values > 0])

# Optional: handle missing values if present
df = df.dropna()  # or use df.fillna(df.mean()) if appropriate
print(f"✅ Dataset shape after handling missing values: {df.shape}")


Missing Values per Column:


Series([], dtype: int64)

✅ Dataset shape after handling missing values: (4794, 102)


In [66]:
"""
Removes subject identifiers and unnecessary metadata to prevent model bias.
"""

# [6] Remove participant identifiers
columns_to_drop = [col for col in df.columns if "subject" in col.lower() or "id" in col.lower()]
df.drop(columns=columns_to_drop, inplace=True, errors="ignore")
print(f"✅ Removed columns: {columns_to_drop if columns_to_drop else 'None found'}")

✅ Removed columns: ['subject']


In [67]:
"""
Removes all 'Z' coordinate columns to simulate 2D-only data (X, Y only).
"""

# [7] Remove Z-coordinates
z_columns = [col for col in df.columns if col.endswith("_z")]
df.drop(columns=z_columns, inplace=True, errors="ignore")
print(f"✅ Removed {len(z_columns)} Z-coordinate columns.")

✅ Removed 33 Z-coordinate columns.


In [68]:
"""
Removes or imputes columns with zero variance or missing values.
"""

# [8] Quality Index Calculation
angle_columns = [col for col in df.columns if "angle" in col.lower()]

# Fallback: use X and Y coordinates when no angle columns are found
if not angle_columns:
    candidate_cols = [col for col in df.columns if col.endswith("_x") or col.endswith("_y")]
else:
    candidate_cols = angle_columns

# Compute variance across coordinates per frame (row-wise)
df["quality_index"] = df[candidate_cols].var(axis=1, skipna=True)

# Normalize the Quality Index between 0–1 for uniform scale
df["quality_index"] = (
    df["quality_index"] - df["quality_index"].min()
) / (df["quality_index"].max() - df["quality_index"].min())

print(f"✅ Quality Index computed using {len(candidate_cols)} columns.")
display(df[["quality_index"]].head())


✅ Quality Index computed using 66 columns.


Unnamed: 0,quality_index
0,0.34692
1,0.428067
2,0.501069
3,0.516977
4,0.519463


In [69]:
"""
Safely normalizes numerical features, skipping columns with invalid values.
"""

# [9] Safe normalization
scaler = StandardScaler()

numeric_cols = df.select_dtypes(include=["float64", "int64"]).columns

# Fit only on valid columns (no NaN or inf)
valid_cols = [c for c in numeric_cols if df[c].notnull().all() and df[c].nunique() > 1]

df[valid_cols] = scaler.fit_transform(df[valid_cols])

print(f"✅ Numeric columns normalized successfully ({len(valid_cols)} columns).")


✅ Numeric columns normalized successfully (67 columns).


In [70]:
"""
Saves the cleaned and processed dataset to the 'data/processed' directory.
"""

# [10] Save processed dataset
output_file = PROCESSED_DATA_PATH / "clean_postural_risk_dataset.csv"
df.to_csv(output_file, index=False)
print(f"✅ Clean dataset saved to: {output_file}")

✅ Clean dataset saved to: ../data/processed/clean_postural_risk_dataset.csv


## Summary

This notebook successfully:
- Loaded and inspected the raw dataset.
- Removed unnecessary columns (subject ID, Z coordinates).
- Calculated a new *Quality Index* to represent posture stability.
- Normalized all numerical features.
- Saved the cleaned dataset to `data/processed/`.

➡️ Next Notebook: **02_exploratory_analysis.ipynb**