# ErgoPose Risk Classifier ‚Äî Data Preparation

This notebook is the **first stage** of the *ErgoPose Risk Classifier* project.  
Its goal is to **load, inspect, clean, and preprocess** the dataset used to train the neural network that classifies ergonomic risk levels based on body pose angles.

### Objectives
- Load the raw dataset downloaded from [Zenodo](https://zenodo.org/records/14230872).
- Inspect the structure and main statistics of the data.
- Handle missing values and inconsistent entries.
- Normalize or standardize numerical features.
- Save the cleaned dataset to the `data/processed/` directory for the next steps.

### Input and Output
- **Input:** `data/raw/postural_risk_dataset.csv`
- **Output:** `data/processed/clean_postural_risk_dataset.csv`

In [22]:
"""
Imports required libraries for data handling, analysis, and preprocessing.
"""

# [1] Imports
import pandas as pd
import numpy as np
from pathlib import Path

In [23]:
"""
Define paths for raw and processed data directories.
"""

# [2] Paths configuration
RAW_DATA_PATH = Path("../data/raw")
PROCESSED_DATA_PATH = Path("../data/processed")

RAW_DATA_PATH.mkdir(exist_ok=True)
PROCESSED_DATA_PATH.mkdir(exist_ok=True)

RAW_DATA_FILE = RAW_DATA_PATH / "postural_risk_dataset.csv"
print(f"‚úÖ Raw dataset path set to: {RAW_DATA_FILE}")


‚úÖ Raw dataset path set to: ../data/raw/postural_risk_dataset.csv


In [24]:
"""
Load the raw dataset from the 'data/raw' folder.
"""

# [3] Load dataset
df = pd.read_csv(RAW_DATA_FILE)

# [INFO] Display basic info
print(f"‚úÖ Dataset loaded successfully ‚Äî {df.shape[0]} rows and {df.shape[1]} columns.")
df.head()


‚úÖ Dataset loaded successfully ‚Äî 4794 rows and 102 columns.


Unnamed: 0,subject,upperbody_label,lowerbody_label,nose_x,nose_y,nose_z,left_eye_inner_x,left_eye_inner_y,left_eye_inner_z,left_eye_x,...,left_heel_z,right_heel_x,right_heel_y,right_heel_z,left_foot_index_x,left_foot_index_y,left_foot_index_z,right_foot_index_x,right_foot_index_y,right_foot_index_z
0,1,TLB,LCL,0.013146,-0.534424,-0.176213,0.019678,-0.557297,-0.160404,0.021779,...,-0.313284,-0.074499,0.623198,-0.041134,-0.362046,0.284611,-0.416112,-0.083041,0.690973,-0.153205
1,1,TLB,LCL,-0.027462,-0.499347,-0.235089,-0.013835,-0.522232,-0.217069,-0.012026,...,-0.339221,0.018673,0.683186,-0.037659,-0.247262,0.34029,-0.44457,0.034586,0.751673,-0.148952
2,1,TLB,LCL,-0.017639,-0.542063,-0.223344,-0.00043,-0.562522,-0.20206,0.001764,...,-0.243208,0.049054,0.677385,-0.03676,-0.249142,0.455043,-0.314104,0.051902,0.745649,-0.125802
3,1,TLB,LCL,-0.02763,-0.556502,-0.149826,-0.007174,-0.575659,-0.129025,-0.005589,...,-0.306242,0.06511,0.672955,0.004685,-0.261612,0.440069,-0.395166,0.057489,0.747611,-0.082085
4,1,TLB,LCL,-0.033802,-0.556527,-0.174968,-0.012644,-0.57593,-0.154702,-0.011082,...,-0.310197,0.063744,0.668368,0.012122,-0.26535,0.453247,-0.398636,0.061104,0.747678,-0.075673


In [25]:
"""
Inspect data types, column names, and general structure of the dataset.
"""

# [4] General structure
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4794 entries, 0 to 4793
Columns: 102 entries, subject to right_foot_index_z
dtypes: float64(99), int64(1), object(2)
memory usage: 3.7+ MB


In [26]:
"""
Display basic statistics for numerical columns to understand data distribution.
"""

# [5] Descriptive statistics
df.describe().T


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
subject,4794.0,6.051314,3.688287,1.000000,3.000000,6.000000,9.000000,13.000000
nose_x,4794.0,0.005079,0.119513,-0.391735,-0.016859,0.003359,0.029534,0.449911
nose_y,4794.0,-0.452200,0.119104,-0.620617,-0.519387,-0.486973,-0.441859,0.223784
nose_z,4794.0,-0.343462,0.150283,-0.810915,-0.427662,-0.340562,-0.247549,0.084132
left_eye_inner_x,4794.0,0.019214,0.120859,-0.389286,-0.002378,0.018068,0.043223,0.469878
...,...,...,...,...,...,...,...,...
left_foot_index_y,4794.0,0.627462,0.162553,-0.125248,0.606920,0.669801,0.719547,0.865604
left_foot_index_z,4794.0,-0.148158,0.148310,-0.481448,-0.257432,-0.135818,-0.038823,0.412060
right_foot_index_x,4794.0,-0.137013,0.237763,-0.736322,-0.323197,-0.166831,0.001375,0.393840
right_foot_index_y,4794.0,0.586791,0.139372,-0.101390,0.557937,0.618563,0.667271,0.816386


In [27]:
"""
Check for missing values across all columns.
"""

# [6] Missing value inspection
missing = df.isnull().sum()
missing = missing[missing > 0]

if not missing.empty:
    print("‚ö†Ô∏è Columns with missing values:")
    print(missing)
else:
    print("‚úÖ No missing values found.")


‚úÖ No missing values found.


In [28]:
"""
Check for duplicated rows that may need to be removed.
"""

# [7] Duplicate detection
duplicates = df.duplicated().sum()
print(f"üîÅ Found {duplicates} duplicated rows.")

if duplicates > 0:
    df = df.drop_duplicates().reset_index(drop=True)
    print("‚úÖ Duplicated rows removed.")


üîÅ Found 0 duplicated rows.


In [29]:
"""
Convert all column names to lowercase and replace spaces with underscores.
This improves consistency and prevents issues in later processing.
"""

# [8] Normalize column names
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
print("‚úÖ Column names standardized.")
df.columns.tolist()


‚úÖ Column names standardized.


['subject',
 'upperbody_label',
 'lowerbody_label',
 'nose_x',
 'nose_y',
 'nose_z',
 'left_eye_inner_x',
 'left_eye_inner_y',
 'left_eye_inner_z',
 'left_eye_x',
 'left_eye_y',
 'left_eye_z',
 'left_eye_outer_x',
 'left_eye_outer_y',
 'left_eye_outer_z',
 'right_eye_inner_x',
 'right_eye_inner_y',
 'right_eye_inner_z',
 'right_eye_x',
 'right_eye_y',
 'right_eye_z',
 'right_eye_outer_x',
 'right_eye_outer_y',
 'right_eye_outer_z',
 'left_ear_x',
 'left_ear_y',
 'left_ear_z',
 'right_ear_x',
 'right_ear_y',
 'right_ear_z',
 'mouth_left_x',
 'mouth_left_y',
 'mouth_left_z',
 'mouth_right_x',
 'mouth_right_y',
 'mouth_right_z',
 'left_shoulder_x',
 'left_shoulder_y',
 'left_shoulder_z',
 'right_shoulder_x',
 'right_shoulder_y',
 'right_shoulder_z',
 'left_elbow_x',
 'left_elbow_y',
 'left_elbow_z',
 'right_elbow_x',
 'right_elbow_y',
 'right_elbow_z',
 'left_wrist_x',
 'left_wrist_y',
 'left_wrist_z',
 'right_wrist_x',
 'right_wrist_y',
 'right_wrist_z',
 'left_pinky_x',
 'left_pinky_y',

In [30]:
"""
Quick visual check for extreme values in key numeric columns.
Only for early understanding ‚Äî not removing anything yet.
"""

# [9] Quick check
df.describe(percentiles=[0.01, 0.99]).T


Unnamed: 0,count,mean,std,min,1%,50%,99%,max
subject,4794.0,6.051314,3.688287,1.000000,1.000000,6.000000,13.000000,13.000000
nose_x,4794.0,0.005079,0.119513,-0.391735,-0.367745,0.003359,0.401608,0.449911
nose_y,4794.0,-0.452200,0.119104,-0.620617,-0.579567,-0.486973,0.018973,0.223784
nose_z,4794.0,-0.343462,0.150283,-0.810915,-0.703309,-0.340562,-0.007860,0.084132
left_eye_inner_x,4794.0,0.019214,0.120859,-0.389286,-0.371983,0.018068,0.421796,0.469878
...,...,...,...,...,...,...,...,...
left_foot_index_y,4794.0,0.627462,0.162553,-0.125248,0.086763,0.669801,0.839000,0.865604
left_foot_index_z,4794.0,-0.148158,0.148310,-0.481448,-0.450570,-0.135818,0.188150,0.412060
right_foot_index_x,4794.0,-0.137013,0.237763,-0.736322,-0.600739,-0.166831,0.366210,0.393840
right_foot_index_y,4794.0,0.586791,0.139372,-0.101390,0.104940,0.618563,0.780302,0.816386


In [31]:
"""
Select the appropriate target variable (upperbody_label) and encode it as numeric.
"""

# [10] Target selection
target_col = "upperbody_label"

# [Encode categorical labels]
mapping = {"low": 0, "medium": 1, "high": 2}
df[target_col] = df[target_col].replace(mapping)

print(f"üéØ Target column selected: '{target_col}'")
print("‚úÖ Encoded labels (0=low, 1=medium, 2=high).")
df[target_col].value_counts()


üéØ Target column selected: 'upperbody_label'
‚úÖ Encoded labels (0=low, 1=medium, 2=high).


upperbody_label
TLF    1897
TUP    1615
TLB     442
TLL     420
TLR     420
Name: count, dtype: int64

In [32]:
"""
Save the cleaned and standardized dataset to 'data/processed'.
"""

# [11] Save cleaned data
output_file = PROCESSED_DATA_PATH / "clean_postural_risk_dataset.csv"
df.to_csv(output_file, index=False)

print(f"Cleaned dataset saved successfully ‚Üí {output_file}")
print(f"Final shape: {df.shape}")


Cleaned dataset saved successfully ‚Üí ../data/processed/clean_postural_risk_dataset.csv
Final shape: (4794, 102)


## Summary

At this stage, we have:

- Successfully loaded the dataset (`4796 rows`).
- Verified structure, data types, and missing values.
- Removed any duplicates and standardized column names.
- Encoded categorical labels if needed.
- Saved a **clean dataset** in `data/processed/clean_postural_risk_dataset.csv`.

This file will be used in the next notebook:
‚û°Ô∏è `02_exploratory_analysis.ipynb` ‚Äî where we will perform feature correlation, visualization, and preliminary interpretation of posture‚Äìrisk relationships.