# Dataset preprocessing

The goal of this notebook is to create a preprocessed kaggle dataset out of the competition dataset.  
For now, the preprocessing will be based on [this notebook](https://www.kaggle.com/code/vonmainstein/imu-tof).  
It consists of the following steps:
-   Set the appropriate dtypes (helps with RAM usage).
-   forward + backward filling feature columns
-   Converting the files csv files to parquet for faster loading and getting the correct data types upon loading.
-   Output statistics of the dataset into a csv(for readablity) file for standardization.  

> Note:  
> - The padding of the sequences will be performed in the model since we don't have access to the "leaderboard dataset" inputs.  
> - Demographics data set will be ignored for now.  

## Imports

In [1]:
import json
from os.path import join

import numpy as np
import pandas as pd
from pandas import DataFrame as DF
from kagglehub import whoami, competition_download, dataset_upload

from config import *

## Data preprocessing

Load dataset, requires to be logged in if this notebook is not running on laggle, go to [your settings](https://www.kaggle.com/settings) to create an access token and put it in `~/.kaggle/`.

In [3]:
competition_dataset_path = competition_download(COMPETITION_HANDLE)
df = pd.read_csv(join(competition_dataset_path, "train.csv"), dtype=DATASET_DF_DTYPES)

Impute -1 Time of Flight sensors values.

In [25]:
feature_cols = list(set(df.columns) - set(META_DATA_COLUMNS))
to_replace = {col: -1.0 for col in df.columns if col.startswith("tof")}

df[feature_cols] = (
    df
    .loc[:, feature_cols]
     # df.replace with np.nan sets dtype to floar64 so we set it back to float32
    .replace(to_replace, value=np.nan)
    .astype("float32")
    .groupby(df["sequence_id"], observed=True, as_index=False)
    .ffill()
    .groupby(df["sequence_id"], observed=True, as_index=False)
    .bfill()
    # In case there are only nan in the column in the sequence
    .fillna(0)
)

Standard scale the feature cols (should probably do something different for IMU cols).

In [29]:
df[feature_cols] = (df[feature_cols] - df[feature_cols].mean()) / df[feature_cols].std()

Time of Flight columns take most of the data, let's reduce their size by aggregating by mean for each Time of Flight sensor.

In [30]:
def agg_tof_cols_per_sensor(df:DF) -> DF:
    for tof_idx in range(1, 6):
        tof_name = f"tof_{tof_idx}"
        tof_cols = [f"{tof_name}_v{v_idx}" for v_idx in range(64)]
        df = (
            df
            # Need to use a dict otherwise the name of the col will be "tof_preffix" instead of the value it contains
            .assign(**{tof_name:df[tof_cols].mean(axis="columns")})
            .drop(columns=tof_cols)
        )
    return df
    
tof_meaned_df = agg_tof_cols_per_sensor(df)
tof_meaned_df

Unnamed: 0,row_id,sequence_type,sequence_id,sequence_counter,subject,orientation,behavior,phase,gesture,acc_x,...,thm_1,thm_2,thm_3,thm_4,thm_5,tof_1,tof_2,tof_3,tof_4,tof_5
0,SEQ_000007_000000,Target,SEQ_000007,0,SUBJ_059520,Seated Lean Non Dom - FACE DOWN,Relaxes and moves hand to target location,Transition,Cheek - pinch skin,0.872407,...,0.502318,1.184774,0.633442,0.367246,0.477938,0.217644,0.095887,-0.311243,-0.019701,0.280289
1,SEQ_000007_000001,Target,SEQ_000007,1,SUBJ_059520,Seated Lean Non Dom - FACE DOWN,Relaxes and moves hand to target location,Transition,Cheek - pinch skin,0.918353,...,0.593105,1.197089,0.681437,0.386687,0.492298,0.224865,0.129878,-0.232165,-0.005667,0.300673
2,SEQ_000007_000002,Target,SEQ_000007,2,SUBJ_059520,Seated Lean Non Dom - FACE DOWN,Relaxes and moves hand to target location,Transition,Cheek - pinch skin,0.706192,...,0.821468,0.976524,0.741405,0.423273,0.510858,0.258531,0.247712,0.009454,0.011723,0.339124
3,SEQ_000007_000003,Target,SEQ_000007,3,SUBJ_059520,Seated Lean Non Dom - FACE DOWN,Relaxes and moves hand to target location,Transition,Cheek - pinch skin,0.858218,...,0.868207,0.063125,0.584725,0.484648,0.341522,0.410849,0.428684,0.363701,0.315419,0.369181
4,SEQ_000007_000004,Target,SEQ_000007,4,SUBJ_059520,Seated Lean Non Dom - FACE DOWN,Relaxes and moves hand to target location,Transition,Cheek - pinch skin,0.679165,...,0.587718,-0.353110,0.081696,0.592139,0.334854,0.720183,0.523887,0.588191,0.504310,0.474861
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
574940,SEQ_065531_000048,Non-Target,SEQ_065531,48,SUBJ_039498,Seated Lean Non Dom - FACE DOWN,Performs gesture,Gesture,Write name on leg,0.322408,...,0.590561,0.279611,0.292463,0.854270,0.656701,-0.675473,-0.549469,0.356566,-0.652218,-0.728909
574941,SEQ_065531_000049,Non-Target,SEQ_065531,49,SUBJ_039498,Seated Lean Non Dom - FACE DOWN,Performs gesture,Gesture,Write name on leg,0.369030,...,0.599804,0.311259,0.314339,0.870208,0.618153,-0.654034,-0.563573,0.350243,-0.699661,-0.708544
574942,SEQ_065531_000050,Non-Target,SEQ_065531,50,SUBJ_039498,Seated Lean Non Dom - FACE DOWN,Performs gesture,Gesture,Write name on leg,0.249436,...,0.620099,0.301660,0.299698,0.848956,0.618153,-0.698966,-0.612436,0.260407,-0.703464,-0.672630
574943,SEQ_065531_000051,Non-Target,SEQ_065531,51,SUBJ_039498,Seated Lean Non Dom - FACE DOWN,Performs gesture,Gesture,Write name on leg,0.402138,...,0.695652,0.389432,0.330572,0.864896,0.561656,-0.692074,-0.591563,0.306319,-0.631848,-0.703397


In [31]:
def normed_feature_sequence_len(sequence: DF, normed_sequence_len: int) -> np.ndarray:
    features = (
        sequence
        .drop(columns=META_DATA_COLUMNS)
        .values
    )
    len_diff = abs(normed_sequence_len - len(features))
    if len(features) < normed_sequence_len:
        padded_features = np.pad(
            features,
            ((len_diff // 2 + len_diff % 2, len_diff // 2), (0, 0)),
        )
        return padded_features
    elif len(features) > normed_sequence_len:
        return features[len_diff // 2:-len_diff // 2]
    else:
        return features

def df_to_ndarray_dataset(df:DF) -> np.ndarray:
    normed_sequence_len = int(df["sequence_counter"].quantile(SEQUENCE_NORMED_LEN_QUANTILE))
    sequence_it = df.groupby("sequence_id", observed=True, as_index=False)
    x = np.empty(
        shape=(len(sequence_it), normed_sequence_len, df.shape[1] - len(META_DATA_COLUMNS)),
        dtype="float32"
    )
    for sequence_idx, (sequence_id, sequence) in enumerate(sequence_it):
        x[sequence_idx] = normed_feature_sequence_len(sequence, normed_sequence_len)

    return x

x = df_to_ndarray_dataset(df)
tof_meaned_x = df_to_ndarray_dataset(tof_meaned_df)

One hot encode target values.

In [None]:
one_hot_encoded_targets = pd.get_dummies(df["gesture"], dtype="float32")
one_hot_encoded_targets

['Above ear - pull hair',
 'Cheek - pinch skin',
 'Eyebrow - pull hair',
 'Eyelash - pull hair',
 'Feel around in tray and pull out an object',
 'Forehead - pull hairline',
 'Forehead - scratch',
 'Neck - pinch skin',
 'Neck - scratch',
 'Text on phone',
 'Wave hello',
 'Write name in air',
 'Write name on leg',
 'Drink from bottle/cup',
 'Pinch knee/leg skin',
 'Pull air toward your face',
 'Scratch knee/leg skin',
 'Glasses on/off']

Create dataset directory.

In [36]:
# Create dataset direcory
! mkdir -p preprocessed_dataset
! rm preprocessed_dataset/*
# Save full dataset
np.save("preprocessed_dataset/X.npy", x, allow_pickle=False)
# Save ToF meaned dataset
np.save("preprocessed_dataset/tof_meaned_X.npy", tof_meaned_x, allow_pickle=False)
# Save targets (Y)
np.save("preprocessed_dataset/Y.npy", one_hot_encoded_targets.values, allow_pickle=False)
# Save their name for inference
with open("preprocessed_dataset/target_names_list.json", "w") as fp:
    json.dump(one_hot_encoded_targets.columns.to_list(), fp, indent=1)

Optionally upload the dataset to kaggle.

In [37]:
if input("Do you want to upload the  dataset to kaggle?[yes/no]") == "yes":
    # Updaload the dataset
    dataset_upload(
        join(whoami()["username"], "prepocessed-cmi-2025"),
        "preprocessed_dataset",
        version_notes="Preprocessed Child Mind Institue 2025 competition dataset."
    )

Kaggle credentials successfully validated.
Uploading Dataset https://www.kaggle.com/datasets/mauroabidalcarrer/prepocessed-cmi-2025 ...
Starting upload for file preprocessed_dataset/tof_meaned_X.npy


Uploading: 100%|██████████| 63.2M/63.2M [00:06<00:00, 9.81MB/s]

Upload successful: preprocessed_dataset/tof_meaned_X.npy (60MB)
Starting upload for file preprocessed_dataset/target_names_list.json



Uploading: 100%|██████████| 441/441 [00:01<00:00, 310B/s]

Upload successful: preprocessed_dataset/target_names_list.json (441B)
Starting upload for file preprocessed_dataset/Y.npy



Uploading: 100%|██████████| 41.4M/41.4M [00:04<00:00, 9.56MB/s]

Upload successful: preprocessed_dataset/Y.npy (39MB)
Starting upload for file preprocessed_dataset/X.npy



Uploading: 100%|██████████| 1.23G/1.23G [01:46<00:00, 11.6MB/s]

Upload successful: preprocessed_dataset/X.npy (1GB)





Your dataset has been created.
Files are being processed...
See at: https://www.kaggle.com/datasets/mauroabidalcarrer/prepocessed-cmi-2025
