# Dataset preprocessing

The goal of this notebook is to create a preprocessed kaggle dataset out of the competition dataset.  
For now, the preprocessing will be based on [this notebook](https://www.kaggle.com/code/vonmainstein/imu-tof).  
It consists of the following steps:
-   Set the appropriate dtypes (helps with RAM usage).
-   forward + backward filling feature columns
-   Converting the files csv files to parquet for faster loading and getting the correct data types upon loading.
-   Output statistics of the dataset into a csv(for readablity) file for standardization.  

> Note:  
> - The padding of the sequences will be performed in the model since we don't have access to the "leaderboard dataset" inputs.  
> - Demographics data set will be ignored for now.  

## Imports

In [None]:
from os.path import join

import numpy as np
import pandas as pd
from pandas import DataFrame as DF
from kagglehub import whoami, competition_download, dataset_upload

from config import *

## Obtain raw dataset
Requires to be logged in if this notebook is not running on laggle, go to [your settings](https://www.kaggle.com/settings) to create an access token and put it in `~/.kaggle/`.

In [2]:
competition_dataset_path = competition_download(COMPETITION_HANDLE)

In [3]:
df = pd.read_csv(join(competition_dataset_path, "train.csv"), dtype=DATASET_DF_DTYPES)

In [4]:
features_describe = (
    df
    .drop(META_DATA_COLUMNS, axis="columns")
    .describe()
)
features_describe

Unnamed: 0,acc_x,acc_y,acc_z,rot_w,rot_x,rot_y,rot_z,thm_1,thm_2,thm_3,...,tof_5_v54,tof_5_v55,tof_5_v56,tof_5_v57,tof_5_v58,tof_5_v59,tof_5_v60,tof_5_v61,tof_5_v62,tof_5_v63
count,574945.0,574945.0,574945.0,571253.0,571253.0,571253.0,571253.0,567958.0,567307.0,568473.0,...,544803.0,544803.0,544803.0,544803.0,544803.0,544803.0,544803.0,544803.0,544803.0,544803.0
mean,1.63998,1.790704,-0.459811,0.360375,-0.119916,-0.059953,-0.188298,27.076448,27.133482,26.702993,...,29.395651,26.030826,45.342583,43.074842,40.045908,37.631707,34.977928,31.93433,29.024752,27.320358
std,5.781259,5.003945,6.09649,0.225739,0.46552,0.543028,0.504137,3.231948,2.941437,4.122353,...,58.093844,54.215523,68.466064,68.017631,66.941587,65.28871,63.201604,60.440645,57.218513,55.407192
min,-34.585938,-24.402344,-42.855469,0.0,-0.999146,-0.999695,-0.998169,-0.370413,21.95882,0.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
25%,-2.964844,-2.121094,-5.417969,0.180237,-0.456299,-0.511536,-0.627686,24.753527,24.543737,24.64035,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
50%,2.972656,0.695312,-1.5625,0.340332,-0.18689,-0.11261,-0.263916,26.982323,26.354338,26.956276,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
75%,6.34375,6.816406,5.164062,0.503479,0.20459,0.440063,0.251099,29.425037,29.620148,29.231794,...,34.0,24.0,81.0,76.0,67.0,59.0,51.0,42.0,35.0,31.0
max,46.328125,27.183594,30.078125,0.99939,0.999817,0.999451,0.999878,38.457664,37.578339,37.294994,...,249.0,249.0,249.0,249.0,249.0,249.0,249.0,249.0,249.0,249.0


In [28]:
tof_cols = [col for col in df.columns if col.startswith("tof")]
to_replace = {col: -1.0 for col in df.columns if col.startswith("tof")}

df[tof_cols] = (
    df
    .loc[:, tof_cols]
     # df.replace with np.nan sets dtype to floar64 so we set it back to float32
    .replace(to_replace, value=np.nan)
    .astype("float32")
    .groupby(df["sequence_id"], observed=True, as_index=False)
    .ffill()
    .groupby(df["sequence_id"], observed=True, as_index=False)
    .bfill()
    # In case there are only nan in the column in the sequence
    .fillna(0)
)

In [31]:
df.memory_usage().div(1024 ** 2).sum()

np.float64(766.5317230224609)

In [45]:
normed_sequence_len = int(df["sequence_counter"].quantile(SEQUENCE_NORMED_LEN_QUANTILE))
sequence_it = df.groupby("sequence_id", observed=True, as_index=False)
x: list[np.ndarray] = []
y: list[np.ndarray] = []

def normed_feature_sequence_len(sequence: np.ndarray, normed_sequence_len: int) -> np.ndarray:
    features = (
        sequence
        .drop(columns=META_DATA_COLUMNS)
        .values
    )
    len_diff = abs(normed_sequence_len - len(features))
    if len(features) < normed_sequence_len:
        padded_features = np.pad(
            features,
            ((len_diff // 2 + len_diff % 2, len_diff // 2), (0, 0)),
        )
        return padded_features
    elif len(features) > normed_sequence_len:
        return features[len_diff // 2 + len_diff % 2:len_diff // 2]

for sequence_id, sequence in sequence_it:
    x.append(normed_feature_sequence_len(sequence, normed_sequence_len))
    y.append(sequence["gesture"].iloc[0])


print(y)

['Cheek - pinch skin', 'Forehead - pull hairline', 'Cheek - pinch skin', 'Write name on leg', 'Forehead - pull hairline', 'Feel around in tray and pull out an object', 'Neck - scratch', 'Neck - pinch skin', 'Forehead - pull hairline', 'Eyelash - pull hair', 'Eyebrow - pull hair', 'Eyelash - pull hair', 'Forehead - scratch', 'Above ear - pull hair', 'Above ear - pull hair', 'Wave hello', 'Wave hello', 'Forehead - scratch', 'Forehead - pull hairline', 'Write name in air', 'Neck - pinch skin', 'Above ear - pull hair', 'Neck - pinch skin', 'Eyebrow - pull hair', 'Neck - scratch', 'Text on phone', 'Forehead - pull hairline', 'Feel around in tray and pull out an object', 'Pull air toward your face', 'Wave hello', 'Eyelash - pull hair', 'Text on phone', 'Pinch knee/leg skin', 'Scratch knee/leg skin', 'Above ear - pull hair', 'Neck - pinch skin', 'Write name in air', 'Eyelash - pull hair', 'Above ear - pull hair', 'Forehead - scratch', 'Above ear - pull hair', 'Eyebrow - pull hair', 'Pull air 

In [46]:
print(x)

[array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], shape=(114, 332)), array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], shape=(114, 332)), array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], shape=(114, 332)), array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0

In [47]:
np.stack(y)

array(['Cheek - pinch skin', 'Forehead - pull hairline',
       'Cheek - pinch skin', ..., 'Above ear - pull hair',
       'Cheek - pinch skin', 'Write name on leg'],
      shape=(8151,), dtype='<U42')

In [None]:
# Create dataset direcory
! mkdir -p preprocessed_dataset

if input("Do you want to upload the  dataset to kaggle?"):
    # Save dataframes
    df.to_parquet("preprocessed_dataset/train.parquet", index=False)
    features_describe.to_csv("preprocessed_dataset/features_describe.csv")
    # Updaload the dataset
    dataset_upload(
        join(whoami()["username"], "prepocessed-cmi-2025"),
        "preprocessed_dataset",
        version_notes="Preprocessed Child Mind Institue 2025 competition dataset."
    )

Kaggle credentials successfully validated.
Uploading Dataset https://www.kaggle.com/datasets/mauroabidalcarrer/prepocessed-cmi-2025 ...
Starting upload for file preprocessed_dataset/train.parquet


Uploading: 100%|██████████| 124M/124M [01:41<00:00, 1.22MB/s] 

Upload successful: preprocessed_dataset/train.parquet (118MB)
Starting upload for file preprocessed_dataset/features_describe.csv



Uploading: 100%|██████████| 27.6k/27.6k [00:00<00:00, 56.7kB/s]

Upload successful: preprocessed_dataset/features_describe.csv (27KB)





Your dataset has been created.
Files are being processed...
See at: https://www.kaggle.com/datasets/mauroabidalcarrer/prepocessed-cmi-2025
