# Dataset preprocessing

The goal of this notebook is to create a preprocessed kaggle dataset out of the competition dataset.  
For now, the preprocessing will be based on [this notebook](https://www.kaggle.com/code/vonmainstein/imu-tof).  
It consists of the following steps:
-   Set the appropriate dtypes (helps with RAM usage).
-   Impute missing feature values with forward, backward and then 0 filling.
-   Split the dataset into multiple cross validation folds.
-   Standardize feature values.
-   Pad/Truncate the sequences to the same length.  

> Note:  
> - Demographics data set will be ignored for now.  

## Imports

In [1]:
import os
import json
from os.path import join
from itertools import repeat, starmap

import numpy as np
import pandas as pd
from numpy import ndarray
import plotly.express as px
from pandas import DataFrame as DF
from scipy.spatial.transform import Rotation
from kagglehub import whoami, competition_download, dataset_upload

from config import *

## Data preprocessing

### Load dataset
Requires to be logged in if this notebook is not running on kaggle, go to [your settings](https://www.kaggle.com/settings) to create an access token and put it in `~/.kaggle/`.

In [2]:
competition_dataset_path = competition_download(COMPETITION_HANDLE)
df = pd.read_csv(join(competition_dataset_path, "train.csv"), dtype=DATASET_DF_DTYPES)

### Impute missing data
Perform forward, backward and then 0 filling of NaN values.

In [3]:
feature_cols = list(set(df.columns) - set(META_DATA_COLUMNS))
# Missing ToF values are already imputed by -1 which is inconvinient since we want all missing values to be NaN.    
# So we replace them by NaN and then perform imputing.  
tof_vals_to_nan = {col: -1.0 for col in df.columns if col.startswith("tof")}
fillna_val_per_col = {col: 1.0 if col == 'rot_w' else 0 for col in df.columns}

df[feature_cols] = (
    df
    .loc[:, feature_cols]
    # df.replace with np.nan sets dtype to floar64 so we set it back to float32
    .replace(tof_vals_to_nan, value=np.nan)
    .astype("float32")
    .groupby(df["sequence_id"], observed=True, as_index=False)
    .ffill()
    .groupby(df["sequence_id"], observed=True, as_index=False)
    .bfill()
    # In case there are only nan in the column in the sequence
    .fillna(fillna_val_per_col)
)

### Euler angles from quaternions

In [4]:
EULER_ANGLES_COLS = ["euler_x", "euler_y", "euler_z"]
QUATERNION_COLS = ['rot_w', 'rot_x', 'rot_y', 'rot_z']
def rot_euler_angles(seq:DF) -> ndarray:
    try:
        quat_cols = seq[QUATERNION_COLS]
        quat_cols /= np.linalg.norm(quat_cols, axis=1, keepdims=True)
        rotation = Rotation.from_quat(quat_cols)
        euler_data = rotation.as_euler("xyz").squeeze()
        angles_df = DF(
            data=euler_data,
            columns=EULER_ANGLES_COLS
        )
        return angles_df
    except ValueError as e:
        print(quat_cols)
        raise e

rot_euler_angles_df = (
    df
    .groupby("sequence_id", as_index=False, observed=True)
    .apply(rot_euler_angles, include_groups=False)
    .loc[:, EULER_ANGLES_COLS]
    .values
)
display(rot_euler_angles_df)
df[EULER_ANGLES_COLS] = rot_euler_angles_df

array([[ 0.13973521,  0.76901321,  1.06602417],
       [ 0.07560633,  0.75346626,  0.9887669 ],
       [-0.23884318,  0.68307921,  0.69691551],
       ...,
       [ 2.17062915,  0.40229489, -3.09554491],
       [ 2.19296721,  0.42823972, -3.09448917],
       [ 2.19121165,  0.41152166, -3.09799803]], shape=(574945, 3))

  df[EULER_ANGLES_COLS] = rot_euler_angles_df
  df[EULER_ANGLES_COLS] = rot_euler_angles_df
  df[EULER_ANGLES_COLS] = rot_euler_angles_df


### One hot encode target values.

In [5]:
one_hot_target = pd.get_dummies(df["gesture"])
df[one_hot_target.columns] = one_hot_target
df

  df[one_hot_target.columns] = one_hot_target
  df[one_hot_target.columns] = one_hot_target
  df[one_hot_target.columns] = one_hot_target
  df[one_hot_target.columns] = one_hot_target
  df[one_hot_target.columns] = one_hot_target
  df[one_hot_target.columns] = one_hot_target
  df[one_hot_target.columns] = one_hot_target
  df[one_hot_target.columns] = one_hot_target
  df[one_hot_target.columns] = one_hot_target
  df[one_hot_target.columns] = one_hot_target
  df[one_hot_target.columns] = one_hot_target
  df[one_hot_target.columns] = one_hot_target
  df[one_hot_target.columns] = one_hot_target
  df[one_hot_target.columns] = one_hot_target
  df[one_hot_target.columns] = one_hot_target
  df[one_hot_target.columns] = one_hot_target
  df[one_hot_target.columns] = one_hot_target
  df[one_hot_target.columns] = one_hot_target


Unnamed: 0,row_id,sequence_type,sequence_id,sequence_counter,subject,orientation,behavior,phase,gesture,acc_x,...,Neck - scratch,Text on phone,Wave hello,Write name in air,Write name on leg,Drink from bottle/cup,Pinch knee/leg skin,Pull air toward your face,Scratch knee/leg skin,Glasses on/off
0,SEQ_000007_000000,Target,SEQ_000007,0,SUBJ_059520,Seated Lean Non Dom - FACE DOWN,Relaxes and moves hand to target location,Transition,Cheek - pinch skin,6.683594,...,False,False,False,False,False,False,False,False,False,False
1,SEQ_000007_000001,Target,SEQ_000007,1,SUBJ_059520,Seated Lean Non Dom - FACE DOWN,Relaxes and moves hand to target location,Transition,Cheek - pinch skin,6.949219,...,False,False,False,False,False,False,False,False,False,False
2,SEQ_000007_000002,Target,SEQ_000007,2,SUBJ_059520,Seated Lean Non Dom - FACE DOWN,Relaxes and moves hand to target location,Transition,Cheek - pinch skin,5.722656,...,False,False,False,False,False,False,False,False,False,False
3,SEQ_000007_000003,Target,SEQ_000007,3,SUBJ_059520,Seated Lean Non Dom - FACE DOWN,Relaxes and moves hand to target location,Transition,Cheek - pinch skin,6.601562,...,False,False,False,False,False,False,False,False,False,False
4,SEQ_000007_000004,Target,SEQ_000007,4,SUBJ_059520,Seated Lean Non Dom - FACE DOWN,Relaxes and moves hand to target location,Transition,Cheek - pinch skin,5.566406,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
574940,SEQ_065531_000048,Non-Target,SEQ_065531,48,SUBJ_039498,Seated Lean Non Dom - FACE DOWN,Performs gesture,Gesture,Write name on leg,3.503906,...,False,False,False,False,True,False,False,False,False,False
574941,SEQ_065531_000049,Non-Target,SEQ_065531,49,SUBJ_039498,Seated Lean Non Dom - FACE DOWN,Performs gesture,Gesture,Write name on leg,3.773438,...,False,False,False,False,True,False,False,False,False,False
574942,SEQ_065531_000050,Non-Target,SEQ_065531,50,SUBJ_039498,Seated Lean Non Dom - FACE DOWN,Performs gesture,Gesture,Write name on leg,3.082031,...,False,False,False,False,True,False,False,False,False,False
574943,SEQ_065531_000051,Non-Target,SEQ_065531,51,SUBJ_039498,Seated Lean Non Dom - FACE DOWN,Performs gesture,Gesture,Write name on leg,3.964844,...,False,False,False,False,True,False,False,False,False,False


### ToF data aggregation.
Time of Flight columns take most of the data, let's reduce their size by aggregating by mean for each Time of Flight sensor.

In [6]:
def agg_tof_cols_per_sensor(df:DF) -> DF:
    for tof_idx in range(1, 6):
        tof_name = f"tof_{tof_idx}"
        tof_cols = [f"{tof_name}_v{v_idx}" for v_idx in range(64)]
        if any(map(lambda col: col not in df.columns, tof_cols)):
            print(f"Some (or) all ToF {tof_idx} columns are not in the df. Maybe you already ran this cell?")
            continue
        df = (
            df
            # Need to use a dict otherwise the name of the col will be "tof_preffix" instead of the value it contains
            .assign(**{tof_name:df[tof_cols].mean(axis="columns")})
            .drop(columns=tof_cols)
        )
    return df

df = agg_tof_cols_per_sensor(df)
# Redifine feature_cols now that there are less of them.
feature_cols = list(set(df.columns) - set(META_DATA_COLUMNS) - set(df["gesture"].unique().tolist()))

df

Unnamed: 0,row_id,sequence_type,sequence_id,sequence_counter,subject,orientation,behavior,phase,gesture,acc_x,...,Drink from bottle/cup,Pinch knee/leg skin,Pull air toward your face,Scratch knee/leg skin,Glasses on/off,tof_1,tof_2,tof_3,tof_4,tof_5
0,SEQ_000007_000000,Target,SEQ_000007,0,SUBJ_059520,Seated Lean Non Dom - FACE DOWN,Relaxes and moves hand to target location,Transition,Cheek - pinch skin,6.683594,...,False,False,False,False,False,139.250000,117.109375,91.687500,123.359375,135.343750
1,SEQ_000007_000001,Target,SEQ_000007,1,SUBJ_059520,Seated Lean Non Dom - FACE DOWN,Relaxes and moves hand to target location,Transition,Cheek - pinch skin,6.949219,...,False,False,False,False,False,139.796875,119.671875,97.921875,124.406250,137.000000
2,SEQ_000007_000002,Target,SEQ_000007,2,SUBJ_059520,Seated Lean Non Dom - FACE DOWN,Relaxes and moves hand to target location,Transition,Cheek - pinch skin,5.722656,...,False,False,False,False,False,142.375000,128.359375,116.953125,125.687500,140.234375
3,SEQ_000007_000003,Target,SEQ_000007,3,SUBJ_059520,Seated Lean Non Dom - FACE DOWN,Relaxes and moves hand to target location,Transition,Cheek - pinch skin,6.601562,...,False,False,False,False,False,154.109375,142.093750,144.515625,149.078125,142.609375
4,SEQ_000007_000004,Target,SEQ_000007,4,SUBJ_059520,Seated Lean Non Dom - FACE DOWN,Relaxes and moves hand to target location,Transition,Cheek - pinch skin,5.566406,...,False,False,False,False,False,177.953125,149.453125,161.828125,163.765625,151.265625
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
574940,SEQ_065531_000048,Non-Target,SEQ_065531,48,SUBJ_039498,Seated Lean Non Dom - FACE DOWN,Performs gesture,Gesture,Write name on leg,3.503906,...,False,False,False,False,False,68.562500,67.750000,144.437500,74.062500,52.843750
574941,SEQ_065531_000049,Non-Target,SEQ_065531,49,SUBJ_039498,Seated Lean Non Dom - FACE DOWN,Performs gesture,Gesture,Write name on leg,3.773438,...,False,False,False,False,False,70.234375,66.656250,144.000000,70.406250,54.531250
574942,SEQ_065531_000050,Non-Target,SEQ_065531,50,SUBJ_039498,Seated Lean Non Dom - FACE DOWN,Performs gesture,Gesture,Write name on leg,3.082031,...,False,False,False,False,False,66.671875,62.906250,136.906250,70.109375,57.468750
574943,SEQ_065531_000051,Non-Target,SEQ_065531,51,SUBJ_039498,Seated Lean Non Dom - FACE DOWN,Performs gesture,Gesture,Write name on leg,3.964844,...,False,False,False,False,False,67.218750,64.500000,140.531250,75.609375,54.937500


### Split into folds

In [7]:
def split_dataset(df:DF, by="subject") -> tuple[DF, DF]:
    unique_sequences = df[by].unique()
    validation_sequences = pd.Series(unique_sequences).sample(
        frac=VALIDATION_FRACTION, replace=False
    )

    validation_set = df[df[by].isin(validation_sequences)]
    train_set = df[~df[by].isin(validation_sequences)]

    return train_set, validation_set

folds = list(map(split_dataset, repeat(df, N_FOLDS)))

In [8]:
display(df.shape)
display(folds[0][0].shape)
display(folds[0][1].shape)

(574945, 47)

(459073, 47)

(115872, 47)

In [9]:
for train_df, validation_df in folds:
    print(train_df.shape, validation_df.shape)

(459073, 47) (115872, 47)
(463036, 47) (111909, 47)
(458225, 47) (116720, 47)
(461464, 47) (113481, 47)
(457812, 47) (117133, 47)


### Std norm
Standard scale the feature cols (should probably do something different for IMU cols).  
<!-- *Deprecated, std norm is now performed at dataset creation to avoid target leakage.*   -->

In [10]:
def std_norm_dataset(train:DF, val:DF) -> tuple[DF, DF]:
    means = train[feature_cols].mean()
    std = train[feature_cols].std()
    train.loc[:, feature_cols] = (train[feature_cols] - means) / std
    val.loc[:, feature_cols] = (val[feature_cols] - means) / std
    return train, val

normed_folds = list(starmap(std_norm_dataset, folds))

  0.02547337]' has dtype incompatible with float32, please explicitly cast to a compatible dtype first.
  train.loc[:, feature_cols] = (train[feature_cols] - means) / std
 -0.72783913]' has dtype incompatible with float32, please explicitly cast to a compatible dtype first.
  train.loc[:, feature_cols] = (train[feature_cols] - means) / std
  train.loc[:, feature_cols] = (train[feature_cols] - means) / std
 -1.07306363]' has dtype incompatible with float32, please explicitly cast to a compatible dtype first.
  train.loc[:, feature_cols] = (train[feature_cols] - means) / std
 -0.88041218]' has dtype incompatible with float32, please explicitly cast to a compatible dtype first.
  train.loc[:, feature_cols] = (train[feature_cols] - means) / std
 -1.60412708]' has dtype incompatible with float32, please explicitly cast to a compatible dtype first.
  train.loc[:, feature_cols] = (train[feature_cols] - means) / std
  train.loc[:, feature_cols] = (train[feature_cols] - means) / std
 -1.0676735

Normalize full dataset.

In [11]:
# Retain full dataset meta data for inference
full_dataset_meta_data = {
    "mean": df[feature_cols].mean().to_dict(),
    "std": df[feature_cols].std().to_dict(),
}
df.loc[:, feature_cols] = (df[feature_cols] - full_dataset_meta_data["mean"]) / full_dataset_meta_data['std']

  0.03026214]' has dtype incompatible with float32, please explicitly cast to a compatible dtype first.
  df.loc[:, feature_cols] = (df[feature_cols] - full_dataset_meta_data["mean"]) / full_dataset_meta_data['std']
 -0.72446184]' has dtype incompatible with float32, please explicitly cast to a compatible dtype first.
  df.loc[:, feature_cols] = (df[feature_cols] - full_dataset_meta_data["mean"]) / full_dataset_meta_data['std']
  df.loc[:, feature_cols] = (df[feature_cols] - full_dataset_meta_data["mean"]) / full_dataset_meta_data['std']
 -1.00162401]' has dtype incompatible with float32, please explicitly cast to a compatible dtype first.
  df.loc[:, feature_cols] = (df[feature_cols] - full_dataset_meta_data["mean"]) / full_dataset_meta_data['std']
 -0.88129619]' has dtype incompatible with float32, please explicitly cast to a compatible dtype first.
  df.loc[:, feature_cols] = (df[feature_cols] - full_dataset_meta_data["mean"]) / full_dataset_meta_data['std']
 -1.61396961]' has dtype

In [12]:
df[feature_cols].agg(["mean", "std"])

Unnamed: 0,rot_z,rot_y,thm_2,tof_5,tof_2,rot_x,thm_4,rot_w,tof_4,acc_y,acc_x,thm_5,acc_z,tof_1,thm_3,euler_y,euler_z,thm_1,euler_x,tof_3
mean,1.855411e-08,-8.728876e-10,2.822679e-07,-5.993022e-08,1.687596e-08,-2.663767e-08,-1.427808e-07,-6.315844e-08,1.404808e-08,1.186949e-08,-1.749467e-08,-3.565034e-09,-2.971537e-09,1.908273e-07,1.823665e-07,1.186411e-18,-1.3050520000000001e-17,3.772558e-08,-8.700347e-18,2.663846e-09
std,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Let's compare the train to validation mean/std skews.

In [13]:
pd.concat({
    "train": folds[0][1][feature_cols].agg(["mean", "std"]),
    "validation": folds[0][1][feature_cols].agg(["mean", "std"]),
})

Unnamed: 0,Unnamed: 1,rot_z,rot_y,thm_2,tof_5,tof_2,rot_x,thm_4,rot_w,tof_4,acc_y,acc_x,thm_5,acc_z,tof_1,thm_3,euler_y,euler_z,thm_1,euler_x,tof_3
train,mean,-0.022788,0.004298,0.121627,-0.237616,-0.00566,-0.013953,0.056279,0.086225,-0.074049,-0.087935,-0.034762,-0.335986,0.013114,-0.039551,0.176921,-0.036992,-0.029604,0.140331,-0.067752,-0.089391
train,std,0.967187,1.028734,0.641238,1.091336,0.988538,0.96039,0.546626,0.963192,0.930029,1.028772,1.00052,1.501069,1.002488,0.948109,0.465942,1.008167,1.008635,0.55748,0.981734,0.946357
validation,mean,-0.022788,0.004298,0.121627,-0.237616,-0.00566,-0.013953,0.056279,0.086225,-0.074049,-0.087935,-0.034762,-0.335986,0.013114,-0.039551,0.176921,-0.036992,-0.029604,0.140331,-0.067752,-0.089391
validation,std,0.967187,1.028734,0.641238,1.091336,0.988538,0.96039,0.546626,0.963192,0.930029,1.028772,1.00052,1.501069,1.002488,0.948109,0.465942,1.008167,1.008635,0.55748,0.981734,0.946357


### Normalize sequences lengths.  
And turn the Dataframes into ndarrays.

#### Visualize histogram of sequences lengths.

Entire dataset sequences lengths.

In [14]:
px.histogram(
    (
        df
        .groupby("sequence_id", observed=True)
        .size()
    ),
    title="Sequence length frequency",
)

Second(to avoid always look at the first one) Train/validation split sequences lengths comparaison.

In [15]:
def get_set_sequences_lengths(set:DF, name:str) -> DF:
    return (
        set
        .groupby("sequence_id", observed=True)
        .size()
        .reset_index(name="length")
        .assign(set=name)
    )

full_se_lengths = pd.concat((
    get_set_sequences_lengths(folds[2][0], "Train"),
    get_set_sequences_lengths(folds[2][1], "Validation"),
))

fig = px.histogram(
    full_se_lengths,
    x="length",
    color="set",
    barmode="overlay",  # or 'group' if you want side-by-side bars
    nbins=50,           # adjust bin size if needed
    title="Sequence Length Distribution: Train vs Validation"
)

fig.update_traces(opacity=0.8)  # better visibility with overlay
fig.show()


In [16]:
for train, val in folds:
    print("train normed sequence len:", int(train.groupby("sequence_id", observed=True).size().quantile(SEQUENCE_NORMED_LEN_QUANTILE)))
    print("validation normed sequence len:", int(val.groupby("sequence_id", observed=True).size().quantile(SEQUENCE_NORMED_LEN_QUANTILE)))
    print()

train normed sequence len: 124
validation normed sequence len: 134

train normed sequence len: 130
validation normed sequence len: 118

train normed sequence len: 125
validation normed sequence len: 131

train normed sequence len: 129
validation normed sequence len: 118

train normed sequence len: 125
validation normed sequence len: 132



#### Sequence length norm implementation

In [17]:
from tqdm.notebook import tqdm

gesture_cols = df["gesture"].unique()

def length_normed_sequence_feat_arr(sequence: DF, normed_sequence_len: int) -> ndarray:
    features = (
        sequence
        .loc[:, feature_cols]
        .values
    )
    len_diff = abs(normed_sequence_len - len(features))
    if len(features) < normed_sequence_len:
        padded_features = np.pad(
            features,
            ((len_diff // 2 + len_diff % 2, len_diff // 2), (0, 0)),
        )
        return padded_features
    elif len(features) > normed_sequence_len:
        return features[len_diff // 2:-len_diff // 2]
    else:
        return features

def df_to_ndarrays(df:DF, normed_sequence_len:int) -> tuple[np.ndarray, np.ndarray]:
    sequence_it = df.groupby("sequence_id", observed=True, as_index=False)
    x = np.empty(
        shape=(len(sequence_it), normed_sequence_len, len(feature_cols)),
        dtype="float32"
    )
    y = np.empty(
        shape=(len(sequence_it), df["gesture"].nunique()),
        dtype="float32"
    )
    for sequence_idx, (_, sequence) in tqdm(enumerate(sequence_it), total=len(sequence_it)):
        normed_seq_feat_arr = length_normed_sequence_feat_arr(sequence, normed_sequence_len)
        x[sequence_idx] = normed_seq_feat_arr
        # Take the first value as they are(or at least should be) all the same in a single sequence
        y[sequence_idx] = sequence[gesture_cols].iloc[0].values

    return x, y

def get_normed_seq_len(dataset:DF) -> int:
    return int(
        dataset
        .groupby("sequence_id", observed=True)
        .size()
        .quantile(SEQUENCE_NORMED_LEN_QUANTILE)
    )

def fold_dfs_to_ndarrays(train:DF, validation:DF) -> tuple[ndarray, ndarray, ndarray, ndarray]:
    """
    Returns:
        (train X, train Y, validation X, validation Y)
    """
    normed_sequence_len = get_normed_seq_len(train)
    return (
        *df_to_ndarrays(train, normed_sequence_len),
        *df_to_ndarrays(validation, normed_sequence_len),
    )

folds_arrs = list(starmap(fold_dfs_to_ndarrays, folds))

  0%|          | 0/6519 [00:00<?, ?it/s]

  0%|          | 0/1632 [00:00<?, ?it/s]

  0%|          | 0/6519 [00:00<?, ?it/s]

  0%|          | 0/1632 [00:00<?, ?it/s]

  0%|          | 0/6519 [00:00<?, ?it/s]

  0%|          | 0/1632 [00:00<?, ?it/s]

  0%|          | 0/6519 [00:00<?, ?it/s]

  0%|          | 0/1632 [00:00<?, ?it/s]

  0%|          | 0/6519 [00:00<?, ?it/s]

  0%|          | 0/1632 [00:00<?, ?it/s]

In [18]:
full_dataset_sequence_length_norm = get_normed_seq_len(df)
full_x, full_y = df_to_ndarrays(df, full_dataset_sequence_length_norm)

  0%|          | 0/8151 [00:00<?, ?it/s]

## Create dataset

In [19]:
# Clean dataset directory if it already exists
! rm -rf preprocessed_dataset
# Create dataset direcory
! mkdir preprocessed_dataset
# Save folds
for fold_i, (train_x, train_y, val_x, val_y) in enumerate(folds_arrs):
    fold_dir_path = join("preprocessed_dataset", f"fold_{fold_i}")
    os.makedirs(fold_dir_path)
    # save features (X)
    np.save(join(fold_dir_path, "train_X.npy"), train_x, allow_pickle=False)
    np.save(join(fold_dir_path, "validation_X.npy"), val_x, allow_pickle=False)
    # Save targets (Y)
    np.save(join(fold_dir_path, "train_Y.npy"), train_y, allow_pickle=False)
    np.save(join(fold_dir_path, "validation_Y.npy"), val_y, allow_pickle=False)
# Save full dataset
full_dataset_dir_path = "preprocessed_dataset/full_dataset"
os.makedirs(full_dataset_dir_path)
np.save(join(full_dataset_dir_path, "X.npy"), full_x, allow_pickle=False)
np.save(join(full_dataset_dir_path, "Y.npy"), full_y, allow_pickle=False)
# Save dataset meta data
full_dataset_meta_data["target_names"] = one_hot_target.columns.to_list()
full_dataset_meta_data["pad_seq_len"] = full_dataset_sequence_length_norm
full_dataset_meta_data["feature_cols"] = feature_cols

with open("preprocessed_dataset/full_dataset_meta_data.json", "w") as fp:
    json.dump(full_dataset_meta_data, fp, indent=4)

In [23]:
df[feature_cols]

Unnamed: 0,rot_z,rot_y,thm_2,tof_5,tof_2,rot_x,thm_4,rot_w,tof_4,acc_y,acc_x,thm_5,acc_z,tof_1,thm_3,euler_y,euler_z,thm_1,euler_x,tof_3
0,-1.238540,-0.716343,1.184774,0.404678,0.148727,-0.508525,0.367246,-0.997151,-0.016778,0.884130,0.872408,0.477938,0.625816,0.339571,0.633442,0.883372,0.339185,0.502318,0.191240,-0.454938
1,-1.267920,-0.681840,1.197089,0.433771,0.199531,-0.476437,0.386687,-0.957737,0.002596,0.884130,0.918353,0.492298,0.588012,0.349948,0.681437,0.860138,0.297587,0.593105,0.150843,-0.337148
2,-1.349747,-0.549352,0.976524,0.490583,0.371769,-0.334147,0.423273,-0.630264,0.026308,0.723320,0.706192,0.510858,0.964766,0.398864,0.741405,0.754947,0.140441,0.821468,-0.047236,0.022423
3,-1.390175,-0.329928,0.063125,0.532300,0.644067,-0.312448,0.484648,-0.290093,0.459197,0.347835,0.858218,0.341522,1.134561,0.621508,0.584725,0.714206,-0.051374,0.868207,-0.231340,0.543179
4,-1.447599,-0.007333,-0.353110,0.684347,0.789974,-0.214344,0.592139,-0.134027,0.731018,-0.302433,0.679165,0.334854,1.655481,1.073910,0.081696,0.418963,-0.252579,0.587718,-0.342103,0.870276
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
574940,0.070326,-0.759754,0.279611,-1.044439,-0.829872,-1.601608,0.854270,-1.117506,-0.929112,-0.444509,0.322408,0.656701,-1.309212,-1.001627,0.292463,0.287935,-1.902655,0.590561,1.457859,0.541703
574941,0.045559,-0.740360,0.311259,-1.014798,-0.851557,-1.607920,0.870208,-1.100313,-0.996778,-0.490566,0.369030,0.618153,-1.434796,-0.969906,0.314339,0.326326,-1.902869,0.599804,1.471826,0.533437
574942,0.041674,-0.741939,0.301660,-0.963200,-0.925905,-1.605422,0.848956,-1.089203,-1.002272,-0.314143,0.249436,0.618153,-1.138775,-1.037499,0.299698,0.335326,-1.901584,0.620099,1.470549,0.399410
574943,0.017150,-0.722432,0.389432,-1.007662,-0.894307,-1.610813,0.864896,-1.070422,-0.900484,-0.429677,0.402138,0.561656,-1.414933,-1.027123,0.330572,0.374099,-1.901015,0.695652,1.484620,0.467900


## Dataset upload
Optionally upload the dataset to kaggle.

In [24]:
if input("Do you want to upload the  dataset to kaggle?[yes/no]").lower() == "yes":
    # Updaload the dataset
    dataset_upload(
        join(whoami()["username"], "prepocessed-cmi-2025"),
        "preprocessed_dataset",
        version_notes="Preprocessed Child Mind Institue 2025 competition dataset."
    )
else:
    print("Dataset has not been uploaded.")

Kaggle credentials successfully validated.
Uploading Dataset https://www.kaggle.com/datasets/mauroabidalcarrer/prepocessed-cmi-2025 ...
Starting upload for file preprocessed_dataset/full_dataset_meta_data.json


Uploading: 100%|██████████| 2.51k/2.51k [00:00<00:00, 6.74kB/s]

Upload successful: preprocessed_dataset/full_dataset_meta_data.json (2KB)
Starting upload for file preprocessed_dataset/full_dataset/X.npy



Uploading: 100%|██████████| 82.8M/82.8M [00:36<00:00, 2.26MB/s]

Upload successful: preprocessed_dataset/full_dataset/X.npy (79MB)
Starting upload for file preprocessed_dataset/full_dataset/Y.npy



Uploading: 100%|██████████| 587k/587k [00:00<00:00, 629kB/s] 

Upload successful: preprocessed_dataset/full_dataset/Y.npy (573KB)
Starting upload for file preprocessed_dataset/fold_4/train_X.npy



Uploading: 100%|██████████| 65.2M/65.2M [00:32<00:00, 2.01MB/s]

Upload successful: preprocessed_dataset/fold_4/train_X.npy (62MB)
Starting upload for file preprocessed_dataset/fold_4/validation_Y.npy



Uploading: 100%|██████████| 118k/118k [00:00<00:00, 209kB/s]

Upload successful: preprocessed_dataset/fold_4/validation_Y.npy (115KB)
Starting upload for file preprocessed_dataset/fold_4/validation_X.npy



Uploading: 100%|██████████| 16.3M/16.3M [00:07<00:00, 2.20MB/s]

Upload successful: preprocessed_dataset/fold_4/validation_X.npy (16MB)
Starting upload for file preprocessed_dataset/fold_4/train_Y.npy



Uploading: 100%|██████████| 469k/469k [00:00<00:00, 538kB/s] 

Upload successful: preprocessed_dataset/fold_4/train_Y.npy (458KB)
Starting upload for file preprocessed_dataset/fold_2/train_X.npy



Uploading: 100%|██████████| 65.2M/65.2M [00:27<00:00, 2.34MB/s]

Upload successful: preprocessed_dataset/fold_2/train_X.npy (62MB)
Starting upload for file preprocessed_dataset/fold_2/validation_Y.npy



Uploading: 100%|██████████| 118k/118k [00:00<00:00, 206kB/s]

Upload successful: preprocessed_dataset/fold_2/validation_Y.npy (115KB)
Starting upload for file preprocessed_dataset/fold_2/validation_X.npy



Uploading: 100%|██████████| 16.3M/16.3M [00:09<00:00, 1.80MB/s]

Upload successful: preprocessed_dataset/fold_2/validation_X.npy (16MB)
Starting upload for file preprocessed_dataset/fold_2/train_Y.npy



Uploading: 100%|██████████| 469k/469k [00:01<00:00, 452kB/s] 

Upload successful: preprocessed_dataset/fold_2/train_Y.npy (458KB)
Starting upload for file preprocessed_dataset/fold_1/train_X.npy



Uploading: 100%|██████████| 67.8M/67.8M [00:29<00:00, 2.29MB/s]

Upload successful: preprocessed_dataset/fold_1/train_X.npy (65MB)
Starting upload for file preprocessed_dataset/fold_1/validation_Y.npy



Uploading: 100%|██████████| 118k/118k [00:00<00:00, 205kB/s]

Upload successful: preprocessed_dataset/fold_1/validation_Y.npy (115KB)
Starting upload for file preprocessed_dataset/fold_1/validation_X.npy



Uploading: 100%|██████████| 17.0M/17.0M [00:07<00:00, 2.21MB/s]

Upload successful: preprocessed_dataset/fold_1/validation_X.npy (16MB)
Starting upload for file preprocessed_dataset/fold_1/train_Y.npy



Uploading: 100%|██████████| 469k/469k [00:00<00:00, 581kB/s] 

Upload successful: preprocessed_dataset/fold_1/train_Y.npy (458KB)
Starting upload for file preprocessed_dataset/fold_3/train_X.npy



Uploading: 100%|██████████| 67.3M/67.3M [00:27<00:00, 2.46MB/s]

Upload successful: preprocessed_dataset/fold_3/train_X.npy (64MB)
Starting upload for file preprocessed_dataset/fold_3/validation_Y.npy



Uploading: 100%|██████████| 118k/118k [00:00<00:00, 199kB/s]

Upload successful: preprocessed_dataset/fold_3/validation_Y.npy (115KB)
Starting upload for file preprocessed_dataset/fold_3/validation_X.npy



Uploading: 100%|██████████| 16.8M/16.8M [00:09<00:00, 1.86MB/s]

Upload successful: preprocessed_dataset/fold_3/validation_X.npy (16MB)
Starting upload for file preprocessed_dataset/fold_3/train_Y.npy



Uploading: 100%|██████████| 469k/469k [00:00<00:00, 563kB/s] 

Upload successful: preprocessed_dataset/fold_3/train_Y.npy (458KB)
Starting upload for file preprocessed_dataset/fold_0/train_X.npy



Uploading: 100%|██████████| 64.7M/64.7M [00:32<00:00, 2.00MB/s]

Upload successful: preprocessed_dataset/fold_0/train_X.npy (62MB)
Starting upload for file preprocessed_dataset/fold_0/validation_Y.npy



Uploading: 100%|██████████| 118k/118k [00:00<00:00, 158kB/s] 

Upload successful: preprocessed_dataset/fold_0/validation_Y.npy (115KB)
Starting upload for file preprocessed_dataset/fold_0/validation_X.npy



Uploading: 100%|██████████| 16.2M/16.2M [00:10<00:00, 1.62MB/s]

Upload successful: preprocessed_dataset/fold_0/validation_X.npy (15MB)
Starting upload for file preprocessed_dataset/fold_0/train_Y.npy



Uploading: 100%|██████████| 469k/469k [00:00<00:00, 647kB/s] 

Upload successful: preprocessed_dataset/fold_0/train_Y.npy (458KB)





Your dataset has been created.
Files are being processed...
See at: https://www.kaggle.com/datasets/mauroabidalcarrer/prepocessed-cmi-2025
