# IDS_balance project: integration of `Boari21` dataset

This notebook integrates the metadata file and all data files of a dataset into a single dataframe and a single parquet file.

`Boari21` dataset: pandas DataFrame with 2_297_532 rows × 53 columns with 272 MB in memory and 174 MB parquet file.

> Boari D (2021) A dataset with ground reaction forces of human balance in Parkinson's disease (Version 6). figshare. https://doi.org/10.6084/m9.figshare.13530587.v6  
> de Oliveira CEN, Ribeiro de Souza C, Treza RC, Hondo SM, Los Angeles E, Bernardo C, Shida TKF, Dos Santos de Oliveira L, Novaes TM, de Campos DDSF, Gisoldi E, Carvalho MJ, Coelho DB (2022) A Public Data Set With Ground Reaction Forces of Human Balance in Individuals With Parkinson's Disease. Frontiers in neuroscience, 16, 865882. https://doi.org/10.3389/fnins.2022.865882

## Setup

In [1]:
import sys, os, datetime, glob
from pathlib import Path
import numpy as np
import pandas as pd
from tqdm import tqdm

print(f'Python {sys.version} on {sys.platform}',
      f' numpy {np.__version__}', f' pandas {pd.__version__}',
      datetime.datetime.now().strftime("%d/%m/%Y %H:%M:%S"), sep='\n')

Python 3.12.7 | packaged by conda-forge | (main, Oct  4 2024, 16:05:46) [GCC 13.3.0] on linux
 numpy 2.2.1
 pandas 2.2.3
27/12/2024 18:25:33


## Dataset location

In [2]:
dataset_name = 'Boari21'
metadata_fname = 'PDPDSinfo.txt'
path2 = Path().resolve().parents[0] / 'datasets' / dataset_name / 'data'
if os.path.isfile(path2 / metadata_fname):
    print(f'Dataset location: {path2}')
else:
    print('Dataset not found.')

Dataset location: /home/marcos/adrive/Python/projects/IDS_balance/datasets/Boari21/data


## Metadata

In [3]:
metadata = pd.read_csv(path2 / metadata_fname, sep='\t', header=0,
                       engine='c', encoding='latin_1')
metadata = metadata.iloc[:32, :39]
display(metadata)
print(f'Information from {len(metadata)} files successfully loaded (total of {len(pd.unique(metadata.ID))} subjects).')

Unnamed: 0,ID,Gender,Age,Height (cm),Weight (kg),BMI (kg/m2),Ortho-Prosthesis,Years of formal study,Disease duration (years),L-Dopa equivalent units (mgday-1),...,OFF - Hoehn & Yahr,OFF - MoCA,OFF - UPDRS-II,OFF - UPDRS-III,OFF - UPDRS-III - Rigidity,OFF - UPDRS-III - Gait,OFF - UPDRS-III - Bradykinesia,OFF - UPDRS-III - Dyskinesia,OFF - miniBESTest,OFF - FES-I
0,PDPDS01,F,53.0,170.0,62.55,21.64,No,16.0,4.0,275.0,...,1.0,29.0,2.0,10.0,3.0,0.0,0.0,1.0,32.0,28.0
1,PDPDS02,M,69.0,165.0,76.5,28.1,Corrective lens,24.0,1.0,770.0,...,2.0,24.0,1.0,17.0,4.0,0.0,1.0,0.0,20.0,19.0
2,PDPDS03,M,68.0,169.0,68.9,24.12,No,11.0,19.0,1664.0,...,3.0,26.0,10.0,48.0,13.0,2.0,4.0,0.0,19.0,47.0
3,PDPDS04,F,77.0,151.5,60.2,26.23,Corrective lens,4.0,15.0,100.0,...,2.0,15.0,6.0,30.0,0.0,0.0,1.0,0.0,26.0,19.0
4,PDPDS05,M,65.0,168.0,89.0,31.53,No,4.0,15.0,766.0,...,3.0,17.0,9.0,36.0,6.0,1.0,3.0,0.0,18.0,47.0
5,PDPDS06,F,44.0,157.0,53.3,21.62,No,16.0,14.0,665.0,...,2.0,27.0,11.0,16.0,4.0,2.0,1.0,0.0,29.0,31.0
6,PDPDS07,M,60.0,179.0,92.45,28.85,Corrective lens,12.0,5.0,750.0,...,3.0,28.0,6.0,30.0,7.0,2.0,2.0,0.0,21.0,47.0
7,PDPDS08,M,81.0,154.5,65.65,27.5,Corrective lens,16.0,4.0,866.0,...,3.0,20.0,7.0,47.0,8.0,2.0,1.0,0.0,20.0,38.0
8,PDPDS09,M,76.0,167.0,65.35,23.43,No,12.0,11.0,500.0,...,2.0,24.0,3.0,20.0,9.0,0.0,0.0,0.0,28.0,28.0
9,PDPDS10,M,73.0,168.0,79.0,27.99,Corrective lens,9.0,3.0,400.0,...,2.0,18.0,1.0,22.0,5.0,1.0,2.0,0.0,29.0,21.0


Information from 32 files successfully loaded (total of 32 subjects).


In [4]:
metadata.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 39 columns):
 #   Column                                                              Non-Null Count  Dtype  
---  ------                                                              --------------  -----  
 0   ID                                                                  32 non-null     object 
 1   Gender                                                              32 non-null     object 
 2   Age                                                                 32 non-null     float64
 3   Height (cm)                                                         32 non-null     float64
 4   Weight (kg)                                                         32 non-null     float64
 5   BMI (kg/m2)                                                         32 non-null     float64
 6   Ortho-Prosthesis                                                    32 non-null     object 
 7   Years of formal stu

### Set variables type as categorical

See https://pandas.pydata.org/docs/user_guide/categorical.html

In [5]:
metadata = metadata.astype('category', copy=True)
metadata.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 39 columns):
 #   Column                                                              Non-Null Count  Dtype   
---  ------                                                              --------------  -----   
 0   ID                                                                  32 non-null     category
 1   Gender                                                              32 non-null     category
 2   Age                                                                 32 non-null     category
 3   Height (cm)                                                         32 non-null     category
 4   Weight (kg)                                                         32 non-null     category
 5   BMI (kg/m2)                                                         32 non-null     category
 6   Ortho-Prosthesis                                                    32 non-null     category
 7   Years of f

## Integration

### Expand metadata to contain conditions of all trials

In [6]:
fnames = sorted([os.path.basename(fname)[:-4] for fname in glob.glob(path2.as_posix() + os.sep + 'PDPDS[0-9]*.txt')])
data = [[fname, int(fname[5:7]), *fname.replace(' ', '_').split('_')[:-1]] for fname in fnames]
df = pd.DataFrame(data=data, columns=['Trial', 'Subject', 'ID', 'Medication', 'Surface', 'Vision'])
metadata = pd.merge(df, metadata, on='ID', how='inner')
metadata = metadata.astype('category', copy=True)
metadata

Unnamed: 0,Trial,Subject,ID,Medication,Surface,Vision,Gender,Age,Height (cm),Weight (kg),...,OFF - Hoehn & Yahr,OFF - MoCA,OFF - UPDRS-II,OFF - UPDRS-III,OFF - UPDRS-III - Rigidity,OFF - UPDRS-III - Gait,OFF - UPDRS-III - Bradykinesia,OFF - UPDRS-III - Dyskinesia,OFF - miniBESTest,OFF - FES-I
0,PDPDS01_off_rs ec_1,1,PDPDS01,off,rs,ec,F,53.0,170.0,62.55,...,1.0,29.0,2.0,10.0,3.0,0.0,0.0,1.0,32.0,28.0
1,PDPDS01_off_rs ec_2,1,PDPDS01,off,rs,ec,F,53.0,170.0,62.55,...,1.0,29.0,2.0,10.0,3.0,0.0,0.0,1.0,32.0,28.0
2,PDPDS01_off_rs ec_3,1,PDPDS01,off,rs,ec,F,53.0,170.0,62.55,...,1.0,29.0,2.0,10.0,3.0,0.0,0.0,1.0,32.0,28.0
3,PDPDS01_off_rs eo_1,1,PDPDS01,off,rs,eo,F,53.0,170.0,62.55,...,1.0,29.0,2.0,10.0,3.0,0.0,0.0,1.0,32.0,28.0
4,PDPDS01_off_rs eo_2,1,PDPDS01,off,rs,eo,F,53.0,170.0,62.55,...,1.0,29.0,2.0,10.0,3.0,0.0,0.0,1.0,32.0,28.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
762,PDPDS32_on_us ec_2,32,PDPDS32,on,us,ec,M,66.0,168.0,60.00,...,3.0,27.0,8.0,45.0,6.0,1.0,2.0,0.0,24.0,24.0
763,PDPDS32_on_us ec_3,32,PDPDS32,on,us,ec,M,66.0,168.0,60.00,...,3.0,27.0,8.0,45.0,6.0,1.0,2.0,0.0,24.0,24.0
764,PDPDS32_on_us eo_1,32,PDPDS32,on,us,eo,M,66.0,168.0,60.00,...,3.0,27.0,8.0,45.0,6.0,1.0,2.0,0.0,24.0,24.0
765,PDPDS32_on_us eo_2,32,PDPDS32,on,us,eo,M,66.0,168.0,60.00,...,3.0,27.0,8.0,45.0,6.0,1.0,2.0,0.0,24.0,24.0


### Merge metadata and data files individually and then concatenate all

In [7]:
def merge_meta_data(metadata, trial):
    # Merge metadata and data files
    data = pd.read_csv(path2 / f'{trial}.txt', delimiter='\t',
                       engine='c', encoding='utf-8', float_precision='round_trip')
    data['Trial'] = trial
    return pd.merge(metadata.query('Trial == @trial'), data, how='inner', on='Trial')

In [8]:
df_all = [merge_meta_data(metadata, trial) for trial in tqdm(metadata['Trial'])]
df_all = pd.concat(df_all, ignore_index=True)
df_all = df_all.astype({'Trial': 'category'})
df_all

100%|███████████████████████████████████████████████████████| 767/767 [00:12<00:00, 61.73it/s]


Unnamed: 0,Trial,Subject,ID,Medication,Surface,Vision,Gender,Age,Height (cm),Weight (kg),...,Time [s],GRFml [N],GRFap [N],GRFv [N],Mml [N.m],Map [N.m],Mv [N.m],Mfree [N.m],COPml [m],COPap [m]
0,PDPDS01_off_rs ec_1,1,PDPDS01,off,rs,ec,F,53.0,170.0,62.55,...,0.01,-2.848395,1.196850,615.378774,4.717177,43.639269,0.077541,-0.184237,-0.001239,0.008021
1,PDPDS01_off_rs ec_1,1,PDPDS01,off,rs,ec,F,53.0,170.0,62.55,...,0.02,-2.803073,1.157796,615.145598,4.702110,43.566825,0.119589,-0.223013,-0.001261,0.007929
2,PDPDS01_off_rs ec_1,1,PDPDS01,off,rs,ec,F,53.0,170.0,62.55,...,0.03,-2.763248,1.121453,614.932638,4.694917,43.492688,0.156901,-0.257316,-0.001270,0.007831
3,PDPDS01_off_rs ec_1,1,PDPDS01,off,rs,ec,F,53.0,170.0,62.55,...,0.04,-2.733660,1.090064,614.756335,4.700805,43.417227,0.185969,-0.283859,-0.001258,0.007728
4,PDPDS01_off_rs ec_1,1,PDPDS01,off,rs,ec,F,53.0,170.0,62.55,...,0.05,-2.717139,1.064791,614.627814,4.721363,43.341939,0.205347,-0.301308,-0.001223,0.007619
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2297527,PDPDS32_on_us eo_3,32,PDPDS32,on,us,eo,M,66.0,168.0,60.00,...,29.96,18.802437,-5.822033,592.582657,-21.152566,2.567898,1.704385,2.400851,0.000127,0.002844
2297528,PDPDS32_on_us eo_3,32,PDPDS32,on,us,eo,M,66.0,168.0,60.00,...,29.97,18.770226,-5.852068,592.542057,-21.291976,2.472014,1.689730,2.388736,-0.000034,0.003090
2297529,PDPDS32_on_us eo_3,32,PDPDS32,on,us,eo,M,66.0,168.0,60.00,...,29.98,18.726942,-5.884151,592.414532,-21.435022,2.374755,1.673765,2.375104,-0.000197,0.003348
2297530,PDPDS32_on_us eo_3,32,PDPDS32,on,us,eo,M,66.0,168.0,60.00,...,29.99,18.674008,-5.915569,592.207677,-21.582002,2.276968,1.656281,2.359782,-0.000360,0.003617


In [9]:
df_all.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2297532 entries, 0 to 2297531
Data columns (total 54 columns):
 #   Column                                                              Dtype   
---  ------                                                              -----   
 0   Trial                                                               category
 1   Subject                                                             category
 2   ID                                                                  category
 3   Medication                                                          category
 4   Surface                                                             category
 5   Vision                                                              category
 6   Gender                                                              category
 7   Age                                                                 category
 8   Height (cm)                                                   

## Save data to file and test it

Use engine 'fastparquet' to preserve category data types; see: https://www.practicaldatascience.org/html/parquet.html

In [10]:
df_all.to_parquet(path2 / f'{dataset_name.lower()}.parquet', engine='fastparquet', index=False)  # 820 MB
#df_all.to_csv(path2 / f'{dataset_name.lower()}.txt', sep='\t', float_format=None, index=False)  # interrupted after > 4 GB

In [11]:
df_all2 = pd.read_parquet(path2 / f'{dataset_name.lower()}.parquet', engine='fastparquet')
df_all2

Unnamed: 0,Trial,Subject,ID,Medication,Surface,Vision,Gender,Age,Height (cm),Weight (kg),...,Time [s],GRFml [N],GRFap [N],GRFv [N],Mml [N.m],Map [N.m],Mv [N.m],Mfree [N.m],COPml [m],COPap [m]
0,PDPDS01_off_rs ec_1,1,PDPDS01,off,rs,ec,F,53.0,170.0,62.55,...,0.01,-2.848395,1.196850,615.378774,4.717177,43.639269,0.077541,-0.184237,-0.001239,0.008021
1,PDPDS01_off_rs ec_1,1,PDPDS01,off,rs,ec,F,53.0,170.0,62.55,...,0.02,-2.803073,1.157796,615.145598,4.702110,43.566825,0.119589,-0.223013,-0.001261,0.007929
2,PDPDS01_off_rs ec_1,1,PDPDS01,off,rs,ec,F,53.0,170.0,62.55,...,0.03,-2.763248,1.121453,614.932638,4.694917,43.492688,0.156901,-0.257316,-0.001270,0.007831
3,PDPDS01_off_rs ec_1,1,PDPDS01,off,rs,ec,F,53.0,170.0,62.55,...,0.04,-2.733660,1.090064,614.756335,4.700805,43.417227,0.185969,-0.283859,-0.001258,0.007728
4,PDPDS01_off_rs ec_1,1,PDPDS01,off,rs,ec,F,53.0,170.0,62.55,...,0.05,-2.717139,1.064791,614.627814,4.721363,43.341939,0.205347,-0.301308,-0.001223,0.007619
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2297527,PDPDS32_on_us eo_3,32,PDPDS32,on,us,eo,M,66.0,168.0,60.00,...,29.96,18.802437,-5.822033,592.582657,-21.152566,2.567898,1.704385,2.400851,0.000127,0.002844
2297528,PDPDS32_on_us eo_3,32,PDPDS32,on,us,eo,M,66.0,168.0,60.00,...,29.97,18.770226,-5.852068,592.542057,-21.291976,2.472014,1.689730,2.388736,-0.000034,0.003090
2297529,PDPDS32_on_us eo_3,32,PDPDS32,on,us,eo,M,66.0,168.0,60.00,...,29.98,18.726942,-5.884151,592.414532,-21.435022,2.374755,1.673765,2.375104,-0.000197,0.003348
2297530,PDPDS32_on_us eo_3,32,PDPDS32,on,us,eo,M,66.0,168.0,60.00,...,29.99,18.674008,-5.915569,592.207677,-21.582002,2.276968,1.656281,2.359782,-0.000360,0.003617


In [12]:
df_all2.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2297532 entries, 0 to 2297531
Data columns (total 54 columns):
 #   Column                                                              Dtype   
---  ------                                                              -----   
 0   Trial                                                               category
 1   Subject                                                             category
 2   ID                                                                  category
 3   Medication                                                          category
 4   Surface                                                             category
 5   Vision                                                              category
 6   Gender                                                              category
 7   Age                                                                 category
 8   Height (cm)                                                   