# IDS_balance project: integration of `Santos16` dataset

This notebook integrates the metadata file and all data files of a dataset into a single dataframe and a single parquet file.

`Santos16` dataset: pandas DataFrame with 11_580_000 rows × 73 columns with 1.6 GB in memory and 800 MB parquet file.

> Santos DA, Duarte M. (2016) A public data set of human balance evaluations (Version 2). figshare. https://doi.org/10.6084/m9.figshare.3394432.v2
> Santos DA, Duarte M. (2016) A public data set of human balance evaluations. PeerJ4:e2648  https://doi.org/10.7717/peerj.2648

## Setup

In [1]:
import sys, os, datetime
from pathlib import Path
import numpy as np
import pandas as pd
from tqdm import tqdm

print(f'Python {sys.version} on {sys.platform}',
      f' numpy {np.__version__}', f' pandas {pd.__version__}',
      datetime.datetime.now().strftime("%d/%m/%Y %H:%M:%S"), sep='\n')

Python 3.12.7 | packaged by conda-forge | (main, Oct  4 2024, 16:05:46) [GCC 13.3.0] on linux
 numpy 2.2.1
 pandas 2.2.3
27/12/2024 00:44:29


## Dataset location

In [2]:
dataset_name = 'Santos16'
metadata_fname = 'BDSinfo.txt'
path2 = Path().resolve().parents[0] / 'datasets' / dataset_name / 'data'
if os.path.isfile(path2 / metadata_fname):
    print(f'Dataset location: {path2}')
else:
    print('Dataset not found.')

Dataset location: /home/marcos/adrive/Python/projects/IDS_balance/datasets/Santos16/data


## Metadata

In [12]:
metadata = pd.read_csv(path2 / metadata_fname, sep='\t', header=0,
                       engine='c', encoding='utf-8')  # , float_precision='round_trip'
display(metadata)
print(f'Information from {len(metadata)} files successfully loaded (total of {len(pd.unique(metadata.Subject))} subjects).')

Unnamed: 0,Trial,Subject,Vision,Surface,Age,AgeGroup,Gender,Height,Weight,BMI,...,Best_7,Best_8,Best_9,Best_10,Best_11,Best_12,Best_13,Best_14,Best_T,Date
0,BDS00001,1,Open,Firm,33.000000,Young,F,157.5,54.2,21.849332,...,2,2,2,2,2,2,2,1,25,2015-10-08 08:30:00.000
1,BDS00002,1,Open,Firm,33.000000,Young,F,157.5,54.2,21.849332,...,2,2,2,2,2,2,2,1,25,2015-10-08 08:30:00.000
2,BDS00003,1,Open,Firm,33.000000,Young,F,157.5,54.2,21.849332,...,2,2,2,2,2,2,2,1,25,2015-10-08 08:30:00.000
3,BDS00004,1,Closed,Firm,33.000000,Young,F,157.5,54.2,21.849332,...,2,2,2,2,2,2,2,1,25,2015-10-08 08:30:00.000
4,BDS00005,1,Closed,Firm,33.000000,Young,F,157.5,54.2,21.849332,...,2,2,2,2,2,2,2,1,25,2015-10-08 08:30:00.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1925,BDS01952,163,Open,Firm,25.416667,Young,M,172.0,74.6,25.216333,...,2,2,2,2,2,2,2,2,26,2016-03-11 10:49:57.538
1926,BDS01953,163,Open,Firm,25.416667,Young,M,172.0,74.6,25.216333,...,2,2,2,2,2,2,2,2,26,2016-03-11 10:49:57.538
1927,BDS01954,163,Closed,Foam,25.416667,Young,M,172.0,74.6,25.216333,...,2,2,2,2,2,2,2,2,26,2016-03-11 10:49:57.538
1928,BDS01955,163,Closed,Foam,25.416667,Young,M,172.0,74.6,25.216333,...,2,2,2,2,2,2,2,2,26,2016-03-11 10:49:57.538


Information from 1930 files successfully loaded (total of 163 subjects).


In [4]:
metadata.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1930 entries, 0 to 1929
Data columns (total 64 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Trial              1930 non-null   object 
 1   Subject            1930 non-null   int64  
 2   Vision             1930 non-null   object 
 3   Surface            1930 non-null   object 
 4   Age                1930 non-null   float64
 5   AgeGroup           1930 non-null   object 
 6   Gender             1930 non-null   object 
 7   Height             1930 non-null   float64
 8   Weight             1930 non-null   float64
 9   BMI                1930 non-null   float64
 10  FootLen            1906 non-null   float64
 11  Nationality        1930 non-null   object 
 12  SkinColor          1930 non-null   object 
 13  Ystudy             1930 non-null   int64  
 14  Footwear           1930 non-null   object 
 15  Illness            1930 non-null   object 
 16  Illness2           1930 

### Set variables type as categorical

See https://pandas.pydata.org/docs/user_guide/categorical.html

In [5]:
metadata = metadata.astype('category', copy=True)
metadata.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1930 entries, 0 to 1929
Data columns (total 64 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   Trial              1930 non-null   category
 1   Subject            1930 non-null   category
 2   Vision             1930 non-null   category
 3   Surface            1930 non-null   category
 4   Age                1930 non-null   category
 5   AgeGroup           1930 non-null   category
 6   Gender             1930 non-null   category
 7   Height             1930 non-null   category
 8   Weight             1930 non-null   category
 9   BMI                1930 non-null   category
 10  FootLen            1906 non-null   category
 11  Nationality        1930 non-null   category
 12  SkinColor          1930 non-null   category
 13  Ystudy             1930 non-null   category
 14  Footwear           1930 non-null   category
 15  Illness            1930 non-null   category
 16  Illnes

## Integration

### Merge metadata and data files individually and then concatenate all

In [6]:
def merge_meta_data(metadata, trial):
    # Merge metadata and data files
    data = pd.read_csv(path2 / f'{trial}.txt', delimiter='\t',
                       engine='c', encoding='utf-8', float_precision='round_trip')
    data['Trial'] = trial
    return pd.merge(metadata.query('Trial == @trial'), data, how='inner', on='Trial')

In [7]:
df_all = [merge_meta_data(metadata, trial) for trial in tqdm(metadata['Trial'])]
df_all = pd.concat(df_all, ignore_index=True)
df_all = df_all.astype({'Trial': 'category'})
df_all

100%|█████████████████████████████████████████████████████| 1930/1930 [00:41<00:00, 46.53it/s]


Unnamed: 0,Trial,Subject,Vision,Surface,Age,AgeGroup,Gender,Height,Weight,BMI,...,Date,Time[s],Fx[N],Fy[N],Fz[N],Mx[Nm],My[Nm],Mz[Nm],COPx[cm],COPy[cm]
0,BDS00001,1,Open,Firm,33.000000,Young,F,157.5,54.2,21.849332,...,2015-10-08 08:30:00.000,0.01,-1.633567,-3.739135,539.066061,5.383505,43.064851,-0.570876,-7.988789,0.998673
1,BDS00001,1,Open,Firm,33.000000,Young,F,157.5,54.2,21.849332,...,2015-10-08 08:30:00.000,0.02,-1.631645,-3.755459,538.591198,5.373275,43.020015,-0.575349,-7.987508,0.997654
2,BDS00001,1,Open,Firm,33.000000,Young,F,157.5,54.2,21.849332,...,2015-10-08 08:30:00.000,0.03,-1.628593,-3.762146,538.207318,5.367719,42.974805,-0.578670,-7.984805,0.997333
3,BDS00001,1,Open,Firm,33.000000,Young,F,157.5,54.2,21.849332,...,2015-10-08 08:30:00.000,0.04,-1.623200,-3.753010,537.977214,5.369522,42.929275,-0.580162,-7.979757,0.998095
4,BDS00001,1,Open,Firm,33.000000,Young,F,157.5,54.2,21.849332,...,2015-10-08 08:30:00.000,0.05,-1.613998,-3.727238,537.915946,5.378311,42.883767,-0.579954,-7.972206,0.999842
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11579995,BDS01956,163,Closed,Foam,25.416667,Young,M,172.0,74.6,25.216333,...,2016-03-11 10:49:57.538,59.96,7.559660,-1.264583,736.066905,-5.914619,-17.292192,-0.201874,2.349269,-0.803544
11579996,BDS01956,163,Closed,Foam,25.416667,Young,M,172.0,74.6,25.216333,...,2016-03-11 10:49:57.538,59.97,7.408137,-1.169381,736.342291,-5.750486,-17.088347,-0.225570,2.320707,-0.780953
11579997,BDS01956,163,Closed,Foam,25.416667,Young,M,172.0,74.6,25.216333,...,2016-03-11 10:49:57.538,59.98,6.770478,-0.808862,738.143537,-5.428087,-16.605555,-0.279714,2.249638,-0.735370
11579998,BDS01956,163,Closed,Foam,25.416667,Young,M,172.0,74.6,25.216333,...,2016-03-11 10:49:57.538,59.99,5.685936,-0.229957,741.306461,-4.964806,-15.876418,-0.360510,2.141681,-0.669737


In [8]:
df_all.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11580000 entries, 0 to 11579999
Data columns (total 73 columns):
 #   Column             Dtype   
---  ------             -----   
 0   Trial              category
 1   Subject            category
 2   Vision             category
 3   Surface            category
 4   Age                category
 5   AgeGroup           category
 6   Gender             category
 7   Height             category
 8   Weight             category
 9   BMI                category
 10  FootLen            category
 11  Nationality        category
 12  SkinColor          category
 13  Ystudy             category
 14  Footwear           category
 15  Illness            category
 16  Illness2           category
 17  Nmedication        category
 18  Medication         category
 19  Ortho-Prosthesis   category
 20  Ortho-Prosthesis2  category
 21  Disability         category
 22  Disability2        category
 23  Falls12m           category
 24  FES_1              cat

## Save data to file and test it

Use engine 'fastparquet' to preserve category data types; see: https://www.practicaldatascience.org/html/parquet.html

In [9]:
df_all.to_parquet(path2 / f'{dataset_name.lower()}.parquet', engine='fastparquet', index=False)  # 820 MB
#df_all.to_csv(path2 / f'{dataset_name.lower()}.txt', sep='\t', float_format=None, index=False)  # interrupted after > 4 GB

In [10]:
df_all2 = pd.read_parquet(path2 / f'{dataset_name.lower()}.parquet', engine='fastparquet')
df_all2

Unnamed: 0,Trial,Subject,Vision,Surface,Age,AgeGroup,Gender,Height,Weight,BMI,...,Date,Time[s],Fx[N],Fy[N],Fz[N],Mx[Nm],My[Nm],Mz[Nm],COPx[cm],COPy[cm]
0,BDS00001,1,Open,Firm,33.000000,Young,F,157.5,54.2,21.849332,...,2015-10-08 08:30:00.000,0.01,-1.633567,-3.739135,539.066061,5.383505,43.064851,-0.570876,-7.988789,0.998673
1,BDS00001,1,Open,Firm,33.000000,Young,F,157.5,54.2,21.849332,...,2015-10-08 08:30:00.000,0.02,-1.631645,-3.755459,538.591198,5.373275,43.020015,-0.575349,-7.987508,0.997654
2,BDS00001,1,Open,Firm,33.000000,Young,F,157.5,54.2,21.849332,...,2015-10-08 08:30:00.000,0.03,-1.628593,-3.762146,538.207318,5.367719,42.974805,-0.578670,-7.984805,0.997333
3,BDS00001,1,Open,Firm,33.000000,Young,F,157.5,54.2,21.849332,...,2015-10-08 08:30:00.000,0.04,-1.623200,-3.753010,537.977214,5.369522,42.929275,-0.580162,-7.979757,0.998095
4,BDS00001,1,Open,Firm,33.000000,Young,F,157.5,54.2,21.849332,...,2015-10-08 08:30:00.000,0.05,-1.613998,-3.727238,537.915946,5.378311,42.883767,-0.579954,-7.972206,0.999842
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11579995,BDS01956,163,Closed,Foam,25.416667,Young,M,172.0,74.6,25.216333,...,2016-03-11 10:49:57.538,59.96,7.559660,-1.264583,736.066905,-5.914619,-17.292192,-0.201874,2.349269,-0.803544
11579996,BDS01956,163,Closed,Foam,25.416667,Young,M,172.0,74.6,25.216333,...,2016-03-11 10:49:57.538,59.97,7.408137,-1.169381,736.342291,-5.750486,-17.088347,-0.225570,2.320707,-0.780953
11579997,BDS01956,163,Closed,Foam,25.416667,Young,M,172.0,74.6,25.216333,...,2016-03-11 10:49:57.538,59.98,6.770478,-0.808862,738.143537,-5.428087,-16.605555,-0.279714,2.249638,-0.735370
11579998,BDS01956,163,Closed,Foam,25.416667,Young,M,172.0,74.6,25.216333,...,2016-03-11 10:49:57.538,59.99,5.685936,-0.229957,741.306461,-4.964806,-15.876418,-0.360510,2.141681,-0.669737


In [11]:
df_all2.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11580000 entries, 0 to 11579999
Data columns (total 73 columns):
 #   Column             Dtype   
---  ------             -----   
 0   Trial              category
 1   Subject            category
 2   Vision             category
 3   Surface            category
 4   Age                category
 5   AgeGroup           category
 6   Gender             category
 7   Height             category
 8   Weight             category
 9   BMI                category
 10  FootLen            category
 11  Nationality        category
 12  SkinColor          category
 13  Ystudy             category
 14  Footwear           category
 15  Illness            category
 16  Illness2           category
 17  Nmedication        category
 18  Medication         category
 19  Ortho-Prosthesis   category
 20  Ortho-Prosthesis2  category
 21  Disability         category
 22  Disability2        category
 23  Falls12m           category
 24  FES_1              cat