# IDS_balance project: integration of `Santos17` dataset

This notebook integrates the metadata file and all data files of a dataset into a single dataframe and a single parquet file.

`Santos17` dataset: pandas DataFrame with 3_542_700 rows × 270 columns with 6.5 GB in memory and 5.4 GB parquet file.

> Santos DA, Fukuchi C, Fukuchi R, Duarte M (2017) A data set with kinematic and ground reaction forces of human balance (Version 1). figshare. https://doi.org/10.6084/m9.figshare.4525082.v1  
> dos Santos DA, Fukuchi CA, Fukuchi RK, Duarte M (2017) A data set with kinematic and ground reaction forces of human balance. PeerJ 5:e3626 https://doi.org/10.7717/peerj.3626.

## Setup

In [1]:
import sys, os, datetime
from pathlib import Path
import numpy as np
import pandas as pd
from tqdm import tqdm

print(f'Python {sys.version} on {sys.platform}',
      f' numpy {np.__version__}', f' pandas {pd.__version__}',
      datetime.datetime.now().strftime("%d/%m/%Y %H:%M:%S"), sep='\n')

Python 3.12.7 | packaged by conda-forge | (main, Oct  4 2024, 16:05:46) [GCC 13.3.0] on linux
 numpy 2.2.1
 pandas 2.2.3
27/12/2024 00:46:01


## Dataset location

In [2]:
dataset_name = 'Santos17'
metadata_fname = 'PDSinfo.txt'
path2 = Path().resolve().parents[0] / 'datasets' / dataset_name / 'data'
if os.path.isfile(path2 / metadata_fname):
    print(f'Dataset location: {path2}')
else:
    print('Dataset not found.')

Dataset location: /home/marcos/adrive/Python/projects/IDS_balance/datasets/dosSantos17/data


## Metadata

In [3]:
metadata = pd.read_csv(path2 / metadata_fname, sep='\t', header=0,
                       engine='c', encoding='utf-8')  # , float_precision='round_trip'
display(metadata)
print(f'Information from {len(metadata)} files successfully loaded (total of {len(pd.unique(metadata.Subject))} subjects).')

Unnamed: 0,Trial,Subject,Vision,Surface,Rep,Age,AgeGroup,Gender,Height,Mass,...,Nmedication,Medication,Ortho-Prosthesis,Ortho-Prosthesis2,Disability,Disability2,Falls12m,PhysicalActivity,Sequence,Date
0,PDS01OR1,1,Open,Rigid,1,25.67,Young,M,1.72,74.30,...,0,No,Yes,Corrective lens,No,No,0,1,"OR, OF, CF, CR",2016-08-01 11:00:17.753
1,PDS01OR2,1,Open,Rigid,2,25.67,Young,M,1.72,74.30,...,0,No,Yes,Corrective lens,No,No,0,1,"OR, OF, CF, CR",2016-08-01 11:00:17.753
2,PDS01OR3,1,Open,Rigid,3,25.67,Young,M,1.72,74.30,...,0,No,Yes,Corrective lens,No,No,0,1,"OR, OF, CF, CR",2016-08-01 11:00:17.753
3,PDS01OF1,1,Open,Foam,1,25.67,Young,M,1.72,74.30,...,0,No,Yes,Corrective lens,No,No,0,1,"OR, OF, CF, CR",2016-08-01 11:00:17.753
4,PDS01OF2,1,Open,Foam,2,25.67,Young,M,1.72,74.30,...,0,No,Yes,Corrective lens,No,No,0,1,"OR, OF, CF, CR",2016-08-01 11:00:17.753
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
583,PDS49CR2,49,Closed,Rigid,2,64.92,Old,F,1.58,60.75,...,3,"HMG-CoA reductase inhibitor, Synthetic thyroid...",Yes,"Corrective lens, Dental implant",Yes,Hearing (Left ear),0,0,"CF, OF, CR, OR",2016-12-06 09:33:45.819
584,PDS49CR3,49,Closed,Rigid,3,64.92,Old,F,1.58,60.75,...,3,"HMG-CoA reductase inhibitor, Synthetic thyroid...",Yes,"Corrective lens, Dental implant",Yes,Hearing (Left ear),0,0,"CF, OF, CR, OR",2016-12-06 09:33:45.819
585,PDS49CF1,49,Closed,Foam,1,64.92,Old,F,1.58,60.75,...,3,"HMG-CoA reductase inhibitor, Synthetic thyroid...",Yes,"Corrective lens, Dental implant",Yes,Hearing (Left ear),0,0,"CF, OF, CR, OR",2016-12-06 09:33:45.819
586,PDS49CF2,49,Closed,Foam,2,64.92,Old,F,1.58,60.75,...,3,"HMG-CoA reductase inhibitor, Synthetic thyroid...",Yes,"Corrective lens, Dental implant",Yes,Hearing (Left ear),0,0,"CF, OF, CR, OR",2016-12-06 09:33:45.819


Information from 588 files successfully loaded (total of 49 subjects).


In [4]:
metadata.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 588 entries, 0 to 587
Data columns (total 29 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Trial              588 non-null    object 
 1   Subject            588 non-null    int64  
 2   Vision             588 non-null    object 
 3   Surface            588 non-null    object 
 4   Rep                588 non-null    int64  
 5   Age                588 non-null    float64
 6   AgeGroup           588 non-null    object 
 7   Gender             588 non-null    object 
 8   Height             588 non-null    float64
 9   Mass               588 non-null    float64
 10  BMI                588 non-null    float64
 11  FootLen            588 non-null    float64
 12  DominantLeg        588 non-null    object 
 13  Nationality        588 non-null    object 
 14  SkinColor          588 non-null    object 
 15  Ystudy             588 non-null    int64  
 16  Footwear           588 non

### Set variables type as categorical

See https://pandas.pydata.org/docs/user_guide/categorical.html

In [5]:
metadata = metadata.astype('category', copy=True)
metadata.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 588 entries, 0 to 587
Data columns (total 29 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   Trial              588 non-null    category
 1   Subject            588 non-null    category
 2   Vision             588 non-null    category
 3   Surface            588 non-null    category
 4   Rep                588 non-null    category
 5   Age                588 non-null    category
 6   AgeGroup           588 non-null    category
 7   Gender             588 non-null    category
 8   Height             588 non-null    category
 9   Mass               588 non-null    category
 10  BMI                588 non-null    category
 11  FootLen            588 non-null    category
 12  DominantLeg        588 non-null    category
 13  Nationality        588 non-null    category
 14  SkinColor          588 non-null    category
 15  Ystudy             588 non-null    category
 16  Footwear

### Add static trial names to the metadata

In [6]:
subjects = pd.unique(metadata.Subject).tolist()
for s in subjects[::-1]:
    idx = metadata.query('Subject == @s', engine='python').index[-1]
    metadata.loc[idx + .5] = metadata.loc[idx].values
    metadata.loc[idx + .5, ['Trial', 'Vision', 'Surface', 'Rep']] = ['PDS' + str(s).zfill(2) + 'static', None, None, None]
metadata.sort_index(inplace=True)
metadata.reset_index(drop=True, inplace=True)

metadata = metadata.astype('category', copy=True)
display(metadata)
display(metadata.info(memory_usage='deep'))

Unnamed: 0,Trial,Subject,Vision,Surface,Rep,Age,AgeGroup,Gender,Height,Mass,...,Nmedication,Medication,Ortho-Prosthesis,Ortho-Prosthesis2,Disability,Disability2,Falls12m,PhysicalActivity,Sequence,Date
0,PDS01OR1,1,Open,Rigid,1.0,25.67,Young,M,1.72,74.30,...,0,No,Yes,Corrective lens,No,No,0,1,"OR, OF, CF, CR",2016-08-01 11:00:17.753
1,PDS01OR2,1,Open,Rigid,2.0,25.67,Young,M,1.72,74.30,...,0,No,Yes,Corrective lens,No,No,0,1,"OR, OF, CF, CR",2016-08-01 11:00:17.753
2,PDS01OR3,1,Open,Rigid,3.0,25.67,Young,M,1.72,74.30,...,0,No,Yes,Corrective lens,No,No,0,1,"OR, OF, CF, CR",2016-08-01 11:00:17.753
3,PDS01OF1,1,Open,Foam,1.0,25.67,Young,M,1.72,74.30,...,0,No,Yes,Corrective lens,No,No,0,1,"OR, OF, CF, CR",2016-08-01 11:00:17.753
4,PDS01OF2,1,Open,Foam,2.0,25.67,Young,M,1.72,74.30,...,0,No,Yes,Corrective lens,No,No,0,1,"OR, OF, CF, CR",2016-08-01 11:00:17.753
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
632,PDS49CR3,49,Closed,Rigid,3.0,64.92,Old,F,1.58,60.75,...,3,"HMG-CoA reductase inhibitor, Synthetic thyroid...",Yes,"Corrective lens, Dental implant",Yes,Hearing (Left ear),0,0,"CF, OF, CR, OR",2016-12-06 09:33:45.819
633,PDS49CF1,49,Closed,Foam,1.0,64.92,Old,F,1.58,60.75,...,3,"HMG-CoA reductase inhibitor, Synthetic thyroid...",Yes,"Corrective lens, Dental implant",Yes,Hearing (Left ear),0,0,"CF, OF, CR, OR",2016-12-06 09:33:45.819
634,PDS49CF2,49,Closed,Foam,2.0,64.92,Old,F,1.58,60.75,...,3,"HMG-CoA reductase inhibitor, Synthetic thyroid...",Yes,"Corrective lens, Dental implant",Yes,Hearing (Left ear),0,0,"CF, OF, CR, OR",2016-12-06 09:33:45.819
635,PDS49CF3,49,Closed,Foam,3.0,64.92,Old,F,1.58,60.75,...,3,"HMG-CoA reductase inhibitor, Synthetic thyroid...",Yes,"Corrective lens, Dental implant",Yes,Hearing (Left ear),0,0,"CF, OF, CR, OR",2016-12-06 09:33:45.819


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 637 entries, 0 to 636
Data columns (total 29 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   Trial              637 non-null    category
 1   Subject            637 non-null    category
 2   Vision             588 non-null    category
 3   Surface            588 non-null    category
 4   Rep                588 non-null    category
 5   Age                637 non-null    category
 6   AgeGroup           637 non-null    category
 7   Gender             637 non-null    category
 8   Height             637 non-null    category
 9   Mass               637 non-null    category
 10  BMI                637 non-null    category
 11  FootLen            637 non-null    category
 12  DominantLeg        637 non-null    category
 13  Nationality        637 non-null    category
 14  SkinColor          637 non-null    category
 15  Ystudy             637 non-null    category
 16  Footwear

None

## Integration

### Merge metadata and data files individually and then concatenate all

In [7]:
def merge_meta_and_data(trial, metadata=metadata, kinds=('')):
    # Merge metadata and data files
    #meta_and_data = pd.DataFrame()
    if trial[5:] != 'static':
        for k, kind in enumerate(kinds):
            df = pd.read_csv(path2 / f'{trial}{kind}.txt', delimiter='\t', engine='c')  #.iloc[:2, :]
            if k == 0:
                df['Trial'] = trial
                meta_and_data = pd.merge(metadata.query('Trial == @trial'), df, how='inner', on='Trial')
            else:
                df.drop(columns='Time', inplace=True)
                meta_and_data = pd.concat([meta_and_data, df], axis=1)
    else:
        df = pd.read_csv(path2 / f'{trial}.txt', delimiter='\t', engine='c')  # .iloc[:2, :]
        df['Trial'] = trial
        meta_and_data = pd.merge(metadata.query('Trial == @trial'), df, how='inner', on='Trial')
        pass

    return meta_and_data

In [8]:
df_all = [merge_meta_and_data(trial, metadata, kinds=('grf', 'ang', 'mkr'))
          for trial in tqdm(metadata['Trial'])]
df_all = pd.concat(df_all, axis=0, ignore_index=True)
df_all = df_all.astype({'Trial': 'category'})
display(df_all)
display(df_all.info(verbose=True, memory_usage='deep'))

100%|███████████████████████████████████████████████████████| 637/637 [01:23<00:00,  7.66it/s]


Unnamed: 0,Trial,Subject,Vision,Surface,Rep,Age,AgeGroup,Gender,Height,Mass,...,R.Knee.Medial_Z,L.Knee.Medial_X,L.Knee.Medial_Y,L.Knee.Medial_Z,R.Ankle.Medial_X,R.Ankle.Medial_Y,R.Ankle.Medial_Z,L.Ankle.Medial_X,L.Ankle.Medial_Y,L.Ankle.Medial_Z
0,PDS01OR1,1,Open,Rigid,1.0,25.67,Young,M,1.72,74.30,...,,,,,,,,,,
1,PDS01OR1,1,Open,Rigid,1.0,25.67,Young,M,1.72,74.30,...,,,,,,,,,,
2,PDS01OR1,1,Open,Rigid,1.0,25.67,Young,M,1.72,74.30,...,,,,,,,,,,
3,PDS01OR1,1,Open,Rigid,1.0,25.67,Young,M,1.72,74.30,...,,,,,,,,,,
4,PDS01OR1,1,Open,Rigid,1.0,25.67,Young,M,1.72,74.30,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3542695,PDS49static,49,,,,64.92,Old,F,1.58,60.75,...,-1.194277,-0.386336,0.424990,-1.275723,-0.385735,0.066252,-1.139551,-0.385341,0.068598,-1.317972
3542696,PDS49static,49,,,,64.92,Old,F,1.58,60.75,...,-1.194275,-0.386327,0.424990,-1.275722,-0.385737,0.066251,-1.139551,-0.385339,0.068597,-1.317970
3542697,PDS49static,49,,,,64.92,Old,F,1.58,60.75,...,-1.194271,-0.386312,0.424988,-1.275720,-0.385738,0.066249,-1.139549,-0.385336,0.068594,-1.317969
3542698,PDS49static,49,,,,64.92,Old,F,1.58,60.75,...,-1.194266,-0.386292,0.424984,-1.275716,-0.385739,0.066245,-1.139545,-0.385333,0.068592,-1.317969


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3542700 entries, 0 to 3542699
Data columns (total 270 columns):
 #    Column             Dtype   
---   ------             -----   
 0    Trial              category
 1    Subject            category
 2    Vision             category
 3    Surface            category
 4    Rep                category
 5    Age                category
 6    AgeGroup           category
 7    Gender             category
 8    Height             category
 9    Mass               category
 10   BMI                category
 11   FootLen            category
 12   DominantLeg        category
 13   Nationality        category
 14   SkinColor          category
 15   Ystudy             category
 16   Footwear           category
 17   Illness            category
 18   Illness2           category
 19   Nmedication        category
 20   Medication         category
 21   Ortho-Prosthesis   category
 22   Ortho-Prosthesis2  category
 23   Disability         category
 2

None

## Save data to file and test it

Use engine 'fastparquet' to preserve category data types; see: https://www.practicaldatascience.org/html/parquet.html

In [9]:
df_all.to_parquet(path2 / f'{dataset_name.lower()}.parquet', engine='fastparquet', index=False)  # 5.4 GB

In [10]:
df_all2 = pd.read_parquet(path2 / f'{dataset_name.lower()}.parquet', engine='fastparquet')
df_all2

Unnamed: 0,Trial,Subject,Vision,Surface,Rep,Age,AgeGroup,Gender,Height,Mass,...,R.Knee.Medial_Z,L.Knee.Medial_X,L.Knee.Medial_Y,L.Knee.Medial_Z,R.Ankle.Medial_X,R.Ankle.Medial_Y,R.Ankle.Medial_Z,L.Ankle.Medial_X,L.Ankle.Medial_Y,L.Ankle.Medial_Z
0,PDS01OR1,1,Open,Rigid,1.0,25.67,Young,M,1.72,74.30,...,,,,,,,,,,
1,PDS01OR1,1,Open,Rigid,1.0,25.67,Young,M,1.72,74.30,...,,,,,,,,,,
2,PDS01OR1,1,Open,Rigid,1.0,25.67,Young,M,1.72,74.30,...,,,,,,,,,,
3,PDS01OR1,1,Open,Rigid,1.0,25.67,Young,M,1.72,74.30,...,,,,,,,,,,
4,PDS01OR1,1,Open,Rigid,1.0,25.67,Young,M,1.72,74.30,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3542695,PDS49static,49,,,,64.92,Old,F,1.58,60.75,...,-1.194277,-0.386336,0.424990,-1.275723,-0.385735,0.066252,-1.139551,-0.385341,0.068598,-1.317972
3542696,PDS49static,49,,,,64.92,Old,F,1.58,60.75,...,-1.194275,-0.386327,0.424990,-1.275722,-0.385737,0.066251,-1.139551,-0.385339,0.068597,-1.317970
3542697,PDS49static,49,,,,64.92,Old,F,1.58,60.75,...,-1.194271,-0.386312,0.424988,-1.275720,-0.385738,0.066249,-1.139549,-0.385336,0.068594,-1.317969
3542698,PDS49static,49,,,,64.92,Old,F,1.58,60.75,...,-1.194266,-0.386292,0.424984,-1.275716,-0.385739,0.066245,-1.139545,-0.385333,0.068592,-1.317969


In [11]:
df_all2.info(verbose=True, memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3542700 entries, 0 to 3542699
Data columns (total 270 columns):
 #    Column             Dtype   
---   ------             -----   
 0    Trial              category
 1    Subject            category
 2    Vision             category
 3    Surface            category
 4    Rep                category
 5    Age                category
 6    AgeGroup           category
 7    Gender             category
 8    Height             category
 9    Mass               category
 10   BMI                category
 11   FootLen            category
 12   DominantLeg        category
 13   Nationality        category
 14   SkinColor          category
 15   Ystudy             category
 16   Footwear           category
 17   Illness            category
 18   Illness2           category
 19   Nmedication        category
 20   Medication         category
 21   Ortho-Prosthesis   category
 22   Ortho-Prosthesis2  category
 23   Disability         category
 2