# IDS_balance project

> **file conversion: metadata and individual files to a single pandas dataframe**

This notebook demonstrates how to convert the metadata file and all data files of a dataset to a single dataframe.  
The following datasets have been converted in this way:  
 - Santos16: 11_580_000 rows × 73 columns; 1.6 GB in memory; 800 MB parquet file
 - dosSantos17: 3_542_700 rows × 270 columns; 6.5 GB in memory; 5.4 GB parquet file

## Setup

In [1]:
import sys, os, glob, datetime
from pathlib import Path
import numpy as np
import pandas as pd
from tqdm import tqdm

print(f'Python {sys.version} on {sys.platform}',
      f' numpy {np.__version__}', f' pandas {pd.__version__}',
      datetime.datetime.now().strftime("%d/%m/%Y %H:%M:%S"), sep='\n')

Python 3.11.9 (tags/v3.11.9:de54cf5, Apr  2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)] on win32
 numpy 1.26.2
 pandas 2.1.3
13/04/2024 10:44:09


## Santos16 dataset

### Dataset location

In [2]:
# LOCAL
path2 = Path(f'.{os.sep}..{os.sep}Santos16{os.sep}data')
print(f'Dataset location: {path2}')

Dataset location: ../Santos16/data


### Load metadata

In [3]:
metadata = pd.read_csv(f'{path2}{os.sep}{'BDSinfo.txt'}', sep='\t', header=0,
                       engine='c', encoding='utf-8')
print('Information from %s files successfully loaded (total of %s subjects).'
      %(len(metadata), len(pd.unique(metadata.Subject))))
display(metadata)

Information from 1930 files successfully loaded (total of 163 subjects).


Unnamed: 0,Trial,Subject,Vision,Surface,Age,AgeGroup,Gender,Height,Weight,BMI,...,Best_7,Best_8,Best_9,Best_10,Best_11,Best_12,Best_13,Best_14,Best_T,Date
0,BDS00001,1,Open,Firm,33.000000,Young,F,157.5,54.2,21.849332,...,2,2,2,2,2,2,2,1,25,2015-10-08 08:30:00.000
1,BDS00002,1,Open,Firm,33.000000,Young,F,157.5,54.2,21.849332,...,2,2,2,2,2,2,2,1,25,2015-10-08 08:30:00.000
2,BDS00003,1,Open,Firm,33.000000,Young,F,157.5,54.2,21.849332,...,2,2,2,2,2,2,2,1,25,2015-10-08 08:30:00.000
3,BDS00004,1,Closed,Firm,33.000000,Young,F,157.5,54.2,21.849332,...,2,2,2,2,2,2,2,1,25,2015-10-08 08:30:00.000
4,BDS00005,1,Closed,Firm,33.000000,Young,F,157.5,54.2,21.849332,...,2,2,2,2,2,2,2,1,25,2015-10-08 08:30:00.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1925,BDS01952,163,Open,Firm,25.416667,Young,M,172.0,74.6,25.216333,...,2,2,2,2,2,2,2,2,26,2016-03-11 10:49:57.538
1926,BDS01953,163,Open,Firm,25.416667,Young,M,172.0,74.6,25.216333,...,2,2,2,2,2,2,2,2,26,2016-03-11 10:49:57.538
1927,BDS01954,163,Closed,Foam,25.416667,Young,M,172.0,74.6,25.216333,...,2,2,2,2,2,2,2,2,26,2016-03-11 10:49:57.538
1928,BDS01955,163,Closed,Foam,25.416667,Young,M,172.0,74.6,25.216333,...,2,2,2,2,2,2,2,2,26,2016-03-11 10:49:57.538


In [4]:
metadata.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1930 entries, 0 to 1929
Data columns (total 64 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Trial              1930 non-null   object 
 1   Subject            1930 non-null   int64  
 2   Vision             1930 non-null   object 
 3   Surface            1930 non-null   object 
 4   Age                1930 non-null   float64
 5   AgeGroup           1930 non-null   object 
 6   Gender             1930 non-null   object 
 7   Height             1930 non-null   float64
 8   Weight             1930 non-null   float64
 9   BMI                1930 non-null   float64
 10  FootLen            1906 non-null   float64
 11  Nationality        1930 non-null   object 
 12  SkinColor          1930 non-null   object 
 13  Ystudy             1930 non-null   int64  
 14  Footwear           1930 non-null   object 
 15  Illness            1930 non-null   object 
 16  Illness2           1930 

### Set variables type as categorical

See https://pandas.pydata.org/docs/user_guide/categorical.html

In [5]:
metadata = metadata.astype('category', copy=True)
metadata.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1930 entries, 0 to 1929
Data columns (total 64 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   Trial              1930 non-null   category
 1   Subject            1930 non-null   category
 2   Vision             1930 non-null   category
 3   Surface            1930 non-null   category
 4   Age                1930 non-null   category
 5   AgeGroup           1930 non-null   category
 6   Gender             1930 non-null   category
 7   Height             1930 non-null   category
 8   Weight             1930 non-null   category
 9   BMI                1930 non-null   category
 10  FootLen            1906 non-null   category
 11  Nationality        1930 non-null   category
 12  SkinColor          1930 non-null   category
 13  Ystudy             1930 non-null   category
 14  Footwear           1930 non-null   category
 15  Illness            1930 non-null   category
 16  Illnes

### Merge metadata and data files individually and then concatenate all

In [6]:
def merge_meta_data(metadata, trial):
    # Merge metadata and data files
    data = pd.read_csv(f'{path2}{os.sep}{trial + '.txt'}', delimiter='\t', engine='c')
    data['Trial'] = trial
    return pd.merge(metadata.query('Trial == @trial'), data, how='inner', on='Trial')

In [7]:
df_all = [merge_meta_data(metadata, trial) for trial in tqdm(metadata['Trial'])]
df_all = pd.concat(df_all, ignore_index=True)
df_all = df_all.astype({'Trial': 'category'})
df_all

100%|████████████████████████████████████████████████████████████████| 1930/1930 [00:42<00:00, 44.92it/s]


Unnamed: 0,Trial,Subject,Vision,Surface,Age,AgeGroup,Gender,Height,Weight,BMI,...,Date,Time[s],Fx[N],Fy[N],Fz[N],Mx[Nm],My[Nm],Mz[Nm],COPx[cm],COPy[cm]
0,BDS00001,1,Open,Firm,33.000000,Young,F,157.5,54.2,21.849332,...,2015-10-08 08:30:00.000,0.01,-1.633567,-3.739135,539.066061,5.383505,43.064851,-0.570876,-7.988789,0.998673
1,BDS00001,1,Open,Firm,33.000000,Young,F,157.5,54.2,21.849332,...,2015-10-08 08:30:00.000,0.02,-1.631645,-3.755459,538.591198,5.373275,43.020015,-0.575349,-7.987508,0.997654
2,BDS00001,1,Open,Firm,33.000000,Young,F,157.5,54.2,21.849332,...,2015-10-08 08:30:00.000,0.03,-1.628593,-3.762146,538.207318,5.367719,42.974805,-0.578670,-7.984805,0.997333
3,BDS00001,1,Open,Firm,33.000000,Young,F,157.5,54.2,21.849332,...,2015-10-08 08:30:00.000,0.04,-1.623200,-3.753010,537.977214,5.369522,42.929275,-0.580162,-7.979757,0.998095
4,BDS00001,1,Open,Firm,33.000000,Young,F,157.5,54.2,21.849332,...,2015-10-08 08:30:00.000,0.05,-1.613998,-3.727238,537.915946,5.378311,42.883767,-0.579954,-7.972206,0.999842
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11579995,BDS01956,163,Closed,Foam,25.416667,Young,M,172.0,74.6,25.216333,...,2016-03-11 10:49:57.538,59.96,7.559660,-1.264583,736.066905,-5.914619,-17.292192,-0.201874,2.349269,-0.803544
11579996,BDS01956,163,Closed,Foam,25.416667,Young,M,172.0,74.6,25.216333,...,2016-03-11 10:49:57.538,59.97,7.408137,-1.169381,736.342291,-5.750486,-17.088347,-0.225570,2.320707,-0.780953
11579997,BDS01956,163,Closed,Foam,25.416667,Young,M,172.0,74.6,25.216333,...,2016-03-11 10:49:57.538,59.98,6.770478,-0.808862,738.143537,-5.428087,-16.605555,-0.279714,2.249638,-0.735370
11579998,BDS01956,163,Closed,Foam,25.416667,Young,M,172.0,74.6,25.216333,...,2016-03-11 10:49:57.538,59.99,5.685936,-0.229957,741.306461,-4.964806,-15.876418,-0.360510,2.141681,-0.669737


In [8]:
df_all.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11580000 entries, 0 to 11579999
Data columns (total 73 columns):
 #   Column             Dtype   
---  ------             -----   
 0   Trial              category
 1   Subject            category
 2   Vision             category
 3   Surface            category
 4   Age                category
 5   AgeGroup           category
 6   Gender             category
 7   Height             category
 8   Weight             category
 9   BMI                category
 10  FootLen            category
 11  Nationality        category
 12  SkinColor          category
 13  Ystudy             category
 14  Footwear           category
 15  Illness            category
 16  Illness2           category
 17  Nmedication        category
 18  Medication         category
 19  Ortho-Prosthesis   category
 20  Ortho-Prosthesis2  category
 21  Disability         category
 22  Disability2        category
 23  Falls12m           category
 24  FES_1              cat

### Save data to file

Use engine 'fastparquet' to preserve category data types; see: https://www.practicaldatascience.org/html/parquet.html

In [9]:
df_all.to_parquet(f'{path2}{os.sep}{"santos16.parquet"}', engine='fastparquet', index=False)  # 820 MB

In [10]:
#df_all.to_csv(f'{path2}{os.sep}{"santos16.txt"}', sep='\t', float_format=None, index=False)  # interrupted after > 4 GB

### Test: load parquet file

In [11]:
df_all2 = pd.read_parquet(f'{path2}{os.sep}{"santos16.parquet"}', engine='fastparquet')
df_all2

Unnamed: 0,Trial,Subject,Vision,Surface,Age,AgeGroup,Gender,Height,Weight,BMI,...,Date,Time[s],Fx[N],Fy[N],Fz[N],Mx[Nm],My[Nm],Mz[Nm],COPx[cm],COPy[cm]
0,BDS00001,1,Open,Firm,33.000000,Young,F,157.5,54.2,21.849332,...,2015-10-08 08:30:00.000,0.01,-1.633567,-3.739135,539.066061,5.383505,43.064851,-0.570876,-7.988789,0.998673
1,BDS00001,1,Open,Firm,33.000000,Young,F,157.5,54.2,21.849332,...,2015-10-08 08:30:00.000,0.02,-1.631645,-3.755459,538.591198,5.373275,43.020015,-0.575349,-7.987508,0.997654
2,BDS00001,1,Open,Firm,33.000000,Young,F,157.5,54.2,21.849332,...,2015-10-08 08:30:00.000,0.03,-1.628593,-3.762146,538.207318,5.367719,42.974805,-0.578670,-7.984805,0.997333
3,BDS00001,1,Open,Firm,33.000000,Young,F,157.5,54.2,21.849332,...,2015-10-08 08:30:00.000,0.04,-1.623200,-3.753010,537.977214,5.369522,42.929275,-0.580162,-7.979757,0.998095
4,BDS00001,1,Open,Firm,33.000000,Young,F,157.5,54.2,21.849332,...,2015-10-08 08:30:00.000,0.05,-1.613998,-3.727238,537.915946,5.378311,42.883767,-0.579954,-7.972206,0.999842
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11579995,BDS01956,163,Closed,Foam,25.416667,Young,M,172.0,74.6,25.216333,...,2016-03-11 10:49:57.538,59.96,7.559660,-1.264583,736.066905,-5.914619,-17.292192,-0.201874,2.349269,-0.803544
11579996,BDS01956,163,Closed,Foam,25.416667,Young,M,172.0,74.6,25.216333,...,2016-03-11 10:49:57.538,59.97,7.408137,-1.169381,736.342291,-5.750486,-17.088347,-0.225570,2.320707,-0.780953
11579997,BDS01956,163,Closed,Foam,25.416667,Young,M,172.0,74.6,25.216333,...,2016-03-11 10:49:57.538,59.98,6.770478,-0.808862,738.143537,-5.428087,-16.605555,-0.279714,2.249638,-0.735370
11579998,BDS01956,163,Closed,Foam,25.416667,Young,M,172.0,74.6,25.216333,...,2016-03-11 10:49:57.538,59.99,5.685936,-0.229957,741.306461,-4.964806,-15.876418,-0.360510,2.141681,-0.669737


In [12]:
df_all2.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11580000 entries, 0 to 11579999
Data columns (total 73 columns):
 #   Column             Dtype   
---  ------             -----   
 0   Trial              category
 1   Subject            category
 2   Vision             category
 3   Surface            category
 4   Age                category
 5   AgeGroup           category
 6   Gender             category
 7   Height             category
 8   Weight             category
 9   BMI                category
 10  FootLen            category
 11  Nationality        category
 12  SkinColor          category
 13  Ystudy             category
 14  Footwear           category
 15  Illness            category
 16  Illness2           category
 17  Nmedication        category
 18  Medication         category
 19  Ortho-Prosthesis   category
 20  Ortho-Prosthesis2  category
 21  Disability         category
 22  Disability2        category
 23  Falls12m           category
 24  FES_1              cat

## dosSantos17 dataset

In [2]:
# LOCAL marcos
"""path2 = Path(f'.{os.sep}..{os.sep}dosSantos17{os.sep}data')
print(f'Dataset location: {path2}')
metadata = pd.read_csv(f"{path2}{os.sep}{'PDSinfo.txt'}", sep='\t', header=0,
                       engine='c', encoding='utf-8')
print('Information from %s files successfully loaded (total of %s subjects).'
      %(len(metadata), len(pd.unique(metadata.Subject))))
display(metadata)"""

SyntaxError: f-string: expecting '}' (2229731765.py, line 4)

In [8]:
# LOCAL Jonatas
path2 = Path(f'D:{os.sep}Datasets{os.sep}dosSantos17{os.sep}data')
print(f'Dataset location: {path2}')
metadata = pd.read_csv(f"{path2}{os.sep}PDSinfo.txt", sep='\t', header=0,
                       engine='c', encoding='utf-8')
print('Information from %s files successfully loaded (total of %s subjects).'
      %(len(metadata), len(pd.unique(metadata.Subject))))
display(metadata)

Dataset location: D:\Datasets\dosSantos17\data
Information from 588 files successfully loaded (total of 49 subjects).


Unnamed: 0,Trial,Subject,Vision,Surface,Rep,Age,AgeGroup,Gender,Height,Mass,...,Nmedication,Medication,Ortho-Prosthesis,Ortho-Prosthesis2,Disability,Disability2,Falls12m,PhysicalActivity,Sequence,Date
0,PDS01OR1,1,Open,Rigid,1,25.67,Young,M,1.72,74.30,...,0,No,Yes,Corrective lens,No,No,0,1,"OR, OF, CF, CR",2016-08-01 11:00:17.753
1,PDS01OR2,1,Open,Rigid,2,25.67,Young,M,1.72,74.30,...,0,No,Yes,Corrective lens,No,No,0,1,"OR, OF, CF, CR",2016-08-01 11:00:17.753
2,PDS01OR3,1,Open,Rigid,3,25.67,Young,M,1.72,74.30,...,0,No,Yes,Corrective lens,No,No,0,1,"OR, OF, CF, CR",2016-08-01 11:00:17.753
3,PDS01OF1,1,Open,Foam,1,25.67,Young,M,1.72,74.30,...,0,No,Yes,Corrective lens,No,No,0,1,"OR, OF, CF, CR",2016-08-01 11:00:17.753
4,PDS01OF2,1,Open,Foam,2,25.67,Young,M,1.72,74.30,...,0,No,Yes,Corrective lens,No,No,0,1,"OR, OF, CF, CR",2016-08-01 11:00:17.753
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
583,PDS49CR2,49,Closed,Rigid,2,64.92,Old,F,1.58,60.75,...,3,"HMG-CoA reductase inhibitor, Synthetic thyroid...",Yes,"Corrective lens, Dental implant",Yes,Hearing (Left ear),0,0,"CF, OF, CR, OR",2016-12-06 09:33:45.819
584,PDS49CR3,49,Closed,Rigid,3,64.92,Old,F,1.58,60.75,...,3,"HMG-CoA reductase inhibitor, Synthetic thyroid...",Yes,"Corrective lens, Dental implant",Yes,Hearing (Left ear),0,0,"CF, OF, CR, OR",2016-12-06 09:33:45.819
585,PDS49CF1,49,Closed,Foam,1,64.92,Old,F,1.58,60.75,...,3,"HMG-CoA reductase inhibitor, Synthetic thyroid...",Yes,"Corrective lens, Dental implant",Yes,Hearing (Left ear),0,0,"CF, OF, CR, OR",2016-12-06 09:33:45.819
586,PDS49CF2,49,Closed,Foam,2,64.92,Old,F,1.58,60.75,...,3,"HMG-CoA reductase inhibitor, Synthetic thyroid...",Yes,"Corrective lens, Dental implant",Yes,Hearing (Left ear),0,0,"CF, OF, CR, OR",2016-12-06 09:33:45.819


In [4]:
# add static trial names to the metadata, so we can loop over it and concatenate to all data
subjects = pd.unique(metadata.Subject).tolist()
for s in subjects[::-1]:
    idx = metadata.query('Subject == @s', engine='python').index[-1]
    metadata.loc[idx + .5] = metadata.loc[idx].values
    metadata.loc[idx + .5, ['Trial', 'Vision', 'Surface', 'Rep']] = ['PDS' + str(s).zfill(2) + 'static', None, None, None]
metadata.sort_index(inplace=True)
metadata.reset_index(drop=True, inplace=True)

In [5]:
metadata = metadata.astype('category', copy=True)
display(metadata)
display(metadata.info(memory_usage='deep'))

Unnamed: 0,Trial,Subject,Vision,Surface,Rep,Age,AgeGroup,Gender,Height,Mass,...,Nmedication,Medication,Ortho-Prosthesis,Ortho-Prosthesis2,Disability,Disability2,Falls12m,PhysicalActivity,Sequence,Date
0,PDS01OR1,1,Open,Rigid,1.0,25.67,Young,M,1.72,74.30,...,0,No,Yes,Corrective lens,No,No,0,1,"OR, OF, CF, CR",2016-08-01 11:00:17.753
1,PDS01OR2,1,Open,Rigid,2.0,25.67,Young,M,1.72,74.30,...,0,No,Yes,Corrective lens,No,No,0,1,"OR, OF, CF, CR",2016-08-01 11:00:17.753
2,PDS01OR3,1,Open,Rigid,3.0,25.67,Young,M,1.72,74.30,...,0,No,Yes,Corrective lens,No,No,0,1,"OR, OF, CF, CR",2016-08-01 11:00:17.753
3,PDS01OF1,1,Open,Foam,1.0,25.67,Young,M,1.72,74.30,...,0,No,Yes,Corrective lens,No,No,0,1,"OR, OF, CF, CR",2016-08-01 11:00:17.753
4,PDS01OF2,1,Open,Foam,2.0,25.67,Young,M,1.72,74.30,...,0,No,Yes,Corrective lens,No,No,0,1,"OR, OF, CF, CR",2016-08-01 11:00:17.753
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
632,PDS49CR3,49,Closed,Rigid,3.0,64.92,Old,F,1.58,60.75,...,3,"HMG-CoA reductase inhibitor, Synthetic thyroid...",Yes,"Corrective lens, Dental implant",Yes,Hearing (Left ear),0,0,"CF, OF, CR, OR",2016-12-06 09:33:45.819
633,PDS49CF1,49,Closed,Foam,1.0,64.92,Old,F,1.58,60.75,...,3,"HMG-CoA reductase inhibitor, Synthetic thyroid...",Yes,"Corrective lens, Dental implant",Yes,Hearing (Left ear),0,0,"CF, OF, CR, OR",2016-12-06 09:33:45.819
634,PDS49CF2,49,Closed,Foam,2.0,64.92,Old,F,1.58,60.75,...,3,"HMG-CoA reductase inhibitor, Synthetic thyroid...",Yes,"Corrective lens, Dental implant",Yes,Hearing (Left ear),0,0,"CF, OF, CR, OR",2016-12-06 09:33:45.819
635,PDS49CF3,49,Closed,Foam,3.0,64.92,Old,F,1.58,60.75,...,3,"HMG-CoA reductase inhibitor, Synthetic thyroid...",Yes,"Corrective lens, Dental implant",Yes,Hearing (Left ear),0,0,"CF, OF, CR, OR",2016-12-06 09:33:45.819


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 637 entries, 0 to 636
Data columns (total 29 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   Trial              637 non-null    category
 1   Subject            637 non-null    category
 2   Vision             588 non-null    category
 3   Surface            588 non-null    category
 4   Rep                588 non-null    category
 5   Age                637 non-null    category
 6   AgeGroup           637 non-null    category
 7   Gender             637 non-null    category
 8   Height             637 non-null    category
 9   Mass               637 non-null    category
 10  BMI                637 non-null    category
 11  FootLen            637 non-null    category
 12  DominantLeg        637 non-null    category
 13  Nationality        637 non-null    category
 14  SkinColor          637 non-null    category
 15  Ystudy             637 non-null    category
 16  Footwear

None

In [6]:
def merge_meta_and_data(trial, metadata=metadata, kinds=('')):
    # Merge metadata and data files
    #meta_and_data = pd.DataFrame()
    if trial[5:] != 'static':
        for k, kind in enumerate(kinds):
            df = pd.read_csv(f"{path2}{os.sep}{trial + kind + '.txt'}", delimiter='\t', engine='c')  #.iloc[:2, :]
            if k == 0:
                df['Trial'] = trial
                meta_and_data = pd.merge(metadata.query('Trial == @trial'), df, how='inner', on='Trial')
            else:
                df.drop(columns='Time', inplace=True)
                meta_and_data = pd.concat([meta_and_data, df], axis=1)
    else:
        df = pd.read_csv(f'{path2}{os.sep}{trial + '.txt'}', delimiter='\t', engine='c')  # .iloc[:2, :]
        df['Trial'] = trial
        meta_and_data = pd.merge(metadata.query('Trial == @trial'), df, how='inner', on='Trial')
        pass

    return meta_and_data

SyntaxError: f-string: expecting '}' (4007174436.py, line 6)

In [17]:
df_all = [merge_meta_and_data(trial, metadata, kinds=('grf', 'ang', 'mkr'))
          for trial in tqdm(metadata['Trial'])]
df_all = pd.concat(df_all, axis=0, ignore_index=True)
df_all = df_all.astype({'Trial': 'category'})
display(df_all)
display(df_all.info(verbose=True, memory_usage='deep'))

100%|██████████████████████████████████████████████████████████████████| 637/637 [01:26<00:00,  7.32it/s]


Unnamed: 0,Trial,Subject,Vision,Surface,Rep,Age,AgeGroup,Gender,Height,Mass,...,R.Knee.Medial_Z,L.Knee.Medial_X,L.Knee.Medial_Y,L.Knee.Medial_Z,R.Ankle.Medial_X,R.Ankle.Medial_Y,R.Ankle.Medial_Z,L.Ankle.Medial_X,L.Ankle.Medial_Y,L.Ankle.Medial_Z
0,PDS01OR1,1,Open,Rigid,1.0,25.67,Young,M,1.72,74.30,...,,,,,,,,,,
1,PDS01OR1,1,Open,Rigid,1.0,25.67,Young,M,1.72,74.30,...,,,,,,,,,,
2,PDS01OR1,1,Open,Rigid,1.0,25.67,Young,M,1.72,74.30,...,,,,,,,,,,
3,PDS01OR1,1,Open,Rigid,1.0,25.67,Young,M,1.72,74.30,...,,,,,,,,,,
4,PDS01OR1,1,Open,Rigid,1.0,25.67,Young,M,1.72,74.30,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3542695,PDS49static,49,,,,64.92,Old,F,1.58,60.75,...,-1.194277,-0.386336,0.424990,-1.275723,-0.385735,0.066252,-1.139551,-0.385341,0.068598,-1.317972
3542696,PDS49static,49,,,,64.92,Old,F,1.58,60.75,...,-1.194275,-0.386327,0.424990,-1.275722,-0.385737,0.066251,-1.139551,-0.385339,0.068597,-1.317970
3542697,PDS49static,49,,,,64.92,Old,F,1.58,60.75,...,-1.194271,-0.386312,0.424988,-1.275720,-0.385738,0.066249,-1.139549,-0.385336,0.068594,-1.317969
3542698,PDS49static,49,,,,64.92,Old,F,1.58,60.75,...,-1.194266,-0.386292,0.424984,-1.275716,-0.385739,0.066245,-1.139545,-0.385333,0.068592,-1.317969


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3542700 entries, 0 to 3542699
Data columns (total 270 columns):
 #    Column             Dtype   
---   ------             -----   
 0    Trial              category
 1    Subject            category
 2    Vision             category
 3    Surface            category
 4    Rep                category
 5    Age                category
 6    AgeGroup           category
 7    Gender             category
 8    Height             category
 9    Mass               category
 10   BMI                category
 11   FootLen            category
 12   DominantLeg        category
 13   Nationality        category
 14   SkinColor          category
 15   Ystudy             category
 16   Footwear           category
 17   Illness            category
 18   Illness2           category
 19   Nmedication        category
 20   Medication         category
 21   Ortho-Prosthesis   category
 22   Ortho-Prosthesis2  category
 23   Disability         category
 2

None

### Save data to file

In [18]:
df_all.to_parquet(f'{path2}{os.sep}{"dossantos17.parquet"}', engine='fastparquet', index=False)  # 5.4 GB

In [21]:
# test it
df_all2 = pd.read_parquet(f'{path2}{os.sep}{"dossantos17.parquet"}', engine='fastparquet')
df_all2

Unnamed: 0,Trial,Subject,Vision,Surface,Rep,Age,AgeGroup,Gender,Height,Mass,...,R.Knee.Medial_Z,L.Knee.Medial_X,L.Knee.Medial_Y,L.Knee.Medial_Z,R.Ankle.Medial_X,R.Ankle.Medial_Y,R.Ankle.Medial_Z,L.Ankle.Medial_X,L.Ankle.Medial_Y,L.Ankle.Medial_Z
0,PDS01OR1,1,Open,Rigid,1.0,25.67,Young,M,1.72,74.30,...,,,,,,,,,,
1,PDS01OR1,1,Open,Rigid,1.0,25.67,Young,M,1.72,74.30,...,,,,,,,,,,
2,PDS01OR1,1,Open,Rigid,1.0,25.67,Young,M,1.72,74.30,...,,,,,,,,,,
3,PDS01OR1,1,Open,Rigid,1.0,25.67,Young,M,1.72,74.30,...,,,,,,,,,,
4,PDS01OR1,1,Open,Rigid,1.0,25.67,Young,M,1.72,74.30,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3542695,PDS49static,49,,,,64.92,Old,F,1.58,60.75,...,-1.194277,-0.386336,0.424990,-1.275723,-0.385735,0.066252,-1.139551,-0.385341,0.068598,-1.317972
3542696,PDS49static,49,,,,64.92,Old,F,1.58,60.75,...,-1.194275,-0.386327,0.424990,-1.275722,-0.385737,0.066251,-1.139551,-0.385339,0.068597,-1.317970
3542697,PDS49static,49,,,,64.92,Old,F,1.58,60.75,...,-1.194271,-0.386312,0.424988,-1.275720,-0.385738,0.066249,-1.139549,-0.385336,0.068594,-1.317969
3542698,PDS49static,49,,,,64.92,Old,F,1.58,60.75,...,-1.194266,-0.386292,0.424984,-1.275716,-0.385739,0.066245,-1.139545,-0.385333,0.068592,-1.317969


In [22]:
df_all2.info(verbose=True, memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3542700 entries, 0 to 3542699
Data columns (total 270 columns):
 #    Column             Dtype   
---   ------             -----   
 0    Trial              category
 1    Subject            category
 2    Vision             category
 3    Surface            category
 4    Rep                category
 5    Age                category
 6    AgeGroup           category
 7    Gender             category
 8    Height             category
 9    Mass               category
 10   BMI                category
 11   FootLen            category
 12   DominantLeg        category
 13   Nationality        category
 14   SkinColor          category
 15   Ystudy             category
 16   Footwear           category
 17   Illness            category
 18   Illness2           category
 19   Nmedication        category
 20   Medication         category
 21   Ortho-Prosthesis   category
 22   Ortho-Prosthesis2  category
 23   Disability         category
 2

### Dos Santos17 
Não iremos ler : ( markers, angs e static)

In [2]:
# LOCAL Jonatas
path2 = Path(f'D:{os.sep}Datasets{os.sep}dosSantos17{os.sep}data')
os_sep = os.sep
metadata_fname = 'PDSinfo.txt'
print(f'Dataset location: {path2}')
metadata = pd.read_csv(f"{path2}{os.sep}PDSinfo.txt", sep='\t', header=0,
                       engine='c', encoding='utf-8')
print('Information from %s files successfully loaded (total of %s subjects).'
      %(len(metadata), len(pd.unique(metadata.Subject))))
display(metadata)

Dataset location: D:\Datasets\dosSantos17\data
Information from 588 files successfully loaded (total of 49 subjects).


Unnamed: 0,Trial,Subject,Vision,Surface,Rep,Age,AgeGroup,Gender,Height,Mass,...,Nmedication,Medication,Ortho-Prosthesis,Ortho-Prosthesis2,Disability,Disability2,Falls12m,PhysicalActivity,Sequence,Date
0,PDS01OR1,1,Open,Rigid,1,25.67,Young,M,1.72,74.30,...,0,No,Yes,Corrective lens,No,No,0,1,"OR, OF, CF, CR",2016-08-01 11:00:17.753
1,PDS01OR2,1,Open,Rigid,2,25.67,Young,M,1.72,74.30,...,0,No,Yes,Corrective lens,No,No,0,1,"OR, OF, CF, CR",2016-08-01 11:00:17.753
2,PDS01OR3,1,Open,Rigid,3,25.67,Young,M,1.72,74.30,...,0,No,Yes,Corrective lens,No,No,0,1,"OR, OF, CF, CR",2016-08-01 11:00:17.753
3,PDS01OF1,1,Open,Foam,1,25.67,Young,M,1.72,74.30,...,0,No,Yes,Corrective lens,No,No,0,1,"OR, OF, CF, CR",2016-08-01 11:00:17.753
4,PDS01OF2,1,Open,Foam,2,25.67,Young,M,1.72,74.30,...,0,No,Yes,Corrective lens,No,No,0,1,"OR, OF, CF, CR",2016-08-01 11:00:17.753
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
583,PDS49CR2,49,Closed,Rigid,2,64.92,Old,F,1.58,60.75,...,3,"HMG-CoA reductase inhibitor, Synthetic thyroid...",Yes,"Corrective lens, Dental implant",Yes,Hearing (Left ear),0,0,"CF, OF, CR, OR",2016-12-06 09:33:45.819
584,PDS49CR3,49,Closed,Rigid,3,64.92,Old,F,1.58,60.75,...,3,"HMG-CoA reductase inhibitor, Synthetic thyroid...",Yes,"Corrective lens, Dental implant",Yes,Hearing (Left ear),0,0,"CF, OF, CR, OR",2016-12-06 09:33:45.819
585,PDS49CF1,49,Closed,Foam,1,64.92,Old,F,1.58,60.75,...,3,"HMG-CoA reductase inhibitor, Synthetic thyroid...",Yes,"Corrective lens, Dental implant",Yes,Hearing (Left ear),0,0,"CF, OF, CR, OR",2016-12-06 09:33:45.819
586,PDS49CF2,49,Closed,Foam,2,64.92,Old,F,1.58,60.75,...,3,"HMG-CoA reductase inhibitor, Synthetic thyroid...",Yes,"Corrective lens, Dental implant",Yes,Hearing (Left ear),0,0,"CF, OF, CR, OR",2016-12-06 09:33:45.819


In [3]:
metadata = pd.read_csv(f'{path2}{os_sep}{metadata_fname}', sep='\t', header=0,
                       engine='c', encoding='utf-8')
print('Information from %s files successfully loaded (total of %s subjects).'
      %(len(metadata), len(pd.unique(metadata.Subject))))
metadata_copy = metadata.copy(deep=True)
display(metadata)

Information from 588 files successfully loaded (total of 49 subjects).


Unnamed: 0,Trial,Subject,Vision,Surface,Rep,Age,AgeGroup,Gender,Height,Mass,...,Nmedication,Medication,Ortho-Prosthesis,Ortho-Prosthesis2,Disability,Disability2,Falls12m,PhysicalActivity,Sequence,Date
0,PDS01OR1,1,Open,Rigid,1,25.67,Young,M,1.72,74.30,...,0,No,Yes,Corrective lens,No,No,0,1,"OR, OF, CF, CR",2016-08-01 11:00:17.753
1,PDS01OR2,1,Open,Rigid,2,25.67,Young,M,1.72,74.30,...,0,No,Yes,Corrective lens,No,No,0,1,"OR, OF, CF, CR",2016-08-01 11:00:17.753
2,PDS01OR3,1,Open,Rigid,3,25.67,Young,M,1.72,74.30,...,0,No,Yes,Corrective lens,No,No,0,1,"OR, OF, CF, CR",2016-08-01 11:00:17.753
3,PDS01OF1,1,Open,Foam,1,25.67,Young,M,1.72,74.30,...,0,No,Yes,Corrective lens,No,No,0,1,"OR, OF, CF, CR",2016-08-01 11:00:17.753
4,PDS01OF2,1,Open,Foam,2,25.67,Young,M,1.72,74.30,...,0,No,Yes,Corrective lens,No,No,0,1,"OR, OF, CF, CR",2016-08-01 11:00:17.753
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
583,PDS49CR2,49,Closed,Rigid,2,64.92,Old,F,1.58,60.75,...,3,"HMG-CoA reductase inhibitor, Synthetic thyroid...",Yes,"Corrective lens, Dental implant",Yes,Hearing (Left ear),0,0,"CF, OF, CR, OR",2016-12-06 09:33:45.819
584,PDS49CR3,49,Closed,Rigid,3,64.92,Old,F,1.58,60.75,...,3,"HMG-CoA reductase inhibitor, Synthetic thyroid...",Yes,"Corrective lens, Dental implant",Yes,Hearing (Left ear),0,0,"CF, OF, CR, OR",2016-12-06 09:33:45.819
585,PDS49CF1,49,Closed,Foam,1,64.92,Old,F,1.58,60.75,...,3,"HMG-CoA reductase inhibitor, Synthetic thyroid...",Yes,"Corrective lens, Dental implant",Yes,Hearing (Left ear),0,0,"CF, OF, CR, OR",2016-12-06 09:33:45.819
586,PDS49CF2,49,Closed,Foam,2,64.92,Old,F,1.58,60.75,...,3,"HMG-CoA reductase inhibitor, Synthetic thyroid...",Yes,"Corrective lens, Dental implant",Yes,Hearing (Left ear),0,0,"CF, OF, CR, OR",2016-12-06 09:33:45.819


In [16]:
metadata = metadata.astype('category', copy=True)
metadata.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 588 entries, 0 to 587
Data columns (total 29 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   Trial              588 non-null    category
 1   Subject            588 non-null    category
 2   Vision             588 non-null    category
 3   Surface            588 non-null    category
 4   Rep                588 non-null    category
 5   Age                588 non-null    category
 6   AgeGroup           588 non-null    category
 7   Gender             588 non-null    category
 8   Height             588 non-null    category
 9   Mass               588 non-null    category
 10  BMI                588 non-null    category
 11  FootLen            588 non-null    category
 12  DominantLeg        588 non-null    category
 13  Nationality        588 non-null    category
 14  SkinColor          588 non-null    category
 15  Ystudy             588 non-null    category
 16  Footwear

In [4]:
def merge_meta_grf(trial):
    file_endings = ['grf', 'mkr', 'ang', 'static']

    for ending in file_endings:
        arquivo = f'{path2}{os_sep}{trial}{ending}.txt'
        if os.path.isfile(arquivo):
            grf = pd.read_csv(arquivo, delimiter='\t', header=0)
            grf['Trial'] = trial
            return pd.merge(metadata.query('Trial == @trial'), grf, how='inner', on='Trial')

In [5]:
from tqdm import tqdm
df_all = [merge_meta_grf(trial) for trial in tqdm(metadata['Trial'])]
df_all = pd.concat(df_all, ignore_index=True)
df_all = df_all.astype({'Trial': 'category'})
df_all

  0%|          | 0/588 [00:00<?, ?it/s]

100%|██████████| 588/588 [00:55<00:00, 10.64it/s]


Unnamed: 0,Trial,Subject,Vision,Surface,Rep,Age,AgeGroup,Gender,Height,Mass,...,RCOP_Y,RCOP_Z,LCOP_X,LCOP_Y,LCOP_Z,COPNET_X,COPNET_Y,COPNET_Z,RFREEMOMENT_Y,LFREEMOMENT_Y
0,PDS01OR1,1,Open,Rigid,1,25.67,Young,M,1.72,74.30,...,0.0,0.093610,0.227590,0.0,-0.085100,0.231089,0.0,-0.002490,3.067748,-0.889998
1,PDS01OR1,1,Open,Rigid,1,25.67,Young,M,1.72,74.30,...,0.0,0.093502,0.227516,0.0,-0.085075,0.230965,0.0,-0.002571,3.058200,-0.873012
2,PDS01OR1,1,Open,Rigid,1,25.67,Young,M,1.72,74.30,...,0.0,0.093391,0.227458,0.0,-0.085064,0.230859,0.0,-0.002660,3.050973,-0.858668
3,PDS01OR1,1,Open,Rigid,1,25.67,Young,M,1.72,74.30,...,0.0,0.093278,0.227424,0.0,-0.085075,0.230782,0.0,-0.002755,3.047538,-0.849078
4,PDS01OR1,1,Open,Rigid,1,25.67,Young,M,1.72,74.30,...,0.0,0.093171,0.227413,0.0,-0.085108,0.230732,0.0,-0.002848,3.047990,-0.845515
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3527995,PDS49CF3,49,Closed,Foam,3,64.92,Old,F,1.58,60.75,...,0.0,0.111741,0.295525,0.0,-0.116602,0.277993,0.0,0.011928,-0.829290,-0.148266
3527996,PDS49CF3,49,Closed,Foam,3,64.92,Old,F,1.58,60.75,...,0.0,0.111785,0.295505,0.0,-0.116570,0.277949,0.0,0.012094,-0.825005,-0.140185
3527997,PDS49CF3,49,Closed,Foam,3,64.92,Old,F,1.58,60.75,...,0.0,0.111849,0.295502,0.0,-0.116534,0.277911,0.0,0.012296,-0.819960,-0.137504
3527998,PDS49CF3,49,Closed,Foam,3,64.92,Old,F,1.58,60.75,...,0.0,0.111940,0.295511,0.0,-0.116497,0.277880,0.0,0.012536,-0.814717,-0.140403


In [23]:
df_all.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3528000 entries, 0 to 3527999
Data columns (total 50 columns):
 #   Column             Dtype   
---  ------             -----   
 0   Trial              category
 1   Subject            category
 2   Vision             category
 3   Surface            category
 4   Rep                category
 5   Age                category
 6   AgeGroup           category
 7   Gender             category
 8   Height             category
 9   Mass               category
 10  BMI                category
 11  FootLen            category
 12  DominantLeg        category
 13  Nationality        category
 14  SkinColor          category
 15  Ystudy             category
 16  Footwear           category
 17  Illness            category
 18  Illness2           category
 19  Nmedication        category
 20  Medication         category
 21  Ortho-Prosthesis   category
 22  Ortho-Prosthesis2  category
 23  Disability         category
 24  Disability2        categ

In [6]:
df_all.to_parquet(f'{path2}{os_sep}{"santos17grf.parquet"}', engine='fastparquet', index=False)  

In [26]:
df_all2 = pd.read_parquet(f'{path2}{os_sep}{"santos17grf.parquet"}', engine='fastparquet')
df_all2

Unnamed: 0,Trial,Subject,Vision,Surface,Rep,Age,AgeGroup,Gender,Height,Mass,...,RCOP_Y,RCOP_Z,LCOP_X,LCOP_Y,LCOP_Z,COPNET_X,COPNET_Y,COPNET_Z,RFREEMOMENT_Y,LFREEMOMENT_Y
0,PDS01OR1,1,Open,Rigid,1,25.67,Young,M,1.72,74.30,...,0.0,0.093610,0.227590,0.0,-0.085100,0.231089,0.0,-0.002490,3.067748,-0.889998
1,PDS01OR1,1,Open,Rigid,1,25.67,Young,M,1.72,74.30,...,0.0,0.093502,0.227516,0.0,-0.085075,0.230965,0.0,-0.002571,3.058200,-0.873012
2,PDS01OR1,1,Open,Rigid,1,25.67,Young,M,1.72,74.30,...,0.0,0.093391,0.227458,0.0,-0.085064,0.230859,0.0,-0.002660,3.050973,-0.858668
3,PDS01OR1,1,Open,Rigid,1,25.67,Young,M,1.72,74.30,...,0.0,0.093278,0.227424,0.0,-0.085075,0.230782,0.0,-0.002755,3.047538,-0.849078
4,PDS01OR1,1,Open,Rigid,1,25.67,Young,M,1.72,74.30,...,0.0,0.093171,0.227413,0.0,-0.085108,0.230732,0.0,-0.002848,3.047990,-0.845515
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3527995,PDS49CF3,49,Closed,Foam,3,64.92,Old,F,1.58,60.75,...,0.0,0.111741,0.295525,0.0,-0.116602,0.277993,0.0,0.011928,-0.829290,-0.148266
3527996,PDS49CF3,49,Closed,Foam,3,64.92,Old,F,1.58,60.75,...,0.0,0.111785,0.295505,0.0,-0.116570,0.277949,0.0,0.012094,-0.825005,-0.140185
3527997,PDS49CF3,49,Closed,Foam,3,64.92,Old,F,1.58,60.75,...,0.0,0.111849,0.295502,0.0,-0.116534,0.277911,0.0,0.012296,-0.819960,-0.137504
3527998,PDS49CF3,49,Closed,Foam,3,64.92,Old,F,1.58,60.75,...,0.0,0.111940,0.295511,0.0,-0.116497,0.277880,0.0,0.012536,-0.814717,-0.140403


In [27]:
df_all2.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3528000 entries, 0 to 3527999
Data columns (total 50 columns):
 #   Column             Dtype   
---  ------             -----   
 0   Trial              category
 1   Subject            category
 2   Vision             category
 3   Surface            category
 4   Rep                category
 5   Age                category
 6   AgeGroup           category
 7   Gender             category
 8   Height             category
 9   Mass               category
 10  BMI                category
 11  FootLen            category
 12  DominantLeg        category
 13  Nationality        category
 14  SkinColor          category
 15  Ystudy             category
 16  Footwear           category
 17  Illness            category
 18  Illness2           category
 19  Nmedication        category
 20  Medication         category
 21  Ortho-Prosthesis   category
 22  Ortho-Prosthesis2  category
 23  Disability         category
 24  Disability2        categ