# Create hdf5 file from gt3x files

In this notebook, code snippets are presented to collect and store the data from various .gt3x files to one hdf5 file. This has the advantage that further processing becomes much quicker as the gt3x files do not need to be decoded everytime.

In [1]:
import h5py
import os
from multiprocessing import Pool
import random
import time
import logging

from tqdm.notebook import tqdm
import glob2
import paat

logging.basicConfig(filename='/tmp/process-gt3x.log')

# Set file path to relevant files
GT3X_BASE_PATH = os.path.join(os.sep, 'run', 'media', 'msw', 'LaCie', 'Actigraph_raw')
HDF5_FILE_PATH = os.path.join(os.sep, 'run', 'media', 'msw', 'LaCie', 'ACTIGRAPH_TU7.hdf5')

# Specify the number of cores you want to execute the code on in parallel processing mode
N_JOBS = 16

INFO:tensorflow:Enabling eager execution
INFO:tensorflow:Enabling v2 tensorshape
INFO:tensorflow:Enabling resource variables
INFO:tensorflow:Enabling tensor equality
INFO:tensorflow:Enabling control flow v2


## Define the processing pipeline

In [2]:
def process_file(file_path):
    try:
        # Load the data
        times, acceleration, meta = paat.io.read_gt3x(file_path)
        
        # Save data to file
        while True:  
            try:
                with h5py.File(HDF5_FILE_PATH, 'a') as hdf5_file:
                    grp = hdf5_file.create_group(meta["Subject_Name"])
                    paat.io.save_dset(grp, "acceleration", times, acceleration, meta)
            # Repeat saving when file is used by a different process
            except OSError: 
                time.sleep(random.uniform(0,3))
                continue
            break
    except Exception as msg:
        print('Could not process file {}: "{}"'.format(file_path, msg))

<div class="alert alert-warning">

**Note:**

The while True statement in `process_file()` might look obscure in the first place, but is necessary to enable parallel processing. A `OSError` is raised by the programm when it tries to save the data while the file is locked by a different process. Sadly, there is no built-in functionality yet for a more sophisticated parallel writing to a hdf5 file, so the quickfix solution is to try again after a short period. This should be improved as soon as more sophisticated ways have been developed!

</div>

## Process all gt3x files

In [3]:
# Get all gt3x files
gt3x_files = glob2.glob(os.path.join(GT3X_BASE_PATH, '**', '*.gt3x'))
    
# Create new empty h5 file
h5py.File(HDF5_FILE_PATH, 'w').close()

# Process all files
with Pool(N_JOBS) as p:
    list(tqdm(p.imap(process_file, gt3x_files), total=len(gt3x_files)))

  0%|          | 0/6157 [00:00<?, ?it/s]

Could not process file /run/media/msw/LaCie/Actigraph_raw/2015 05/90449329 (2015-05-09).gt3x: "Unable to create group (name already exists)"
Could not process file /run/media/msw/LaCie/Actigraph_raw/2015 09/92568434 (2015-09-24).gt3x: "'NoneType' object has no attribute 'shape'"
Could not process file /run/media/msw/LaCie/Actigraph_raw/2015 10/90606223 (2015-10-28).gt3x: "'NoneType' object has no attribute 'shape'"
Could not process file /run/media/msw/LaCie/Actigraph_raw/2015 10/91500621 (2015-10-03).gt3x: "'Serial_Number'"
Could not process file /run/media/msw/LaCie/Actigraph_raw/2015 11/92678133 (2015-11-28).gt3x: "Unable to create group (name already exists)"
Could not process file /run/media/msw/LaCie/Actigraph_raw/2016 03/90516829 (2016-03-12).gt3x: "'NoneType' object has no attribute 'shape'"
Could not process file /run/media/msw/LaCie/Actigraph_raw/2016 03/92538330 (2016-03-16).gt3x: "'NoneType' object has no attribute 'shape'"
Could not process file /run/media/msw/LaCie/Actigr