# Validation of the HDF5 file

In this section, the processing from the previous section is validated against the previous way, developed by Shaheen. This is done to ensure, that the code snippets in the previous section collect the same data and produce a similar output file as before. However, it is worth to note, that the files differe significantly, as with the new version using `paat`, we are saving just the information from the ActiGraph log file and create the time vector based on the meta data we stored previously. This has the advantage to save the space of the timestamp vector which still has a `n_samples / hz` length, with `n_samples` being the number of observations of the acceleration data and `hz` being the sampling rate.

Therefore, the main objective of this section is to check that the data from all subjects is also stored in the new file.

In [1]:
import h5py
import os

# Set file path to relevant files
OLD_HDF5_FILEPATH = os.path.join(os.sep, 'run', 'media', 'msw', 'LaCie', 'ACTIGRAPH_TU7.hdf5')
NEW_HDF5_FILEPATH = os.path.join(os.sep, 'run', 'media', 'msw', 'LaCie1', 'ACTIGRAPH_TU7.hdf5')

## Load subject information

### Load the subjects from the file generated with Shaheen's code

In [2]:
with h5py.File(OLD_HDF5_FILEPATH, 'r') as old_hdf5_file:
    old_subjects = set(old_hdf5_file.keys())

### Load the subjects from the file generated using PAAT

In [3]:
with h5py.File(NEW_HDF5_FILEPATH, 'r') as new_hdf5_file:
    new_subjects = set(new_hdf5_file.keys())

## Results

### Number of subjects in each file

In [4]:
print("{} subjects in Shaheen's HDF5 file ({})".format(len(old_subjects), OLD_HDF5_FILEPATH))
print("{} subjects in the new HDF5 file ({})".format(len(new_subjects), NEW_HDF5_FILEPATH))

6114 subjects in Shaheen's HDF5 file (/run/media/msw/LaCie/ACTIGRAPH_TU7.hdf5)
6138 subjects in the new HDF5 file (/run/media/msw/LaCie1/ACTIGRAPH_TU7.hdf5)


### Comparision between the two files

In [5]:
print("{} subjects are in both datasets".format(len(new_subjects & old_subjects)))
print("{} subjects are just in one of the datasets".format(len(old_subjects ^ new_subjects)))
print("{} subjects are in the old, but not in the new dataset".format(len(old_subjects - new_subjects)))
print("{} subjects are in the new, but not in the old dataset".format(len(new_subjects - old_subjects)))

6114 subjects are in both datasets
24 subjects are just in one of the datasets
0 subjects are in the old, but not in the new dataset
24 subjects are in the new, but not in the old dataset


### Analysis of the differences between the two files

In [6]:
print("The following subjects are just in one of the datasets:")
print('\n'.join(map(str, old_subjects ^ new_subjects)))

The following subjects are just in one of the datasets:
90156930
90222217
90251623
90124622
 91299131
90126927
90198027
90025925
90132015
90233017
90070622
90179935
90165728
90107724
90200314
MOS2C02150396
90046928
91520421
90198128
90103720
90268934
90086023
90043824
92615730
