# Initial Data Exploration
This is going to be a little difficult for a couple of reasons

1. The data is poorly described, there is not accompanying readme or documentation on what each of the fields means. We are going to have to rely on the [original paper](https://www.researchgate.net/profile/Masako-Tamaki/publication/236113471_Neural_Decoding_of_Visual_Imagery_During_Sleep/links/02e7e53a5e1eba1005000000/Neural-Decoding-of-Visual-Imagery-During-Sleep.pdf) and its [supplemental materials](https://www.science.org/doi/suppl/10.1126/science.1234330/suppl_file/horikawa.sm.pdf) to figure out how to use it
2. The data is in an h5 format, which is something we haven't worked with before
3. The data is spread across several files, so we will have to aggregate it ourselves

In [34]:
!ls preproc

PreprocessedPerceptionDataSubject1.h5  PreprocessedSleepDataSubject3.h5
PreprocessedPerceptionDataSubject2.h5  propsSubject1.h5
PreprocessedPerceptionDataSubject3.h5  propsSubject2.h5
PreprocessedSleepDataSubject1.h5       propsSubject3.h5
PreprocessedSleepDataSubject2.h5


The files appear to be in "h5" format, which we can read using the h5py library ([documentation](https://docs.h5py.org/en/stable/))

It seems like there's 3 different classes of files, Perception data, Sleep data, and "props"?

The paper attempts to reconstruct visual data rather than just classify, so it uses a decoder trained from perception data while viewing particular images, most likely the "Perception" data, so we can probably ignore this. 

The sleep data probably has what we are looking for.

I'm not really sure what props is.

In [46]:
import h5py
import os

In [47]:
root = "preproc/"
perc = root + "PreprocessedPerceptionDataSubject1.h5"
sleep = root+ "PreprocessedSleepDataSubject1.h5"
props = root+ "propsSubject1.h5"

Start with sleep data

In [76]:
dfile = h5py.File(sleep, "r")
dfile.keys()

<KeysViewHDF5 ['data', 'metaData', 'metaDefinition']>

Looks like theres 3 keys in the dataset, data, metadata, and metadefinition

In [77]:
for key in dfile.keys():
    dset = dfile[key]
    try:
        print(key, dset.shape, dset.dtype, dset[0], end="\n\n")
    except:
        print(key, dset, type(key), end="\n\n")

data (235, 4039) float64 [-1.23439404 -0.48702744  1.19918902 ...  1.          1.
  1.        ]

metaData <HDF5 group "/metaData" (43 members)> <class 'str'>

metaDefinition (43,) |S33 b'0 = not voxelData, 1 = voxelData'



In [92]:
tfile = h5py.File(root+ "PreprocessedSleepDataSubject2.h5", "r")
tfile["data"].shape

(198, 3981)

Looks like data is just a 2D array of data with where the first dimension is the number of "awakenings", which is confirmed by the paper...
> (235, 198, and
186 awakenings for participants 1 to 3, respec-
tively, used for decoding analyses) 

metaData is a group and metaDefinition contains some kind of string data, the "meta" datasets seem to be related since they have the same number of elements

In [105]:
data = dfile["data"]
mdata = dfile["metaData"]
mdef = dfile["metaDefinition"]
for i, item in enumerate(mdata):
    print(i, item, " -- ",mdef[i])

0 EEG_sleep_score  --  b'0 = not voxelData, 1 = voxelData'
1 FFA  --  b'Value = X coordinate'
2 HVC  --  b'Value = Y coordinate'
3 LOC  --  b'Value = Z coordinate'
4 LVC  --  b'0 = not FFA voxel, 1 = FFA voxel'
5 PPA  --  b'0 = not HVC voxel, 1 = HVC voxel'
6 Synset_building_ID_02913152  --  b'0 = not LOC voxel, 1 = LOC voxel'
7 Synset_chair_ID_03001627  --  b'0 = not LVC voxel, 1 = LVC voxel'
8 Synset_character_ID_06818970  --  b'0 = not PPA voxel, 1 = PPA voxel'
9 Synset_clothing_ID_03051540  --  b'0 = not V1 voxel, 1 = V1 voxel'
10 Synset_code_ID_06355894  --  b'0 = not V2 voxel, 1 = V2 voxel'
11 Synset_cognition_ID_00023271  --  b'0 = not V3 voxel, 1 = V3 voxel'
12 Synset_external_body_part_ID_05225090  --  b'0 = not label, 1 = label'
13 Synset_geographical_area_ID_08574314  --  b'0 = absent, 1 = present'
14 Synset_girl_ID_10129825  --  b'0 = absent, 1 = present'
15 Synset_group_ID_00031264  --  b'0 = absent, 1 = present'
16 Synset_illustration_ID_06999233  --  b'0 = absent, 1 = pr

Seem to be feature-value pairs, but I'm not really sure how that maps to the data vector?

Also, these seem to be out of order, most likely since dictionary indexing is unordered, but I'm not sure how to fix that. It may just be even more poor dataset creation on the part of the authors...

From the paper we know...
- LVC, HVC, V1, V2, V3, LOC, FF, and PPA are all visual areas. These will be our input
- Synsets are the different "classes" of things they saw in their dreams, these will be our output.

I'm still no sure where the data is split, so for now, I'm going to look at the "props" data and hope its in there

In [79]:
pfile = h5py.File(props, "r")
pfile.keys()

<KeysViewHDF5 ['roiMask', 'roiNames', 'synsetNames', 'synsetPairs', 'xyz']>

In [80]:
for key in pfile.keys():
    dset = pfile[key]
    try:
        print(key, dset.shape, dset.dtype, dset[0], end="\n\n")
    except:
        print(key, dset, type(key), end="\n\n")

roiMask (8, 4010) float64 [0. 0. 0. ... 0. 0. 0.]

roiNames (8,) |S4 b'FFA'

synsetNames (26,) |S38 b'Synset_male_ID_09624168'

synsetPairs (201, 2) |S38 [b'Synset_character_ID_06818970' b'Synset_male_ID_09624168']

xyz (3, 4010) float64 [-67.5 -67.5 -67.5 ...  64.5  64.5  64.5]



Luckily, this seems to be the information on the boundaries between each of the features described in the "sleep" data. From the supplemental materials, the roi is "region of interest" which is the region of the brain.

In [81]:
rmask = pfile["roiMask"]
rnames = pfile["roiNames"]
xyz = pfile["xyz"]

print(rmask[0, :].sum(), rnames[0])
print(xyz[:, 2])

for region in range(len(rnames)):
    numel = rmask[region].sum()
    print(rnames[region], " -- ", numel)

537.0 b'FFA'
[-67.5 -43.5  -4.5]
b'FFA'  --  537.0
b'HVC'  --  1956.0
b'LOC'  --  523.0
b'LVC'  --  2054.0
b'PPA'  --  353.0
b'V1'  --  885.0
b'V2'  --  901.0
b'V3'  --  728.0


From the supplemental materials...
> For the analysis of individual subareas, the following numbers of voxels were identified
for V1, V2, V3, LOC, FFA, and PPA, respectively: 885, 901, 728, 523, 537, and 353 voxels for Subject 1;

This matches with the sum of our region of interest masks, meaning that these masks do in fact show the region of interest. The "xyz" feature appears to be the voxel (3d pixel) coordinates corresponding to each entry in our region of interest (which we don't really need)

In [100]:
print(data.shape)
ffa_mask = rmask[0].astype(bool)
print(ffa_mask.shape)

(235, 4039)
(4010,)


The mask seems to be shorter by 29 elements, which I would assume is the space for the meta elements?

In [102]:
data[-1, 4009:]

array([  1.97166996,   1.        ,   0.        ,   0.        ,
         0.        ,   0.        ,   0.        ,   0.        ,
         0.        ,   0.        ,   0.        ,   0.        ,
         0.        ,   0.        ,   0.        ,   0.        ,
         0.        ,   0.        ,   0.        ,   0.        ,
         0.        ,   0.        ,   0.        ,   0.        ,
         0.        ,   0.        ,   0.        ,   1.        ,
        26.        , 235.        ])

Yep! The last two seem to be numbers rather than logical values which matches what we would expect if `mdef` is properly ordered. Now we just need to extract all of the info into a more user-friendly form, probably pandas.