Data
 Data for training, validation, and testing should be stored in separate HDF5 files, using the following hierarchical format:

First level: A unique identifier, e.g. image ID.
The second level always has the following entries:
A group named ROI, which itself has the dataset named vol_with_bg as child: The cropped ROI around the left hippocampus of size 64x64x64.
A dataset named tabular of size 14: The tabular non-image data.
A scalar attribute RID with the patient ID.
A string attribute VISCODE with ADNI's visit code.
Additional attributes, depending on the task.
For classification, a string attribute DX containing the diagnosis: CN, MCI, or Dementia.
For time-to-dementia analysis. A string attribute event indicating whether conversion to dementia was observed (yes or no), and a scalar attribute time with the time to dementia onset or the time of censoring.
One entry in the resulting HDF5 file should have the following structure:

/1010012                 Group
    Attribute: RID scalar
        Type:      native long
        Data:  1234
    Attribute: VISCODE scalar
        Type:      variable-length null-terminated UTF-8 string
        Data:  "bl"
    Attribute: DX scalar
        Type:      variable-length null-terminated UTF-8 string
        Data:  "CN"
    Attribute: event scalar
        Type:      variable-length null-terminated UTF-8 string
        Data:  "no"
    Attribute: time scalar
        Type:      native double
        Data:  123
/1010012/ROI Group
/1010012/ROI/vol_with_bg Dataset {64, 64, 64}
/1010012/tabular         Dataset {14}
Finally, the HDF5 file should also contain the following meta-information in a separate group named stats:

/stats/tabular           Group
/stats/tabular/columns   Dataset {14}
/stats/tabular/mean      Dataset {14}
/stats/tabular/stddev    Dataset {14}

In [6]:
# !pip install pandas
# !pip install numpy
# !pip install matplotlib
# !pip install openpyxl


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

Collecting openpyxl
  Downloading openpyxl-3.1.2-py2.py3-none-any.whl.metadata (2.5 kB)
Collecting et-xmlfile (from openpyxl)
  Downloading et_xmlfile-1.1.0-py3-none-any.whl.metadata (1.8 kB)
Downloading openpyxl-3.1.2-py2.py3-none-any.whl (249 kB)
   ---------------------------------------- 250.0/250.0 kB 2.6 MB/s eta 0:00:00
Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.1.2


In [8]:
dataInDir = "./RawDataset/"
mriDataDir = f"{dataInDir}MRIs/"
dataOutDir = "./Data/"


# tabularDf = pd.read_csv(dataInDir + "Clinical_and_Other_Features.xlsx")
tabularDf = pd.read_excel(dataInDir + "Clinical_and_Other_Features.xlsx")
print(tabularDf.head())
# iterate through folders in MRI data directory
# get the patient ID from the folder name
# get MRI data from the folder

for dir in os.listdir(mriDataDir):
    print(dir)
    patientId = dir
    print(f"Processing patient {patientId}")
    # load in the .dcm files

    mriImges = []
    for file in os.listdir(mriDataDir + dir):
        print(file)
        # mriImges.append(pydicom.dcmread(mriDataDir + dir + "/" + file))


  Patient Information                 MRI Technical Information  \
0          Patient ID  Days to MRI (From the Date of Diagnosis)   
1                 NaN                                       NaN   
2      Breast_MRI_001                                         6   
3      Breast_MRI_002                                        12   
4      Breast_MRI_003                                        10   

                                          Unnamed: 2  \
0                                       Manufacturer   
1  GE MEDICAL SYSTEMS=0, MPTronic software=1, SIE...   
2                                                  2   
3                                                  0   
4                                                  0   

                                          Unnamed: 3  \
0                            Manufacturer Model Name   
1  Avanto=0, Optima MR450w=1, SIGNA EXCITE=2, SIG...   
2                                                  0   
3                                   