# 1. Preprocessing of XRF hdf files
## Summary
This notebook reduces the raw .hdf XRF files collected at the ID15A beamline (European Synchrotron Radiation Facility) to pre-processed hdf files. These pre-processed hdf files have a consistent dataset naming strategy and contain only the datasets required to produce deconvoluted 2D XRF data. Datasets in pre-processed datasets include:
1. stage coordinates 
    + hrz (z position)
    + hry (y position)
2. spectra per pixel 
    + falconx_det0 (detector 1)
    + fluodet_det0 (detector 2, used in some scans)
3. beam flux measurements local to the sample environment collected at the time interval trigger (for normalisation of XRF spectra intensities)
    + fpico3

### Stitching
As well as reducing hdf datasets to a consistent format, this notebook can stitch together fragmented scans (scans formed by multiple scans). For all scans then:
+ Complete scans are reduced to the key datasets needed for 2D XRF mapping
+ Incomplete scans that need to be stitched together are stitched then reduced to the key datasets needed for 2D XRF mapping.
    This requires that:
    1. Scans to be stitched have at least 1 row of scan overlap (the notebook will detect this via row coordinates stored in the `hrz` dataset of raw hdf files)
    2. Scans to be stitched together are identified in `xrf_scan_metadata.csv` in the raw data directory. Within `xrf_scan_metadata.csv`, a `scanset` column indicates scans that need to be stitched together by a common name. 

All reduced .hdf files are output to `data/processed/xrf/1_reduced_reshaped`. 

Output reduced .hdf files are suitable for further processing, including flux normalisation (see notebook 2) or straight deconvolution (see notebook 3) to generate per channel images. 

In [1]:
import pathlib
import numpy as np
import pandas as pd
import h5py

### Set up folder structure
The following cell sets up the folder structure required. Only the base directory may need modifying. 

In the base directory, if a raw subdirectory does not exist it will be made. This raw subdirectory should contain the raw .hdf files of interest and `xrf_scan_metadata.csv`. Within `xrf_scan_metadata.csv`, a column listing the filepath to each raw .hdf file of interest is required. 

A processed subdirectory will be made for the stitched .hdf files. 

In [2]:
# Define base directory to work from 
base_dir = "C:/Users/MerrickS/OneDrive/Work/2_UZH/Papers/1_MEZ_XRF"
base_dir = pathlib.Path(base_dir)

# Make raw directory for raw hdf files if it does not exist
input_dir = base_dir / 'data' / 'raw' / 'xrf'
input_dir.mkdir(parents=True, exist_ok=True)

# Make output directory for reshaped hdf files if it does not exist
out_dir = base_dir / 'data' / 'processed' / 'xrf' / '1_reduced_reshaped_hdfs'
out_dir.mkdir(parents=True, exist_ok=True)
print('Reshaped hdf files output to: \t', out_dir)   

Reshaped hdf files output to: 	 C:\Users\MerrickS\OneDrive\Work\2_UZH\Papers\1_MEZ_XRF\data\processed\xrf\1_reduced_reshaped_hdfs


### Select raw datasets to include in pre-processed hdf files
Specify which datasets to copy from raw data to pre-processed hdf datasets. Datasets will be stored as dictionary keys to which values are ascribed prior to saving into the preprocessed hdf file.

In [3]:
# Name the hdf parameters to extract and incorpate into new stitched hdf file (can add
# extras here if look at the keys in hdf)
hdf_datasets = ['1.1/measurement/hrz',
                '1.1/measurement/hry',
                '1.1/measurement/fpico3',
                '1.1/measurement/falconx_det0',
                '1.1/measurement/fluodet_det0',
                ]

# Convert datasets list to a dictionary (no values at this point)
hdf_datasets = dict.fromkeys(hdf_datasets)

# Make a dataframe to track preprocessed scans and scansets with deconvolution configs
preprocessed_files_cols = ['hdf_file', 'config_file', 'step_um', 'dual_detector']
df_preprocessed_files = pd.DataFrame(columns = preprocessed_files_cols)

### Identify scans to process
Import a .csv spreadsheet named `xrf_scan_metadata.csv` from the /data/raw/xrf/ subdirectory. This .csv list all scans of interest and identifies which scans need to be stitched together. 

`xrf_scan_metadata.csv` should include
1. A `scan_name` column which will be used to name the pre-processed dataset
2. A `hdf_filepath` column that points to the raw data location relative to the base directory.
3. A `scanset` column that identifies scans to be stitched together as those scans with the same scanset name. The scanset name will be used to name the file

The next cell collects filepaths, files, filenames to be processed.


In [10]:
#   Get .hdf scan names and paths from hdf_filepath supplied in raw metadata 
#   csv
df_hdf_metadata = pd.read_csv(input_dir / 'xrf_scan_metadata2.csv')

# Make a dataframe of paths, files and fnames from hdf_filepath to include in 'xrf_scan_metadata_full.csv'
full_fpaths = []
paths = []
files = []
fnames = []
for hdf_filepath in list(df_hdf_metadata['hdf_filepath']):
    hdf_filepath = pathlib.Path(hdf_filepath)
    
    fullpath = hdf_filepath
        
    full_fpaths.append(fullpath)
    paths.append(hdf_filepath.parent)
    files.append(hdf_filepath.name)
    fnames.append(hdf_filepath.stem)

df_hdf_metadata['hdf_full_fpaths'] = full_fpaths
df_hdf_metadata['hdf_path'] = paths
df_hdf_metadata['hdf_file'] = files
df_hdf_metadata['hdf_filename'] = fnames

df_hdf_metadata.to_csv((out_dir / 'xrf_scan_metadata_full2.csv'), index=False)

# check hdf files exist
for filepath in full_fpaths:
    if filepath.exists() == True:
        pass
    else:
        print ("\t !!!File does not exist!!!:", fullpath)


 The next cell identifies complete scans and scansets from the imported csv. 

In [12]:
# Get complete scans
df_complete_scans = df_hdf_metadata[df_hdf_metadata['scanset'].isnull()]
complete_scans = df_complete_scans['hdf_filename'].unique().tolist()
print('\n', len(complete_scans), 'complete scans identified for reduction:')    
for scan in complete_scans: print('\t', scan)

# Get incomplete scans that are part of a scanset to stitch
df_scansets = df_hdf_metadata[df_hdf_metadata['scanset'].notnull()]
scansets = df_scansets['scanset'].unique().tolist() # scansets unique names

print('\n', len(scansets), 'scansets identified for stitching and reduction from metadata:')    
for scanset in scansets: print('\t',scanset)

hdfs_to_stitch = [] # list of scans for each scanset
for scanset in scansets:
    scanset_scans = df_scansets.hdf_filename[df_scansets['scanset'] == scanset].to_list()
    hdfs_to_stitch.append(scanset_scans)
    
# Turn scansets to a dictionary for each scanset (keys) of .hdfs (values) 
# that need stitching
scansets = dict(zip(scansets, hdfs_to_stitch)) 

print('\n Scan names for scansets listed in \'scansets\' dictionary')


 8 complete scans identified for reduction:
	 appendix_a1_overview_solid_0001
	 appendix_a1_overview_solid_0002
	 appendix_a1_ROI_solid_0001
	 tonsil_t1_overview_solid_0001
	 tonsil_t1_ROI_solid_0001
	 breast_cancer_b2_solid_overview_0corrected_0003
	 breast_cancer_b2_ROI_solid_ROI_0001
	 breast_cancer_b2_ROI_solid_0003

 0 scansets identified for stitching and reduction from metadata:

 Scan names for scansets listed in 'scansets' dictionary


### Hdf dataset crop functions
Load functions needed to define crop dimensions for scansets. 

In [13]:
#%% This function returns the hrz (number of rows and their coordinates) 
#   for a specified hdf file
def hdf_hrz_list(hdf_filepath): 
    with h5py.File(hdf_filepath, 'r') as hdf:
        hrz_list = list(hdf['1.1/measurement/hrz'])
    return hrz_list

#%% This function extracts the named dataset from an hdf file as an hdf dataset
def hdf_dataset(hdf_filepath, dataset): 
    with h5py.File(hdf_filepath, 'r') as hdf:
        node = f"{dataset}"
        try:
            hdf[node]
            #print(f'{node} exists')       
            dset = hdf[f'{dataset}']
        except KeyError:
            print(f'\t !!! {node} does not exist, could not retrieve dataset')
            dset = []
        dset = dset[:]
    return dset

#%% This function identifies the  overlapping rows in a list of hdf5 
#   consecutive scans. It returns a list of lengths by which to crop data 
#   within overlapping scans. This crop_list can be used to slice overlapping
#   scans so they are suitable to stitch together.
def hdf_list_crop(hdf_filepaths):  
    scan_number = len(hdf_filepaths)
    
    # Make a list of hrz position lists
    z_pos = []
    y_cols = []
    for scan in hdf_filepaths:
        scan_filepath = base_dir / scan
        
        with h5py.File(scan_filepath, 'r') as hdf_initialise:
            hdf_z_pos = hdf_hrz_list(scan_filepath)

            # in some .hdfs, the fscan parameters are not recorded, so column number must be calculated
            node = '1.1/instrument/fscan_parameters/fast_npoints'              
            try:
                ycols_in_scan = hdf_initialise[node][()]
            except KeyError:
                ycols_in_scan = hdf_z_pos.count(hdf_z_pos[0]) # get number of columns in a row, by counting number of z_pos in first row       

        z_pos.append(hdf_z_pos)
        y_cols.append(ycols_in_scan) # get number of columns in a row, by counting number of z_pos in first row
        print(f'\t Scan: {scan} \n\t has {len(set(hdf_z_pos))} rows & {y_cols[-1]} columns')
        
    # Check scans have the same number of columns that can be stitched
    if y_cols.count(y_cols[0]) == len(y_cols):
        y_cols = y_cols[0]
        print(f'\t Same number of columns in each scan so scanset {scanset} suitable for stitching',
              f'\n\t Proceeding to determine crop dimensions for {scanset}')
    else:
        print(f'\t Different number of columns in each scan',
              f'\n\t Scanset {scanset} cannot be stitched')
    
    # Get hrz rows overlap for successive scans
    z_overlap_rows = []
    for scan in range(scan_number-1): #check overlap between scan pairs, except last scan
        z_overlap = list((set(z_pos[scan])).intersection(set(z_pos[scan+1])))
        z_overlap_rows.append(len(z_overlap))
        print(f'\t Scan {scan+1} overlaps with scan {scan+2} by {z_overlap_rows[-1]} rows.')
        print(f'\t Last {z_overlap_rows[-1]} rows will be dropped from scan {scan+1} in crop list')
    z_overlap_rows.append(0)
        
    # Length of datapoints to keep from each hdf file
    crop_list = []
    for scan, value in enumerate(z_pos):
        rows = len(set(value))
        crop_list.append((rows-z_overlap_rows[scan])*y_cols)
                
    return crop_list

def scannames_to_filepaths(scans):
    filepaths = []
    for scan in scans:
        filepath = df_hdf_metadata.loc[df_hdf_metadata['hdf_filename'] == scan, 'hdf_full_fpaths'].iloc[0]
        filepaths.append(filepath)
    return filepaths

### Get scanset crop dimensions
For identified scansets, define how to crop scans within a scanset to stitch together. 

Scanset crop dimensions will be stored as a `crop_dict` dictionary, with scanset name as key and crop dimensions as value.

In [14]:
crop_dict = {}
for scanset in scansets:
    print(f'Scanset {scanset}:')
    hdf_filepaths = scannames_to_filepaths(scansets[scanset])
    crop_dict.update({scanset:hdf_list_crop(hdf_filepaths)})
           
print('\n\n Crop dimensions for scansets')
for scanset in scansets:
    print('\t', scanset, 'crop dimensions', crop_dict[scanset])

print('\n\n All scanset crop dimensions identified')



 Crop dimensions for scansets


 All scanset crop dimensions identified


### Stitch scansets to pre-processed hdf files
Using the crop dimensions for scans in scansets stored in `crop_dict`, we can now stitch scansets together to the pre-processed datasets.

The following cell first loads the necessary functions for stitching list or array hdf datasets, and the metafunction `stitch_hdf_array_or_list_dset` used by `scanset_reduced_hdf` to produce the new reshaped and reduced datasets. 

Reduced stitched scans are output to the `data/processed/xrf/1_reduced_reshaped_hdfs` directory and identified with the `_stitch` suffix.

In [15]:
#%% This function return a stitched list dataset for specified list datsets in 
#   an hdf5. It requires 1) a list of hdf files containing the specified 
#   dataset, 2) a list of lengths to crop each dataset and 3) the dataset name
def stitch_hdf_list_dset(hdf_list, crop_dims, dset): # (list, list, 'string')
    data = []
    for number, scan in enumerate(hdf_list):
        with h5py.File(scan, 'r') as hdf:
            data_add = hdf[f'{dset}'][:]
            data_add = data_add[:crop_dims[number]]
            data.extend(data_add)
    return data

def stitch_hdf_array_dset(hdf_list, crop_dims, dset): 
    for number, scan in enumerate(hdf_list):
        if number == 0:
            with h5py.File(scan, 'r') as hdf:
                data = hdf[f'{dset}'][:]
                data = data[:crop_dims[number]]
        else:    
            with h5py.File(scan, 'r') as hdf:
                data_add = hdf[f'{dset}'][:]
                data_add = data_add[:crop_dims[number]]
                data = np.concatenate((data, data_add), axis=0)
    return data


#%% This function identifies if the dataset to stitch is a list or an array.
#   It then selects the appropriate function to stitch the dataset type
def stitch_hdf_array_or_list_dset(hdf_list, crop_dims, dset):
    print(hdf_list[0])
    with h5py.File(hdf_list[0], 'r') as hdf_initialise:
        shape = hdf_initialise[f'{dset}'].shape
        shape = (0, shape[-1])
    if len(shape) == 1:
        print('list data')
        data = stitch_hdf_list_dset(hdf_list, crop_dims, dset)
    else:
        print('array data')
        data = stitch_hdf_array_dset(hdf_list, crop_dims, dset)        
    return data
   
def scanset_reduced_hdf(scanset, datasets):
    """
    This function creates a new hdf file for acqusitions comprising 
    multiple scans that need stitching.
    """
    hdf_list = scannames_to_filepaths(scansets[scanset])
        
    print('Scans to stitch:')
    for scan in hdf_list:
        print('\t', scan)
    crop_dims = crop_dict[scanset]
    print('Crop dimensions:', crop_dims)
    
    hdf_new_name = base_dir / out_dir / f'{scanset}.h5'
    print(hdf_new_name)
    
    with h5py.File(hdf_new_name, 'w') as hdf_new:     
        for dset in datasets:
            print(f'\t Adding {scanset} {dset} dataset to new hdf file')
            hdf_datasets[dset] = stitch_hdf_array_or_list_dset(hdf_list = hdf_list,
                                                               crop_dims = crop_dims, 
                                                               dset = dset)
            dset_name = dset.rpartition('/')[-1]
                
            hdf_new.create_dataset(dset_name, data=hdf_datasets[dset], compression='gzip')
        
    print(f'\t Preprocessed hdf file for scanset {scanset} output to: \n\t {hdf_new_name}')
    
    
# Stitch scansets together and output the new hdf files
for scanset in scansets:
    print(f'\n Stitching scanset:', scanset)
    scanset_reduced_hdf(scanset = scanset, datasets = hdf_datasets)
    
    config_file = df_scansets.loc[df_scansets['scanset'] == scanset, 'config_file'].iloc[0]
    step_um = df_scansets.loc[df_scansets['scanset'] == scanset, 'step_um'].iloc[0]
    dual_detector = df_scansets.loc[df_scansets['scanset'] == scanset, 'dual_detector'].iloc[0]
    det_type = df_scansets.loc[df_scansets['scanset'] == scanset, 'detector'].iloc[0]
   
    df_preprocessed_files = df_preprocessed_files.append({'config_file':config_file, 
                                                          'hdf_file':scanset, 
                                                          'step_um':step_um,
                                                          'dual_detector':dual_detector,
                                                          'detector':det_type
                                                         }, ignore_index=True)
    
print('\n All scansets finished stitching')



 All scansets finished stitching


### Reduce complete scans to pre-processed hdf files
For complete scans that do not need to be stitched, the following cell first loads the metafunction `scan_reduced_hdf` to produce the reduced pre-processed dataset.

Pre-processed datasets are output to the `data/processed/xrf/1_reduced_reshaped_hdfs` directory


In [16]:
def scan_reduced_hdf(scan, datasets):
    """
    This function creates a new hdf file for acqusitions comprising a single scan.
    """
    hdf_filepath = df_complete_scans.loc[df_complete_scans['hdf_filename'] == scan, 'hdf_full_fpaths'].iloc[0]
       
    hdf_new_name = base_dir / out_dir / f'{scan}.h5'
    with h5py.File(hdf_new_name, 'w') as hdf_new:
        for dset in datasets:
            print(f'\t Adding {scan} {dset} dataset to new hdf file')
            hdf_datasets[dset] = hdf_dataset(hdf_filepath = hdf_filepath, dataset = dset)                                         
            dset_name = dset.rpartition('/')[-1]
            hdf_new.create_dataset(dset_name, data=hdf_datasets[dset], compression='gzip')       
    print(f'\t Preprocessed hdf file for scan {scan} output to: \n\t {hdf_new_name}')
    
for scan in complete_scans:
    print('\n Preprocessing scan:', scan)
    scan_reduced_hdf(scan = scan, datasets=hdf_datasets)
                
    config_file = df_complete_scans.loc[df_complete_scans['hdf_filename'] == scan, 'config_file'].iloc[0]
    step_um = df_complete_scans.loc[df_complete_scans['hdf_filename'] == scan, 'step_um'].iloc[0]
    dual_detector = df_complete_scans.loc[df_complete_scans['hdf_filename'] == scan, 'dual_detector'].iloc[0]
    det_type = df_complete_scans.loc[df_complete_scans['hdf_filename'] == scan, 'detector'].iloc[0]

    df_preprocessed_files = df_preprocessed_files.append({'config_file':config_file, 
                                                          'hdf_file':scan, 
                                                          'step_um':step_um,
                                                          'dual_detector':dual_detector,
                                                          'detector':det_type
                                                         }, ignore_index=True)
       
print('\n All scans finished reducing')


 Preprocessing scan: appendix_a1_overview_solid_0001
	 Adding appendix_a1_overview_solid_0001 1.1/measurement/hrz dataset to new hdf file
	 Adding appendix_a1_overview_solid_0001 1.1/measurement/hry dataset to new hdf file
	 Adding appendix_a1_overview_solid_0001 1.1/measurement/fpico3 dataset to new hdf file
	 Adding appendix_a1_overview_solid_0001 1.1/measurement/falconx_det0 dataset to new hdf file
	 Adding appendix_a1_overview_solid_0001 1.1/measurement/fluodet_det0 dataset to new hdf file
	 !!! 1.1/measurement/fluodet_det0 does not exist, could not retrieve dataset
	 Preprocessed hdf file for scan appendix_a1_overview_solid_0001 output to: 
	 C:\Users\MerrickS\OneDrive\Work\2_UZH\Papers\1_MEZ_XRF\data\processed\xrf\1_reduced_reshaped_hdfs\appendix_a1_overview_solid_0001.h5

 Preprocessing scan: appendix_a1_overview_solid_0002
	 Adding appendix_a1_overview_solid_0002 1.1/measurement/hrz dataset to new hdf file
	 Adding appendix_a1_overview_solid_0002 1.1/measurement/hry dataset to

MemoryError: Unable to allocate 95.4 GiB for an array with shape (6250000, 4096) and data type uint32

In [73]:
def scan_reduced_hdf_large(scan, datasets):
    """
    This function creates a new hdf file for acqusitions comprising a single scan.
    """
    hdf_filepath = df_complete_scans.loc[df_complete_scans['hdf_filename'] == scan, 'hdf_full_fpaths'].iloc[0]
    print(hdf_filepath)
       
    hdf_new_name = base_dir / out_dir / f'{scan}.h5'
    print(hdf_new_name)
    
    datasets = list(hdf_datasets.keys())
        
    with h5py.File(hdf_filepath,'r') as f_src:
        with h5py.File(hdf_new_name,'w') as f_dest:

            for dset in datasets:
                print("dataset", dset)

                # Check dataset exists
                e = dset in f_src

                # Copy existing datasets
                if e == True:
                    print (dset, "exists")
                    new_node = dset.split('/')[-1]
                    f_src.copy(f_src[dset],f_dest, new_node) # copy to new hdf                
                else:
                    pass

    print(f'\t Preprocessed hdf file for scan {scan} output to: \n\t {hdf_new_name}')

for scan in complete_scans:
    print('\n Preprocessing scan:', scan)
    scan_reduced_hdf_large(scan = scan, datasets=hdf_datasets)
                
    config_file = df_complete_scans.loc[df_complete_scans['hdf_filename'] == scan, 'config_file'].iloc[0]
    step_um = df_complete_scans.loc[df_complete_scans['hdf_filename'] == scan, 'step_um'].iloc[0]
    dual_detector = df_complete_scans.loc[df_complete_scans['hdf_filename'] == scan, 'dual_detector'].iloc[0]
    det_type = df_complete_scans.loc[df_complete_scans['hdf_filename'] == scan, 'detector'].iloc[0]

    df_preprocessed_files = df_preprocessed_files.append({'config_file':config_file, 
                                                          'hdf_file':scan, 
                                                          'step_um':step_um,
                                                          'dual_detector':dual_detector,
                                                          'detector':det_type
                                                         }, ignore_index=True)
       
print('\n All scans finished reducing')




 Preprocessing scan: appendix_a1_overview_solid_0001
C:\Users\MerrickS\OneDrive\Work\2_UZH\Papers\1_MEZ_XRF\data\raw\xrf\scans\appendix_a1_overview_solid\appendix_a1_overview_solid_0001\appendix_a1_overview_solid_0001.h5
C:\Users\MerrickS\OneDrive\Work\2_UZH\Papers\1_MEZ_XRF\data\processed\xrf\1_reduced_reshaped_hdfs\appendix_a1_overview_solid_0001.h5
dataset 1.1/measurement/hrz
1.1/measurement/hrz exists
dataset 1.1/measurement/hry
1.1/measurement/hry exists
dataset 1.1/measurement/fpico3
1.1/measurement/fpico3 exists
dataset 1.1/measurement/falconx_det0
1.1/measurement/falconx_det0 exists
dataset 1.1/measurement/fluodet_det0
	 Preprocessed hdf file for scan appendix_a1_overview_solid_0001 output to: 
	 C:\Users\MerrickS\OneDrive\Work\2_UZH\Papers\1_MEZ_XRF\data\processed\xrf\1_reduced_reshaped_hdfs\appendix_a1_overview_solid_0001.h5

 Preprocessing scan: appendix_a1_overview_solid_0002
C:\Users\MerrickS\OneDrive\Work\2_UZH\Papers\1_MEZ_XRF\data\raw\xrf\scans\appendix_a1_overview_sol

### Scans with different dataset structure
During acquisition, some raw .hdf scans end up storing datasets in 2.1/measurements hdf node rather than 1.1/measurements. For these scans (which can be identified by small file size if reduced in the prior step), hdf reduction needs to be pointed to this different dataset structure. 

For scans listed in `scans_in_2pt1`, reduced .hdfs are rederived in the following cell by defining this structure. 

In [10]:
scans_in_2pt1 = ['sample107_t_0001', 'sample304_0017', 'sample304_b_0004']

# Switch dataset directory from 1.1 to 2.1
hdf_datasets_2 = list(hdf_datasets)
hdf_datasets_2 = [dset.replace('1.1', '2.1') for dset in hdf_datasets_2]
hdf_datasets_2 = dict.fromkeys(hdf_datasets_2)

for scan in scans_in_2pt1:
    print('\n Preprocessing scan:', scan)
    scan_reduced_hdf(scan = scan, datasets=hdf_datasets_2)
    
print('\n Irregular complete scans processed to pre-processed directory')


 Preprocessing scan: sample107_t_0001
	 Adding sample107_t_0001 2.1/measurement/hrz dataset to new hdf file
	 Adding sample107_t_0001 2.1/measurement/hry dataset to new hdf file
	 Adding sample107_t_0001 2.1/measurement/fpico3 dataset to new hdf file
	 !!! 2.1/measurement/fpico3 does not exist, could not retrieve dataset
	 Adding sample107_t_0001 2.1/measurement/falconx_det0 dataset to new hdf file
	 Adding sample107_t_0001 2.1/measurement/fluodet_det0 dataset to new hdf file
	 Preprocessed hdf file for scan sample107_t_0001 output to: 
	 C:\Users\MerrickS\OneDrive\Work\2_UZH\Papers\1_MEZ_XRF\data\processed\xrf\1_reduced_reshaped_hdfs\sample107_t_0001.h5

 Preprocessing scan: sample304_0017
	 Adding sample304_0017 2.1/measurement/hrz dataset to new hdf file
	 Adding sample304_0017 2.1/measurement/hry dataset to new hdf file
	 Adding sample304_0017 2.1/measurement/fpico3 dataset to new hdf file
	 Adding sample304_0017 2.1/measurement/falconx_det0 dataset to new hdf file
	 Adding sample

Export csv tracking preprocessed files alongside infomration key to deconvolution and later processing.

In [74]:
df_preprocessed_files.to_csv((out_dir / 'preprocessed_hdf_config_files2.csv'), index=False)

Generate fpico mask to normalise later plots according to X-ray flux. These will be added to the output hdf reduced, reshaped .h5 files. 

In [87]:
# hdf_img_fpaths = [i for i in out_dir.glob('*.h5')]
hdf_img_fpaths = [list(out_dir.glob(f'{i}.h5'))[0] for i in df_preprocessed_files['hdf_file']]

def hdf_fpico_mask(hdf_img_fpath):
    with h5py.File(hdf_img_fpath, 'r+') as hdf:    
        fpico3 = hdf['fpico3'][:]
        z = list(hdf['hrz'][:])
        y = list(hdf['hry'][:])
        
        rows = len(np.unique(z))
        cols = int(len(y)/rows)
        
        if len(fpico3) > 10:
            fpico_mask = fpico3.reshape((rows, cols))
            
            if 'fpico_mask' in hdf:
                hdf['fpico_mask'][...] = fpico_mask
            else:
                hdf.create_dataset(name = 'fpico_mask', data = fpico_mask)

            return fpico_mask
    
for hdf_img_fpath in hdf_img_fpaths:
    fpico_mask = hdf_fpico_mask(hdf_img_fpath)


ValueError: cannot reshape array of size 219352 into shape (176,1246)

Add in scan step size to each preprocessed hdf file

In [69]:
for hdf_img_fpath in hdf_img_fpaths:
    step = df_preprocessed_files.loc[df_preprocessed_files['hdf_file'] == hdf_img_fpath.stem, 'step_um'].iloc[0]
    
    with h5py.File(hdf_img_fpath, 'r+') as hdf:
        hdf.create_dataset(name = 'step_um', data = step)


NameError: name 'hdf_img_fpaths' is not defined