# 3. XRF normalisation and hdf repacking
## Summary
This notebook generates normalised data for the deconvoluted XRF spectra plots generated by `2_2D_XRF_deconvolution.pynb`. This includes beam intensity normalisation of pixels within an image and scatter intensity normalisation across a set of images. 

Raw and normalised datasets are also output as an `hdf` file that contains:
+ **Sample/scan metadata**
+ **2D channel plots as 3D datasets** `[x, y, channels]` under `hdf[images]`
    + Raw in `hdf[images/raw]` and normalised in `hdf[images/normalised]`
+ **Channel metadata** as a `pandas.Dataframe` exported by `panda.Dataframe.to_hdf` to `hdf[images/df_channel_metadata]`
    + df row index corresponds to a channel in 3D array image stack (`[x, y, index]`), enabling column-wise storage of channel specific information (e.g. information related to experiment design such as XRF edge names, or to facilitate downstream analysis by indicating which channels should be used for segmentation). 
+ **2D masks are stored as 2D array datasets `[x, y]` under `hdf[masks]`**

This repacked hdf structure is useful for storing linked image and mask data. This simplifies sharing of images with associated masks and downstream measurements of a selected image stack (raw or normalised) with the a chosen mask (e.g. cell segmentation mask). 

In [1]:
import pathlib
import numpy as np
import pandas as pd

import h5py
import utilities
from pymca_repack import unpack_pymca_h5, XrfImageMaskHDF

Comparison datasets are defined according to 'scatter_set' column in the input cvsv. 

In [2]:
# Set data directory to work from 
base_dir = "C:/Users/MerrickS/OneDrive/Work/2_UZH/Papers/1_MEZ_XRF"
base_dir = pathlib.Path(base_dir)
base_sub_dir = base_dir / 'data' / 'processed' / 'xrf'

# Specify the input directory where hdf files to process are located
hdf_dir = base_sub_dir / '2_deconvoluted_hdfs' / 'summary_hdfs'

# Gather filepaths for preprocessed hdfs and config files for XRF fitting
hdf_filepaths = list(hdf_dir.glob('*.h5'))

# Make output directory segmented images
out_dir = base_sub_dir / '3_norm_repacked_XRF_hdfs'
out_dir.mkdir(parents=True, exist_ok=True)
print('Repacked hdf files will be output to: \n\t', out_dir) 

# Read in scans and original scan metadata
df_hdf_files = pd.read_csv(base_sub_dir / '1_reduced_reshaped_hdfs' / 'preprocessed_hdf_config_files.csv')
df_hdf_metadata = pd.read_csv(base_sub_dir / '1_reduced_reshaped_hdfs' / 'xrf_scan_metadata_full.csv')

Repacked hdf files will be output to: 
	 C:\Users\MerrickS\OneDrive\Work\2_UZH\Papers\1_MEZ_XRF\data\processed\xrf\3_norm_repacked_XRF_hdfs


The following cell establishes the scatter max for a set of images specified in the 'scatter_set' column of the input raw data. This will be used to normalise scatter intensities amongst an image set. 

In [3]:
scatter_set_max = dict.fromkeys(df_hdf_metadata['scatter_set'].unique())

for hdf_fpath in hdf_filepaths:
    if 'stitch' in hdf_fpath.stem:
        scatter_set = df_hdf_metadata.loc[df_hdf_metadata['scanset'] == hdf_fpath.stem, 'scatter_set'].iloc[0]
    else:
        scatter_set = df_hdf_metadata.loc[df_hdf_metadata['hdf_filename'] == hdf_fpath.stem, 'scatter_set'].iloc[0]
    
    with h5py.File(hdf_fpath, 'r') as hdf:
        dset = hdf[f'{hdf_fpath.stem}/plotselect/Scatter_Compton000'][()]
        #dset = hdf[f'{hdf_fpath.stem}/plotselect/Scatter_Peak000'][()]

        scatter_mean = np.mean(dset)
            
    if scatter_set_max[scatter_set] is None:
        scatter_set_max[scatter_set] = scatter_mean
    
    if scatter_mean > scatter_set_max[scatter_set]:
        scatter_set_max[scatter_set] = scatter_mean
        
print(scatter_set_max)

{1: 166.755, 2: 99.71418, 3: 162.23982, 4: 152.12837, 5: 99.73146, 6: 776.9347, 7: 2194.7812, 8: 286.3401, 9: 414.77716, 10: 289.1419, 11: 273.66733, 12: 286.18423, 13: 241.93733, 14: 227.2852, 15: 234.31715, 16: 656.75146, 17: 246.39746, 18: 106.02408, 19: 105.89993}


Identify hdfs to repack

In [4]:
for hdf_fpath in hdf_filepaths:
    # Extract key hdf XRF properties from deconvoluted hdf file
    image_stack, df_plots, fpico_mask = unpack_pymca_h5(hdf_fpath)

    images = {'raw':image_stack}
    masks = {'fpico_mask':fpico_mask}      
    
    # Collect experiment channel info where available
    panel_dir = base_dir / 'data' / 'raw' / 'antibody_panels'
    sample_metadata, df_panel = utilities.get_hdf_metadata(
        hdf_fpath, 
        df_hdf_metadata, 
        panel_dir
    )
    
    df_plots_panel = utilities.get_hdf_full_plot_df(df_plots, df_panel)

    # Repack key hdf XRF properties and generate normalised XRF image stacks
    hdf_repack = XrfImageMaskHDF(
        images=images, 
        channel_metadata=df_plots_panel, 
        masks=masks,
        sample_metadata=sample_metadata
    )
    
    hdf_repack.scatter_normalise(scatter_set_max)
    hdf_repack.beam_intensity_normalise()

    output_fpath = out_dir / f'{hdf_fpath.stem}_normalised.h5'
    hdf_repack.export_hdf(output_fpath=output_fpath)


001_002_stitch.h5 is a stitch file
epithelial_cell_pellet_panel is matched to sample 001_002_stitch.h5
001_004_stitch.h5 is a stitch file
epithelial_cell_pellet_panel is matched to sample 001_004_stitch.h5
304_007_stitch.h5 is a stitch file
No panel for 304_007_stitch.h5
304_009_stitch.h5 is a stitch file
No panel for 304_009_stitch.h5
AXO_thin_film_3_0001.h5 is a non stitch file
No panel for AXO_thin_film_3_0001.h5
AXO_thin_film_3_0002.h5 is a non stitch file
No panel for AXO_thin_film_3_0002.h5
AXO_thin_film_3_0003.h5 is a non stitch file
No panel for AXO_thin_film_3_0003.h5
axo_thin_film_c00_0005.h5 is a non stitch file
No panel for axo_thin_film_c00_0005.h5


  image_norm[:, :, i] = image_norm[:, :, i]/scatter_factor
  image_norm[:, :, i] = image_norm[:, :, i]/scatter_factor


AXO_thin_film_C00_006_0006.h5 is a non stitch file
No panel for AXO_thin_film_C00_006_0006.h5
blank_env_1000Hz_0001.h5 is a non stitch file
No panel for blank_env_1000Hz_0001.h5
blank_env_10Hz_0001.h5 is a non stitch file
No panel for blank_env_10Hz_0001.h5
blank_env_250Hz_0001.h5 is a non stitch file
No panel for blank_env_250Hz_0001.h5
blank_env_500Hz_0001.h5 is a non stitch file
No panel for blank_env_500Hz_0001.h5
blank_env_50Hz_0001.h5 is a non stitch file
No panel for blank_env_50Hz_0001.h5
blank_env_800Hz_0001.h5 is a non stitch file
No panel for blank_env_800Hz_0001.h5
blank_tower_1000Hz_0001.h5 is a non stitch file
No panel for blank_tower_1000Hz_0001.h5
blank_tower_10Hz_0001.h5 is a non stitch file
No panel for blank_tower_10Hz_0001.h5
blank_tower_250Hz_0001.h5 is a non stitch file
No panel for blank_tower_250Hz_0001.h5
blank_tower_500Hz_0001.h5 is a non stitch file
No panel for blank_tower_500Hz_0001.h5
blank_tower_50Hz_0001.h5 is a non stitch file
No panel for blank_tower_5