Skip to content
/ CS3 Public

Component versus Sample Spectral Similarity (CS3) for EEM-PARAFAC model verification

Notifications You must be signed in to change notification settings

MRPHarris/CS3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Component versus Sample Spectral Similarity (CS3) in R

Current main branch version: 0.1.2

Current dev branch verison: 0.1.2

The CS3 package is an implementation of the Component-Versus-Sample Spectral Similarity (CS3) approach for three-dimensional fluorescence (Excitation-Emission Matrix; EEM) model verification in the R statistical programming langauge. The goal of the package is to facilitate intuitive Parallel Factor Analysis (PARAFAC) model validation, building upon the R fluorescence framework provided by the EEM, eemR, and staRdom packages.

All code relates to the manuscript currently under review in Analytical Methods titled CS3: a new diagnostic tool for EEM-PARAFAC component superposition in environmental fluorescence spectroscopy.

The current github repo is https://github.com/MRPHarris/CS3.

Package framework

This package operates almost entirely around a single function, per_eem_ssc. This function compares one or more PARAFAC components with constituent EEM spectra at the target component peak wavelength position. This provides a way to meaningfully evaluate the way in which a PARAFAC model minimises residuals on a per-component basis. It is particularly useful at diagnosing under-fitting, as it is sensitive to the presence of residuals sharing the EEM space of a component.

Four corrections are applied during this operation:

  1. Spectral correction. During spectral correction, the contribution from non-target components is subtracted from the sample spectra being compared. In this manner, the target component is compared only against sample + residual fluorescence, and similarity differences are not contributed to by other components.

  2. Interpolation. Spectra are interpolated to an evenly spaced 1nm bandwidth.

  3. Savitzky-Golay smoothing. Raw sample spectra are noisy - part of the strength of PARAFAC is that this noise can be removed. Thus in directly comparing PARAFAC and sample spectra, it is often necessary to smooth the data to some degree. SG smoothing preserves the shape of the spectra whilst dispensing with a lot of noisiness.

  4. Incomplete excitation peak removal. The shift and shape sensitive congruence metric was developed by Wunsch et al., 2019, in order to better compare excitation spectra. However, this metric is also useful in picking up small excitation spectral differences. To apply it, it is necessary to remove incomplete peaks so as not to bias the peak position penalty term (incorrect assignment of peaks). A gradient-based peak detection and trimming method based upon Murphy (2011) is applied for this purpose. SSC comparisons are made only to trimmed spectra.

After passing through some combination of these corrections, spectra are evaluated for TCC and/or SSC, and the results returned. High metric values indicate that a component strongly matches the spectra in the EEMs it claims it is explaining. Low values indicate that this component is poorly fitted. However, the values should not be averaged and taken at face value, as this would essentially be a sub-par split-half test. Instead, the values should be plotted against other model performance metrics along with the component loadings, and used to infer where the model is falling down. Are there specific samples where the model and sample similarity falls apart? Why might this be? Is spectral averaging occurring; are more components required?

Method limit handling

Three functions are included that can be used to create method limit (background, detection, and quantification limits) EEM objects. Pure water blank EEM measurements associated with a given instrument configuration can be supplied to these functions to create an object of the desired type. These objects can also be passed to the per_eem_ssc function to optionally exclude samples returning a below-threshold (LoQ, MDL, AB) intensity for a target component. This can reduce the influence of noise on similarity scores, creating a more valid measure of component superposition.

Per-component and per-model scores

The per_eem_ssc function will produce a data frame of similarity scores for a given component across all samples. When a multi-component PARAFAC model is supplied, each component will be analysed and the results nested within a list, each list element corresponding to a component. The function combined_pessc_scores can be supplied with one of these data frames to produce a whole-component score, with an option to exclude samples exhibiting an intensity beneath that component’s LoQ (to use this functionality, an LoQ object must be supplied to per_eem_ssc for score generation). The investigator can perform this operation for all component results for a given model (e.g., iterate along list elements returned by per_eem_ssc), then take the mean of the resulting scores to derive a whole-model similarity score. Whole-model scores will typically reach a plateau once optimal decomposition is achieved in test-case datasets - however, this behaviour may not translate to more dynamic and complex EEM datasets.

Using the package

The per_eem_ssc only requires two things: a PARAFAC model object, returned from staRdom::eem_parafac(), and the eemlist used to generate the model. An optional (and recommended) addition is to create a quantification limit EEM object from the blanks accompanying the EEM dataset, supplied to create_MQL_EEM. The eemlist/s must be compliant with the staRdom/EEM/eemR fluorescence framework for R (see Pucher et al., 2019). Supply these parameters to the main function, and you’re good to go.

A note on code from Murphy (2011) for gradient peak detection

RamanIntegrationRange originally mentioned in this paper: Murphy, K. R.: A Note on Determining the Extent of the Water Raman Peak in Fluorescence Spectroscopy, Appl. Spectrosc., 65(2), 233-236, doi:10.1366/10-06136, 2011.

Subsequently, the function was integrated into the drEEM toolbox: Murphy, K. R., Stedmon, C. A., Graeber, D. and Bro, R.: Fluorescence spectroscopy and multi-way techniques. PARAFAC, Anal. Methods, 5, 6557-6566, 2013.

The function was then extracted from drEEM in MATLAB using <open RamanIntegrationRange.m> Code was then ported to R, using equivalent commands. e.g. forecast::ma() instead of smooth(). Where it was possible, functions that preserved the ‘style’ of the code were used. E.g. discrete numerical gradient with pracma::gradient(). The internal (non-exported) function get_gradient_vals can be found in R -> gradient_helper_functions, and is responsible for discrete gradient data generation.

References

Murphy, K. R. (2011). A Note on Determining the Extent of the Water Raman Peak in Fluorescence Spectroscopy. Applied Spectroscopy, 65(2), 233–236. https://doi.org/10.1366/10-06136

Pucher, M., Wünsch, U., Weigelhofer, G., Murphy, K., Hein, T., & Graeber, D. (2019). staRdom: Versatile Software for Analyzing Spectroscopic Data of Dissolved Organic Matter in R. Water, 11, 2366. https://doi.org/10.3390/w11112366

Wünsch, U. J., Bro, R., Stedmon, C. A., Wenig, P., & Murphy, K. R. (2019). Emerging patterns in the global distribution of dissolved organic matter fluorescence. Analytical Methods, 11(7), 888–893. https://doi.org/10.1039/C8AY02422G

Agostino A., Rao N. R. H., Paul S., Zhang, Z., Leslie, G., Le-Clech, P., Henderson, R. (2021). Polymer leachates emulate naturally derived fluorescent dissolved organic matter: Understanding and managing sample container interferences. Water Research, 204. https://doi.org/10.1016/j.watres.2021.117614

About

Component versus Sample Spectral Similarity (CS3) for EEM-PARAFAC model verification

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages