Current main branch version: 0.1.2
Current dev branch verison: 0.1.2
The CS3 package is an implementation of the Component-Versus-Sample Spectral Similarity (CS3) approach for three-dimensional fluorescence (Excitation-Emission Matrix; EEM) model verification in the R statistical programming langauge. The goal of the package is to facilitate intuitive Parallel Factor Analysis (PARAFAC) model validation, building upon the R fluorescence framework provided by the EEM, eemR, and staRdom packages.
All code relates to the manuscript currently under review in Analytical Methods titled CS3: a new diagnostic tool for EEM-PARAFAC component superposition in environmental fluorescence spectroscopy.
The current github repo is https://github.com/MRPHarris/CS3.
This package operates almost entirely around a single function,
per_eem_ssc
. This function compares one or more PARAFAC components
with constituent EEM spectra at the target component peak wavelength
position. This provides a way to meaningfully evaluate the way in which
a PARAFAC model minimises residuals on a per-component basis. It is
particularly useful at diagnosing under-fitting, as it is sensitive to
the presence of residuals sharing the EEM space of a component.
Four corrections are applied during this operation:
-
Spectral correction. During spectral correction, the contribution from non-target components is subtracted from the sample spectra being compared. In this manner, the target component is compared only against sample + residual fluorescence, and similarity differences are not contributed to by other components.
-
Interpolation. Spectra are interpolated to an evenly spaced 1nm bandwidth.
-
Savitzky-Golay smoothing. Raw sample spectra are noisy - part of the strength of PARAFAC is that this noise can be removed. Thus in directly comparing PARAFAC and sample spectra, it is often necessary to smooth the data to some degree. SG smoothing preserves the shape of the spectra whilst dispensing with a lot of noisiness.
-
Incomplete excitation peak removal. The shift and shape sensitive congruence metric was developed by Wunsch et al., 2019, in order to better compare excitation spectra. However, this metric is also useful in picking up small excitation spectral differences. To apply it, it is necessary to remove incomplete peaks so as not to bias the peak position penalty term (incorrect assignment of peaks). A gradient-based peak detection and trimming method based upon Murphy (2011) is applied for this purpose. SSC comparisons are made only to trimmed spectra.
After passing through some combination of these corrections, spectra are evaluated for TCC and/or SSC, and the results returned. High metric values indicate that a component strongly matches the spectra in the EEMs it claims it is explaining. Low values indicate that this component is poorly fitted. However, the values should not be averaged and taken at face value, as this would essentially be a sub-par split-half test. Instead, the values should be plotted against other model performance metrics along with the component loadings, and used to infer where the model is falling down. Are there specific samples where the model and sample similarity falls apart? Why might this be? Is spectral averaging occurring; are more components required?
Three functions are included that can be used to create method limit
(background, detection, and quantification limits) EEM objects. Pure
water blank EEM measurements associated with a given instrument
configuration can be supplied to these functions to create an object of
the desired type. These objects can also be passed to the per_eem_ssc
function to optionally exclude samples returning a below-threshold (LoQ,
MDL, AB) intensity for a target component. This can reduce the influence
of noise on similarity scores, creating a more valid measure of
component superposition.
The per_eem_ssc
function will produce a data frame of similarity
scores for a given component across all samples. When a multi-component
PARAFAC model is supplied, each component will be analysed and the
results nested within a list, each list element corresponding to a
component. The function combined_pessc_scores
can be supplied with one
of these data frames to produce a whole-component score, with an option
to exclude samples exhibiting an intensity beneath that component’s LoQ
(to use this functionality, an LoQ object must be supplied to
per_eem_ssc
for score generation). The investigator can perform this
operation for all component results for a given model (e.g., iterate
along list elements returned by per_eem_ssc
), then take the mean of
the resulting scores to derive a whole-model similarity score.
Whole-model scores will typically reach a plateau once optimal
decomposition is achieved in test-case datasets - however, this
behaviour may not translate to more dynamic and complex EEM datasets.
The per_eem_ssc
only requires two things: a PARAFAC model object,
returned from staRdom::eem_parafac()
, and the eemlist used to generate
the model. An optional (and recommended) addition is to create a
quantification limit EEM object from the blanks accompanying the EEM
dataset, supplied to create_MQL_EEM
. The eemlist/s must be compliant
with the staRdom/EEM/eemR fluorescence framework for R (see Pucher et
al., 2019). Supply these parameters to the main function, and you’re
good to go.
RamanIntegrationRange originally mentioned in this paper: Murphy, K. R.: A Note on Determining the Extent of the Water Raman Peak in Fluorescence Spectroscopy, Appl. Spectrosc., 65(2), 233-236, doi:10.1366/10-06136, 2011.
Subsequently, the function was integrated into the drEEM toolbox: Murphy, K. R., Stedmon, C. A., Graeber, D. and Bro, R.: Fluorescence spectroscopy and multi-way techniques. PARAFAC, Anal. Methods, 5, 6557-6566, 2013.
The function was then extracted from drEEM in MATLAB using <open
RamanIntegrationRange.m> Code was then ported to R, using equivalent
commands. e.g. forecast::ma() instead of smooth(). Where it was
possible, functions that preserved the ‘style’ of the code were used.
E.g. discrete numerical gradient with pracma::gradient()
. The internal
(non-exported) function get_gradient_vals
can be found in R ->
gradient_helper_functions, and is responsible for discrete gradient data
generation.
Murphy, K. R. (2011). A Note on Determining the Extent of the Water Raman Peak in Fluorescence Spectroscopy. Applied Spectroscopy, 65(2), 233–236. https://doi.org/10.1366/10-06136
Pucher, M., Wünsch, U., Weigelhofer, G., Murphy, K., Hein, T., & Graeber, D. (2019). staRdom: Versatile Software for Analyzing Spectroscopic Data of Dissolved Organic Matter in R. Water, 11, 2366. https://doi.org/10.3390/w11112366
Wünsch, U. J., Bro, R., Stedmon, C. A., Wenig, P., & Murphy, K. R. (2019). Emerging patterns in the global distribution of dissolved organic matter fluorescence. Analytical Methods, 11(7), 888–893. https://doi.org/10.1039/C8AY02422G
Agostino A., Rao N. R. H., Paul S., Zhang, Z., Leslie, G., Le-Clech, P., Henderson, R. (2021). Polymer leachates emulate naturally derived fluorescent dissolved organic matter: Understanding and managing sample container interferences. Water Research, 204. https://doi.org/10.1016/j.watres.2021.117614