-
Notifications
You must be signed in to change notification settings - Fork 24
B. Preliminary: 'Total counts'
During the early stages of data processing, SQUID 2.50 derives ‘total-count’ values for each scan of each species, for ions and SBM. These ‘total-count’ values are derived from the sets of 10 ion-count (or SBM) ‘integrations’ measured at each species during each scan, as follows:
- ‘Total-count’ SBM values are universally calculated using the Tukey’s Biweight algorithm with tuning factor = 6.
- ‘Total-count’ ion values are calculated using the Tukey’s Biweight algorithm with tuning factor = 9 when the median of the 10 integrations exceeds 100. If the median value is 100 or less, the mean is calculated by a custom process involving Poisson outlier identification and rejection.
These SQUID 2.50 calculations were elucidated and emulated using two CSVs (Check_01… and Check_02…) developed in Calamari:
- Check_01_IonIntegrations_PerScan.CSV is a simple and direct extract from the Prawn XML file, with one row per scan, and 11 columns per species (comprising one for ‘count_time_sec’, and one for each of the 10 ion-count ‘integrations’ measured at each species during each scan). The output (which is documented in detail here) is used to verify the function of the Tukey’s Biweight algorithm with tuning 9 (in species with higher count-rates), as well as replicating SQUID’s Poisson outlier identification and rejection process, and subsequent arithmetic mean calculation (in species with lower count-rates).
- Check_02_SBMIntegrations_PerScan.CSV is a simple and direct extract from the Prawn XML file, with one row per scan, and 11 columns per species (comprising one for ‘count_time_sec’, and one for each of the 10 SBM ‘integrations’ measured at each species during each scan). The output (which is documented in detail here) is used to verify the function of the Tukey’s Biweight algorithm with tuning 6.
The Tukey’s Biweight algorithm (as implemented in Isoplot) is iterative, and operates either until a test for convergence is passed, or 101 iterations have been completed (whichever comes first). The iterated calculations are hand-worked in this Excel workbook (Handworked_Biweight_10dataPoints.xls) which, for convenience, has been customised to operate on input sets of exactly 10 data-points. The XLS already contains worked examples of the application to a set of ion-count integrations and a set of SBM integration, but can be applied to other inputs as follows:
- Paste (as Values) the 10 integrations of interest into the column of 10 yellow cells in column B (B9:B18).
- Specify the desired tuning factor (6 or 9 only) in the isolated yellow cell B7.
- Populate the grey cells B2:B6 with enough information to remind you which set of integrations you’re looking at.
Press Calculate Now (F9) and look at the output data lower in columns A–B. Note that:
- Output from Isoplot ‘biweight’ (pink cells) requires Isoplot to be available to this spreadsheet; if not, this 2 x 2 cell-block will show error-messages of some kind.
- Values of Isoplot ‘biweight’ (blue cells) are simply Copy-Paste as Values performed manually from the ‘Output’ block above it, provided mainly so people looking at the Example worksheets but without access to Isoplot can still verify that the spreadsheet works.
- Output of this worksheet (red cells) are accompanied by a count of how many iterations it took for the calculation to converge, just for interest. Biweight Mean and Sigma ought to match the results obtained from the Isoplot function, to 15 significant figures.
Replication (in XLS) of Poisson outlier identification and rejection for low ion-count species, using Calamari input
This Excel workbook (100142_FullCheck_Sub100Medians.xls) shows detailed working of the Poisson assessment routine in SQUID 2.50, for ion counts where the median of the 10 integrations <= 100. The relevant XML file is ‘100142…’, and these calculations are upstream (and independent) of any other user-defined parameters.
Sheet ‘Poisson’ simply contains the hard-wired upper (ArrU) and lower (ArrL) c95% limits for all the integers between 0 and 100, as hard-wired in SQUID 2.50. Note the anomalies at the lower end of the scale: the upper c95% limit for a value of 0 (= 6) was arbitrarily chosen. It means the outlier-threshold when the median is 0 is much higher than when the median is 1, or 2.
The rest of the sheets relate to specific mass-stations in which the median integration value for some scans/analyses drop below the 100 threshold. In XML file ‘100142…’ the only mass-stations affected are 204Pb, background (Bgd), 207Pb and 208Pb. For each mass-station, there are three worksheets, each exploring a different rounding behaviour of non-integer medians (for the purpose of potential outlier detection at SQUID replication). ‘…_up’ sheets investigate ‘ceiling’ rounding of medians of the form ‘x.5’ to ‘x+1’, whereas ‘…_down’ sheets investigate ‘floor’ rounding of medians of the form ‘x.5’ to ‘x’. Note that neither of these rounding methods permits full-file-scale replication of SQUID behaviour (this is better illustrated by masses 207 and 208 than 204 and background). SQUID behaviour can only be fully replicated on the ‘…_halfeven’ sheets, which implement ‘banker’s rounding’ (i.e. medians of the form ‘x.5’ are rounded to ‘x’ when x is even, and rounded to ‘x+1’ when x is odd).
In each of these sheets, rows 1–3 contain peak-specific reference data, and also summary data derived from the rows below. There are 684 rows of data in each case (rows 5–688), comprising one row per scan per analysis (114 analyses @ 6 scans each) for the entire XML file.
- Columns A–C contain identifying data for each scan.
- Columns D–M contain the values of the 10 integrations as measured, derived directly from Calamari Report ‘Check_01…’.
- Column N calculates the median of the 10 integrations (with row 3 containing a count of the values <= 100), and column O is a Boolean identifying whether the median is <= 100 and non-integer, because this combination of properties is where the unusual rounding behaviour arises (row 3 contains a count of the values where both conditions are true).
- Column P contains the median value rounded to an integer, following the rounding mode defined in row 2 (‘up’ = ‘ceiling’, ‘down’ = ‘floor’, and ‘half-even’ = ‘banker’s rounding’). If the rounded median <= 100, columns Q and R contain the values ‘looked up’ from columns ArrL and ArrU respectively, within sheet ‘Poisson’, corresponding to the rounded median.
- Columns S–AB contain the residuals of each of the 10 integrations relative to the unrounded (true) median in column N (not the rounded version in column P!). For the irrelevant cases (i.e. median > 100), columns Q and R are null, and zeroes are returned everywhere. Relevant cases (i.e. median <= 100) are identified by non-null entries in columns Q and R, but residual values are only calculated if the relevant integration lies beyond the 95% confidence intervals defined in columns Q and R (otherwise residuals are set to zero).
- Column AC simply identifies the largest outlier (maximum residual value) for the 10 integrations.
- Columns AD–AM are populated according to whether any given integration qualifies as an excludable outlier or not. Any such outlier is populated with a 1; all other integrations are labelled 0. The objective is to ensure that only one integration (that with the largest residual) is excluded, and that if two or more integrations share the maximum residual value, the integration acquired first is the only one rejected.
- Columns AN–AW are simply a re-expression of the original 10 integrations, with the exception that any integration identified as an outlier by the foregoing logic is assigned a value of 0 instead. The re-expressed columns can be considered to contain ‘surviving’ integrations.
- Columns AX–BG are the squares of the surviving integrations, which are eventually used for calculating SigmaPkCounts.
- Column BH quantifies the number of surviving integrations (9 where a Poisson-related rejection has been permitted, 10 otherwise), column BI is the sum of the surviving integrations (from columns AN–AW), and column BJ is the sum of their squares (from columns AX–BG).
- Column BK contains peak mean counts (sum of surviving integrations / number of surviving integrations), column BL converts the value to CPS using the sheet-specific count_time value (cell B2), and column BM performs the collector deadtime correction using the sheet-specific deadtime value (cell B3).
- Column BN contains the calculated sigma of peak counts (using the sum of surviving integrations, the sum of their squares, and the number of surviving integrations). Column BO contains an alternative estimate of the sigma of peak counts, named ‘Poisson sigma’, which is simply the square root of ‘peak mean counts’. Column BP adopts the larger of these two values.
- Columns BQ and BR contain the calculated ‘total counts at peak’ and ‘1sigma at peak’ (as calculated by SQUID and presented in the ‘condensed XML’ worksheet, and as calculated by Calamari in SQUID_01…), but only for scans where the unrounded median of the 10 original integrations (column N) <= 100. For higher-count scans, these values are simply equated to the corresponding values Copy-Pasted from SQUID (columns BS and BT) so that differences unrelated to the Poisson calculations do not contaminate subsequent comparisons.
- Columns BS and BT are the ‘total counts at peak’ and ‘±1sigma at peak’ Copy-Pasted from the ‘condensed XML’ sheet of the corresponding SQUID-workbook, for the purpose of comparison. Cell BT1 counts the total number of scans with non-integer medians <= 100, because this combination is where unusual rounding behaviour arises.
- Columns BU and BV compare the ‘total counts at peak’ and ‘±1sigma at peak’ values calculated in this worksheet with those presented by SQUID, via the simple subtraction [worksheet value minus SQUID value]. Note that this operation inevitably yields zeroes for all scans where the median of integrations exceeds 100: this is purely a mechanism to remove ‘non-Poisson’ calculations from being considered in the comparison (in reality, these calculations are performed using SQUID/Isoplot’s TukeysBiweight algorithm, which is deliberately excluded from this workbook). Row 3 contains the sum of all differences calculated in these columns; these are best assessed on the ‘207_...’ and ‘208_...” worksheets, where substantial non-zero values are obtained in the ‘…_up’ and ‘…_down’ sheets, because neither rounding method matches SQUID for all non-integer values <= 100.
- Column BW looks at every instance of non-integer medians <= 100, in terms of whether the ‘total counts at peak’ and ‘±1sigma at peak’ values calculated in the worksheet both match those presented by SQUID, or not. Counts of matched and mismatched values are presented in cells BW1 and BW2, and should sum to the value in cell BT1. In general, mismatches arise when the non-integer sub-100 median is rounded to the ‘wrong’ value (more on this below), ‘wrong’ values are selected from ArrL and ArrU as a result, leading to the erroneous identification (or non-identification) of an outlier integration.
- Column BX looks in more detail at the matched and mismatched values, in terms of whether the non-integer sub-100 median is ‘odd-floored’ (i.e. x.5 where x is odd) or ‘even-floored’ (i.e. x.5 where x is even). This check illustrates the nature of the rounding problem: on all ‘…_up’ sheets, worksheet–SQUID mismatches arise only when even-floored numbers are rounded up to the next odd number, and analogously, on ‘…_down’ sheets, worksheet–SQUID mismatches arise only when odd-floored numbers are rounded down to the previous odd number (see cells BX3 and BX4, one of which is always zero, the other containing the full number of observed worksheet–SQUID mismatches). The ‘…_halfeven’ sheets illustrate the solution to this problem: to mimic SQUID’s behaviour by rounding all non-integer medians to the nearest even number.
- Columns BY–CG contain, for selected analyses (indicated by yellow highlight in columns A–C of the sheets ‘207_halfeven’ and ‘208_halfeven’), hand-working of the ‘total [mass] cts/sec’ value. Column BY contains the scan-by-scan ‘peak counts/sec’ (= total counts at peak / count_time), and column BZ contains the corresponding scan-by-scan data for background, together with a ‘calculated background counts/sec’, which is the arithmetic mean of the scan-by-scan values.
- Column CA contains the scan-by-scan ‘absolute net counts/sec’, consisting of the ‘peak count/sec’ corrected by subtraction of the average ‘calculated background counts/sec’ in column BZ.
- Column CB contains the CalcVariance, which augments the ‘absolute net counts/sec by a ‘calculated background counts/sec’ factor proportional to the ratio of the peak count_time relative to background count_time. Column CC contains the ‘peak fractional error’ derived from the CalcVariance.
- Column CD contains a single ‘mean net counts/second’ (as the arithmetic average of the scan-by-scan ‘absolute net counts/sec’ in column CA), and column CE converts this to a single ‘total counts/second’ via addition of the average ‘calculated background counts/sec’.
- Column CF contains the corresponding value Copy-Pasted from the relevant SQUID-workbook, and column CG compares the ‘total counts/second’ values calculated in this worksheet with those presented by SQUID, via the simple subtraction [worksheet value minus SQUID value].
Next: C. Step 1: Transform XML into condensed and reformatted 'total counts at peak' worksheet