In [None]:
# Import functions and settings used in this notebook
%run ./modules/data_reduction_modules.ipynb

# pCO₂ Data Reduction part 3: sensor data

Now we will have a look at a sensor data set.
## Data set
Here you find an example dataset, which was prepared for this exercise. You can download it to your computer and have a look at it: https://fileshare.icos-cp.eu/s/ixoDPdgBrwFTwdm.  
This is not needed for further calculations but you might want to know how your dataset looks. The columns are described below in the meta data section. Please note that there are no reference gases used. The data are from a Contros HydroC CO₂ sensor that was run in 2018 during a cruise in the Med Sea. The original data set looks slightly different (only additional columns), but for the ease of use we cleaned it. The sensor was calibrated before and after the deployment.  
It has been found that instead of using the actual time of the sensor, the runtime is more suitable for calibration purposes as the instrument doesn't drift while it's not turned on.


### Here are the metadata:
**sensor specific coefficients**  

| Variable | Value | Notes |
| :- | :- | :- |
| F |60625 | NDIR unit specific scale facotor |
| $T_{sensor}$ | 35.8 | Temperature of the initial CO₂ sensor at pre-calibration |
| $f(T_{sensor})$ | 10019.24 | NDIR unit specific temperature compensation factor |

**calibration coefficients**  

| coefficient | pre calibration | post calibration |
| :- | :- | :-|
| k1 | 68.93709E-3 | 69.99492E-3 |
| k2 | 6.124565E-6 | 5.529816e-6 |
| k3 | 3.484077E-10 | 3.857533E-10 |

**column header:**

| Column | Notes |
| :- | :- |
| Timestamp | | 
| Date | date and time recorded by the sensor | 
| p_NDIR | Pressure in the gas stream behind NDIR unit |
| p_in | Pressure in the gas stream behind membrane |
| Zero | 1: sensor measures zero gas; 0: sensor does'nt measure zero gas |
| Flush | 1: sensor is flushing after the zero, 0: no flushing |
| external_pump | Status of the external pump: 1: on, 0: off |
| Runtime | Seconds of operation since last calibration |
| Signal_raw ($S_{raw}$) | Raw detector signal (digits) |
| Signal ref ($S_{ref}$) | Raw reference signal |
| T_sensor | Temperature of the NDIR unit |
| Signal_proc | Signal corrected for T-influences and drift (this is based on the data from the pre-calibration and estimated drift) |
| Conc_estimate |Fraction of CO₂ (estimate) in headspace, valid until 1000 ppm |
| pCO2_corr | Partial pressure of CO₂, p_in and T_sensor corrected |
| xCO2_corr | Fraction of CO₂ in headspace at 100% rH, p and T corrected |
| T_control | Probe temperature for the temperature control (for keeping the whole sensor at a stable temperature) |
| T_gas | Temperature of the gas stream behind the NDIR unit |
| %rH_gas | Relative humidity of the gas stream behind the NDIR unit |
| water temp membrane | Water temperature next to the sensor's membrane, measured with an oxygen optode |
| SST | Sea surface temperature measured by the ship's thermosalinograph |
| SSS | Sea surface salinity measured by the ship's thermosalinograph |



### Cell (1)
Load the data file and display the first 100 rows. It might take a while to load the dataset.


In [None]:
# load data
input_data = pandas.read_csv("data/part3_SD_datafile_Contros.txt", sep="\t")

# We also create a real timestamp field from the GPS date and time
input_data['Timestamp'] = input_data[['Date', 'Time']].apply(lambda x: pandas.to_datetime(' '.join(x), format="%d.%m.%Y %H:%M:%S"), axis=1)

# display the first 100 rows
input_data[0:99]

### Cell (2) - calibration data
Here we enter the data from the [calibration sheet]( https://fileshare.icos-cp.eu/s/bMA8Wts3FJRxQgj). It is the data processing sheet from the pre-calibration, but the one for the post calibration looks similar.

In [None]:
T0 = 273.15 # normal temperature
p0 = 1013.25 # normal pressure

# sensor specific coefficients
F = 60625 # NDIR unit specific scale facotor
T_sensor = 35.8 # Pre; Temperature of the initial CO2 sensor
fT_sensor = 10019.24 # NDIR unit specific temp. compensation factor

# calibration coefficients from calibration sheets
k1_pre = 0.06893709
k2_pre = 0.000006124565
k3_pre = 0.0000000003484077

k1_post = 0.06999492
k2_post = 0.000005529816
k3_post = 0.0000000003857533

### Cell (3) - signal conversion
The dual-beam NDIR detector provides a raw and a reference signal, $𝑆_{raw}$ and $𝑆_{ref}$ , that are combined to a continuously referenced two-beam signal through 
$$𝑆_{2beam}=\frac{𝑆_{raw}}{𝑆_{ref}}.$$
The thermal response characteristics of every NDIR detector are determined with a zero-gas once during an initial, post-production gas calibration. Since every NDIR sensor features a temperature probe next to its detector, $T_{sensor}$, these characteristics can be included at this stage by means of a $T_{sensor}$-dependent factor, $f(T_{sensor})$. This step happens for backup reasons, since typically the NDIR detector within the underwater sensor is operated at a stabilized temperature. Including the temperature compensation the two-beam signal becomes 
$$ 𝑆′_{2beam} = \frac{𝑆_{raw}}{𝑆_{ref}}\times f(T_{sensor})$$.

In [None]:
# signal conversion
# add new columns
input_data['S_2beam'] = np.nan
input_data['S2_2beam'] = np.nan

# calculate S_2beam and S2_S2beam
for rowindex, datarow in log_progress(input_data.iterrows(), every=1000, size=input_data.shape[0], name='Calculating'):
    input_data.at[rowindex, 'S_2beam'] = datarow['Signal_raw'] / datarow['Signal_ref'] 
    input_data.at[rowindex, 'S2_2beam'] = (datarow['Signal_raw'] / datarow['Signal_ref']) * fT_sensor

# display data
input_data[['Timestamp', 'S2_2beam', 'S_2beam']][0:99]


### Cell (4) - extract zero data

Accordingly, the detector also provides a two-beam signal during the regular zeroings (Z) at discrete points in time, $$𝑆′_{2beam,Z} = \frac{𝑆_{raw,Z}}{𝑆_{ref,Z}}\times 𝑓(𝑇_{sensor})$$
where $𝑆_{raw,Z}$ is the raw signal during zeroing and $𝑆_{ref,Z}$ the reference signal during zeroing.  
In the next cell we extract the zero readings (all lines with a "1" in the column zero). Since the first readings of each zeroing are not zero (due to flushig time), the first three readings of each zeroing are discarded. The amount of data points to discard depends on the sensor settings and the number needs to be determined for every deployment and when settings are changed.
Here we need to cheat a bit: It is quite cumbersome to code the extraction of zero data. Therefore we did this offline and now just load the extracted zero data from a seperate file. The Matlab code for the extraction of the zero data can be found [here](https://fileshare.icos-cp.eu/s/bFSNrGEzbdwNaKm).

In [None]:
# load the extracted zero data 
zero_data = pandas.read_csv("data/part3_zero.txt", sep="\t")

# add time stamp
zero_data['Timestamp'] = zero_data[['date', 'time']].apply(lambda x: pandas.to_datetime(' '.join(x), format="%d.%m.%Y %H:%M:%S"), axis=1)

zero_data

### Cell (5) - same as cell (3) but only for zero data
In the next cell we do the same as in cell(3) but ONLY for the zero data.

In [None]:
# signal conversion
# add new columns
zero_data['S_2beam_Z'] = np.nan
zero_data['S2_2beam_Z'] = np.nan

for rowindex, datarow in zero_data.iterrows():
    zero_data.at[rowindex, 'S_2beam_Z'] = datarow['Signal_raw_Z'] / datarow['Signal_ref_Z'] 
    zero_data.at[rowindex, 'S2_2beam_Z'] = datarow['Signal_raw_Z'] / datarow['Signal_ref_Z'] * fT_sensor

# display data
zero_data[['Timestamp','Runtime_Z', 'S_2beam_Z', 'S2_2beam_Z']]

### Cell (6) - plot zero data
Plot the processed signal from the zero data over the runtime. Note: we see here only the data from the sensor's zero runs. We now need to find a zero representation for the whole data set (analogous to the calibration of the underway data).

In [None]:
pos_plot = scatter_plot(zero_data['Runtime_Z'], zero_data['S2_2beam_Z'],
  'Runtime', 'S2_2beam_Z')

show(pos_plot)

### Regressing zero data
In order to have a zero value for eacht data point, we need to interpolate over the data set. One option would be a point by point interpolation. As can be seen in the plot above, there is some noise in the data but they show a long term trend. Thus, we need to find an interpolation that interpolates over longer times.

### Cell (7) - regressing zero data (method 1)
 The easiest way is to find a linear regression of $S'_{2beam,Z}$ vs. the Runtime:

In [None]:
# plot zero data
pos_plot = scatter_plot(zero_data['Runtime_Z'], zero_data['S2_2beam_Z'],
  'Runtime', 'S2_2beam_Z')

# make regression for all zero data against Runtime
slope, intercept, r, p, std_err = stats.linregress(zero_data['Runtime_Z'], zero_data['S2_2beam_Z'])

# plot regression line
pos_plot.line([160000, 2400000], [160000 * slope + intercept, 2400000 * slope + intercept], line_dash='dashed')

# print regression function
print('The resulting regression is: S2_beam_Z_reg = {: 2.6e} * Runtime + {: 2.2f}'.format(slope, intercept))


show(pos_plot)


### Cell (8) -  regressing zero data (method 2)
Obviously the regression over the whole data set is a very bad representation of the data set. The data set looks like it is divided in two parts:  before a Runtime of 540000 s and after. In the next cell we compute two different regressions and plot them.

In [None]:
# plot zero data
pos_plot = scatter_plot(zero_data['Runtime_Z'], zero_data['S2_2beam_Z'],
  'Runtime', 'S2_2beam_Z')

# make regression for first part
slope1, intercept1, r1, p1, std_err1 = stats.linregress(zero_data['Runtime_Z'][0:125], zero_data['S2_2beam_Z'][0:125])
# plot regression 1
pos_plot.line([165000, 540000.0], [165000 * slope1 + intercept1, 540000 * slope1 + intercept1], line_dash='dashed', 
              line_color='green')
# make regression for second part
slope2, intercept2, r2, p2, std_err2 = stats.linregress(zero_data['Runtime_Z'][126:665], zero_data['S2_2beam_Z'][126:665])
# plot regression 2
pos_plot.line([540000, 2500000], [540000 * slope2 + intercept2, 2500000 * slope2 + intercept2], line_dash='dashed', 
              line_color='red')

# print regression function1
print('The resulting regressions are:')
print('1st part: S2_beam_Z_reg = {: 2.6e} * Runtime + {: 2.2f}'.format(slope1, intercept1))
print('2nd part: S2_beam_Z_reg = {: 2.6e} * Runtime + {: 2.2f}'.format(slope2, intercept2))

show(pos_plot)
#print regression functions

### Cell (9) - apply regressions to data set
Use the two regression functions from cell (8) to calculate the temperature corrected two beam signal for the whole data set: $S'_{2beam,Z,reg}$.

In [None]:
# find first and last data point of the whole data set
startpoint = input_data['Runtime'][1]
endpoint = input_data['Runtime'][45708]

# set the breakpoint between the two regressions at RuntimeZ = 540000
breakpoint = zero_data['Runtime_Z'][125]

# add new variable
input_data['S2_2beamZ_reg'] = np.nan

# Loop through each row and calculate S2_2beam_reg with one of the regressions
for row_index, row_data in log_progress(input_data.iterrows(), every=1000, size=input_data.shape[0], name='Calculating'):
    
    # regression 1, first part
    if row_data['Runtime'] < breakpoint:
        rt = row_data['Runtime']
        input_data['S2_2beamZ_reg'][row_index] = slope1 * rt + intercept1
        
    # regression 2, second part    
    else:
        rt = row_data['Runtime']
        input_data['S2_2beamZ_reg'][row_index] = slope2 * rt + intercept2
   
# display first 100 rows
input_data[0:99]

### Cell (10) - calc drift corrected signal
Now we can calculate a drift corrected signal for every data point:
$$S_{proc}(𝑡)=𝐹\times\left(1−(\frac{S′_{2beam}(𝑡)}{S′_{2beam,Z,reg}(𝑡)}\right)$$
where t is the time.

In [None]:
# create new column
input_data['S_proc'] = np.nan

for row_index, row_data in log_progress(input_data.iterrows(), every=1000, size=input_data.shape[0], name='Calculating'):
        S2 = row_data['S2_2beam']
        S2z = row_data['S2_2beamZ_reg']
        S2_S2z = S2 / S2z
        S_proc = F * (1-S2_S2z)
        
        input_data.at[row_index, 'S_proc'] = S_proc
        
# display first 100 rows of the data set        
input_data[0:99]

### Cell (11) - interpolate calibration coefficients k1, k2, k3
We want to use the drift corrected signal to calculate $\text{x}CO_{2,wet}$:
$$\text{x}CO_{2,wet} = \left( k_3\times S_{proc}^3+ k_2\times S_{proc}^2+ k_1\times S_{proc} \right)\times \frac{p_0\times T}{T_0\times p}$$
where $T$ is the gas temperature, T_gas, $p$ is the pressure in the IR cell, p_NDIR, and$p_0$ and $T_0$ are standard pressure and temperature, respectively.  

***But*** before we can do that we need to interpolate the coefficients $k_i$. Changes in the sensor's concentration-dependent response are considered by transforming the polynomial of the pre-deployment calibration into the polynomial of the post-deployment calibration. Concretely, the calibration factors $k_i$ of the polynomial then become 𝑆2_2beamZ-dependent.
This is done to regress $k_i$ against $S'_{2beam,Z}$.  
We get the coefficints from the pre and post calibration sheet (see above).


In [None]:
# set first and last data point of the data set
startpoint = input_data['S2_2beamZ_reg'][1]
endpoint = input_data['S2_2beamZ_reg'][45708]

# 
x = [startpoint, endpoint]
y_k1 = [k1_pre, k1_post]
y_k2 = [k2_pre, k2_post]
y_k3 = [k3_pre, k3_post]

# find regerssion coefficients from ki vs. S2_2beam_reg
slope_k1, intercept_k1, rk1, p_k1, std_err_k1 = stats.linregress(x, y_k1)
slope_k2, intercept_k2, rk2, p_k2, std_err_k2 = stats.linregress(x, y_k2)
slope_k3, intercept_k3, rk3, p_k3, std_err_k3 = stats.linregress(x, y_k3)

# make new columns
input_data['k1'] = np.nan
input_data['k2'] = np.nan
input_data['k3'] = np.nan



# Loop through each row in the input dataset and calculate ki
for row_index, row_data in log_progress(input_data.iterrows(), every=1000, size=input_data.shape[0], name='Calculating'):
    
    s2 = row_data['S2_2beamZ_reg']
    input_data['k1'][row_index] = slope_k1 * s2 + intercept_k1
    input_data['k2'][row_index] = slope_k2 * s2 + intercept_k2
    input_data['k3'][row_index] = slope_k3 * s2 + intercept_k3

# display first 100 rows
input_data[0:99]

### Cell (12) - calculate xCO₂ and pCO₂
Now we can use the calculated $k_i$ from cell (11) to calculate $\text{x}CO_{2,wet}$:
$$\text{x}CO_{2,wet} = \left( k_3\times S_{proc}^3+ k_2\times S_{proc}^2+ k_1\times S_{proc} \right)\times \frac{p_0\times T}{T_0\times p}$$
and then finally $pCO_{2,wet}$:  
$$ pCO_{2,wet} = \text{x}CO_{2,wet} \frac{p_{in}}{p_0}$$,
where $p_{in}$ is the pressure measured behind the membrane.


In [None]:

input_data['xco2_wet'] = np.nan
input_data['pco2_wet'] = np.nan


# Loop through each row in the input dataset
for row_index, row_data in log_progress(input_data.iterrows(), every=1000, size=input_data.shape[0], name='Calculating'):
    
    Tgas = row_data['T_gas'] + T0
    S2_DC = row_data['S_proc']
    pndir = row_data['p_NDIR']
    pin = row_data['p_in']
    k1_row = row_data['k1']
    k2_row = row_data['k2']
    k3_row = row_data['k3']
    
    
    xco2 = (k3_row * S2_DC * S2_DC * S2_DC + k2_row * S2_DC * S2_DC + k1_row * S2_DC) * ((p0 * Tgas)/(T0*pndir))
    pco2 = xco2 * (pin/p0)
    
    input_data['xco2_wet'][row_index] = xco2
    input_data['pco2_wet'][row_index] = pco2

input_data[0:99]

### Cell (13, 14) clean data set
All zero and flush data are still in the data set. Here we add a new variable "measure" that contains 1 for real measuremnts and 0 for zero and flush intervals.

In [None]:
# add new variable
input_data['measure'] = np.nan


for row_index, row_data in log_progress(input_data.iterrows(), every=1000, size=input_data.shape[0], name='Calculating'):
    
    z = row_data['Zero']
    f = row_data['Flush']
        
    m = z+f
    input_data['measure'][row_index] = m


In [None]:
# add new variables that contain NaN during zero and flush
input_data['xco2_wet_m'] = np.nan
input_data['pco2_wet_m'] = np.nan
input_data['p_NDIR_m'] = np.nan
input_data['xCO2_corr_m'] = np.nan
input_data['pCO2_corr_m'] = np.nan

for row_index, row_data in log_progress(input_data.iterrows(), every=1000, size=input_data.shape[0], name='Calculating'):
    
   
    if row_data['measure'] == 0:
        input_data.at[row_index, 'xco2_wet_m'] = row_data['xco2_wet']
        input_data.at[row_index, 'pco2_wet_m'] = row_data['pco2_wet']
        input_data.at[row_index, 'p_NDIR_m'] = row_data['p_NDIR']
        input_data.at[row_index, 'xCO2_corr_m'] = row_data['xCO2_corr']
        input_data.at[row_index, 'pCO2_corr_m'] = row_data['pCO2_corr']
        
        
input_data[0:99]

### Cell (15) plot all
Plot all CO₂ data:
1. pCO₂ - Sensor: pCO₂ data as reported directly from the sensor (pCO2_corr)
2. xCO₂ - sensor: xCO₂ data as reported directly from the sensor (xCO2_corr)
3. xCO₂ Wet: wet xCO₂ from our calculation
4. pCO₂ Wet: wet pCO₂ from our calculation

In [None]:
equ = input_data.query('measure == 0')

p = figure(plot_width=600, plot_height=600, x_axis_type='datetime', x_axis_label='Time', y_axis_label='CO₂')
p.circle(equ['Timestamp'], equ['pCO2_corr'], size=5, legend_label='pCO₂ - Sensor')
p.circle(equ['Timestamp'], equ['xCO2_corr'], color='red', legend_label='xCO₂ - Sensor')
p.circle(equ['Timestamp'], equ['xco2_wet'], color='pink', legend_label='xCO₂ Wet')
p.circle(equ['Timestamp'], equ['pco2_wet'], color='brown', legend_label='pCO₂ Wet ')

show(p)

# Please be patient...

### Cell (16) plot all
Play around and think about how to do the following exercises. You can also copy/paste code from the other notebooks here.
- calculate fCO₂_wet
- calculate pCO₂@SST and fCO₂@SST
- calculate the pCO2 data by using calibration method 1 (cell 7) and compare the results

In [None]:
plotsource = ColumnDataSource(input_data)
selectable_vars = ['pCO2_corr_m', 'xCO2_corr_m', 'pco2_wet_m']
column_select = Select(title='Variable', value=selectable_vars[0], options=selectable_vars)

p = figure(plot_width=600, plot_height=600, x_axis_type='datetime', min_border_left=50)
circle = p.circle('Timestamp', column_select.value, size=5, source=plotsource)

update_plot = CustomJS(args=dict(circle=circle), code="""
    circle.glyph.y = {field: cb_obj.value};
  """)

column_select.js_on_change('value', update_plot)

controls = column(column_select)
show(row(p, controls))