# overview

This example shows how to use `flux-data-qaqc` with a custom climate input file that is not from FLUXNET. The only real differences lie in the config file declarations therefore the entire workflow from the FLUXNET example notebook will work just the same. That notebook is recommended to be viewed for all general use whereas this one puts more focus on the formating rules of input data itself. 

---
The data used herein is provided with the software package and can be downloaded [here](https://github.com/Open-ET/flux-data-qaqc/blob/master/examples/), it happens to be from a USGS eddy covariance flux tower for Dixie Valey Dense Vegetation. Details on the data can be found in this [report](https://pubs.usgs.gov/pp/1805/pdf/pp1805.pdf).

In [1]:
%load_ext autoreload
%autoreload 2
from fluxdataqaqc import Data, QaQc, Plot
from bokeh.plotting import figure, show
from bokeh.models.formatters import DatetimeTickFormatter
from bokeh.models import LinearAxis, Range1d
from bokeh.io import output_notebook
output_notebook()

# seting up a config file 
---
The config file needed for using `flux-data-qaqc` has two major sections:
1. METADATA
2. DATA

Currently in **METADATA**, the "station_elevation" (expected in meters) and latitude (decimal degrees) fields are used to calculate clear sky potential solar radiation. The item "missing_data_value" is used to correctly parse missing data in the climate time series. Other metadata is not used currently but may be useful for custom workflows, more on this later.

The **DATA** section of the config file is where you specify climate variables and their units. There are two major functionalities in `flux-data-qaqc`, first, correcting surface energy balance by adjusting latent energy and sensible heat fluxes. Second, it serves as a robust way to read in different time series data and simply plot their daily and monthly time series. The latter is under development but generally speaking the module is able to generate useful interactive plots of arbitrary time series data. 

Here is a list of all the "expected" climate variable names in the **DATA** section:

In [2]:
config_path = 'USGS_config.ini'
d = Data(config_path)
for each in d.config.items('DATA'):
    print(each[0])

datestring_col
year_col
month_col
day_col
net_radiation_col
net_radiation_units
ground_flux_col
ground_flux_units
latent_heat_flux_col
latent_heat_flux_units
latent_heat_flux_corrected_col
latent_heat_flux_corrected_units
sensible_heat_flux_col
sensible_heat_flux_units
sensible_heat_flux_corrected_col
sensible_heat_flux_corrected_units
shortwave_in_col
shortwave_in_units
shortwave_out_col
shortwave_out_units
shortwave_pot_col
shortwave_pot_units
longwave_in_col
longwave_in_units
longwave_out_col
longwave_out_units
vap_press_col
vap_press_units
vap_press_def_col
vap_press_def_units
avg_temp_col
avg_temp_units
precip_col
precip_units
wind_spd_col
wind_spd_units


**Note:** You may not have any of the expected climate variables in your data, and specify them all as missing ('na') however the result will be an output dataset of null values, and no plots will be produced!

---
# creating your own quality control values with your input data

Currently `flux-data-qaqc` supports filtering out poor quality data based on quality control (QC) values that exist within the input data. 

The QC values should be decimal fractions (although there is no strict contraints), 0 meaning poorest and 1 meaning best quality data. Second, the column containing the QC flag in your input climate file should be named for the variable it corresponds to with the suffix **'_QC'**. For example if your sensible heat column was named **sens_h** then your QC column should be named **sens_h_QC**. Below is an example using the provided FLUXNET file which includes its own QC flags for sensible heat and others.

In [3]:
config_path = 'fluxnet_config.ini'
d = Data(config_path)
# this attribute shows you which variables were found in your input file that have
# quality control values assigned, it uses the names as found in the input file
d.qc_var_pairs

{'NETRAD': 'NETRAD_QC',
 'G_F_MDS': 'G_F_MDS_QC',
 'LE_F_MDS': 'LE_F_MDS_QC',
 'H_F_MDS': 'H_F_MDS_QC',
 'SW_IN_F': 'SW_IN_F_QC',
 'SW_OUT': 'SW_OUT_QC',
 'LW_IN_F': 'LW_IN_F_QC',
 'LW_OUT': 'LW_OUT_QC',
 'VPD_F': 'VPD_F_QC',
 'TA_F': 'TA_F_QC',
 'P_F': 'P_F_QC',
 'WS_F': 'WS_F_QC'}

In [4]:
# values of sensible heat and its QC value
d.df[['H_F_MDS', 'H_F_MDS_QC']].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7f3d3da618d0>

# visualize the QC values alongside corresponding data

If you create your own QC values be sure to validate them to make sure everything seems correct. Below we see that the lowest QC values correspond with poor quality gap-fill data near the begining of the dataset. 

In [5]:
p = figure(x_axis_label='date', y_axis_label='sensible heat flux')
p.extra_y_ranges = {"sec": Range1d(start=-0.1, end=1.1)}
p.line(d.df.index, d.df['H_F_MDS'], color='red', line_width=1)
p.add_layout(LinearAxis(y_range_name="sec", axis_label='QC value'), 'right')
p.circle(d.df.index, d.df['H_F_MDS_QC'], line_width=2, y_range_name="sec")
p.x_range=Range1d(d.df.index[0], d.df.index[365])
p.xaxis.formatter = DatetimeTickFormatter(days="%d-%b-%Y")
show(p)

By default when filtering data based on QC values, the routine provided removes all data that falls below a QC value of 0.5. Although this can be modified, see the [provided FLUXNET Jupyter notebook](https://github.com/Open-ET/flux-data-qaqc/blob/master/examples/FLUXNET_2015_example.ipynb) for examples.

# after applying QC filter removing values below 0.5

Values of sensible heat with QC values < 0.5 are now removed (null). 

Note, the `Data.apply_qc_flags()` method currently applies the filter to all variables in the climate file that have a QC column. Other options such as assigning select variables different threhsold values may be added in the future if it is requested. 

In [6]:
# apply QC filters
d.apply_qc_flags()
# same figure
p = figure(x_axis_label='date', y_axis_label='sensible heat flux')
p.extra_y_ranges = {"sec": Range1d(start=-0.1, end=1.1)}
p.line(d.df.index, d.df['H_F_MDS'], color='red', line_width=1)
p.add_layout(LinearAxis(y_range_name="sec", axis_label='QC value'), 'right')
p.circle(d.df.index, d.df['H_F_MDS_QC'], line_width=2, y_range_name="sec")
p.x_range=Range1d(d.df.index[0], d.df.index[365])
p.xaxis.formatter = DatetimeTickFormatter(days="%d-%b-%Y")
show(p)

---

# Specifying multiples soil heat flux and soil moisture variables for your input

`flux-data-qaqc` provides the ability to read in multiple soil heat flux/moisture variables for a given station location, calculate their weighted or non weighted average, and write/plot their daily and monthly time series. This may be useful for comparing/validating multiple soil heat/moisture probes at varying locations or depths or varying instrumentation. 

Here is what you need to do to use this functionality:

1. List the multiple soil variable names in your config file following the convention:
 * For multiple soil heat flux variables config names should begin with "G_" or "g_" followed by an integer starting with 1,2,3,... i.e. g_[number]. For example:

```bash
g_1 = name_of_my_soil_heat_flux_variable
```
 * For soil moisture variables the name of the config variable should follow "theta_[number]" for example:

```bash
theta_1 = name_of_my_soil_moisture_variable
```

2. List the units and (optionally) weights of the multiple variables 
 * To specify the units of your soil flux/moisture variables add "_units" to the config name you assigned:

```bash
g_1_units = w/m2
theta_1_units = cm
```

 * To set weights for multiple variables to compute weighted averages assign the "_weight" suffix to their names in the config file. For example, to set weights for multiple soil heat flux variables:
```bash
g_1_weight = 0.25
g_2_weight = 0.25
g_3_weight = 0.5
```
 * Note, if weights are not given the arithmetic mean will be calculated, if the weights do not sum to 1, they will be automatically normalized so that they do. 

# multiple soil variable weighted average examples

The provided multiple soil variable config and input data are used for these examples.

Here is the section of the config file that defines the multiple soil variables in the input climate file used for the example below:

```bash
g_1 = added_G_col
g_1_weight = 6
g_1_units = w/m2
g_2 = another_G_var
g_2_weight = 2
g_2_units = w/m2
# note the next variable is the same that was assigned as the main soil heat flux variable
# i.e. ground_flux_col = G
g_3 = G
g_3_weight = 0.5
g_3_units = w/m2

theta_1 = soil_moisture_z1
theta_1_weight = 0.25
theta_1_units = cm
theta_2 = soil_moisture_z10
theta_2_weight = 0.75
theta_2_units = cm
```

In [7]:
# read in the data
config_path = 'multiple_soilflux_config.ini'
d = Data(config_path)
# note the newly added multiple g and theata variables
d.variables

{'date': 'date',
 'year': 'na',
 'month': 'na',
 'day': 'na',
 'Rn': 'Rn',
 'G': 'G',
 'LE': 'LE',
 'LE_corr': 'LE_corr',
 'H': 'H',
 'H_corr': 'H_corr',
 'sw_in': 'sw_in',
 'sw_out': 'sw_out',
 'sw_pot': 'sw_pot',
 'lw_in': 'lw_in',
 'lw_out': 'lw_out',
 'vp': 'na',
 'vpd': 'vpd',
 't_avg': 't_avg',
 'ppt': 'ppt',
 'ws': 'ws',
 'g_1': 'added_G_col',
 'g_2': 'another_G_var',
 'g_3': 'G',
 'theta_1': 'soil_moisture_z1',
 'theta_2': 'soil_moisture_z10'}

In [8]:
# and their units
d.units

{'Rn': 'w/m2',
 'G': 'w/m2',
 'LE': 'w/m2',
 'LE_corr': 'w/m2',
 'H': 'w/m2',
 'H_corr': 'w/m2',
 'sw_in': 'w/m2',
 'sw_out': 'w/m2',
 'sw_pot': 'w/m2',
 'lw_in': 'w/m2',
 'lw_out': 'w/m2',
 'vp': 'na',
 'vpd': 'hPa',
 't_avg': 'C',
 'ppt': 'mm',
 'ws': 'm/s',
 'g_1': 'w/m2',
 'g_2': 'w/m2',
 'theta_1': 'cm',
 'theta_2': 'cm'}

In [9]:
# or to view these variables and their weights only
d.soil_var_weight_pairs

{'g_1': {'name': 'added_G_col', 'weight': '6'},
 'g_2': {'name': 'another_G_var', 'weight': '2'},
 'g_3': {'name': 'G', 'weight': '0.5'},
 'theta_1': {'name': 'soil_moisture_z1', 'weight': '0.25'},
 'theta_2': {'name': 'soil_moisture_z10', 'weight': '0.75'}}

# when the data is first loaded into memory the weighted averages are calculated

At this stage weights will be automatically normalized so that they sum to one and the new weights will be printed if this occurs.

In [10]:
# call daily or monthly dataframe to calculate the weighted averages if they exist
d.df.head()

g weights do not sum to one, normalizing
Here are the new weights:
 added_G_col:0.71, another_G_var:0.24, G:0.06


Unnamed: 0_level_0,t_avg,sw_pot,sw_in,lw_in,vpd,ppt,ws,Rn,sw_out,lw_out,...,LE,LE_corr,H,H_corr,added_G_col,another_G_var,soil_moisture_z1,soil_moisture_z10,g_mean,theta_mean
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2009-01-01,2.803,186.71,123.108,261.302,1.919,0.0,3.143,,,,...,67.1459,43.8414,20.3876,13.3116,,,20.57327,26.94286,0.0,25.350463
2009-01-02,2.518,187.329,121.842,268.946,0.992,0.0,2.093,,,,...,92.8616,60.9673,32.6505,21.4364,,,20.25087,26.601709,0.0,25.013999
2009-01-03,5.518,188.008,124.241,268.004,2.795,0.0,4.403,,,,...,75.8029,50.3151,20.0569,13.313,,,20.827236,26.644598,0.0,25.190258
2009-01-04,-3.753,188.742,113.793,246.675,0.892,0.0,4.336,,,,...,67.1459,45.0539,20.3876,13.6798,,,20.988757,26.843588,0.0,25.37988
2009-01-05,-2.214,189.534,124.332,244.478,1.304,0.0,2.417,,,,...,92.8616,62.6443,32.6505,22.026,,,20.756527,26.262146,0.0,24.885741


In [11]:
# note the weights have been changed and updated 
d.soil_var_weight_pairs

{'g_1': {'name': 'added_G_col', 'weight': 0.7058823529411765},
 'g_2': {'name': 'another_G_var', 'weight': 0.23529411764705882},
 'g_3': {'name': 'G', 'weight': 0.058823529411764705},
 'theta_1': {'name': 'soil_moisture_z1', 'weight': '0.25'},
 'theta_2': {'name': 'soil_moisture_z10', 'weight': '0.75'}}

In [12]:
# now the dataframe also has the weighted means that will be named g_mean and theta_mean
d.df.columns

Index(['t_avg', 'sw_pot', 'sw_in', 'lw_in', 'vpd', 'ppt', 'ws', 'Rn', 'sw_out',
       'lw_out', 'G', 'LE', 'LE_corr', 'H', 'H_corr', 'added_G_col',
       'another_G_var', 'soil_moisture_z1', 'soil_moisture_z10', 'g_mean',
       'theta_mean'],
      dtype='object')

###  the weighted mean is closest to the variable assigned to "g_1" which had the highest weight

In [13]:
p = figure(x_axis_label='date', y_axis_label='Soil heat flux')
p.line(d.df.index, d.df['g_mean'], color='black', legend="weighted mean", line_width=2)
p.line(d.df.index, d.df['added_G_col'], color='orange', legend="g_1: 0.71", line_width=1)
p.line(d.df.index, d.df['another_G_var'], color='green', legend="g_2: 0.24", line_width=1)
p.line(d.df.index, d.df['G'], color='red', legend="g_3: 0.60", line_width=1)

p.x_range=Range1d(d.df.index[150], d.df.index[160])
p.xaxis.formatter = DatetimeTickFormatter(days="%d-%b-%Y")
show(p)