# Dana Fork at Bug Camp 2005 - 2021: Combing New and Old Datasets
Joe Ammatelli | 08-22-22

This notebook facilitates joining the old dataset with the new dataset (accounting for potentially disimilar labels). For backwards compatability, in the event the old and new datsets overlap in time, we join the new dataset with the end of the old dataset, trimming overlapping data from the new dataset. 


Once all steps have been completed, a single .csv file with the following quantities will be generated (and will span the entire period for which data is available across both the new and old file).
* date and time (UTC)
* vented pressure, cm
* raw pressure, cm
* barocorrected pressure, cm
* adjusted stage, cm
* estimated discharge, cms
* water temperature, degrees C
* discharge flag

Author of Template and Underlying Code: Joe Ammatelli | (jamma@uw.edu) | August 2022

In [3]:
import os
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import timedelta

sys.path.insert(0, os.path.abspath(os.path.join('..', '..', 'src')))

import config
import level_baro_utils

sys.path.remove(os.path.abspath(os.path.join('..', '..', 'src')))

## Configure Plotting Preferences
**Analyst TODO:**
* Choose plotting backend:
    - Interactive (recommended): uncomment `%matplotlib notebook` and `FIGSIZE=NONE`; comment out `FIGSIZE = config.FIGSIZE`
    - Inline: comment out `%matplotlib notebook` `FIGSIZE=NONE`; uncomment `FIGSIZE = config.FIGSIZE`

In [4]:
%matplotlib notebook
FIGSIZE=None

#FIGSIZE = config.FIGSIZE

sns.set_theme()

## Specify site code and define start/end years of new series
**Analyst TODO**:
* assign an integer representing the site to the variable `sitecode`. Mappings are as follows (follows from upstream to downstream):
    * 0 : Lyell Below Maclure
    * 1 : Lyell Above Twin Bridges
    * 2 : Dana Fork at Bug Camp
    * 3 : Tuolumne River at Highway 120
    * 4 : Budd Creek
    * 5 : Delaney Above PCT
* assign an integer (format 'YYYY') representing the first year of data collection to `start_year`
* assign an integer (format 'YYYY') representing the last year of data collection to `end_year`

These input parameters are used to automatically retrieve the postprocessed data.

In [5]:
sitecode = 2

start_year = 2019
end_year = 2021

## Read in both datasets
**Analyst TODO:** Ensure each column is appropriate datatype, correct as necessary by mapping column number to datatype in dictionary called `dtypes` and calling `choose_column_dtype` function (can leave `dtypes` as empty dictionary otherwise); ensure there is a column in the old table for each column of the new table, add new columns as necessary; inspect the resultant tables

**Old Dataset**

In [6]:
old_fn = 'Dana_Bug_Camp_timeseries_stage_Q_T_2005_2018.csv'
old_path = os.path.join('..', '..', 'compiled_data', 'published', old_fn)
old_df = pd.read_csv(old_path, index_col=0, parse_dates=[0], infer_datetime_format=True, na_values=[' NaN'])

  exec(code_obj, self.user_global_ns, self.user_ns)


In [7]:
old_df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 224248 entries, 2005-09-01 22:00:00 to 2018-06-18 19:30:00
Data columns (total 8 columns):
 #   Column                                       Non-Null Count   Dtype 
---  ------                                       --------------   ----- 
 0    raw_pressure(cm)                            224248 non-null  object
 1    barocorrected_pressure(cm)                  224248 non-null  object
 2    adjusted_stage(cm)                          224248 non-null  object
 3    estimated_discharge(cms)                    224248 non-null  object
 4   lower_confidence_discharge_cms_bestestimate  224248 non-null  object
 5   upper_confidence_discharge_cms_bestestimate  224248 non-null  object
 6    water_temperature(deg_C)                    224248 non-null  object
 7    discharge flag                              224248 non-null  int64 
dtypes: int64(1), object(7)
memory usage: 15.4+ MB


Explicity choose datatype for incorrect columns of interest (read_csv sometimes chooses wrong datatype for some columns and won't allow me to convert from certain types to others) -- This step should not be necessary for future processing.

In [8]:
dtypes = {0:np.float64,
          1:np.float64,
          2:np.float64,
          3:np.float64,
          4:np.float64,
          5:np.float64,
          6:np.float64}

In [9]:
old_df = level_baro_utils.choose_column_dtype(old_df, dtypes)

In [10]:
old_df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 224248 entries, 2005-09-01 22:00:00 to 2018-06-18 19:30:00
Data columns (total 8 columns):
 #   Column                                       Non-Null Count   Dtype  
---  ------                                       --------------   -----  
 0    raw_pressure(cm)                            221758 non-null  float64
 1    barocorrected_pressure(cm)                  214921 non-null  float64
 2    adjusted_stage(cm)                          214921 non-null  float64
 3    estimated_discharge(cms)                    214921 non-null  float64
 4   lower_confidence_discharge_cms_bestestimate  221647 non-null  float64
 5   upper_confidence_discharge_cms_bestestimate  221647 non-null  float64
 6    water_temperature(deg_C)                    224247 non-null  float64
 7    discharge flag                              224248 non-null  int64  
dtypes: float64(7), int64(1)
memory usage: 15.4 MB


Additionally, since we have no column for stage (the author's of the previous dataset chose to only include barocorrected pressure), we need to add an additinal column to the table to represent stage.

In [15]:
old_df['stage(cm)'] = old_df[' barocorrected_pressure(cm)']

Inspect final results

In [16]:
old_df.head()

Unnamed: 0_level_0,raw_pressure(cm),barocorrected_pressure(cm),adjusted_stage(cm),estimated_discharge(cms),lower_confidence_discharge_cms_bestestimate,upper_confidence_discharge_cms_bestestimate,water_temperature(deg_C),discharge flag,stage(cm)
date_time(UTC:PDT+7),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2005-09-01 22:00:00,86.3,22.0,28.04,0.16,-0.08,0.42,13.53,0,22.0
2005-09-01 22:30:00,86.4,21.9,28.04,0.16,-0.08,0.42,13.82,0,21.9
2005-09-01 23:00:00,86.3,22.0,28.04,0.16,-0.08,0.42,14.08,0,22.0
2005-09-01 23:30:00,86.5,22.0,28.04,0.16,-0.08,0.42,14.22,0,22.0
2005-09-02 00:00:00,86.2,22.5,28.65,0.18,-0.07,0.44,14.29,0,22.5


**New Dataset**

In [17]:
new_fn = config.FINAL_OUTPUT_FN.format(site=config.SITE_SHORTNAME[sitecode],
                                       start=start_year,
                                       end=end_year)

new_path = os.path.join('..', '..', 'stitch_discharge', 'data', 'processed', new_fn)

new_df = pd.read_csv(new_path, index_col=0, parse_dates=[0], infer_datetime_format=True)

Inspect resultant tables

In [18]:
new_df.head()

Unnamed: 0_level_0,vented_pressure(cm),raw_pressure(cm),barocorrected_pressure(cm),adjusted_stage(cm),estimated_discharge(cms),water_temperature(deg_C),discharge_flag
date_time(UTC:PDT+7),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2018-06-01 00:00:00,53.27904,,,63.538053,6.504656,8.92,0
2018-06-01 00:30:00,53.09616,,,63.355173,6.450432,8.72,0
2018-06-01 01:00:00,53.7972,,,64.056213,6.659228,8.46,0
2018-06-01 01:30:00,54.102,,,64.361013,6.750797,8.14,0
2018-06-01 02:00:00,54.49824,,,64.757253,6.870547,7.754,0


## If the old dataset has different column names than the new dataset, map the old names to the new ones
This step exists to compensate for different header labels used in prior datasets. Moving forward (as of the release of the first version of this processing suite, attempts are made to have a consisistent labelling scheme so that this step is not necessary in the future).

### Specify the old label names
**Analyst TODO**:
If the old dataset uses different header names, defines of the header names in the appropriate variable below (leave as empty string otherwise).

e.g.
* if the old dataset uses the label `stage (cm)` to describe the offset stage value, set the variable `adjusted_stage_label` equal to `stage (cm)`

In [19]:
raw_pressure_label = ' raw_pressure(cm)'
barocorrected_pressure_label = ' barocorrected_pressure(cm)'
adjusted_stage_label = 'stage(cm)'
estimated_discharge_label = ' estimated_discharge(cms)'
water_temperature_label = ' water_temperature(deg_C)'
discharge_flag_label = ' discharge flag'

old_labels = [raw_pressure_label, 
              barocorrected_pressure_label, 
              adjusted_stage_label, 
              estimated_discharge_label,
              water_temperature_label,
              discharge_flag_label]

### Update the header of the old dataset dataframe to match the header of the new dataset dataframe
**Analyst TODO:** Run cells

In [20]:
old_df = level_baro_utils.map_old_labels_2_new(old_df, old_labels)
old_df.head()

Unnamed: 0_level_0,raw_pressure(cm),barocorrected_pressure(cm),adjusted_stage(cm),estimated_discharge(cms),water_temperature(deg_C),discharge_flag
date_time(UTC:PDT+7),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2005-09-01 22:00:00,86.3,22.0,22.0,0.16,13.53,0
2005-09-01 22:30:00,86.4,21.9,21.9,0.16,13.82,0
2005-09-01 23:00:00,86.3,22.0,22.0,0.16,14.08,0
2005-09-01 23:30:00,86.5,22.0,22.0,0.16,14.22,0
2005-09-02 00:00:00,86.2,22.5,22.5,0.18,14.29,0


## Join the old data frame with the new dataframe (only the columns they have in common: namely the labels listed in the previous step)
**Analyst TODO:** Run cells

In [21]:
resultant_df, boundary = level_baro_utils.join_dataframes(old_df, new_df)

# may need to override datatype of some columns
resultant_df['estimated_discharge(cms)'] = resultant_df['estimated_discharge(cms)'].astype(np.float64)

resultant_df.head()

Joining old and new series
End of old series: 2018-06-18 19:30:00
Start of new series: 2018-06-18 20:00:00


Unnamed: 0_level_0,vented_pressure(cm),raw_pressure(cm),barocorrected_pressure(cm),adjusted_stage(cm),estimated_discharge(cms),water_temperature(deg_C),discharge_flag
date_time(UTC:PDT+7),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2005-09-01 22:00:00,,86.3,22.0,22.0,0.16,13.53,0
2005-09-01 22:30:00,,86.4,21.9,21.9,0.16,13.82,0
2005-09-01 23:00:00,,86.3,22.0,22.0,0.16,14.08,0
2005-09-01 23:30:00,,86.5,22.0,22.0,0.16,14.22,0
2005-09-02 00:00:00,,86.2,22.5,22.5,0.18,14.29,0


## Inspect result around boundary of old/new dataset
**Analyst TODO** Inspect the results. Verify boundary of new and old series does not have duplicated values

In [22]:
resultant_df.plot()

<IPython.core.display.Javascript object>

<AxesSubplot:xlabel='date_time(UTC:PDT+7)'>

In [23]:
level_baro_utils.plot_boundary(resultant_df, boundary)

<IPython.core.display.Javascript object>

Unnamed: 0_level_0,vented_pressure(cm),raw_pressure(cm),barocorrected_pressure(cm),adjusted_stage(cm),estimated_discharge(cms),water_temperature(deg_C),discharge_flag
date_time(UTC:PDT+7),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2018-06-18 18:30:00,,32.83,32.83,32.83,1.61,7.29,0
2018-06-18 19:00:00,,32.95,32.95,32.95,1.63,7.98,0
2018-06-18 19:30:00,,32.74,32.74,32.74,1.6,8.72,0
2018-06-18 20:00:00,32.67456,,,42.933573,1.587312,9.53,0
2018-06-18 20:00:00,32.67456,,,42.933573,1.587312,9.53,0


## Save Series

In [24]:
first_year = old_df.index[0].year
last_year = new_df.index[-1].year

level_baro_utils.save_final_data(resultant_df, sitecode, first_year, last_year)

Wrote data to ../data/processed/DanaFk@BugCamp_timeseries_stage_Q_T_2005_2021.csv
