# Lyell Above Twin Bridges 2005 - 2021: Combing New and Old Datasets
Joe Ammatelli | 8-22-22

This notebook facilitates joining the old dataset with the new dataset (accounting for potentially disimilar labels). For backwards compatability, in the event the old and new datsets overlap in time, we join the new dataset with the end of the old dataset, trimming overlapping data from the new dataset. 


Once all steps have been completed, a single .csv file with the following quantities will be generated (and will span the entire period for which data is available across both the new and old file).
* date and time (UTC)
* vented pressure, cm
* raw pressure, cm
* barocorrected pressure, cm
* adjusted stage, cm
* estimated discharge, cms
* water temperature, degrees C
* discharge flag

Author of Template and Underlying Code: Joe Ammatelli | (jamma@uw.edu) | August 2022

In [1]:
import os
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import timedelta

sys.path.insert(0, os.path.abspath(os.path.join('..', '..', 'src')))

import config
import level_baro_utils

sys.path.remove(os.path.abspath(os.path.join('..', '..', 'src')))

## Configure Plotting Preferences
**Analyst TODO:**
* Choose plotting backend:
    - Interactive (recommended): uncomment `%matplotlib notebook` and `FIGSIZE=NONE`; comment out `FIGSIZE = config.FIGSIZE`
    - Inline: comment out `%matplotlib notebook` `FIGSIZE=NONE`; uncomment `FIGSIZE = config.FIGSIZE`

In [2]:
%matplotlib notebook
FIGSIZE=None

#FIGSIZE = config.FIGSIZE

sns.set_theme()

## Specify site code and define start/end years of new series
**Analyst TODO**:
* assign an integer representing the site to the variable `sitecode`. Mappings are as follows (follows from upstream to downstream):
    * 0 : Lyell Below Maclure
    * 1 : Lyell Above Twin Bridges
    * 2 : Dana Fork at Bug Camp
    * 3 : Tuolumne River at Highway 120
    * 4 : Budd Creek
    * 5 : Delaney Above PCT
* assign an integer (format 'YYYY') representing the first year of data collection to `start_year`
* assign an integer (format 'YYYY') representing the last year of data collection to `end_year`

These input parameters are used to automatically retrieve the postprocessed data.

In [3]:
sitecode = 1

start_year = 2019
end_year = 2021

## Read in both datasets
**Analyst TODO:** Ensure each column is appropriate datatype, correct as necessary by mapping column number to datatype in dictionary called `dtypes` and calling `choose_column_dtype` function (can leave `dtypes` as empty dictionary otherwise); inspect the output tables

**Old Dataset**

In [4]:
old_fn = 'Lyell_abv_Twin_timeseries_stage_Q_T_2002_2018.csv'
old_path = os.path.join('..', '..', 'compiled_data', 'published', old_fn)
old_df = pd.read_csv(old_path, index_col=0, parse_dates=[0], infer_datetime_format=True, na_values=[' NaN'])

  exec(code_obj, self.user_global_ns, self.user_ns)


In [5]:
old_df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 295597 entries, 2001-08-17 19:00:00 to 2018-06-30 06:00:00
Data columns (total 18 columns):
 #   Column                                                   Non-Null Count   Dtype  
---  ------                                                   --------------   -----  
 0    upstream_raw_pressure(cm)                               295597 non-null  object 
 1    upstream_barocorrected_pressure(cm)                     295597 non-null  object 
 2    upstream_estimated_discharge(cms)                       295597 non-null  object 
 3    upstream_lower_confidence_discharge_cms_bestestimate    295597 non-null  object 
 4    upstream_upper_confidence_discharge_cms_bestestimate    295597 non-null  object 
 5    upstream_temperature(degC)                              295597 non-null  object 
 6    upstream_instrument_ID                                  295597 non-null  int64  
 7    downstream_raw_pressure(cm)                             295597 non

Explicity choose datatype for incorrect columns of interest (read_csv sometimes chooses wrong datatype for some columns and won't allow me to convert from certain types to others) -- This step should not be necessary for future processing.

In [6]:
dtypes = {0:np.float64,
          1:np.float64,
          2:np.float64,
          3:np.float64,
          4:np.float64,
          5:np.float64,
          7:np.float64,
          8:np.float64,
          9:np.float64,
          10:np.float64,
          11:np.float64,
          12:np.float64,
          14:np.float64,
          15:np.float64}

In [7]:
old_df = level_baro_utils.choose_column_dtype(old_df, dtypes)

In [8]:
old_df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 295597 entries, 2001-08-17 19:00:00 to 2018-06-30 06:00:00
Data columns (total 18 columns):
 #   Column                                                   Non-Null Count   Dtype  
---  ------                                                   --------------   -----  
 0    upstream_raw_pressure(cm)                               224740 non-null  float64
 1    upstream_barocorrected_pressure(cm)                     224586 non-null  float64
 2    upstream_estimated_discharge(cms)                       224586 non-null  float64
 3    upstream_lower_confidence_discharge_cms_bestestimate    212109 non-null  float64
 4    upstream_upper_confidence_discharge_cms_bestestimate    212109 non-null  float64
 5    upstream_temperature(degC)                              224740 non-null  float64
 6    upstream_instrument_ID                                  295597 non-null  int64  
 7    downstream_raw_pressure(cm)                             158724 non

Here we have to choose which station to use (since there were two locations over this time span). For the new dataset, we choose to only report data from the upstream location

In [9]:
fig, ax = plt.subplots(2, 1, sharex=True)
old_df[' downstream_estimated_discharge(cms)'].plot(ax=ax[0], label='downstream')
old_df[' upstream_estimated_discharge(cms)'].plot(ax=ax[0], label='upstream')
ax[0].set_ylabel('cms')
ax[0].set_title('Discharge')
ax[0].legend()
old_df[' discharge_source(1=upstream_2=downstream)'].plot(ax=ax[1])
ax[1].set_title('Discharge Source (1=upstream, 2=downstream)')

<IPython.core.display.Javascript object>

Text(0.5, 1.0, 'Discharge Source (1=upstream, 2=downstream)')

In [10]:
old_df = old_df.iloc[:, [0,1,2,3,4,5,6,17]]
old_df = old_df[old_df[' upstream_estimated_discharge(cms)'].notna()]

Additionally, since we have no column for stage (the author's of the previous dataset chose to only include barocorrected pressure), we need to add an additinal column to the table to represent stage.

In [11]:
old_df['stage(cm)'] = old_df[' upstream_barocorrected_pressure(cm)']

Inspect final table

In [12]:
old_df.head()

Unnamed: 0_level_0,upstream_raw_pressure(cm),upstream_barocorrected_pressure(cm),upstream_estimated_discharge(cms),upstream_lower_confidence_discharge_cms_bestestimate,upstream_upper_confidence_discharge_cms_bestestimate,upstream_temperature(degC),upstream_instrument_ID,discharge_flag,stage(cm)
date_time(UTC:PDT+7),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2005-09-03 01:00:00,96.6,31.9,0.56,0.06,1.16,16.82,1,0,31.9
2005-09-03 01:30:00,96.7,32.0,0.57,0.07,1.16,16.75,1,0,32.0
2005-09-03 02:00:00,96.5,31.5,0.54,0.04,1.13,16.58,1,0,31.5
2005-09-03 02:30:00,96.5,31.3,0.52,0.03,1.11,16.35,1,0,31.3
2005-09-03 03:00:00,96.5,31.0,0.5,0.01,1.09,16.06,1,0,31.0


**New Dataset**

In [13]:
new_fn = config.FINAL_OUTPUT_FN.format(site=config.SITE_SHORTNAME[sitecode],
                                       start=start_year,
                                       end=end_year)

new_path = os.path.join('..', '..', 'stitch_discharge', 'data', 'processed', new_fn)

new_df = pd.read_csv(new_path, index_col=0, parse_dates=[0], infer_datetime_format=True)

Inspect resultant tables

In [14]:
new_df.head()

Unnamed: 0_level_0,raw_pressure(cm),barocorrected_pressure(cm),adjusted_stage(cm),estimated_discharge(cms),water_temperature(deg_C),discharge_flag
date_time(UTC:PDT+7),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2018-07-25 19:00:00,815.33,47.811598,47.314401,1.882593,13.475,0
2018-07-25 19:30:00,815.22,47.662551,47.165354,1.867403,13.235,0
2018-07-25 20:00:00,814.96,47.363503,46.866307,1.837055,13.209,0
2018-07-25 20:30:00,814.65,47.014456,46.51726,1.801854,13.301,0
2018-07-25 21:00:00,814.43,46.755409,46.258213,1.775883,13.492,0


## If the old dataset has different column names than the new dataset, map the old names to the new ones
This step exists to compensate for different header labels used in prior datasets. Moving forward (as of the release of the first version of this processing suite, attempts are made to have a consisistent labelling scheme so that this step is not necessary in the future).

### Specify the old label names
**Analyst TODO**:
If the old dataset uses different header names, defines of the header names in the appropriate variable below (leave as empty string otherwise).

e.g.
* if the old dataset uses the label `stage (cm)` to describe the offset stage value, set the variable `adjusted_stage_label` equal to `stage (cm)`

In [15]:
old_df[[' upstream_raw_pressure(cm)', ' upstream_barocorrected_pressure(cm)']].plot()

<IPython.core.display.Javascript object>

<AxesSubplot:xlabel='date_time(UTC:PDT+7)'>

In [16]:
raw_pressure_label = ' upstream_raw_pressure(cm)'
barocorrected_pressure_label = ' upstream_barocorrected_pressure(cm)'
adjusted_stage_label = 'stage(cm)'
estimated_discharge_label = ' upstream_estimated_discharge(cms)'
water_temperature_label = ' upstream_temperature(degC)'
discharge_flag_label = ' discharge_flag'

old_labels = [raw_pressure_label, 
              barocorrected_pressure_label, 
              adjusted_stage_label, 
              estimated_discharge_label,
              water_temperature_label,
              discharge_flag_label]

### Update the header of the old dataset dataframe to match the header of the new dataset dataframe
**Analyst TODO:** Run cells

In [17]:
old_df = level_baro_utils.map_old_labels_2_new(old_df, old_labels)
old_df.head()

Unnamed: 0_level_0,barocorrected_pressure(cm),adjusted_stage(cm),estimated_discharge(cms),water_temperature(deg_C),discharge_flag,discharge_flag
date_time(UTC:PDT+7),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2005-09-03 01:00:00,96.6,31.9,31.9,0.56,16.82,0
2005-09-03 01:30:00,96.7,32.0,32.0,0.57,16.75,0
2005-09-03 02:00:00,96.5,31.5,31.5,0.54,16.58,0
2005-09-03 02:30:00,96.5,31.3,31.3,0.52,16.35,0
2005-09-03 03:00:00,96.5,31.0,31.0,0.5,16.06,0


## Join the old data frame with the new dataframe (only the columns they have in common: namely the labels listed in the previous step)
**Analyst TODO:** Run cells

In [18]:
resultant_df, boundary = level_baro_utils.join_dataframes(old_df, new_df)

# may need to override datatype of some columns
resultant_df['estimated_discharge(cms)'] = resultant_df['estimated_discharge(cms)'].astype(np.float64)

resultant_df.head()

Joining old and new series
End of old series: 2018-06-30 06:00:00
Start of new series: 2018-07-25 19:00:00


Unnamed: 0_level_0,raw_pressure(cm),barocorrected_pressure(cm),adjusted_stage(cm),estimated_discharge(cms),water_temperature(deg_C),discharge_flag
date_time(UTC:PDT+7),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2005-09-03 01:00:00,,96.6,31.9,31.9,0.56,16.82
2005-09-03 01:30:00,,96.7,32.0,32.0,0.57,16.75
2005-09-03 02:00:00,,96.5,31.5,31.5,0.54,16.58
2005-09-03 02:30:00,,96.5,31.3,31.3,0.52,16.35
2005-09-03 03:00:00,,96.5,31.0,31.0,0.5,16.06


## Inspect result around boundary of old/new dataset
**Analyst TODO** Inspect the results. Verify boundary of new and old series does not have duplicated values

In [19]:
resultant_df.plot()

<IPython.core.display.Javascript object>

<AxesSubplot:xlabel='date_time(UTC:PDT+7)'>

In [20]:
level_baro_utils.plot_boundary(resultant_df, boundary)

<IPython.core.display.Javascript object>

Unnamed: 0_level_0,raw_pressure(cm),barocorrected_pressure(cm),adjusted_stage(cm),estimated_discharge(cms),water_temperature(deg_C),discharge_flag
date_time(UTC:PDT+7),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2018-06-30 05:00:00,,810.55,52.78,52.78,2.47,11.08
2018-06-30 05:30:00,,810.85,53.22,53.22,2.52,11.1
2018-06-30 06:00:00,,811.04,53.33,53.33,2.53,11.11


## Save Series

In [21]:
first_year = old_df.index[0].year
last_year = new_df.index[-1].year

level_baro_utils.save_final_data(resultant_df, sitecode, first_year, last_year)

Wrote data to ../data/processed/LyellAbvTB_timeseries_stage_Q_T_2005_2021.csv
