# [SITE ] [STARTYEAR - ENDYEAR] - Combing New and Old Datasets
ANALYST NAME | DATE

This notebook facilitates joining the old dataset with the new dataset (accounting for potentially disimilar labels). For backwards compatability, in the event the old and new datsets overlap in time, we join the new dataset with the end of the old dataset, trimming overlapping data from the new dataset. 


Once all steps have been completed, a single .csv file with the following quantities will be generated (and will span the entire period for which data is available across both the new and old file).
* date and time (UTC)
* vented pressure, cm
* raw pressure, cm
* barocorrected pressure, cm
* adjusted stage, cm
* estimated discharge, cms
* water temperature, degrees C
* discharge flag

Author of Template and Underlying Code: Joe Ammatelli | (jamma@uw.edu) | August 2022

In [1]:
import os
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import timedelta

sys.path.insert(0, os.path.abspath(os.path.join('..', '..', 'src')))

import config
import level_baro_utils

sys.path.remove(os.path.abspath(os.path.join('..', '..', 'src')))

## Configure Plotting Preferences
**Analyst TODO:**
* Choose plotting backend:
    - Interactive (recommended): uncomment `%matplotlib notebook` and `FIGSIZE=NONE`; comment out `FIGSIZE = config.FIGSIZE`
    - Inline: comment out `%matplotlib notebook` `FIGSIZE=NONE`; uncomment `FIGSIZE = config.FIGSIZE`

In [2]:
%matplotlib notebook
FIGSIZE=None

#FIGSIZE = config.FIGSIZE

sns.set_theme()

## Specify site code and define start/end years of new series
**Analyst TODO**:
* assign an integer representing the site to the variable `sitecode`. Mappings are as follows (follows from upstream to downstream):
    * 0 : Lyell Below Maclure
    * 1 : Lyell Above Twin Bridges
    * 2 : Dana Fork at Bug Camp
    * 3 : Tuolumne River at Highway 120
    * 4 : Budd Creek
    * 5 : Delaney Above PCT
* assign an integer (format 'YYYY') representing the first year of data collection to `start_year`
* assign an integer (format 'YYYY') representing the last year of data collection to `end_year`

These input parameters are used to automatically retrieve the postprocessed data.

In [3]:
sitecode = 0

start_year = 2019
end_year = 2021

## Read in both datasets
**Analyst TODO:** Ensure each column is appropriate datatype, correct as necessary by mapping column number to datatype in dictionary called `dtypes` and calling `choose_column_dtype` function (can leave `dtypes` as empty dictionary otherwise); inspect the output tables

**Old Dataset**

In [4]:
old_fn = 'Lyell_blw_Maclure_timeseries_stage_Q_T_2005_2018.csv'
old_path = os.path.join('..', '..', 'compiled_data', 'published', old_fn)
old_df = pd.read_csv(old_path, index_col=0, parse_dates=[0], infer_datetime_format=True, na_values=[' NaN'])

  exec(code_obj, self.user_global_ns, self.user_ns)


In [5]:
old_df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 225479 entries, 2004-07-16 00:00:00 to 2018-06-26 23:00:00
Data columns (total 10 columns):
 #   Column                                       Non-Null Count   Dtype  
---  ------                                       --------------   -----  
 0    raw_pressure(cm)                            225479 non-null  float64
 1    barocorrected_pressure(cm)                  225479 non-null  float64
 2    offset(cm)                                  225479 non-null  float64
 3    stage(cm)                                   225479 non-null  object 
 4    estimated_discharge(cms)                    225479 non-null  object 
 5   lower_confidence_discharge_cms_bestestimate  225479 non-null  float64
 6   upper_confidence_discharge_cms_bestestimate  225479 non-null  float64
 7    instrument_ID                               225479 non-null  int64  
 8    water_temperature(deg_C)                    225479 non-null  float64
 9    discharge flag          

Explicity choose datatype for incorrect columns (read_csv sometimes chooses wrong datatype for some columns and won't allow me to convert from certain types to others) -- This step should not be necessary for future processing.

In [6]:
dtypes = {3:np.float64,
          4:np.float64}

In [7]:
old_df = level_baro_utils.choose_column_dtype(old_df, dtypes)

In [8]:
old_df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 225479 entries, 2004-07-16 00:00:00 to 2018-06-26 23:00:00
Data columns (total 10 columns):
 #   Column                                       Non-Null Count   Dtype  
---  ------                                       --------------   -----  
 0    raw_pressure(cm)                            225479 non-null  float64
 1    barocorrected_pressure(cm)                  225479 non-null  float64
 2    offset(cm)                                  225479 non-null  float64
 3    stage(cm)                                   223514 non-null  float64
 4    estimated_discharge(cms)                    223514 non-null  float64
 5   lower_confidence_discharge_cms_bestestimate  225479 non-null  float64
 6   upper_confidence_discharge_cms_bestestimate  225479 non-null  float64
 7    instrument_ID                               225479 non-null  int64  
 8    water_temperature(deg_C)                    225479 non-null  float64
 9    discharge flag          

In [9]:
old_df.head()

Unnamed: 0_level_0,raw_pressure(cm),barocorrected_pressure(cm),offset(cm),stage(cm),estimated_discharge(cms),lower_confidence_discharge_cms_bestestimate,upper_confidence_discharge_cms_bestestimate,instrument_ID,water_temperature(deg_C),discharge flag
date_time(UTC:PDT+7),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2004-07-16 00:00:00,137.3,67.8,159.9,227.69,1.01,0.64,1.51,1,13.1,0
2004-07-16 00:30:00,137.5,68.7,172.1,240.79,2.47,1.24,3.05,1,12.94,0
2004-07-16 01:00:00,138.7,69.9,172.1,242.01,2.65,1.34,3.25,1,12.69,0
2004-07-16 01:30:00,139.5,70.8,172.1,242.93,2.78,1.42,3.41,1,12.35,0
2004-07-16 02:00:00,140.9,72.3,172.1,244.45,3.02,1.56,3.7,1,12.03,0


**New Dataset**

In [10]:
new_fn = config.FINAL_OUTPUT_FN.format(site=config.SITE_SHORTNAME[sitecode],
                                       start=start_year,
                                       end=end_year)

new_path = os.path.join('..', '..', 'stitch_discharge', 'data', 'processed', new_fn)

new_df = pd.read_csv(new_path, index_col=0, parse_dates=[0], infer_datetime_format=True)

Inspect resultant tables

In [11]:
new_df.head()

Unnamed: 0_level_0,vented_pressure(cm),raw_pressure(cm),barocorrected_pressure(cm),adjusted_stage(cm),estimated_discharge(cms),water_temperature(deg_C),discharge_flag
date_time(UTC:PDT+7),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2018-06-01 00:00:00,65.80632,,,241.341155,2.548146,4.018,0
2018-06-01 00:30:00,66.20256,,,241.737395,2.606281,3.783,0
2018-06-01 01:00:00,67.48272,,,243.017555,2.797776,3.525,0
2018-06-01 01:30:00,67.11696,,,242.651795,2.742493,3.257,0
2018-06-01 02:00:00,66.32448,,,241.859315,2.624277,2.968,0


## If the old dataset has different column names than the new dataset, map the old names to the new ones
This step exists to compensate for different header labels used in prior datasets. Moving forward (as of the release of the first version of this processing suite, attempts are made to have a consisistent labelling scheme so that this step is not necessary in the future).

### Specify the old label names
**Analyst TODO**:
If the old dataset uses different header names, defines of the header names in the appropriate variable below (leave as empty string otherwise).

e.g.
* if the old dataset uses the label `stage (cm)` to describe the offset stage value, set the variable `adjusted_stage_label` equal to `stage (cm)`

In [12]:
vented_pressure_label = ''
raw_pressure_label = ' raw_pressure(cm)'
barocorrected_pressure_label = ' barocorrected_pressure(cm)'
adjusted_stage_label = ' stage(cm)'
estimated_discharge_label = ' estimated_discharge(cms)'
water_temperature_label = ' water_temperature(deg_C)'
discharge_flag_label = ' discharge flag'

old_labels = [vented_pressure_label, 
              raw_pressure_label, 
              barocorrected_pressure_label, 
              adjusted_stage_label, 
              estimated_discharge_label,
              water_temperature_label,
              discharge_flag_label]

### Update the header of the old dataset dataframe to match the header of the new dataset dataframe
**Analyst TODO:** Run cells

In [14]:
old_df = level_baro_utils.map_old_labels_2_new(old_df, old_labels)
old_df.head()

Unnamed: 0_level_0,raw_pressure(cm),barocorrected_pressure(cm),adjusted_stage(cm),estimated_discharge(cms),water_temperature(deg_C),discharge_flag
date_time(UTC:PDT+7),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2004-07-16 00:00:00,137.3,67.8,227.69,1.01,13.1,0
2004-07-16 00:30:00,137.5,68.7,240.79,2.47,12.94,0
2004-07-16 01:00:00,138.7,69.9,242.01,2.65,12.69,0
2004-07-16 01:30:00,139.5,70.8,242.93,2.78,12.35,0
2004-07-16 02:00:00,140.9,72.3,244.45,3.02,12.03,0


## Join the old data frame with the new dataframe (only the columns they have in common: namely the labels listed in the previous step)
**Analyst TODO:** Run cells

In [15]:
resultant_df, boundary = level_baro_utils.join_dataframes(old_df, new_df)

# may need to override datatype of some columns
resultant_df['estimated_discharge(cms)'] = resultant_df['estimated_discharge(cms)'].astype(np.float64)

resultant_df.head()

Joining old and new series
End of old series: 2018-06-26 23:00:00
Start of new series: 2018-06-26 23:30:00


Unnamed: 0_level_0,vented_pressure(cm),raw_pressure(cm),barocorrected_pressure(cm),adjusted_stage(cm),estimated_discharge(cms),water_temperature(deg_C),discharge_flag
date_time(UTC:PDT+7),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2004-07-16 00:00:00,,137.3,67.8,227.69,1.01,13.1,0
2004-07-16 00:30:00,,137.5,68.7,240.79,2.47,12.94,0
2004-07-16 01:00:00,,138.7,69.9,242.01,2.65,12.69,0
2004-07-16 01:30:00,,139.5,70.8,242.93,2.78,12.35,0
2004-07-16 02:00:00,,140.9,72.3,244.45,3.02,12.03,0


## Inspect result around boundary of old/new dataset
**Analyst TODO** Inspect the results. Verify boundary of new and old series does not have duplicated values

In [16]:
resultant_df.plot()

<IPython.core.display.Javascript object>

<AxesSubplot:xlabel='date_time(UTC:PDT+7)'>

In [17]:
level_baro_utils.plot_boundary(resultant_df, boundary)

<IPython.core.display.Javascript object>

Unnamed: 0_level_0,vented_pressure(cm),raw_pressure(cm),barocorrected_pressure(cm),adjusted_stage(cm),estimated_discharge(cms),water_temperature(deg_C),discharge_flag
date_time(UTC:PDT+7),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2018-06-26 22:00:00,,61.48,61.48,237.02,1.95,10.9,0
2018-06-26 22:30:00,,62.09,62.09,237.63,2.03,11.08,0
2018-06-26 23:00:00,,62.48,62.48,238.02,2.08,11.17,0
2018-06-26 23:30:00,63.73368,,,239.268515,2.252926,11.23,0
2018-06-27 00:00:00,65.532,,,241.066835,2.508215,11.24,0


## Save Series

In [18]:
first_year = old_df.index[0].year
last_year = new_df.index[-1].year

level_baro_utils.save_final_data(resultant_df, sitecode, first_year, last_year)

Wrote data to ../data/processed/LyellBlwMaclure_timeseries_stage_Q_T_2004_2021.csv
