# Lyell Above Twin Bridges 2019-2021 Time Series Stitching 
Joe Ammatelli | 8-9-22

This notebook documents the steps taken to offset and combine postprocessed (trimmed, barocorrected, quality controlled) level/stage time series from multiple years into a single continuous time series. In particular, this notebook faciliates the following steps:
1. Loading the previous published dataset
2. Loading manual stage measurements
3. Loading postprocessed vented time series
4. Loading postprocessed unvented time series
5. Offsetting segments such that they agree with the published data and/or manual stage measurements
6. Stitching of offset time series into a single time series

Once all steps have been completed, a single .csv file with the following quantities will be generated (and will span the entire period from which data was collected). 
* date and time (UTC)
* vented pressure, cm
* raw pressure, cm
* barocorrected pressure, cm
* adjusted stage, cm
* estimated discharge, cms
* water temperature, degrees C
* discharge flag

Author of Template and Underlying Code: Joe Ammatelli | (jamma@uw.edu) | July 2022

## Import Relevant Libraries
**Analyst TODO**: Nothing

In [1]:
import os
import sys
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from datetime import datetime, timedelta 

sys.path.insert(0, os.path.abspath(os.path.join('..', '..', 'src')))

import config
import level_baro_utils
import log_utils

sys.path.remove(os.path.abspath(os.path.join('..', '..', 'src')))

## Choose Plotting Backend
**Analyst TODO**:
* Choose plotting backend:
    - Interactive (recommended): uncomment `%matplotlib notebook` and `FIGSIZE=NONE`; comment out `FIGSIZE = config.FIGSIZE`
    - Inline: comment out `%matplotlib notebook` `FIGSIZE=NONE`; uncomment `FIGSIZE = config.FIGSIZE`

In [2]:
%matplotlib notebook
FIGSIZE=None

#FIGSIZE = config.FIGSIZE

sns.set_theme()

## Specify Site Information
**Analyst TODO**:
* assign an integer representing the site to the variable `sitecode`. Mappings are as follows (follows from upstream to downstream):
    * 0 : Lyell Below Maclure
    * 1 : Lyell Above Twin Bridges
    * 2 : Dana Fork at Bug Camp
    * 3 : Tuolumne River at Highway 120
    * 4 : Budd Creek
    * 5 : Delaney Above PCT
* assign an integer (format 'YYYY') representing the first year of data collection to `start_year`
* assign an integer (format 'YYYY') representing the last year of data collection to `end_year`
* assign an string (format 'YY-YY') representing the data collection span in years (i.e. `start_year` to `end_year`) to the variable `span`

These input parameters are used to automatically retrieve the postprocessed data, populate the correct log file, label any plots with relevant site descriptors, and automatically write output with descriptive names. 

In [3]:
# example 
# sitecode = 2
# start_year = 2019
# end_year = 2021
# span = '18-21'

sitecode = 1
start_year = 2019
end_year = 2021
span = '19-21'

## Load All Data
**Analyst TODO**:
The vented and unvented data segments are loaded automatically. However, because the previously published data sets and compiled stage data may have nonstandard names and/or table headers, the previoulsy published data and compiled stage data will need to be loaded manually. 
* Read previously published data set into dataframe
* Read compile stage measurements into dataframe

### Load Previously Published Data Set
**Analyst TODO**:
* Make changes as specified below

In [4]:
# TODO: specify file name (not full path, just name)
fn = 'Lyell_abv_Twin_timeseries_stage_Q_T_2002_2018.csv'

# correct relative path automatically configured
prev_path = os.path.join('..', '..', 'compiled_data', 'published', fn)

# TODO: change index_col and parse_dates entries as needed
prev_df = pd.read_csv(prev_path, index_col=0, parse_dates=[0], infer_datetime_format=True, na_values=[' NaN'])

# TODO: select the "adjusted stage" column from the dataframe 
# will need to check .csv file to see what the column label is
prev_stage_ds = prev_df[' upstream_barocorrected_pressure(cm)'].astype(np.float64).dropna()

# Preview selected series to make sure everything looks alright
prev_stage_ds.head()

  exec(code_obj, self.user_global_ns, self.user_ns)


date_time(UTC:PDT+7)
2005-09-03 01:00:00    31.9
2005-09-03 01:30:00    32.0
2005-09-03 02:00:00    31.5
2005-09-03 02:30:00    31.3
2005-09-03 03:00:00    31.0
Name:  upstream_barocorrected_pressure(cm), dtype: float64

### Load Manual Stage Measurements
**Analyst TODO**:
* Make changes as specified below (may change depending on how the file is formatted)

In [5]:
# TODO: specify file name (not full path, just name)
fn = 'stage19-21.csv'

# correct relative path automatically configured
stage_path = os.path.join('..', '..', 'compiled_data', 'stage', fn)

# TODO: change parse_dates param as needed
# numbers in the list should correspond to the columns in the table with date time data
manual_stage_df = pd.read_csv(stage_path, parse_dates=[[1,2]], infer_datetime_format=True)

# Perform timezone (PDT --> UTC) and unit (FT --> CM) corrections
# TODO: verify timezone of manual stage measurements, adjust offset as needed
utc_pdt_timedelta = timedelta(hours=7)
manual_stage_df['date_time (pdt)'] += utc_pdt_timedelta
manual_stage_df['stage (ft)'] *= level_baro_utils.FT_TO_CM

# Create dataframe for stage measurements
# TODO: for each entry in the columns dictionary, ensure the lefthand mapping matches the table labels
# e.g. if the compile stage table has columns "date_time (pdt)" and "stage (ft)",
# the columns argument should be {'date_time (pdt)':'date_time(UTC)', 'stage (ft)':'stage(cm)'}
manual_stage_df.rename(columns={'date_time (pdt)':'date_time(UTC)', 'stage (ft)':'stage(cm)'}, inplace=True)
manual_stage_df.set_index('date_time(UTC)', inplace=True)

# Select only the manual stage measurements for the site of interest
# TODO: modify the indexer into manual_stage_df so that it matches the label of the column for the site label
site_manual_stage_df = manual_stage_df[manual_stage_df['site'] == 'LyellAbvTwinBridges']  # NOTE: had to hardcode name here
site_manual_stage_df.head()

Unnamed: 0_level_0,site,stage(cm),notes
date_time(UTC),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-07-17 16:15:00,LyellAbvTwinBridges,281.0256,From Site profile
2019-07-31 23:02:00,LyellAbvTwinBridges,295.9608,
2019-08-01 00:12:00,LyellAbvTwinBridges,295.9608,
2019-08-08 15:52:00,LyellAbvTwinBridges,300.228,
2019-08-08 16:46:00,LyellAbvTwinBridges,299.9232,


### Load Vented Data
**Analyst TODO**: Nothing

In [6]:
# vented_segments = level_baro_utils.read_vented_segments(sitecode)
# vented_segments[-1].head()

vented_segments=None
print('No vented data available')

No vented data available


### Load Unvented Data
For each year that data is available, load unvented time series data

**Analyst TODO**: Nothing

In [7]:
unvented_segments = level_baro_utils.read_unvented_segments(sitecode, start_year, end_year)

loading ../../unvented_2019/data/barocorrected/LyellAbvTB_2019_barocorrected_0.csv
loading ../../unvented_2020/data/barocorrected/LyellAbvTB_2020_barocorrected_0.csv
loading ../../unvented_2021/data/barocorrected/LyellAbvTB_2021_barocorrected_0.csv


## Plot All Data Together (and develop plan for finding offsets)
**Analyst TODO**: Inspect plot, decide how you are going to compute the offsets for each segment.

In [8]:
level_baro_utils.plot_all(title=f'{config.SITE_LONGNAME[sitecode]}: All Raw Data', 
                          prev=prev_stage_ds,
                          prev_weeks=192,
                          stage=site_manual_stage_df, 
                          vented_segments=None, 
                          unvented_segments=unvented_segments,
                          figsize=FIGSIZE)

<IPython.core.display.Javascript object>

## From left to right, find offset of each segment
**Analyst TODO**:
For each segment, compute the offset needed so that the time series matches a portion of an overlapping time series (taken to be "ground truth") and/or fits the manual stage measurements. 
* Initialize offsets (to be zero); this way, we can incrimentally observe how the offset time series look
* Create a new markdown cell and give a discriptive name, e.g. "Find offset between published record and vented series"
* Create a new code cell, compute the difference between the time series and a reference (either vented series, previous overlapping record, or manual stage measurements); visualize the difference (to see if it is roughly constant)
    - Use `dif_btw_series` to generate a series representing the difference between two series at each step
    - Use `dif_btw_stage_series` to generate a series with the difference between a set of manual stage measurements and the corresponding sample from the time series 
* Create a new code cell; filter out members of the difference series as necessary so that the aggregation of is not biased by outliers
* Create a new code cell; reduce the difference time series to a single offset value (mean and/or median are good choices), save in the appropriate entry of the offset data structure
* Create a new code cell: plot the time series with offset (along with any other desired time series for comparison); give descriptive title. Data plotting options (what to display):
    - prev : previously published record
    - stage : manual stage measurements
    - vented_segments : all vented segments
    - unvented_segments : all unvented segments
    - vented_offsets : offsets to apply to vented segments
    - unvented offsets : offsets to apply to unvented segments

**Strategies**
* Start from the past, work toward the present (and use previous records as references)
1. try to find offset between previous record and vented; then find different between each unvented segment and the offset vented series
2. Fit vented/unvented data to stage measurements

### Initialize Offsets

In [9]:
vented_offsets, unvented_offsets = level_baro_utils.initialize_offsets(vented_segments, unvented_segments)

### 1. Find offset between previous record and stage measurement from 2015
We do not have an unvented record to work with and there is clearly some offset with the stage measurements. We therefore find offset between the 2015 stage measurement and the previoulsy published record. Assuming this offset is uniform for all the stage measurements, we can apply the offset to the other stage measurements and then procedd as usual. 

In [10]:
offset_stage = level_baro_utils.dif_btw_stage_series(site_manual_stage_df['stage(cm)'], prev_stage_ds)[0]
print(f'Stage Offset: {offset_stage}')

Stage Offset: 245.5756


In [11]:
site_manual_stage_df['stage(cm)'] = site_manual_stage_df['stage(cm)'] - offset_stage

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  site_manual_stage_df['stage(cm)'] = site_manual_stage_df['stage(cm)'] - offset_stage


In [12]:
level_baro_utils.plot_all(title=f'{config.SITE_LONGNAME[sitecode]}: offset stage measurements', 
                          prev=prev_stage_ds,
                          prev_weeks=192,
                          stage=site_manual_stage_df, 
                          vented_segments=None, 
                          unvented_segments=unvented_segments,
                          figsize=FIGSIZE)

<IPython.core.display.Javascript object>

### 1. Find difference between 2019 segment 0 and stage

In [13]:
series1 = site_manual_stage_df['stage(cm)']
series2 = unvented_segments[2019][0]['barocorrected_pressure(cm)']

dif = level_baro_utils.dif_btw_stage_series(series1, series2)

plt.figure()
dif.plot(marker='.', linestyle = 'None')
plt.title('series1 - series2')

dif

<IPython.core.display.Javascript object>

date_time(UTC)
2015-07-17 16:15:00          NaN
2019-07-15 17:45:00     8.226741
2019-07-15 19:36:00     7.844491
2019-07-18 23:30:00   -12.938965
2019-07-19 00:13:00   -14.634452
2019-07-31 23:02:00     1.391197
2019-08-01 00:12:00     1.857861
2019-08-08 15:52:00     4.772752
2019-08-08 16:46:00          NaN
2019-08-08 17:00:00          NaN
2020-06-20 17:30:00          NaN
2020-06-20 19:08:00          NaN
2020-07-22 23:40:00          NaN
2020-07-23 00:47:00          NaN
2020-08-06 18:31:00          NaN
2020-08-06 19:49:00          NaN
2020-08-13 17:38:00          NaN
2020-08-13 18:25:00          NaN
2021-06-15 17:40:00          NaN
2021-07-14 16:08:00          NaN
2021-07-14 16:53:00          NaN
2021-07-21 18:30:00          NaN
2021-07-21 19:40:00          NaN
2021-08-24 21:15:00          NaN
2021-08-24 22:15:00          NaN
dtype: float64

In [14]:
unvented_offsets[2019][0] = dif.mean()

In [15]:
level_baro_utils.plot_all(title=f'{config.SITE_LONGNAME[sitecode]}: offset unvented 2019 segment 0',
                          prev=prev_stage_ds,
                          prev_weeks=192,
                          stage=site_manual_stage_df,
                          vented_segments=None, 
                          unvented_segments=unvented_segments, 
                          unvented_offsets=unvented_offsets)

<IPython.core.display.Javascript object>

### 2. Leave 2020 segment as is
Using the mean offset for segment one, we get a plausible offset trend from the end of the last record. Without any calculated offset, the 2020 segment aligns with the end of the 2019 segment, as desired. 

### 3. Find difference between 2019 segment 0 and stage

In [16]:
series1 = site_manual_stage_df['stage(cm)']
series2 = unvented_segments[2021][0]['barocorrected_pressure(cm)']

dif = level_baro_utils.dif_btw_stage_series(series1, series2)

plt.figure()
dif.plot(marker='.', linestyle = 'None')
plt.title('series1 - series2')

dif

<IPython.core.display.Javascript object>

date_time(UTC)
2015-07-17 16:15:00          NaN
2019-07-15 17:45:00          NaN
2019-07-15 19:36:00          NaN
2019-07-18 23:30:00          NaN
2019-07-19 00:13:00          NaN
2019-07-31 23:02:00          NaN
2019-08-01 00:12:00          NaN
2019-08-08 15:52:00          NaN
2019-08-08 16:46:00          NaN
2019-08-08 17:00:00          NaN
2020-06-20 17:30:00          NaN
2020-06-20 19:08:00          NaN
2020-07-22 23:40:00          NaN
2020-07-23 00:47:00          NaN
2020-08-06 18:31:00          NaN
2020-08-06 19:49:00          NaN
2020-08-13 17:38:00          NaN
2020-08-13 18:25:00          NaN
2021-06-15 17:40:00    15.096767
2021-07-14 16:08:00    24.401097
2021-07-14 16:53:00    24.485061
2021-07-21 18:30:00    20.658357
2021-07-21 19:40:00    20.824501
2021-08-24 21:15:00          NaN
2021-08-24 22:15:00          NaN
dtype: float64

In [17]:
unvented_offsets[2021][0] = dif.max()

In [18]:
level_baro_utils.plot_all(title=f'{config.SITE_LONGNAME[sitecode]}: offset unvented segments',
                          prev=prev_stage_ds,
                          prev_weeks=192,
                          stage=site_manual_stage_df,
                          vented_segments=None, 
                          unvented_segments=unvented_segments, 
                          unvented_offsets=unvented_offsets)

<IPython.core.display.Javascript object>

## Manually Apply Corrections as Needed
**Analyst TODO**:
* Manually change offset values if computed values are clearly incorrect

## Create time series for entire period: select segments to use for time series, add offsets
**Analyst TODO**:
* Specify which segments to string together
    - create a list of lists; for each inner list, provide: segment, when to start using the segment in the stitched series, and the offset
    - e.g.: ```segments = [[vented_segments[0], vented_segments[0].index[0], vented_offsets[0]],[unvented_segments[2019][0], unvented_segments[2019][0].index[0], unvented_offsets[2019][0]],[unvented_segments[2021][0], unvented_segments[2021][0].index[0], unvented_offsets[2021][0]],[unvented_segments[2021][1], unvented_segments[2021][1].index[0],unvented_offsets[2021][1]]] ```corresponds to using the vented time series for the first segment of the stitched time series (from the beginning of the vented record to the beginning of next segment), using the 0th segment of the unvented 2019 data for the second segment of the stitched time series (from the start of the 2019 segment 0 record to the start of the next record) and so forth). 
          
Once the segments are specified, they can be automatically joined using `level_baro_utils.stitch_timeseries` and passing the segments list

In [19]:
# example
# segments = [[vented_segments[0], vented_segments[0].index[0], vented_offsets[0]],
#             [unvented_segments[2019][0], unvented_segments[2019][0].index[0], unvented_offsets[2019][0]],
#             [unvented_segments[2021][0], unvented_segments[2021][0].index[0], unvented_offsets[2021][0]],
#             [unvented_segments[2021][1], unvented_segments[2021][1].index[0], unvented_offsets[2021][1]]]

segments = [[unvented_segments[2019][0], unvented_segments[2019][0].index[0], unvented_offsets[2019][0]],
            [unvented_segments[2020][0], unvented_segments[2020][0].index[0], unvented_offsets[2020][0]],
            [unvented_segments[2021][0], unvented_segments[2021][0].index[0], unvented_offsets[2021][0]]]

stitched_df = level_baro_utils.stitch_timeseries(segments)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  segment['adjusted_stage(cm)'] = offset_series


### Inspect Stitched Time Series

In [20]:
plt.figure()
plt.plot(stitched_df['adjusted_stage(cm)'])
plt.xticks(rotation=30)
plt.ylabel('cm')
plt.title(f'{config.SITE_LONGNAME[sitecode]}: Stitched Time Seris')

stitched_df.head()

<IPython.core.display.Javascript object>

Unnamed: 0_level_0,vented_pressure(cm),raw_pressure(cm),barocorrected_pressure(cm),adjusted_stage(cm),estimated_discharge(cms),water_temperature(deg_C),discharge_flag
date_time(UTC:PDT+7),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2018-07-25 19:00:00,,815.33,47.811598,47.314401,,13.475,0
2018-07-25 19:30:00,,815.22,47.662551,47.165354,,13.235,0
2018-07-25 20:00:00,,814.96,47.363503,46.866307,,13.209,0
2018-07-25 20:30:00,,814.65,47.014456,46.51726,,13.301,0
2018-07-25 21:00:00,,814.43,46.755409,46.258213,,13.492,0


## Generate discharge time series using rating curve
**Analyst TODO**: Ensure rating curve up to data (update in config if necessary), inspect result

In [21]:
stitched_df['estimated_discharge(cms)'] = level_baro_utils.compute_discharge(stitched_df['adjusted_stage(cm)'], sitecode)

In [22]:
level_baro_utils.plot_discharge(stitched_df, sitecode)

stitched_df.head()

<IPython.core.display.Javascript object>

Unnamed: 0_level_0,vented_pressure(cm),raw_pressure(cm),barocorrected_pressure(cm),adjusted_stage(cm),estimated_discharge(cms),water_temperature(deg_C),discharge_flag
date_time(UTC:PDT+7),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2018-07-25 19:00:00,,815.33,47.811598,47.314401,1.882593,13.475,0
2018-07-25 19:30:00,,815.22,47.662551,47.165354,1.867403,13.235,0
2018-07-25 20:00:00,,814.96,47.363503,46.866307,1.837055,13.209,0
2018-07-25 20:30:00,,814.65,47.014456,46.51726,1.801854,13.301,0
2018-07-25 21:00:00,,814.43,46.755409,46.258213,1.775883,13.492,0


## Write Output to File
**Analyst TODO**: Nothing to change

In [23]:
level_baro_utils.save_final_data(stitched_df, sitecode, start_year, end_year)

Wrote data to ../data/processed/LyellAbvTB_timeseries_stage_Q_T_2019_2021.csv
