# Merging observations

This notebook shows how observations and observation collections can be merged.

## <a id=top></a>Notebook contents

1. [Simple merge](#simplemerge)
2. [Merge options](#mergeoptions)
3. [Merging observation collections](#mergeoc)

In [1]:
import numpy as np
import pandas as pd
import hydropandas as hpd
from IPython.display import display

import logging
hpd.util.get_color_logger('INFO');

## Simple merge<a id=simplemerge></a>

In [2]:
# observation 1
df = pd.DataFrame({'measurements':np.random.randint(0,10,5)}, index=pd.date_range('2020-1-1', '2020-1-5'))
o1 = hpd.Obs(df, name='obs',x=0, y=0)
o1 

Unnamed: 0,measurements
2020-01-01,8
2020-01-02,3
2020-01-03,6
2020-01-04,7
2020-01-05,3


In [3]:
# observation 2
df = pd.DataFrame({'measurements':np.random.randint(0,10,5)}, index=pd.date_range('2020-1-6', '2020-1-10'))
o2 = hpd.Obs(df, name='obs',x=0, y=0)
o2

Unnamed: 0,measurements
2020-01-06,0
2020-01-07,1
2020-01-08,9
2020-01-09,3
2020-01-10,0


In [4]:
o1.merge_observation(o2)


[32mINFO:hydropandas.observation:new observation has a different time series[0m
[32mINFO:hydropandas.observation:merge time series[0m
[32mINFO:hydropandas.observation:new and existing observation have the same metadata[0m


Unnamed: 0,measurements
2020-01-01,8
2020-01-02,3
2020-01-03,6
2020-01-04,7
2020-01-05,3
2020-01-06,0
2020-01-07,1
2020-01-08,9
2020-01-09,3
2020-01-10,0


## Merge options<a id=mergeoptions></a>

#### overlapping timeseries

In [5]:
# create a parly overlapping dataframe
df = pd.DataFrame({'measurements':np.concatenate([o1['measurements'].values[-2:],np.random.randint(0,10,3)])}, index=pd.date_range('2020-1-4', '2020-1-8'))
o3 = hpd.Obs(df, name='obs', x=0, y=0)
o3

Unnamed: 0,measurements
2020-01-04,7
2020-01-05,3
2020-01-06,5
2020-01-07,0
2020-01-08,3


In [6]:
o1.merge_observation(o3)

[32mINFO:hydropandas.observation:new observation has a different time series[0m
[32mINFO:hydropandas.observation:merge time series[0m
[32mINFO:hydropandas.observation:new and existing observation have the same metadata[0m


Unnamed: 0,measurements
2020-01-01,8
2020-01-02,3
2020-01-03,6
2020-01-04,7
2020-01-05,3
2020-01-06,5
2020-01-07,0
2020-01-08,3


In [7]:
# create a parly overlapping dataframe with different values
df = pd.DataFrame({'measurements':np.random.randint(0,10,5)}, index=pd.date_range('2020-1-4', '2020-1-8'))
o4 = hpd.Obs(df, name='obs', x=0, y=0)
o4

Unnamed: 0,measurements
2020-01-04,7
2020-01-05,8
2020-01-06,5
2020-01-07,3
2020-01-08,8


by default an error is raised if the overlapping time series have different values

In [8]:
o1.merge_observation(o4)

[32mINFO:hydropandas.observation:new observation has a different time series[0m
[32mINFO:hydropandas.observation:merge time series[0m


ValueError: observations have different values for same time steps

With the 'overlap' argument you can specify to use the left or the right observation when merging. See example below.

In [9]:
print('use left')
display(o1.merge_observation(o4, overlap='use_left')) # use the existing observation
print('use right')
display(o1.merge_observation(o4, overlap='use_right')) # use the existing observation


use left
[32mINFO:hydropandas.observation:new observation has a different time series[0m
[32mINFO:hydropandas.observation:merge time series[0m
[32mINFO:hydropandas.observation:new and existing observation have the same metadata[0m


Unnamed: 0,measurements
2020-01-01,8
2020-01-02,3
2020-01-03,6
2020-01-04,7
2020-01-05,3
2020-01-06,5
2020-01-07,3
2020-01-08,8


use right
[32mINFO:hydropandas.observation:new observation has a different time series[0m
[32mINFO:hydropandas.observation:merge time series[0m
[32mINFO:hydropandas.observation:new and existing observation have the same metadata[0m


Unnamed: 0,measurements
2020-01-01,8
2020-01-02,3
2020-01-03,6
2020-01-04,7
2020-01-05,8
2020-01-06,5
2020-01-07,3
2020-01-08,8


#### metadata
The `merge_observation` method checks by default if the metadata of the two observations is the same.

In [10]:
# observation 2
df = pd.DataFrame({'measurements':np.random.randint(0,10,5)}, index=pd.date_range('2020-1-6', '2020-1-10'))
o5 = hpd.Obs(df, name='obs5',x=0, y=0)
o5

Unnamed: 0,measurements
2020-01-06,5
2020-01-07,5
2020-01-08,2
2020-01-09,1
2020-01-10,3


When the metadata differs a ValueError is raised.

In [11]:
o1.merge_observation(o5)

[32mINFO:hydropandas.observation:new observation has a different time series[0m
[32mINFO:hydropandas.observation:merge time series[0m


ValueError: existing observation name differs from new observation

If you set the `merge_metadata` argument to `False` the metadata is not merged and only the timeseries of the observations is merged.

In [12]:
o1.merge_observation(o5, merge_metadata=False)

[32mINFO:hydropandas.observation:new observation has a different time series[0m
[32mINFO:hydropandas.observation:merge time series[0m


Unnamed: 0,measurements
2020-01-01,8
2020-01-02,3
2020-01-03,6
2020-01-04,7
2020-01-05,3
2020-01-06,5
2020-01-07,5
2020-01-08,2
2020-01-09,1
2020-01-10,3


Just as with overlapping timeseries, the 'overlap' argument can also be used for overlapping metadata values

In [13]:
o_merged = o1.merge_observation(o5, overlap='use_left', merge_metadata=True)
print('observation name when overlap="use_left":', o_merged.name)
o_merged = o1.merge_observation(o5, overlap='use_right', merge_metadata=True)
print('observation name when overlap="use_right":', o_merged.name)

[32mINFO:hydropandas.observation:new observation has a different time series[0m
[32mINFO:hydropandas.observation:merge time series[0m
[32mINFO:hydropandas.observation:existing observation name differs from new observation, use existing[0m
observation name when overlap="use_left": obs
[32mINFO:hydropandas.observation:new observation has a different time series[0m
[32mINFO:hydropandas.observation:merge time series[0m
[32mINFO:hydropandas.observation:existing observation name differs from new observation, use new[0m
observation name when overlap="use_right": obs5


#### all combinations

In [14]:
# observation 6
df = pd.DataFrame({'measurements':np.random.randint(0,10,5),
                   'filter':np.ones(5)}, index=pd.date_range('2020-1-1', '2020-1-5'))
o6 = hpd.Obs(df, name='obs6',x=100, y=0)
o6

Unnamed: 0,measurements,filter
2020-01-01,0,1.0
2020-01-02,4,1.0
2020-01-03,8,1.0
2020-01-04,3,1.0
2020-01-05,3,1.0


In [15]:
# observation 7
df = pd.DataFrame({'measurements':np.concatenate([o5['measurements'].values[-1:],np.random.randint(0,10,4)]),
                   'remarks':['', '', '', 'unreliable', '']}, index=pd.date_range('2020-1-4', '2020-1-8'))
o7 = hpd.Obs(df, name='obs7',x=0, y=100)
o7

Unnamed: 0,measurements,remarks
2020-01-04,3,
2020-01-05,2,
2020-01-06,2,
2020-01-07,5,unreliable
2020-01-08,0,


In [16]:
o6.merge_observation(o7, overlap='use_right')

[32mINFO:hydropandas.observation:new observation has a different time series[0m
[32mINFO:hydropandas.observation:merge time series[0m
[32mINFO:hydropandas.observation:existing observation name differs from new observation, use new[0m
[32mINFO:hydropandas.observation:existing observation x differs from new observation, use new[0m
[32mINFO:hydropandas.observation:existing observation y differs from new observation, use new[0m


Unnamed: 0,measurements,remarks,filter
2020-01-01,0,,1.0
2020-01-02,4,,1.0
2020-01-03,8,,1.0
2020-01-04,3,,1.0
2020-01-05,2,,1.0
2020-01-06,2,,
2020-01-07,5,unreliable,
2020-01-08,0,,


## Merge observation collections<a id=mergeoc></a>

In [17]:
# create an observation collection
oc1 = hpd.ObsCollection.from_list([o1])
oc1

Unnamed: 0_level_0,x,y,filename,source,unit,obs
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
obs,0,0,,,,Obs obs -----metadata------ name : obs x : 0 ...


We can add a single observation to this collection using the `add_observation` method.

In [18]:
oc1.add_observation(o2)
oc1

[32mINFO:hydropandas.obs_collection:observation name obs already in collection, merging observations[0m
[32mINFO:hydropandas.observation:new observation has a different time series[0m
[32mINFO:hydropandas.observation:merge time series[0m
[32mINFO:hydropandas.observation:new and existing observation have the same metadata[0m


Unnamed: 0_level_0,x,y,filename,source,unit,obs
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
obs,0,0,,,,Obs obs -----metadata------ name : obs x : 0 ...


We can also combine two observation collections.

In [19]:
# create another observation collection
oc2 = hpd.ObsCollection.from_list([o5, o6])
oc2

# add the collection to the previous one
oc1.add_obs_collection(oc2, inplace=True)
oc1

[32mINFO:hydropandas.obs_collection:adding obs5 to collection[0m
[32mINFO:hydropandas.obs_collection:adding obs6 to collection[0m


Unnamed: 0_level_0,x,y,filename,source,unit,obs
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
obs,0,0,,,,Obs obs -----metadata------ name : obs x : 0 ...
obs5,0,0,,,,Obs obs5 -----metadata------ name : obs5 x : ...
obs6,100,0,,,,Obs obs6 -----metadata------ name : obs6 x : ...


There is an automatic check for overlap based on the name of the observations. If the observations in both collections are exactly the same they are merged.

In [20]:
# add o2 to the observation collection 1
oc1.add_observation(o2)

[32mINFO:hydropandas.obs_collection:observation name obs already in collection, merging observations[0m
[32mINFO:hydropandas.observation:new observation has a different time series[0m
[32mINFO:hydropandas.observation:merge time series[0m
[32mINFO:hydropandas.observation:new and existing observation have the same metadata[0m


If the observation you want to add has the same name but not the same timeseries an error is raised.

In [21]:
o1_mod = o1.copy()
o1_mod.loc['2020-01-02', 'measurements'] = 100
oc1.add_observation(o1_mod)

[32mINFO:hydropandas.obs_collection:observation name obs already in collection, merging observations[0m
[32mINFO:hydropandas.observation:new observation has a different time series[0m
[32mINFO:hydropandas.observation:merge time series[0m


ValueError: observations have different values for same time steps

To avoid errors we can use the `overlap` arguments to specify which observation we want to use.

In [22]:
oc1.add_observation(o1_mod, overlap='use_left')
oc1

[32mINFO:hydropandas.obs_collection:observation name obs already in collection, merging observations[0m
[32mINFO:hydropandas.observation:new observation has a different time series[0m
[32mINFO:hydropandas.observation:merge time series[0m
[32mINFO:hydropandas.observation:new and existing observation have the same metadata[0m


Unnamed: 0_level_0,x,y,filename,source,unit,obs
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
obs,0,0,,,,Obs obs -----metadata------ name : obs x : 0 ...
obs5,0,0,,,,Obs obs5 -----metadata------ name : obs5 x : ...
obs6,100,0,,,,Obs obs6 -----metadata------ name : obs6 x : ...
