Skip to content

Commit

Permalink
Merge pull request #97 from TUW-GEO/make-validation-framework-more-mo…
Browse files Browse the repository at this point in the history
…dular

Make validation framework more modular and fix a few smaller bugs that slipped into the last release.
  • Loading branch information
cpaulik committed Apr 21, 2016
2 parents 66d18b9 + 6df9518 commit 90dbb1d
Show file tree
Hide file tree
Showing 13 changed files with 958 additions and 481 deletions.
7 changes: 7 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
# v0.5.1, 2016-04-21

* Fix bug in jobs argument passing to Validation class.
* Add support to use a pre initialized DataManager instance in the Validation class.
* Add support for per dataset reading method names in the DataManager. This
relaxes the assumption that every dataset has a `read_ts` method.

# v0.5.0, 2016-04-20

* Fix bug in temporal resampling if input was a pandas.Series
Expand Down
14 changes: 2 additions & 12 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Welcome to the documentation of pytesmo - a python Toolbox for the Evaluation of Soil Moisture Observations
Welcome to the documentation of pytesmo - a Python Toolbox for the Evaluation of Soil Moisture Observations
===========================================================================================================

Contents:
Expand All @@ -15,14 +15,4 @@ Contents:

introduction.rst
examples.rst

_rst/pytesmo


Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`

moduleindex.rst
5 changes: 5 additions & 0 deletions docs/moduleindex.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
API Documentation
*****************

* :ref:`genindex`
* :ref:`modindex`
14 changes: 7 additions & 7 deletions docs/validation_framework.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -8,28 +8,28 @@
"\n",
"The pytesmo validation framework takes care of iterating over datasets, spatial and temporal matching as well as sclaing. It uses metric calculators to then calculate metrics that are returned to the user. There are several metrics calculators included in pytesmo but new ones can be added simply.\n",
"\n",
"### Overview\n",
"## Overview\n",
"\n",
"How does the validation framework work? It makes these assumptions about the used datasets:\n",
"\n",
"- The dataset readers that are used have a `read_ts` method that can be called either by a grid point index (gpi) which can be any indicator that identifies a certain grid point or by using longitude and latitude. This means that both call signatures `read_ts(gpi)` and `read_ts(lon, lat)` must be valid. Please check the [pygeobase](https://github.com/TUW-GEO/pygeobase) documentation for more details on how a fully compatible dataset class should look. But a simple `read_ts` method should do for the validation framework.\n",
"- The dataset readers that are used have a `read_ts` method that can be called either by a grid point index (gpi) which can be any indicator that identifies a certain grid point or by using longitude and latitude. This means that both call signatures `read_ts(gpi)` and `read_ts(lon, lat)` must be valid. Please check the [pygeobase](https://github.com/TUW-GEO/pygeobase) documentation for more details on how a fully compatible dataset class should look. But a simple `read_ts` method should do for the validation framework. This assumption can be relaxed by using the `read_ts_names` keyword in the pytesmo.validation_framework.data_manager.DataManager class.\n",
"- The `read_ts` method returns a pandas.DataFrame time series.\n",
"- Ideally the datasets classes also have a `grid` attribute that is a [pygeogrids](http://pygeogrids.readthedocs.org/en/latest/) grid. This makes the calculation of lookup tables easily possible and the nearest neighbor search faster.\n",
"\n",
"Fortunately these assumptions are true about the dataset readers included in pytesmo. \n",
"\n",
"It also makes a few assumptions about how to perform a validation. For a comparison study it is often necessary to choose a spatial reference grid, a temporal reference and a scaling or data space reference. \n",
"\n",
"#### Spatial reference\n",
"### Spatial reference\n",
"The spatial reference is the one to which all the other datasets are matched spatially. Often through nearest neighbor search. The validation framework uses grid points of the dataset specified as the spatial reference to spatially match all the other datasets with nearest neighbor search. Other, more sophisticated spatial matching algorithms are not implemented at the moment. If you need a more complex spatial matching then a preprocessing of the data is the only option at the moment.\n",
"\n",
"#### Temporal reference\n",
"### Temporal reference\n",
"The temporal reference is the dataset to which the other dataset are temporally matched. That means that the nearest observation to the reference timestamps in a certain time window is chosen for each comparison dataset. This is by default done by the temporal matching module included in pytesmo. How many datasets should be matched to the reference dataset at once can be configured, we will cover how to do this later.\n",
"\n",
"#### Data space reference\n",
"### Data space reference\n",
"It is often necessary to bring all the datasets into a common data space by using. Scaling is often used for that and pytesmo offers a choice of several scaling algorithms (e.g. CDF matching, min-max scaling, mean-std scaling, triple collocation based scaling). The data space reference can also be chosen independently from the other two references. \n",
"\n",
"### Data Flow\n",
"## Data Flow\n",
"\n",
"After it is initialized, the validation framework works through the following steps:\n",
"\n",
Expand All @@ -41,7 +41,7 @@
"6. Get the calculated metrics from the metric calculators\n",
"7. Put all the metrics into a dictionary by dataset combination and return them.\n",
"\n",
"### Masking datasets\n",
"## Masking datasets\n",
"Masking datasets can be used if the the datasets that are compared do not contain the necessary information to mask them. For example we might want to use modelled soil temperature data to mask our soil moisture observations before comparing them. To be able to do that we just need a Dataset that returns a pandas.DataFrame with one column of boolean data type. Everywhere where the masking dataset is `True` the data will be masked.\n",
"\n",
"Let's look at a first example."
Expand Down
11 changes: 6 additions & 5 deletions docs/validation_framework.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,9 @@ the used datasets:
`pygeobase <https://github.com/TUW-GEO/pygeobase>`__ documentation
for more details on how a fully compatible dataset class should look.
But a simple ``read_ts`` method should do for the validation
framework.
framework. This assumption can be relaxed by using the
``read_ts_names`` keyword in the
pytesmo.validation\_framework.data\_manager.DataManager class.
- The ``read_ts`` method returns a pandas.DataFrame time series.
- Ideally the datasets classes also have a ``grid`` attribute that is a
`pygeogrids <http://pygeogrids.readthedocs.org/en/latest/>`__ grid.
Expand All @@ -37,7 +39,7 @@ comparison study it is often necessary to choose a spatial reference
grid, a temporal reference and a scaling or data space reference.

Spatial reference
^^^^^^^^^^^^^^^^^
~~~~~~~~~~~~~~~~~

The spatial reference is the one to which all the other datasets are
matched spatially. Often through nearest neighbor search. The validation
Expand All @@ -49,7 +51,7 @@ matching then a preprocessing of the data is the only option at the
moment.

Temporal reference
^^^^^^^^^^^^^^^^^^
~~~~~~~~~~~~~~~~~~

The temporal reference is the dataset to which the other dataset are
temporally matched. That means that the nearest observation to the
Expand All @@ -60,7 +62,7 @@ reference dataset at once can be configured, we will cover how to do
this later.

Data space reference
^^^^^^^^^^^^^^^^^^^^
~~~~~~~~~~~~~~~~~~~~

It is often necessary to bring all the datasets into a common data space
by using. Scaling is often used for that and pytesmo offers a choice of
Expand Down Expand Up @@ -190,7 +192,6 @@ framework can go through the jobs and read the correct time series.
2007-01-01 05:00:00 0.214 U M
Initialize the Validation class
-------------------------------

Expand Down
Empty file removed examples/__init__.py
Empty file.
33 changes: 22 additions & 11 deletions pytesmo/validation_framework/data_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,8 @@

import pandas as pd

from pygeobase.object_base import TS


class DataManager(object):

Expand Down Expand Up @@ -69,9 +71,10 @@ class DataManager(object):
period : list, optional
Of type [datetime start, datetime end]. If given then the two input
datasets will be truncated to start <= dates <= end.
read_ts_method_name: string, optional
read_ts_names: string or dict of strings, optional
if another method name than 'read_ts' should be used for reading the data
then it can be specified here.
then it can be specified here. If it is a dict then specify a
function name for each dataset.
Methods
-------
Expand All @@ -88,7 +91,7 @@ class DataManager(object):

def __init__(self, datasets, ref_name,
period=None,
read_ts_method_name='read_ts'):
read_ts_names='read_ts'):
"""
Initialize parameters.
"""
Expand All @@ -111,7 +114,13 @@ def __init__(self, datasets, ref_name,

self.period = period
self.luts = self.get_luts()
self.read_ts_method_name = read_ts_method_name
if type(read_ts_names) is dict:
self.read_ts_names = read_ts_names
else:
d = {}
for dataset in datasets:
d[dataset] = read_ts_names
self.read_ts_names = d

def _add_default_values(self):
"""
Expand Down Expand Up @@ -240,8 +249,10 @@ def read_ds(self, name, *args):
args.extend(ds['args'])

try:
func = getattr(ds['class'], self.read_ts_method_name)
func = getattr(ds['class'], self.read_ts_names[name])
data_df = func(*args, **ds['kwargs'])
if type(data_df) is TS:
data_df = TS.data
except IOError:
warnings.warn(
"IOError while reading dataset {} with args {:}".format(name,
Expand All @@ -260,19 +271,19 @@ def read_ds(self, name, *args):
warnings.warn("No data for dataset {}".format(name))
return None

if len(data_df) == 0:
warnings.warn("No data for dataset {}".format(name))
return None

if isinstance(data_df, pd.DataFrame) == False:
warnings.warn("Data is not a DataFrame {:}".format(args))
return None

if self.period is not None:
data_df = data_df[self.period[0]:self.period[1]]
# here we use the isoformat since pandas slice behavior is
# different when using datetime objects.
data_df = data_df[
self.period[0].isoformat():self.period[1].isoformat()]

if len(data_df) == 0:
warnings.warn("No data for other dataset {:}".format(args))
warnings.warn("No data for dataset {} with arguments {:}".format(name,
args))
return None

else:
Expand Down
53 changes: 49 additions & 4 deletions pytesmo/validation_framework/metric_calculators.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,14 +43,26 @@ class BasicMetrics(object):
This class just computes the basic metrics,
Pearson's R
Spearman's rho
optionally Kendall's tau
RMSD
BIAS
it also stores information about gpi, lat, lon
and number of observations
Parameters
----------
other_name: string, optional
Name of the column of the non-reference / other dataset in the
pandas DataFrame
calc_tau: boolean, optional
if True then also tau is calculated. This is set to False by default
since the calculation of Kendalls tau is rather slow and can significantly
impact performance of e.g. global validation studies
"""

def __init__(self, other_name='other'):
def __init__(self, other_name='k1',
calc_tau=False):

self.result_template = {'R': np.float32([np.nan]),
'p_R': np.float32([np.nan]),
Expand All @@ -66,6 +78,7 @@ def __init__(self, other_name='other'):
'lat': np.float64([np.nan])}

self.other_name = other_name
self.calc_tau = calc_tau

def calc_metrics(self, data, gpi_info):
"""
Expand All @@ -82,7 +95,7 @@ def calc_metrics(self, data, gpi_info):
Notes
-----
Kendall tau is not calculated at the moment
Kendall tau is calculation is optional at the moment
because the scipy implementation is very slow which is problematic for
global comparisons
"""
Expand All @@ -99,16 +112,48 @@ def calc_metrics(self, data, gpi_info):
x, y = data['ref'].values, data[self.other_name].values
R, p_R = metrics.pearsonr(x, y)
rho, p_rho = metrics.spearmanr(x, y)
# tau, p_tau = metrics.kendalltau(x, y)
RMSD = metrics.rmsd(x, y)
BIAS = metrics.bias(x, y)

dataset['R'][0], dataset['p_R'][0] = R, p_R
dataset['rho'][0], dataset['p_rho'][0] = rho, p_rho
# dataset['tau'][0], dataset['p_tau'][0] = tau, p_tau
dataset['RMSD'][0] = RMSD
dataset['BIAS'][0] = BIAS

if self.calc_tau:
tau, p_tau = metrics.kendalltau(x, y)
dataset['tau'][0], dataset['p_tau'][0] = tau, p_tau

return dataset


class BasicMetricsPlusMSE(BasicMetrics):
"""
Basic Metrics plus Mean squared Error and the decomposition of the MSE
into correlation, bias and variance parts.
"""

def __init__(self, other_name='k1',
calc_tau=False):

super(BasicMetricsPlusMSE, self).__init__(other_name=other_name,
calc_tau=calc_tau)
self.result_template.update({'mse': np.float32([np.nan]),
'mse_corr': np.float32([np.nan]),
'mse_bias': np.float32([np.nan]),
'mse_var': np.float32([np.nan])})

def calc_metrics(self, data, gpi_info):
dataset = super(BasicMetricsPlusMSE, self).calc_metrics(data, gpi_info)
if len(data) < 10:
return dataset
x, y = data['ref'].values, data[self.other_name].values
mse, mse_corr, mse_bias, mse_var = metrics.mse(x, y)
dataset['mse'][0] = mse
dataset['mse_corr'][0] = mse_corr
dataset['mse_bias'][0] = mse_bias
dataset['mse_var'][0] = mse_var

return dataset


Expand Down
Loading

0 comments on commit 90dbb1d

Please sign in to comment.