Merge pull request #97 from TUW-GEO/make-validation-framework-more-mo…

…dular Make validation framework more modular and fix a few smaller bugs that slipped into the last release.
TUW-GEO · Apr 21, 2016 · 90dbb1d · 90dbb1d
2 parents 66d18b9 + 6df9518
commit 90dbb1d
Show file tree

Hide file tree

Showing 13 changed files with 958 additions and 481 deletions.
diff --git a/CHANGES.md b/CHANGES.md
@@ -1,3 +1,10 @@
+# v0.5.1, 2016-04-21
+
+* Fix bug in jobs argument passing to Validation class. 
+* Add support to use a pre initialized DataManager instance in the Validation class.
+* Add support for per dataset reading method names in the DataManager. This
+  relaxes the assumption that every dataset has a `read_ts` method.
+
 # v0.5.0, 2016-04-20
 
 * Fix bug in temporal resampling if input was a pandas.Series

diff --git a/docs/index.rst b/docs/index.rst
@@ -3,7 +3,7 @@
    You can adapt this file completely to your liking, but it should at least
    contain the root `toctree` directive.
 
-Welcome to the documentation of pytesmo - a python Toolbox for the Evaluation of Soil Moisture Observations
+Welcome to the documentation of pytesmo - a Python Toolbox for the Evaluation of Soil Moisture Observations
 ===========================================================================================================
 
 Contents:
@@ -15,14 +15,4 @@ Contents:
 
    introduction.rst
    examples.rst
-
-   _rst/pytesmo
-
-
-Indices and tables
-==================
-
-* :ref:`genindex`
-* :ref:`modindex`
-* :ref:`search`
-
+   moduleindex.rst
diff --git a/docs/moduleindex.rst b/docs/moduleindex.rst
@@ -0,0 +1,5 @@
+API Documentation
+*****************
+
+* :ref:`genindex`
+* :ref:`modindex`
diff --git a/docs/validation_framework.ipynb b/docs/validation_framework.ipynb
@@ -8,28 +8,28 @@
     "\n",
     "The pytesmo validation framework takes care of iterating over datasets, spatial and temporal matching as well as sclaing. It uses metric calculators to then calculate metrics that are returned to the user. There are several metrics calculators included in pytesmo but new ones can be added simply.\n",
     "\n",
-    "### Overview\n",
+    "## Overview\n",
     "\n",
     "How does the validation framework work? It makes these assumptions about the used datasets:\n",
     "\n",
-    "- The dataset readers that are used have a `read_ts` method that can be called either by a grid point index (gpi) which can be any indicator that identifies a certain grid point or by using longitude and latitude. This means that both call signatures `read_ts(gpi)` and `read_ts(lon, lat)` must be valid. Please check the [pygeobase](https://github.com/TUW-GEO/pygeobase) documentation for more details on how a fully compatible dataset class should look. But a simple `read_ts` method should do for the validation framework.\n",
+    "- The dataset readers that are used have a `read_ts` method that can be called either by a grid point index (gpi) which can be any indicator that identifies a certain grid point or by using longitude and latitude. This means that both call signatures `read_ts(gpi)` and `read_ts(lon, lat)` must be valid. Please check the [pygeobase](https://github.com/TUW-GEO/pygeobase) documentation for more details on how a fully compatible dataset class should look. But a simple `read_ts` method should do for the validation framework. This assumption can be relaxed by using the `read_ts_names` keyword in the pytesmo.validation_framework.data_manager.DataManager class.\n",
     "- The `read_ts` method returns a pandas.DataFrame time series.\n",
     "- Ideally the datasets classes also have a `grid` attribute that is a [pygeogrids](http://pygeogrids.readthedocs.org/en/latest/) grid. This makes the calculation of lookup tables easily possible and the nearest neighbor search faster.\n",
     "\n",
     "Fortunately these assumptions are true about the dataset readers included in pytesmo. \n",
     "\n",
     "It also makes a few assumptions about how to perform a validation. For a comparison study it is often necessary to choose a spatial reference grid, a temporal reference and a scaling or data space reference. \n",
     "\n",
-    "#### Spatial reference\n",
+    "### Spatial reference\n",
     "The spatial reference is the one to which all the other datasets are matched spatially. Often through nearest neighbor search. The validation framework uses grid points of the dataset specified as the spatial reference to spatially match all the other datasets with nearest neighbor search. Other, more sophisticated spatial matching algorithms are not implemented at the moment. If you need a more complex spatial matching then a preprocessing of the data is the only option at the moment.\n",
     "\n",
-    "#### Temporal reference\n",
+    "### Temporal reference\n",
     "The temporal reference is the dataset to which the other dataset are temporally matched. That means that the nearest observation to the reference timestamps in a certain time window is chosen for each comparison dataset. This is by default done by the temporal matching module included in pytesmo. How many datasets should be matched to the reference dataset at once can be configured, we will cover how to do this later.\n",
     "\n",
-    "#### Data space reference\n",
+    "### Data space reference\n",
     "It is often necessary to bring all the datasets into a common data space by using. Scaling is often used for that and pytesmo offers a choice of several scaling algorithms (e.g. CDF matching, min-max scaling, mean-std scaling, triple collocation based scaling). The data space reference can also be chosen independently from the other two references. \n",
     "\n",
-    "### Data Flow\n",
+    "## Data Flow\n",
     "\n",
     "After it is initialized, the validation framework works through the following steps:\n",
     "\n",
@@ -41,7 +41,7 @@
     "6. Get the calculated metrics from the metric calculators\n",
     "7. Put all the metrics into a dictionary by dataset combination and return them.\n",
     "\n",
-    "### Masking datasets\n",
+    "## Masking datasets\n",
     "Masking datasets can be used if the the datasets that are compared do not contain the necessary information to mask them. For example we might want to use modelled soil temperature data to mask our soil moisture observations before comparing them. To be able to do that we just need a Dataset that returns a pandas.DataFrame with one column of boolean data type. Everywhere where the masking dataset is `True` the data will be masked.\n",
     "\n",
     "Let's look at a first example."

diff --git a/docs/validation_framework.rst b/docs/validation_framework.rst
@@ -22,7 +22,9 @@ the used datasets:
    `pygeobase <https://github.com/TUW-GEO/pygeobase>`__ documentation
    for more details on how a fully compatible dataset class should look.
    But a simple ``read_ts`` method should do for the validation
-   framework.
+   framework. This assumption can be relaxed by using the
+   ``read_ts_names`` keyword in the
+   pytesmo.validation\_framework.data\_manager.DataManager class.
 -  The ``read_ts`` method returns a pandas.DataFrame time series.
 -  Ideally the datasets classes also have a ``grid`` attribute that is a
    `pygeogrids <http://pygeogrids.readthedocs.org/en/latest/>`__ grid.
@@ -37,7 +39,7 @@ comparison study it is often necessary to choose a spatial reference
 grid, a temporal reference and a scaling or data space reference.
 
 Spatial reference
-^^^^^^^^^^^^^^^^^
+~~~~~~~~~~~~~~~~~
 
 The spatial reference is the one to which all the other datasets are
 matched spatially. Often through nearest neighbor search. The validation
@@ -49,7 +51,7 @@ matching then a preprocessing of the data is the only option at the
 moment.
 
 Temporal reference
-^^^^^^^^^^^^^^^^^^
+~~~~~~~~~~~~~~~~~~
 
 The temporal reference is the dataset to which the other dataset are
 temporally matched. That means that the nearest observation to the
@@ -60,7 +62,7 @@ reference dataset at once can be configured, we will cover how to do
 this later.
 
 Data space reference
-^^^^^^^^^^^^^^^^^^^^
+~~~~~~~~~~~~~~~~~~~~
 
 It is often necessary to bring all the datasets into a common data space
 by using. Scaling is often used for that and pytesmo offers a choice of
@@ -190,7 +192,6 @@ framework can go through the jobs and read the correct time series.
     2007-01-01 05:00:00          0.214                  U                       M
 
 
-
 Initialize the Validation class
 -------------------------------
 

diff --git a/examples/__init__.py b/examples/__init__.py
diff --git a/pytesmo/validation_framework/data_manager.py b/pytesmo/validation_framework/data_manager.py
@@ -31,6 +31,8 @@
 
 import pandas as pd
 
+from pygeobase.object_base import TS
+
 
 class DataManager(object):
 
@@ -69,9 +71,10 @@ class DataManager(object):
     period : list, optional
         Of type [datetime start, datetime end]. If given then the two input
         datasets will be truncated to start <= dates <= end.
-    read_ts_method_name: string, optional
+    read_ts_names: string or dict of strings, optional
         if another method name than 'read_ts' should be used for reading the data
-        then it can be specified here.
+        then it can be specified here. If it is a dict then specify a
+        function name for each dataset.
 
     Methods
     -------
@@ -88,7 +91,7 @@ class DataManager(object):
 
     def __init__(self, datasets, ref_name,
                  period=None,
-                 read_ts_method_name='read_ts'):
+                 read_ts_names='read_ts'):
         """
         Initialize parameters.
         """
@@ -111,7 +114,13 @@ def __init__(self, datasets, ref_name,
 
         self.period = period
         self.luts = self.get_luts()
-        self.read_ts_method_name = read_ts_method_name
+        if type(read_ts_names) is dict:
+            self.read_ts_names = read_ts_names
+        else:
+            d = {}
+            for dataset in datasets:
+                d[dataset] = read_ts_names
+            self.read_ts_names = d
 
     def _add_default_values(self):
         """
@@ -240,8 +249,10 @@ def read_ds(self, name, *args):
         args.extend(ds['args'])
 
         try:
-            func = getattr(ds['class'], self.read_ts_method_name)
+            func = getattr(ds['class'], self.read_ts_names[name])
             data_df = func(*args, **ds['kwargs'])
+            if type(data_df) is TS:
+                data_df = TS.data
         except IOError:
             warnings.warn(
                 "IOError while reading dataset {} with args {:}".format(name,
@@ -260,19 +271,19 @@ def read_ds(self, name, *args):
             warnings.warn("No data for dataset {}".format(name))
             return None
 
-        if len(data_df) == 0:
-            warnings.warn("No data for dataset {}".format(name))
-            return None
-
         if isinstance(data_df, pd.DataFrame) == False:
             warnings.warn("Data is not a DataFrame {:}".format(args))
             return None
 
         if self.period is not None:
-            data_df = data_df[self.period[0]:self.period[1]]
+            # here we use the isoformat since pandas slice behavior is
+            # different when using datetime objects.
+            data_df = data_df[
+                self.period[0].isoformat():self.period[1].isoformat()]
 
         if len(data_df) == 0:
-            warnings.warn("No data for other dataset {:}".format(args))
+            warnings.warn("No data for dataset {} with arguments {:}".format(name,
+                                                                             args))
             return None
 
         else:

diff --git a/pytesmo/validation_framework/metric_calculators.py b/pytesmo/validation_framework/metric_calculators.py
@@ -43,14 +43,26 @@ class BasicMetrics(object):
     This class just computes the basic metrics,
     Pearson's R
     Spearman's rho
+    optionally Kendall's tau
     RMSD
     BIAS
 
     it also stores information about gpi, lat, lon
     and number of observations
+
+    Parameters
+    ----------
+    other_name: string, optional
+        Name of the column of the non-reference / other dataset in the
+        pandas DataFrame
+    calc_tau: boolean, optional
+        if True then also tau is calculated. This is set to False by default
+        since the calculation of Kendalls tau is rather slow and can significantly
+        impact performance of e.g. global validation studies
     """
 
-    def __init__(self, other_name='other'):
+    def __init__(self, other_name='k1',
+                 calc_tau=False):
 
         self.result_template = {'R': np.float32([np.nan]),
                                 'p_R': np.float32([np.nan]),
@@ -66,6 +78,7 @@ def __init__(self, other_name='other'):
                                 'lat': np.float64([np.nan])}
 
         self.other_name = other_name
+        self.calc_tau = calc_tau
 
     def calc_metrics(self, data, gpi_info):
         """
@@ -82,7 +95,7 @@ def calc_metrics(self, data, gpi_info):
 
         Notes
         -----
-        Kendall tau is not calculated at the moment
+        Kendall tau is calculation is optional at the moment
         because the scipy implementation is very slow which is problematic for
         global comparisons
         """
@@ -99,16 +112,48 @@ def calc_metrics(self, data, gpi_info):
         x, y = data['ref'].values, data[self.other_name].values
         R, p_R = metrics.pearsonr(x, y)
         rho, p_rho = metrics.spearmanr(x, y)
-        # tau, p_tau = metrics.kendalltau(x, y)
         RMSD = metrics.rmsd(x, y)
         BIAS = metrics.bias(x, y)
 
         dataset['R'][0], dataset['p_R'][0] = R, p_R
         dataset['rho'][0], dataset['p_rho'][0] = rho, p_rho
-        # dataset['tau'][0], dataset['p_tau'][0] = tau, p_tau
         dataset['RMSD'][0] = RMSD
         dataset['BIAS'][0] = BIAS
 
+        if self.calc_tau:
+            tau, p_tau = metrics.kendalltau(x, y)
+            dataset['tau'][0], dataset['p_tau'][0] = tau, p_tau
+
+        return dataset
+
+
+class BasicMetricsPlusMSE(BasicMetrics):
+    """
+    Basic Metrics plus Mean squared Error and the decomposition of the MSE
+    into correlation, bias and variance parts.
+    """
+
+    def __init__(self, other_name='k1',
+                 calc_tau=False):
+
+        super(BasicMetricsPlusMSE, self).__init__(other_name=other_name,
+                                                  calc_tau=calc_tau)
+        self.result_template.update({'mse': np.float32([np.nan]),
+                                     'mse_corr': np.float32([np.nan]),
+                                     'mse_bias': np.float32([np.nan]),
+                                     'mse_var': np.float32([np.nan])})
+
+    def calc_metrics(self, data, gpi_info):
+        dataset = super(BasicMetricsPlusMSE, self).calc_metrics(data, gpi_info)
+        if len(data) < 10:
+            return dataset
+        x, y = data['ref'].values, data[self.other_name].values
+        mse, mse_corr, mse_bias, mse_var = metrics.mse(x, y)
+        dataset['mse'][0] = mse
+        dataset['mse_corr'][0] = mse_corr
+        dataset['mse_bias'][0] = mse_bias
+        dataset['mse_var'][0] = mse_var
+
         return dataset