Add a `require_data()` method #715

danielhuppmann · 2022-12-07T05:26:21Z

Please confirm that this PR has done the following:

Tests Added
Documentation Added
~~Name of contributors Added to AUTHORS.rst~~
Description in RELEASE_NOTES.md Added

Description of PR

This PR adds a require_data() method as a generalized approach to require_variable(). The new method returns none if all scenarios have all (combinations) of required values on the dimensions, or it returns a DataFrame of model-scenarios where not all required values are present.

The PR also adds a (data) "coordinates" attribute to quickly access the data columns that are not in the index.

Question

The existing validation methods like require_variable() are very verbose. Here, I have not added any logging output. Any preferences?

codecov · 2022-12-07T05:43:56Z

Codecov Report

Merging #715 (ff97338) into main (df1df1c) will increase coverage by 0.0%.
The diff coverage is 100.0%.

❗ Current head ff97338 differs from pull request most recent head c1e01b9. Consider uploading reports for the commit c1e01b9 to get more accurate results

@@          Coverage Diff          @@
##            main    #715   +/-   ##
=====================================
  Coverage   94.9%   94.9%           
=====================================
  Files         58      58           
  Lines       5914    5952   +38     
=====================================
+ Hits        5614    5654   +40     
+ Misses       300     298    -2

Impacted Files	Coverage Δ
pyam/core.py	`95.2% <100.0%> (+0.3%)`	⬆️
tests/test_feature_validation.py	`100.0% <100.0%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

pyam/core.py

phackstock · 2022-12-07T13:56:22Z

I think the return values should contain all identifiers to missing data points, not just model and scenario as it could otherwise lead to ambiguities.

As an example these data:

Model	Scenario	Region	Variable	Unit	2020	2025
model_a	scenario_a	region_a	variable_a	unit_a	1
model_a	scenario_a	region_b	variable_a	unit_a	1	2

with a required data specification:

region: [region_a, region_b]
variable: variable_a
year: [2020, 2025]

would currently return model_a and scenario_a. That does not tell you the whole story though as for the first line in the data a value for 2025 is missing while for the second line, which is also model_a and scenario_a, everything is in order.

phackstock

This is mostly the continuation of my previous comment.

phackstock · 2022-12-07T14:21:46Z

pyam/core.py

+
+        # identify scenarios that have some but not all required values
+        columns = [i for i in self.coordinates if i not in required]
+        rows = rows.droplevel(level=columns).drop_duplicates()


In keeping with the theme of my comment of returning all identifiers of the missing values I don't think we should drop the relevant columns.

This would make the validation later len(df) != n impossible, because there may be multiple entries along a not-required dimension.

phackstock · 2022-12-07T14:25:23Z

pyam/core.py

+        )
+
+        index_incomplete = pd.DataFrame(
+            [idx for idx, df in data if len(df) != n], columns=self.index.names


This is purely a clarifying question on my side, why is the if len(df) !=n needed? As I understand it n is the total number of combinations of required data fields, right? So for example two variables for two regions for three years would be 2x2x3=12?

If the number of required data (n, computed as the product of the number of values per dimension) does not equal to the number of existing data len(df), then some of the required data is missing.

Does that assumption not fall flat as soon as you provide multiple values for unit? Maybe we should restrict the unit attribute to a single one then.

That is why I remove the columns that are not required and then drop duplicates...

(And this issue is not restricted the units, you have the same problem when you only require variables and you have multiple regions).

pyam/core.py

danielhuppmann · 2022-12-07T14:51:51Z

I think the return values should contain all identifiers to missing data points, not just model and scenario as it could otherwise lead to ambiguities.

It's not quite straightforward to develop a useful interface (or object) for this "complete" return-object.

Riffing off the example above: Should the returned dataframe contain a column "unit"? If yes, the value should probably be None for all columns... If no, the columns of the returned object depend on the arguments, making automated processing difficult.

Second, I see a problem that computing the missing index will be quite a lot of processing, only to be thrown away later anyway.

Suggestion: we leave as is for now (or even simplify just return bool) and add an argument "verbose" later?

phackstock · 2022-12-07T15:01:49Z

Suggestion: we leave as is for now (or even simplify just return bool) and add an argument "verbose" later?

My preference would be the latter. Just keeping it as a kind of validator of, "is everything that we require there?". The question of "what is missing?" could almost be a separate one.

phackstock · 2022-12-07T18:34:20Z

I had an idea of a different, maybe more simple, approach that might be worth exploring.
It could be an option to simply construct a list of all the combinations of required data input and then loop over them one by one checking if there is data there. During that you would save what's missing.
Something like:

def require_data(self, region=None, variable=None, unit=None, year=None, exclude_on_fail=False):
    missing_datapoints = {}
    # combine "everything with everything"
    for datapoint in itertools.product([region, variable, unit, year]):
        # iterate over all model, scenario combinations
        for model, scenario in self.index:
            # obviously this won't work exactly like this but just to illustrate the idea
            if datapoint not in self[model, scenario]:
                missing_datapoints[(model, scenario)] = datapoint
    if missing_datapoints:
        return missing_datapoints

danielhuppmann · 2022-12-08T06:53:50Z

I had an idea of a different, maybe more simple, approach that might be worth exploring.

Yes, that's the (only) way to go if we want to have a complete list of all missing (combinations of) required items.

But if the use case (for now) is just to know whether all required items exist and (optionally) exclude those scenarios that do meet the requirement from further processing, checking the length of an existing index is much more efficient than creating a second index and comparing the two indices line-by-line.

phackstock · 2022-12-09T12:01:02Z

Yes, true. I was already thinking towards providing user feedback to point exactly what data points are missing but that’s probably better handled in a different function if we really need that level of precision.

phackstock

Looks good to me, ready to be merged.

danielhuppmann added 4 commits December 7, 2022 05:50

Add (data) coordinates attribute

319c67c

Add passing test for require_data

5ea6509

Add implementation for require_data

4f6dea7

Add test for tests when required data is not present

f030d57

danielhuppmann assigned danielhuppmann and phackstock Dec 7, 2022

danielhuppmann added 2 commits December 7, 2022 06:27

Add to release notes

7e8936e

Remove work-in-progress import

d5465dd

danielhuppmann unassigned phackstock Dec 7, 2022

danielhuppmann requested a review from phackstock December 7, 2022 05:34

phackstock reviewed Dec 7, 2022

View reviewed changes

pyam/core.py Show resolved Hide resolved

phackstock reviewed Dec 7, 2022

View reviewed changes

danielhuppmann added 2 commits December 12, 2022 11:33

Merge branch 'main' into feature/require

ff97338

Improve the docstring

c1e01b9

danielhuppmann marked this pull request as ready for review December 12, 2022 11:19

phackstock approved these changes Dec 12, 2022

View reviewed changes

danielhuppmann merged commit e0534d9 into IAMconsortium:main Dec 12, 2022

danielhuppmann deleted the feature/require branch December 12, 2022 17:23

danielhuppmann mentioned this pull request Jan 2, 2023

Filtering out outliers in data #629

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a `require_data()` method #715

Add a `require_data()` method #715

danielhuppmann commented Dec 7, 2022 •

edited

codecov bot commented Dec 7, 2022 •

edited

phackstock commented Dec 7, 2022

phackstock left a comment

phackstock Dec 7, 2022

danielhuppmann Dec 7, 2022

phackstock Dec 7, 2022

danielhuppmann Dec 7, 2022

phackstock Dec 7, 2022

danielhuppmann Dec 7, 2022

danielhuppmann commented Dec 7, 2022

phackstock commented Dec 7, 2022

phackstock commented Dec 7, 2022 •

edited

danielhuppmann commented Dec 8, 2022

phackstock commented Dec 9, 2022

phackstock left a comment

Add a require_data() method #715

Add a require_data() method #715

Conversation

danielhuppmann commented Dec 7, 2022 • edited

Please confirm that this PR has done the following:

Description of PR

Question

codecov bot commented Dec 7, 2022 • edited

Codecov Report

phackstock commented Dec 7, 2022

phackstock left a comment

Choose a reason for hiding this comment

phackstock Dec 7, 2022

Choose a reason for hiding this comment

danielhuppmann Dec 7, 2022

Choose a reason for hiding this comment

phackstock Dec 7, 2022

Choose a reason for hiding this comment

danielhuppmann Dec 7, 2022

Choose a reason for hiding this comment

phackstock Dec 7, 2022

Choose a reason for hiding this comment

danielhuppmann Dec 7, 2022

Choose a reason for hiding this comment

danielhuppmann commented Dec 7, 2022

phackstock commented Dec 7, 2022

phackstock commented Dec 7, 2022 • edited

danielhuppmann commented Dec 8, 2022

phackstock commented Dec 9, 2022

phackstock left a comment

Choose a reason for hiding this comment

Add a `require_data()` method #715

Add a `require_data()` method #715

danielhuppmann commented Dec 7, 2022 •

edited

codecov bot commented Dec 7, 2022 •

edited

phackstock commented Dec 7, 2022 •

edited