Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialize features as JSON #532

Merged
merged 14 commits into from
May 8, 2019
Merged

Serialize features as JSON #532

merged 14 commits into from
May 8, 2019

Conversation

CJStadler
Copy link
Contributor

@CJStadler CJStadler commented May 6, 2019

This changes the implementation of save_features and load_features to
use JSON instead of pickling. This will give us greater control over
version changes, and also makes it easier to inspect the saved data.

At the top level the JSON object has the following keys:

  • schema_version: the version of the features schema. During
    deserialization if the saved version is greater than the current
    version an error will be raised.
  • ft_version: the version of featuretools.
  • entityset: the entityset metadata (all features must be associated
    with the same entityset).
  • feature_list: an array of the names of the features.
  • feature_definitions: an object where the keys are feature names and
    the values are objects with the data for the corresponding feature.
    This may include features which were not included in the given list
    but are dependencies of the features in the list.

Each feature object has the following keys:

  • type: the name of the class of this feature.
  • dependencies: a list of the names of features which this feature
    depends on.
  • arguments: an object storing the data necessary to construct this
    feature. This is generated by feature.get_arguments during
    serialization, and passed to feature_class.from_dictionary during
    deserialization to construct the new feature.

Primitives may be included as arguments of features, and have the
following keys:

  • type: the name of the primitive class.
  • module: the name of the module the primitive class is defined in.
  • arguments: an object storing the data necessary to construct this
    primitive. This is built by reflecting on the signature of the
    primitive's constructor and for every parameter getting the attribute
    with the same name.

Since primitives classes may come from different modules or be defined
dynamically they are found by searching the descendants of
PrimitiveBase.

Other changes:

  • Move featuretools.__version__ to its own module. This allows it to be
    imported into other featuretools modules without creating a circular
    dependency.
  • Add EntitySet.to_dictionary() to get the entity set metadata as a
    dictionary.
  • Add feature.unique_name(), which includes the entity id. get_name()
    may not be unique in a list of features because features on different
    entities could have the same name.
  • Add Timedelta.get_arguments() and from_dictionary.

Resolves #471

This changes the implementation of save_features and load_features to
use JSON instead of pickling. This will give us greater control over
version changes, and also makes it easier to inspect the saved data.

At the top level the JSON object has the following keys:
- schema_version: the version of the features schema. During
  deserialization if the saved version is greater than the current
  version an error will be raised.
- ft_version: the version of featuretools.
- entityset: the entityset metadata (all features must be associated
  with the same entityset).
- feature_list: an array of the names of the features.
- feature_definitions: an object where the keys are feature names and
  the values are objects with the data for the corresponding feature.
  This may include features which were not included in the given list
  but are dependencies of the features in the list.

Each feature object has the following keys:
- type: the name of the class of this feature.
- dependencies: a list of the names of features which this feature
  depends on.
- arguments: an object storing the data necessary to construct this
  feature. This is generated by feature.get_arguments during
  serialization, and passed to feature_class.from_dictionary during
  deserialization to construct the new feature.

Primitives may be included as arguments of features, and have the
following keys:
- type: the name of the primitive class.
- module: the name of the module the primitive class is defined in.
- arguments: an object storing the data necessary to construct this
  primitive. This is built by reflecting on the signature of the
  primitive's constructor and for every parameter getting the attribute
  with the same name.

Since primitives classes may come from different modules or be defined
dynamically they are found by searching the descendents of
PrimitiveBase.

Other changes:
- Move featuretools.__version__ to its own module. This allows it to be
  imported into other featuretools modules without creating a circular
  dependency.
- Add EntitySet.to_dictionary() to get the entity set metadata as a
  dictionary.
- Add feature.unique_name(), which includes the entity id. get_name()
  may not be unique in a list of features because features on different
  entities could have the same name.
- Add Timedelta.get_arguments() and from_dictionary.
@codecov
Copy link

codecov bot commented May 6, 2019

Codecov Report

Merging #532 into master will increase coverage by 0.12%.
The diff coverage is 99.73%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #532      +/-   ##
==========================================
+ Coverage    96.1%   96.23%   +0.12%     
==========================================
  Files         108      114       +6     
  Lines        8915     9245     +330     
==========================================
+ Hits         8568     8897     +329     
- Misses        347      348       +1
Impacted Files Coverage Δ
featuretools/utils/api.py 100% <ø> (ø) ⬆️
featuretools/feature_base/api.py 100% <100%> (ø) ⬆️
featuretools/entityset/entityset.py 95.07% <100%> (+0.02%) ⬆️
.../tests/primitive_tests/test_features_serializer.py 100% <100%> (ø)
featuretools/feature_base/feature_base.py 96.95% <100%> (+0.48%) ⬆️
featuretools/primitives/base/primitive_base.py 100% <100%> (ø) ⬆️
...s/tests/primitive_tests/test_transform_features.py 98.12% <100%> (+0.03%) ⬆️
...ools/tests/primitive_tests/test_direct_features.py 100% <100%> (ø) ⬆️
featuretools/primitives/utils.py 97.36% <100%> (+0.93%) ⬆️
...ests/primitive_tests/test_feature_serialization.py 100% <100%> (ø) ⬆️
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0739b88...107213a. Read the comment docs.

In older python versions some lists in the dictionaries do not always
have the same order. Instead, convert them back to entitysets and
compare those.
@CJStadler
Copy link
Contributor Author

@rwedge this is ready for review. Thanks!

@rwedge rwedge self-requested a review May 6, 2019 20:42
featuretools/feature_base/feature_base.py Outdated Show resolved Hide resolved
featuretools/feature_base/feature_base.py Outdated Show resolved Hide resolved
featuretools/feature_base/features_deserializer.py Outdated Show resolved Hide resolved
featuretools/feature_base/features_deserializer.py Outdated Show resolved Hide resolved
featuretools/feature_base/features_deserializer.py Outdated Show resolved Hide resolved
featuretools/__init__.py Show resolved Hide resolved
@CJStadler CJStadler requested a review from rwedge May 7, 2019 20:53
raise RuntimeError('Primitive "%s" in module "%s" not found' %
(class_name, module))
if class_cache:
class_cache[cache_key] = cls
Copy link
Contributor

@rwedge rwedge May 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only caches the primitive class found by _find_class_in_descendants but I think we can cache any primitive class examined by _find_class_in_descendants once it has looked at all of it's subclasses.

@rwedge
Copy link
Contributor

rwedge commented May 8, 2019

Current untested scenarios:

  • Trying to deserialize an unknown feature class
  • Trying to deserialize an unknown primitive class

Added PrimitivesDeserializer, wrapping a cache and a generator which
iterates over all primitive classes. When deserializing a primitive if
it is not in the cache then we iterate until it is found, adding every
seen class to the cache. When deseriazing the next primitive the
iteration resumes where it left off. This means that we never visit a
class more than once.

A PrimitivesDeserializer is initialized in FeaturesDeserializer and then
passed to every `Feature.from_dictionary` call.
@CJStadler CJStadler merged commit 7c1c1a9 into master May 8, 2019
@CJStadler CJStadler deleted the features-json branch May 8, 2019 19:21
@rwedge rwedge mentioned this pull request May 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove save_obj_pickle and load_pickle functions
3 participants