New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Clean up entity creation logic #336

Merged

rwedge merged 32 commits into master from entity-from-df-cleanup-1

Dec 7, 2018

Contributor

rwedge commented Dec 3, 2018

The idea behind this PR is to make it easier to find where in the code we handle data type and variable type conversion when creating a new entity. Currently it's handled in two places: EntitySet._import_from_dataframe and Entity.__init__.

This PR moves the relevant logic in EntitySet._import_from_dataframe into Entity.__init__ and removes a couple statements that try to infer variable types that can be handled by Entity.infer_variable_types instead.

rwedge added 6 commits

December 3, 2018 14:38


          remove unnecessary categorical infer

3a66a7f


          move column name check into Entity.__init__

49bb132


          move time index check into Entity.__init__

6420ed7


          run convert_all_variable_data once

547bfe7


          move make index logic to Entity.__init__

4b73cac


          don't set index vtype in _import_from_dataframe

5767d9f

Codecov bot commented Dec 3, 2018 •

edited

Loading

Codecov Report

Merging #336 into master will increase coverage by 0.05%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #336      +/-   ##
==========================================
+ Coverage   95.28%   95.34%   +0.05%     
==========================================
  Files          74       74              
  Lines        7783     7792       +9     
==========================================
+ Hits         7416     7429      +13     
+ Misses        367      363       -4

Impacted Files	Coverage Δ
featuretools/tests/testing_utils/mock_ds.py	`87.4% <ø> (ø)`	⬆️
featuretools/entityset/entity.py	`95.65% <100%> (+1.71%)`	⬆️
featuretools/entityset/entityset.py	`95.73% <100%> (-0.23%)`	⬇️
featuretools/tests/entityset_tests/test_es.py	`99.22% <100%> (+0.01%)`	⬆️
featuretools/tests/entityset_tests/test_entity.py	`100% <100%> (ø)`	⬆️
featuretools/utils/gen_utils.py	`85.71% <0%> (-2.39%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9fae88b...e356a72. Read the comment docs.

kmax12 reviewed

View reviewed changes

featuretools/entityset/entity.py Outdated

@@ @@ -93,9 +116,8 @@ def __init__(self, id, df, entityset, variable_types=None, @@
                       inferred_variable_types = self.infer_variable_types(ignore=list(variable_types.keys()),
                                                                           link_vars=link_vars)
                       for var_id, desired_type in variable_types.items():

Contributor

kmax12 Dec 4, 2018

can simplify loop to inferred_variable_types.update(variable_types)

rwedge added 3 commits

December 4, 2018 11:01


          simplify combining inferred and declared vtypes

591c18e


          add create_index function

473f0ea


          move variable creation logic from Entity.__init__ to separate method

dd5ec81

kmax12 suggested changes

View reviewed changes

featuretools/entityset/entity.py Outdated

+                          make_index (bool, optional) : If True, assume index does not exist as a column in
+                              dataframe, and create a new column of that name using integers the (0, len(dataframe)).
+                              Otherwise, assume index exists in dataframe.
                       """
                       assert is_string(id), "Entity id must be a string"

Contributor

kmax12 Dec 5, 2018

maybe add a function from line 69 - 76 called _validate_entity_params(id, time_index, columns)

featuretools/entityset/entity.py Outdated

@@ @@ -669,3 +681,21 @@ def col_is_datetime(col): @@
                       else:
                           return True
                   return False
+              def create_index(index, make_index, df):

Contributor

kmax12 Dec 5, 2018

add short comment describing function

something like "handles index creation logic"

Contributor Author

rwedge Dec 5, 2018

Maybe this should also be private

Contributor

kmax12 Dec 5, 2018

yep

featuretools/entityset/entity.py Outdated

+                      index = df.columns[0]
+                  elif make_index and index in df.columns:
+                      raise RuntimeError("Cannot make index: index variable already present")
+                  elif make_index or index not in df.columns:

Contributor

kmax12 Dec 5, 2018

i think this last elif block can be simplified to

elif index not in df.columns:
    if not make_index:
        logger.warning("index %s not found in dataframe, creating new "
                       "integer column", index)
    df.insert(0, index, range(0, len(df)))
    created_index = index

at this point if index is in df.columns then make_index must be false and we should skip this block. this is because we handle the case where make_index and index in df.columns is True above

Contributor

kmax12 Dec 5, 2018

if my logic is correct might be helpful to leave a comment about this entire if/elif block so people can quickly understand

featuretools/entityset/entityset.py Show resolved Hide resolved

featuretools/entityset/entity.py Outdated

+                      created_index, index, df = create_index(index, make_index, df)
+                      if index not in variable_types:
+                          variable_types[index] = vtypes.Index
                       self.data = {"df": df,

Contributor

kmax12 Dec 5, 2018

can we avoid setting the data like this? perhaps by just making df an argument to some of the other functions?

it feels wrong to set the data here and then call self.update_data(...) a second time at the end

Contributor Author

rwedge Dec 5, 2018

should we just call the relevant parts of self.update_data separately and skip calling update_data?. It already expects the indexes to be set and it's a little strange to be "updating" the dataframe on initialization

Contributor

kmax12 Dec 5, 2018

ya, i agree with that approach

featuretools/entityset/entity.py Outdated

-                      self.convert_all_variable_data(inferred_variable_types)
+                      variable_types = variable_types or {}
+                      secondary_time_index = secondary_time_index or {}
+                      self.create_variables(variable_types, index, time_index, secondary_time_index)

Contributor

kmax12 Dec 5, 2018

i think this should be self._create_variables since a user should never need call it, right?

featuretools/entityset/entity.py Outdated

-                      self.convert_all_variable_data(inferred_variable_types)
+                      variable_types = variable_types or {}
+                      secondary_time_index = secondary_time_index or {}
+                      self.create_variables(variable_types, index, time_index, secondary_time_index)

Contributor

kmax12 Dec 5, 2018

i think secondary_time_index is getting modified by reference within this function (actually when it gets passed to infer_variable_types). if that's true, i'd say let's avoid that or at least leave a comment.

Contributor Author

rwedge Dec 5, 2018

This code here, right?

secondary_time_index = secondary_time_index or {}
for ti, cols in secondary_time_index.items():
    if ti not in cols:
        cols.append(ti)

The first line should be thrown out and mabye the rest should go in _validate_entity_params with either a warning or an error if the secondary time index column isn't included.

rwedge added 10 commits

December 5, 2018 11:45


          make create_variables a private method

dd4b553


          create _valide_entity_params function

2c8bd43


          make create_index private method

969fba5


          move index variable_type setting into _create_variables

398f9ae


          replace update_data call in in Entity.__init__

8e9b9d4


          move second_time_index column checks into set_secondary_time_index

920abc2


          simplify loop in _handle_time

89da038


          used is_categorical_dtype check to infer categorical

3309d9c


          simplify _create_index logic and add comments

2a81eb4


          reorder assignments in entity init

d2d236a

kmax12 reviewed

View reviewed changes

featuretools/entityset/entity.py Outdated

+                      self.variables = [index_variable] + [v for v in variables
+                                                           if v.id != index]
+                  def infer_variable_types(self, variable_types, time_index, secondary_time_index):
                       """Extracts the variables from a dataframe

Contributor

kmax12 Dec 5, 2018

update doc string

rwedge added 3 commits

December 5, 2018 16:48


          set_time_index and set_seconary_time_index won't handle None

14794a6


          remove unused defaults


          remove unused attribute Entity.encoding

6c3047c

kmax12 reviewed

View reviewed changes

featuretools/entityset/entity.py Show resolved Hide resolved

rwedge added 3 commits

December 5, 2018 19:25


          remove references to empty default variables

401e56c


          simplify time index init logic

d109043


          entity docstring updates

a40436e

rwedge added 6 commits

December 6, 2018 11:37


          add tests for setting bad time indexes

deb0890


          remove Entity(Set) attributes from api_ref

a0a2af3


          move variable_types is None handling into _create_variables

e2ceb05


          add test for to check for correct sorting during update_data if entit…

463599f

…y has no time index


          redo already_sorted tests to include more cases

96ff570


          linting

a0860da

kmax12 changed the title ~~[WIP] Clean up entity creation logic~~ Clean up entity creation logic

kmax12 approved these changes

View reviewed changes

Contributor

kmax12 commented Dec 7, 2018

Looks good to merge


          Merge branch 'master' into entity-from-df-cleanup-1

e356a72

rwedge merged commit d367818 into master

rwedge deleted the entity-from-df-cleanup-1 branch

December 7, 2018 16:48

kmax12 mentioned this pull request

Provide better error message when no index is provided and index cannot be inferred #274

Closed

georgewambold mentioned this pull request

Duplicate names in normalize_entity's additional/copy_variables throws non-obvious error #347

Closed

rwedge mentioned this pull request

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet