Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Groupby dfs #472

Merged
merged 57 commits into from Mar 28, 2019
Merged
Changes from 53 commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
87194f2
add GroupByTransformPrimitive
rwedge Feb 25, 2019
862590c
update cumulative primitives to inherit from GBTP
rwedge Feb 25, 2019
6948485
add GroupByTransformFeature
rwedge Feb 25, 2019
9f0a2ff
add GBTF to uses_full_entity logic
rwedge Feb 25, 2019
55b6b16
add groupby handler to pandas backend
rwedge Feb 25, 2019
6398a78
move cumulative tests to new test_groupby_primitives file
rwedge Feb 25, 2019
3748402
Merge branch 'master' into grouby-transform-feature
rwedge Feb 26, 2019
09b84fc
remove GroupByTransformPrimitive
rwedge Feb 26, 2019
4a2d6b0
add groupby_primitives argument to DFS
rwedge Feb 26, 2019
9254060
rework groupby logic
rwedge Feb 26, 2019
e3a77d3
update test cases for cumulative primitives
rwedge Feb 26, 2019
bde94de
handled 'nan' group of groupby
rwedge Feb 27, 2019
2837c45
Merge branch 'grouby-transform-feature' into groupby-dfs
rwedge Feb 27, 2019
f68e9d5
fix input_types expansion bug
rwedge Feb 27, 2019
290e56e
add groupby_primitives to dfs
rwedge Feb 27, 2019
0bf87a0
add deep feature synthesis tests for groupby_primitives
rwedge Feb 27, 2019
90a8b2e
document how we seach for columns to group by
rwedge Feb 28, 2019
efba737
simplify check_trans_primitive
rwedge Feb 28, 2019
8dc9b62
redo groupby_feature.get_name
rwedge Mar 6, 2019
f4f7743
Merge branch 'grouby-transform-feature' into groupby-dfs
rwedge Mar 6, 2019
c7e808b
update groupby feature name tests
rwedge Mar 6, 2019
f89167d
linting
rwedge Mar 6, 2019
540c301
Merge branch 'grouby-transform-feature' into groupby-dfs
rwedge Mar 6, 2019
f17e6e6
make GBTF a sublcass of TransformFeature
rwedge Mar 6, 2019
607ccb2
change groupby restriction from Id to Discrete
rwedge Mar 6, 2019
927e62d
test categorical direct feature as groupby in GBTF
rwedge Mar 6, 2019
67f1fcf
test GBTF.copy
rwedge Mar 6, 2019
fe5400f
test groupby with empty data
rwedge Mar 6, 2019
d8db338
test uses_calc_time with GBTF
rwedge Mar 7, 2019
e55fc51
linting
rwedge Mar 7, 2019
a866d73
Merge branch 'master' into grouby-transform-feature
rwedge Mar 15, 2019
9dd526c
rename test file
rwedge Mar 15, 2019
06dac4c
have feature tree separate features by groupby
rwedge Mar 15, 2019
6a9b8d5
check exact class instead of allowing subclasses in feature handlers
rwedge Mar 15, 2019
d70dc6d
add groupby to base_features earlier
rwedge Mar 15, 2019
0d11650
Merge branch 'grouby-transform-feature' into groupby-dfs
rwedge Mar 18, 2019
e8c9492
convert base features of GBTF to list in _build_transform_features
rwedge Mar 18, 2019
84bc534
Merge branch 'master' into grouby-transform-feature
rwedge Mar 18, 2019
95d7f3e
change implementation of cumulative count
rwedge Mar 22, 2019
03f1a41
add comments about time where we exclude the groupby feature when usi…
rwedge Mar 22, 2019
a3ee88f
reassign index when primitive function returns series
rwedge Mar 22, 2019
d9af1bf
let pandas fill in null values for instances without a group
rwedge Mar 22, 2019
403ad54
Merge branch 'master' into grouby-transform-feature
rwedge Mar 22, 2019
2a41be1
linting
rwedge Mar 22, 2019
21625ac
Merge branch 'grouby-transform-feature' into groupby-dfs
rwedge Mar 22, 2019
e03f1cf
fix unicode error in GBFT generate_name function
rwedge Mar 22, 2019
069f0aa
test passing dfs bad groupby primitive
rwedge Mar 25, 2019
839ffdb
Merge branch 'master' into groupby-dfs
rwedge Mar 25, 2019
3a3f79d
linting
rwedge Mar 25, 2019
640d5f4
document changes to cumulative features in the changelog
rwedge Mar 26, 2019
12845d7
Merge branch 'master' into groupby-dfs
rwedge Mar 26, 2019
3138502
rename groupby_primitives to groupby_transform_primitives
rwedge Mar 27, 2019
122551a
Merge branch 'groupby-dfs' of github.com:Featuretools/featuretools in…
rwedge Mar 27, 2019
a9a5217
fix skipped groupby_primitives in docs
rwedge Mar 27, 2019
1af90fe
actually test that dfs will stack agg on gbtf
rwedge Mar 27, 2019
4a42821
Update changelog.rst
kmax12 Mar 28, 2019
a0ec863
Update changelog.rst
kmax12 Mar 28, 2019
File filter...
Filter file types
Jump to…
Jump to file or symbol
Failed to load files and symbols.
+129 −15
Diff settings

Always

Just for now

@@ -2,6 +2,38 @@

Changelog
---------
**v0.7.0** Mar XX, 2019

Breaking Changes:

* Cumulative transform primitives are now calculated using a new feature class, ``GroupByTransformFeature``. ``ft.dfs`` now has a ``groupby_transform_primitives`` parameter, which DFS will use to automatically construct groupby features. This change applies to ``CumSum``, ``CumCount``, ``CumMean``, ``CumMin``, and ``CumMax``.

.. code-block:: python
:caption: Previous behavior
ft.dfs(entityset=es, target_entity='customers', transform_primitives=["cum_mean"])
.. code-block:: python
:caption: New behavior
ft.dfs(entityset=es, target_entity='customers', groupby_primitives=["cum_mean"])
This conversation was marked as resolved by kmax12

This comment has been minimized.

Copy link
@kmax12

kmax12 Mar 27, 2019

Member

update groupby_transform_primitives here

Changes to writing individual features:

.. code-block:: python
:caption: Previous behavior
ft.Feature([base_feature, groupby_feature], primitive=CumulativePrimitive)
.. code-block:: python
:caption: New behavior
ft.Feature(base_feature, groupby=groupby_feature, primitive=CumulativePrimitive)
**v0.6.1** Feb 15, 2019
* Cumulative primitives (:pr:`410`)
* Entity.query_by_values now preserves row order of underlying data (:pr:`428`)
@@ -490,7 +490,7 @@ def generate_name(self):
# place in the feature name
base_names = [bf.get_name() for bf in self.base_features[:-1]]
_name = self.primitive.generate_name(base_names)
return "{} by {}".format(_name, self.groupby.get_name())
return u"{} by {}".format(_name, self.groupby.get_name())


class Feature(object):
@@ -7,6 +7,7 @@
from featuretools.feature_base import (
AggregationFeature,
DirectFeature,
GroupByTransformFeature,
IdentityFeature,
TransformFeature
)
@@ -17,7 +18,7 @@
TransformPrimitive
)
from featuretools.utils import is_string
from featuretools.variable_types import Boolean, Numeric
from featuretools.variable_types import Boolean, Id, Numeric

logger = logging.getLogger('featuretools')

@@ -47,6 +48,9 @@ class DeepFeatureSynthesis(object):
["count"]
groupby_transform_primitives (list[str or :class:`.primitives.TransformPrimitive`], optional):
list of Transform primitives to make GroupByTransformFeatures with
max_depth (int, optional) : maximum allowed depth of features.
Default: 2. If -1, no limit.
@@ -85,6 +89,7 @@ def __init__(self,
agg_primitives=None,
trans_primitives=None,
where_primitives=None,
groupby_transform_primitives=None,
max_depth=2,
max_hlevel=2,
max_features=-1,
@@ -152,18 +157,8 @@ def __init__(self,
ftypes.Weekday, ftypes.Haversine,
ftypes.NumWords, ftypes.NumCharacters] # ftypes.TimeSince
self.trans_primitives = []
trans_prim_dict = ftypes.get_transform_primitives()
for t in trans_primitives:
if is_string(t):
if t.lower() not in trans_prim_dict:
raise ValueError("Unknown transform primitive {}. ".format(t),
"Call ft.primitives.list_primitives() to get",
" a list of available primitives")
t = trans_prim_dict[t.lower()]
t = handle_primitive(t)
if not isinstance(t, TransformPrimitive):
raise ValueError("Primitive {} in trans_primitives is not a "
"transform primitive".format(type(t)))
t = check_trans_primitive(t)
self.trans_primitives.append(t)

if where_primitives is None:
@@ -180,6 +175,13 @@ def __init__(self,
p = handle_primitive(p)
self.where_primitives.append(p)

if groupby_transform_primitives is None:
groupby_transform_primitives = []
self.groupby_transform_primitives = []
for p in groupby_transform_primitives:
p = check_trans_primitive(p)
self.groupby_transform_primitives.append(p)

self.seed_features = seed_features or []
self.drop_exact = drop_exact or []
self.drop_contains = drop_contains or []
@@ -505,6 +507,31 @@ def _build_transform_features(self, all_features, entity, max_depth=0):
self._handle_new_feature(all_features=all_features,
new_feature=new_f)

for groupby_prim in self.groupby_transform_primitives:
# Normally input_types is a list of what inputs can be supplied to
# the primitive function. Here we temporarily add `Id` as an extra
# item in input_types so that the matching function will also look
# for feature columns to group by.
input_types = groupby_prim.input_types[:]
# if multiple input_types, only use first one for DFS
if type(input_types[0]) == list:
input_types = input_types[0]
input_types.append(Id)

features = self._features_by_type(all_features=all_features,
entity=entity,
max_depth=new_max_depth,
variable_type=set(input_types))
matching_inputs = match(input_types, features,
commutative=groupby_prim.commutative)
for matching_input in matching_inputs:
if all(bf.number_output_features == 1 for bf in matching_input):
new_f = GroupByTransformFeature(list(matching_input[:-1]),
groupby=matching_input[-1],
primitive=groupby_prim)
self._handle_new_feature(all_features=all_features,
new_feature=new_f)

def _build_forward_features(self, all_features, parent_entity,
child_entity, relationship, max_depth=0):

@@ -755,3 +782,20 @@ def handle_primitive(primitive):
primitive = primitive()
assert isinstance(primitive, PrimitiveBase), "must be a primitive"
return primitive


def check_trans_primitive(primitive):
trans_prim_dict = ftypes.get_transform_primitives()

if is_string(primitive):
if primitive.lower() not in trans_prim_dict:
raise ValueError("Unknown transform primitive {}. ".format(primitive),
"Call ft.primitives.list_primitives() to get",
" a list of available primitives")
primitive = trans_prim_dict[primitive.lower()]
primitive = handle_primitive(primitive)
if not isinstance(primitive, TransformPrimitive):
raise ValueError("Primitive {} in trans_primitives or "
"groupby_transform_primitives is not a transform "
"primitive".format(type(primitive)))
return primitive
@@ -15,6 +15,7 @@ def dfs(entities=None,
instance_ids=None,
agg_primitives=None,
trans_primitives=None,
groupby_transform_primitives=None,
allowed_paths=None,
max_depth=2,
ignore_entities=None,
@@ -74,6 +75,9 @@ def dfs(entities=None,
Default: ["day", "year", "month", "weekday", "haversine", "num_words", "num_characters"]
groupby_transform_primitives (list[str or :class:`.primitives.TransformPrimitive`], optional):
list of Transform primitives to make GroupByTransformFeatures with
allowed_paths (list[list[str]]): Allowed entity paths on which to make
features.
@@ -187,6 +191,7 @@ def dfs(entities=None,
dfs_object = DeepFeatureSynthesis(target_entity, entityset,
agg_primitives=agg_primitives,
trans_primitives=trans_primitives,
groupby_transform_primitives=groupby_transform_primitives,
max_depth=max_depth,
where_primitives=where_primitives,
allowed_paths=allowed_paths,
@@ -264,6 +264,38 @@ def test_makes_agg_features_with_where(es):
'COUNT(log WHERE products.department = food)'))


def test_make_groupby_features(es):
dfs_obj = DeepFeatureSynthesis(target_entity_id='log',
entityset=es,
agg_primitives=[],
trans_primitives=[],
groupby_transform_primitives=['cum_sum'])
features = dfs_obj.build_features()
assert (feature_with_name(features,
"CUM_SUM(value) by session_id"))


def test_make_groupby_features_with_agg(es):
This conversation was marked as resolved by kmax12

This comment has been minimized.

Copy link
@kmax12

kmax12 Mar 27, 2019

Member

what does this this test test that isn't covered above? is the idea to show that agg primitives get stacked on top of groupby transform features?

dfs_obj = DeepFeatureSynthesis(target_entity_id='customers',
entityset=es,
agg_primitives=['sum'],
trans_primitives=[],
groupby_transform_primitives=['cum_sum'])
features = dfs_obj.build_features()
assert (feature_with_name(features,
u"CUM_SUM(age) by région_id"))


def test_bad_groupby_feature(es):
msg = "Unknown transform primitive max"
with pytest.raises(ValueError, match=msg):
DeepFeatureSynthesis(target_entity_id='customers',
entityset=es,
agg_primitives=['sum'],
trans_primitives=[],
groupby_transform_primitives=['max'])


def test_abides_by_max_depth_param(es):
for i in [1, 2, 3]:
dfs_obj = DeepFeatureSynthesis(target_entity_id='sessions',
@@ -742,8 +774,8 @@ def test_checks_primitives_correct_type(es):
trans_primitives=[])

error_text = "Primitive <class \\'featuretools\\.primitives\\.standard\\."\
"aggregation_primitives\\.Last\\'> in trans_primitives is "\
"not a transform primitive"
"aggregation_primitives\\.Last\\'> in trans_primitives or "\
"groupby_transform_primitives is not a transform primitive"
with pytest.raises(ValueError, match=error_text):
DeepFeatureSynthesis(target_entity_id="sessions",
entityset=es,
@@ -109,6 +109,7 @@ def test_all_variables(entities, relationships):
instance_ids=instance_ids,
agg_primitives=[Max, Mean, Min, Sum],
trans_primitives=[],
groupby_transform_primitives=["cum_sum"],
max_depth=3,
allowed_paths=None,
ignore_entities=None,
ProTip! Use n and p to navigate between commits in a pull request.
You can’t perform that action at this time.