Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Groupby dfs #472

Merged
merged 57 commits into from Mar 28, 2019
Merged
Changes from 4 commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
87194f2
add GroupByTransformPrimitive
rwedge Feb 25, 2019
862590c
update cumulative primitives to inherit from GBTP
rwedge Feb 25, 2019
6948485
add GroupByTransformFeature
rwedge Feb 25, 2019
9f0a2ff
add GBTF to uses_full_entity logic
rwedge Feb 25, 2019
55b6b16
add groupby handler to pandas backend
rwedge Feb 25, 2019
6398a78
move cumulative tests to new test_groupby_primitives file
rwedge Feb 25, 2019
3748402
Merge branch 'master' into grouby-transform-feature
rwedge Feb 26, 2019
09b84fc
remove GroupByTransformPrimitive
rwedge Feb 26, 2019
4a2d6b0
add groupby_primitives argument to DFS
rwedge Feb 26, 2019
9254060
rework groupby logic
rwedge Feb 26, 2019
e3a77d3
update test cases for cumulative primitives
rwedge Feb 26, 2019
bde94de
handled 'nan' group of groupby
rwedge Feb 27, 2019
2837c45
Merge branch 'grouby-transform-feature' into groupby-dfs
rwedge Feb 27, 2019
f68e9d5
fix input_types expansion bug
rwedge Feb 27, 2019
290e56e
add groupby_primitives to dfs
rwedge Feb 27, 2019
0bf87a0
add deep feature synthesis tests for groupby_primitives
rwedge Feb 27, 2019
90a8b2e
document how we seach for columns to group by
rwedge Feb 28, 2019
efba737
simplify check_trans_primitive
rwedge Feb 28, 2019
8dc9b62
redo groupby_feature.get_name
rwedge Mar 6, 2019
f4f7743
Merge branch 'grouby-transform-feature' into groupby-dfs
rwedge Mar 6, 2019
c7e808b
update groupby feature name tests
rwedge Mar 6, 2019
f89167d
linting
rwedge Mar 6, 2019
540c301
Merge branch 'grouby-transform-feature' into groupby-dfs
rwedge Mar 6, 2019
f17e6e6
make GBTF a sublcass of TransformFeature
rwedge Mar 6, 2019
607ccb2
change groupby restriction from Id to Discrete
rwedge Mar 6, 2019
927e62d
test categorical direct feature as groupby in GBTF
rwedge Mar 6, 2019
67f1fcf
test GBTF.copy
rwedge Mar 6, 2019
fe5400f
test groupby with empty data
rwedge Mar 6, 2019
d8db338
test uses_calc_time with GBTF
rwedge Mar 7, 2019
e55fc51
linting
rwedge Mar 7, 2019
a866d73
Merge branch 'master' into grouby-transform-feature
rwedge Mar 15, 2019
9dd526c
rename test file
rwedge Mar 15, 2019
06dac4c
have feature tree separate features by groupby
rwedge Mar 15, 2019
6a9b8d5
check exact class instead of allowing subclasses in feature handlers
rwedge Mar 15, 2019
d70dc6d
add groupby to base_features earlier
rwedge Mar 15, 2019
0d11650
Merge branch 'grouby-transform-feature' into groupby-dfs
rwedge Mar 18, 2019
e8c9492
convert base features of GBTF to list in _build_transform_features
rwedge Mar 18, 2019
84bc534
Merge branch 'master' into grouby-transform-feature
rwedge Mar 18, 2019
95d7f3e
change implementation of cumulative count
rwedge Mar 22, 2019
03f1a41
add comments about time where we exclude the groupby feature when usi…
rwedge Mar 22, 2019
a3ee88f
reassign index when primitive function returns series
rwedge Mar 22, 2019
d9af1bf
let pandas fill in null values for instances without a group
rwedge Mar 22, 2019
403ad54
Merge branch 'master' into grouby-transform-feature
rwedge Mar 22, 2019
2a41be1
linting
rwedge Mar 22, 2019
21625ac
Merge branch 'grouby-transform-feature' into groupby-dfs
rwedge Mar 22, 2019
e03f1cf
fix unicode error in GBFT generate_name function
rwedge Mar 22, 2019
069f0aa
test passing dfs bad groupby primitive
rwedge Mar 25, 2019
839ffdb
Merge branch 'master' into groupby-dfs
rwedge Mar 25, 2019
3a3f79d
linting
rwedge Mar 25, 2019
640d5f4
document changes to cumulative features in the changelog
rwedge Mar 26, 2019
12845d7
Merge branch 'master' into groupby-dfs
rwedge Mar 26, 2019
3138502
rename groupby_primitives to groupby_transform_primitives
rwedge Mar 27, 2019
122551a
Merge branch 'groupby-dfs' of github.com:Featuretools/featuretools in…
rwedge Mar 27, 2019
a9a5217
fix skipped groupby_primitives in docs
rwedge Mar 27, 2019
1af90fe
actually test that dfs will stack agg on gbtf
rwedge Mar 27, 2019
4a42821
Update changelog.rst
kmax12 Mar 28, 2019
a0ec863
Update changelog.rst
kmax12 Mar 28, 2019
File filter...
Filter file types
Jump to…
Jump to file or symbol
Failed to load files and symbols.

Always

Just for now

@@ -324,11 +324,6 @@ def _calculate_transform_features(self, features, entity_frames):
values = feature_func(*variable_data)

# if we don't get just the values, the assignment breaks when indexes don't match
def strip_values_if_series(values):
if isinstance(values, pd.Series):
values = values.values
return values

if f.number_output_features > 1:
values = [strip_values_if_series(value) for value in values]
else:
@@ -345,48 +340,35 @@ def _calculate_groupby_features(self, features, entity_frames):

frame = entity_frames[entity_id]

# get list of groupby fatures
groupby_lists = defaultdict(list)
for f in features:
# handle when no data
if frame.shape[0] == 0:
set_default_column(frame, f)
continue

groupby_lists[f.groupby.get_name()].append(f)

for groupby, features in groupby_lists.items():
# collect only the variables we need for this transformation
variables = set([groupby])
for feature in features:
for base_feature in feature.base_features:
variables.add(base_feature.get_name())

grouped = frame[variables].groupby(groupby)

for feature in features:
variable_data = [grouped[bf.get_name()] for
bf in feature.base_features]
groupby = f.groupby.get_name()
column_names = [bf.get_name() for bf in f.base_features]
frame_data = frame[set(column_names + [groupby])]
feature_func = f.get_function()

feature_func = f.get_function()
group_values = []
for index, group in frame_data.groupby(groupby):
variable_data = [group[name] for name in column_names]
# apply the function to the relevant dataframe slice and add the
# feature row to the results dataframe.
if f.primitive.uses_calc_time:
values = feature_func(*variable_data, time=self.time_last)
else:
values = feature_func(*variable_data)

# if we don't get just the values, the assignment breaks when indexes don't match
def strip_values_if_series(values):
if isinstance(values, pd.Series):
values = values.values
return values
if not isinstance(values, pd.Series):
values = pd.Series(values, index=variable_data[0].index)
group_values.append(values)

if f.number_output_features > 1:
values = [strip_values_if_series(value) for value in values]
else:
values = [strip_values_if_series(values)]
update_feature_columns(f, frame, values)
null_group = frame[pd.isnull(frame[groupby])]
group_values.append(null_group[groupby])
group_values = pd.concat(group_values)
update_feature_columns(f, frame, [group_values.sort_index().values])

return frame

@@ -616,3 +598,9 @@ def update_feature_columns(feature, data, values):
assert len(names) == len(values)
for name, value in zip(names, values):
data[name] = value


def strip_values_if_series(values):
if isinstance(values, pd.Series):
values = values.values
return values
@@ -490,11 +490,10 @@ def copy(self):
self.primitive,
self.groupby)

# TODO: add groupby param to generate_name
def generate_name(self):
base_names = [bf.get_name() for bf in self.base_features]
# groupby_name = self.groupby.get_name()
return self.primitive.generate_name(base_names)
return self.primitive.generate_name(base_names, groupby=self.groupby)


class Feature(object):
@@ -12,9 +12,11 @@ class TransformPrimitive(PrimitiveBase):
# (and will receive these values as input, regardless of specified instance ids)
uses_full_entity = False

def generate_name(self, base_feature_names):
def generate_name(self, base_feature_names, groupby=None):
name = u"{}(".format(self.name.upper())
name += u", ".join(base_feature_names)
if groupby is not None:
name += u" by {}".format(groupby)
name += u")"
return name

@@ -1,3 +1,5 @@
import pandas as pd

from featuretools.primitives.base import TransformPrimitive
from featuretools.variable_types import Discrete, Id, Numeric

@@ -27,7 +29,7 @@ class CumCount(TransformPrimitive):

def get_function(self):
def cum_count(values):
return values.cumcount() + 1
return pd.Series(range(1, len(values) + 1), index=values.index)

return cum_count

@@ -42,7 +44,7 @@ class CumMean(TransformPrimitive):

def get_function(self):
def cum_mean(values):
return values.cumsum() / (values.cumcount() + 1)
return values.cumsum() / pd.Series(range(1, len(values) + 1), index=values.index)

return cum_mean

@@ -6,7 +6,8 @@

import featuretools as ft
from featuretools.computational_backends import PandasBackend
from featuretools.primitives import CumCount, CumMax, CumMean, CumMin, CumSum
from featuretools.primitives import CumCount, CumMax, CumMean, CumMin, CumSum, TransformPrimitive
from featuretools.variable_types import Numeric


@pytest.fixture
@@ -21,24 +22,27 @@ class TestCumCount:
def test_order(self):
g = pd.Series(["a", "b", "a"])

answer = [1, 1, 2]
answers = ([1, 2], [1])

function = self.primitive().get_function()
np.testing.assert_array_equal(function(g.groupby(g)), answer)
for (_, group), answer in zip(g.groupby(g), answers):
np.testing.assert_array_equal(function(group), answer)

def test_regular(self):
g = pd.Series(["a", "b", "a", "c", "d", "b"])
answer = [1, 1, 2, 1, 1, 2]
answers = ([1, 2], [1, 2], [1], [1])

function = self.primitive().get_function()
np.testing.assert_array_equal(function(g.groupby(g)), answer)
for (_, group), answer in zip(g.groupby(g), answers):
np.testing.assert_array_equal(function(group), answer)

def test_discrete(self):
g = pd.Series(["a", "b", "a", "c", "d", "b"])
answer = [1, 1, 2, 1, 1, 2]
answers = ([1, 2], [1, 2], [1], [1])

function = self.primitive().get_function()
np.testing.assert_array_equal(function(g.groupby(g)), answer)
for (_, group), answer in zip(g.groupby(g), answers):
np.testing.assert_array_equal(function(group), answer)


class TestCumSum:
@@ -49,18 +53,20 @@ def test_order(self):
v = pd.Series([1, 2, 2])
g = pd.Series(["a", "b", "a"])

answer = [1, 2, 3]
answers = ([1, 3], [2])

function = self.primitive().get_function()
np.testing.assert_array_equal(function(v.groupby(g)), answer)
for (_, group), answer in zip(v.groupby(g), answers):
np.testing.assert_array_equal(function(group), answer)

def test_regular(self):
v = pd.Series([101, 102, 103, 104, 105, 106])
g = pd.Series(["a", "b", "a", "c", "d", "b"])
answer = [101, 102, 204, 104, 105, 208]
answers = ([101, 204], [102, 208], [104], [105])

function = self.primitive().get_function()
np.testing.assert_array_equal(function(v.groupby(g)), answer)
for (_, group), answer in zip(v.groupby(g), answers):
np.testing.assert_array_equal(function(group), answer)


class TestCumMean:
@@ -70,18 +76,20 @@ def test_order(self):
v = pd.Series([1, 2, 2])
g = pd.Series(["a", "b", "a"])

answer = [1, 2, 1.5]
answers = ([1, 1.5], [2])

function = self.primitive().get_function()
np.testing.assert_array_equal(function(v.groupby(g)), answer)
for (_, group), answer in zip(v.groupby(g), answers):
np.testing.assert_array_equal(function(group), answer)

def test_regular(self):
v = pd.Series([101, 102, 103, 104, 105, 106])
g = pd.Series(["a", "b", "a", "c", "d", "b"])
answer = [101, 102, 102, 104, 105, 104]
answers = ([101, 102], [102, 104], [104], [105])

function = self.primitive().get_function()
np.testing.assert_array_equal(function(v.groupby(g)), answer)
for (_, group), answer in zip(v.groupby(g), answers):
np.testing.assert_array_equal(function(group), answer)


class TestCumMax:
@@ -92,18 +100,20 @@ def test_order(self):
v = pd.Series([1, 2, 2])
g = pd.Series(["a", "b", "a"])

answer = [1, 2, 2]
answers = ([1, 2], [2])

function = self.primitive().get_function()
np.testing.assert_array_equal(function(v.groupby(g)), answer)
for (_, group), answer in zip(v.groupby(g), answers):
np.testing.assert_array_equal(function(group), answer)

def test_regular(self):
v = pd.Series([101, 102, 103, 104, 105, 106])
g = pd.Series(["a", "b", "a", "c", "d", "b"])
answer = [101, 102, 103, 104, 105, 106]
answers = ([101, 103], [102, 106], [104], [105])

function = self.primitive().get_function()
np.testing.assert_array_equal(function(v.groupby(g)), answer)
for (_, group), answer in zip(v.groupby(g), answers):
np.testing.assert_array_equal(function(group), answer)


class TestCumMin:
@@ -114,18 +124,20 @@ def test_order(self):
v = pd.Series([1, 2, 2])
g = pd.Series(["a", "b", "a"])

answer = [1, 2, 1]
answers = ([1, 1], [2])

function = self.primitive().get_function()
np.testing.assert_array_equal(function(v.groupby(g)), answer)
for (_, group), answer in zip(v.groupby(g), answers):
np.testing.assert_array_equal(function(group), answer)

def test_regular(self):
v = pd.Series([101, 102, 103, 104, 105, 106, 100])
g = pd.Series(["a", "b", "a", "c", "d", "b", "a"])
answer = [101, 102, 101, 104, 105, 102, 100]
answers = ([101, 101, 100], [102, 102], [104], [105])

function = self.primitive().get_function()
np.testing.assert_array_equal(function(v.groupby(g)), answer)
for (_, group), answer in zip(v.groupby(g), answers):
np.testing.assert_array_equal(function(group), answer)


def test_cum_sum(es):
@@ -172,17 +184,61 @@ def test_cum_sum_group_on_nan(es):
['shoes'] +
[np.nan] * 4 +
['coke_zero'] * 2)
es['log'].df['value'][16] = 10
cum_sum = ft.Feature(log_value_feat, groupby=es['log']['product_id'], primitive=CumSum)
features = [cum_sum]
df = ft.calculate_feature_matrix(entityset=es, features=features, instance_ids=range(15))
df = ft.calculate_feature_matrix(entityset=es, features=features, instance_ids=range(17))
cvalues = df[cum_sum.get_name()].values
assert len(cvalues) == 15
assert len(cvalues) == 17
cum_sum_values = [0, 5, 15,
15, 35,
0, 1, 3,
3, 3,
0,
np.nan, np.nan, np.nan, np.nan]
np.nan, np.nan, np.nan, np.nan, np.nan, 10]

assert len(cvalues) == len(cum_sum_values)
for i, v in enumerate(cum_sum_values):
if np.isnan(v):
assert (np.isnan(cvalues[i]))
else:
assert v == cvalues[i]


def test_cum_sum_numpy_group_on_nan(es):
class CumSumNumpy(TransformPrimitive):
"""Returns the cumulative sum after grouping"""

name = "cum_sum"
input_types = [Numeric]
return_type = Numeric
uses_full_entity = True

def get_function(self):
def cum_sum(values):
return values.cumsum().values
return cum_sum

log_value_feat = es['log']['value']
es['log'].df['product_id'] = (['coke zero'] * 3 + ['car'] * 2 +
['toothpaste'] * 3 + ['brown bag'] * 2 +
['shoes'] +
[np.nan] * 4 +
['coke_zero'] * 2)
es['log'].df['value'][16] = 10
cum_sum = ft.Feature(log_value_feat, groupby=es['log']['product_id'], primitive=CumSumNumpy)
features = [cum_sum]
df = ft.calculate_feature_matrix(entityset=es, features=features, instance_ids=range(17))
cvalues = df[cum_sum.get_name()].values
assert len(cvalues) == 17
cum_sum_values = [0, 5, 15,
15, 35,
0, 1, 3,
3, 3,
0,
np.nan, np.nan, np.nan, np.nan, np.nan, 10]

assert len(cvalues) == len(cum_sum_values)
for i, v in enumerate(cum_sum_values):
if np.isnan(v):
assert (np.isnan(cvalues[i]))
ProTip! Use n and p to navigate between commits in a pull request.
You can’t perform that action at this time.