You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently we cannot support entitysets where there are multiple paths between any two entities. The root cause is that aggregation and direct features do not specify a path to the entity of their base features, assuming that there can only be one. During feature matrix calculation when multiple paths are detected we therefore raise an error ("Diamond graph detected!").
By instead representing features in terms of relationships we can build multiple
features with different paths to the same entity.
Example structures and options for naming features
The features discussed are aggregations, but the same logic should apply to DirectFeatures, just in the opposite direction.
1. Two to One
Two relationships point to the same entity.
Example:
Entities:
games
teams
Relationships:
games.home_team_id -> teams.id
games.away_team_id -> teams.id
Current features generated by dfs (for teams target entity):
With inferred names: MEAN(customers.transactions.items.amount)
We could skip transactions because there is only one path from customers
(or stores) to items: MEAN(customers.items.amount).
Option 3 would mean that if an entityset has no (undirected) cycles then the
names of features would not change from the current implementation. However,
this adds complexity, and you are giving up some clarity to reduce verbosity. Is
this worth it?
Our initial plan is to only skip relationships in names if there is only one possible path to the base entity. In the above case because there are multiple paths we would use MEAN(customers.transactions.items.amount).
User defined relationship names
We will need to update all APIs which create relationships to accept two new parameters:
child_relationship_name. E.g. "home_games"
parent_relationship_name. E.g. "home team"
These will then replace the generated relationship names.
Development plan
At any point: add an API to set relationship names.
In order:
Update AggregationFeature and DirectFeature to use relationship paths
internally. Accept the existing params and use them to build the relationship
path. Raise if there are multiple possible paths.
Update calculate_feature_matrix (and PandasBackend) to be "relationship
aware".
Update DFS to be "relationship aware" and to construct features using
relationship paths. Update features to use relationship based names.
The text was updated successfully, but these errors were encountered:
Currently we cannot support entitysets where there are multiple paths between any two entities. The root cause is that aggregation and direct features do not specify a path to the entity of their base features, assuming that there can only be one. During feature matrix calculation when multiple paths are detected we therefore raise an error ("Diamond graph detected!").
By instead representing features in terms of relationships we can build multiple
features with different paths to the same entity.
Example structures and options for naming features
The features discussed are aggregations, but the same logic should apply to
DirectFeature
s, just in the opposite direction.1. Two to One
Two relationships point to the same entity.
Example:
Current features generated by
dfs
(forteams
target entity):MEAN(games.home_score)
MEAN(games.away_score)
We should be able to represent
MEAN(home_games.home_score)
MEAN(home_games.away_score)
MEAN(away_games.home_score)
MEAN(away_games.away_score)
Naming:
MEAN(games[home_team_id].home_score)
MEAN(home_games.home_score)
2. Diamond
There are two paths (through different entities) to the same entity.
Example:
Currently
dfs
generatesMEAN(transactions.amount)
(forregions
).We should be able to represent
MEAN(stores.transactions.amount)
MEAN(customers.transactions.amount)
Naming
MEAN(stores[region_id].transactions[store_id].amount)
names:
MEAN(stores.transactions.amount)
3. Triangle
Target entity has a direct path to base entity, and another through a third
entity.
Example:
Currently
dfs
generatesMEAN(queries.time)
(forusers
).We should be able to represent
MEAN(queries.time)
(where queries is the relationship directly from queriesto users).
MEAN(reports.queries.time)
Naming
MEAN(queries[creator_id].time)
MEAN(reports[creator_id].queries[report_id].time)
MEAN(queries.time)
MEAN(reports.queries.time)
4. Self-loop
Example:
Currently we can't represent any features through this relationship, and DFS recurses until it crashes.
We should be able to represent
MEAN(direct_reports.age)
.Naming:
MEAN(employees[manager_id].age)
MEAN(direct_reports.age)
.5. Linear Paths
Any of the above structures, but with a linear path inserted. E.g. in the
diamond example if there are
items
belonging totransactions
.Naming:
MEAN(customers[region_id].transactions[customer_id].items[log_id].amount)
MEAN(customers.transactions.items.amount)
transactions
because there is only one path fromcustomers
(or
stores
) toitems
:MEAN(customers.items.amount)
.Option 3 would mean that if an entityset has no (undirected) cycles then the
names of features would not change from the current implementation. However,
this adds complexity, and you are giving up some clarity to reduce verbosity. Is
this worth it?
Our initial plan is to only skip relationships in names if there is only one possible path to the base entity. In the above case because there are multiple paths we would use
MEAN(customers.transactions.items.amount)
.User defined relationship names
We will need to update all APIs which create relationships to accept two new parameters:
child_relationship_name
. E.g. "home_games"parent_relationship_name
. E.g. "home team"These will then replace the generated relationship names.
Development plan
At any point: add an API to set relationship names.
In order:
AggregationFeature
andDirectFeature
to use relationship pathsinternally. Accept the existing params and use them to build the relationship
path. Raise if there are multiple possible paths.
calculate_feature_matrix
(andPandasBackend
) to be "relationshipaware".
relationship paths. Update features to use relationship based names.
The text was updated successfully, but these errors were encountered: