Support entitysets with multiple paths between entities #543

CJStadler · 2019-05-15T16:26:35Z

Currently we cannot support entitysets where there are multiple paths between any two entities. The root cause is that aggregation and direct features do not specify a path to the entity of their base features, assuming that there can only be one. During feature matrix calculation when multiple paths are detected we therefore raise an error ("Diamond graph detected!").

By instead representing features in terms of relationships we can build multiple
features with different paths to the same entity.

Example structures and options for naming features

The features discussed are aggregations, but the same logic should apply to
DirectFeatures, just in the opposite direction.

1. Two to One

Two relationships point to the same entity.

Example:

Entities:
  games
  teams
Relationships:
  games.home_team_id -> teams.id
  games.away_team_id -> teams.id

Current features generated by dfs (for teams target entity):

MEAN(games.home_score)
MEAN(games.away_score)

We should be able to represent

MEAN(home_games.home_score)
MEAN(home_games.away_score)
MEAN(away_games.home_score)
MEAN(away_games.away_score)

Naming:

Fully qualified: MEAN(games[home_team_id].home_score)
With user provided relationship names: MEAN(home_games.home_score)

2. Diamond

There are two paths (through different entities) to the same entity.

Example:

Entities:
  regions
  customers
  stores
  transactions
Relationships:
  transactions.store_id -> stores.id
  transactions.customer_id -> customers.id
  stores.region_id -> region.id
  customers.region_id -> region.id

Currently dfs generates MEAN(transactions.amount) (for regions).

We should be able to represent

MEAN(stores.transactions.amount)
MEAN(customers.transactions.amount)

Naming

Fully qualified: MEAN(stores[region_id].transactions[store_id].amount)
Because no relationship is ambiguous we can infer their names from the entity
names: MEAN(stores.transactions.amount)

3. Triangle

Target entity has a direct path to base entity, and another through a third
entity.

Example:

Entities:
  users
  reports
  queries
Relationships:
  queries.creator_id -> users.id
  queries.report_id -> reports.id
  reports.creator_id -> users.id

Currently dfs generates MEAN(queries.time) (for users).

We should be able to represent

MEAN(queries.time) (where queries is the relationship directly from queries
to users).
MEAN(reports.queries.time)

Naming

Fully qualified:

MEAN(queries[creator_id].time)
MEAN(reports[creator_id].queries[report_id].time)

Since no relationship is ambiguous we can infer their names:

MEAN(queries.time)
MEAN(reports.queries.time)

4. Self-loop

Example:

Entities:
  employees
Relationships
  employees.manager_id -> employees.id

Currently we can't represent any features through this relationship, and DFS recurses until it crashes.

We should be able to represent MEAN(direct_reports.age).

Naming:

Fully qualified: MEAN(employees[manager_id].age)
With provided names: MEAN(direct_reports.age).

5. Linear Paths

Any of the above structures, but with a linear path inserted. E.g. in the
diamond example if there are items belonging to transactions.

Naming:

Fully qualified: MEAN(customers[region_id].transactions[customer_id].items[log_id].amount)
With inferred names: MEAN(customers.transactions.items.amount)
We could skip transactions because there is only one path from customers
(or stores) to items: MEAN(customers.items.amount).

Option 3 would mean that if an entityset has no (undirected) cycles then the
names of features would not change from the current implementation. However,
this adds complexity, and you are giving up some clarity to reduce verbosity. Is
this worth it?

Our initial plan is to only skip relationships in names if there is only one possible path to the base entity. In the above case because there are multiple paths we would use MEAN(customers.transactions.items.amount).

User defined relationship names

We will need to update all APIs which create relationships to accept two new parameters:

child_relationship_name. E.g. "home_games"
parent_relationship_name. E.g. "home team"

These will then replace the generated relationship names.

Development plan

At any point: add an API to set relationship names.

In order:

Update AggregationFeature and DirectFeature to use relationship paths
internally. Accept the existing params and use them to build the relationship
path. Raise if there are multiple possible paths.
Update calculate_feature_matrix (and PandasBackend) to be "relationship
aware".
Update DFS to be "relationship aware" and to construct features using
relationship paths. Update features to use relationship based names.

The text was updated successfully, but these errors were encountered:

kmax12 · 2019-07-01T17:45:40Z

@CJStadler are we good to close this? i think the only thing not implemented is 4. Self-loop which has issue #601

CJStadler · 2019-07-01T18:01:20Z

Yeah I think we can close this.

We also haven't implemented custom relationship names, although I'm not sure that is very important. Should I create an issue for that?

CJStadler mentioned this issue May 16, 2019

Add relationship_path to agg and direct features #544

Merged

CJStadler self-assigned this May 30, 2019

CJStadler mentioned this issue May 31, 2019

Update calculating features to handle multiple paths #572

Merged

CJStadler closed this as completed Jul 1, 2019

CJStadler mentioned this issue Jul 1, 2019

Allow relationships to be given custom names #633

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support entitysets with multiple paths between entities #543

Support entitysets with multiple paths between entities #543

CJStadler commented May 15, 2019

kmax12 commented Jul 1, 2019

CJStadler commented Jul 1, 2019

Support entitysets with multiple paths between entities #543

Support entitysets with multiple paths between entities #543

Comments

CJStadler commented May 15, 2019

Example structures and options for naming features

1. Two to One

2. Diamond

3. Triangle

4. Self-loop

5. Linear Paths

User defined relationship names

Development plan

kmax12 commented Jul 1, 2019

CJStadler commented Jul 1, 2019