Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support entitysets with multiple paths between entities #543

Closed
CJStadler opened this issue May 15, 2019 · 2 comments
Closed

Support entitysets with multiple paths between entities #543

CJStadler opened this issue May 15, 2019 · 2 comments
Assignees

Comments

@CJStadler
Copy link
Contributor

Currently we cannot support entitysets where there are multiple paths between any two entities. The root cause is that aggregation and direct features do not specify a path to the entity of their base features, assuming that there can only be one. During feature matrix calculation when multiple paths are detected we therefore raise an error ("Diamond graph detected!").

By instead representing features in terms of relationships we can build multiple
features with different paths to the same entity.

Example structures and options for naming features

The features discussed are aggregations, but the same logic should apply to
DirectFeatures, just in the opposite direction.

1. Two to One

Two relationships point to the same entity.

Example:

Entities:
  games
  teams
Relationships:
  games.home_team_id -> teams.id
  games.away_team_id -> teams.id

Current features generated by dfs (for teams target entity):

  • MEAN(games.home_score)
  • MEAN(games.away_score)

We should be able to represent

  • MEAN(home_games.home_score)
  • MEAN(home_games.away_score)
  • MEAN(away_games.home_score)
  • MEAN(away_games.away_score)

Naming:

  1. Fully qualified: MEAN(games[home_team_id].home_score)
  2. With user provided relationship names: MEAN(home_games.home_score)

2. Diamond

There are two paths (through different entities) to the same entity.

Example:

Entities:
  regions
  customers
  stores
  transactions
Relationships:
  transactions.store_id -> stores.id
  transactions.customer_id -> customers.id
  stores.region_id -> region.id
  customers.region_id -> region.id

Currently dfs generates MEAN(transactions.amount) (for regions).

We should be able to represent

  • MEAN(stores.transactions.amount)
  • MEAN(customers.transactions.amount)

Naming

  1. Fully qualified: MEAN(stores[region_id].transactions[store_id].amount)
  2. Because no relationship is ambiguous we can infer their names from the entity
    names: MEAN(stores.transactions.amount)

3. Triangle

Target entity has a direct path to base entity, and another through a third
entity.

Example:

Entities:
  users
  reports
  queries
Relationships:
  queries.creator_id -> users.id
  queries.report_id -> reports.id
  reports.creator_id -> users.id

Currently dfs generates MEAN(queries.time) (for users).

We should be able to represent

  • MEAN(queries.time) (where queries is the relationship directly from queries
    to users).
  • MEAN(reports.queries.time)

Naming

  1. Fully qualified:
  • MEAN(queries[creator_id].time)
  • MEAN(reports[creator_id].queries[report_id].time)
  1. Since no relationship is ambiguous we can infer their names:
  • MEAN(queries.time)
  • MEAN(reports.queries.time)

4. Self-loop

Example:

Entities:
  employees
Relationships
  employees.manager_id -> employees.id

Currently we can't represent any features through this relationship, and DFS recurses until it crashes.

We should be able to represent MEAN(direct_reports.age).

Naming:

  1. Fully qualified: MEAN(employees[manager_id].age)
  2. With provided names: MEAN(direct_reports.age).

5. Linear Paths

Any of the above structures, but with a linear path inserted. E.g. in the
diamond example if there are items belonging to transactions.

Naming:

  1. Fully qualified: MEAN(customers[region_id].transactions[customer_id].items[log_id].amount)
  2. With inferred names: MEAN(customers.transactions.items.amount)
  3. We could skip transactions because there is only one path from customers
    (or stores) to items: MEAN(customers.items.amount).

Option 3 would mean that if an entityset has no (undirected) cycles then the
names of features would not change from the current implementation. However,
this adds complexity, and you are giving up some clarity to reduce verbosity. Is
this worth it?

Our initial plan is to only skip relationships in names if there is only one possible path to the base entity. In the above case because there are multiple paths we would use MEAN(customers.transactions.items.amount).

User defined relationship names

We will need to update all APIs which create relationships to accept two new parameters:

  • child_relationship_name. E.g. "home_games"
  • parent_relationship_name. E.g. "home team"

These will then replace the generated relationship names.

Development plan

At any point: add an API to set relationship names.

In order:

  1. Update AggregationFeature and DirectFeature to use relationship paths
    internally. Accept the existing params and use them to build the relationship
    path. Raise if there are multiple possible paths.
  2. Update calculate_feature_matrix (and PandasBackend) to be "relationship
    aware".
  3. Update DFS to be "relationship aware" and to construct features using
    relationship paths. Update features to use relationship based names.
@kmax12
Copy link
Contributor

kmax12 commented Jul 1, 2019

@CJStadler are we good to close this? i think the only thing not implemented is 4. Self-loop which has issue #601

@CJStadler
Copy link
Contributor Author

Yeah I think we can close this.

We also haven't implemented custom relationship names, although I'm not sure that is very important. Should I create an issue for that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants