Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interesting phenomenon #28

Closed
dstaehler opened this issue Apr 1, 2021 · 4 comments
Closed

Interesting phenomenon #28

dstaehler opened this issue Apr 1, 2021 · 4 comments

Comments

@dstaehler
Copy link

Hello Cleora team, a very interesting and clever solution for creating embeddings. However, I noticed a behavior that I cannot explain. When creating embeddings with one column (a category of a single node type) that contains both a start and an end node (simple edge list), nodes that are further away from each other generate a vector that is closer to each other. E.g .: (a) -> (b) -> (c) -> (d)

as Edge List:
a b
b c
c d

The vectors a and d are closer together than the vectors a and b (by Cosin value)

Volume: approx. 5.5 million nodes and 41 million edges

I created the embeddings with the following call:

--columns = 'complex :: reflexive :: nodes' -d = 128 -i = 'node.edgelist' -n = 4

As I understand the pattern, the reflexive relationship in a column of a single type (complex) should cover an edge list with a category of node types. What am I doing wrong with the configuration or is this an issue?

A short tip would be very appreciated.

Best

@barbara3430
Copy link

Hello @dstaehler ! Your method of running Cleora seems fine. In order to investigate this phenomenon let's do the following:

  • please double check if you're using cosine similarity for distance computation
  • if yes, this behavior seems strange. Are you sure there are no hidden cycles in your data? E.g. node A could be linked to D futher in the dataset by a direct or at least closer connection.
  • you could try with smaller -n. With too large -n all vectors will eventually get very similar to each other. We usually use -n of 1 to 4 and larger -d - usually 1024.

Let us know how it goes :)
Best, Barbara

@dstaehler
Copy link
Author

dstaehler commented Apr 10, 2021

Hello Barbara,

Thanks for your response. I use cosine similarity for the comparison of the vectors. There may be (very likely) connections in the form of A -> C and C -> D somewhere in the dataset. But I have checked the near vector space for the examples where the strange effects occur by a shortest path scan of the whole dataset. The results are: e.g. two nodes that have a shortest path of 3 edges deliver a higher cosine similarity than a node pair where the shortest path is 1 edge. This occurs for almost all examples I checked. What I haven’t done so far is taken the node degree into account. But this would require a full degree scan of the dataset upfront. Skipped this due to scanning complexity. I checked the reduction of n as well (n = 2). Same effect. And I tried the increase of the dimensions as well (d = 512). Didn’t do 1024 but 512 produced the same results. So far I‘m a bit lost what else I cloud do.

best

Doug

@barbara3430
Copy link

Dear Doug,
Could you possibly provide us with a slice of your data where this is happening (anonymized if privacy is of concern)? Even a very small one (dozens-hundreds of lines).

Another option is that maybe your graph is bipartite. We have a discussion about such graphs here, with an explanation what to do: #29

Best!
Barbara

@dstaehler
Copy link
Author

Hello Barbara,

thank for your replay and the offer to have a look at the dataset. The graph is not bipartite. All the nodes belong to the same entity type. Regarding the example dataset I'll drop you an email.

best

Doug

@piobab piobab closed this as completed Mar 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants