Interesting phenomenon #28

dstaehler · 2021-04-01T18:22:05Z

Hello Cleora team, a very interesting and clever solution for creating embeddings. However, I noticed a behavior that I cannot explain. When creating embeddings with one column (a category of a single node type) that contains both a start and an end node (simple edge list), nodes that are further away from each other generate a vector that is closer to each other. E.g .: (a) -> (b) -> (c) -> (d)

as Edge List:
a b
b c
c d

The vectors a and d are closer together than the vectors a and b (by Cosin value)

Volume: approx. 5.5 million nodes and 41 million edges

I created the embeddings with the following call:

--columns = 'complex :: reflexive :: nodes' -d = 128 -i = 'node.edgelist' -n = 4

As I understand the pattern, the reflexive relationship in a column of a single type (complex) should cover an edge list with a category of node types. What am I doing wrong with the configuration or is this an issue?

A short tip would be very appreciated.

Best

barbara3430 · 2021-04-06T07:54:39Z

Hello @dstaehler ! Your method of running Cleora seems fine. In order to investigate this phenomenon let's do the following:

please double check if you're using cosine similarity for distance computation
if yes, this behavior seems strange. Are you sure there are no hidden cycles in your data? E.g. node A could be linked to D futher in the dataset by a direct or at least closer connection.
you could try with smaller -n. With too large -n all vectors will eventually get very similar to each other. We usually use -n of 1 to 4 and larger -d - usually 1024.

Let us know how it goes :)
Best, Barbara

dstaehler · 2021-04-10T20:37:21Z

Hello Barbara,

Thanks for your response. I use cosine similarity for the comparison of the vectors. There may be (very likely) connections in the form of A -> C and C -> D somewhere in the dataset. But I have checked the near vector space for the examples where the strange effects occur by a shortest path scan of the whole dataset. The results are: e.g. two nodes that have a shortest path of 3 edges deliver a higher cosine similarity than a node pair where the shortest path is 1 edge. This occurs for almost all examples I checked. What I haven’t done so far is taken the node degree into account. But this would require a full degree scan of the dataset upfront. Skipped this due to scanning complexity. I checked the reduction of n as well (n = 2). Same effect. And I tried the increase of the dimensions as well (d = 512). Didn’t do 1024 but 512 produced the same results. So far I‘m a bit lost what else I cloud do.

best

Doug

barbara3430 · 2021-04-15T06:33:55Z

Dear Doug,
Could you possibly provide us with a slice of your data where this is happening (anonymized if privacy is of concern)? Even a very small one (dozens-hundreds of lines).

Another option is that maybe your graph is bipartite. We have a discussion about such graphs here, with an explanation what to do: #29

Best!
Barbara

dstaehler · 2021-04-18T12:16:07Z

Hello Barbara,

thank for your replay and the offer to have a look at the dataset. The graph is not bipartite. All the nodes belong to the same entity type. Regarding the example dataset I'll drop you an email.

best

Doug

piobab closed this as completed Mar 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interesting phenomenon #28

Interesting phenomenon #28

dstaehler commented Apr 1, 2021

barbara3430 commented Apr 6, 2021

dstaehler commented Apr 10, 2021 •

edited

barbara3430 commented Apr 15, 2021

dstaehler commented Apr 18, 2021

Interesting phenomenon #28

Interesting phenomenon #28

Comments

dstaehler commented Apr 1, 2021

barbara3430 commented Apr 6, 2021

dstaehler commented Apr 10, 2021 • edited

barbara3430 commented Apr 15, 2021

dstaehler commented Apr 18, 2021

dstaehler commented Apr 10, 2021 •

edited