Validate path connections #6041

EDsCODE · 2021-09-20T19:51:52Z

Bug description

Please describe.
When calculating paths, we aggregate many node and link data, often much more than would be reasonable to display. We perform a simple limit in our queries to cap the number of links we're accounting for. As a result, we sometimes might be cutting off a set of links that are dependent on one another. For example, there might be some links that go from $pageview -> insight viewed -> viewed dashboard however, our limit cuts off the data for $pageview -> insight viewed.

If this affects the front-end, screenshots would be of great help.

Expected behavior

The limited data we return should be complete so that paths aren't stranded.

How to reproduce

Internal graph link here

Notice how a start point is defined but there are start points on the visualization that are unrelated. The sankey is rendering stranded links that start at the 2nd or 3rd step but don't have a 1st step

Thank you for your bug report – we love squashing them!

neilkakkar · 2021-09-21T12:41:26Z

I was exploring if we can find smarter defaults, instead of trying to validate graphs.

https://metabase.posthog.net/question/143

So, I analyzed data for 1 month, and over all paths for PostHog, we have ~28,000 edges. It's very much a power law distribution. Average edge weight is ~7, but 95%ile is 11, 99.5%ile is 160, 99.9%ile is 1000.

So, we can't return all the data when there's no start or end point.

Out of these, there are ~7,500 starting points. Edit: ~4000 unique starting points

Next up, I'll look into the start points here^ to figure out the same data for them, and if it reduces the sample space enough, think we could have a high enough default for start and end points.

neilkakkar · 2021-09-21T13:21:44Z

Hmm, things aren't looking too great. Chose the largest ones, and then a few random ones:

Event	Edges	Avg weight per edge	Quantiles: 10%, 20%, 30%, 50%, 80%, 95%, 99.5%, 99.9%
$autocapture	19447	5.1789993315164295	[1,1,1,1,2,9,132,685.349000000012]
$identify	17083	6.739038810513376	[1,1,1,1,3,13,149.3150000000005,670.439000000014]
https://posthog.com	1427	28.142256482130342	[1,1,1,1,5,48.700000000000045,1207.5799999999745,3464.036000000001]
user logged in	2179	4.343276732446077	[1,1,1,1,2,4,127.55000000000064,597.5500000000029]
$capture_failed_request	4723	3.8816430235020114	[1,1,1,1,2,7,106.78000000000065,360.6680000000015]
https://posthog.com/careers	374	11.366310160427808	[1,1,1,1,4,31.049999999999898,602.9449999999999,866.697000000004]
timezone component viewed	3097	2.2156926057474977	[1,1,1,1,1,4,32.51999999999998,78.80799999999999]
Palette shown	654	2.02	[1.0,1.0,1.0,1.0,1.0,4.0,19.20500000000004,106.03399999999556]
redacted	41	5.853658536585366	[1,1,2,3,6,15,51.79999999999991,56.760000000000026]
redacted	3	1	[1,1,1,1,1,1,1,1]
redacted	70	2.414285714285714	[1,1,1,1,2.200000000000003,8.549999999999997,25.44500000000002,30.68899999999995]
redacted	13	2.4615384615384617	[1,1,1,2,4,5.199999999999996,6.8199999999999985,6.963999999999999]
https://app.posthog.com'	264	31.348484848484848	[1,1,1,2,14.400000000000006,147.09999999999997,764.4100000000005,1261.515000000014]

This one fans out a LOT: link - there's supposed to be 374 edges starting at /careers/ but yeah, 50% of those have literally just one event.

Same for this one.

These are probably the worst of worst cases^^^. But assuming 10x bigger company, they start looking more reasonable.

neilkakkar · 2021-09-22T13:37:22Z

Another idea: We can leverage the shape of the distribution. Instead of having a limit on number of edges, how about we have a limit on [absolute|relative] edge weights?

So something like, if the number of people who did edge A->B is less than 1% / 25 people, discard the edge from final result. It doesn't directly solve the above problem, but gives us better initial data to work with: If we have N edges with say, the same weight, and half of those fall in the limit, it would be a shame to discard the other half, as they're as useful (or useless) as the first half.

... And then we can consider completing this graph^^.

neilkakkar · 2021-09-23T10:56:58Z

Things turned a bit tricky. I'm exploring three different ways to solve this problem

Complete Dangling Edges

Ensures that whatever edges make it to the cut-off, the remaining edges are added.

Pro: Paths don't look wrong.

Cons: Some very low weight edges show up, which can be hard to visualise / fill graph with useless information.

Implementation notes: This makes things hard. Here's a failed attempt: https://metabase.posthog.net/question/150 that makes the wrong tradeoffs. We don't want to bound above the maximum edge weights. I suspect any solution that goes this way will significantly slow down our queries, since this requires some sort of graph traversal.

Still trying to figure out if there's a better way around to solve this.

Delete Dangling Edges

Ensures we validate edges before returning, so no dangling edges remain.

Pro: Paths don't look wrong.

Cons: Some high weight edges are removed, which might be carrying useful information.

Implementation notes: relatively straightforward to do outside of SQL. And up to ~1000 edges, has negligible effect on performance.

Defer control to users

The crux between Solution (1) and (2) is the amount of information we show. Depending on the case, it can be useful to see the extra information, going further in depth, and in other cases, better to get rid of all low weight nodes.

More importantly, it's hard to get the visualisation just right, a priori.

This solution gives users these advanced manipulation options. There's two controls:

(a) Control maximum number of edges. And
(b) Control minimum (& max) edge weight. "Make the graph display whatever you want it to display".

Note: The edge weight represents the number of people on that path.

Overall Solution

The overall solution I'm leaning towards right now is: based on the above calculations, provide more meaningful defaults. The 95%ile has edge weights ~10 for the more popular cases, which translates to ~100 edges. Make these the default. (that's 5x more edges than right now, and increase these further if steps go above 5)

This won't solve the graph looking weird in some cases, so delete dangling nodes (only when start or end point are defined), and tell the user that the graph is incomplete. And encourage using the advanced options I mentioned above for getting more indepth information.

I think I need to myself play with these advanced options to figure out if there's heuristics we can find for even better default values.

It does make things a bit more complicated for users, but hopefully most users are happy with the default.

I do think it's important to allow this customization so users can drill down and up the graph. It gives a new dimension: allows not just number of step manipulation, but things like, "oh, I notice this specific segment of slightly unpopular paths (~200 edge weight) seem weird. Let me set edge weights between 100 and 300 to explore these more in depth, then find the specific people doing this, and see if I can figure out why they're doing things like this" etc. etc.

Whether it's worth doing is an open question, I guess.

cc: @paolodamico @clarkus @marcushyett-ph @EDsCODE @liyiy for more input :)

marcushyett-ph · 2021-09-23T14:57:36Z

I'll let @paolodamico and @clarkus chime in as they have the most context. But generally providing the best defaults we can sounds like a good approach to me.

I have a question related specifically to the terminology used, edge weight etc. Do we have a more user friendly term in mind for how to describe this? As it feels pretty technical and might be hard for users to adopt.

neilkakkar · 2021-09-23T15:04:47Z

Definitely. It's the "count of users on a path". So, min edge weight is something like: "Minimum count of users on a Path"

paolodamico · 2021-09-27T04:14:24Z

Hey @neilkakkar! In general, 100% agree with the overall approach of sensible defaults and advanced customization. Some questions,

Can you clarify what you mean by deleting dangling edges?
Can you clarify how edge weight would be understood from a user's perspective? Would I let you know how many min/max users should be on a path? Would the number of users be counted from the root step?
Would users need/can control both (a) & (b)?
I'm having a hard time understanding a users for wanting to control the maximum weight? Wouldn't you always want to definitely see paths with larger weights?
Unless is not technically complex, I think we could start just with sensible defaults and then get feedback from users to understand better what and how advanced controls should behave. I'm thinking we'll get to a significantly better state by getting a lot of real data points from users.

neilkakkar · 2021-09-27T09:29:00Z

Dangling edges: Link

Notice how "/events 38" goes further ahead than the next row, and same for "/events 30". In the first place, if these were the first event that happened, shouldn't they have been a part of "/events 138" ?

Leaving them alone makes the graph look wrong.

This is same as the issue in the original link in the issue: The start point is gone, these are intermediate edges, and thus dangling. Deleting dangling edges means getting rid of these in the final visualisation.

Can you clarify how edge weight would be understood from a user's perspective? Would I let you know how many min/max users should be on a path? Would the number of users be counted from the root step?

edge weight is indeed count of users on a single path item (I'm not yet sure of the right terminology to use, judging by the confusion exposed on the PR). It doesn't mean number of people on the entire path, but between any two consecutive Path items.

Would users need/can control both (a) & (b)?

Good question. We could remove some controls, but removing any of these feels incomplete to me. Since: (a) No. of edges controls how dense the graph gets. (b) Min-Max controls what kind of edges show up.

I'm having a hard time understanding a users for wanting to control the maximum weight? Wouldn't you always want to definitely see paths with larger weights?

Say you're interested in where people drop off, and say it's a very successful product: most people convert.. (Or vice versa, case is identical). Most path items on the happy path then have a high weight - and these are the ones you don't want to see, since they are noise. Setting a max weight effectively removes all of them, and helps you visualise where the dropoffs really go.

Something similar can be achieved with excluding the popular events, but it's not the same, since you want to know if these "dropoffs" take some other route to the popular events. (Max weight would remove the popular paths, but not the small weight traversals to the popular items).

It made sense to me, but it's 100% an advanced use case - and not very obvious. But since these are advanced features anyway....

Unless is not technically complex, I think we could start just with sensible defaults and then get feedback from users to understand better what and how advanced controls should behave. I'm thinking we'll get to a significantly better state by getting a lot of real data points from users.

As in, don't show any advanced options at all? Or just show them populated with defaults?

The latter makes sense to me. The former not so much, because then the users wouldn't know how to control these advanced options at all?

100% agreed on getting real data points.

clarkus · 2021-09-27T15:00:02Z

The weight concept is new to me, so I'm catching up a bit on this. It seems like this might be the core reason a user would want to adjust weight for a paths insight:

Say you're interested in where people drop off, and say it's a very successful product: most people convert.. (Or vice versa, case is identical). Most path items on the happy path then have a high weight - and these are the ones you don't want to see, since they are noise. Setting a max weight effectively removes all of them, and helps you visualise where the dropoffs really go.

Setting a maximum weight can optimize for analyzing dropoffs. Is the converse true for minimum weights? If so, that might be a good way to communicate the value of the feature to users. I think default make a ton of sense for this, but maybe there's some easy mode where the user just selects an "optimize for dropoffs" control or something similar?

neilkakkar · 2021-09-30T11:45:22Z

If you want to play around with edge weights, they're behind the new-paths-ui-edge-weights Feature Flag. Very interesting to play around with

neilkakkar · 2021-10-01T15:44:53Z

Validation is done, so I'll close this.

posthog-contributions-bot · 2021-10-01T15:44:54Z

This issue has 1909 words. Issues this long are hard to read or contribute to, and tend to take very long to reach a conclusion. Instead, why not:

Write some code and submit a pull request! Code wins arguments
Have a sync meeting to reach a conclusion
Create a Request for Comments and submit a PR with it to the meta repo or product internal repo

EDsCODE added bug Something isn't working right team-core-analytics labels Sep 20, 2021

EDsCODE assigned neilkakkar and EDsCODE Sep 20, 2021

EDsCODE added the feature/paths Feature Tag: Paths label Sep 20, 2021

neilkakkar mentioned this issue Sep 24, 2021

Advanced User Controls and Proper Path Validation #6098

Closed

6 tasks

neilkakkar mentioned this issue Sep 28, 2021

Remove dangling edges from Paths #6142

Merged

neilkakkar closed this as completed Oct 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate path connections #6041

Validate path connections #6041

EDsCODE commented Sep 20, 2021

neilkakkar commented Sep 21, 2021 •

edited

Loading

neilkakkar commented Sep 21, 2021 •

edited

Loading

neilkakkar commented Sep 22, 2021

neilkakkar commented Sep 23, 2021 •

edited

Loading

marcushyett-ph commented Sep 23, 2021

neilkakkar commented Sep 23, 2021

paolodamico commented Sep 27, 2021

neilkakkar commented Sep 27, 2021 •

edited

Loading

clarkus commented Sep 27, 2021

neilkakkar commented Sep 30, 2021

neilkakkar commented Oct 1, 2021

posthog-contributions-bot bot commented Oct 1, 2021

Validate path connections #6041

Validate path connections #6041

Comments

EDsCODE commented Sep 20, 2021

Bug description

Expected behavior

How to reproduce

Thank you for your bug report – we love squashing them!

neilkakkar commented Sep 21, 2021 • edited Loading

neilkakkar commented Sep 21, 2021 • edited Loading

neilkakkar commented Sep 22, 2021

neilkakkar commented Sep 23, 2021 • edited Loading

Complete Dangling Edges

Delete Dangling Edges

Defer control to users

Overall Solution

marcushyett-ph commented Sep 23, 2021

neilkakkar commented Sep 23, 2021

paolodamico commented Sep 27, 2021

neilkakkar commented Sep 27, 2021 • edited Loading

clarkus commented Sep 27, 2021

neilkakkar commented Sep 30, 2021

neilkakkar commented Oct 1, 2021

posthog-contributions-bot bot commented Oct 1, 2021

neilkakkar commented Sep 21, 2021 •

edited

Loading

neilkakkar commented Sep 21, 2021 •

edited

Loading

neilkakkar commented Sep 23, 2021 •

edited

Loading

neilkakkar commented Sep 27, 2021 •

edited

Loading