Research ability to support very high slices without allocating for intermediate slices #593

jaffee · 2017-05-31T13:44:19Z

Description

In #591 we see that Pilosa can't handle extremely high column values, even when many intermediate columns aren't used.

Success criteria (What criteria will consider this ticket closeable?)

Either we've changed the way that Pilosa tracks slices to allow for this kind of behavior, or we've decided to explicitly not support it, documented it, and returned reasonable errors to clients who try to do it (which #591 should account for).

benbjohnson · 2017-11-10T17:18:01Z

I tried digging into this and #591 but there's several issues I've run into.

First, we can get around the initial inverseSlices panic by changing slices []uint64 to a generator (e.g. slices func() uint64) that can stream all slice numbers instead of building a slice of slice numbers.

However, that still leaves us with 5,739,353,673,740 slices to process. We don't track which slices have data AFAIK so we'd have to check each one.

One option would be reducing the number of slices by increasing the slice width. We're currently at 1,048,576 so we'd have to increase it by quite a bit. But even if we increased slice width to MaxUint32 then we'd still have 1,401,209,393 slices.

The only option I can think of that would make any sense would be to provide a translation table for columnID so that a 6018148517796380732 could be stored as 1 internally. Maybe the translation table could use strings so we could support hexidecimal UUIDs as well. That way we could support the user's value as "6018148517796380732" instead and it's just an opaque string.

@travisturner What do you think?

jaffee · 2017-11-10T19:49:39Z

@benbjohnson being able to do that translation (for rows as well as columns actually) would be extremely valuable. This is definitely something we've discussed (at length), and basically haven't been confident enough to do. There are a lot of finicky bits(pun?) in the distributed case, and the translation needs to be fast in both directions to support ingestion and queries.

For this particular issue, I was wondering if we could do something similar to the way that we track which containers are in use in roaring with the slice of "keys" (see Bitmap.insertAt to jog your memory). I can tell, though, that we'll probably have to alter the query execution and fanout logic fairly significantly (or have strong consensus about which slices exist across the cluster). It may be that doing it this way isn't much easier than translating opaque strings.

TLDR; if you think doing that translation layer is feasible, I'd love to hear more.

travisturner · 2017-11-10T19:54:56Z

@benbjohnson A translation table like you suggested sounds great, and we've talked about something like that for a while, but we have yet to come up with a solution that isn't too complex. There are several concerns, a couple of which are:

ensuring that the lookup doesn't (too badly) affect performance
having the translation table available on all nodes (for example, what happens when two nodes receive a setbit to the same column key?)

If you have an idea for how we might approach this in a reasonable way it would be great.

Alternatively, would it be easier for us to simply keep track of which slices contain data? Would that help with this particular problem?

benbjohnson · 2017-11-10T20:17:44Z

If we limited it to 32 byte strings then we're looking at 32GB per billion keys. That would be ~950 slices. Are we expecting higher cardinality than that?

Keeping the whole translation table on each node is the fastest since it wouldn't have to do an extra hop to translate. However, syncing 32GB on each node wouldn't be fun.

Sharding the translation table would mean that each query would have to fan out to its owner first before querying. That seems like it's mostly an issue on ingest since read queries would be aggregations and generally wouldn't use specific keys, right?

Another option would be to have a translation service in front the pilosa cluster. It could be a small strongly consistent cluster and its only job would be to add new translation keys, rewrite queries to the cluster, and rewrite results from the cluster (if there are keys in the results). It could live independently of the primary cluster and only be required if translation is used.

jaffee · 2018-08-10T19:32:46Z

closing this since it seems to be covered by #1008

alanbernstein mentioned this issue Jun 30, 2017

Catch make() panic in executor #688

Closed

12 tasks

benbjohnson self-assigned this Oct 31, 2017

travisturner mentioned this issue Dec 13, 2017

Track shards that have data (instead of MaxShard) #1008

Closed

jaffee mentioned this issue Dec 25, 2017

fatal error: runtime: cannot allocate memory #1028

Closed

jaffee mentioned this issue Feb 26, 2018

Unable to retrieve very lage columnID's #1147

Closed

jaffee mentioned this issue Aug 9, 2018

Review remoteMaxSlice logic #153

Closed

travisturner mentioned this issue Aug 9, 2018

MaxShard value incorrect on cluster startup #916

Closed

jaffee closed this as completed Aug 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research ability to support very high slices without allocating for intermediate slices #593

Research ability to support very high slices without allocating for intermediate slices #593

jaffee commented May 31, 2017

benbjohnson commented Nov 10, 2017

jaffee commented Nov 10, 2017

travisturner commented Nov 10, 2017 •

edited

benbjohnson commented Nov 10, 2017

jaffee commented Aug 10, 2018

Research ability to support very high slices without allocating for intermediate slices #593

Research ability to support very high slices without allocating for intermediate slices #593

Comments

jaffee commented May 31, 2017

Description

Success criteria (What criteria will consider this ticket closeable?)

benbjohnson commented Nov 10, 2017

jaffee commented Nov 10, 2017

travisturner commented Nov 10, 2017 • edited

benbjohnson commented Nov 10, 2017

jaffee commented Aug 10, 2018

travisturner commented Nov 10, 2017 •

edited