Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non deterministic filter results #64

Open
james-bowers opened this issue Jun 4, 2022 · 6 comments
Open

Non deterministic filter results #64

james-bowers opened this issue Jun 4, 2022 · 6 comments

Comments

@james-bowers
Copy link

james-bowers commented Jun 4, 2022

Hey, it's me again

I've stumbled across a case where my filtering logic sometimes returns the correct result; a count of 5, and sometimes returns a count of 0.

I've created a gist to show the logic and the test I'm running
https://gist.github.com/james-bowers/55722b1cf2cb60d88093a3051c7e0c20

I'm wondering if there's something I'm doing wrong in the checkFilters function 🤔 as in the readme examples, the "where" clause is achieved by creating various indexes. However I'm confused as to why my approach sometimes works, and sometimes doesn't.

Any light you can shed on the issue would be greatly appreciated.

If the recommended approach is to always create an index for each filter, what overhead is there in doing this?

Thanks so much! 🙏🏼

@Dreeseaw
Copy link
Collaborator

Dreeseaw commented Jun 4, 2022

Hello,

Funny enough, this problem doesn't originate from the filtering logic in WithValue(), but in the column creation in (referring to your gist link) engine.go line 41. When the returned error is checked, you'll see it panics every time, as a result of players.json containing an inner-struct field. For using this data, you can look at

func createCollection(out *column.Collection, amount int) *column.Collection {
for reference.

However, for loops ranging over maps in Go are indeterministic when it comes to order. Within CreateColumnsOf, the given columns of an object are processed in "random" order, so sometimes the three columns referred to in your test example are created, and sometimes they aren't, depending on if they get created before the error or not. In the random times they are not created, the transaction will not find the column to load and therefore clear the internal index (resulting in counts of 0).

@james-bowers
Copy link
Author

james-bowers commented Jun 4, 2022

Thanks for responding so fast @Dreeseaw, it's much appreciated!

If I may pick your brains about two more things; Is it bad practice to chain multiple WithValue calls like I'm doing in the checkFilters function? Or is it best practice to always create an index for each filter. e.g:

func checkFilters(txn *column.Txn, filters []Filter) *column.Txn {
	for _, filter := range filters {
		txn = txn.WithValue(filter.Field, func(cellValue interface{}) bool {

			return compare.IsTruthy(cellValue, filter.Comparitor, filter.Value)
		})
	}

	return txn
}

from the same gist above ^

Secondly, am I right in thinking this library doesn't support ordering (either at Query time or by defining an index before a query runs)? If not, would this be a possible thing to achieve in a library like this or is it a limitation of the bitmap indexing or columnar architecture? I've looked at a few other libraries and the closest to ordering in an in-memory columnar database is arcticdb which orders records during insertion only, and once it's inserted doesn't allow any other ordering indexes to be created, or order to be defined in a query as far as I can tell.

I really appreciate your time on this, I'm new to both GoLang and columnar databases!

@Dreeseaw
Copy link
Collaborator

Dreeseaw commented Jun 5, 2022

@james-bowers Your approach of loop filtering looks good, and no, you probably shouldn't create a new index on each value in a column. The exception to this is a column that you'd be fine trading a memory hit for a performance gain, such as one with a known distinct number of values (like "class type" in the rouge example) that is used in a lot of your transactions.

To your second point, no, this library does not support any sort of "ORDER BY" statements (speaking in SQL). I'd recommend ranging through after gathering your filtered results, similar to what you see here.

@kelindar I would be interested in what you think about adding sort functionality, as wouldn't it require holding a lock across the entire column that's being sorted on? Or using an MVCC linked-list so that the values remain intact throughout the range function?

@kelindar
Copy link
Owner

kelindar commented Jun 7, 2022

@Dreeseaw indeed, sorting would be good to add as it's a fairy common operation. I would want to avoid going the MVCC route if possible, so what we could do instead is implement a special sorted index that allows users to define a sorting column and whenever a new value is inserted/deleted, we auto update the index.

@Dreeseaw
Copy link
Collaborator

@kelindar Hello, I've been thinking about this sorted index & wanted to try a base implementation. By creating a new SortedIndex type that internally stores a b-tree (probably google/btree) that updates respective to a column of a comparable type, we could expose txn.SortedRange() that simply (for now) iterates through all keys in order, reads the uint32 values, checks if they're present in the index, and proceeds to return the row index to the user.

Let me know what you think of this approach!

@kelindar
Copy link
Owner

@Dreeseaw I think it could be a good start. Not sure about google/btree specifically, would need to compare (I like https://github.com/tidwall/btree as an implementation, zero dependencies).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants