Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix collStats to return correct count of documents for SQLite #3363

Merged
merged 4 commits into from Sep 15, 2023

Conversation

chilagrow
Copy link
Contributor

@chilagrow chilagrow commented Sep 15, 2023

Description

Closes #3355.

Readiness checklist

  • I added/updated unit tests (and they pass).
  • I added/updated integration/compatibility tests (and they pass).
  • I added/updated comments and checked rendering.
  • I made spot refactorings.
  • I updated user documentation.
  • I ran task all, and it passed.
  • I ensured that PR title is good enough for the changelog.
  • (for maintainers only) I set Reviewers (@FerretDB/core), Milestone (Next), Labels, Project and project's Sprint fields.
  • I marked all done items in this checklist.

@chilagrow chilagrow added the code/bug Some user-visible feature works incorrectly label Sep 15, 2023
@chilagrow chilagrow added this to the Next milestone Sep 15, 2023
@chilagrow chilagrow self-assigned this Sep 15, 2023
@codecov
Copy link

codecov bot commented Sep 15, 2023

Codecov Report

Merging #3363 (1e6dc92) into main (19f0906) will increase coverage by 0.00%.
The diff coverage is 75.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #3363   +/-   ##
=======================================
  Coverage   74.07%   74.07%           
=======================================
  Files         413      413           
  Lines       25306    25315    +9     
=======================================
+ Hits        18746    18753    +7     
- Misses       5472     5473    +1     
- Partials     1088     1089    +1     
Files Changed Coverage
internal/backends/sqlite/database.go 75.00%
Flag Coverage Δ
hana-1 ?
hana-2 ?
hana-3 ?
integration 71.11% <75.00%> (+<0.01%) ⬆️
mongodb-1 4.72% <0.00%> (-0.01%) ⬇️
pg-1 43.81% <0.00%> (+0.92%) ⬆️
pg-2 43.54% <0.00%> (-0.44%) ⬇️
pg-3 44.10% <0.00%> (-0.82%) ⬇️
sqlite-1 42.72% <75.00%> (+1.24%) ⬆️
sqlite-2 42.11% <75.00%> (-0.75%) ⬇️
sqlite-3 42.91% <75.00%> (-0.83%) ⬇️
unit 23.09% <75.00%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

@chilagrow
Copy link
Contributor Author

When I run the same benchmark #3355 (comment) the count is correct.
avgObjSize increased from 198 to 204, because storageSize (table size) is still the entire table size and now we use better approximation of rows.

$  task bench-sqlite BENCH_NAME=BenchmarkQuerySmallDocuments BENCH_DOCS=100
task: [bench-sqlite] go test -tags= -timeout=0 -run=XXX -count=10 -bench=BenchmarkQuerySmallDocuments -benchtime=5s -benchmem -log-level=error -bench-docs=100 -target-backend=ferretdb-sqlite -sqlite-url=file:../tmp/sqlite-tests/ | tee new-sqlite.txt

2023-09-15T12:32:16.471+0900    INFO    setup/startup.go:90     Target system: ferretdb-sqlite (built-in).
2023-09-15T12:32:16.471+0900    INFO    setup/startup.go:103    Compat system: none, compatibility tests will be skipped.
2023-09-15T12:32:16.471+0900    INFO    debug   debug/debug.go:92       Starting debug server on http://127.0.0.1:39451/
collStats [
{ns BenchmarkQuerySmallDocuments-SmallDocuments-Docs100-3947.BenchmarkQuerySmallDocuments-SmallDocuments-Docs100-3947} 
{size 20480} 
{count 100} 
{avgObjSize 204} 
{storageSize 20480} 
{nindexes 0} 
{totalIndexSize 0} 
{totalSize 20480} 
{scaleFactor 1} 
{ok 1}
]

@chilagrow chilagrow marked this pull request as ready for review September 15, 2023 03:45
@chilagrow chilagrow requested review from AlekSi and a team as code owners September 15, 2023 03:45
@chilagrow chilagrow requested review from rumyantseva, a team and noisersup September 15, 2023 03:45
@chilagrow chilagrow enabled auto-merge (squash) September 15, 2023 03:46
AlekSi
AlekSi previously approved these changes Sep 15, 2023
Copy link
Member

@AlekSi AlekSi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but we should understand that collStats's count field is an approximation. It might not be correct in MongoDB, too. If we need to know the exact count, we should run count, not collStats.

internal/backends/sqlite/database.go Outdated Show resolved Hide resolved
@chilagrow
Copy link
Contributor Author

LGTM, but we should understand that collStats's count field is an approximation. It might not be correct in MongoDB, too. If we need to know the exact count, we should run count, not collStats.

Yes indeed, for mongoDB as far as I can tell, there isn't an indication if it's exact or not. For us, both pg and sqlite are approximation.

@AlekSi
Copy link
Member

AlekSi commented Sep 15, 2023

mongosh method is called _estimated_DocumentCount: https://www.mongodb.com/docs/manual/reference/method/db.collection.estimatedDocumentCount/#mongodb-method-db.collection.estimatedDocumentCount

https://www.mongodb.com/docs/manual/reference/operator/aggregation/collStats/ says:

The count is based on the collection's metadata, which provides a fast but sometimes inaccurate count for sharded clusters.

Deprecated https://www.mongodb.com/docs/manual/reference/command/collStats/ says:

After an unclean shutdown: validate updates the count statistic in the collStats output with the latest value.

So it looks like that count does not count documents but returns the current value from the planner's metadata, that is an estimate.

@chilagrow
Copy link
Contributor Author

mongosh method is called _estimated_DocumentCount: https://www.mongodb.com/docs/manual/reference/method/db.collection.estimatedDocumentCount/#mongodb-method-db.collection.estimatedDocumentCount

https://www.mongodb.com/docs/manual/reference/operator/aggregation/collStats/ says:

The count is based on the collection's metadata, which provides a fast but sometimes inaccurate count for sharded clusters.

Deprecated https://www.mongodb.com/docs/manual/reference/command/collStats/ says:

After an unclean shutdown: validate updates the count statistic in the [collStats](https://www.mongodb.com/docs/manual/reference/command/collStats/#mongodb-dbcommand-dbcmd.collStats output with the latest value.

So it looks like that count does not count documents but returns the current value from the planner's metadata, that is an estimate.

Thanks for looking into this in detail 🤗!

Copy link
Member

@rumyantseva rumyantseva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a helpful improvement! And I also like the test that helps us ensure the number is correct.

@chilagrow chilagrow merged commit 8c9b681 into FerretDB:main Sep 15, 2023
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code/bug Some user-visible feature works incorrectly
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

SQLite collStats returns inaccurate count of documents
3 participants