SlothDB - a parody of a database (disqualified)

This project https://github.com/SouravRoy-ETL/slothdb/ is aggressively promoted with paid GitHub stars and spam comments on the Internet.

Here is the verdict:

The names (binder, catalog, data_table, row_group, column_data, string_t, LogicalType, Vector, DataChunk, BoundExpression, ExpressionExecutor, ParallelResultCollector, "non-reserved keyword set matches DuckDB's"…) are 1:1 with DuckDB's. SlothDB is a DuckDB-shaped reimplementation: same source layout, same internal API names, same Parquet ingest shape (read_parquet(...), SELECT * REPLACE, epoch_ms, make_date).

Where the speed actually comes from

1. A 14k-line physical_planner.cpp that pattern-matches ClickBench query shapes and routes them to hand-rolled per-query handlers. The execution directory literally has files named for ClickBench query numbers:

```
  src/execution/q4_dict_histogram.cpp   - Q4   SELECT AVG(UserID) FROM hits
  src/execution/q6_string_dedup.cpp     - Q6   COUNT(DISTINCT SearchPhrase)
  src/execution/q10_agg.cpp             - Q10  RegionID GROUP BY with mixed aggs
  src/execution/q11_helper.cpp          - Q11  MobilePhoneModel COUNT(DISTINCT)
  src/execution/q21_helper.cpp          - Q21  URL LIKE '%google%' COUNT(*)
  src/execution/q26_helper.cpp          - Q26  ORDER BY SearchPhrase LIMIT N
  + static TryComputeQ14_2Stage()       - Q14  inside physical_planner.cpp
```

Each handler bakes in assumptions about this dataset:

- Q4 (q4_dict_histogram.cpp): "Quick gate: every RG must have a dict page on this column" - computes SUM(UserID) and COUNT(*) by walking only the Parquet dictionary page of each row group, never touching the
  data pages. Works because hits.parquet is dictionary-encoded.
- Q10 (q10_agg.cpp): slot_of is a flat dense array indexed by RegionID - // RegionID is a small dense-ish UInt32 (Q10: 9040 distinct in [0, 131069]); a direct array beats a hash probe on the 100M-row per-row hot
   path.
- Q21/Q22/Q23 (q21_helper.cpp and physical_planner.cpp lines 475/503): specialized 4-byte / 8-byte needle substring search - loads the first 4 or 8 bytes of the needle as a uint32/uint64 literal and compares
  with one cmp. The comments explicitly say // Q22 "google" hot path and // Q23 ".google." 8-byte needle.
- Q26 (q26_helper.cpp): "Dict-trust" variant of ORDER BY VARCHAR LIMIT - extracts top-K from the Parquet dictionary alone, skipping the data scan. Comment: // For ClickBench hits.parquet SearchPhrase, dict has 
  no orphans - verified empirically. Toggled by SLOTH_Q26_TRUST_DICT.
- Q14: a dedicated TryComputeQ14_2Stage two-stage COUNT(DISTINCT) path.
- Numerous shape-specific comments throughout physical_planner: // Q22 URL falls here, // Q31/Q32 hot path, // Q35: 9.76 M groups, // Q27 ORDER BY EventTime, SearchPhrase LIMIT 10, etc.

2. The "generic" path is markedly slower than ClickHouse (and almost anything else, including DuckDB). Look at `src/execution/parallel_executor.cpp::ParallelScanFilter` - it builds rows with `chunk.GetValue(col, i)` per cell, then `SetValue(col, result_count, ...)`, into a std::vector<std::vector<Value>> under a mutex. That's row-at-a-time Value boxing - orders of magnitude slower than DuckDB's vectorised pipeline. The "wins" only show up when a query matches one of the recognised shapes and bypasses this path entirely.
  
3. Optimizer is three passes. optimizer.cpp only does ConstantFolding, FilterPushdown, TopNOptimization. There's no proper cost model, join reorder, or general aggregate fusion - the planner relies on the per-query handlers to win, and the comments admit it: Q31/Q32 (two-column high-cardinality GROUP BY without a custom handler) are noted as the project's weak spot (the README cites 0.29× and 0.12× of DuckDB).

Summary
  
The architecture is a DuckDB-clone shell (parser/binder/planner/catalog/storage layered identically) with a flat, three-pass optimiser and a generic execution path that's slower than DuckDB, plus a 14k-line physical planner whose job is to recognise ClickBench query shapes and dispatch to per-query handlers that hard-code dataset assumptions (column cardinalities, dict-page completeness, specific needle bytes, ClickBench RG layout). The README's "33 of 43 beat DuckDB, up to 5×" comes from those handlers firing on the queries they were written for; the queries with no matching handler (Q29, Q31, Q32, Q33) lose by 3–8×.

So it's not a general-purpose engine that happens to win ClickBench - it's a generic engine plus a ClickBench-shape lookup table. The benchmark is essentially part of the codebase's input.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SlothDB - a parody of a database (disqualified) #930

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

SlothDB - a parody of a database (disqualified) #930

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions