Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSD cache-dictionary #8624

Closed
wants to merge 113 commits into from
Closed

Conversation

nikvas0
Copy link
Contributor

@nikvas0 nikvas0 commented Jan 12, 2020

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Cache-dictionaries layout for storing data on SSD.

TODO:

  • getBlockInputStream
  • Counters
  • Strings support
  • one file fifo compaction
  • lru for keys in memory
  • Configurable write buffer size
  • Checksums for blocks
  • store keys and metadata on ssd?
  • Any keys support
  • Get rid of duplicate objects in update from different threads
  • Perfomance test
  • Change HashMap (now unordered_map) (store hashes instead of keys?)
  • documentation
  • some refactoring

achulkov2 added a commit to achulkov2/ClickHouse that referenced this pull request May 24, 2020
@nikitamikhaylov
Copy link
Member

I think it is necessary to add performance tests to benchmark it in comparison with DirectDictionary and CacheDictionary. Or just CacheDictionary in memory with SSDCacheDictionary.
I don't mean performance tests in CI, I want something like the one given here #10622


namespace
{
inline size_t nearestPowTwo(size_t x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to use roundUpToPowerOfTwoOrZero from our codebase


size_t getPosition(const size_t bucket) const
{
const size_t idx = (bucket >> 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need it here? bucket it already a number from [0, buckets - 1]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To determine a position in the bucket we need to read 4 bits from the array with positions, but we can address only bytes, so we need to divide bucket number by 2.

};

if (!write_buffer)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Braces are redundant according to our codestyle


const size_t start_block = current_file_block_id % max_size;
const size_t finish_block = start_block + write_buffer_size;
for (const auto& key : keys)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: for (const auto & key : keys)

{
// add partitions to queue
while (partitions.size() > max_partitions_count)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style

AIOContext aio_context(1);

while (io_submit(aio_context.ctx, 1, &request_ptr) != 1)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style

}

while (io_getevents(aio_context.ctx, 1, 1, &event, nullptr) != 1)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style


TemporalComplexKeysPool tmp_keys_pool;
storage.update(
source_ptr,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it true, that we can use many sources and not only one?

void SSDCachePartition::remove()
{
std::unique_lock lock(rw_lock);
std::filesystem::remove(std::filesystem::path(path + BIN_FILE_EXT));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error handling?

std::vector<Key> required_ids(not_found_ids.size());
std::transform(std::begin(not_found_ids), std::end(not_found_ids), std::begin(required_ids), [](const auto & pair) { return pair.first; });

storage.update(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better to wrap it in a separate function, which adds source_ptr and lifetime and call this function?

@filimonov
Copy link
Contributor

BTW - it would be really interesting to see some perf number, and/or comparison with aerospike (#5629)

@alexey-milovidov
Copy link
Member

@filimonov They are published in https://presentations.clickhouse.tech/hse_2020/3rd/SSD_Dictionary_full.pdf

@filimonov
Copy link
Contributor

Just for the reference - aerospike gives up to ~1Mln req/sec

@alexey-milovidov
Copy link
Member

@filimonov This number does not mean anything.
It can be too slow or too fast depending on hardware.
Nikita has tested with laptop SSD.

nikitamikhaylov added a commit that referenced this pull request Jun 27, 2020
@nikitamikhaylov
Copy link
Member

Merged in #11947

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature Pull request with new product feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants