Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT CLOSE OR MERGE] Rti 12596 compress to file #32

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

tak30
Copy link
Collaborator

@tak30 tak30 commented Aug 5, 2022

No description provided.

@tak30 tak30 self-assigned this Aug 5, 2022
Comment on lines +74 to +88
static int save_file(const char* fileName, const void* buff, size_t buffSize)
{
FILE* const oFile = fopen(fileName, "a");
if (!oFile) {
return 0;
}
size_t const wSize = fwrite(buff, 1, buffSize, oFile);
if (wSize != (size_t)buffSize) {
return 0;
}
if (fclose(oFile)) {
return 0;
}
return 1;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this part is IO bound and it's not marked as such, then opening and closing are expensive operations on top of that doing it constantly is going to be slow. There is no fsync to ensure that the data is written to the disk.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I highly recommend reading the I/O Queues part of the erl_nif

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also probably it should be scheduled with ERL_NIF_DIRTY_JOB_IO_BOUND

Copy link
Collaborator Author

@tak30 tak30 Aug 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding fsync: I tried to use fsync or any variation like fdatasync, but they are not compliant with C99 : error: implicit declaration of function 'fsync' is invalid in C99 [-Werror,-Wimplicit-function-declaration] fsync(); ^ 1 error generated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding I/O Queues, that is a good idea, I just found this very good example, we could replicate it: https://github.com/erlang/otp/blob/master/erts/emulator/nifs/common/zlib_nif.c

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also probably it should be scheduled with ERL_NIF_DIRTY_JOB_IO_BOUND

I'm not sure about that as compressing uses a lot of cpu

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's because fsync is under unistd.h

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if we have a way with IO queues, then I would rather follow that approach rather than directly writing to disk.

static ERL_NIF_TERM zstd_nif_compress_to_file(ErlNifEnv* env, int argc, const ERL_NIF_TERM argv[]) {
ErlNifBinary bin;
if(!enif_inspect_iolist_as_binary(env, argv[0], &bin)) return enif_make_badarg(env);
if (bin.size > MAX_BYTES_TO_NIF) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data should be splitted and handle in shorter nif calls. I recommend reading Yielding NIF and Long-Running NIFs https://www.erlang.org/doc/man/erl_nif.html

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should do the same as zlib

#include <zstd.h>

#define MAX_BYTES_TO_NIF 20000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How and Why this value was chosen?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raz-adroll
Copy link
Collaborator

You can build some NIF functions that do the actual compression and then do the rest using erlang in the same module if you do not feel comfortable with C.

@raz-adroll
Copy link
Collaborator

You also forgot to have options for the compression level and the "long distance matching with given window log" or --long option.

@tak30
Copy link
Collaborator Author

tak30 commented Aug 8, 2022

You can build some NIF functions that do the actual compression and then do the rest using erlang in the same module if you do not feel comfortable with C.

I was trying to do the compression and writing in C in order to be able to release memory as soon as possible. It also should be more efficient.

@tak30
Copy link
Collaborator Author

tak30 commented Aug 8, 2022

You also forgot to have options for the compression level and the "long distance matching with given window log" or --long option.

The --long option is missing but the compression level is there.

@tak30
Copy link
Collaborator Author

tak30 commented Aug 9, 2022

@raz-adroll I suggest merging this change as it is now and work on the improvements whenever the next ticket is prioritized.

@raz-adroll
Copy link
Collaborator

@raz-adroll I suggest merging this change as it is now and work on the improvements whenever the next ticket is prioritized.

This shouldn't be merged as it is.

@tak30
Copy link
Collaborator Author

tak30 commented Aug 9, 2022

@raz-adroll I suggest merging this change as it is now and work on the improvements whenever the next ticket is prioritized.

This shouldn't be merged as it is.

What do you propose to do as the minimum for merging? I believe adding the fsync should be enough

@raz-adroll
Copy link
Collaborator

@raz-adroll I suggest merging this change as it is now and work on the improvements whenever the next ticket is prioritized.

This shouldn't be merged as it is.

What do you propose to do as the minimum for merging? I believe adding the fsync should be enough

The minimum would be not breaking trino querying

@tak30
Copy link
Collaborator Author

tak30 commented Aug 9, 2022

@raz-adroll I suggest merging this change as it is now and work on the improvements whenever the next ticket is prioritized.

This shouldn't be merged as it is.

What do you propose to do as the minimum for merging? I believe adding the fsync should be enough

The minimum would be not breaking trino querying

Sure, then this PR will be kept unmerged until the next ticket is prioritized.

@tak30 tak30 changed the title Rti 12596 compress to file [DO NOT CLOSE OR MERGE] Rti 12596 compress to file Aug 9, 2022
@CLAassistant
Copy link

CLAassistant commented Sep 20, 2023

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants