New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data upload/transfers from endpoints calls create/open/write/close #84
Comments
I think I understand what you are seeing. The globus connector uses the
The default number of threads is 3 so each upload will spawn two additional threads making three total. Each of these open the file so that it can write. As far as I can tell we do not have the file length so we can't make a decision based on the length of the file. For file reads, there is another environment variable called |
All that said, we don't do a create/close/open. We do perform open/create if the initial data object does not exist. Please let me know if what you are seeing does not align with either this comment or the previous comment. If the issue is the multiple threads I'm not sure what can be done about if you still want to have multiple threads for large files. If the issue is the open/create when the file does not exist, we might be able to change that to a single open with the O_CREAT flag. |
I will try to explain my observations:
Maybe I didn't get why the open/create here but I checked other clients, e.g the prc does only open (not open/create) if the initial data object does not exist. My simple impression is that I think it could be better if follows the same like other clients using the same apis in terms of always seeing same results. But no idea about the performance point of view.
I am sorry if I didn't get what is meant, also I am not an expert but I should say I checked with the following pep and it contains the file size. I mean the first thread knows already the file size I guess.
|
instead of calling rcDataObjOpen followed by rcDataObjCreate if the open fails
@JustinKyleJames - Close if complete. Thanks |
We have fixed the create/open to only be open with the O_CREAT flag. As for the multiple threads and file size, I am not sure how you are getting the file size. I do not see it. Here is my dataObjInp for pep_api_data_obj_open_post operation:
Here is my rule:
@mstfdkmn, are you sure the write where you got the file size was via Globus? As for the create, yes, we've always done only one create. Now we will be doing one open with O_CREAT flag set and then other opens for the other threads. I'll keep this open for now until we determine whether or not Globus is providing the file size. I don't see any way it is provided and the plugin was not setting it. (It was setting it to zero.) |
Himm a bit strange!. Although the PRC and globus trigger the same PEP, the amount of the serialized parameters (keys) is not the same in both calls done by the PRC and globus. Globus seems to behave differently than how the PRC does. I mean the globus call for example doesn't contain dataSize and data_size whereas the PRC does. Please have a look at my logs below for more. And can we think this will change with your O_CREAT flag fix ? My rule:
PRC call result:
globus call result:
|
That is because the PRC sets the file size because it knows the file size. Unfortunately, as far as I can tell the file size is not provided from Globus so we don't set it. |
Can we ... ask Globus about this? |
Yes, I will do that. |
I did find out that if the client sets it up properly, the transfer_info->alloc_size is set to the uploaded file size. I verified that this is the case with globus-url-transfer. It is not the case if using generic ftp commands. I have made the following changes to the code. If the following conditions are met we will only use one thread for uploads:
In any other case we will use the numberOfIrodsReadWriteThreads setting since we either don't know the file size or have not set a threshold. @trel @korydraughn - Does that sound acceptable or should we have a default threshold? In either case, if alloc_size is not set I think we need to keep the number of transfer threads as the file could be large - we just don't know. So here is an example configuration in /etc/gridftp.conf:
In this case we use 3 threads for any transfer above 32 MiB or if we do not know the file size. |
@trel @korydraughn - One additional note to the question above. The reason we don't currently have a default threshold is because this threshold was originally only for downloads. In the download case we have to do a query to iRODS to determine the file size. An administrator may choose to not set $irodsParallelFileSizeThresholdBytes as they may feel using multiple threads on small files is better than always doing a query to the iRODS database. This isn't really a consideration for uploads as we don't do a query for the file size. It is either provided or not. We could have a default threshold that only applies to uploads. |
It sounds like As for the conditions you proposed in #84 (comment), those sound fine. Have you received a response from Globus about whether the file size is provided (and when it's provided)? |
Yes, the info in the first paragraph here #84 (comment) was from Globus. |
The reason I didn't consider making a new setting is because the existing one has a generic name and 1) not sure if we can/should rename it at this point and 2) if it isn't renamed it might be confusing. One thought is to use the same parameter but have a default value (say, 32 MiB) in the case of uploads. |
That might be nice. And just a documentation exercise for those who might go looking. |
Agreed. |
I made this change to the code and README. |
As far as I see, the globus connector first creates a data object/file in irods (create/close), and later writes in said data object (open/write) and lastly closes the data object. This seems not the same as other clients (prc, httapi) behaviors which call the same APIs.
Also, if the upload happens through the personnel connect/https, then open/create order seems changing, that is, first the open is called and later somewhere create is called.
And there is no way to create an empty data object in the globus interfaces. I was always supposing that the create operation exists to create an empty object.
Custom rules that have logic with
pep_api_data_obj_open_*, pep_api_data_obj_create_*, pep_api_data_obj_close_*
don't easily fit how globus works. But this seems not a big deal for now.I am wondering why the globus connector each time first creates and later opens a data object instead of opening/writing while transferring data to irods? Also, would not this have a negative impact on the speed of transfer?
The text was updated successfully, but these errors were encountered: