Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fs.ftpfs.FTPFS.upload opens new connections #455

Closed
atollk opened this issue Mar 17, 2021 · 8 comments · Fixed by #457
Closed

fs.ftpfs.FTPFS.upload opens new connections #455

atollk opened this issue Mar 17, 2021 · 8 comments · Fixed by #457
Milestone

Comments

@atollk
Copy link
Member

atollk commented Mar 17, 2021

Hi,
I'm looking at this function:

def upload(self, path, file, chunk_size=None, **options):

_manage_ftp always opens a new FTP connection, together with user login, "FEAT" command and so on. This is done everytime one wants to copy a file to an FTPFS. Is there a good reason why that code doesn't just use self._ftp, the connection that should be open for the lifetime of the FTPFS object? Otherwise, I'd like to create a PR to fix that.

@althonos
Copy link
Member

Hmmm, true, I'm actually wondering too. I know it makes sense to open a new connection when opening a new file, to guarantee the file and the FTPFS can be used in parallel, but i don't think it makes sense for the shortcut methods (like upload or readbytes) to create a new connection.

@willmcgugan , any opinion? Maybe there was a reason for doing so when you wrote the code that I am unaware of.

@willmcgugan
Copy link
Member

I know it makes sense to open a new connection when opening a new file, to guarantee the file and the FTPFS can be used in parallel, but i don't think it makes sense for the shortcut methods (like upload or readbytes) to create a new connection.

Yeah, FTP offers no way of multiplexing file downloads that I know of, so multiple instances are the only way to achieve that. Re shortcut methods, I guess we could re-use the main connection used for listing directories, but there would still need to be one per thread to make it thread-safe.

I wouldn't be surprised if there were wins that could be made for copying files sequentially. Especially for many small files. But in general I would expect downloading in parallel to eclipse those wins...

@althonos
Copy link
Member

althonos commented Mar 18, 2021

but there would still need to be one per thread to make it thread-safe.

Wouldn't locking the FTPFS when accessing the main connection be sufficient though?

@willmcgugan
Copy link
Member

Sure. That would limit performance though. Do we do that already? I don’t recall.

@althonos
Copy link
Member

In the snippet linked by @atollk the lock is acquired before _manage_ftp is called, yes.

@willmcgugan
Copy link
Member

In that case re-using the connection may be beneficial. @atollk you’re welcome to tackle that, assuming @althonos approves.

@althonos
Copy link
Member

Fine by me! There are probably other places in fs.ftpfs where this could be beneficial, but we can figure that out if you want to open a PR.

@atollk
Copy link
Member Author

atollk commented Mar 19, 2021

I created the PR. As @althonos said, in upload it's an easy one-liner.

I don't see any other places in fs.ftpfs where that change could also be implemented, except for maybe readbytes as you mentioned yourself. Correct me if I'm wrong though; if it's on a similar size, I can put those in the same PR.

Regarding readbytes, I suppose the change is not as simple, since self._lock is not obtained in the current code, so as @willmcgugan said, the same change might cause behavioral change.

In my experience, multiple FTP connections to the same server almost never actually give you anything; you'll just have two connections with half their data rate each. (Although I'll leave that statement up for discussion.) In that case, I feel that _manage_ftp as a whole should be removed. Users who still want to open multiple connections at once always have the option of creating multiple FTPFS instances.

As an alternative, maybe some kind of "connection pool" instead of a single ftp connection would be appropriate; so _manage_ftp would not have to open a new connection on each call but would reuse existing connections if they are unused at the moment. Although that might be something more suitable for pyfilesystem3, if that's coming in the near future :)

@althonos althonos added this to the v2.4.13 milestone Mar 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants