Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] FileSplitter with FTP upload support #17549

Closed
wants to merge 3 commits into from

Conversation

NickLaMuro
Copy link
Member

@NickLaMuro NickLaMuro commented Jun 7, 2018

"So basically... you re-wrote split..."

- Probably everyone code reviewing

Why?

Well, this is first off related to this BZ and eventually this BZ. Since split drastically differs on BSD/UNIX systems, trying to work with the limited set of flags they do share was a bit of a pain and not terribly flexible.

More over, the latter BZ would eventually require that we take the generated files and upload them to FTP, and ideally it would be best if we could avoid putting the split up files on the local server first before uploading. This is possible in "GNU split land" with the --filter flag via a nested curl command, but not so in BSD. Since Ruby has all of the facilities to handle this in it's standard lib, and it is a common denominator on all appliances, this seemed like the most straight forward and easily testable route (despite being an exercise in "re-inventing the wheel").

Why here?

Well, I figured the rake tasks that would be triggered by the appliance_console would be the ones in charge of determining if file splitting was necessary, so instead of moving this code into manageiq-gems-pending, I figured it made more sense to keep it close to where it was being used. Some support code will still be necessary to allow pipes to work with PostgresAdmin and such (see ManageIQ/awesome_spawn#41 for a start...), but those will come later.

Why FTP (right now)?

Without it, it made less sense that I spent the time re-writing split, and including provides a better bit of context to why this is necessary, and how the pieces will eventually work together.

Links

Testing/QA

I have been testing this by exercising the code in #17652, so see that PR for details.

def initialize(options = {})
@input_file = options[:input_file] || ARGF
@input_filename = options[:input_filename]
@byte_count = options[:byte_count] || 10_485_760
Copy link
Member Author

@NickLaMuro NickLaMuro Jun 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This default is 10.megabytes, but just expressed as the integer directly (avoids needing the ActiveSupport dependency when run as a script).

I am open to a better default (specifically for pg_dump outputs and backups), but I just went with something here that could be reasonable to deal with in the test suite, since I am testing this on the filesystem directly with 10MB temp files.

@carbonin
Copy link
Member

Is there a reason this all wasn't done as an enhancement to FileDepotFtp? It feels like we could use some of the existing connection logic from that class.

If you need a separate script I would think that script should use an enhanced FileDepotFtp which maybe called out to a new utility class for splitting files (which wouldn't have to know about FTP at all).

@NickLaMuro
Copy link
Member Author

Is there a reason this all wasn't done as an enhancement to FileDepotFtp?

I guess I didn't mention this in the PR description, but the first reason I made this a script was because pg_dump can't be called in any other way (so via a c-ext or a SQL command), which is something I did look into in the first place to allow for this being done within a rake task without shelling out.

That is worth noting because I want to avoid having to have a tmp copy of the dump on the host machine for this script incase the DB is massive and there is limited disk space on the host running the dump. This is a big factor in every one of the other decisions made (regardless of how smart or dumb going down this path may be)

This lead to a bunch of smaller decisions:

  • To support FTP, we needed the FTP code to be "stream-able", something that I had to hand write and wasn't available in net/ftp by default. The solution is a bit of a hack since we are calling out to some private methods though. I decided to make it as part of the same process since it would be less of a bottleneck then passing streams between two ruby processes.
  • The split functionality needed to be "pipe-able" to support FTP, and needed to happen prior to sending something to the FTP endpoint. split by default only has the --filter option, and was only available on GNU split, not BSD split. NFS and SMB are just mounts, so the resulting files can just be dumped into their target location, but FTP needs to be sent to another "function" (process, code path, etc.) since it can't act like a mount by default (to my knowledge at least).
  • I could have just piped the resulting output and buffered it off to something like curl for FTP. But wasn't sure how pipe in to and out of in a single ruby process and we already have net/ftp in stdlib, so that is a known constant and seemed easier just to squash them together. curl could also be weirdly setup on the $PATH in some machines, or even non-existent by default in some distros (though, I think it is a requirement on our appliances).
  • I didn't want this script to be bogged down by the entire Rails ENV, which would have been required if I made use of the existing code in FileDepotFTP directly. Since this process is definitely going to be the bottleneck compared to pg_dump, I wanted it as light weight as possible.

That said, I think not sharing code between this and FileDepotFTP makes more sense, and my thought to do this (while accounting for the above) might be to extract the shared code into a module that can be loaded by FileDepotFTP and this script, but then this script wouldn't be tied to using the entire Rails ENV.


For reference, the usage when called out to via a shell out is going to be something like:

$ pg_dump -h example.com -u root ... vmdb_production | file_splitter.rb -b 512M --ftp-host foo.com ...

@NickLaMuro NickLaMuro force-pushed the file_split_script branch 2 times, most recently from e5ca2b4 to 53a1bf0 Compare June 28, 2018 00:05
@NickLaMuro NickLaMuro force-pushed the file_split_script branch 2 times, most recently from 6e3b8e9 to a6a46ec Compare June 28, 2018 23:05
Copy link
Member Author

@NickLaMuro NickLaMuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just moving history of @Fryguy 's comment on #17652 that where intended for this PR, and showing the update and rational for them.

"k" => KILOBYTE,
"m" => MEGABYTE,
"g" => GIGABYTE
}.freeze
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This addresses @Fryguy 's comment in #17652

Not using activesupport because I might go as far as disabling rubygems as well to decrease the boottime of this script, since this script will definitely be the bottleneck when run with piped output from either pg_dump or pg_basebackup.


KILOBYTE = 1024
MEGABYTE = KILOBYTE * 1024
GIGABYTE = MEGABYTE * 1024
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This addresses @Fryguy 's comment in #17652

Splits files based on a given size, similar to `split`.  Uses number
based filename increments instead of letter based (which is the default
in `split`)
A library for shared Net::FTP functions that can be used throughout the
ManageIQ project.  Extracted from FileDepotFtp.
Allows for sending both anonymous and authorized FTP uploads directly
from the script.  Splits are done in memory, so the file never is
created on the host machine, which is ideal for large backups and such
that are being split up.

Spec for testing this functionality are done via a live FTP (vsftpd)
server setup with the following config:

    listen=NO
    listen_ipv6=YES

    local_enable=YES
    local_umask=022
    write_enable=YES
    connect_from_port_20=YES

    anonymous_enable=YES
    anon_root=/var/ftp/pub
    anon_umask=022
    anon_upload_enable=YES
    anon_mkdir_write_enable=YES
    anon_other_write_enable=YES

    pam_service_name=vsftpd
    userlist_enable=YES
    userlist_deny=NO
    tcp_wrappers=YES

Running on the 192.168.50.3 IP address.  Full vagrant setup for the
above can be found here:

    https://gist.github.com/NickLaMuro/c883a997c7ae943dd684bccd469cea43

Put the `Vagrantfile` into a directory and run:

    $ vagrant up

(with `vagrant` installed on that machine, of course)

The specs will only run if the user specifically targets the tests using
a tag flag:

    $ bundle exec rspec --tag with_real_ftp
@miq-bot
Copy link
Member

miq-bot commented Aug 2, 2018

Checked commits NickLaMuro/manageiq@65d9ca0~...fabf763 with ruby 2.3.3, rubocop 0.52.1, haml-lint 0.20.0, and yamllint 1.10.0
4 files checked, 2 offenses detected

lib/manageiq/util/file_splitter.rb

@NickLaMuro NickLaMuro changed the title FileSplitter with FTP upload support [WIP] FileSplitter with FTP upload support Aug 3, 2018
@NickLaMuro
Copy link
Member Author

Might change how we are handling file splitting, so setting this as [WIP] for now.

See #17798 (comment) for some further details.

@NickLaMuro
Copy link
Member Author

Closing this for now. I think I have ripped out the parts that I do need for this, so no need to leave it open.

Main rational can be found here: #17798 (comment)

@NickLaMuro NickLaMuro closed this Aug 14, 2018
NickLaMuro added a commit to NickLaMuro/manageiq-gems-pending that referenced this pull request Sep 18, 2018
Built off of this reference code:

    buf_left     = byte_count
    while buf_left.positive?
      cur_readsize = buf_left - FTP_CHUNKSIZE >= 0 ? FTP_CHUNKSIZE : buf_left
      buf = input_file.read(cur_readsize)
      break if buf == nil # rubocop:disable Style/NilComparison (from original)
      conn.write(buf)
      buf_left -= FTP_CHUNKSIZE
    end

From:

    ManageIQ/manageiq#17549

This basically allows a shared form for uploading the source_input to a
file IO (whether that IO is a socket or or some other form of IO).

The method has been split up from the reference code since there are
some places where the loop itself might not be needed, but the code to
deliver a chunk at a time might be (see next commit for an example of
this using `MiqSwiftStorage`).

Also, since `read_single_chunk` is designed to be called on it's own, we
want to be able to eject with an empty string from that method itself.
This basically re-arranges some code (from the reference code) to allow
that, but maintain the same amount of conditional logic per chunk read
that was there previously.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants