Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Switch/Option/Job to automatically execute 'tarsnap -t -f' for each archive not yet indexed by tarsnap-gui #149

Open
james-mcgoodwin opened this issue May 31, 2017 · 6 comments
Assignees
Milestone

Comments

@james-mcgoodwin
Copy link

james-mcgoodwin commented May 31, 2017

It would be wonderful if there were some command-line argument, or application option or a way to define a Job such that Tarsnap-gui would fetch the archive contents for all 'un-indexed' archives and store them in it's sqlite database for future use.

The problem this feature would help solve is that when you need to do a restore, and if another program is creating archives (IE: danrue/Feather), it's difficult to use Tarnsap-gui to find the archive which contains the data you might want to recover. In this scenario, Tarsnap-gui does not have an index of the archives contents because said archive was generated via other means.

Currently, to accomplish this means starting Tarsnap-gui, going to the "Archives" tab and clicking on the "Display details for this Archive" button for each archive that is 'un-indexed'. This prompts Tarsnap-gui to run 'tarsnap -t -f ' which will fetch and index all of the contents of the archive into it's sqlite database.

This process is relatively short for archives massing in the hundreds-of-megs. But the time to finish this command when your archive hits tens-of-gigabytes in size or larger can be on the order of hours instead of minutes.

The super-power of Tarsnap-gui is that when clicking the "Display details for this Archive" button a second time, or clicking on a different archive that has already been indexed, the application simply recalls that data back out of the database -- This is a profoundly faster process than doing a full 'tarsnap -t -f ' command from scratch, and helps panicked users find the most recent archives with current contents in the /home/some_user/my_valuable_data_directory folder.

In an ideal world, this could be some sort of special job type that can be invoked from the command line. In this case, I could simply set a bi-weekly cron job to kick off at <some_deepest_of_darkest_of_hours> and when the day comes that I am desperate to recover data, I already have an up-to-date dataset of the contents of all Tarsnap archives waiting for me to paw through frantically.

Is that a feature that seems reasonable to request?

NOTE: I'm making a few assumptions here:

  1. the user values the time saved with having an up-to-date dataset of archive contents vs. the incurred cost that the additional 'tarsnap -t -f ' traffic will cause.

  2. the sqlite3 database used by Tarsnap-gui is magical and can scale/grow to support the indexes of Joe Schmoe's umpossibly unimaginably numerous archives that Joe's never bothered to prune.

  3. the sqlite3 database doesn't suffer some sort of crippling performance penalty because of (2) because, again, it's magical.

  4. the Tarsnap service itself at v1-0-0-server.tarsnap.com doesn't mind what is effectively a whole-sale audit of <select_appropriately_horrifyingly_large_number_of_subscribers> accounts and the un-indexed archives there-in.

Thanks very much.

@shinnok
Copy link
Contributor

shinnok commented May 31, 2017

Hi James,

Thanks for the detailed and comprehensive feature request. I think you touched on all the intricacies related to this feature which I was planning to implement as part of a bigger one, more specifically the ability to search&restore specific files across all archives (or a specific Job).

I would be tempted to implement it as a global setting ('Always fetch archive contents') with the verbose warning that this will incur traffic charges over time. For archives created within the GUI the best time for fetching contents would be right after an archive create or when remote listing refresh returns new archives (on app startup); created either via Job scheduling or manually outside the GUI. Regarding the latter case, this would be a startup performance problem only if one persistently creates archives outside the GUI using the same key, which I don't recommend, especially after 1.0 which will make Job scheduling even easier and more granular (per job daily, weekly, monthly).

Regarding performance of the actual -t -f, the GUI runs them concurrently and even for archives of hundreds of thousands of files the operation usually completes in minutes, given a reasonable network connection and complexity of archives. Regarding Sqlite persistent store, I store the archive contents compressed with zlib as binary blobs and only uncompress&parse when Archive details widget is shown. As soon as that widget is hidden/destroyed, contents memory is also released.

@shinnok shinnok self-assigned this May 31, 2017
@james-mcgoodwin
Copy link
Author

Hi Shinnok,

My feelings for the 'Always fetch archive contents (at application boot)' are mixed. Tarsnap-GUI launches several simultaneous calls to tarsnap in parallel (I think I counted up to 10 last night, but some where around there).

Which is excellent, but I would suggest that the performance hit to the machines CPU or the app responsiveness at boot time would possibly not be a very good experience for the user. Consider if Tarsnap-GUI were to kick off an at-launch, multi-process go-index-the-everything job, just as the user is poking and clicking inside the application for the first time that day/week/catastrophic-meltdown.

Which was where my brain was at when I noodled on a cron-driven special Job 'task' that would run around and do all that indexing before the user needs to use it.

You're absolutely right though, the absolutely optimal time is of course to capture the index right after an archive is initially committed. Step1 - make it. Step2 - record it. Totally reasonable.

Tarsnap-gui is crucial in my little collection of tools exactly because it gives me the index database. It that lets me know which archive is the correct one to select before I invest in the restore time needed for the process.

For archive creation, I currently use Feather because it handles all the grandfather-father-son retention scheme for me. For a cron-based backup job that runs every 2 hours, Feathers archive management and pruning functionality becomes incredibly attractive. I'm backing up my home dir, so as-current-as-reasonable backups are nice;)

I'm keen to move my archive generation back to tarsnap-gui when v1.0 brings in archive auto-rotation.

But be it either Feather or Tarsnap-GUI, my personal ideal is less about the tool and more about allowing cron to drive both archive creation and archive indexing, independent of the presence of a user.

That would push both time-intensive tasks into background clockwork and free Tarsnap-GUI of any responsibilities during start-up and early process lifetime. ((IE: allowing Tarsnap-gui to be both up-to-date and idle/responsive for users right from GUI launch.))

@james-mcgoodwin
Copy link
Author

I should add here, that I think we're experiencing fundamental differences in the time needed to run tarsnap in either -t mode or -x mode. I think I gravitate to moving the indexing bit into a background task specifically because I experience it as a very time intensive process.

My runtime for that command on my stupid, lazily crafted archives (~30Gb/ 500,000 files) runs from about 30 minutes to close to 45 minutes on an SSD-drive based core i7 macbook pro with 16GB of RAM and an internet service of 100mb up / 10mb down.

If you're getting much better performance with a 'tarsnap -t -f' execution on a several-hundred-thousand-file archive, then I must be doing something fundamentally wrong.

I did a 3.5GB recovery with Tarsnap-gui v0.9/Tarsnap v1.0.37 last night and the process took in something around 4 hours to complete. I always just thought that average and expected.

@shinnok
Copy link
Contributor

shinnok commented Jun 2, 2017

Unfortunately rolling backups won't be part of the 1.0 release. But I do understand the need for a --maintenance procedure once rolling is also implemented. It should be a tarsnap-gui parameter, just like --jobs and it should handle the archive rotation, archive syncing with remote and archive contents fetching. I would be tempted to run it after every --jobs invocation and on a cron schedule. This way the app would have less to do on startup and be ready to use from the get go.

I'm glad to hear that the GUI is of good use to you, I think 1.0 will bring you lots of utility and performance also, since there have been many improvements regarding archive contents (long ls listing format with sortable columns, search&filter, restore/download only selected files, better detection/support for truncated/empty archives, etc). Thank you for your feedback!

@shinnok
Copy link
Contributor

shinnok commented Jun 2, 2017

Regarding the -t performance on my side, here's how long it took to fetch for a 300k contents count with a Job archive trail of 148 going back to 2 years:

[6/2/17 10:08 AM] Fetching contents for archive Job_Work_2017-06-02_10-00-03...
[6/2/17 10:18 AM] Fetching contents for archive Job_Work_2017-06-02_10-00-03... done.

So I might have exaggerated with minutes, still a good runtime if you ask me. I'm using tarsnap CLI 1.0.37-head and GUI master.

@james-mcgoodwin
Copy link
Author

Regarding the proposed '--maintenance' flag, I would agree that it should fire after one/all jobs are completed.

That's the essence of how I'm using Feather now.

I'm looking forward to v1.0:)

Thanks for making this project. Tarsnap-GUI is an important part of my tarsnap tool kit:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants