Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preallocate files #2169

Open
reallyuniquename opened this issue May 3, 2022 · 19 comments
Open

Preallocate files #2169

reallyuniquename opened this issue May 3, 2022 · 19 comments

Comments

@reallyuniquename
Copy link

Right now missing articles are simply getting skipped. Remaining data is misaligned and resulting filesize is different.

This behaviour complicates a lot of simple ways of restoring missing chunks even if you can get them from other sources. You can't easily fill just the missing data from P2P networks. You can't use RAR's internal recovery record. You can't unpack existing data from multipart archives even when total number of lost articles is relatively small.

I know, I know. NZB files do not provide length of actual data, only a number of raw bytes. But you can still guess and I believe some clients already do that quite fine.

I've just had to pull 50GB of data once again while I was missing less than 1MB. This is really frustrating. I also see an increasing number of uploaders that intentionally do not provide any PARs and 1 missing article trashes the whole batch.

Anyway this has been brought up a few times on forums with a lof valid points but I don't think it got any traction: /viewtopic.php?t=9851 & /viewtopic.php?t=15373.

@Safihre
Copy link
Member

Safihre commented May 3, 2022

It's easily possible, the information is inside the yEnc header. That's how nzbget does it. I already made a proof of concept a year or so ago.
It's just quite a big investment and there was no real good reason to do it. Par2 repair works just as well in my tests with filled or unfilled data.
I understand your use case, but it's also a bit exotic.

A much better use case would be that this way we can actually implement a proper Retry where only the missing articles are tried again and filled in at the right spot in the file. That's the reason why nzbget implemented it.

@thezoggy
Copy link
Contributor

thezoggy commented May 3, 2022

there is tons of obfuscation methods, purposely not including pars / leaving stuff incomplete on purpose (force you to get the actual nzb from the source) to prevent site leeching or avoid dmca and so on. How often are you getting these nzb and do you find it is only from one site?

I'd wonder how much actual benefit there would be when the majority of the time you just have the data to write anyways. So writing placeholder files is time consuming and wear on the hdd.

To also do the zero prefill required additional accounting since you would have to know what hadn't been filled and what has?

@Safihre
Copy link
Member

Safihre commented May 4, 2022

@thezoggy you're not actually writing data, you say fp.seek(X). All the space between 0 and X will just not be filled with anything. So no extra writes.
Depends on the OS how it actually fills the space as far as I understand.

@reallyuniquename
Copy link
Author

True, implementing this would make proper Retry possible too.

the information is inside the yEnc header

I'm not familiar with NNTP. If article is missing where would the header come from?

@thezoggy
It's not obfuscation, uploaders explicitly state they don't use PAR files. Those mostly come from one place but don't focus much on that. It's more about extracting existing data and its structure. I've had enough failed downloads with proper PAR sets that couldn't be salvaged despite having most of the articles.

As for the zero prefill it doesn't write anything if implemented properly, OS handles that. I mean full preallocation is not required to fill missing articles but I felt like it's a very related issue.

@Safihre
Copy link
Member

Safihre commented May 5, 2022

If there's no article, indeed we can't parse the header.

Does it really need to preallocate the whole file? Why not add as we go? As long as the data is in the right location inside the file.
For example if the file is 100Kb, but the last article of 10Kb is missing, it's not fine to just leave it at 90Kb?

@reallyuniquename
Copy link
Author

No, full instant preallocation is not required. I guess I shouldn't have mixed these two features in one issue. But they both require accounting for the size of missing articles.

I'm guessing most upload tools use fixed data length for every post in a single NZB. Can we use that to calculate correct size when article is missing? Would that work?

@Safihre
Copy link
Member

Safihre commented May 5, 2022

I'm guessing most upload tools use fixed data length for every post in a single NZB. Can we use that to calculate correct size when article is missing? Would that work?

All articles that are present will tell in their yEnc-header exactly in what spot their data belongs. So if an article is missing, the next article will just say where it's data should be and we can start writing there:

      =ybegin part=41 line=128 size=49152000 name=90E2Sdvsmds0801dvsmds90E.part06.rar 
      =ypart begin=15360001 end=15744000 

@animetosho
Copy link

Does it really need to preallocate the whole file? Why not add as we go?

There are benefits with pre-allocation. Exact behaviour can vary across filesystems, but things include:

  • space is preallocated, so you won't get 'out of space' errors during download. Also helps other programs be aware of the intended disk usage, so they won't get unexpected 'out of space' errors either
  • reduces fragmentation
  • I generally find pre-allocation to be faster than a filesystem constantly trying to increase the size of a file (and calls like this tend to be fast)
  • PAR2 repair is slightly more effective when data is in the right place. PAR2 can deal with misaligned data, but there's some minor efficiency loss, plus the mechanism it uses to do this is CRC rolling, which being a bit slow, many clients limit its usage
  • the file can be used across other tools, e.g. imported into a torrent client to fetch missing parts

IMO, pre-allocating files is the most sensible approach for a downloader, if possible.

@reallyuniquename
Copy link
Author

If preallocation gets a go please make sure to use instant file initialization using SeManageVolumePrivilege on Windows. Otherwise you'd be literally writing zeros.

IMO, this and filling missing articles should be separate options.

@Safihre
Copy link
Member

Safihre commented May 6, 2022

It's interesting, but as I wrote before I don't see many convincing points (e.g. that would benefit 50%+ of users) to implement it right now.
Maybe in the future.

@animetosho
Copy link

If preallocation gets a go please make sure to use instant file initialization

Across other applications, I see it often being an option between no preallocation, fast preallocation (using OS supplied calls, or maybe seeking to the last byte and writing a 0) and "full" preallocation (explicitly zero-fill the file), with the default being on fast preallocation.
Unfortunately, filesystems aren't consistent with how they operate. Particularly if the filesystem doesn't support sparse files, the OS may want to zero-fill the file for security reasons.

Of course, whether it's worth implementing is another judgement altogether. If I were implementing the system from scratch, I'd definitely take the approach, but changes on an existing system is a different cost.

@reallyuniquename
Copy link
Author

Sparse files are usually what you want to stay away from. At least on conventional spinning rust. Insane fragmentation, extremely slow to read from.

I see it often being an option between no preallocation, fast preallocation (using OS supplied calls, or maybe seeking to the last byte and writing a 0) and "full" preallocation (explicitly zero-fill the file), with the default being on fast preallocation.

I've yet to see an app that transparently explains this to an end user. qBittorrent still takes the heat for stalling while it writes gigabytes of zeros on Windows for hours with preallocation enabled. The reason for that is libtorrent call to SetFileValidData() which allocates space without zeroing it requires additional permissions it never asks user for.

@animetosho
Copy link

animetosho commented May 9, 2022

Actually, the way you put it, sparse files are probably what you do want most of the time. Keep in mind that torrent clients tend to download in random order, whilst Usenet would be largely (if not always) sequential, so a sparse file is no worse than having no preallocation.
(as for random ordered downloaders, the actual implementation of sparse files would matter - if the filesystem actually reserves the space, but reads from unwritten sectors return zeroes, you don't get any additional fragmentation compared to zero-filling the file)

Thanks for sharing the info though!

@Safihre Safihre changed the title [Feature Request] Fill missing articles with zeroes AKA preallocate files Preallocate files May 24, 2022
@puzzledsab
Copy link
Contributor

This sparse file thing is much harder than I hoped. I modified newswrapper to extract ybegin and yend and tried to use it with fseek in assemble like some examples seemed to imply that I could. Unfortunately it creates broken files when I dump all available parts in every loop. Apparently you can't update data inside a file.

I don't think the files being generated this way are sparse at all or if it would work this way if they were. If they were, what happens to the data that crosses sectors? I think maybe we would have to write data in 4 KB blocks and join sections of parts to make sure they fit properly. I'm using Windows and NTFS.

I have verified that the ybegin value I extract for each part is correct (after adjusting for yenc kinks) by comparing it to fout.tell() when I'm using the current append mode and that the size calculated using ybegin and yend is the same as length(data).

Related to #2459

@mnightingale
Copy link
Contributor

Hmm python has strange open modes, I can't tell which would be "open in binary for writing without truncating"
Looking at https://docs.python.org/3.8/library/functions.html#open maybe "r+b".

Did you try that?

@mnightingale
Copy link
Contributor

mnightingale commented Feb 18, 2023

Also regarding the 4 KB writing thing, it might actually be worth investigating turning the buffering=0 option back to the default which will handle writing in optimal sizes, or at least understanding whether it’s ever advantageous to disable buffering.

@Safihre
Copy link
Member

Safihre commented Feb 18, 2023

@mnightingale please take a look at the actual implementation in cpython of the buffering, I did and it's much more basic than you might think. There is barely anything smart about it, and the buffering limit is just a constant. Setting buffering=0 gives us a direct filepointer instead of all the useless overhead of buffering logic that we don't need because we always write blocks much bigger (750kb) than the buffering limit.

@animetosho
Copy link

@puzzledsab Do you have some code we can look at?

If they were, what happens to the data that crosses sectors?

The filesystem APIs are supposed to hide the notion of sectors, so you shouldn't have to worry about that. Essentially, the data will get placed accordingly and split across sectors if necessary.

@puzzledsab
Copy link
Contributor

@mnightingale : I did try it but apparently not hard enough. It seems to work now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants