-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Working on major Par2 changes. Name? #130
Comments
Interesting! You seem to be interested in retaining most of PAR2 - I presume, only changing the erasure coding used? What's the aim of your change? Just to improve encoding/decoding performance? Have you looked at MultiPar's efforts and ideas with a PAR2 successor? https://www.livebusinesschat.com/smf/index.php?board=417.0 As for your question, I feel a little uncomfortable with keeping the same extension. PAR2 is fairly old and well established, and as you've found, at least one client (possibly more) wouldn't handle such a change particularly well. Thanks for your work nonetheless, I'm interested to see how it turns out. |
I have two goals. The first is to fix the bug in the Reed-Solomon code that has existed since Par1. I tried to fix this in Par2 by using only certain base values. That lessened the problem, but didn't fix it. My other goal is to start using low-density parity check (LDPC), which are no longer covered by patents in the USA. LDPC is much faster than Reed-Solomon, with a slightly increased chance of non-recovery. There are many forms of LDPC; Tornado Codes are simple ones. If I do the spec right, repair will work for many LDPC, not just Tornado codes. Then, people who implement the spec can play with multiple different encoding schemes. One of my restrictions is that I want to do this change in about a week's worth of work. I have some free time, but not much. Having looked at the code, I really want to spend time cleaning up the code and making it work like a library, but don't have enough time to find all the bugs I would create by doing that. I think I can fix Reed-Solomon and add support for Tornado Codes in 40 hours. I've tried to understand the author of MultiPar's efforts. But I haven't see a spec for their "Par3". I don't think they are devoted to open source code. I designed the Par2 file spec to be forward-compatible - it was designed to add features. I'm sure the Par2 spec could be improved by a redesign, but I don't see a huge win for it right now. Certainly not without a large number of people devoted to implementing it and checking the code. |
Thanks for the response and explanation! So it seems like this is more of just an experiment for encoding schemes rather than some sort of end-user oriented, production-ready PAR2 successor? Totally understand the lack of time. It's just that I don't think a PAR2 successor should really be rushed, particularly if the aim is to continue to support it for the next ~10 years or more. Parchive is often used for archival purposes (hence the name), so it's a format you generally don't want to be changing often, which is, unfortunately, often at odds with an experimental scheme. Not trying to discourage change or contributions in any way, but I thought it should be pointed out nonetheless.
"PAR3" was never really a finalised spec, only a proposal at most. The MultiPar author was experimenting with some new ideas, and decided to call it "PAR3" and included it in MultiPar. Unfortunately, people actually started using it for non-experimental purposes, so he had to pull it. Regardless, I mention this as they seem to be the only people still looking into a PAR2 successor (as well as one of the few actively developed PAR2 implementations), so thought you/they may be interested in such a proposal that you're making. I'm not too sure if you'll get much of a response here, but if you put up your proposed specification, I can share a link to it over there to get some of their thoughts. Again, I personally do appreciate the effort you've gone (and will go) to, and am interested to see what you come up with! |
Thanks! I tried to keep tabs on the Par3 discussions back then, but I never saw a specification. So their old proposed Par3 changed every packet. The changes include:
I'm not a fan of the last 4 changes - it increases code without specific benefits. In fact, using packet numbers rather than strings means it is difficult to add custom extensions and to debug things. Does anyone have an analysis of how much time par2cmdline (or another client) spends doing each task? E.g., how much time in CRCs, MD5s, inverting the matrix, etc.. If MD5 is taking up too much time, maybe we should replace it. Otherwise, it seems foolish to drop backwards compatibility for non-recovery packets. Does anyone have the "noe" program or "libnoe" library for computing Reed-Solomon with FFT? Or perhaps another implementation? I'm not tied to Plank's corrected version. Fixing that problem is certainly something I'm in favor of. As far as "an experiment" vs "end-user oriented, production-ready PAR2 successor", I do not want to stall all development for a successor just because people use PAR2. We need to experiment. We should work on fixing Reed-Solomon and we should work on adding support for LDPC. We do need to be careful communicating to users what is untrusted. We also don't want to break "par2cmdline" for Par2, which is why I'm (1) not changing much, (2) adding unit tests and (3) asked another developer to review my code before it is merged. |
Going out on a limb here (I haven't read the proposal), I would imagine that the typical approach here is to allocate ranges for custom extensions. I also imagine that ease of debugging was not goal of the proposal either.
That's an interesting opinion. Personally, I don't see much benefit in retaining backwards compatibility just for non-recovery packets, seeing as the whole point of PAR2 is the recovery packets. I suppose there's some use with using it to perform hash verification, but I don't feel that this is the primary aim of PAR2, and imagine that other methods (e.g. just storing a hash) are more frequently used for cases where only verification is necessary. The ability to verify is also limited, because most PAR2 tools also check the recoverability if bad blocks are found, which obviously won't work if a new type of recovery block is introduced; I don't recall the PAR2 specification even allowing for new types of recovery blocks, so there's no reason or expectation for clients to handle such a change in an ideal manner either. You do mention that you have limited time to work on this, and hence want to minimize changes, which is totally understandable, but I see this more as a constraint rather than a benefit for minimizing changes.
This doesn't really address your concerns, but here's a list of issues the MultiPar author has with PAR2, written 5 years after the PAR3 proposal, that he'd like to resolve with a PAR2 successor. May give some ideas on the reasoning behind the proposed PAR3 (e.g. using numbers to maybe reduce the size of the PAR archive, further enhanced by compressing filenames?).
I only have experience with implementing a PAR2 encoder, so haven't looked much into decoding (so not much knowledge/experience implementing matrix inversion), and what I'm about to say relates to encoding. As RS coding is relatively slow, generally the time spent hashing is negligible. However, if you replace RS with something faster, hashing will become a larger part of the overall speed to create/repair PAR2. As far as I'm aware, all PAR2 clients (other than ParPar) perform an initial hash pass to compute the MD5/CRC32 of each file/block, during which, you're probably limited by I/O rather than CPU. NMVe SSDs, which are gaining mainstream traction, likely tip things the other way, though running MD5 in multiple threads is probably sufficient to keep I/O being the bottleneck. Another possible trick is to use a multi-buffer (SIMD) MD5 implementation, but I think only ParPar is the only PAR2 client which implements that. CRC32 is generally very fast, and modern x86/ARM CPUs have acceleration for it. I think rolling the CRC and checking for matches, for finding misaligned blocks during recovery, can be somewhat slow though. Rolling can also be problematic if you have to re-compute the MD5 for every potential match, which can lead to a kind of DoS attack against a PAR2 client attempting recovery. Here's a discussion around hashing in PAR2 and successors. I/O can be a limitation when processing large amounts of data which cannot fit in memory, due to the need for certain seek patterns. This may become less of an issue with the dying use of rotating media for storage though. Note that par2cmdline isn't particularly a performance oriented implementation of PAR2. For most of its history, it's been a single threaded, non-SIMD, no-ASM implementation. There have been various forks which have added these performance features (e.g. par2-tbb, phpar2). It's only more recently that par2cmdline-mt was merged in, adding multi-threading to par2cmdline (but still no SIMD). MultiPar is more performance oriented, so obviously the author is more concerned about such issues with a proposed successor. For some rough performance figures on the fastest implementations I know of, off the top of my head. On one thread of a 4.3GHz Skylake-X CPU (supports 512-bit SIMD via AVX512):
Oh definitely! It's just that that wasn't clear from your first post, considering that this repository is more a user-facing tool rather than for experimentation, and the questions asked around the name (which sounds more like an issue if you're proposing an actual standard rather than experimentation) initially gave me the impression otherwise. Hence my asking and confirming. Thanks again for the reply! |
I don't worry about intentional MD5 hash collisions or denial of service. We use MD5 for uniqueness, but I don't think our users rely on that. (Or, if they do, they probably know what MD5 is and its security issues!) I do worry about buffer overflow and writing system files. If we want people to implement Par3, it's easiest if they can just reuse the code from Par2. Why recode the wheel? I agree that par2cmdline should not be tweeked for performance. It should be a clear roadmap for how to implement the standard. I actually don't think it does a good job of that - par2repairer.cpp is 2750 lines. I guess it needs to be multi-threaded (or use fork?) because it is compute intensive and everyone has multi-cores now-a-days. Is GF16 multiply 45GBps or 45MBps? If it's 45MBps, then it and I/O are our bottlenecks. Changing to LDPC might improve things. I'm not sure what I'll work on now. Maybe I'll just spend the week cleaning up par2cmdline. I hate modifying a working program without enough tests, because I may break an untested feature. (There's a lot of commandline options to test!) But a cleaner version would make it easier for people to copy the code and then improve performance of pieces. If I skip additions and just improve par2cmdline's code, there's things like:
As I said, these are huge changes and I'm loath to make them, because there's a good chance that I'll introduce bugs to a long-run tool. |
Multiplying a buffer by a constant can be done at 45GB/s using 512-bit SIMD. In other words, a single multiplication is 45GB/s. The implementation is based off Plank's "Screaming Fast Galois Field Arithmetic Using Intel SIMD Instructions" paper (you can find his implementation here). My (much less readable) implementation extends his 128-bit SIMD code to use 512-bit SIMD (available on the latest Intel server processors), and adds a few small performance tweaks on top, effectively achieving around 4x the speed of Plank's implementation on CPUs with AVX512. Here's a benchmark I performed with some GF16 multiply techniques. The "lh_lookup" value refers to the technique of splitting the 16-bit multiply into low/high 8-bit multiplies in a 512-entry lookup table (basically the same method used in par2cmdline) - this is the fastest, portable, pure-C technique I know of, but as it doesn't use SIMD, falls behind the other techniques on performance. "shuffle" refers to my implementation of Plank's method above, at 128/256/512 bit SIMD widths. "xor" is the technique of using JIT to generate XOR sequences linked to above, at 128/256/512-bit SIMD widths. "memcpy" is just the standard glibc implementation of (looks like I accidentally chopped off the X-axis label, which is the buffer size used; you'll find a similar graph in Plank's paper, but his implementation only uses 128-bit SIMD) However, in reality, it's unlikely that users create PAR2 with only a single recovery block. Which means that more multiplications need to be performed, proportional to the number of blocks to be generated. The example I gave was with 1000 recovery blocks, which means 1000x region multiplies for each input block, hence giving the (very rough) 1000x overall speed decrease (=45MB/s per core). Note that this speed was achieved on a very recent, high power Intel CPU with AVX512 running at a high clock rate. Most users do not have access to such powerful hardware, so the figures I'm giving are somewhat at the upper-end of the scale of what's possible on a CPU today (a more typical x86 processor today would probably get half that speed). At the other end of the scale, a low power ARM CPU (with NEON (SIMD) support), for example, may only achieve around 300MB/s using the above algorithm for a single multiply. CPUs without SIMD units may perform even worse. So despite the various tricks you can pull for a highly optimised/tuned implementation of GF16 multiplication, they come at a complexity cost, and there's still fundamental limitations, hence a faster erasure code is definitely welcome. Particularly, something that can grow at O(n log n) complexity, instead of O(n*n) would be useful, especially if we want to scale beyond 64k recovery blocks. Just to make sure you're aware, I don't work on or maintain the par2cmdline codebase, so can't really comment much on your suggestions (hence I won't be able to do much regarding the comment you made in your latest pull request). I'm mostly just an interested 3rd party who works on my own PAR2 implementation (which has somewhat different aims to existing PAR2 implementations (e.g. one thing I wanted was to be able to process data from streams rather than be tied to on-disk files)). There isn't much development going on with par2cmdline these days though, so I would suspect that any contribution you're willing to make is valued. Your suggestions do sound like good changes nonetheless, and I appreciate you putting your time and effort into this.
All changes have such risk, but without changes there is no progress. You sound like you're fairly cautious already, so I wouldn't worry too much about it. All software has bugs - if one crops up, someone just needs to fix it. Yutaka Sawada (MultiPar author) made a comment regarding your initial proposal. I've copy/pasted it below:
|
On Tue 30 Apr 2019 at 07:35:08 -0700, Michael Nahas wrote:
- create a par2 library and have par2cmdline process commandline args
and then call the library.
That would be quite useful!
Currently there is a tool that I sometimes use (nget, an nntp downloader
with automatic use of par2 files) which includes a copy of par2cmdline,
but modified a bit. The tool is also abandoned, but I cloned it to
gitlab: https://gitlab.com/Rhialto/nget .
It would be great if I could modify it to use a proper library
(eliminating the copy and the annoying modifications).
|
Thanks @animetosho for the GBps / MBps explanation. memcpy is a good reference for speed. And thanks for forwarding the comment on "matrix inversion" from Yutaka Sawada. As part of a job, I've actually tried to invert large matrices and I found that doing more than a 400x400 was impossibly slow. I'm not sure we should use the term "Reed-Solomon" if we're not actually doing Reed-Solomon. I've been thinking of renaming the class to "reedsolomon_broken". I find it interesting that people are changing filenames after creating Par2 files. That isn't something I anticipated when writing the spec! Currently, the spec says that clients shouldn't try to verify how the "recovery set ID" is calculated. We could amend it to say that they shouldn't try to verify how "file ids" are calculated. We did an informal amendment to a version "2.1" which allowed clients to use UTF-8 in filenames (instead of pure ASCII). @animetosho The streaming idea is interesting. Are you using the optional packets that allow you to include slices of the input file? |
Yeah, there are all sorts of things people out there want to be able to do. Anothing thing is being able to add files or update a PAR2 archive. Currently this requires recomputing the whole thing due to the strict file ordering determining the coefficients used for computation. When you get down to it, there's a lot of things that Parchive could be doing but currently can't in PAR2, which is why I personally think a major overhaul of the format is warranted.
I am not including parts of the source file in the PAR2 archive. Streaming is more in the sense of being able to pipe data in from memory, as opposed to using PAR2 as some on-the-fly FEC for lossy network transmission or the like (though you could do that too). For example @Rhialto gives the example of a usenet downloader. Currently these write all downloaded files to disk, including PAR2 files, then perform verification/repair. If those could be performed somewhat on-the-fly, whilst the download is occurring, you can save time over doing separate processes, whilst also reducing disk I/O (and even avoid writing PAR2 files to disk entirely). Unfortunately, existing PAR2 implementations are too closely tied to the notion of on-disk files to be able to do this. |
If you can collect the use cases that Par2 doesn't work well in, I can take a look at how the design might change. I think we can handle renaming files by just not requiring clients to confirm the File ID. If we want to handle single-byte changes in files, like the proposed Par3 spec you forwarded, that does require big changes. Do people really want >32,000 blocks? Each block adds overhead (CRC32 and MD5 checksum). I mean, it's very doable, but I'd want to shift to something like LDPC, which has less computation per block. Streaming is possible with minor changes. You need some initial information to initialize the state in the reader. That would be (1) input file descriptions, checksums, etc. and (2) the number of recovery buffers to allocate and their coefficients. Then, as data was downloaded, you need to identify what input slice it was, write it to the correct place in the file, and subtract its effect from each of the N recovery buffers. Lastly, when all the data had been downloaded, you need to identify the K slices that need recover. If K < N, download the recovery data for K slices, add it to the appropriate recovery buffer, solve the matrix, and write out the recovered data. If you assume the initial information is the no-recovery-slice Par2 file, then the only thing we are missing is the number of recovery slices and their coefficients. An adventurous client could guess those from the names of Par2 files. Or we could put that data in the Par2 files. That is, a new packet that is a hint to the repair client about the number of buffers to allocate. Streaming would use a lot of CPU and storage unnecessarily, but it would be cool to try. I like that the downloading program would know exactly where to put the data blocks and the client wouldn't need the logic to rewrite the data files. |
I'm not sure about collecting use cases, but some ideas are listed here.
The overhead is relatively small for larger blocks, e.g. 1MB blocks. However, even at 1MB blocks, 32k means you can't have more than 32GB in a Parchive without increasing block size. Though not common, it's not too unusual to have >32GB files today. And if we're thinking about a format which is to last the next 10-20 years, it can feel a little limiting.
As you alluded to, it's already possible (most of the time) without any change, though the decoder may need to be a little clever about it. Knowing how many recovery blocks are available may actually not be necessary, and in fact, may be largely irrelevant (at least for the downloader use case). The remaining issue is the recovery coefficients. These could be guessed from the filenames, as you pointed out, though you could also just guess the first few coefficients typically used as there's generally little reason that an encoder wouldn't use those. Encoding in a streaming fashion is a little easier than verify/repair, since the create tool has full control over the process. The primary limitation here is being able to fit all recovery blocks in memory. (note that the ParPar project wasn't aimed at changing PAR2, rather, at working within its bounds) Yutaka Sawada added another note:
|
My opinion was that UTF-8 support was discussed and approved by Peter B. Clements and a bunch of us and approved. I may add that to the "spec" section of the new webpage as a 2.0.1 spec. The list of use cases include:
I'll have to think about "repair Par2 files" as a use case. I understand the desire for it. If you're transmitting files, you want to restore them to the original state before re-transmitting them. If you've done a backup, you'd like to restore every damaged disk image to the original state before re-burning it. I've usually assumed clients have the freedom to determine how many times to repeat the un-recoverable packets (and their placement in the files). This can probably be accomplished by a "suggested file layout" without a strict requirement. I don't know about fixing small errors, like adding/dropping/changing a byte. I'm just not sure how to do that - especially adds and drops. It feels like a separate file format should be used for that. I've already discussed how to stream with Par2. It takes some guesses, but I think it would work in practice. The only change I could even suggest was a packet that hinted about the exponents of recovery packets and, even then, it would only be a hint. The big stumbling block is the changing filenames and incremental changes. The easiest fix here is just to open up more freedom in the Par2 standard. Do not require (always) for the FileIDs to be listed in sorted order in the main packet. We can consider if we also want to allow FileIDs or the RecoverySetID to be something other than its current definition, but changing the main packet order is necessary and sufficient. Those changes can done by a "Par version 2.0.2" amendment. And we could include the suggested file layout with it too. There are reasons to go to a "Par3": more blocks, fixing Reed-Solomon, and replacing MD5 with a faster/better-suited hash. I think adding LDPC is the easy way to get to more blocks. And it would keep the Reed-Solomon matrix small, if we used Plank's correction which requires sloooow Gaussian Elimination. (I don't understand the FFT version yet.) But simple changes to Par2 might get us a few more use cases for people. |
Yeah, those small changes to PAR2 could definitely open things up a little, and there's a good chance that existing PAR2 clients would already be fully compatible with them anyway. I found a page where someone else was investigating a PAR2 successor, which may have some ideas+thoughts (haven't really read through it). Some ideas I just thought up for a "PAR3" (may not be a good idea, just mentioning them):
User maosredbrook gives the interesting idea of a RAR-esque usage of storing input files in Parchive with optional encryption. PAR2 already supports embedding input slices, though I'm not sure how useful that feature is. Another option is that, since decoders have to scan for packet headers, one can embed (concatenate onto the end) a PAR2 archive into another file. This can be used to, say, tack on recovery information to an archive which doesn't support repair (e.g. 7z), and the file still remains usable (as long as the application doesn't make assumptions on what's at the end of the file). "Extracting" the file would just be like stripping off the PAR2 packets off the end. Yutaka Sawada writes:
My understanding of the packed slices feature is that it doesn't increase the number of available slices in any way, rather, the only thing it does is allow input files to be more tightly packed, since they don't have to be padded to a full slice (only to a subslice). If my understanding is correct, one could always just set the subslice to 4 bytes and gain maximum efficiency with packing files, with little downside (other than some additional complexity in decoders?). Adding subslice hashes could make the feature more useful as smaller erasures can be identified, and actually gives a penalty for using tiny (i.e. 4 byte) subslices. It's not quite the same as enabling more slices overall however (which can be more useful in cases when you're encoding input of unknown length). |
The comment about removing files (and leaving some coefficients missing) is a good note. I hadn't thought of that. 0-length files are within the spec. Clients can put them either in the recovery set or the non-recovery set. Either way of doing them is fine. The non-recovery set was only added for backwards compatibility, so I'd prefer including them in the recovery set. The only bad way of handling them is the way that par2cmdline does - which is not at all. They get ignored! As for compression, encryption, permissions, many files, etc., I'm an old UNIX junkie, so I think there are many other programs that handle that better. gzip, gpg, tar, etc.. I'd rather have someone run many tools that do each thing well, rather than build it all into Par. I really wished some client had implemented the "input slices" optional feature. Many large files are already compressed. Many posters on Usenet were using RAR to split up a file and not for compression/permissions/etc.. I was really hoping a Par client would eliminate the use of RAR. But, as before, I wouldn't include encryption. There are other programs that do that better. ... if you we do a Par3, what do you think of making that a required packet type? |
Which in my opinion, is completely pointless. No-one has actually been able to give me a sound explanation of why files need to be split at all, as yEnc already splits up files into articles, and another layer of splitting seems completely unnecessary. Splitting can have uses if putting files on fixed size media (CD/DVDs) and the like, but I don't see any requirement for using RAR to do that, particularly if you're not compressing. I think the Usenet case is more due to legacy and users not updating their knowledge or processes, rather than anything which still makes sense today, on a more technical level.
I suppose the packet is useful if someone needs it, but I'm struggling to see an actual use case where it's particularly beneficial. Embedding source files turns Parchive into an archive format, at which point, you're competing with TAR/ZIP and the like for functionality, without many of its features (timestamps, permissions etc). Various other archive formats do support embedding as well (e.g. for self-extracting executables), but because PAR2's input file slice packet requires the file to be stored non-contiguously (broken into slices), it won't be recognised without an application extracting the files from the PAR2 itself. |
For anyone tracking this thread, I just pushed a HUGE change. It creates "libpar2", a library that anyone can include in their code. The commandline program, "par2", just calls the library. Pull request is here. With a description of all the changed. It fixes a number of bugs (including one that didn't match the spec!) and minor features. I only have two computers to test on and neither are a Windows machine. I'd love if everyone could download the code and make sure it compiles on your machine. |
Nice work! |
I'm done all changes to this pull request. It's ready to merge, if someone can run it with Microsoft Visual C++ and run the tests on Windows. |
Compiling seems to fail on MSVC 2017 with linker errors. Adding src/libpar2.cpp /.h to the project (par2cmdline.vcxproj) fixes this and it compiles fine. |
@animetosho Thank you for running it!! @animetosho Did the last test run? The && operator will only run the program on the right if the program on the left exits with code 0. Maybe you could do: I'll see what I can do to the unit tests to make it clearer which ran and success/failure. Can you attach your par2cmdline.vcxproj, so I can add it to the fork? I'm assuming it is just adding 2 lines in the obvious location, but that's just a guess. If you can't attach the file, commit it locally, run "git format-patch -1 HEAD" and then post the contents of the patch it generates. |
All the tests did run through like that, it's just that I don't know whether a failing test still returns 0 or not. I've attached the project file. It just adds the two files, and doesn't attempt to, e.g. build a separate library and CLI executable. Compile log:
|
I merged your Microsoft project file. I fixed the code for the warnings - one was a bug I had introduced. The && operator in bash checks the exit code. It will only run the code on the right if the program on the left returns 0. (Which is what each test should do if it passes.) So, "echo SUCCESS" is only run if every program to its left had an exit code of 0. (If you want to run two command sequentially and always have the second run (independent of the exit code of the left one), I think you use the semi-colon ; operator. |
I started looking into Tornado Codes. I think there is an error in the paper which affects the code's performance. That is, the codes still work, but a typo makes the codes look much better than they are in practice. The error is on page 7 of the Paper It's in the 4th paragraph of Section 7, which starts "The left degree sequence..." You start by choosing d. The higher the value for d, the more space-efficient the algorithm, but d+1 must be less than n, the number of message symbols (a.k.a. input slices). In the paragraph, he defines H(d) and says it approximates ln(d). Then he gives the formula for lambda where lambda_i is the portion of message symbols (input slices) that are associated with i check symbols (redundancy slices). So, lambda_2 might be 20%, which means 20% of input slices can be recovered by exactly 2 redundancy slices. The formula for lambda_i is 1/(H(d)(i-1)). All that looks correct. The problem is the next sentence, "The average left degree a_l equals H(d)(d+1)/d." I think that is wrong. The formula to compute the average is: When d=100, H(d) is about 5.18. Luby's value is 5.24. My calculation is 20.28. For larger values of d, the separation gets wider. For d=10000, Luby's value is 9.79 and mine is 1022. Why is this important? Because the average left degree has to equal the average right degree and the average right degree is basically "how many input slices are used to create each recovery slice". If we want Par3 (or Par4) to be disk efficient, we want a large value for d. But if we want it to be fast, we also want each recovery slice to be computed from only a few input slices. And if my formula is correct and H(d) is approximately ln(d), then that value goes up with d/ln(d), which is a big value. I've included my code in this post. At the start is a big comment describing each variable name in the paper. That may help if you try to read the paper. The actual code to compute H, lambda, and the average is about 60 lines. tornadocode_standalone.cpp.txt I'll spend another day looking at Tornado Codes, but we might look into other LDPC codes. |
I was wrong. The paper used a strange notation. lambda was not the portion of nodes with out-degree i, but the portion of EDGES that are connected to nodes with out-degree i. So I was multiplying by i when I should have been dividing. For 1000000 input slices, the average number of input slices used to compute each recovery slice is 14. Which is the nice low value that we want! |
I've been looking into Tornado Codes and it doesn't look good. I'm generating ones with 100 to 32000 inputs and 20% to 5% of recovery data and seeing how often they can recover the original file. For 100 input blocks and 20% recovery data (20 parity blocks), in 1% of cases we cannot repair 3 or fewer errors. Some graphs encountered errors where they could not repair 1 block. On average, we can repair 13 errors, which is far fewer than 20. For larger numbers, I looked at 10,000 input blocks with 5% recovery data (500 parity blocks). In 1% of cases, we cannot repair 84 errors. (The worst I saw was 42 errors that it could not repair!) On average, we can repair 415 errors. I've seen a paper that tries to fix some of these problems with "small" graphs, by making sure every input has trivial coverage and that parity blocks don't overlap. But I also found a paper that says Tornado Codes have a high overhead until you get to above 100,000 blocks. I had already been thinking that Par3 (or whatever the next version is called) should support all linear codes. That is, any matrix of inputs to recovery blocks using GF(16 bit). The nice thing about that approach is that it would support Tornado Codes, Reed-Solomon Codes, and some others. I'm now thinking that we may need to find our own style of linear code. Tornado Codes don't do well with "small" numbers of input blocks because they use XOR. We might be able to get some better performance using GF(16 bit) operations. I haven't seen any paper doing that. So, this may involve some research. I've attached my test program, if you want to see what I'm doing. "a.out 10000 5 100" creates a graph with 10000 inputs, 5% recovery data, and simulates recovering damage 100 times. |
This might be the paper we want. It says that an n-by-n random matrix where each row has log(n) non-zero elements usual has rank n minus a constant. And if we use GF(q), the probability that the rank is less than n-k is some constant times 1/q^k. In fact, we're probably using this theorem right now. Because of James S. Plank's mistake, we aren't using a Vandermonde matrix but some other matrix. That is, a random-ish matrix with at least log(n) non-zero values per row. My thinking is that we can build recovery out of random matrices that are n-by-n whose rows have log(n) non-zero elements, which have rank n-k, and add a few rows that have n elements, derived from the Vandermonde matrix. Then we'll only have wasted k parity blocks and gotten n/log(n) times faster. This doesn't avoid the fact that inverting a large matrix is expensive. Tornado Codes do nicely because their recovery mechanism only looks at one recovery block at a time. I'll see what I can do. Perhaps we can build a hierarchy of small random matrices. |
For those following along, I noticed that Yutaka posted another comment a while ago:
|
I'm still looking at Parchive Version 3. I hit on a crazy off-the-wall idea. I decided to look at other file formats. Wikipedia has a list. Many are proprietary or single-platform, but a few are not. An interesting one is Java Archive files with the ".jar" extension. That format, if you don't know, is just a ZIP file. It just has a different extension and the only difference is that it contains specific files (e.g., "META-INF/MANIFEST.MF"). I had the crazy idea that it might be useful if Parchive Version 3 also fit another file format. A likely candidate is "tar". Tar file format The tar file format consists of blocks of 512 bytes. The original tar file had 3 kinds of blocks. Header blocks contain the file name, file size, file type, permissions, owner, etc.. Data blocks hold the content of files. The last kind of blocks are end-of-file indicators and are just filled with zeros. The tar file format was upgraded in 1988. The new USTAR file format added a magic value ("ustar\000"). It also changed the file type ("normal", "hard link", or "soft link") to support 256 types of header blocks and reserved some for custom extensions. We could put Parchive data either into files with special names or into header blocks marked with a custom extension. I'm definitely saying this is a crazy idea. But I found out that "tar" doesn't contain checksums for its files, only a checksum for the data in the header block! You would think that a file archiver would do that, but it doesn't. "tar" is out of date. Reusing tar's format provides an easy way to include file permissions and file data into the Parchive format. Another benefit of reusing the tar format (especially if we did it using files with special names) is that we can debug using the tar program and anyone who doesn't have access to install a Parchive 3 client could still get access to files using the common utility, "tar". There are many arguments against this idea. I've always thought of Parchive as a unix-pipeable program. I always thought of users doing something like "tar files* | compress | par > backup.tar.Z.par". Reusing the tar format doesn't really make sense in that sequence, because par really should only be focused on one thing, redundancy, and other programs like "tar" should be focused on archiving (grouping files, storing permissions, etc.). The other argument is that we don't plan on implementing all of "tar"'s file format right away. It adds a lot of headaches for client writers. We might mess up some detail of "tar" and end up with a wacky file format that isn't actually compatible. We should specialize in redundancy, which we're good at, and leave "tar" people to fix their own problems. I've pretty much argued myself out of not using the tar file format. It is far easier for Parchive version 3 to just reuse the message format that I designed for version 2. But the tar format is an intriguing option and I thought I should put the idea out there and see if it sparked any discussion. |
New draft of the specification. Big changes:
I reordered a few things. The rolling hash is still CRC-64-XZ This is because it has a clear, available specification. The specification for CRC-64-ISO costs money. And it has been "withdrawn". (I'm guessing that means superceded by a new specification.) The closest things to a specification for it were this page and This page If you can find a good spec for CRC-64-ISO (or another hash), I'll change it. If you have simple code to define the CRC, I would probably accept that. 16kB hash is now the rolling hash, instead of fingerprint hash. I changed filename lengths to 1 byte. I'm not sure why it was 2 bytes. Most filesystems only support 255 byte names. I changed the length of a path to 2 bytes. In Linux, paths are limited to 4kB and, in Windows, to 260 bytes. I changed the length of the owner name and group names to 1 byte. The maximum allowed length on Linux is 32 bytes. I am very happy with the specification. I feel like most of the pieces are in place. After your comments, I am likely to take "UNFINISHED" off the front of the specification and proclaim this the "BEFORE-REFERENCE-IMPLEMENTATION" version and put the specification on the website. Obviously bugs will be found and worked out during the first implementation of it. But I think we're at the point that we go to code. BTW, I'm going to be starting a new business next week. I won't have time to write the reference implementation. I will be available to review code and write some unit tests. Does anyone want to take responsibility for writing the reference implementation? |
Note that 255 bytes under one encoding may end up longer than 255 bytes when converted to UTF-8. Particularly likely in encodings which use 2 bytes/character that end up as 3 bytes/character in UTF-8 (such as some East Asian character sets).
Windows actually supports 32767 UCS-2 characters (65534 bytes) in paths via UNC naming. I think some UCS-2 characters can map to >2 byte encodings in UTF-8, but I doubt exceeding 64KB in a path name is likely.
Congrats on getting this far, and hope the business works out well! |
As Mr. Anime Tosho wrote already, UTF-8 encoding may exceed the limit. I tested filename length on my PC (Windows 10). I can set max 239 Japanese Unicode characters for a filename with Windows Explorer. The character is encode to 3-bytes each in UTF-8. So, the UTF-8 encoded filename length becomes 239 * 3 = 717 bytes. Thus, the filename length field requires 2-bytes.
I tried to compare the speed of hash algorithms on my PC. While BLAKE3's official implementation works well, I could not compile KangarooTwelve's sample code on Microsoft Visual Studio 2019. Does someone know where to get KangarooTwelve's optimized source code for MSVC ? Anyway, BLAKE3 is much faster than MD5, when SSE2/SSE4.1/AVX2/AVX512 are available. BLAKE3 is 5 times faster than MD5 on my PC with AVX2. About rolling hash, CRC-64-XZ is enough fast with CLMUL (Carry-less Multiplication instruction set). Because most current Intel/AMD CPUs have this feature, there would be no problem. I put an example of speed difference on my PC below. Note, speed is varied by CPU's extension or SIMD. This result is just a sample of a recent CPU. (It may differ from old CPU or later CPU.)
|
Even old CPUs can get good speed. I have 9 year old CPU and get these numbers:
Of course all of these are for non-rolling variant. I don't know if rolling hash can be sped-up (it's 240 MB/s with table lookup). |
btw, will Par3 support only slow matrix-based |
As a file format, it's designed to support any Galois field and any code matrix. But, someone needs to write source code to use FFT based Reed-Solomon Codes for PAR3. Though there are some papers about the technology, I could not understand the math. Actualy, I cannot understand even these current recovery codes on PAR3 spec. (I'm hard to read mathematical words in English.) That is why I cannot write a reference implementation of PAR3 client. |
The "any code matrix" part is what I don't understand. I'm able to implement FFT based Reed-Solomon but I can't see any matrix there. (But it is possible it's hidden somewhere in the algorithm as I don't really understand it.) |
They give some info here. Basically you need to use the As they mention, assembly optimisations aren't available on MSVC because they're written for GCC. Perhaps it's possible to assemble these into object files, then disassemble into MASM format, but I haven't bothered checking that. It looks like BLAKE3 internally uses a tree, which means that it can scale across cores without issue. This gets around the issue of a single-threaded hashing bottleneck that currently exists in PAR2. |
Hi folks, I've been gratefully watching your work for some time and am eagerly awaiting the PAR3 spec to solve some of the problems we're facing in the space where I work (scientific computing). Our use case is that we have millions of files (mostly video and audio from sensors - e.g. TIF, JPEG, MP4) in a dataset, and the dataset grows slowly over time - for example, our botanical scientists have over 1 million ultra-high resolution photos of plants (around 500MB for each photo), which are used to identify changes in plant genetics. We have multiple similar use cases, and adopting a normal cloud service for this (E.g. AWS S3) would be totally cost prohibitive. We currently store this data in a messy combination of onsite NAS and tape storage, and we want to try adopting Usenet as a place to store and share these huge datasets. But, error correction with PAR2 on a 500TB file collection made of 700K article size and a 32768 piece limit is obviously very inefficient - each PAR2 piece would span across approx. 21798 Usenet articles! So I was excited to read the spec as posted by @mdnahas (thank you very much for your wonderful work!) but it does seem firstly that there is an extension of scope here that makes PAR3 harder to implement. Commenting against your design goals from line 40, the below 'make sense' to me (as a non-expert!) as an improvement of PAR2 as it is used today for Usenet binaries :
However the below seem to be new features that don't fit in with the Usenet PAR2 workflow. To comment on each from the perspective of my use case (and indeed, anyone who wants to create redundancy on a large number of files in one PAR3 set, like the poster above who wants to back up 100K images):
I appreciate that there are going to be users of PAR3 who do have use for the above features! But are they the users that the developers wish to address? They shouldn't feel the pressure to address them if they don't want to. Instead, if one wants to create a new archive format that captures more of the "definitions" (e.g. archiver, checksummer, grouper, etc. as described prior by @mdnahas) it would be extremely welcome work, as it would remove the need for multiple file passes and could include support for things like on-the-fly error correction of video streams - all very welcome, but probably outside the user-perceived scope of PAR3 :) Please forgive me if I have misunderstood any of the above, and my apologies if this post is unwelcome. Thanks again! |
Yes, this is the problem. They might forget to include AVX2noAsm and AVX512noAsm version code in the package. eXtended Keccak Code Package seems to contain them. Thanks Mr. Anime Tosho for the advice.
I thought so, too. Users' voice is welcome. Thanks Mr. Dimitri Vasdekis. Because I'm a lazy programmer, I felt that small improvement for current usage might be enough. I'm not sure that I will be able to implement some new features in PAR3 specifications. But, I will try, when users request a feature in future. |
@dvasdekis writes:
I'm glad Parchive is being considered. I love hearing about users and new use cases.
An example of incremental backup would be: you have 100 files and protect them with File1.par3, and then add 10 more files and protect all of them with File2.par3. A client will be able to read both File1.par3 and File2.par3 and use them to recover the 110 files. Obviously, File1.par3 will not protect any of the 10 new files. The incremental backup case makes sure that you can use older Par3 files with newer ones. You can get by with a smaller File2.par3 because File1.par3 exists. I suppose a client could read File1.par3, compute the effect of the 10 new files, and write File3.par3, which would protect all 110 files. But that is not the "incremental backup" case. That, in fact, could be done with Par2 right now.
Yes, if you want to compress all your files first. If you want to protect files on a USB drive and have the files uncompressed, the feature is valuable. So, the feature is not useful for the Usenet use case. That is not the only use case.
That is a longer discussion. You can find it discussed above in this issue thread.
You're assuming someone compresses first.
Yes, it is messy. It may not add value to your use case, but it does to people protecting a USB drive.
Now, you're assuming the receiver will run a script after receiving the data?? Yeah, I would never do that.
If you want to send a file that works both as a Par3 file and a ZIP file. That is, one where if a user only has ZIP, they can unzip the file. But, if they have a Par client, they can fix the damaged file and unzip it.
Normally with Par2, you send the original files and a .par2 file. In this usage, you only sent a .par2 file, but the data for all the original files is stored inside the .par2 file. It makes .par3 behave like most other archive file formats. The feature was optional in Par2. I was hoping that clients would use it and that Usenet would stop using RAR to split files. With a compressed video file, RAR compression does almost nothing --- people were only using RAR to split files into pieces. I had hoped that it could be done with Par2. In Par3, I made the feature mandatory.
First, I don't think supporting any Galois field is very complicated. Second, picking a "sane default" is hard. Lastly, some algorithms (like FFT Reed Solomon) want to choose a particular Galois Field.
This offers the trade-off of more protection vs. speed. Again, I think picking a "sane default" is hard. Also, it allows future development of algorithms.
My goal has always been to expand Parchive's use cases. If you read the design for Par1, you'll see it was very very tied to Usenet and to RAR. Par2 is used other places besides Usenet. I actually disappointed how limited the Par3 design is --- I wanted it to be used for many more things. But, because the data can be stored external to the Par3 file, the names of the files containing that data need to be readily accessible in the format. It actually took me a while to design Par3 because I struggled against that constraint.
I had hoped to do that. Unfortunately, if I wanted Par3 to still be useful to Usenet, it couldn't be done. It will have to be some done by some other designer with another file format. It really should be done --- "tar" is old and crusty.
Thanks for caring enough to comment. As I said, it's always interesting to hear from users and hear new use cases. |
I personally can't vouch for this being a great idea, but you're welcome to try I guess. Concerns are mostly around long term viability, but if it's just a backup copy (or just for distribution), it's probably not a big issue.
The feature largely acts like separate PAR archives, where one declares another to be its parent, so it's as possible as it is to create completely separate PAR sets. There's downsides to the approach, such as redundancy not being shared across the whole archive, and the structure could get unwieldy if it's extended a lot (since it's essentially a linked list). A client could always choose to sidestep the feature and just update the original PAR3 file instead, which wasn't really possible under PAR2 (PAR2's strict file ordering made it infeasible to do in most cases).
Remember that this is just the specification, which defines the maximum extent of what is allowed. You'd ultimately be interacting with a client, which will be responsible for catering for your use case (as long as the specification doesn't restrict its ability to do so). If you don't want symlinks/hardlinks to be represented, ideally you'd tell your client such (if its default behaviour isn't what you want) and it'd just ignore those attributes when creating the PAR3.
An example might be a program which creates an archive of files with redundancy (like the "new archive format" you mention later). Currently this requires two steps, and creates multiple files, whereas the change could allow for it to happen in one step, and bundled into a single file.
Again, it'll be up to the client to determine how this is handled/presented to the end user, the specification just enables the possibility.
It supports all the features that PAR2 has, so other than implementation complexity, it theoretically shouldn't be any worse :)
RAR is mostly used because people are stubborn and blindly follow 20 year old tutorials/advice which recommend doing it. Embedding files within PAR isn't helpful to the Usenet use-case (in fact, it'd be detrimental, because downloaders only download PAR files if they detect damage). |
Wow! @animetosho I love your rant/explanation about RAR!!! I'd be happy if we convince people that Par3 means you don't need RAR. ... even if the truth is that they didn't need RAR in the first place! |
Thanks @Yutaka-Sawada , I will change the standard back to 2-bytes for filename/dirname length and 4-bytes for path length. |
I took another look at CRCs. I hated that the XZ one does the stupid bit twiddling at the start and end. It may make it mathematically meaningful, but it doesn't add anything to its error detection. It looks like the CRC-64-ECMA is the same as XZ, but without the bit twiddling. ("ECMA-182"). So we could use that... but it is the larger polynomial. The CRC-64-ISO has the small polynomial, but does the bit twiddling. It took me a while to discover that the polynomial in the reference code for XZ, which is this: You can see that when we write the XZ code's polynomial in binary: So, I was able to find the small ISO polynomial on the webpages: Should we go with the ISO polynomial and no bit twiddling? That would mean the XZ Code with How does that sound? @Yutaka-Sawada Do you think it will make the code faster (or easier to write)? |
A practical reason for non-zero initial value for CRC is to detect zero-bytes at start. With 0 as initial value CRC just ignores any zero-bytes at beginning, i.e. CRC remains 0 until first non-zero byte. But my understanding is that that feature isn't useful when using fixed block size, like rolling CRC usually does. (It is useful when you have two blocks of different length but which have same content, except for the amount of zero-bytes at beginning.) |
In addition to bit-complement at beginning/end, other difference between CRC-64/XZ and CRC-64/ECMA-182 is bit order (i.e. |
When CLMUL is available on CPU, there is no speed difference between polynomials. Only when CLMUL isn't available, a polynomial of CRC-64-ISO is faster than that of CRC-64-XZ. But, PAR3 isn't designed for obsolated CPUs. So, there is no big reason to select a polynomial by speed. By the way, it's possible to calculate CRC-64-ISO without table lookup. I put my hash testing application (TestHash_2022-01-11.zip) in "tool" folder on OneDrive. There are some implementations of hash functions. You may read the source code. The CRC-64-ISO's sample code is like below.
About bit flipping, I don't like the useless factor. The initial & final bit flipping affects CRC calculation at sliding window. You may see the file "crc.h" of par2cmdline. There is a value
This
So, slide speed may become a little faster. But the difference would be very small and ignorable. While I removed the effect of bit flipping from CRC-32 before verification in my PAR2 client internally, I'm not sure the worth to effort. If you want to slide CRC-64 for arbitrary window size (block size, 16 KB, or 40 bytes) in PAR3, three windowmask(s) will be required to search complete blocks, misnamed files, or tail chunks. Though it adds more complexity, the difference is very small. Technically, there is no problem at all.
No, it won't be remarkable. If you think that CRC-64-XZ is good for definition, it's OK. If you want a simple construction, using CRC-64-ISO's polynomial without bit flipping is the simplest. But, I cannot say that it should be which. There is no difference in speed nor noticeable complexity. You should consider another factor except speed. I'm sorry for confusing you by my old post. |
BTW, did you see that there is a new version of BitTorrent? New hash function. |
Though likely for different reasons. Hash function was changed from SHA1 to SHA256 due to security issues in the former. Security is actually more important in BT because you're taking data from untrustworthy peers. The directory structure change is to avoid duplication in long/commonly-used path names. Although likely beneficial to PAR3 (though less so due to the various overheads), I believe your key motivator was to support empty directories. BTv2 solves a rather annoying issue of BTv1, which didn't have stable per-file hashes, which I personally consider to be the biggest win. |
About ivalid filenames on Windows OS, there seems to be some more. I post some examples of them. Reserved filenames with extension are invalid, too. Such like; "CON.txt". The filename is case insensitive. "con" and "con.txt" are invalid, too. Windows Explorer cannot treat them. Filename with period at the last is invalid, too. Such like; "something.". The last period is trimed by Windows Explorer at renaming. So, users cannot make such invalid filename normally. But, Windows Explorer cannot treat them. Filename with space at the first or last is invalid, too. Such like; " something" or "soemthing ". These spaces seem to be trimed by Windows Explorer automatically. So, users may not see such invalid filename normally. |
Okay, it sounds like we have a definition of CRC-64-ISO. And, while the bit inversion at start and end is annoying, it does serve a purpose and it is possible to optimize most of the cost away. I'll change the specification to CRC-64-ISO. Are there any other open issues that need discussion? If there are no open issues, I will:
We still need someone to code the reference implementation. I'm willing to help, with unit tests and other stuff, but I don't have time to code it. Are there any other open issues with the specification? |
I'm not sure that you answered already, but Markus Laire asked like below ago;
I think that a PAR3 client developer is possible to define a new Matrix Packets for his special Recovery Codes. Though the naming is Matrix Packet, it would be a definision of how to make recovery data.
If nobody write, I will be able to try. While I'm difficult to implement PAR3's new Recovery Codes, I can copy most basic features from my PAR2 client. But, as I use Microsoft's C language and C-runtime library for Windows OS, my source code may be incompatible on other environments. |
Good point. I looked at the paper on FFT Reed Solomon by Lin, Chung and Han. It is implemented by the library Leopard on GitHub. This algorithm uses the normal Galois Fields and should be compatible with the current Par3 specification. I don't know what code matrix it is using. I've emailed the author of Leopard and one of the authors of the paper, to see if they can define the matrix. If I hear back, I'll replace the Cauchy matrix with it.
That's great! I'll be glad to help make the code run with GCC on Linux. As well as help with unit tests. |
I'm done converting the specification to Markdown. It uses Github's own dialect of Markdown, because the default Markdown does not support tables. It is attached. I've generated the HTML and made the changes for the website, but we'll have to wait for Ike to accept my push request before you can see them. So excited! A great milestone! I can't wait to see the reference implementation! |
Oh, also, how do you want to be credited in the specification? |
We have a new repo for Par3! I'm moving discussion of the specification there. Post all future messages to: |
Hi everyone,
I wrote the specification for Par2 a long time ago. I'm working on the code for a new version of Par. It will include:
I've spent a week learning the code. I've written unit tests for some of the existing code. The tests should allow me to modify the code without breaking it. The unit tests should be run as part of "make check" but I don't know how to add them. (I've never learned Automake). Can anyone explain how?
I also plan on writing a diff tool that can compare Par files to make sure the packets are bit-for-bit identical. I'll use this to make sure that my changes haven't affected the program's output for version 2 of the specification.
I plan on adding a "doc" directory, which will contain the old Par2 specification and the new specification.
The Tornado Codes will need a predictable pseudo-random number generator. I expect I will use a version of Linear Congruential Generator.
The big question I have is: what do we name the next version and do we want to add a new file extension? At this moment, I plan on keeping all of Par2's packets and just adding new recovery packets. This will mean that par2 clients will still be able to verify the file, but will not be able to fix it. Unfortunately, par2cmdline currently silently ignores any packet type it does not recognize. So, existing users won't know why they cannot fix it. I would normally call the new specification Par2.1 or Par3, except the name "Par3" has been used by the developer of MultiPar. Perhaps we should call it "Par4"?
When we decide on a new name, I'll push a new branch and everyone can take a look at the spec/code.
Mike
The text was updated successfully, but these errors were encountered: