Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Working on major Par2 changes. Name? #130

Open
mdnahas opened this issue Apr 27, 2019 · 222 comments
Open

Working on major Par2 changes. Name? #130

mdnahas opened this issue Apr 27, 2019 · 222 comments

Comments

@mdnahas
Copy link
Member

mdnahas commented Apr 27, 2019

Hi everyone,

I wrote the specification for Par2 a long time ago. I'm working on the code for a new version of Par. It will include:

  1. Reed-Solomon encoding with James S. Plank's correction
  2. Tornado Codes by Luby

I've spent a week learning the code. I've written unit tests for some of the existing code. The tests should allow me to modify the code without breaking it. The unit tests should be run as part of "make check" but I don't know how to add them. (I've never learned Automake). Can anyone explain how?

I also plan on writing a diff tool that can compare Par files to make sure the packets are bit-for-bit identical. I'll use this to make sure that my changes haven't affected the program's output for version 2 of the specification.

I plan on adding a "doc" directory, which will contain the old Par2 specification and the new specification.

The Tornado Codes will need a predictable pseudo-random number generator. I expect I will use a version of Linear Congruential Generator.

The big question I have is: what do we name the next version and do we want to add a new file extension? At this moment, I plan on keeping all of Par2's packets and just adding new recovery packets. This will mean that par2 clients will still be able to verify the file, but will not be able to fix it. Unfortunately, par2cmdline currently silently ignores any packet type it does not recognize. So, existing users won't know why they cannot fix it. I would normally call the new specification Par2.1 or Par3, except the name "Par3" has been used by the developer of MultiPar. Perhaps we should call it "Par4"?

When we decide on a new name, I'll push a new branch and everyone can take a look at the spec/code.

Mike

@animetosho
Copy link
Contributor

Interesting!

You seem to be interested in retaining most of PAR2 - I presume, only changing the erasure coding used? What's the aim of your change? Just to improve encoding/decoding performance?

Have you looked at MultiPar's efforts and ideas with a PAR2 successor? https://www.livebusinesschat.com/smf/index.php?board=417.0
It seems like they're looking at a more substantial change to the format, rather than just an encoding change, but I wonder, wouldn't it be more worthwhile trying to do this instead of making a only-slightly-backwards-compatible PAR2 successor? Is there that much value in being able to verify data with an old PAR2 client?

As for your question, I feel a little uncomfortable with keeping the same extension. PAR2 is fairly old and well established, and as you've found, at least one client (possibly more) wouldn't handle such a change particularly well.

Thanks for your work nonetheless, I'm interested to see how it turns out.

@mdnahas
Copy link
Member Author

mdnahas commented Apr 28, 2019

I have two goals. The first is to fix the bug in the Reed-Solomon code that has existed since Par1. I tried to fix this in Par2 by using only certain base values. That lessened the problem, but didn't fix it.

My other goal is to start using low-density parity check (LDPC), which are no longer covered by patents in the USA. LDPC is much faster than Reed-Solomon, with a slightly increased chance of non-recovery. There are many forms of LDPC; Tornado Codes are simple ones. If I do the spec right, repair will work for many LDPC, not just Tornado codes. Then, people who implement the spec can play with multiple different encoding schemes.

One of my restrictions is that I want to do this change in about a week's worth of work. I have some free time, but not much. Having looked at the code, I really want to spend time cleaning up the code and making it work like a library, but don't have enough time to find all the bugs I would create by doing that. I think I can fix Reed-Solomon and add support for Tornado Codes in 40 hours.

I've tried to understand the author of MultiPar's efforts. But I haven't see a spec for their "Par3". I don't think they are devoted to open source code. I designed the Par2 file spec to be forward-compatible - it was designed to add features. I'm sure the Par2 spec could be improved by a redesign, but I don't see a huge win for it right now. Certainly not without a large number of people devoted to implementing it and checking the code.

@animetosho
Copy link
Contributor

Thanks for the response and explanation!

So it seems like this is more of just an experiment for encoding schemes rather than some sort of end-user oriented, production-ready PAR2 successor?

Totally understand the lack of time. It's just that I don't think a PAR2 successor should really be rushed, particularly if the aim is to continue to support it for the next ~10 years or more. Parchive is often used for archival purposes (hence the name), so it's a format you generally don't want to be changing often, which is, unfortunately, often at odds with an experimental scheme.

Not trying to discourage change or contributions in any way, but I thought it should be pointed out nonetheless.
PAR2 does have quite a number of issues, some as you have noted, so addressing some them is certainly a welcome move.

But I haven't see a spec for their "Par3". I don't think they are devoted to open source code.

"PAR3" was never really a finalised spec, only a proposal at most. The MultiPar author was experimenting with some new ideas, and decided to call it "PAR3" and included it in MultiPar. Unfortunately, people actually started using it for non-experimental purposes, so he had to pull it.
If you're interested, here's the last version of the "PAR3" spec, last updated in 2010. I think the author did aim to have an open standard, but it didn't go anywhere. I think the new approach seems to be with more open discussion around it, hence the link I previously gave. At this stage, there hasn't been anything beyond ideas/discussion however.
In short, PAR3 was never more than some experimental implementation that didn't really go anywhere and had to be pulled. In some ways, it seems similar-ish to what you're proposing at the moment - e.g. if it's aim is more experimental, and if it's labelled as some PAR2 successor, end users may end up inadvertently relying on it, when they shouldn't be.

Regardless, I mention this as they seem to be the only people still looking into a PAR2 successor (as well as one of the few actively developed PAR2 implementations), so thought you/they may be interested in such a proposal that you're making. I'm not too sure if you'll get much of a response here, but if you put up your proposed specification, I can share a link to it over there to get some of their thoughts.

Again, I personally do appreciate the effort you've gone (and will go) to, and am interested to see what you come up with!

@mdnahas
Copy link
Member Author

mdnahas commented Apr 29, 2019

Thanks! I tried to keep tabs on the Par3 discussions back then, but I never saw a specification.

So their old proposed Par3 changed every packet. The changes include:

  • uses Reed Solomon FFT codes
  • drops MD5
  • adds two different 64-bit CRCs (to correct small errors ... but not to replace MD5 (?))
  • use 1 and 2 byte integers (some integer fields are variable length)
  • packet types are numbers, not 16-char strings
  • files are identified by integers, not MD5-16k hashes
  • pack the ends of files (parts less than a full slice) together

I'm not a fan of the last 4 changes - it increases code without specific benefits. In fact, using packet numbers rather than strings means it is difficult to add custom extensions and to debug things.

Does anyone have an analysis of how much time par2cmdline (or another client) spends doing each task? E.g., how much time in CRCs, MD5s, inverting the matrix, etc.. If MD5 is taking up too much time, maybe we should replace it. Otherwise, it seems foolish to drop backwards compatibility for non-recovery packets.

Does anyone have the "noe" program or "libnoe" library for computing Reed-Solomon with FFT? Or perhaps another implementation? I'm not tied to Plank's corrected version. Fixing that problem is certainly something I'm in favor of.

As far as "an experiment" vs "end-user oriented, production-ready PAR2 successor", I do not want to stall all development for a successor just because people use PAR2. We need to experiment. We should work on fixing Reed-Solomon and we should work on adding support for LDPC. We do need to be careful communicating to users what is untrusted. We also don't want to break "par2cmdline" for Par2, which is why I'm (1) not changing much, (2) adding unit tests and (3) asked another developer to review my code before it is merged.

@animetosho
Copy link
Contributor

using packet numbers rather than strings means it is difficult to add custom extensions and to debug things

Going out on a limb here (I haven't read the proposal), I would imagine that the typical approach here is to allocate ranges for custom extensions. I also imagine that ease of debugging was not goal of the proposal either.

it seems foolish to drop backwards compatibility for non-recovery packets

That's an interesting opinion. Personally, I don't see much benefit in retaining backwards compatibility just for non-recovery packets, seeing as the whole point of PAR2 is the recovery packets. I suppose there's some use with using it to perform hash verification, but I don't feel that this is the primary aim of PAR2, and imagine that other methods (e.g. just storing a hash) are more frequently used for cases where only verification is necessary. The ability to verify is also limited, because most PAR2 tools also check the recoverability if bad blocks are found, which obviously won't work if a new type of recovery block is introduced; I don't recall the PAR2 specification even allowing for new types of recovery blocks, so there's no reason or expectation for clients to handle such a change in an ideal manner either.

You do mention that you have limited time to work on this, and hence want to minimize changes, which is totally understandable, but I see this more as a constraint rather than a benefit for minimizing changes.

it increases code without specific benefits

This doesn't really address your concerns, but here's a list of issues the MultiPar author has with PAR2, written 5 years after the PAR3 proposal, that he'd like to resolve with a PAR2 successor. May give some ideas on the reasoning behind the proposed PAR3 (e.g. using numbers to maybe reduce the size of the PAR archive, further enhanced by compressing filenames?).

Does anyone have an analysis of how much time par2cmdline (or another client) spends doing each task?

I only have experience with implementing a PAR2 encoder, so haven't looked much into decoding (so not much knowledge/experience implementing matrix inversion), and what I'm about to say relates to encoding.

As RS coding is relatively slow, generally the time spent hashing is negligible. However, if you replace RS with something faster, hashing will become a larger part of the overall speed to create/repair PAR2.

As far as I'm aware, all PAR2 clients (other than ParPar) perform an initial hash pass to compute the MD5/CRC32 of each file/block, during which, you're probably limited by I/O rather than CPU. NMVe SSDs, which are gaining mainstream traction, likely tip things the other way, though running MD5 in multiple threads is probably sufficient to keep I/O being the bottleneck. Another possible trick is to use a multi-buffer (SIMD) MD5 implementation, but I think only ParPar is the only PAR2 client which implements that.
MD5 is reasonably fast for a cryptographic hash. For cryptographic purposes though, it's not recommended due to existing collision attacks against it. The use of a cryptographic hash function may be interesting if one can rely on it being secure - for example, you could use it to uniquely identify files and clients might be able to make certain assumptions around it, although I imagine the benefits may be limited.

CRC32 is generally very fast, and modern x86/ARM CPUs have acceleration for it. I think rolling the CRC and checking for matches, for finding misaligned blocks during recovery, can be somewhat slow though. Rolling can also be problematic if you have to re-compute the MD5 for every potential match, which can lead to a kind of DoS attack against a PAR2 client attempting recovery.

Here's a discussion around hashing in PAR2 and successors.

I/O can be a limitation when processing large amounts of data which cannot fit in memory, due to the need for certain seek patterns. This may become less of an issue with the dying use of rotating media for storage though.

Note that par2cmdline isn't particularly a performance oriented implementation of PAR2. For most of its history, it's been a single threaded, non-SIMD, no-ASM implementation. There have been various forks which have added these performance features (e.g. par2-tbb, phpar2). It's only more recently that par2cmdline-mt was merged in, adding multi-threading to par2cmdline (but still no SIMD). MultiPar is more performance oriented, so obviously the author is more concerned about such issues with a proposed successor.

For some rough performance figures on the fastest implementations I know of, off the top of my head. On one thread of a 4.3GHz Skylake-X CPU (supports 512-bit SIMD via AVX512):

  • MD5: 800MB/s (OpenSSL ASM implementation)
  • MD5 multi-buffer: 3GB/s (I think? can't remember exactly now) using 128-bit AVX512VL. A full 512-bit AVX512 implementation should be a bit under 4x that, so potentially >10GB/s
  • CRC32: >25GB/s using the carry-less multiply PCLMULQDQ instruction (memory read bandwidth per core is around 17GB/s, so this will be limited by it for buffers which don't fit CPU cache). The more common "Sarwate" algorithm won't achieve that speed, but I'd guess is still at least 1-2GB/s.
  • GF16 multiplication: 45GB/s using AVX512's 512-bit VPSHUFB instruction. I seem to be L2 cache bound at that speed, so there's potential further gains by processing multiple blocks at a time, rather than the simple "multiply" function I have in ParPar. Despite this speed though, RS coding obviously requires multiple multiplies, depending on how many recovery blocks are generated. Very roughly speaking, for 1000 recovery blocks, you're probably looking at 45MB/s per core, which is likely going to be your primary bottleneck if not reading from a harddisk
  • NMVe SSD: around 3GB/s sequential read. 4K random QD32 is probably around 1GB/s
  • harddisk: typically 100-200MB/s sequential read, 0.2-0.5MB/s 4K random

I do not want to stall all development for a successor just because people use PAR2. We need to experiment

Oh definitely! It's just that that wasn't clear from your first post, considering that this repository is more a user-facing tool rather than for experimentation, and the questions asked around the name (which sounds more like an issue if you're proposing an actual standard rather than experimentation) initially gave me the impression otherwise. Hence my asking and confirming.
I'd personally prefer to keep such changes out of the main par2cmdline branch until something more final, otherwise someone is bound to misuse it. Interested parties can grab the experimental fork to test changes :)

Thanks again for the reply!

@mdnahas
Copy link
Member Author

mdnahas commented Apr 30, 2019

I don't worry about intentional MD5 hash collisions or denial of service. We use MD5 for uniqueness, but I don't think our users rely on that. (Or, if they do, they probably know what MD5 is and its security issues!) I do worry about buffer overflow and writing system files.

If we want people to implement Par3, it's easiest if they can just reuse the code from Par2. Why recode the wheel?

I agree that par2cmdline should not be tweeked for performance. It should be a clear roadmap for how to implement the standard. I actually don't think it does a good job of that - par2repairer.cpp is 2750 lines. I guess it needs to be multi-threaded (or use fork?) because it is compute intensive and everyone has multi-cores now-a-days.

Is GF16 multiply 45GBps or 45MBps? If it's 45MBps, then it and I/O are our bottlenecks. Changing to LDPC might improve things.

I'm not sure what I'll work on now. Maybe I'll just spend the week cleaning up par2cmdline. I hate modifying a working program without enough tests, because I may break an untested feature. (There's a lot of commandline options to test!) But a cleaner version would make it easier for people to copy the code and then improve performance of pieces.

If I skip additions and just improve par2cmdline's code, there's things like:

  • create a par2 library and have par2cmdline process commandline args and then call the library.
    (This is hard because the commandline class currently does things like look up the size of files!!)
  • automatically handle the case when the bug in Plank's Reed-Solomon appears. (I didn't see code to handle this. Is it there?)
  • print message if a packet is detected but type is not handled
  • make CRC and MD5 have the same interface (they are both hashes, but the class interfaces are very different). I'd like to encapsulate the many ~0 that are currently in the code without any explanation.
  • get rid of class variables for par2repairer and par2creator. Most are currently acting like global variables in a 2700 line program, rather than class vars. Because of this, it is hard to see where data is created and consumed.

As I said, these are huge changes and I'm loath to make them, because there's a good chance that I'll introduce bugs to a long-run tool.

@animetosho
Copy link
Contributor

animetosho commented May 1, 2019

Is GF16 multiply 45GBps or 45MBps?

Multiplying a buffer by a constant can be done at 45GB/s using 512-bit SIMD. In other words, a single multiplication is 45GB/s. The implementation is based off Plank's "Screaming Fast Galois Field Arithmetic Using Intel SIMD Instructions" paper (you can find his implementation here). My (much less readable) implementation extends his 128-bit SIMD code to use 512-bit SIMD (available on the latest Intel server processors), and adds a few small performance tweaks on top, effectively achieving around 4x the speed of Plank's implementation on CPUs with AVX512.
I've also investigated an alternative GF16 multiply kernel which relies on JIT'ing XOR code for each multiply. My current implementation seems to generally outperform Plank's algorithm, at the expense of a significantly more complex implementation (perhaps ridiculously complex in fact). It currently doesn't scale as well to 512-bit SIMD though (around 40GB/s peak), as it is mostly stalled by L2 cache, though I believe it can be further tweaked, and has the potential to be much faster.

Here's a benchmark I performed with some GF16 multiply techniques. The "lh_lookup" value refers to the technique of splitting the 16-bit multiply into low/high 8-bit multiplies in a 512-entry lookup table (basically the same method used in par2cmdline) - this is the fastest, portable, pure-C technique I know of, but as it doesn't use SIMD, falls behind the other techniques on performance. "shuffle" refers to my implementation of Plank's method above, at 128/256/512 bit SIMD widths. "xor" is the technique of using JIT to generate XOR sequences linked to above, at 128/256/512-bit SIMD widths. "memcpy" is just the standard glibc implementation of memcpy - mostly useful as a reference to give an indication of cache/memory bandwidth.

gfspeed

(looks like I accidentally chopped off the X-axis label, which is the buffer size used; you'll find a similar graph in Plank's paper, but his implementation only uses 128-bit SIMD)

However, in reality, it's unlikely that users create PAR2 with only a single recovery block. Which means that more multiplications need to be performed, proportional to the number of blocks to be generated. The example I gave was with 1000 recovery blocks, which means 1000x region multiplies for each input block, hence giving the (very rough) 1000x overall speed decrease (=45MB/s per core).

Note that this speed was achieved on a very recent, high power Intel CPU with AVX512 running at a high clock rate. Most users do not have access to such powerful hardware, so the figures I'm giving are somewhat at the upper-end of the scale of what's possible on a CPU today (a more typical x86 processor today would probably get half that speed). At the other end of the scale, a low power ARM CPU (with NEON (SIMD) support), for example, may only achieve around 300MB/s using the above algorithm for a single multiply. CPUs without SIMD units may perform even worse.
(it's also worth pointing out that that computers with powerful CPUs are likely paired with fast drives, in which case, I/O is unlikely going to be a bottleneck in practice)

So despite the various tricks you can pull for a highly optimised/tuned implementation of GF16 multiplication, they come at a complexity cost, and there's still fundamental limitations, hence a faster erasure code is definitely welcome. Particularly, something that can grow at O(n log n) complexity, instead of O(n*n) would be useful, especially if we want to scale beyond 64k recovery blocks.


Just to make sure you're aware, I don't work on or maintain the par2cmdline codebase, so can't really comment much on your suggestions (hence I won't be able to do much regarding the comment you made in your latest pull request). I'm mostly just an interested 3rd party who works on my own PAR2 implementation (which has somewhat different aims to existing PAR2 implementations (e.g. one thing I wanted was to be able to process data from streams rather than be tied to on-disk files)). There isn't much development going on with par2cmdline these days though, so I would suspect that any contribution you're willing to make is valued. Your suggestions do sound like good changes nonetheless, and I appreciate you putting your time and effort into this.

there's a good chance that I'll introduce bugs to a long-run tool

All changes have such risk, but without changes there is no progress. You sound like you're fairly cautious already, so I wouldn't worry too much about it. All software has bugs - if one crops up, someone just needs to fix it.


Yutaka Sawada (MultiPar author) made a comment regarding your initial proposal. I've copy/pasted it below:

Thank you for notice. It's interesting, and I will help as possible as I can. MultiPar will support it, too.

Because I don't know about Tornado Codes, I cannot say it's good or not. But, "1.Reed-Solomon encoding with James S. Plank's correction" would be bad, because it requires inversion of matrix. When there are many blocks, inversion will become almost impossible. (from calculation speed and required memory size.) I prefer to use Cauchy Matric for quick calculation, which Mr. persicum suggested ago.

As I cannot explain the mathematics, I put some paper about the usage.

Filename: oninversionofcer00sche.pdf
"On the Inversion of Certain Matrices"
by Samuel Schechter, February 1, 1959
AEC Computin and Applied Mathematics Center
Institute of Mathematical Sciences
New York University

Filename: 0301097.pdf
"The Efficient Implementation of Reed-Solomon High Rate Erasure Resilient Codes"
by Jin Li
Microsoft Research, Communication and Collaboration

Another issue of PAR2 is order of input files. Because filename affects the order of input files in current PAR2, it's impossible to change filename later, even when contents of files aren't changed. Now, a user needs to recreate whole PAR2 files after renaming a source file. Some users requested a feature of renaming filenames over MultiPar. Though I include a tool to rename filename in MultiPar package, it's out of PAR2 spec. If new PAR spec (maybe PAR3) solves this problem, it will be convenient.

@Rhialto
Copy link

Rhialto commented May 1, 2019 via email

@mdnahas
Copy link
Member Author

mdnahas commented May 1, 2019

Thanks @animetosho for the GBps / MBps explanation. memcpy is a good reference for speed.

And thanks for forwarding the comment on "matrix inversion" from Yutaka Sawada. As part of a job, I've actually tried to invert large matrices and I found that doing more than a 400x400 was impossibly slow. I'm not sure we should use the term "Reed-Solomon" if we're not actually doing Reed-Solomon. I've been thinking of renaming the class to "reedsolomon_broken".

I find it interesting that people are changing filenames after creating Par2 files. That isn't something I anticipated when writing the spec! Currently, the spec says that clients shouldn't try to verify how the "recovery set ID" is calculated. We could amend it to say that they shouldn't try to verify how "file ids" are calculated. We did an informal amendment to a version "2.1" which allowed clients to use UTF-8 in filenames (instead of pure ASCII).

@animetosho The streaming idea is interesting. Are you using the optional packets that allow you to include slices of the input file?

@animetosho
Copy link
Contributor

Yeah, there are all sorts of things people out there want to be able to do. Anothing thing is being able to add files or update a PAR2 archive. Currently this requires recomputing the whole thing due to the strict file ordering determining the coefficients used for computation.
Not being limited to 32k input blocks/files is also a nice thing to have.

When you get down to it, there's a lot of things that Parchive could be doing but currently can't in PAR2, which is why I personally think a major overhaul of the format is warranted.

Are you using the optional packets that allow you to include slices of the input file?

I am not including parts of the source file in the PAR2 archive. Streaming is more in the sense of being able to pipe data in from memory, as opposed to using PAR2 as some on-the-fly FEC for lossy network transmission or the like (though you could do that too).

For example @Rhialto gives the example of a usenet downloader. Currently these write all downloaded files to disk, including PAR2 files, then perform verification/repair. If those could be performed somewhat on-the-fly, whilst the download is occurring, you can save time over doing separate processes, whilst also reducing disk I/O (and even avoid writing PAR2 files to disk entirely). Unfortunately, existing PAR2 implementations are too closely tied to the notion of on-disk files to be able to do this.
I also had a request to be able to have stdin directly piped in for processing, which is interesting because you don't know the size of it in advance, in addition to it being a stream.

@mdnahas
Copy link
Member Author

mdnahas commented May 2, 2019

If you can collect the use cases that Par2 doesn't work well in, I can take a look at how the design might change.

I think we can handle renaming files by just not requiring clients to confirm the File ID.

If we want to handle single-byte changes in files, like the proposed Par3 spec you forwarded, that does require big changes.

Do people really want >32,000 blocks? Each block adds overhead (CRC32 and MD5 checksum). I mean, it's very doable, but I'd want to shift to something like LDPC, which has less computation per block.

Streaming is possible with minor changes. You need some initial information to initialize the state in the reader. That would be (1) input file descriptions, checksums, etc. and (2) the number of recovery buffers to allocate and their coefficients. Then, as data was downloaded, you need to identify what input slice it was, write it to the correct place in the file, and subtract its effect from each of the N recovery buffers. Lastly, when all the data had been downloaded, you need to identify the K slices that need recover. If K < N, download the recovery data for K slices, add it to the appropriate recovery buffer, solve the matrix, and write out the recovered data.

If you assume the initial information is the no-recovery-slice Par2 file, then the only thing we are missing is the number of recovery slices and their coefficients. An adventurous client could guess those from the names of Par2 files. Or we could put that data in the Par2 files. That is, a new packet that is a hint to the repair client about the number of buffers to allocate.

Streaming would use a lot of CPU and storage unnecessarily, but it would be cool to try. I like that the downloading program would know exactly where to put the data blocks and the client wouldn't need the logic to rewrite the data files.

@animetosho
Copy link
Contributor

animetosho commented May 3, 2019

I'm not sure about collecting use cases, but some ideas are listed here.
I think all text fields should be defined as UTF-8 (edit: looks like you've already done this), which makes the "unicode" (presumably fixed-length UCS-2, though encoders would likely use variable length UTF-16 regardless) packets redundant. Actually, I'm always already using UTF-8 for the "ASCII packets" anyway - despite what the standard says, many clients just use whatever the default system codepage is for these. If you have non-ASCII characters, the filename will be barfed whatever you do, but UTF-8 is generally more "kind" in these situations, and if the system codepage is UTF-8, it could just work anyway.

Do people really want >32,000 blocks? Each block adds overhead (CRC32 and MD5 checksum).

The overhead is relatively small for larger blocks, e.g. 1MB blocks. However, even at 1MB blocks, 32k means you can't have more than 32GB in a Parchive without increasing block size. Though not common, it's not too unusual to have >32GB files today. And if we're thinking about a format which is to last the next 10-20 years, it can feel a little limiting.
I suppose how large blocks should be does become a bit of an open question, but I'd like to think that the block size should increase at a slower rate relative to the amount of data typically covered by Parchive.

Streaming is possible with minor changes.

As you alluded to, it's already possible (most of the time) without any change, though the decoder may need to be a little clever about it.

Knowing how many recovery blocks are available may actually not be necessary, and in fact, may be largely irrelevant (at least for the downloader use case).
To expand on that, in most cases, K (damaged blocks) is significantly smaller than N (available recovery blocks), so it's somewhat of a waste of resources to compute on N recovery blocks during download. Now in a pure streaming situation, you don't have much choice, but for the download situation, you can always go back and compute any additional recovery you need, if there was more damaged blocks than you initially guessed.
This greatly reduces CPU/RAM usage (vs computing all recovery), and seems to be a practical optimisation for the use case. Particularly since downloading is often network bandwidth limited, followed by I/O, whilst CPU is often fairly idle; it makes some sense to expend some idle CPU resources to potentially speed up the process. For example, a downloader may choose to always compute 32 recovery blocks during download, regardless of availability; as the download is network bound, this only increases CPU usage a little, but doesn't otherwise impact download speed. When the download is complete, if there are fewer than 32 damaged blocks, they can be repaired without re-reading the downloaded files off disk. If there are more than 32 bad blocks, you'll need to re-read the files to compute all the needed blocks, but in this case, the previously computed blocks can actually be re-used to reduce the amount of computation needed.

The remaining issue is the recovery coefficients. These could be guessed from the filenames, as you pointed out, though you could also just guess the first few coefficients typically used as there's generally little reason that an encoder wouldn't use those.
If the assumptions don't hold here (e.g. no index file present, or unexpected coefficients), the downloader can just fall back to existing less efficient techniques. This is for the downloader case though, where pure streaming isn't a requirement, rather, it's only an optimisation.

Encoding in a streaming fashion is a little easier than verify/repair, since the create tool has full control over the process. The primary limitation here is being able to fit all recovery blocks in memory.

(note that the ParPar project wasn't aimed at changing PAR2, rather, at working within its bounds)


Yutaka Sawada added another note:

Though this isn't serious problem, checksum in PAR2 (CRC-32 and MD5) would better be changed in the future version. For example, BLAKE2 is faster and safer than MD5. CRC-64 is faster and has less collision than CRC-32, but it requires additional 4-bytes. Because BLAKE2 has no disadvantage, there is no reason to avoid it. Using MD5 for a new specification may be obsolate. Because CRC-64 is double size than CRC-32, it gives slightly larger recovery files. For current max 32768 blocks (or files), CRC-32 is enough. If new spec supports more blocks, CRC-64 will be preferable. (It depends on how many blocks.)

I put some paper of BLAKE2, CRC-64, and MD5's collision.

Filename: blake2.pdf
"BLAKE2: simpler, smaller, fast as MD5"
by Jean-Philippe Aumasson, Samuel Neves, Zooko Wilcox-O'Heam, Christian Winnerlein

Filename: nguyen_Fast_CRC_2012.02.20.pdf
"Fast CRCs"
by Gam D. Nguyen
Naval Research Laboratory

Filename: md5_crc32_hack.zip
"How to Break MD5 and Other Hash Functions"
by Xiaoyun Wang and Hongbo Yu
(I include sample data and PAR files to test.)

@mdnahas
Copy link
Member Author

mdnahas commented May 4, 2019

My opinion was that UTF-8 support was discussed and approved by Peter B. Clements and a bunch of us and approved. I may add that to the "spec" section of the new webpage as a 2.0.1 spec.

The list of use cases include:

  1. streaming = generate recovery data while input is in transmission; use the recovery data to fix missing slices after transmission finishes.
  2. easier single-pass file recovery (which is very similar to streaming)
  3. incremental operation = ability to add source files not recalculate the entire recovery file
  4. changing names of source files
  5. support for more blocks
    And, of course, faster encoding and decoding. And a fix to Plank's Reed-Solomon encoding.

I'll have to think about "repair Par2 files" as a use case. I understand the desire for it. If you're transmitting files, you want to restore them to the original state before re-transmitting them. If you've done a backup, you'd like to restore every damaged disk image to the original state before re-burning it. I've usually assumed clients have the freedom to determine how many times to repeat the un-recoverable packets (and their placement in the files). This can probably be accomplished by a "suggested file layout" without a strict requirement.

I don't know about fixing small errors, like adding/dropping/changing a byte. I'm just not sure how to do that - especially adds and drops. It feels like a separate file format should be used for that.

I've already discussed how to stream with Par2. It takes some guesses, but I think it would work in practice. The only change I could even suggest was a packet that hinted about the exponents of recovery packets and, even then, it would only be a hint.

The big stumbling block is the changing filenames and incremental changes. The easiest fix here is just to open up more freedom in the Par2 standard. Do not require (always) for the FileIDs to be listed in sorted order in the main packet. We can consider if we also want to allow FileIDs or the RecoverySetID to be something other than its current definition, but changing the main packet order is necessary and sufficient. Those changes can done by a "Par version 2.0.2" amendment. And we could include the suggested file layout with it too.

There are reasons to go to a "Par3": more blocks, fixing Reed-Solomon, and replacing MD5 with a faster/better-suited hash. I think adding LDPC is the easy way to get to more blocks. And it would keep the Reed-Solomon matrix small, if we used Plank's correction which requires sloooow Gaussian Elimination. (I don't understand the FFT version yet.)

But simple changes to Par2 might get us a few more use cases for people.

@animetosho
Copy link
Contributor

Yeah, those small changes to PAR2 could definitely open things up a little, and there's a good chance that existing PAR2 clients would already be fully compatible with them anyway.

I found a page where someone else was investigating a PAR2 successor, which may have some ideas+thoughts (haven't really read through it).

Some ideas I just thought up for a "PAR3" (may not be a good idea, just mentioning them):

  • filename compression: this probably will only be useful for those dealing with lots of small files, and probably not worth the complexity. But maybe if we have some directory structure, we can get a fair portion of the benefits of size reduction without much additional complexity (I recall that the full path needs to be specified for each file in PAR2, so by eliminating common path elements might be useful)
  • define how empty (0-byte) files are handled? Some clients place them in the recovery section, others place them in the non-recovery section. For the sake of consistency, it may be worth making a decision on that in the spec?
  • extensibility: I presume that this will be included anyway, but including algorithm indicators (like what erasure scheme is used, or what hash algorithm is used) can help extensibility. Another area may be allowing for optional fields, such as additional file properties or the like, should a use case ever come up, which can aid with new features and backwards compatibility
  • removing files from a Parchive? I imagine this to be similar to adding files, but this does introduce the notion of "holes" (missing input slices) in the input, which I don't think the current PAR2 would handle well

User maosredbrook gives the interesting idea of a RAR-esque usage of storing input files in Parchive with optional encryption. PAR2 already supports embedding input slices, though I'm not sure how useful that feature is. Another option is that, since decoders have to scan for packet headers, one can embed (concatenate onto the end) a PAR2 archive into another file. This can be used to, say, tack on recovery information to an archive which doesn't support repair (e.g. 7z), and the file still remains usable (as long as the application doesn't make assumptions on what's at the end of the file). "Extracting" the file would just be like stripping off the PAR2 packets off the end.
This embedding also allows an application to use some existing encryption technique without Parchive needing to support it (I don't feel that encryption should be part of Parchive).
If it's not there already, do you think it'd be worthwhile mentioning something about embedding in the Parchive spec?


Yutaka Sawada writes:

By the way, PAR2 might try to solve the max 32768 blocks (or files) problem. They are optional "Packed Main packet" and "Packed Recovery Slice packet". If new spec merges such idea in standard packet, there will be no limit of blocks (or files). For example, "Main packets" includes "Subslice size" always, as if it was a packed one. It needs to consider how are slice and subslice relation and usage.

Small improvement is better than nothing. I could not solve a dilemma; fast speed vs recovery performance. (For example, LDPC is very fast, but requires more recovery data.) If someone knows a problem of current PAR2, or comes up with a good idea, he may post his thought on the par2cmdline page. I expect that Michael Nahas will realize our hope.

My understanding of the packed slices feature is that it doesn't increase the number of available slices in any way, rather, the only thing it does is allow input files to be more tightly packed, since they don't have to be padded to a full slice (only to a subslice). If my understanding is correct, one could always just set the subslice to 4 bytes and gain maximum efficiency with packing files, with little downside (other than some additional complexity in decoders?). Adding subslice hashes could make the feature more useful as smaller erasures can be identified, and actually gives a penalty for using tiny (i.e. 4 byte) subslices. It's not quite the same as enabling more slices overall however (which can be more useful in cases when you're encoding input of unknown length).

@mdnahas
Copy link
Member Author

mdnahas commented May 6, 2019

The comment about removing files (and leaving some coefficients missing) is a good note. I hadn't thought of that.

0-length files are within the spec. Clients can put them either in the recovery set or the non-recovery set. Either way of doing them is fine. The non-recovery set was only added for backwards compatibility, so I'd prefer including them in the recovery set. The only bad way of handling them is the way that par2cmdline does - which is not at all. They get ignored!

As for compression, encryption, permissions, many files, etc., I'm an old UNIX junkie, so I think there are many other programs that handle that better. gzip, gpg, tar, etc.. I'd rather have someone run many tools that do each thing well, rather than build it all into Par.

I really wished some client had implemented the "input slices" optional feature. Many large files are already compressed. Many posters on Usenet were using RAR to split up a file and not for compression/permissions/etc.. I was really hoping a Par client would eliminate the use of RAR. But, as before, I wouldn't include encryption. There are other programs that do that better. ... if you we do a Par3, what do you think of making that a required packet type?

@animetosho
Copy link
Contributor

animetosho commented May 6, 2019

Many posters on Usenet were using RAR to split up a file and not for compression/permissions/etc

Which in my opinion, is completely pointless. No-one has actually been able to give me a sound explanation of why files need to be split at all, as yEnc already splits up files into articles, and another layer of splitting seems completely unnecessary.

Splitting can have uses if putting files on fixed size media (CD/DVDs) and the like, but I don't see any requirement for using RAR to do that, particularly if you're not compressing.

I think the Usenet case is more due to legacy and users not updating their knowledge or processes, rather than anything which still makes sense today, on a more technical level.
RAR, however, is used for obfuscation/encryption purposes, in which case, embedding files in PAR2 may not address the needs of uploaders.

if you we do a Par3, what do you think of making that a required packet type?

I suppose the packet is useful if someone needs it, but I'm struggling to see an actual use case where it's particularly beneficial.

Embedding source files turns Parchive into an archive format, at which point, you're competing with TAR/ZIP and the like for functionality, without many of its features (timestamps, permissions etc).
I think it would be more fruitful to promote embedding Parchive into other formats instead, in which case, Parchive just adds redundancy to any arbitrary file. For example, 7z archives don't support recovery, but if one appends a few PAR2 packets onto the end, they could protect against corruption in the file. If there's no corruption, the 7z archive is usable as is.
This is the benefit of embedding Parchive into other files, rather than embedding files into Parchive - the original files are usable without the need for extraction (assuming that the application isn't too strict about what's at the end of a file).

Various other archive formats do support embedding as well (e.g. for self-extracting executables), but because PAR2's input file slice packet requires the file to be stored non-contiguously (broken into slices), it won't be recognised without an application extracting the files from the PAR2 itself.
One could always just make up a new packet type and embed the whole file contiguously to enable this sort of embedding though.

@mdnahas
Copy link
Member Author

mdnahas commented May 11, 2019

For anyone tracking this thread, I just pushed a HUGE change. It creates "libpar2", a library that anyone can include in their code. The commandline program, "par2", just calls the library.

Pull request is here. With a description of all the changed. It fixes a number of bugs (including one that didn't match the spec!) and minor features.

I only have two computers to test on and neither are a Windows machine. I'd love if everyone could download the code and make sure it compiles on your machine.

@animetosho
Copy link
Contributor

Nice work!

@mdnahas
Copy link
Member Author

mdnahas commented May 14, 2019

I'm done all changes to this pull request. It's ready to merge, if someone can run it with Microsoft Visual C++ and run the tests on Windows.

Pull Request

@animetosho
Copy link
Contributor

Compiling seems to fail on MSVC 2017 with linker errors. Adding src/libpar2.cpp /.h to the project (par2cmdline.vcxproj) fixes this and it compiles fine.
I'm not sure how to run tests on Windows, as they seem to be shell scripts. I tried running a chain of tests/test1 && tests/test2 && ... via an MSYS bash shell, and it seems to run them all, though I'm not sure if the tests ever return a non-zero return value on failure.

@mdnahas
Copy link
Member Author

mdnahas commented May 15, 2019

@animetosho Thank you for running it!!

@animetosho Did the last test run? The && operator will only run the program on the right if the program on the left exits with code 0. Maybe you could do:
tests/test1 && tests/test2 && ... && tests/testN && echo SUCCESS
and check if it prints out "SUCCESS".

I'll see what I can do to the unit tests to make it clearer which ran and success/failure.

Can you attach your par2cmdline.vcxproj, so I can add it to the fork? I'm assuming it is just adding 2 lines in the obvious location, but that's just a guess. If you can't attach the file, commit it locally, run "git format-patch -1 HEAD" and then post the contents of the patch it generates.

@animetosho
Copy link
Contributor

All the tests did run through like that, it's just that I don't know whether a failing test still returns 0 or not.

I've attached the project file. It just adds the two files, and doesn't attempt to, e.g. build a separate library and CLI executable.
par2cmdline.vcxproj.gz

Compile log:

1>------ Build started: Project: par2cmdline, Configuration: Release Win32 ------
1>C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\Common7\IDE\VC\VCTargets\Platforms\Win32\PlatformToolsets\v141_xp\Toolset.targets(39,5): warning MSB8051: Support for targeting Windows XP is deprecated and will not be present in future releases of Visual Studio. Please see https://go.microsoft.com/fwlink/?linkid=2023588 for more information.
1>commandline.cpp
1>z:\par2cmdline-library_dev\src\commandline.cpp(173): warning C4244: 'argument': conversion from 'u64' to 'u32', possible loss of data
1>z:\par2cmdline-library_dev\src\commandline.cpp(1268): warning C4244: 'argument': conversion from 'u64' to 'u32', possible loss of data
1>crc.cpp
1>creatorpacket.cpp
1>criticalpacket.cpp
1>datablock.cpp
1>descriptionpacket.cpp
1>diskfile.cpp
1>filechecksummer.cpp
1>galois.cpp
1>libpar2.cpp
1>mainpacket.cpp
1>md5.cpp
1>par1fileformat.cpp
1>par1repairer.cpp
1>par1repairersourcefile.cpp
1>z:\par2cmdline-library_dev\src\par1repairersourcefile.cpp(50): warning C4244: 'argument': conversion from 'u16' to 'char', possible loss of data
1>par2cmdline.cpp
1>par2creator.cpp
1>z:\par2cmdline-library_dev\src\par2creator.cpp(331): warning C4244: 'argument': conversion from 'u64' to 'const unsigned int', possible loss of data
1>z:\par2cmdline-library_dev\src\par2creator.cpp(345): warning C4244: 'argument': conversion from 'i64' to 'const unsigned int', possible loss of data
1>z:\par2cmdline-library_dev\src\par2creator.cpp(355): warning C4244: 'argument': conversion from 'i64' to 'const unsigned int', possible loss of data
1>par2creatorsourcefile.cpp
1>par2fileformat.cpp
1>par2repairer.cpp
1>z:\par2cmdline-library_dev\src\par2repairer.cpp(1220): warning C4244: 'argument': conversion from 'i64' to 'const unsigned int', possible loss of data
1>z:\par2cmdline-library_dev\src\par2repairer.cpp(1327): warning C4244: 'argument': conversion from 'u64' to 'const unsigned int', possible loss of data
1>z:\par2cmdline-library_dev\src\par2repairer.cpp(1333): warning C4244: 'argument': conversion from 'i64' to 'const unsigned int', possible loss of data
1>z:\par2cmdline-library_dev\src\par2repairer.cpp(2649): warning C4244: 'argument': conversion from 'i64' to 'const unsigned int', possible loss of data
1>z:\par2cmdline-library_dev\src\par2repairer.cpp(2650): warning C4244: 'argument': conversion from 'i64' to 'const unsigned int', possible loss of data
1>z:\par2cmdline-library_dev\src\par2repairer.cpp(2658): warning C4244: 'argument': conversion from 'i64' to 'const unsigned int', possible loss of data
1>Generating Code...
1>Compiling...
1>par2repairersourcefile.cpp
1>recoverypacket.cpp
1>reedsolomon.cpp
1>verificationhashtable.cpp
1>verificationpacket.cpp
1>Generating Code...
1>par2cmdline.vcxproj -> Z:\par2cmdline-library_dev\Win32\Release\par2.exe
1>Done building project "par2cmdline.vcxproj".
========== Build: 1 succeeded, 0 failed, 0 up-to-date, 0 skipped ==========

@mdnahas
Copy link
Member Author

mdnahas commented May 16, 2019

I merged your Microsoft project file. I fixed the code for the warnings - one was a bug I had introduced.

The && operator in bash checks the exit code. It will only run the code on the right if the program on the left returns 0. (Which is what each test should do if it passes.) So, "echo SUCCESS" is only run if every program to its left had an exit code of 0. (If you want to run two command sequentially and always have the second run (independent of the exit code of the left one), I think you use the semi-colon ; operator.

@mdnahas
Copy link
Member Author

mdnahas commented May 23, 2019

I started looking into Tornado Codes. I think there is an error in the paper which affects the code's performance. That is, the codes still work, but a typo makes the codes look much better than they are in practice.

The error is on page 7 of the Paper It's in the 4th paragraph of Section 7, which starts "The left degree sequence..."

You start by choosing d. The higher the value for d, the more space-efficient the algorithm, but d+1 must be less than n, the number of message symbols (a.k.a. input slices). In the paragraph, he defines H(d) and says it approximates ln(d). Then he gives the formula for lambda where lambda_i is the portion of message symbols (input slices) that are associated with i check symbols (redundancy slices). So, lambda_2 might be 20%, which means 20% of input slices can be recovered by exactly 2 redundancy slices. The formula for lambda_i is 1/(H(d)(i-1)). All that looks correct.

The problem is the next sentence, "The average left degree a_l equals H(d)(d+1)/d." I think that is wrong. The formula to compute the average is:
sum from 2 to d+1 of i * lambda_i.
= 21/(H(d)(2-1) + 31/H(d)(3-1) + ... + (d+1)*1/(H(d)(d+1-1)
= (2/1)/H(d) + (3/2)/H(d) + ... + ((d+1)/d)/H(d)
= (1+1 + 1+1/2 + 1+1/3 + ... + 1+1/d)/H(d)
= (1+1+1+...+1 + 1+1/2+1/3+...+1/d)/H(d)
= (d + H(d))/H(d) substituted the definition of H(d)
= 1 + d/H(d)
which is definitely not equal to what Luby has, H(d)(d+1)/d.

When d=100, H(d) is about 5.18. Luby's value is 5.24. My calculation is 20.28. For larger values of d, the separation gets wider. For d=10000, Luby's value is 9.79 and mine is 1022.

Why is this important? Because the average left degree has to equal the average right degree and the average right degree is basically "how many input slices are used to create each recovery slice". If we want Par3 (or Par4) to be disk efficient, we want a large value for d. But if we want it to be fast, we also want each recovery slice to be computed from only a few input slices. And if my formula is correct and H(d) is approximately ln(d), then that value goes up with d/ln(d), which is a big value.

I've included my code in this post. At the start is a big comment describing each variable name in the paper. That may help if you try to read the paper. The actual code to compute H, lambda, and the average is about 60 lines.

tornadocode_standalone.cpp.txt

I'll spend another day looking at Tornado Codes, but we might look into other LDPC codes.

@mdnahas
Copy link
Member Author

mdnahas commented May 24, 2019

I was wrong. The paper used a strange notation. lambda was not the portion of nodes with out-degree i, but the portion of EDGES that are connected to nodes with out-degree i. So I was multiplying by i when I should have been dividing.

For 1000000 input slices, the average number of input slices used to compute each recovery slice is 14. Which is the nice low value that we want!

tornadocode_standalone.cpp.txt

@mdnahas
Copy link
Member Author

mdnahas commented Jun 1, 2019

I've been looking into Tornado Codes and it doesn't look good. I'm generating ones with 100 to 32000 inputs and 20% to 5% of recovery data and seeing how often they can recover the original file.

For 100 input blocks and 20% recovery data (20 parity blocks), in 1% of cases we cannot repair 3 or fewer errors. Some graphs encountered errors where they could not repair 1 block. On average, we can repair 13 errors, which is far fewer than 20.

For larger numbers, I looked at 10,000 input blocks with 5% recovery data (500 parity blocks). In 1% of cases, we cannot repair 84 errors. (The worst I saw was 42 errors that it could not repair!) On average, we can repair 415 errors.

I've seen a paper that tries to fix some of these problems with "small" graphs, by making sure every input has trivial coverage and that parity blocks don't overlap. But I also found a paper that says Tornado Codes have a high overhead until you get to above 100,000 blocks.
James S. Plank paper on LDPC with small graphs

I had already been thinking that Par3 (or whatever the next version is called) should support all linear codes. That is, any matrix of inputs to recovery blocks using GF(16 bit). The nice thing about that approach is that it would support Tornado Codes, Reed-Solomon Codes, and some others. I'm now thinking that we may need to find our own style of linear code. Tornado Codes don't do well with "small" numbers of input blocks because they use XOR. We might be able to get some better performance using GF(16 bit) operations. I haven't seen any paper doing that. So, this may involve some research.

I've attached my test program, if you want to see what I'm doing. "a.out 10000 5 100" creates a graph with 10000 inputs, 5% recovery data, and simulates recovering damage 100 times.
tornadocode_standalone.cpp.txt

@mdnahas
Copy link
Member Author

mdnahas commented Jun 2, 2019

This might be the paper we want. It says that an n-by-n random matrix where each row has log(n) non-zero elements usual has rank n minus a constant. And if we use GF(q), the probability that the rank is less than n-k is some constant times 1/q^k.
Paper

In fact, we're probably using this theorem right now. Because of James S. Plank's mistake, we aren't using a Vandermonde matrix but some other matrix. That is, a random-ish matrix with at least log(n) non-zero values per row.

My thinking is that we can build recovery out of random matrices that are n-by-n whose rows have log(n) non-zero elements, which have rank n-k, and add a few rows that have n elements, derived from the Vandermonde matrix. Then we'll only have wasted k parity blocks and gotten n/log(n) times faster.

This doesn't avoid the fact that inverting a large matrix is expensive. Tornado Codes do nicely because their recovery mechanism only looks at one recovery block at a time. I'll see what I can do. Perhaps we can build a hierarchy of small random matrices.

@animetosho
Copy link
Contributor

For those following along, I noticed that Yutaka posted another comment a while ago:

I came up with a simple method to construct non-random generator matrix. It's based on Reed-Solomon Codes, but acts as non-binary LDPC. I feel that it may be good for our Parchive usage. But, I'm not sure that it's good or not. I put the example of my idea on HTML format here for future note.

Split_Reed-Solomon.htm

@mdnahas
Copy link
Member Author

mdnahas commented Jul 18, 2019

I'm still looking at Parchive Version 3. I hit on a crazy off-the-wall idea.

I decided to look at other file formats. Wikipedia has a list.

Many are proprietary or single-platform, but a few are not. An interesting one is Java Archive files with the ".jar" extension. That format, if you don't know, is just a ZIP file. It just has a different extension and the only difference is that it contains specific files (e.g., "META-INF/MANIFEST.MF"). I had the crazy idea that it might be useful if Parchive Version 3 also fit another file format. A likely candidate is "tar". Tar file format

The tar file format consists of blocks of 512 bytes. The original tar file had 3 kinds of blocks. Header blocks contain the file name, file size, file type, permissions, owner, etc.. Data blocks hold the content of files. The last kind of blocks are end-of-file indicators and are just filled with zeros.

The tar file format was upgraded in 1988. The new USTAR file format added a magic value ("ustar\000"). It also changed the file type ("normal", "hard link", or "soft link") to support 256 types of header blocks and reserved some for custom extensions. We could put Parchive data either into files with special names or into header blocks marked with a custom extension.

I'm definitely saying this is a crazy idea. But I found out that "tar" doesn't contain checksums for its files, only a checksum for the data in the header block! You would think that a file archiver would do that, but it doesn't. "tar" is out of date. Reusing tar's format provides an easy way to include file permissions and file data into the Parchive format. Another benefit of reusing the tar format (especially if we did it using files with special names) is that we can debug using the tar program and anyone who doesn't have access to install a Parchive 3 client could still get access to files using the common utility, "tar".

There are many arguments against this idea. I've always thought of Parchive as a unix-pipeable program. I always thought of users doing something like "tar files* | compress | par > backup.tar.Z.par". Reusing the tar format doesn't really make sense in that sequence, because par really should only be focused on one thing, redundancy, and other programs like "tar" should be focused on archiving (grouping files, storing permissions, etc.).

The other argument is that we don't plan on implementing all of "tar"'s file format right away. It adds a lot of headaches for client writers. We might mess up some detail of "tar" and end up with a wacky file format that isn't actually compatible. We should specialize in redundancy, which we're good at, and leave "tar" people to fix their own problems.

I've pretty much argued myself out of not using the tar file format. It is far easier for Parchive version 3 to just reuse the message format that I designed for version 2. But the tar format is an intriguing option and I thought I should put the idea out there and see if it sparked any discussion.

@mdnahas
Copy link
Member Author

mdnahas commented Dec 30, 2021

New draft of the specification.

Big changes:

  • add External Recovery Data packet, for completeness.
  • chunk descriptions now has its own table describing its fields
  • chunk descriptions now use 40-byte rolling hash and allow inline-data
  • FAT permissions now support "Content Not Indexed" bit from NTFS
  • Fingerprint hash is Blake3.

I reordered a few things.

The rolling hash is still CRC-64-XZ This is because it has a clear, available specification. The specification for CRC-64-ISO costs money. And it has been "withdrawn". (I'm guessing that means superceded by a new specification.) The closest things to a specification for it were this page and This page If you can find a good spec for CRC-64-ISO (or another hash), I'll change it. If you have simple code to define the CRC, I would probably accept that.

16kB hash is now the rolling hash, instead of fingerprint hash.

I changed filename lengths to 1 byte. I'm not sure why it was 2 bytes. Most filesystems only support 255 byte names.

I changed the length of a path to 2 bytes. In Linux, paths are limited to 4kB and, in Windows, to 260 bytes.

I changed the length of the owner name and group names to 1 byte. The maximum allowed length on Linux is 32 bytes.

I am very happy with the specification. I feel like most of the pieces are in place. After your comments, I am likely to take "UNFINISHED" off the front of the specification and proclaim this the "BEFORE-REFERENCE-IMPLEMENTATION" version and put the specification on the website. Obviously bugs will be found and worked out during the first implementation of it. But I think we're at the point that we go to code.

BTW, I'm going to be starting a new business next week. I won't have time to write the reference implementation. I will be available to review code and write some unit tests.

Does anyone want to take responsibility for writing the reference implementation?

Par3_spec.txt

@animetosho
Copy link
Contributor

I changed filename lengths to 1 byte. I'm not sure why it was 2 bytes. Most filesystems only support 255 byte names.

Note that 255 bytes under one encoding may end up longer than 255 bytes when converted to UTF-8. Particularly likely in encodings which use 2 bytes/character that end up as 3 bytes/character in UTF-8 (such as some East Asian character sets).

in Windows, to 260 bytes.

Windows actually supports 32767 UCS-2 characters (65534 bytes) in paths via UNC naming.

I think some UCS-2 characters can map to >2 byte encodings in UTF-8, but I doubt exceeding 64KB in a path name is likely.

BTW, I'm going to be starting a new business next week

Congrats on getting this far, and hope the business works out well!

@Yutaka-Sawada
Copy link

Yutaka-Sawada commented Jan 3, 2022

I changed filename lengths to 1 byte. I'm not sure why it was 2 bytes.

As Mr. Anime Tosho wrote already, UTF-8 encoding may exceed the limit. I tested filename length on my PC (Windows 10). I can set max 239 Japanese Unicode characters for a filename with Windows Explorer. The character is encode to 3-bytes each in UTF-8. So, the UTF-8 encoded filename length becomes 239 * 3 = 717 bytes. Thus, the filename length field requires 2-bytes.

Fingerprint hash is Blake3.

I tried to compare the speed of hash algorithms on my PC. While BLAKE3's official implementation works well, I could not compile KangarooTwelve's sample code on Microsoft Visual Studio 2019. Does someone know where to get KangarooTwelve's optimized source code for MSVC ? Anyway, BLAKE3 is much faster than MD5, when SSE2/SSE4.1/AVX2/AVX512 are available. BLAKE3 is 5 times faster than MD5 on my PC with AVX2.

About rolling hash, CRC-64-XZ is enough fast with CLMUL (Carry-less Multiplication instruction set). Because most current Intel/AMD CPUs have this feature, there would be no problem. I put an example of speed difference on my PC below. Note, speed is varied by CPU's extension or SIMD. This result is just a sample of a recent CPU. (It may differ from old CPU or later CPU.)

Tested with Intel Core i5-10400 2.9 GHz (single thread)

CRC-32 (table lookup) : 636 MB/s
CRC-32 (with CLMUL) : 5571 MB/s

CRC-32-C (SSE4.2, 32-bit) : 4739 MB/s
CRC-32-C (SSE4.2, 64-bit) : 9132 MB/s

CRC-64-XZ (table lookup) : 656 MB/s
CRC-64-XZ (with CLMUL) : 6097 MB/s

CRC-64-ISO (table lookup) : 653 MB/s
CRC-64-ISO (64-bit) : 3875 MB/s
CRC-64-ISO (with CLMUL) : 6097 MB/s

MD5 : 692 MB/s
BLAKE3 (with AVX2) : 3875 MB/s

@malaire
Copy link

malaire commented Jan 3, 2022

Even old CPUs can get good speed. I have 9 year old CPU and get these numbers:

Tested with Intel Core i5-3570K 3.4 GHz (single thread)

CRC-32-C (8 kiB slicing-by-8 table lookup): 2400 MB/s
CRC-32-C (hardware): 15000 MB/s

CRC-64-XZ (16 kiB slicing-by-8 table lookup): 1700 MB/s
CRC-64-XZ (hardware): 3200 MB/s (*)

*) Library I'm using claims up to 28200 MB/s but that probably requires newer CPU.

Of course all of these are for non-rolling variant. I don't know if rolling hash can be sped-up (it's 240 MB/s with table lookup).

@malaire
Copy link

malaire commented Jan 3, 2022

btw, will Par3 support only slow matrix-based O(n^2) Reed-Solomon or is faster O(n log n) Reed-Solomon also supported?

@Yutaka-Sawada
Copy link

will Par3 support only slow matrix-based O(n^2) Reed-Solomon or is faster O(n log n) Reed-Solomon also supported?

As a file format, it's designed to support any Galois field and any code matrix. But, someone needs to write source code to use FFT based Reed-Solomon Codes for PAR3. Though there are some papers about the technology, I could not understand the math. Actualy, I cannot understand even these current recovery codes on PAR3 spec. (I'm hard to read mathematical words in English.) That is why I cannot write a reference implementation of PAR3 client.

@malaire
Copy link

malaire commented Jan 3, 2022

As a file format, it's designed to support any Galois field and any code matrix.

The "any code matrix" part is what I don't understand. I'm able to implement FFT based Reed-Solomon but I can't see any matrix there. (But it is possible it's hidden somewhere in the algorithm as I don't really understand it.)

@animetosho
Copy link
Contributor

I could not compile KangarooTwelve's sample code on Microsoft Visual Studio 2019. Does someone know where to get KangarooTwelve's optimized source code for MSVC ?

They give some info here. Basically you need to use the make tool under Cygwin/MSYS2 to generate the .vcxproj file. I've attached the output of such below, in case you are unable to do it:

K12-master.zip

As they mention, assembly optimisations aren't available on MSVC because they're written for GCC. Perhaps it's possible to assemble these into object files, then disassemble into MASM format, but I haven't bothered checking that.
I do note that the assembly optimised version does run noticeably faster than the non-assembly version.

It looks like BLAKE3 internally uses a tree, which means that it can scale across cores without issue. This gets around the issue of a single-threaded hashing bottleneck that currently exists in PAR2.

@dvasdekis
Copy link

dvasdekis commented Jan 4, 2022

Hi folks, I've been gratefully watching your work for some time and am eagerly awaiting the PAR3 spec to solve some of the problems we're facing in the space where I work (scientific computing). Our use case is that we have millions of files (mostly video and audio from sensors - e.g. TIF, JPEG, MP4) in a dataset, and the dataset grows slowly over time - for example, our botanical scientists have over 1 million ultra-high resolution photos of plants (around 500MB for each photo), which are used to identify changes in plant genetics. We have multiple similar use cases, and adopting a normal cloud service for this (E.g. AWS S3) would be totally cost prohibitive. We currently store this data in a messy combination of onsite NAS and tape storage, and we want to try adopting Usenet as a place to store and share these huge datasets. But, error correction with PAR2 on a 500TB file collection made of 700K article size and a 32768 piece limit is obviously very inefficient - each PAR2 piece would span across approx. 21798 Usenet articles!

So I was excited to read the spec as posted by @mdnahas (thank you very much for your wonderful work!) but it does seem firstly that there is an extension of scope here that makes PAR3 harder to implement. Commenting against your design goals from line 40, the below 'make sense' to me (as a non-expert!) as an improvement of PAR2 as it is used today for Usenet binaries :

  • support more than 2^16 files
  • support UTF-8 filenames (Par2 added this after version 2.0.)
  • support more than 2^16 blocks
  • replace MD5 hash (It is both slow and less secure.)
  • support "tail packing", where a block holds data from multiple files

However the below seem to be new features that don't fit in with the Usenet PAR2 workflow. To comment on each from the perspective of my use case (and indeed, anyone who wants to create redundancy on a large number of files in one PAR3 set, like the poster above who wants to back up 100K images):

* support "incremental backups" - this is obviously not supported today, but would be fantastic for us. We would love to be able to add a file to a PAR set, and have the program produce some extra files that, alongside the original files, would allow for the restoration of this new file. But I don't believe this is mathematically possible without recreating the entire PAR set?

* support deduplication, where the same block appears in multiple files - shouldn't add much overhead, but not really valuable, as a modern archive format (e.g. 7zip) already does this via its 'compression dictionary'.
* dropped requirement for 4-byte alignment of data - this would only save 3 bytes per block, best case?
* support empty directories - not sure why someone would need this when the archive format can handle this?
* support file permissions - security practices I've seen would only ever set permissions at the parent folder/'container' level, making file based permissions unnecessary? Additionally, as permissions work differently between different OS's, it seems like a messy requirement that doesn't add value
* support hard links and symbolic links - once again, not sure why someone would need this? I would just ship my archive with a little script that creates these if necessary?
* support files that work both as a Par3 file and another type. For example, putting recovery data inside a ZIP, ISO 9600 disk image, or other file. - I don't understand what the use case for this would be? We can already zip up PAR3 files, but why would we ever do this?
* support for storing data inside a Par3 file (This was optional in Par2.) - I don't exactly understand this feature?
* support any Galois field that is a power of 2^8 - The user is going to just want a sane default field setting that's fast on x86. No need to make it more complicated?
* support any linear code (Reed-Solomon, LDPC, random sparse matrix, etc.) - Same as above.

I appreciate that there are going to be users of PAR3 who do have use for the above features! But are they the users that the developers wish to address? They shouldn't feel the pressure to address them if they don't want to.
I feel like there is a branching design goal here - is PAR3 just 'better than' PAR2 for the same use cases? I feel like this is the user expectation.

Instead, if one wants to create a new archive format that captures more of the "definitions" (e.g. archiver, checksummer, grouper, etc. as described prior by @mdnahas) it would be extremely welcome work, as it would remove the need for multiple file passes and could include support for things like on-the-fly error correction of video streams - all very welcome, but probably outside the user-perceived scope of PAR3 :)

Please forgive me if I have misunderstood any of the above, and my apologies if this post is unwelcome. Thanks again!

@Yutaka-Sawada
Copy link

As they mention, assembly optimisations aren't available on MSVC because they're written for GCC.

Yes, this is the problem. They might forget to include AVX2noAsm and AVX512noAsm version code in the package. eXtended Keccak Code Package seems to contain them. Thanks Mr. Anime Tosho for the advice.

I feel like there is a branching design goal here - is PAR3 just 'better than' PAR2 for the same use cases? I feel like this is the user expectation.

I thought so, too. Users' voice is welcome. Thanks Mr. Dimitri Vasdekis. Because I'm a lazy programmer, I felt that small improvement for current usage might be enough. I'm not sure that I will be able to implement some new features in PAR3 specifications. But, I will try, when users request a feature in future.

@mdnahas
Copy link
Member Author

mdnahas commented Jan 4, 2022

@dvasdekis writes:

Hi folks, I've been gratefully watching your work for some time and am eagerly awaiting the PAR3 spec to solve some of the problems we're facing in the space where I work (scientific computing). Our use case is...

I'm glad Parchive is being considered. I love hearing about users and new use cases.

the below 'make sense' to me (as a non-expert!) as an improvement of PAR2 as it is used today for Usenet binaries :
...

However the below seem to be new features that don't fit in with the Usenet PAR2 workflow. To comment on each from the perspective of my use case (and indeed, anyone who wants to create redundancy on a large number of files in one PAR3 set, like the poster above who wants to back up 100K images):

* support "incremental backups" - this is obviously not supported today, but would be fantastic for us. We would love to be able to add a file to a PAR set, and have the program produce some extra files that, alongside the original files, would allow for the restoration of this new file. But I don't believe this is mathematically possible without recreating the entire PAR set?

An example of incremental backup would be: you have 100 files and protect them with File1.par3, and then add 10 more files and protect all of them with File2.par3. A client will be able to read both File1.par3 and File2.par3 and use them to recover the 110 files. Obviously, File1.par3 will not protect any of the 10 new files.

The incremental backup case makes sure that you can use older Par3 files with newer ones. You can get by with a smaller File2.par3 because File1.par3 exists.

I suppose a client could read File1.par3, compute the effect of the 10 new files, and write File3.par3, which would protect all 110 files. But that is not the "incremental backup" case. That, in fact, could be done with Par2 right now.

* support deduplication, where the same block appears in multiple files - shouldn't add much overhead, but not really valuable, as a modern archive format (e.g. 7zip) already does this via its 'compression dictionary'.

Yes, if you want to compress all your files first. If you want to protect files on a USB drive and have the files uncompressed, the feature is valuable. So, the feature is not useful for the Usenet use case. That is not the only use case.

* dropped requirement for 4-byte alignment of data - this would only save 3 bytes per block, best case?

That is a longer discussion. You can find it discussed above in this issue thread.

* support empty directories - not sure why someone would need this when the archive format can handle this?

You're assuming someone compresses first.

* support file permissions - security practices I've seen would only ever set permissions at the parent folder/'container' level, making file based permissions unnecessary? Additionally, as permissions work differently between different OS's, it seems like a messy requirement that doesn't add value

Yes, it is messy. It may not add value to your use case, but it does to people protecting a USB drive.

* support hard links and symbolic links - once again, not sure why someone would need this? I would just ship my archive with a little script that creates these if necessary?

Now, you're assuming the receiver will run a script after receiving the data?? Yeah, I would never do that.

* support files that work both as a Par3 file and another type. For example, putting recovery data inside a ZIP, ISO 9600 disk image, or other file. - I don't understand what the use case for this would be? We can already zip up PAR3 files, but why would we ever do this?

If you want to send a file that works both as a Par3 file and a ZIP file. That is, one where if a user only has ZIP, they can unzip the file. But, if they have a Par client, they can fix the damaged file and unzip it.

* support for storing data inside a Par3 file (This was optional in Par2.) - I don't exactly understand this feature?

Normally with Par2, you send the original files and a .par2 file. In this usage, you only sent a .par2 file, but the data for all the original files is stored inside the .par2 file. It makes .par3 behave like most other archive file formats.

The feature was optional in Par2. I was hoping that clients would use it and that Usenet would stop using RAR to split files. With a compressed video file, RAR compression does almost nothing --- people were only using RAR to split files into pieces. I had hoped that it could be done with Par2. In Par3, I made the feature mandatory.

* support any Galois field that is a power of 2^8 - The user is going to just want a sane default field setting that's fast on x86. No need to make it more complicated?

First, I don't think supporting any Galois field is very complicated. Second, picking a "sane default" is hard. Lastly, some algorithms (like FFT Reed Solomon) want to choose a particular Galois Field.

* support any linear code (Reed-Solomon, LDPC, random sparse matrix, etc.) - Same as above.

This offers the trade-off of more protection vs. speed. Again, I think picking a "sane default" is hard. Also, it allows future development of algorithms.

I appreciate that there are going to be users of PAR3 who do have use for the above features! But are they the users that the developers wish to address? They shouldn't feel the pressure to address them if they don't want to. I feel like there is a branching design goal here - is PAR3 just 'better than' PAR2 for the same use cases? I feel like this is the user expectation.

My goal has always been to expand Parchive's use cases. If you read the design for Par1, you'll see it was very very tied to Usenet and to RAR. Par2 is used other places besides Usenet. I actually disappointed how limited the Par3 design is --- I wanted it to be used for many more things. But, because the data can be stored external to the Par3 file, the names of the files containing that data need to be readily accessible in the format. It actually took me a while to design Par3 because I struggled against that constraint.

Instead, if one wants to create a new archive format that captures more of the "definitions" (e.g. archiver, checksummer, grouper, etc. as described prior by @mdnahas) it would be extremely welcome work, as it would remove the need for multiple file passes and could include support for things like on-the-fly error correction of video streams - all very welcome, but probably outside the user-perceived scope of PAR3 :)

I had hoped to do that. Unfortunately, if I wanted Par3 to still be useful to Usenet, it couldn't be done. It will have to be some done by some other designer with another file format. It really should be done --- "tar" is old and crusty.

Please forgive me if I have misunderstood any of the above, and my apologies if this post is unwelcome. Thanks again!

Thanks for caring enough to comment. As I said, it's always interesting to hear from users and hear new use cases.

@animetosho
Copy link
Contributor

we want to try adopting Usenet as a place to store and share these huge datasets

I personally can't vouch for this being a great idea, but you're welcome to try I guess. Concerns are mostly around long term viability, but if it's just a backup copy (or just for distribution), it's probably not a big issue.
If you do go down the route, then note that a PAR2 doesn't need to cover the entire data set - you can choose to have multiple PAR2s cover subsets.

But I don't believe this is mathematically possible without recreating the entire PAR set?

The feature largely acts like separate PAR archives, where one declares another to be its parent, so it's as possible as it is to create completely separate PAR sets.

There's downsides to the approach, such as redundancy not being shared across the whole archive, and the structure could get unwieldy if it's extended a lot (since it's essentially a linked list). A client could always choose to sidestep the feature and just update the original PAR3 file instead, which wasn't really possible under PAR2 (PAR2's strict file ordering made it infeasible to do in most cases).

I would just ship my archive with a little script that creates these if necessary?

Remember that this is just the specification, which defines the maximum extent of what is allowed. You'd ultimately be interacting with a client, which will be responsible for catering for your use case (as long as the specification doesn't restrict its ability to do so).

If you don't want symlinks/hardlinks to be represented, ideally you'd tell your client such (if its default behaviour isn't what you want) and it'd just ignore those attributes when creating the PAR3.

I don't understand what the use case for this would be? We can already zip up PAR3 files, but why would we ever do this?

An example might be a program which creates an archive of files with redundancy (like the "new archive format" you mention later). Currently this requires two steps, and creates multiple files, whereas the change could allow for it to happen in one step, and bundled into a single file.

The user is going to just want a sane default field setting that's fast on x86. No need to make it more complicated?

Again, it'll be up to the client to determine how this is handled/presented to the end user, the specification just enables the possibility.

is PAR3 just 'better than' PAR2 for the same use cases?

It supports all the features that PAR2 has, so other than implementation complexity, it theoretically shouldn't be any worse :)


I was hoping that clients would use it and that Usenet would stop using RAR to split files.

RAR is mostly used because people are stubborn and blindly follow 20 year old tutorials/advice which recommend doing it.

Embedding files within PAR isn't helpful to the Usenet use-case (in fact, it'd be detrimental, because downloaders only download PAR files if they detect damage).

@mdnahas
Copy link
Member Author

mdnahas commented Jan 11, 2022

Wow! @animetosho I love your rant/explanation about RAR!!! I'd be happy if we convince people that Par3 means you don't need RAR. ... even if the truth is that they didn't need RAR in the first place!

@mdnahas
Copy link
Member Author

mdnahas commented Jan 11, 2022

Thanks @Yutaka-Sawada , I will change the standard back to 2-bytes for filename/dirname length and 4-bytes for path length.

@mdnahas
Copy link
Member Author

mdnahas commented Jan 11, 2022

I took another look at CRCs. I hated that the XZ one does the stupid bit twiddling at the start and end. It may make it mathematically meaningful, but it doesn't add anything to its error detection.

It looks like the CRC-64-ECMA is the same as XZ, but without the bit twiddling. ("ECMA-182"). So we could use that... but it is the larger polynomial.

The CRC-64-ISO has the small polynomial, but does the bit twiddling.

It took me a while to discover that the polynomial in the reference code for XZ, which is this:
0xC96C5795D7870F42
and is actually the bit-reverse polynomial that is on this page and this page is:
0x42f0e1eba9ea3693

You can see that when we write the XZ code's polynomial in binary:
1100100101101100010101111001010111010111100001110000111101000010
along side the the website's polynomial in binary:
0100001011110000111000011110101110101001111010100011011010010011
It takes some time, but you can check that they are reverses of each other.

So, I was able to find the small ISO polynomial on the webpages:
0x000000000000001b
and the bit-reversed value is
0xD800000000000000
(You can also find this constant in this source code. )

Should we go with the ISO polynomial and no bit twiddling?

That would mean the XZ Code with
static const uint64_t poly64 = UINT64_C(0xD800000000000000)
and the lines
uint64_t
crc64(const uint8_t *buf, size_t size, uint64_t crc)
{
crc = ~crc;
for (size_t i = 0; i < size; ++i)
crc = crc64_table[buf[i] ^ (crc & 0xFF)]
^ (crc >> 8);
return ~crc;
}
replaced by
uint64_t
crc64(const uint8_t *buf, size_t size, uint64_t crc)
{
for (size_t i = 0; i < size; ++i)
crc = crc64_table[buf[i] ^ (crc & 0xFF)]
^ (crc >> 8);
return crc;
}

How does that sound?

@Yutaka-Sawada Do you think it will make the code faster (or easier to write)?

@malaire
Copy link

malaire commented Jan 11, 2022

I hated that the XZ one does the stupid bit twiddling at the start and end. It may make it mathematically meaningful, but it doesn't add anything to its error detection.

A practical reason for non-zero initial value for CRC is to detect zero-bytes at start. With 0 as initial value CRC just ignores any zero-bytes at beginning, i.e. CRC remains 0 until first non-zero byte.

But my understanding is that that feature isn't useful when using fixed block size, like rolling CRC usually does. (It is useful when you have two blocks of different length but which have same content, except for the amount of zero-bytes at beginning.)

@malaire
Copy link

malaire commented Jan 11, 2022

That would mean the XZ Code with ...

In addition to bit-complement at beginning/end, other difference between CRC-64/XZ and CRC-64/ECMA-182 is bit order (i.e. refin and refout parameters in CRC catalogue).

@Yutaka-Sawada
Copy link

Should we go with the ISO polynomial and no bit twiddling?

When CLMUL is available on CPU, there is no speed difference between polynomials. Only when CLMUL isn't available, a polynomial of CRC-64-ISO is faster than that of CRC-64-XZ. But, PAR3 isn't designed for obsolated CPUs. So, there is no big reason to select a polynomial by speed.

By the way, it's possible to calculate CRC-64-ISO without table lookup. I put my hash testing application (TestHash_2022-01-11.zip) in "tool" folder on OneDrive. There are some implementations of hash functions. You may read the source code. The CRC-64-ISO's sample code is like below.

uint64_t
crc64(const uint8_t *buf, size_t size, uint64_t crc)
{
    uint64_t A;
    crc = ~crc;
    for (size_t i = 0; i < size; ++i){
        A = crc ^ buf[i];
        A = A << 56;
        crc = (crc >> 8) ^ A ^ (A >> 1) ^ (A >> 3) ^ (A >> 4);
    }
    return ~crc;
}

About bit flipping, I don't like the useless factor. The initial & final bit flipping affects CRC calculation at sliding window. You may see the file "crc.h" of par2cmdline. There is a value windowmask for slide CRC window.

u32 windowmask = ComputeWindowMask(window);
crc = windowmask ^ CRCSlideChar(windowmask ^ crc, buffer[window], buffer[0], windowtable);

This windowmask is the result of bit flipping in CRC-32. If you omit the initial & final bit flipping from CRC-32 value, it won't require windowmask at sliding like below.

crc = CRCSlideChar(crc, buffer[window], buffer[0], windowtable);

So, slide speed may become a little faster. But the difference would be very small and ignorable. While I removed the effect of bit flipping from CRC-32 before verification in my PAR2 client internally, I'm not sure the worth to effort.

If you want to slide CRC-64 for arbitrary window size (block size, 16 KB, or 40 bytes) in PAR3, three windowmask(s) will be required to search complete blocks, misnamed files, or tail chunks. Though it adds more complexity, the difference is very small. Technically, there is no problem at all.

Do you think it will make the code faster (or easier to write)?

No, it won't be remarkable. If you think that CRC-64-XZ is good for definition, it's OK. If you want a simple construction, using CRC-64-ISO's polynomial without bit flipping is the simplest. But, I cannot say that it should be which. There is no difference in speed nor noticeable complexity. You should consider another factor except speed. I'm sorry for confusing you by my old post.

@mdnahas
Copy link
Member Author

mdnahas commented Jan 11, 2022

BTW, did you see that there is a new version of BitTorrent?

New hash function.
New directory structure.
Sound familiar? :)

@animetosho
Copy link
Contributor

New hash function.
New directory structure.
Sound familiar? :)

Though likely for different reasons.

Hash function was changed from SHA1 to SHA256 due to security issues in the former. Security is actually more important in BT because you're taking data from untrustworthy peers.
I believe libtorrent's implementation is using the latest draft spec (from 2018), which was probably written before BLAKE3/K12 and similar, so uses the rather slow SHA256, though BT isn't otherwise CPU intensive and would likely be limited by network bandwidth.

The directory structure change is to avoid duplication in long/commonly-used path names. Although likely beneficial to PAR3 (though less so due to the various overheads), I believe your key motivator was to support empty directories.

BTv2 solves a rather annoying issue of BTv1, which didn't have stable per-file hashes, which I personally consider to be the biggest win.

@Yutaka-Sawada
Copy link

this is probably a good place to recommended that users be warned when they create Par3 files containing names that are incompatible with Windows, Mac, or Linux systems.

About ivalid filenames on Windows OS, there seems to be some more. I post some examples of them.

Reserved filenames with extension are invalid, too. Such like; "CON.txt". The filename is case insensitive. "con" and "con.txt" are invalid, too. Windows Explorer cannot treat them.

Filename with period at the last is invalid, too. Such like; "something.". The last period is trimed by Windows Explorer at renaming. So, users cannot make such invalid filename normally. But, Windows Explorer cannot treat them.

Filename with space at the first or last is invalid, too. Such like; " something" or "soemthing ". These spaces seem to be trimed by Windows Explorer automatically. So, users may not see such invalid filename normally.

@mdnahas
Copy link
Member Author

mdnahas commented Jan 25, 2022

Okay, it sounds like we have a definition of CRC-64-ISO. And, while the bit inversion at start and end is annoying, it does serve a purpose and it is possible to optimize most of the cost away. I'll change the specification to CRC-64-ISO.

Are there any other open issues that need discussion?

If there are no open issues, I will:

  1. make CRC-64-ISO the rolling hash in Par3_spec.txt
  2. check in old versions of Par3_spec.txt into parchive.github.io repository, so we have a history of the specification.
  3. convert the new Par3_spec.txt into an HTML document (Or maybe MarkDown and generate HTML)
  4. declare the HTML document to be the offical pre-reference implementation version of the Parchive 3.0 specification

We still need someone to code the reference implementation. I'm willing to help, with unit tests and other stuff, but I don't have time to code it.

Are there any other open issues with the specification?
Is anyone interested in helping code the reference implementation?

@Yutaka-Sawada
Copy link

Are there any other open issues with the specification?

I'm not sure that you answered already, but Markus Laire asked like below ago;

btw, will Par3 support only slow matrix-based O(n^2) Reed-Solomon or is faster O(n log n) Reed-Solomon also supported?

I think that a PAR3 client developer is possible to define a new Matrix Packets for his special Recovery Codes. Though the naming is Matrix Packet, it would be a definision of how to make recovery data.

Is anyone interested in helping code the reference implementation?

If nobody write, I will be able to try. While I'm difficult to implement PAR3's new Recovery Codes, I can copy most basic features from my PAR2 client. But, as I use Microsoft's C language and C-runtime library for Windows OS, my source code may be incompatible on other environments.

@mdnahas
Copy link
Member Author

mdnahas commented Jan 28, 2022

btw, will Par3 support only slow matrix-based O(n^2) Reed-Solomon or is faster O(n log n) Reed-Solomon also supported?

Good point.

I looked at the paper on FFT Reed Solomon by Lin, Chung and Han. It is implemented by the library Leopard on GitHub.

This algorithm uses the normal Galois Fields and should be compatible with the current Par3 specification.

I don't know what code matrix it is using. I've emailed the author of Leopard and one of the authors of the paper, to see if they can define the matrix.

If I hear back, I'll replace the Cauchy matrix with it.

Is anyone interested in helping code the reference implementation?

If nobody write, I will be able to try. While I'm difficult to implement PAR3's new Recovery Codes, I can copy most basic features from my PAR2 client. But, as I use Microsoft's C language and C-runtime library for Windows OS, my source code may be incompatible on other environments.

That's great! I'll be glad to help make the code run with GCC on Linux. As well as help with unit tests.

@mdnahas
Copy link
Member Author

mdnahas commented Jan 28, 2022

I'm done converting the specification to Markdown. It uses Github's own dialect of Markdown, because the default Markdown does not support tables. It is attached.

I've generated the HTML and made the changes for the website, but we'll have to wait for Ike to accept my push request before you can see them.

So excited! A great milestone! I can't wait to see the reference implementation!

Parity_Volume_Set_Specification_v3.0.md

@mdnahas
Copy link
Member Author

mdnahas commented Jan 28, 2022

Oh, also, how do you want to be credited in the specification?
@Yutaka-Sawada @animetosho @malaire

@mdnahas
Copy link
Member Author

mdnahas commented Jan 29, 2022

We have a new repo for Par3!

I'm moving discussion of the specification there. Post all future messages to:

Parchive/par3cmdline#1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests