Skip to content

add std.compression.lz77 #1335

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

WalterBright
Copy link
Member

No description provided.

@korobochka
Copy link

For me circular buffer seems like a general data structure, not specific to lz77. So my question is -- why not move it to std.array or std.container? I think this is not the only data structure that is implemented as private in non array/container related Phobos module.
Sorry if I am missing something.

* InputRange
*/

auto compress(R)(R src)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... if (isInputRange!R && is(ElementType!R : ubyte))

@JakobOvrum
Copy link
Contributor

It's much easier to get the documentation right if docs are generated and linked. Pull requests for new modules generally do this.

@andralex
Copy link
Member

andralex commented Jun 7, 2013

I think this is good work but some additional design is required. The entire signature/factory aspect must be handled upfront. Fundamentally we should have a function that given a range compressed with any supported algorithm is able to instantiate the proper decompressor and use it.

printf("Decompression done, dilen = %d decompressed = %d\n",
di.length, si2.length);

if (si != si2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert(si == si2, "Buffers don't match");

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert's go away with -release, I wanted this one to stay in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you expect this to be executed, then arguably assert(0) shouldn't be in there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert(0) is replaced with a HLT instruction for -release.

@WalterBright
Copy link
Member Author

Hmm, I wrote a note on why there isn't a signature, it seems to have vanished.

@andralex
Copy link
Member

andralex commented Jun 8, 2013

@WalterBright did we agree that it's worth looking on compress/expand algos that traffic one ubyte[] at a time? Operating at ubyte level is inefficient for this algorithm and it will be inefficient for virtually all others.

@WalterBright
Copy link
Member Author

I considered and rejected having a factory & signature for the following reasons:

  1. the component is supposed to be independent. Having a factory to detect a signature and select the corresponding component requires that it knows about all the others. This is the job of an outer layer, not this.
  2. there are endless popular compression algorithms. Some have signatures, some don't. There is no central repository of signatures - so having one is unreliable at being unique
  3. having a signature will make front() slightly slower, as it'll have to check every time if it's seen the signature or not
  4. some archive formats, like zip, do not store the signature with the compressed data. It's done as a flag elsewhere.
  5. self-extracting executables, installers, etc., have no need of a factory/signature
  6. I have used compressors to compress short strings - they'd be rendered much less useful if a signature was added

Hence I believe the factory/signature belongs in an outer layer, not in this component.

@WalterBright
Copy link
Member Author

@andralex yes we did agree on ubyte[], but it's non-trivial to do it, and I don't think it's at the top of my endless stack, and this module does work without it.

}
else if (offset < 0x8000) // 11111XXXXXX
{
putBit(1);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might as well implement putBits to use via putBits(1, 1, 1, 1) because there's so many wasteful one-liners here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not going to do it that way because the putBit()'s are all inlined and then optimized. Doing a loop would be slower.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can the compiler not optimize out the call putBits(1, 1, 1, 1, 0)? The arguments are known at compile-time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compiler currently does not unroll loops.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So much for the talk about having discovered all major optimizations back in the 80s.. How is this optimization not implemented? It looks trivial.

@andralex
Copy link
Member

andralex commented Jun 9, 2013

(Edit: just got private message from Walter he was about to submit changes to most points. Apologies for being unfairly general in my criticisms.)

@andralex yes we did agree on ubyte[], but it's non-trivial to do it, and I don't think it's at the top of my endless stack, and this module does work without it.

We can't add a new module just because whoever wrote it just didn't have the time to get it right. We did that with std.json and std.xml and that didn't turn well. We could, however, add it to etc so people can experiment with it.

Before this gets in std, we must make sure whether a range of ubyte[] is more appropriate than a range of ubyte and how to use multiple compression algorithms in the same application without errors.

Also this is great as an experiment in minimalism but I can't accept in std a module that reinvents parts of std. It must not define its own notion of range, copy, circular buffer, and whatnot. Whatever issues are there with std's modularization, going lone cowboy on this is emphatically not the answer.

@WalterBright
Copy link
Member Author

New version posted incorporating most of your suggestions, thanks!

@kyllingstad
Copy link
Contributor

@WalterBright:

I considered and rejected having a factory & signature for the following reasons: [...]

I completely agree with this. I may want to use my own signature for my own compressed data; this is not the compression algorithm's business.

@WalterBright
Copy link
Member Author

Also check out line 464 of https://github.com/D-Programming-Language/phobos/blob/master/std/zip.d. If we always put out a signature, doing this code would get pretty hackish with ranges.

@andralex
Copy link
Member

andralex commented Jun 9, 2013

I considered and rejected having a factory & signature for the following reasons: [...]

I completely agree with this. I may want to use my own signature for my own compressed data; this is not the compression algorithm's business.

I completely agree too. All I'm saying is this algorithm must be surrounded by the appropriate design, in addition to being able to work independently.

If we make a mistake here (and just how often primitives just work without a design?), future implementers of compression algorithms will forever curse us for making it hard to get things done.

  1. The most obvious aspect is handling buffers vs. single bytes. If we do byte-level, /everyone/ will have to.
  2. If we make it unduly difficult to integrate with factories and signatures, again we'll only have trouble down the road trying to decompress data compressed with a different algorithm. As the simplest design, every algorithm must have a unique signature (e.g. 2-4 bytes) that its user may or may not choose to insert at the beginning of the stream. This algorithm would prolly have a signature like 1 (whereas plain copy may be 0). Then future algorithms will each reclaim new signatures.

@andralex
Copy link
Member

andralex commented Jun 9, 2013

Also check out line 464 of https://github.com/D-Programming-Language/phobos/blob/master/std/zip.d. If we always put out a signature, doing this code would get pretty hackish with ranges.

I don't think we should force inserting a signature. Our design must, however, facilitate it.

I'm saying, compress() and expand() are a great start but we need a little framework to encompass them and other algorithms.

@andralex
Copy link
Member

andralex commented Jun 9, 2013

Before I forget - this should inline the loop:

void putBits(T)(T bits...)
{
    foreach (b; bits) putBit(bit);
}

That would shorten code considerably and work as well as writing the calls by hand.


//import core.stdc.stdio : printf;

private struct CircularBuffer(U: T[dim], T, size_t dim)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should migrate to std.range

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not put this in std.range or even std.container. Very rarely does a module need a circular buffer. The module situation is bad enough as it is, let's not make it worse.

Put it in std.circularbuffer. That way, this module can import just CircularBuffer and not the whole of Phobos. Hopefully at some point we can revert the mistake of having a std.container module, which was destined from the get-go to be a monster containing every conceivable data structure. Phobos really needs to be more organised and we might as well start now.

@WalterBright
Copy link
Member Author

The putBits(T)(T bits...) does not compile. The only thing that does work is:

void putBits(bits...)() {
    foreach(b; bits) putBit(b);
}

putBits!(1,0,1)();

@yebblies
Copy link
Contributor

void putBits(T...)(T bits) {
    foreach(b; bits) putBit(b);
}

putBits(1,0,1);

{
// Reached the end of the source
static if (arrayLike)
{ if (si != src.length)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enforceEx!CompressException(…)? ;)

@WalterBright
Copy link
Member Author

Thanks, yebblies! That's the right form, but the code generated is worse (compiler doesn't do a good job inlining it).

@yebblies
Copy link
Contributor

Thanks, yebblies! That's the right form, but the code generated is worse (compiler doesn't do a good job inlining it).

Ah, I'll bet this is a dmd-only problem.

Please, PLEASE reconsider naming the functions something other than 'compress' and 'expand'. We have great tools to resolve conflicting functions, but generic names increase the effort required to understand code, and conflicts ruin ufcs.

@dnadlinger
Copy link
Contributor

@yebblies: Seems like you are rather alone with that opinion.

@yebblies
Copy link
Contributor

@klickverbot Only until I convince somebody else!

@ghost
Copy link

ghost commented Jun 13, 2013

He's not alone in this, this should be inside of some struct. expand is common, for example it's used in std.typetuple, and in the DRanges library.

But more importantly, if someone adds another compression engine (say lz78), how are you going to use multiple compression engines at once? You'd have to fully qualify everything. expand for example doesn't even verify at compile-time that the input range is actually LZ77 compressed to begin with.

It's better to be type-safe than sorry (you could use std.compression.lz77.compress and then accidentally use std.compression.lz78.expand).

@dnadlinger
Copy link
Contributor

@AndrejMitrovic: Type safety has nothing to do with this, and it's as easy to typo an import statement as it is to misspell a function call.

@ghost
Copy link

ghost commented Jun 13, 2013

@klickverbot: It's not about typing code, it's about reading code.

With array.lz77.compress it's easier to spot the bug.

@dnadlinger
Copy link
Contributor

It's equally easy to spot a module importing lz78 when it should import lz77.

@ghost
Copy link

ghost commented Jun 13, 2013

Okie.

@mihails-strasuns
Copy link

@WalterBright are you interested in proceeding with this proposal in Phobos queue after this release is out? I will mark it as "On hold" otherwise until you can spend time on it or find another champion.

@ghost
Copy link

ghost commented Feb 15, 2014

If he's gonna work on it he can reopen it, otherwise this is just adding to the black statistic of open pulls.

@ghost ghost closed this Feb 15, 2014
@WalterBright
Copy link
Member Author

How about putting this in std.experimental?

@yebblies
Copy link
Contributor

How about putting this in std.experimental?

How about putting this on code.dlang.org? (instead or as well)

@mihails-strasuns
Copy link

There is no point in adding to std.experimental module you are not ready to actively support. And in that case there is no reason it can't go through review process either.

@complexmath
Copy link
Contributor

I'd personally like to see std.experimental replaced by distribution via Dub. Maintaining experimental code in the main repo doesn't seem terribly appealing.

@wilzbach
Copy link
Contributor

@WalterBright please put it on dub, so your work doesn't get lost!

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.