Skip to content

Initial commit for feedback to support sdl dub packages #392

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

marler8997
Copy link
Contributor

Don't look into merging this pull yet, I just wanted to get some feedback. Since the Json type litters the code everywhere I though I would continue using it for now. So the parseSdl function will actually return the Json struct. I think the name of the Json struct should be changed eventually though. I think this is a farily easy way to modify the existing architecture to support multiple formats, what do people think?

@marler8997
Copy link
Contributor Author

I also need to know what we want the final SDL format to look like. I can either implement the format found here: http://forum.rejectedsoftware.com/groups/rejectedsoftware.dub/thread/2/ or I can implement a "translation" to json format.

@s-ludwig
Copy link
Member

s-ludwig commented Aug 8, 2014

I think the parser should attack at the same spot as the current JSON parser, i.e. it should directly populate a PackageInfo struct instead of returning Json, which would then need to be parsed again. Currently this is PackageInfo.parseJson, ConfigurationInfo.parseJson and BuildSettingsTemplate.parseJson and the simplest way would be to define separate dub.parsers.sdl.parseSDL(ref PackageInfo, SDLNode) style free functions (the JSON ones would then also be converted into free functions and moved into their own dub.parsers.json module.

The only exception AFAIK is the PackageInfo.subPackages field, which has been left as Json for simplicity up to now. I'd say that this needs to be properly replaced by PackageInfo[], but to keep things focused, an SDL parser that ignores sub modules in the first iteration would also be ok.

@marler8997
Copy link
Contributor Author

Awesome thanks for the guidance, will try to get to it tomorrow.

@s-ludwig
Copy link
Member

s-ludwig commented Aug 8, 2014

Oh, forgot to mention, the format should be implemented roughly according to https://github.com/D-Programming-Language/dub/wiki/DEP1 - however, that page is a little out of date, so there may be some changes required compared to the current specification.

@marler8997
Copy link
Contributor Author

Got another question. Should dub support unicode SDL or the ascii variant?

@s-ludwig
Copy link
Member

s-ludwig commented Aug 8, 2014

Definitely Unicode. All input files should be assumed to be UTF-8 (using std.utf.validate), not sure if sdlang-d does that already.  Sent from my BlackBerry 10 smartphone. From: marler8997Sent: Freitag, 8. August 2014 18:38To: D-Programming-Language/dubReply To: D-Programming-Language/dubCc: Sönke LudwigSubject: Re: [dub] Initial commit for feedback to support sdl dub packages (#392)Got another question. Should dub support unicode SDL or the ascii variant?

—Reply to this email directly or view it on GitHub.

@marler8997
Copy link
Contributor Author

I chose to write a new SDL parser because the Abscissa Parser uses alot of memory that isn't necessary. I've got a prototype working but the SDL language guide doesn't seem to cover everything. I can't figure out if starting curly braces need to appear on the same line as the tag. For example, this is definitely valid SDL:

parent {
    child
}

But what should happen in this case?

parent
{
    child
}

I could either

  1. accept it as valid and tread child as a child tag of the parent tag (not sure if this is valid according to sdl though)
  2. post an error ("in sdl, curly braces must be on the same line as their tag")
  3. handle it as a nameless tag. So this would become 2 tags, one with the name "parent" and one with the name "content".

Does anyone know if SDL specifies what to do in this case? If it doesn't specify, does anyone have an opinion on what to do here?

@marler8997
Copy link
Contributor Author

I was able to get in contact with the creator of SDL and he has informed me that braces on their own line is invalid SDL. I have updated my parser to throw an error message when this occurs. Another suprising thing I came to find out is that all attributes must occur after all values. I wanted to bring this up to check if everyone is ok with this.

tag "this-is-ok" option="value" 
tag option="value" "this-is-invalid-sdl"

That was invalid SDL because the attribute 'option' cannot appear before any sdl values such as the following string literal. My parser originally supported mixing values and attributes so I was wondering if we wanted to follow what the creator of SDL has told me, or go with allowing the mixing of values and attributes. This restriction by the way is not stated in the language-guide, I had to ask the creator specifically and he is going to ammend the guide with this restriction.

@s-ludwig
Copy link
Member

I'd say let's stay compatible with his implementation, so that we don't hamper interoperability with other implementations.

BTW, regarding the choice to write a new SDL parser, I think the memory use of the parser in case of DUB is probably not a practical concern, but it was something that I stumbled over, too. However, if we go for a different implementation, IMO it should still be a proper separately maintained library, instead of just using it DUB internally. Do you think it would be possible to maybe instead just enhance sdlang-d, or are the API changes too fundamental? (ping @Abscissa)

@Abscissa
Copy link
Contributor

I chose to write a new SDL parser because the Abscissa Parser uses alot of memory that isn't necessary.

Please file an issue for any such problems you find: https://github.com/Abscissa/SDLang-D/issues Any specifics you have in mind would be especially helpful.

@marler8997
Copy link
Contributor Author

There's not really any issues with Abscissa's parser, it's just a different design then I would do. It parses the sdl into a tree of objects that can be used to access the data. This makes it easy to use but it's using memory it doesn't need to.

If we used it in Dub the SDL tree data structures would only be used temporarily to populate the PackageInfo, ConfigurationInfo and BuildSettingsTemplate structs. This is a waste memory because the parser could just populate the Dub structs directly and never need to create it's own tree. It is true that this waste of memory is not a huge issue, however, I plan on using SDL in other projects of mine and I will be using a parser that does not use memory in this way. So I figured I might as well write it now and see if it would be good for Dub.

After I finish my parser I will point you to it to see which one you would rather use. I anticipate this parser will use less memory and perform faster (plus you don't have to perform the extra step to convert the tree strcuture to the DUB structs). My parser is designed around 1 very long and complicated function.

bool parseSdlTag(Tag* tag, ref const(char)* next, const char* limit) {
    // around 400 lines of code
}   

This function iterates over the given string pointed to by the "next" variable (using UTF-8) and will populate the supplied tag structure. It also sets the "next" variable to point at the next character after the tag so it can just be called over and over until the sdl has fully been parsed. It will throw an SdlParseException on any errors and returns false when the sdl is finished. There's also a wrapper function that accepts a character array instead of 2 pointers. I'm also experimenting with different wrappers to walk the sdl.

Normally I would never write such a monstrous function, but since this code could potentially be re-used in many places, it's worth taking the extra time to write it with performance in mind and low memory consumption. Hence why I'm spending alot of time writing comprehensive unit tests for it.

I had a couple ideas on what kinds of structures/functions to create to iterate through the tags. I had one idea that was similar to the D getopt function:

getsdl(sdl,
      "name", packageInfo.description,
      "authors", packageInfo.authors,
      "dependencies", packageInfo.dependencies);

I'm not sure if this is the best idea though, I'm putting it on the side-burner for now to look into later. Another idea I had is to create structure that allows the user to walk the SDL, I call it SdlWalker. If I used this approach here is an example of what the code would look like to parse a Dub package.

void parsePackageInfo(string sdlText) {
    Tag tag;
    auto sdl = SdlWalker(&tag, sdlText);
    for(;!sdl.empty; sdl.popFront()) { // I don't use foreach here because SdlWalker just populates
                                       // the local tag stucture so the front() function doesn't make sense
        if(tag.name == "name") {

            if(this.name != null) tag.throwIsDuplicate();
            tag.enforceNoAttributes();
            tag.enforceNoChildren();
            tag.enforceOneValue(this.name);

        } else if(tag.name == "authors") {

            if(this.authors !is null) tag.throwIsDuplicate();
            tag.enforceNoAttributes();
            tag.enforceNoChildren();
            tag.enforceValues(this.authors);

        } else if(tag.name == "dependency") {

            tag.enforceNoValues();
            tag.enforceNoAttributes();
            auto children = tag.children;
            for(;!children.empty; children.popFront) {

                if(tag.name == "content") {
                    //...
                }

            }           

        } else {      
            tag.throwIsUnknown();     
        }
    }

}

@Abscissa
Copy link
Contributor

There's not really any issues with Abscissa's parser, it's just a different design then I would do. It parses the sdl into a tree of objects that can be used to access the data. This makes it easy to use but it's using memory it doesn't need to.

That is true. I have intended to make sax- or pull-style parsing a more officially supported option, just haven't done so yet.

FWIW though, dub package files are going to be vastly too small for an in-memory tree to really be an actual issue. (Unless I screwed something up and the tree is using much more memory than it should be using. If anyone does find anything like that, then please file a ticket and I'll take a look.)

@s-ludwig
Copy link
Member

Just some random thoughts ahead ;)

One thing that would be worth incorporating is support for generic input ranges - with a special case implementation for string for maximum performance. That would enable direct allocation-less parsing SDL from various sources, such as files or network connections (or form a ANSI->UTF-8 decoder range, an unzip range etc.). One thing that I would avoid in D is the use of pointers for iteration. The safe alternative is to use a slice (string in this case) for the next variable, which spans the input from the current position to the end (especially if limit is a pointer to the end, using a single slice next[0 .. limit-next] instead would be a perfect fit here).

Regarding the parser API, I'd probably offer two or three alternatives - a StAX style API (basically SdlWalker), possibly a SAX style API, and a DOM style API for the cases where memory requirements are less important than convenient access, which is layered on top of the StAX one. At least those are by far the most popular API styles for XML parsers, so people will be definitely be looking for them.

For the StAX stlye API, it should definitely provide a proper front property, so that it can be used not only with foreach, but also with std.algorithm and other functions that work on input ranges. front could just return a reference to the internal Tag structure and specify that the contents get invalidated by the next call to popFront. There is already a number of ranges that work like this for efficiency reasons.

The part about having to use the right order for parsing values, attributes and children should ideally be encoded in the API, so that it is impossible to accidentally do something wrong by using the wrong order or by forgetting to handle a certain kind of data. One obvious and not particularly beautiful solution would be to provide a single function Tag.parseContents(scope void delegate(ValueRange) handle_values, scope void delegate(AttributeRange) handle_attributes, scope void delegate(RangeTag) handle_children). But maybe there is something nicer?

The getopt API seems to be a bit limiting in what it can potentially do, but something similar to a generic serialization API using reflection could also be interesting:

struct DependencyTag {
   string value; // a single value is required
   bool optional = false; // optional "optional" attribute
   string path = null; // optional "path" attribute
   string version_ = null; // optional "version" attribute
   // alternative declaration:
   //Nullable!string version_;
}
struct DubPackageSDL {
    string name; // treated as a single "name" tag with exactly one value
    string[] authors; // "authors" tag with any number of values
    DependencyTag[] dependency; // complex "dependency" tag
}
auto pack = input.parseTags!DubPackageSDL();

But I'm not sure if it's worth it just for this use case.

@marler8997
Copy link
Contributor Author

I like the feedback. There was alot of thoughts so I'll try going through them one by one.

  1. Support for generic input ranges

    I plan on writing a wrapper struct to support this instead of allowing the parseSdlTag function itself to accept generic input ranges. My initial design was to use generic input ranges in parseSdlTag directly but to understand why I changed it let me say exactly what the function does.

    The parseSdlTag function iterates through a string containing at least 1 SDL tag and saves "slices" of its name/values/attributes to the Tag struct.

    With this assumption the parseSdlTag doesn't need to allocate any memory (except for the lists of values/attributes in the tag struct, which uses the std.array Appender struct right now)! If it supported generic input ranges then it would need to allocate memory for every name/value/attribute because it can't guarantee that the input range would preserve the memory. However, with this design the caller could just read the entire sdl file into memory and keep it there, or copy the sdl values to a more compact data structure and throw away the sdl text afterwards.

    The other reason is I didn't want to make the parseSdlTag function itself a template is because it's over 400 lines of code. Making a template wrapper function/struct makes alot more sense to prevent accidental code explosion.

    That being said I still plan on writing a struct to support generic input buffers by buffering the input tag by tag (in many cases this is line by line) and then have it call parseSdlTag.

  2. Slices instead of pointers

    You are correct that using slices is safer then pointers since the compiler can insert runtime checks whenever the slice is accessed. For this function though I would be performing a lot of these operations:

    next = next[1..$];
    

    This results in two operations one on the pointer and one on the length. Instead I just initially save a pointer to the end of the array and make sure to check that I'm not past the end before every access to the array. In most cases I wouldn't care about the extra operation but I'm paying extra attention to performance on this library. I did realize however that I wasn't getting anything from having the caller pass the pointers in themselves. Instead I've changed the function to accept a slice and I save the pointers to local variabes and make sure to update the slice before I return. Here's the new function signature:

    bool parseSdlTag(Tag* tag, char[]* sdlText)    
    /*
    note: I'm using char[] instead of const(char)[] because
          it may chose to modify the sdl text like when it needs
          to escape strings.  However you can tell the sdl parser to
          not modify the sdl text by setting a flag in the Tag struct.
    */
    
  3. StAX, SAX, DOM

    It would be trivial to add a SAX wrapper and I was also already thinking about adding a DOM API as well like Abscissa's parser. I'm developing StAX first because I know I can write SAX and DOM around StAX after the fact.

  4. A proper front property

    I was debating what to do about this. The front property is necessary for use with component programming, but I don't like using a foreach loop that returns a pointer to a structure that you already have direct access to. This looks like one of those cases where a language deficiency is forcing a non-optimal solution for the sake of "pretty" code. It would be nice if D foreach could support iterating through input ranges without calling the front property:

    Tag tag;
    SdlWalker sdl = SdlWalker(tag);
    foreach(sdl) {
      // you can now use tag
    }
    

    I've posted something to dforums here to see what people think.

  5. API

    You are having the same thoughts that I've been having about the API. I also thought the that getopt api might be limiting and I was looking to using reflection. If D supported custom attributes I would be more keen to use reflection. I'll let you know what I come up with.

@marler8997
Copy link
Contributor Author

My SDL parser is doing very well. My unit tests are nearing completion and it performs lightning fast (15 times faster than Abscissa's right now, no offense to Absicca, his parser looks pretty and has its own tree structure). I have been corresponding with the creator of SDL, Daniel, to make sure my parser handles all the corner cases and I've come up with a list of concerns:

1. You can't put a tag's open brace on the next line...one format to rule them all!

nice-tag {
    // children....
}
rebellious-tag-who-should-be-locked-up-for-its-crimes
{
    // children...probably delinquents
}
// Keep in mind that this is not valid SDL and will result in an
// error saying something like:
//     "SyntaxError on line 5: braces don't go on their own line you dummy!"

Why it's a concern: Results in the parser rejecting SDL that it could otherwise accept

I can only think of two reasons why someone would make this restriction

  1. They think it makes the parser simpler
  2. They want all SDL files to have the same brace look to them

I'm assuming it's not reason number 1 because it would be a shame to restrict SDL just because the developer couldn't write a good parser. As for reason 2, this is just my opinion but I don't think a parser should reject SDL just because it doesn't look quite right. Allowing this would add no ambiguity to the grammar.

Note: My parser allows this via an option flag (it added 3 line of code to support this as a dynamic option).

2. Strings must always have "quotes"

color "red" // Oh ok, you want "red"
color red   // I don't understand what you want! What is red?!!!

Why it's a concern: Results in the parser rejecting SDL that it could otherwise accept

I really don't want this to be JSON where everything that isn't a number has to have quotes around it. Allowing strings to omit their surrounding quotes (when they don't contain whitespace) could make SDL look cleaner and the grammar would still be unambiguous.

3. All tag values must appear before all tag attributes...but why?

tag "a-value" attr="an-attribute" # Valid SDL
tag attr="an-attribute" "a-value" # Invalid SDL...no soup for you!

Why it's a concern: Results in the parser rejecting SDL that it could otherwise accept

This restriction seems very odd to me? I'm sure there are cases where putting one or more attributes before a value or list of values makes sense. Like the odd open brace restriction, it seems funny to me to throw an error message telling the user to change the SDL when the parser can already parse the data with no issues.

Note: My parser allows this via an option flag (again, added 3 lines of code)

4. Numerical Literals with type information...but why?

SDL supports adding various postfix characters to number literals that specify the number's underlying type.

a-float       23.4f
another-float 3.14F
a-long 88499382L
--- you get the picture

Why it's a concern: Would result in SDL files being littered with numbers that have extra characters adding no useful information. Let me explain why I think in almost all cases this type information is useless.

In most cases an SDL number is eventually going to be handled by code that's going to store it in a numerical type that can't be changed at runtime.

draw {
    point   1L   3L
    point 4.5f   8f
    point  -6    10
}
// even though some of these points have integral types and some 
// have floating point types, the application code is probably
// going to convert all of them to some common floating point type
// whether or not SDL says they are floats.

Now in some cases (I can't think of any off the top of my head) it may make sense for the application code to support storing the number in some dynamic type that could be changed at runtime. Even then this information is still useless because that type information could have been determined from either the numbers value, or more importantly, the context in which it appears.

So why does SDL support this type information? Any SDL parsing library will have no knowledge of a number's context, so the parser must be able to store the number before it knows what type it should be stored in. The SDL parser could store the literal string and postpone converting it to a numerical type, however, SDL was originally written in Java. In Java (and C#) you can't simply save a pointer to a string, you need to allocate memory for every substring you want to save inside that string. My guess is Daniel wanted the Java/C# parser to store the SDL number in an integer type instead of allocating a new string for every number literal. Then he added the ability for the user to specify the number type so the parser could select the right type if the user was so inclined. However, for a native language it's easy to save a pointer to an integer literal (or a "slice" in D) and postpone the conversion to when the user code queries for it. That way the code that knows the context is also the code that determines what integer type it requires.

Also keep in mind it makes sense for a compiler to allow a user to specify the storage type of a literal because that number gets handled directly by the processor via registers/physical memory/etc. However, in almost all cases this information is useless in an SDL file.

Note: There may be very specific cases where specifying the type of a number would be helpful. Like maybe if you were using SDL to auto-generate source code. However, for these project specific cases you could just use a string instead of a number. Allowing these postfix characters would litter SDL files with unnecessary characters that in almost all cases would add no useful information.

5. Empty tags default to "content"? ...BUT...WHY?

Ok this one's really not a big deal, but I figured I might as well list everything. When a user omits the tag name the parser is supposed to fill in the name with the string "content". This is no problem for the parser but seems odd. Why not just leave the name as an empty/null string?

6. Keywords can't be used as IDs.

switch on=true # No! Didn't you know "on" is an SDL keyword!!!

Why it's a concern: Results in the parser rejecting SDL that it could otherwise accept

Allowing tag names and attribute names to be keywords such as (on/off/true/false/null) would not make the grammar ambiguous so why not just support it

Note: My parser just supports this. Maybe I'll make it optional as well but I think we should just change SDL to support this.

Ok done ranting

My parser already supports most of these features either inherently or via option flags. I would like to get people's opinions on these things, especially the numerical literals issue. I'm going to send Daniel a link to this and maybe he can address and correct any misunderstandings I have. Thanks.

@etcimon
Copy link
Contributor

etcimon commented Aug 13, 2014

Looks great, I was curious to see the code behind this but no can find. When will it be available?

@marler8997
Copy link
Contributor Author

I'll have it done in a week or so.

@etcimon
Copy link
Contributor

etcimon commented Aug 13, 2014

I'll have it done in a week or so.

Always appreciated, I was hoping for a comment-able open source configuration format that supports nesting and reading it with a deserialization type feature like this: config.get!MyStruct("some/path"). What license did you choose for it?

@marler8997
Copy link
Contributor Author

Ah yes...using compile time reflection to parse the sdl. I am looking into the right way to support that. It would be easy and striaghtforward if D supported custom attributes

struct PackageInfo {
    string name;

    @sdl(ElementName="author")
    string[] authors;

    struct Dependency {
        // dependency members...
    }
    @sdl(ElementName="dependency")
    Dependency[] dependencies;

    @sdl(ElementName="subPackage")
    PackageInfo[] subPackages;
}

But last time I checked D didn't support this. As for the license I just used "public domain". I don't know much about licenses. Is "public domain" ok you think? It wouldn't bother me if anyone else used the code or modified their own version of it.

@etcimon
Copy link
Contributor

etcimon commented Aug 13, 2014

It would be easy and striaghtforward if D supported custom attributes

That's supported, although there's no support for assignment operators in the attribute, but you should look at the deserialization library in vibe.d to see how this can be done.

https://github.com/rejectedsoftware/vibe.d/blob/master/examples/serialization/source/app.d

@s-ludwig
Copy link
Member

(will comment on the other things later, just wanted to chime in regarding serialization)

I've just written a little serialization module for the vibe.d serialization framework that builds upon sdlang-d, but it would of course be easy to change to any other parser. I'll share that once some remaining corner cases are etched out.

@marler8997
Copy link
Contributor Author

Oh perfect I've been waiting for the custom attributes feature!!! I've been googling it every once in a while but just kept finding the forum posts asking for it to be supported. Thanks for posting this I'll incorporate it into my parser today.

@Abscissa
Copy link
Contributor

@marler8997, regarding your list of concerns with SDL: I agree there are some weird things in SDL, and you've identified most of them pretty well (lexing dates/times is also pretty hairy). But an important thing to keep in mind is that SDL is what it is, and if you're talking about lots of big changes to it than that really needs to be treated as what it is: a separate language derived from SDL. Otherwise, you're inviting the same horrible mess that was 90's HTML.

It's been awhile, so my memory of most of the details and corner cases has faded, but when writing my lexer/parser, I was surprised just how much of the weird things turned out to have some reason (or so I thought at the time?). So any loosening of restrictions, even if done as a separate language, needs to come with a very close look for any corner cases it may affect or introduce.

  1. Empty tags default to "content"? ...BUT...WHY?

Yea, this one, I went ahead and considered to be an aspect of the SDL library's API, rather than part of the language itself. So mine defaults to empty string, too.

@dleuck
Copy link

dleuck commented Aug 13, 2014

@marler8997 Thank you for all this feedback and for working on a D implementation of SDL. Per our email discussion, I wish I had published a formal grammar when I first open sourced the project to assist those working on other implementations. The language guide, tests and first parsers (Java, C# and Ruby) ended up serving as the spec.
SDL's design goals include supporting a rich set of data types and structures while being terse, simple, easily to read at a glance and comfortable for people familiar with common programming languages. With those goals in mind, I'll try to provide satisfying responses to your questions.

1. You can't put a tag's open brace on the next line...one format to rule them all!

As you correctly assumed, the reason is consistency and readability. If you are used to looking at one curly brace treatment encountering the other can be jarring. That being said, I've had enough requests for this that it may end up in SDL2.

2. Strings must always have "quotes"

I agree. This is planned for SDL2. Among other reasons, it will help make command line input of SDL tags easier.

3. All tag values must appear before all tag attributes...but why?

This was a design decision made for legibility. Mixing values and attributes makes long tag definitions harder to read. This restriction is consistent with method argument list rules for most languages that support both ordered and named arguments to methods.

4. Numerical Literals with type information...but why?

For several reasons. A common use of SDL is to model method invocations in both dynamic and source generation scenarios. In languages like C++, Java and C# the type information is necessary to determine which version of an overloaded method is selected. Additionally, and more fundamentally, the semantics of an integer, floating point and decimal type differ.

5. Empty tags default to "content"? ...BUT...WHY?

This was an early decision that, as you said, could be handled a few different ways. We wanted to avoid the ambiguities of empty versus null strings between languages, so we went with "content" as a default tag name. I wouldn't suggest changing this behavior if you want to ensure interop with other SDL systems.

6. Keywords can't be used as IDs.

There are cases where this creates ambiguity in the grammar. For example, would a single line true be a tag with the name true or an anonymous tag ("content") with the value true. We could potentially support some cases where its not ambiguous and allow keywords to be used as IDs anywhere with a character prefix. I'll incorporate something along these lines for SDL2.

@marler8997
Copy link
Contributor Author

Braces after newline...That being said, I've had enough requests for this that it may end up in SDL2.

Cool I'll just keep it as an option for my parser right now. Then if you decide it will be in SDL2 then I'll take it out as an option and just support it by default.

Non-quotes strings...This is planned for SDL2.

Awesome, I will support non-quoted strings then! Also note that this will cause the same ambiguity you mention later with keywords and ids. But there's only one place that the ambiguity would occur and that's in the tag name (You won't have this ambiguity with attributes because they are always followed with the '=' character). You would have two choices on how to handle it. Assume <first-token> to be the first token of a tag after the optional namespace.

if <first-token> is a valid id, then it's the tag name
else <first-token> is a literal

true # tag-name = "true"

Note that with this priority, in order to have an empty
tag-name with a single value of true (or any keyword) you would need to do this

content true

Or you could go the other way:

if <first-token> is a keyword then tag-name="content" and <first-token> is the first value
else if <first-token> is a valid id, then it's the tag name
else <first-token> is a literal

true # tag-name = "content", first-value = true

I probably wouldn't have a strong preference but I think if someone wrote sdl like this

matrix {
    true true false false
    false true false true
}

They obviously don't mean to make the tag-name true/false so I would probably prefer the second case.

Mixed Values and Attributes...This restriction is consistent with method argument list rules for most languages that support both ordered and named arguments to methods.

Yes this restriction did remind me of named function arguments. I'm not sure if I agree that making a restriction to help programmers make their code look nice is very helpful. If having the values after the attributes doesn't look good I would think that the developer would move their values to the front anyway. Why make a restriction for this when the opposite may look better in some cases? What if I had some sort of set of list tags with one optional attribute, I may want to put the attribute at the front like this;

fruits category=food   apple banana pear kiwi
dogs   category=animal pug poodle chihuahua "golden retriever"

I have a few more comments but will wait for a response from these ones first.

@dleuck
Copy link

dleuck commented Aug 13, 2014

Cool I'll just keep it as an option for my parser right now.
Then if you decide it will be in SDL2 then I'll take it out as
an option and just support it by default.

That makes sense. Its likely to be supported in SDL2.

re: non-quoted strings
For the first token anything that is not a literal will be interpreted as the tag name. I believe that is the best way to follow the PLA (principal of least astonishment). If people really want to use a keyword as a tag name, which we can assume will be relatively rare based on the fact it hasn't previously been requested, we will allow it with some character as a prefix (e.g. ~true for a tag name "true"). This is a convention used by other languages. I'll propose something in the next language spec. We have to be careful not to do something that conflicts with other literal types that might be introduced.

Yes this restriction did remind me of named function arguments. I'm not sure if I agree
that making a restriction to help programmers make their code look nice is very helpful.

In language design you make calls for these purposes (i.e. consistency and readability) frequently. Because of the precedent with argument lists in many languages, and because it would make the APIs confusing (what does tag.getValue(index) mean if attributes are interspersed?) we aren't likely to change this one.

@s-ludwig
Copy link
Member

Wow this thread escalated quickly ;)

Sorry in advance for repeating some of the points, I just want to stress what is important in my eyes.

SDL syntax comments

I have to admit that I don't like some of the suggestions for format changes. To me, simplicity was one of the selling points and most of these additions either introduce ambiguity (harder to understand for the casual reader and harder to implement for the library writer), or just increase the size of the language for a very small gain.

  • putting braces on the next line: This is ambiguous with defining an unnamed tag with child nodes. Introducing such a syntax makes it more difficult to understand and blocks the way for ever allowing such pure "branch nodes".
  • strings without quotes: This one has some merit, but again introduces an ambiguity which makes understanding the format harder (first identifier tag name or value?), at least with the current nomenclature. Of course, the whole tag name concept could be dropped, so that a tag only ever contains values and attributes and interpreting the first value as the tag name is up to the user of the format.
  • keywords as identifiers: Breaking this rule complicates the parser design, at least with the typical lexer/parser separation and again makes it harder to understand as a reader what actually happens. An escape syntax like ~true would be fine though - or dropping the tag name concept and requiring "true" instead.

All in all I think that a simple and unambiguous language has a much greater value than little syntax gimmicks. It was one of the main reasons why I prefer SDL over YAML.

TL;DR Please avoid introducing ambiguities to the language! ...and feature creep, too ;)

Serialization

I've put my vibe.data.serialization based code in a gist. It automatically supports all of the customization attributes of the serialization framework. What I'd still like to do is to implement some more modes for how the data is layed out. Currently for example it doesn't use attributes for example, but for simple types they would be a natural fit.

Library implementation

  • input range support: Using a tag based buffer probably is okay for most uses, but there are also strange things like the XMPP protocol, which use nested XML documents for streaming data. There it would be important to be able to parse at least sub tags as data comes in instead of waiting for the full top level tag. With some clever use of templates it should be possible to push the code that needs to decide between slicing and making a heap allocated copy to the lowest level, so that the big parser body (please refactor into multiple functions ;)) can be reused for both cases.
  • using slices instead of pointers: I would perform a benchmark of both approaches to see if the compiler doesn't actually optimize away the additional operations. An additional important issue with pointers is that you wouldn't be able to mark the library @safe. But especially for such critical user facing code it would be nice to get some assistance from the compiler to avoid unsafe memory accesses + it would allow to conveniently use the library from other @safe code without nasty @trusted wrappers (of course the SDL parser could just be marked @trusted, but just trusting a 400 line pointer juggling function to not contain any memory access bugs may not be the best idea).
  • front property: I have to admit that I found the example code that you posted quite confusing in this regard. I had to read the code a few times until I noticed that the &tag pointer was passed to the SDL parser. While it is also not ideal, returning a ref const(Tag) from front which points to an internally buffered Tag provides a much more "logical" (intuitive) API. Of course the user might try to extract the pointer from this reference and assume that the referenced value stays the same after calling popFront, but it's quite unlikely that someone would actually do that (much more likely that someone just copies the whole value) and also the same is true for the current approach.

@dleuck
Copy link

dleuck commented Aug 14, 2014

@s-ludwig All good points. Simplicity was and is a key design goal.

re: curly braces on the next line - This is ambiguous with defining an unnamed tag with child nodes.

This is why I need more sleep :-) You are right. This suggestion would create an ambiguity in the grammar exactly as you describe.

re: strings without quotes

The non-quoted strings feature could be made simpler by requiring they only be used with named tags (i.e. tags that are not anonymous).

Of course, the whole tag name concept could be dropped

We couldn't do this because it wouldn't be backwards compatible, and would break just about everything using SDL.

re: keywords as identifiers

I agree that its probably best to keep it simple and require the escape syntax in all cases where a keyword is being used as an identifier.

@marler8997
Copy link
Contributor Author

You've almost convinced me but I have a few more questions first.

In response to the first part I"ll say that it appears there are many variations of JSON that handle the nesting issue.

For the second part, I think I understand what you are saying. It appears that the reason I like the JSON variations is the reason you dislike them. JSON (and it's variants) have 2 data structures, objects and lists. There is a one-to-one mapping of JSON objects to any of it's variations. That appeals to me because "thinking" of the build spec as an object itself is nice because that's what it ends up becoming after it's parsed anyways. If you look at SDL in the same way, it also supports objects and lists but adds values/attributes to objects. In the end an SDL attribute can be seen as just another field that's written before the curly-brace section, but it "looks different" to the developer. And an object value is equivalent to an ASON nameless field.

JSON: my-object { "name" : "my-name", "key" : "value", "key2" : "value" }
SDL:  my-object "my-name" key="value" {
         key2 "value"
      }
ASON: my-object "my-name" {key:"value"
         key2 "value"
      }

Note: I'm just using ASON as an example, but I think that there may be other established JSON variations that support a feature like this.

So if I am understanding correctly, the biggest reason you prefer SDL over JSON variations is because of attributes...is that correct?

P.S. The platform specification suffixes would have been handled nicely by nameless fields in ASON (just saying):

libs linux ["dl", "glfw"]
libs osx ["glfw3"]

@s-ludwig
Copy link
Member

It appears that the reason I like the JSON variations is the reason you dislike them. JSON (and it's variants) have 2 data structures, objects and lists. There is a one-to-one mapping of JSON objects to any of it's variations.

I actually like that property of JSON, too, but only when used as a pure data interchange format (e.g. as a serialization format). For humans, the additional nesting that this causes is just additional cognitive load, and attributes are an intuitive concept to give the data more structure (BTW, I'm not just talking about syntactical nesting, although the syntax can help somewhat).

So if I am understanding correctly, the biggest reason you prefer SDL over JSON variations is because of attributes...is that correct?

Attributes, but maybe more importantly, tags! Because you can repeat the same tag multiple times instead of defining a nested object (or array of objects). For a JSON type format that would just mean that the field gets overwritten. A tag list based format just seems like a more natural fit for what needs to be expressed in package description files. They perfectly fit things like "dependencies", "configurations", "subPackages", "buildTypes" and so on.

The platform specification suffixes would have been handled nicely by nameless fields in ASON

Yes, I don't want to say that I "hate" the idea of nameless fields (because it depends on the context and this part of ASON does have its merits), but I really dislike them here because they break the generic interoperability (without an additional schema specification, which complicates things) and because they are optional. They do nicely solve the problem of JSON, where you have to check at runtime if a certain value is a say string or an object with more detailed fields inside, but this is also something that I already dislike about JSON in the first place that you have to resort to such things for the sake of convenience.

Or let's put it this way, if you would have asked me some years earlier about ASON (without all the optional punctuation and more limited quote-less strings, that is ;)), I would have liked it a lot as a generic basis for defining application specific data DSLs. However, it was a real pleasure for me to discover SDL back then (being slightly skeptic about date/time formats), because it had such a similar syntax to the "optimal" DSLs that I was using all the time, but as an added bonus also was fully generic, so this is something that I don't want to miss. (Note: my DSLs never had attributes, but they did have tags/commands)

It's unfortunate that SDL made some overly liberal decisions with data/time formats (which IMO is not a fatal flaw, though). But there is one idea that could also be interesting. What about defining a simpler SDL-lite instead of SDL 2? It would be a strict subset (since we are always talking about supersets, which are a lot less useful for interoperability) of SDL, so it couldn't solve all syntax issues, but it could remove certain features to become a very clean format with a simple parser logic, while still being valid SDL.

@marler8997
Copy link
Contributor Author

Alright, thanks for "bearing" with me to explain these things. I'm still not understanding a couple of your points, but I think if you can help me understand these last few points and we can be on the same page it will help me with the code design and also allow both of us to be on the same page when explaining to others why we chose SDL.

To summarize, my issues are:

  1. Aren't ASON singular names the same as duplicated SDL tag names?
  2. Aren't nameless ASON fields the same as SDL tag values?

I actually like that property of JSON, too, but only when used as a pure data interchange format (e.g. as a serialization format). For humans, the additional nesting that this causes is just additional cognitive load, and attributes are an intuitive concept to give the data more structure (BTW, I'm not just talking about syntactical nesting, although the syntax can help somewhat).

Doesn't the SingularName feature of ASON (and other JSON variations) amount to the exact same thing as duplicated SDL tag names?

SDL : dependency "vibe-d" version=">=1.0.0"
      dependency "sdlang" version=">=1.0.0"

ASON: dependency "vibe-d" version=">=1.0.0"
      dependency "sdlang" version=">=1.0.0" // exactly the same right?

So if I am understanding correctly, the biggest reason you prefer SDL over JSON variations is because of attributes...is that correct?

Attributes, but maybe more importantly, tags! Because you can repeat the same tag multiple times instead of defining a nested object (or array of objects). For a JSON type format that would just mean that the field gets overwritten. A tag list based format just seems like a more natural fit for what needs to be expressed in package description files. They perfectly fit things like "dependencies", "configurations", "subPackages", "buildTypes" and so on.

I defer to my original example. I'm not understanding the difference between a list of tags and a list of SingularName elements. Don't they represent the same thing? Maybe you could provide a quick example of a situation in which they would be different?

The platform specification suffixes would have been handled nicely by nameless fields in ASON

Yes, I don't want to say that I "hate" the idea of nameless fields (because it depends on the context and this part of ASON does have its merits), but I really dislike them here because they break the generic interoperability (without an additional schema specification, which complicates things) and because they are optional.

I wouldn't call this "breaking generic interoperability" because what kind of generic functionality are you breaking? Any generic program will still be able to parse the nameless fields, they just won't know the names associated with those fields, and any generic program reading the ASON wouldn't care what the names were anyway since they don't understand what the data structures mean in the first place.

Any non-generic program that understands the semantics of a DUB data structure will also know what fields can be nameless...just like if we used SDL, any generic program wouldn't know what the nameless tag values mean but any dub specific program would.

Unless I am missing something, an ASON nameless field is the same thing as an SDL "object-value" (a "tag value" on a tag that also has attributes or child tags). Am I wrong on this? I suppose you could look at SDL tags a different way but I believe that if you look at SDL tags as field names (including SingularName fields) that it covers every case. Or you could look at tag names as keys to a string hash table, with duplicated names being added to a list in the hash table.

Or let's put it this way, if you would have asked me some years earlier about ASON (without all the optional punctuation and more limited quote-less strings, that is ;)), I would have liked it a lot as a generic basis for defining application specific data DSLs. However, it was a real pleasure for me to discover SDL back then (being slightly skeptic about date/time formats), because it had such a similar syntax to the "optimal" DSLs that I was using all the time, but as an added bonus also was fully generic, so this is something that I don't want to miss. (Note: my DSLs never had attributes, but they did have tags/commands)

I'm curious, what's an example of an "optimal" DSL?

It's unfortunate that SDL made some overly liberal decisions with data/time formats (which IMO is not a fatal flaw, though). But there is one idea that could also be interesting. What about defining a simpler SDL-lite instead of SDL 2? It would be a strict subset (since we are always talking about supersets, which are a lot less useful for interoperability) of SDL, so it couldn't solve all syntax issues, but it could remove certain features to become a very clean format with a simple parser logic, while still being valid SDL.

Well, the date-time/timestamp literals are the last thing that I have chosen to implement in my parser. I don't think this is an unreasonable restriction.

Thanks again for taking the time to help me understand.

@s-ludwig
Copy link
Member

I'm not understanding the difference between a list of tags and a list of SingularName elements. Don't they represent the same thing?

They achieve the same goal, but:

  • They are again something application specific
  • They are optional syntax sugar (multiple ways to achieve the same thing)

Tags on the other hand are an integral part of the data structure itself.

Any generic program will still be able to parse the nameless fields, they just won't know the names associated with those fields, and any generic program reading the ASON wouldn't care what the names were anyway since they don't understand what the data structures mean in the first place.

So can you tell me how what the following is parsed: foo bar x y?

Any non-generic program that understands the semantics of a DUB data structure will also know what fields can be nameless

Yes, but then we can as well make a completely custom DUB language and fully exploit all possibilities for the optimal syntax instead of being held back by the generic part of the format.

Unless I am missing something, an ASON nameless field is the same thing as an SDL "object-value" (a "tag value" on a tag that also has attributes or child tags). Am I wrong on this?

The syntactically can look (but don't have to) the same, but they are structurally very different.

I'm curious, what's an example of an "optimal" DSL?

Well, you could have something very application specific that could highly benefit from a certain syntax construct. Someone mentioned heredoc strings, but it could be anything. Many of those things are not generally very useful, though. I can't give you a concrete code example, but for example I have a custom language for defining themes for a UI framework. It is layed out in a way that makes those files very compact to write, while still being very flexible in terms of extensibility and being friendly to read. Any generic approach would either hamper readability or conciseness.

I think much of the misunderstanding comes from differing points of view. I'm looking mostly from the view of a potential user, who doesn't care at all how elegant the underlying data structure may be. To me, what matters is also not only how nice and concise you can format the data, but also how much hidden potential for confusion there is. And among other things, any alternative or optional syntaxes are suspect to such confusion.

This is of course only for judging the language itself. Interoperability and familiarity go on top of that.

@marler8997
Copy link
Contributor Author

OK I understand your concern now about ASON's "application specific" nameless fields. I will say that this is a weakness of generic ASON, but not one for applications do not have any objects that can write all it's fields as nameless. Any generic ASON tools (ASON to SDL for example), would assume that the ASON had no objects that only have nameless fields. Then if the application did have those kinds of objects, it would need to provide the tool with that information (what objects can be written with nameless fields only).

The last concern I have with SDL is users being confused between the JSON format and SDL format.

JSON: configurations : [
          {
              "name": "my-name",...
          }
      ]
JSON VARIANT:
      configuration {
          name "my-name" // Only one way to do this
      }
SDL:  configuration "my-name" {
          ...
      }
      // OR
      configuration name="my-name" {
          ...
      }
      // OR
      configuration {
          name "my-name"
      }
      // OR
      configuration {
          "my-name"
      }

Since there are multiple ways to do it in SDL, it seems like it's going to be harder on the user. I can picture a user going through their JSON file converting it to SDL and thinking to themselves "Now is that field suppose to be a value, an attribute or a child tag?". I say this because it is a common problem in many XML languages. What do you think, will this be confusing for users?

@s-ludwig
Copy link
Member

It's usually a pretty simple decision. Mandatory arguments are passed as values and optional arguments are passed as attributes. Or, as a more general rule, if there is no particular reason, the shortest variant is chosen.

There is also not more to learn for configuration "foo" { ... } than for "configurations": [{ "name": "foo", ... }] and it's also intuitively clear what it means, and this is what is important.

BTW, there are definitely more ways to express the configuration list in JSON:

order independent configurations:

"configurations": {
    "foo": {
        ...
    }
}

order dependent configurations:

"configurations": [
    {
        "name": "foo",
        ...
    }
]

simulate tags using familiar platform-suffix-style:

"configuration-foo": {
    ....
}

simulate tags using a more structured approach:

[
    {"configuration": "foo", "settings": {...}}
]

However, this is not the "one way to do one thing" that I mean. Of course there are endless possibilities to choose from, but in the end there will just be one valid way how to actually specify a configuration in the spec. Important is that you don't have to wonder what a certain piece of code actually does when it's written in an unfamiliar syntactical variant. That would be just unnecessary additional cognitive load.

@marler8997
Copy link
Contributor Author

Ok, just to be clear, you could use ASON and have the exact same syntax as SDL, only the ASON file would have a language-defined mapping to JSON whereas the SDL is an application-defined mapping to JSON. Given that knowledge, you stil think SDL is a better option? If so I will submit:)

@s-ludwig
Copy link
Member

My pet peeve with ASON is the number of alternative syntaxes to the same thing that it provides. And of course that it isn't possible to parse without some kind of application specific schema. Any SDL file could be parsed, edited or translated to other formats in an generic way, which can be(come) a big plus, even if that isn't too relevant for our concrete case. But I'm also thinking in terms of larger adoption there.

Of course the package description specification itself is application specific and there has to be some kind of non-identity mapping to JSON, if we want the improvements to be more than just cosmetics, be it ASON or SDL. Even if cosmetics can already do a lot in this case, I think that alone they are barely enough of an argument for introducing another language, at least not better alternatives in sight.

@marler8997
Copy link
Contributor Author

This discussion has made me realize something about ASON. Let me go rework the ASON documentation a little and get back to you. Even though I don't think we will be using ASON, I think this will be helpful.

@marler8997
Copy link
Contributor Author

Ok I'm done discussing this. Let's finish SDL. I got two quick questions.

Mandatory arguments are passed as values and optional arguments are passed as attributes. Or, as a more general rule, if there is no particular reason, the shortest variant is chosen.

I notice that the dependency tag didn't follow this rule.

dependency "vibe-d" version=">=0.7.11"

Isn't version a mandatory argument? Is this just an exception to the rule? Maybe the rule could be "The first most obvious mandatory argument for a tag is passed as a value, and other mandatary single-value arguments are passed as attributes."

Also why does the configuration have a string argument but subPackage does not? Is it because subPackage references use a path and have no name?

Thanks.

@s-ludwig
Copy link
Member

I notice that the dependency tag didn't follow this rule.

There are also path based references or references to sub packages, which don't require a version.

Also why does the configuration have a string argument but subPackage does not? Is it because subPackage references use a path and have no name?

Sub packages contain exactly the same contents as a regular package description. Puling out the name there I think would just create a strange special case.

@marler8997
Copy link
Contributor Author

Ok makes sense. I just tested some corner cases with @Abscissa 's parser and opened an issue here. Take a look, you may or may not be concerned about them. Let me know if you think any of these need to be fixed if we were to use his parser.

@DmitryOlshansky
Copy link
Member

IMHO HOCON is both closer to JSON and has precedent of being used in configs in major projects.
https://github.com/typesafehub/config/blob/master/HOCON.md

@marler8997
Copy link
Contributor Author

Initially I thought we wanted a data format close to JSON. Under this assumption I proposed a new language called ASON. However, it turns out that @s-ludwig was more interested in the "expresiveness" of the language rather than how closely it fit the existing JSON model. I think the current SDL proposal is fine. HOCON looks interesting though. It has some similarities and some differences with ASON and other JSON variants.

@s-ludwig
Copy link
Member

HOCON does have some interesting traits (e.g. the path syntax, a { b: 1, c: 2} <-> a.b: 1, a.c: 2, and variable substitution/inheritance) and some slightly worrying ones, such as the file import feature and maybe the very liberal treatment of separators/whitespace and value auto concatenation. But in general, since we have to treat it as a separate language anyway, rather than trying to get a superset or something as close as possible to JSON, this should be a good opportunity to take a fresh look at the problem and find a language without any artificial constraints like that.

One thing that is particularly nice about SDL in the context of many package descriptions is that it is tag based, or rather that it allows to write

dependency x version=1.0.0 optional=true;
dependency x version=2.0.0;

instead of always using nesting (gets worse for configurations, build types and sub packages)

dependencies: {
    x: {
        version: 1.0.0,
        optional: true
     },
     y: 2.0.0
}

It's also syntactically quite close to D ("block statements" with semicolon terminated child "statements") and would lend itself well for potential script-like additions in the future (if cond { ... }).

@DmitryOlshansky
Copy link
Member

@marler8997 I thought as well, i.e. a superset of JSON. But anyway if SDL is the ticket, let's go with it.

@marler8997
Copy link
Contributor Author

SDL isn't perfect but I think it's good enough. I would prefer if it allowed curly braces on their own line, had a grammar and allowed unquoted strings. These features may be added in SDL version 2. I also don't like how it tries to define so many types. If the unquoted string feature was implemented well, the need for some of the types would almost go away, but Daniel (the creator of SDL), doesn't think the types are going to go away in SDL version 2.

@MartinNowak
Copy link
Member

How does this go?

@marler8997
Copy link
Contributor Author

I've just been busy. I may get back to this but if someone wants to continue this or just use what I've done as a guide I'm fine with it. Let me know if someone wants to take this over for now.

@MartinNowak MartinNowak added this to the 1.0.0 milestone Dec 7, 2014
@mihails-strasuns
Copy link

@marler8997 can you quickly outline remaining TODO list for this PR?

@MartinNowak
Copy link
Member

You wanna do it @Dicebot? 👏

@mihails-strasuns
Copy link

I want 1.0.0 out :) Thus yes, considering to take this over.

@marler8997
Copy link
Contributor Author

It looks like this isn't the latest version I was working on. Let me get in to work tomorrow and I'll look over what I've done and make a list of what needs to be done. I'll get back to you then.

@mihails-strasuns
Copy link

Thanks!

@marler8997
Copy link
Contributor Author

@Dicebot I realized that this particular PR was just the first attempt at exploring what it would take to support SDL. I've done the real work on another branch and I've just submitted a new PR with the "real" work #473

Feel free to use whatever you like from that commit. Let me know if you have any questions. Most of the work will be in sdl.d which translates the SDL into the PackageRecipe structure.

@mihails-strasuns
Copy link

FYI: I am still on it but familiarizing with package recipe handling inside dub takes quite some time.

@s-ludwig s-ludwig closed this in 2b513f7 Jun 16, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants