Finish shaving the yak for <xs:all> elements #1

duck2 · 2019-06-27T22:30:23Z

This has really turned into a snake story.

What we are seeking is a solution to the state explosion problem when we want to accept N independent inputs, in any order.

The solution is running N parallel state machines:

The problem with the solution is knowing how to run parallel state machines. When generating a single state machine, it is OK to generate something like this:

while(1){
    switch(state){
    case 0:
        if(inp == "sizing"){
            load(current_element);
            state = 1;
        } else {
            error("expected sizing, found %s" % inp);
        }
        break;
    case 1:
        if(inp == "end of input"){
            goto accept;
        } else {
            error("expected end of input, found %s" % inp);
        }
        break;
    }
    if(next_element) inp = next_element.name;
    else inp = "end of input";
}
accept:
    return;

When generating several state machines, we can:

Make state machines which silently accept inputs unrelated to them:

Add dispatching code which first checks the input and passes it only to the relevant state machine(s).

We also need to know when to accept the input. For that, we would need to check if each of the parallel state machines ended up in an accepting state.

The code for both are very inelegant. If we make silently accepting state machines, it's something like this:

while(1){
    switch(state1){
    case 0:
        if(inp == "sizing"){
            load(current_element);
            state1 = 1;
        } else if(inp == "connection_block" || inp == "default_fc") {
            /* do nothing */
        } else {
            error("expected sizing or connection_block or default_fc, found %s" % inp);
        }
        break;
    case 1:
        if(inp == "end of input"){
            accept[0] = 1;
        } else if(inp == "connection_block" || inp == "default_fc") {
            /* do nothing */
        } else {
            error("expected end of input or connection_block or default_fc, found %s" % inp);
        }
        break;
    }
    switch(state2){
    case 0:
        if(inp == "connection_block"){
            load(current_element);
            state1 = 1;
        } else if(inp == "sizing" || inp == "default_fc") {
            /* do nothing */
    [...]
    if(all_accepted(accept)) break;
    if(next_element) inp = next_element.name;
    else inp = "end of input";
}

The code for the second would look like this:

while(1){
    if(inp == "sizing" || inp == "end of input"){
        switch(state1){
        case 0:
            if(inp == "sizing"){
                load(current_element);
                state1 = 1;
        } else {
            error("expected sizing, found %s" % inp);
        }
        break;
        case 1:
            if(inp == "end of input"){
                accept[0] = 1;
            } else {
                error("expected end of input, found %s" % inp);
            }
        break;
        }
    } else if(inp == "connection_block" || inp == "end of input") {
        switch(state2){
        case 0:
            if(inp == "connection_block"){
                load(current_element);
                state1 = 1;
            } else {
    [...]
    if(all_accepted(accept)) break;
    if(next_element) inp = next_element.name;
    else inp = "end of input";
}

which are definitely not OK to generate, since the code is too long and it's not easy to see what it's doing. On top of that, error messages and checking for accepted states are not complete. I see no way to elegantly implement them in a while-switch loop either.

duck2 · 2019-06-27T22:51:04Z

A good step to take here is bringing in state tables for transitions. If we had a magic state table where we could look up next_state = states[current_state][current_input], a single state machine would reduce to this:

states = {-1};
states[0][SIZING] = 1;
states[1][END] = ACCEPT;
[...]
while(1){
    state = states[state][inp];
    if(state == -1) error(state, inp); /* can read the table and see what was expected */
    if(state == ACCEPT) break;
    if(next_element) inp = tokenize(next_element.name);
    else inp = END;
}

And, multiple state machines could be run like this:

states0 = {-1};
states0[0][SIZING] = 1;
states0[1][END] = ACCEPT;
states1 = {-1};
states1[0][CONNECTION_BLOCK] = 1;
states1[1][END] = ACCEPT;
[...]
while(1){
    if(inp == SIZING || inp == END){
        state0 = states0[state0][inp];
    }else if(inp == CONNECTION_BLOCK || inp == END){
        state1 = states1[state1][inp];
    }
    if(state0 == -1 || state1 == -1) error(state0, state1, inp); /* I'm sure this would work out. */
    if(state0 == ACCEPT && state1 == ACCEPT) break;
    if(next_element) inp = tokenize(next_element.name);
    else inp = END;
}

This makes much more sense, there aren't countless string comparisons and indentation levels. But this too, misses an important point. The state transitions have actions associated with them. How do we load the elements?

The tokenize function can look like magic, but I'm sure we could have a perfect hash function to reduce every tokenize to a hash+strcmp. I think even if we don't implement it, ~6-7 strcmps for every element won't hurt anyone.

duck2 · 2019-06-27T23:01:37Z

About the actions, I think I can bail out to using if statements, since typically only a few state transitions in xs:alls have actions associated with them:

states0 = {-1};
states0[0][SIZING] = 1;
states0[1][END] = ACCEPT;
states1 = {-1};
states1[0][CONNECTION_BLOCK] = 1;
states1[1][END] = ACCEPT;
[...]
while(1){
    if(inp == SIZING || inp == END){
        state0 = states0[state0][inp];
        if(state0 == 0 && inp == SIZING){
            load_element(current_node); /* This isn't the same code for every field. If it was so, we could make something more elegant. */
        }
    }else if(inp == CONNECTION_BLOCK || inp == END){
        state1 = states1[state1][inp];
        if(state1 == 1 && inp == CONNECTION_BLOCK){
            load_element(current_node);
        }
    }
    if(state0 == -1 || state1 == -1) error(state0, state1, inp); /* I'm sure this would work out. */
    if(state0 == ACCEPT && state1 == ACCEPT) break;
    if(next_element) inp = tokenize(next_element.name);
    else inp = END;
}

At least we have separated the element loading logic(side effect) from the validation&error logic(pure), which is a win.

Phew. Now, this is a lot of work to do in Python, but at least I poured my floating thoughts about this here.

mithro · 2019-06-28T01:57:08Z

@duck2 - So the only <xs:all> I could find in the xsd files we care about is the following;

  <xs:complexType name="device">
    <xs:all>
      <xs:element name="sizing">
        <xs:complexType>
          <xs:attribute name="R_minW_nmos" type="xs:double" use="required"/>
          <xs:attribute name="R_minW_pmos" type="xs:double" use="required"/>
        </xs:complexType>
      </xs:element>
      <xs:element name="connection_block">
        <xs:complexType>
          <xs:attribute name="input_switch_name" type="xs:string" use="required"/>
        </xs:complexType>
      </xs:element>
      <xs:element name="area">
        <xs:complexType>
          <xs:attribute name="grid_logic_tile_area" type="xs:double" use="required"/>
        </xs:complexType>
      </xs:element>
      <xs:element name="switch_block">
        <xs:complexType>
          <xs:attribute name="type" type="switch_block_type" use="required"/>
          <xs:attribute name="fs" type="xs:int"/>
        </xs:complexType>
      </xs:element>
      <xs:element name="chan_width_distr" minOccurs="0" type="chan_width_distr"/>
      <xs:element name="default_fc" minOccurs="0">
        <xs:complexType>
          <xs:attribute name="in_type" type="fc_type_enum"/>
          <xs:attribute name="in_val" type="xs:double"/>
          <xs:attribute name="out_type" type="fc_type_enum"/>
          <xs:attribute name="out_val" type="xs:double"/>
        </xs:complexType>
      </xs:element>
    </xs:all>
  </xs:complexType>

mithro · 2019-06-28T01:58:57Z

Ignoring the inside of each of the xs:element, you get the following;

  <xs:complexType name="device">
    <xs:all>
      <xs:element name="sizing"/>
      <xs:element name="connection_block"/>
      <xs:element name="area"/>
      <xs:element name="switch_block"/>
      <xs:element name="chan_width_distr" minOccurs="0" type="chan_width_distr"/>
      <xs:element name="default_fc" minOccurs="0"/>
    </xs:all>
  </xs:complexType>

mithro · 2019-06-28T02:00:03Z

This says, you want;

One sizing element.
One connection_block element.
One area element.
One switch_block element.
Zero or One chan_width_distr element.
Zero or One default_fc element.

mithro · 2019-06-28T02:09:46Z

Then you could end up with something like this?

def validate_device(device):
  if 'sizing' not in device:
    raise TypeError('sizing tag missing from device tag')
  if 'connection_block' not in device:
    raise TypeError('connection_block tag missing from device tag')
  ...

def parse_sizing(tag):
  ....

def parse_connection_block(tag):
  connection_block = {}
  ....
  validate_connection_block(connection_block)
  return connection_block

def parse_device(tag):
  device = {}
  sub_tag_parsers = {
    'sizing': parse_sizing,
    'connection_block': parse_connection_block,
    'area': parse_area,
    ...,
  } 
  for child in children:
    if child.tag not in sub_tag_parsers:
       raise TypeError("Unknown tag {} while parsing device".format(child.tag))
    if child.tag in device:
       raise TypeError("Duplicate tag {} while parsing device".format(child.tag))
    device[child.tag] = sub_tag_parsers(child)
  validate_device(device)
  return device

duck2 · 2019-06-28T07:20:27Z

Putting parse functions in a map is a good idea! I can make another table and stuff function pointers in it.

Unfortunately, counting to validate only works well when we only have <xs:element>s in the model group. If we had something like this:

  <xs:complexType name="device">
    <xs:all>
      <xs:element name="sizing"/>
      <xs:element name="connection_block"/>
      <xs:sequence>
          <xs:choice>
              <xs:element name="area"/>
              <xs:element name="area_legacy"/>
          </xs:choice>
          <xs:element name="switch_block" maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:element name="chan_width_distr" minOccurs="0" type="chan_width_distr"/>
      <xs:element name="another_chan_width_distr" type="chan_width_distr" />
      <xs:element name="default_fc" minOccurs="0"/>
    </xs:all>
  </xs:complexType>

then this would get hairy, since we need to count the occurrences of (<area>|<area_legacy>)<switch_block>+ in the input, which requires state machine-like code. If we do that for every member group, we end up with parallel state machines.

duck2 · 2019-06-28T07:27:27Z

We have only one big xs:all currently, but some types are written as sequences or choices while they really should be alls. For instance:

  <xs:element name="architecture">
    <xs:complexType>
      <xs:choice minOccurs="0" maxOccurs="unbounded">
        <xs:element name="models" type="models"/>
        <xs:element name="layout" type="layout"/>
        <xs:element name="device" type="device"/>
        <xs:element name="switchlist" type="switchlist"/>
        <xs:element name="switchblocklist" type="switchblocklist"/>
        <xs:element name="segmentlist" type="segmentlist"/>
        <xs:element name="power" type="global_power"/>
        <xs:element name="clocks" type="clocks"/>
        <xs:element name="directlist" type="directlist"/>
        <xs:element name="tiles" type="tiles"/>
        <xs:element name="complexblocklist" type="complexblocklist"/>
      </xs:choice>
    </xs:complexType>
  </xs:element>

or

  <xs:complexType name="segment">
    <xs:choice maxOccurs="unbounded">
      <xs:element name="sb" type="segment_block"/>
      <xs:element name="cb" type="segment_block"/>
      <xs:element name="mux" type="segment_mux"/>
      <xs:element name="wire_switch" type="segment_wire_switch"/>
      <xs:element name="opin_switch" type="segment_wire_switch"/>
    </xs:choice>
[...]

duck2 · 2019-06-28T07:28:20Z

This issue also applies to <xs:attribute> tags, since they also occur in any order.

mithro · 2019-06-28T15:10:37Z

Then you could end up with something like this?

def parse_arch(tag):
  arch = {
    'models': [],
    'layouts': [],
    ....
  }
  sub_tag_parsers = {
    'model': parse_model,
    'layout': parse_layout,
    ...,
  } 
  for child in children:
    if child.tag not in sub_tag_parsers:
       raise TypeError("Unknown tag {} while parsing device".format(child.tag))
    subparser = sub_tag_parsers[child.tag]
    current = arch.get(child.tag, None)
    if type(arch[child.tag]) == list:
        arch[child.tag].append(subparser(child))
    elif arch[child.tag] is None:
        arch[child.tag] = subparser(child)
   else:
        raise TypeError("Duplicate tag {} while parsing device".format(child.tag))
  validate_arch(arch)
  return arch

mithro · 2019-06-28T15:14:25Z

(FYI that parse doesn't quite match what is described in the xsd... The xsd says you can have at most one models tag which can have multiple model tags under it.)

duck2 · 2019-06-30T19:44:43Z

Turns out xmlschema itself doesn't support sequences in alls, which is a XSD 1.1 feature. This makes everything about the parallel state machines and all void. I'm currently implementing xs:alls with bitsets.

mithro · 2019-06-30T21:41:10Z

Can you add support for sequences in alls to xmlschema?

duck2 · 2019-06-30T22:23:39Z

I thought about that, but my impression is that the maintainer of xmlschema would like to release support for XSD 1.1 when it's ready and it's not complete yet.

It does have an XMLSchema11 class, the definition of which starts like

# ++++ UNDER DEVELOPMENT, DO NOT USE!!! ++++
class XMLSchema11(XMLSchemaBase):
[...]

Currently, that class can parse this XSD:

<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
	<xsd:group name="hey">
		<xsd:sequence>
			<xsd:element name="greeting" type="xsd:string"></xsd:element>
			<xsd:element name="name" type="xsd:string"></xsd:element>
		</xsd:sequence>
	</xsd:group>
    
	<xsd:complexType name="hello">
		<xsd:all>
			<xsd:group ref="hey"/>
			<xsd:element name="whatsup" type="xsd:int"></xsd:element>
		</xsd:all>
	</xsd:complexType>

	<xsd:element name="hello" type="hello"></xsd:element>
</xsd:schema>

and produce something like this from it:

Xsd11Group(model='all', occurs=[1, 1])
{'_group': [XsdGroup(ref='hey', model='sequence', occurs=[1, 1]),
            Xsd11Element(name='whatsup', occurs=[1, 1])],
 'annotation': None,
 'elem': <Element '{http://www.w3.org/2001/XMLSchema}all' at 0x7f8c14390ae8>,
 'errors': [],
 'model': 'all',
 'name': None,
 'parent': Xsd11ComplexType(name='hello'),
 'schema': XMLSchema11(basename='hello.xsd', namespace=''),
 'validation': 'strict'}

XsdGroup(ref='hey', model='sequence', occurs=[1, 1])
{'_group': [Xsd11Group(name='hey', model='sequence', occurs=[1, 1])],
 'annotation': None,
 'elem': <Element '{http://www.w3.org/2001/XMLSchema}group' at 0x7f8c14390b38>,
 'errors': [],
 'model': 'sequence',
 'name': 'hey',
 'parent': Xsd11Group(model='all', occurs=[1, 1]),
 'schema': XMLSchema11(basename='hello.xsd', namespace=''),
 'validation': 'strict'}

Xsd11Group(name='hey', model='sequence', occurs=[1, 1])
{'_group': [Xsd11Element(name='greeting', occurs=[1, 1]),
            Xsd11Element(name='name', occurs=[1, 1])],
 'annotation': None,
 'elem': <Element '{http://www.w3.org/2001/XMLSchema}group' at 0x7f8c143907c8>,
 'errors': [],
 'model': 'sequence',
 'name': 'hey',
 'schema': XMLSchema11(basename='hello.xsd', namespace=''),
 'validation': 'strict'}

duck2 · 2019-06-30T23:35:00Z

After reading the W3C standard more carefully, I see group support in xs:all is weird(even in XSD 1.1):

You can only put referenced groups(<group ref=...>) in xs:all. It's my fault not reading the standard very carefully before- I saw group in all's possible children and assumed it would be easy to put sequences inside alls.
You can't assign min_occurs or max_occurs to anything inside an xs:all.

Now, in this case, we know elements inside alls are only bits(found or not found). Sequences or choices can be run aside as slightly more complicated state machines. Then it's fine to emit elements as bitsets and sequences/choices as state machines.

duck2 · 2019-07-21T14:52:06Z

I ended up just implementing XSD 1.0 alls. They are good enough for most purposes.

mithro mentioned this issue Jul 2, 2019

Look at if a generated regex parser can do better job than pugixml (performance wise) #5

Open

duck2 closed this as completed Jul 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finish shaving the yak for <xs:all> elements #1

Finish shaving the yak for <xs:all> elements #1

duck2 commented Jun 27, 2019

duck2 commented Jun 27, 2019

duck2 commented Jun 27, 2019

mithro commented Jun 28, 2019

mithro commented Jun 28, 2019

mithro commented Jun 28, 2019

mithro commented Jun 28, 2019 •

edited

Loading

duck2 commented Jun 28, 2019

duck2 commented Jun 28, 2019

duck2 commented Jun 28, 2019

mithro commented Jun 28, 2019 •

edited

Loading

mithro commented Jun 28, 2019

duck2 commented Jun 30, 2019

mithro commented Jun 30, 2019

duck2 commented Jun 30, 2019

duck2 commented Jun 30, 2019

duck2 commented Jul 21, 2019

Finish shaving the yak for <xs:all> elements #1

Finish shaving the yak for <xs:all> elements #1

Comments

duck2 commented Jun 27, 2019

duck2 commented Jun 27, 2019

duck2 commented Jun 27, 2019

mithro commented Jun 28, 2019

mithro commented Jun 28, 2019

mithro commented Jun 28, 2019

mithro commented Jun 28, 2019 • edited Loading

duck2 commented Jun 28, 2019

duck2 commented Jun 28, 2019

duck2 commented Jun 28, 2019

mithro commented Jun 28, 2019 • edited Loading

mithro commented Jun 28, 2019

duck2 commented Jun 30, 2019

mithro commented Jun 30, 2019

duck2 commented Jun 30, 2019

duck2 commented Jun 30, 2019

duck2 commented Jul 21, 2019

mithro commented Jun 28, 2019 •

edited

Loading

mithro commented Jun 28, 2019 •

edited

Loading