Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incomplete functionality #3

Open
snake-biscuits opened this issue Jul 3, 2021 · 2 comments
Open

Incomplete functionality #3

snake-biscuits opened this issue Jul 3, 2021 · 2 comments

Comments

@snake-biscuits
Copy link

I was looking at using this library for converting some python objects that use the struct module into C++ definitions

Unfortunately I found that this repo only handles a small subset of valid struct format strings

struct allows spaces in format strings, as well as integers preceding an identifier.
"I 128s 2h" is a valid format for the struct module, but not for this repo

Instead it fails on the first space, since it isn't listed as a valid character, rather than skipping it
The same for any integer.

I've made my own implementation to handle this, rather than a fork, since I also have to handle my custom objects
(The binary data is represented in nested structures, so my definitions also have to take that into account)

One common special case to look out for is strings:
With struct both "5c" & "5s" represent the equivalent of char[5] in C
However, the objects returned are different:

>>> import struct
>>> struct.unpack("5c", b"\x00\x01\x02\x03\x04")
(0, 1, 2, 3, 4)
>>> struct.unpack("5s", b"words")
b"words"

This is an important restriction, in writing my own function to standardise the format strings,
I had to include a case for strings, to ensure I recognised it as the single object struct would expect

Unfortunately my solution is a little clunky, but I'll share it here if you think you'd find it useful:

def simplify(_format):
	"""Converts a format string for the struct module into a list of type chars"""
    _format = _format.replace(" ", "")
    if _format[-1].isnumeric():
        raise RuntimeError("_format is invalid (ends in a number)")
    out, i = list(), 0
    while i < len(_format):
        char = _format[i]
        if char.isnumeric(): # find the whole number
            j = 1
            while _format[i:i+j].isnumeric():
                j += 1
            j -= 1
            count = int(_format[i:i + j])
            i += j
            f = _format[i]
            # NOTE: f is only assumed to be valid, validity is not checked here
            if f == "s":  # string
                out.append(f"{count}s")
            else:
                out.extend([f] * count)  # "2I" -> ["I", "I"]
        elif char.isalpha():  # lazily collect type
            out.append(_format[i])
        elif char == "?":  # bool is not an alphabetical character, but is still valid
            out.append("?")
        else:
            raise RuntimeError(f"Invalid character '{char}' in _format") 
        i += 1
    return out

This could probably be tidier but in my testing it seems to work with the format strings I'm using
(no pointer types, padding, size_t or 0 length types)

Using this with the dictionary lookups in struct_parse should handle both spaces and types preceeded by counts
Strings are passed through as "128s" etc., so a special case would need to be added
Presumably noting the type as string and also the size

I hope this is useful in expanding this repo to support all valid format strings the struct module accepts

@snake-biscuits
Copy link
Author

I've cleaned up my solution considerably using regex, via the built in re module

def split_format(_format):
    _format = re.findall(r"[0-9]*[xcbB\?hHiIlLqQnNefdspP]", _format.replace(" ", ""))
    out = list()
    for f in _format:
        match_numbered = re.match(r"([0-9]+)([xcbB\?hHiIlLqQnNefdpP])", f)
        # NOTE: does not decompress strings
        if match_numbered is not None:
            count, f = match_numbered.groups()
            out.extend(f * int(count))
        else:
            out.append(f)
    return out

this returns a tuple of strings, with all types but strings being multiplied out
should be far easier to work with than my old solution
one major point to note is that the second regex pattern does not match for s, leaving strings unchanged

@zmarvel
Copy link
Owner

zmarvel commented Jul 9, 2021

Thanks for taking the time to report this bug and for sharing your code! I'll leave this issue open and take a look at it implementing it when I get some time 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants