Add symbols API #505

Frostbyte0x70 · 2024-05-28T17:53:19Z

Adds an API that combines binary editing and the pmdsky-debug symbols. Requires SkyTemple/pmdsky-debug-py#16.

The following components have been added:

RWValue

New class under common/ that can be used to read/write a value of a given type at a given offset on a certain binary. Has one subclass for each supported data type. It mostly encapsulates the read_<type> and write_<type> methods that are currently implemented in utils.py, while adding support for new data types (such as fixed-point numbers and instruction immediates).

RWSymbol

New class under hardcoded/symbols. Used to link symbol information (normally pulled from Symbol instances) to RWValue instances. Combined, this allows reading/writing the value of a symbol with a single method call, while abstracting everything related to its type or offset.

CType

Small helper class used to parse some C Types.

SymbolPath

Allows accessing the fields and elements of complex symbol types (such as arrays and structs) using a path string.

BinaryDataGetter

API class that allows retrieving some information about the game's binaries and the symbols contained in it. It's based on an existing implementation in one of SkyTemple's modules, which directly uses reflection to get the list of symbols for the current ROM. I thought it would be nice to abstract all that.
I've only implemented the methods I'm going to need, but more could be added in the future if required.

Manual structs

I wanted to have support for structs, but that would have required parsing every C type in pmdsky-debug in order to figure out the size and fields of each possible type. That's too much work, so I instead decided to manually implement an API for the most relevant structs I could find. They are described in symbols/manual/structs.py.
The main downside is that the info could get outdated if any of those structs change in pmdsky-debug, but it's better than parsing the repo's entire C headers.

Manual enums

Similar idea, but with enums. I've included several types I considered relevant, listing each of their possible values.

Tests

I wrote tests for all the relevant classes implemented and included them in the tests/ subfolder. I didn't look too much into how tests are normally set up because it seemed quite complicated. I didn't need a real ROM to run them either.
So I don't know if the implementation needs changes. If so, let me know.
The tests will fail until the pmdsky-debug-py PR mentioned above is merged. They all pass in my dev environment.

Current TODOs

Update pmdsky-debug to include the new types (many of them are simply represented as integers of various sizes)
Wait for next pmdsky-debug-py version, then bump the version requirement
Make sure the tests are correctly structured, that they are also run during CI and that they all pass

I'll mark the PR as a draft for now. Suggestions are welcome.

github-actions · 2024-05-28T18:05:18Z

Test Results

12 files ± 0 12 suites ±0 43m 47s ⏱️ + 2m 8s
1 490 tests + 135 1 489 ✅ + 135 1 💤 ±0 0 ❌ ±0
11 892 runs +1 080 10 324 ✅ +1 080 1 568 💤 ±0 0 ❌ ±0

Results for commit bee3be8. ± Comparison against base commit 49d8f8e.

♻️ This comment has been updated with latest results.

UsernameFodder · 2024-05-30T04:22:15Z

I haven't looked at this closely enough to properly understand the techniques being used here to translate from the symbol info to the high-level API, but I'm concerned that using regexes to interpret the C types could be pretty fragile. The pmdsky-debug tooling does this to some extent, but at least in that case those assumptions are all tightly coupled to the headers themselves since they live together in the same repo. Making the same sort of tacit assumptions outside of the original repo makes me a bit worried.

C is pretty flexible in the way things can be declared. Do these regexes break down if there's some slight variation on the "normal" syntax/style used in the pmdsky-debug headers currently? For example:

// typedef'd struct
typedef struct {
    int x;
} typedef_struct;

struct bare_struct {
    typedef_struct y;
};

// keywords in weird places
char const *const const_string;
const char *volatile *volatile volatile_string;

// parens in weird places
int *(array_of_ptrs[8]);
int (*ptr_to_array)[8];

// un-typedef'd function pointer types
void (*function_ptr)(int);
void (*signal(int, void (*fp)(int)))(int);

// macro magic, like the existing ENUM_8_BIT and ENUM_16_BIT macros
#define MAKE_PAIR(a, b)      \
    struct pair_##a##_##b {  \
        a first;             \
        b second;            \
    };
MAKE_PAIR(int, char);
struct pair_int_char pair;

Not that all of these constructs are particularly likely, but they are still possible.

Have you considered using some proper AST-based approach? Something like pycparser or Clang AST? Clang AST in particular is something I've thought about experimenting with before, but I've never gotten around to it.

Frostbyte0x70 · 2024-05-30T15:47:37Z

You're right, it won't work with weirder declarations. However, right now the only headers that are parsed are those under the headers/data folder, it doesn't parse types. Can you forsee a situation where those symbols could have different syntax?
If not, I think the current approach should be good enugh. Like I said, I'd like to avoid getting into the territory of parsing the entire C headers, which would probably be a lot more complicated.
And even if there's some particular cases where data symbols have their type declared with strange syntax, all that would happen is that those symbols would be unsupported, all the other ones would still work. Which I think it's good enough, I'd like this to support as many symbols as possible, but there's no need to cover them all, especially if it requires an overcomplicated setup.

UsernameFodder · 2024-05-31T03:12:33Z

You're right, it won't work with weirder declarations. However, right now the only headers that are parsed are those under the headers/data folder, it doesn't parse types. Can you forsee a situation where those symbols could have different syntax? If not, I think the current approach should be good enugh. Like I said, I'd like to avoid getting into the territory of parsing the entire C headers, which would probably be a lot more complicated. And even if there's some particular cases where data symbols have their type declared with strange syntax, all that would happen is that those symbols would be unsupported, all the other ones would still work. Which I think it's good enough, I'd like this to support as many symbols as possible, but there's no need to cover them all, especially if it requires an overcomplicated setup.

Well if we're only best-effort parsing headers/data declarations, and we're failing gracefully, I suppose that's okay.

I don't expect the extremely weird constructs, but you never know. The more tame ones could conceivably slip through? I remember somebody at some point tried to use the typedef struct style. And the un-typedef'd function pointer is certainly something I can imagine happening from sloppiness. One other thing that I could actually see slipping through is stuff related to whitespace and comments, for example:

// Declaration that's commented out
// extern int foo;

extern int bar /* inline comment */ ;

extern struct baz
    var; // long comment that causes the formatter to wrap the line halfway through the declaration

So maybe it's worth making sure the regexes are resilient to that sort of thing, at least.

pyproject.toml

Frostbyte0x70 · 2024-05-31T10:40:39Z

You're right, it won't work with weirder declarations. However, right now the only headers that are parsed are those under the headers/data folder, it doesn't parse types. Can you forsee a situation where those symbols could have different syntax? If not, I think the current approach should be good enugh. Like I said, I'd like to avoid getting into the territory of parsing the entire C headers, which would probably be a lot more complicated. And even if there's some particular cases where data symbols have their type declared with strange syntax, all that would happen is that those symbols would be unsupported, all the other ones would still work. Which I think it's good enough, I'd like this to support as many symbols as possible, but there's no need to cover them all, especially if it requires an overcomplicated setup.

Well if we're only best-effort parsing headers/data declarations, and we're failing gracefully, I suppose that's okay.

I don't expect the extremely weird constructs, but you never know. The more tame ones could conceivably slip through? I remember somebody at some point tried to use the typedef struct style. And the un-typedef'd function pointer is certainly something I can imagine happening from sloppiness. One other thing that I could actually see slipping through is stuff related to whitespace and comments, for example:
// Declaration that's commented out
// extern int foo;

extern int bar /* inline comment */ ;

extern struct baz
    var; // long comment that causes the formatter to wrap the line halfway through the declaration
So maybe it's worth making sure the regexes are resilient to that sort of thing, at least.

Would the typedef struct style affect files under data/? If the only thing that changes is the declaration, then it shouldn't matter.
Comments are a reasonable concern. I'll see what I can do about those. Although I don't see why someone would add an inline comment before the semicolon.

UsernameFodder · 2024-06-01T02:30:23Z

Would the typedef struct style affect files under data/? If the only thing that changes is the declaration, then it shouldn't matter.

It shouldn't be a big deal, but I'd probably just make sure the parsing code here doesn't rely on the struct keyword always being in front of a struct tag. I suppose with the fixed-point typedefs you introduced, it probably already doesn't.

Although I don't see why someone would add an inline comment before the semicolon.

Yeah it is rather weird, I probably wouldn't willingly allow it in code reviews. It's probably fine not to worry about it. I brought it up because there are C projects that put comments in the middle of code like this for various reasons. Some people like to use it to comment out stuff. Though I think probably it's more common in function prototypes for commenting individual parameters within the same line.

Frostbyte0x70 · 2024-06-01T11:48:52Z

It shouldn't be a big deal, but I'd probably just make sure the parsing code here doesn't rely on the struct keyword always being in front of a struct tag. I suppose with the fixed-point typedefs you introduced, it probably already doesn't.

Isn't the tag required when declaring a variable of that type? I'm not that familiar with C.

UsernameFodder · 2024-06-01T18:24:56Z

Isn't the tag required when declaring a variable of that type? I'm not that familiar with C.

"tag" refers to the "name" part of the struct. For struct foo, "foo" is the "tag". What I meant was that nothing should break horribly if there was something like MyStruct foo; as opposed to struct MyStruct foo;.

Frostbyte0x70 · 2024-06-01T22:28:27Z

"tag" refers to the "name" part of the struct. For struct foo, "foo" is the "tag". What I meant was that nothing should break horribly if there was something like MyStruct foo; as opposed to struct MyStruct foo;.

So C supports the same syntax as more recent languages where struct and enum types can be referred to using only the tag? Why is struct or enum always added every time it's referenced in pmdsky-debug then?

UsernameFodder · 2024-06-01T22:57:00Z

"tag" refers to the "name" part of the struct. For struct foo, "foo" is the "tag". What I meant was that nothing should break horribly if there was something like MyStruct foo; as opposed to struct MyStruct foo;.

So C supports the same syntax as more recent languages where struct and enum types can be referred to using only the tag? Why is struct or enum always added every time it's referenced in pmdsky-debug then?

I wouldn't say it "supports" it. You can only do it if you define the struct with typedefs, i.e.

typedef struct {...} MyStruct;

It's a stylistic preference. Some projects, like the Linux kernel, prefer to write the struct and enum keywords explicitly for various reasons. In pmdsky-debug, I chose the style because I think the explicitness adds clarity around the low-level thing being represented.

Frostbyte0x70 · 2024-06-02T09:34:38Z

Oh, I see.
Well, without the keyword there's no way to know if the type is a struct or an enum without parsing the entire headers, which I'm not going to do, as stated above. So I think it's reasonable to support only structs and enums that use the prefix.

Frostbyte0x70 · 2024-06-12T20:47:48Z

After over two hours of pain, I have realized that you can't just keep a copy of the bytearray object that represents a binary, since it might go out of sync with the internal version kept by SkyTemple. I stumbled upon what I believe is the proper way of modifying binaries (RomProject.modify_binary()).
This means the RWValue class will have to be rewritten to take the bytes or bytearray object to read/write from/to on its read()/write() method.

I really wish SkyTemple had better documentation. It is quite frustrating to realize you've been doing everything wrong because there were no hints as to what the proper way of implementing things is.

theCapypara · 2024-06-15T13:31:41Z

@End45 I merged #506 so tests should now run through again when you update your branch.

I'll do a review once all the checks run through, but in general I think this is great so I don't expect to have many change requests!

Frostbyte0x70 · 2024-06-15T14:34:15Z

@End45 I merged #506 so tests should now run through again when you update your branch.

I'll do a review once all the checks run through, but in general I think this is great so I don't expect to have many change requests!

Alright! I still expect to add a few more commits here though. Likely just minor fixes.

Frostbyte0x70 · 2024-06-19T23:50:19Z

Alright, now that the UI part is implemented, I think the backend is ready!

The only thing that is missing would be formatting. However, I have some issues with the suggested changes there, especially with all the unnecessary line breaks the formatter wants to add. Some sections become borderline unreadable afterwards. I think the formatter settings would need to be reviewed.

theCapypara · 2024-06-20T12:55:23Z

@End45 The formatter has next to no settings by design. You can disable the formatter for some lines:
https://docs.astral.sh/ruff/formatter/#format-suppression

Frostbyte0x70 · 2024-06-20T18:13:38Z

@End45 The formatter has next to no settings by design. You can disable the formatter for some lines: https://docs.astral.sh/ruff/formatter/#format-suppression

Looking at the docs, I think one of the reasons why so many line breaks are introduced is because the default max line length is 88 characters. Maybe it could be bumped to 120. I just tested it and that gets rid of most of the problematic lines. The rest are either not too bad after reformatting or warrant using the suppression tag.

theCapypara · 2024-06-20T19:07:44Z

No, we will keep the defaults.

Frostbyte0x70 · 2024-06-20T21:22:23Z

Why though? 120 is the default I see the most nowadays, and the current setting really makes a lot of stuff almost unreadable. I could just litter the code with suppression tags, but I think it would be an even worse solution.

theCapypara · 2024-06-20T22:46:33Z

79 is the maximum required by PEP 8.
https://peps.python.org/pep-0008/#maximum-line-length

Frostbyte0x70 · 2024-06-21T10:36:55Z

That makes no sense to me, tbh. Code is viewed on a single window 99% percent of the time, so that's what it should be optimized for. You can't tell me that it makes sense to hinder readability and inflate the line count just so we don't have to use the horizontal scrollbar in the 1% of cases where code is viewed on a split window.

Again, if you think it's a better solution, I'll just add suppression statements anywhere I consider them to be necessary, but I really don't get why we would make everything harder for ourselves just becuase an arbitrary standard says so.

theCapypara · 2024-06-21T10:47:56Z

Personally I work with multiple panes and zoomed in code all the time, I use 80 columns for that reason everywhere.
Can you give an example where the formatter would mess up the readability a lot?

Frostbyte0x70 · 2024-06-21T16:08:56Z

I'll move this conversation to a thread on the Discord server, since I don't want to flood the PR with pictures.

theCapypara · 2024-06-21T19:01:04Z

I raised the character line limit to 120.

skytemple_files/common/rw_value.py

theCapypara

Great! Really nothing to add, just overall really cool!

theCapypara reviewed May 31, 2024

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

This was referenced May 31, 2024

Update C type regex + Refactor symbol loader SkyTemple/pmdsky-debug-py#18

Merged

Update Symbol instantiations #507

Closed

Frostbyte0x70 force-pushed the symbols branch from e96549a to e540448 Compare June 19, 2024 22:26

Frostbyte0x70 mentioned this pull request Jun 19, 2024

Add symbols editor SkyTemple/skytemple#761

Merged

Frostbyte0x70 marked this pull request as ready for review June 19, 2024 23:50

Frostbyte0x70 changed the title ~~Add symbols API (draft)~~ Add symbols API Jun 19, 2024

theCapypara self-requested a review June 20, 2024 12:58

Frostbyte0x70 added 9 commits June 22, 2024 13:01

Add symbols API (draft)

b5a0bed

Fix get_simplified_type

62c65e6

Update pmdsky-debug-py version and fx types

bf7f7f9

Minor updates and fixes

24b31f6

Refactor RWValue to take binaries as parameters

555fa16

Bug fixes

da4ab60

Support pmdsky-debug-py 10.0.16

bd55f24

More implementation changes and fixes

137a20e

Fixes

d4e1304

Frostbyte0x70 force-pushed the symbols branch from aac5fc1 to d48f75d Compare June 22, 2024 11:17

Format

b6362d2

Frostbyte0x70 force-pushed the symbols branch from d48f75d to b6362d2 Compare June 22, 2024 11:37

theCapypara requested changes Jun 22, 2024

View reviewed changes

skytemple_files/common/rw_value.py Outdated Show resolved Hide resolved

Make requested changes

bee3be8

theCapypara self-requested a review June 23, 2024 10:21

theCapypara approved these changes Jun 27, 2024

View reviewed changes

theCapypara merged commit f69bc5b into SkyTemple:master Jun 27, 2024
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add symbols API #505

Add symbols API #505

Frostbyte0x70 commented May 28, 2024 •

edited

Loading

github-actions bot commented May 28, 2024 •

edited

Loading

UsernameFodder commented May 30, 2024 •

edited

Loading

Frostbyte0x70 commented May 30, 2024

UsernameFodder commented May 31, 2024

Frostbyte0x70 commented May 31, 2024

UsernameFodder commented Jun 1, 2024

Frostbyte0x70 commented Jun 1, 2024

UsernameFodder commented Jun 1, 2024

Frostbyte0x70 commented Jun 1, 2024

UsernameFodder commented Jun 1, 2024

Frostbyte0x70 commented Jun 2, 2024

Frostbyte0x70 commented Jun 12, 2024

theCapypara commented Jun 15, 2024 •

edited

Loading

Frostbyte0x70 commented Jun 15, 2024

Frostbyte0x70 commented Jun 19, 2024

theCapypara commented Jun 20, 2024

Frostbyte0x70 commented Jun 20, 2024

theCapypara commented Jun 20, 2024

Frostbyte0x70 commented Jun 20, 2024

theCapypara commented Jun 20, 2024

Frostbyte0x70 commented Jun 21, 2024

theCapypara commented Jun 21, 2024

Frostbyte0x70 commented Jun 21, 2024

theCapypara commented Jun 21, 2024

theCapypara left a comment

Add symbols API #505

Add symbols API #505

Conversation

Frostbyte0x70 commented May 28, 2024 • edited Loading

RWValue

RWSymbol

CType

SymbolPath

BinaryDataGetter

Manual structs

Manual enums

Tests

Current TODOs

github-actions bot commented May 28, 2024 • edited Loading

Test Results

UsernameFodder commented May 30, 2024 • edited Loading

Frostbyte0x70 commented May 30, 2024

UsernameFodder commented May 31, 2024

Frostbyte0x70 commented May 31, 2024

UsernameFodder commented Jun 1, 2024

Frostbyte0x70 commented Jun 1, 2024

UsernameFodder commented Jun 1, 2024

Frostbyte0x70 commented Jun 1, 2024

UsernameFodder commented Jun 1, 2024

Frostbyte0x70 commented Jun 2, 2024

Frostbyte0x70 commented Jun 12, 2024

theCapypara commented Jun 15, 2024 • edited Loading

Frostbyte0x70 commented Jun 15, 2024

Frostbyte0x70 commented Jun 19, 2024

theCapypara commented Jun 20, 2024

Frostbyte0x70 commented Jun 20, 2024

theCapypara commented Jun 20, 2024

Frostbyte0x70 commented Jun 20, 2024

theCapypara commented Jun 20, 2024

Frostbyte0x70 commented Jun 21, 2024

theCapypara commented Jun 21, 2024

Frostbyte0x70 commented Jun 21, 2024

theCapypara commented Jun 21, 2024

theCapypara left a comment

Choose a reason for hiding this comment

Frostbyte0x70 commented May 28, 2024 •

edited

Loading

github-actions bot commented May 28, 2024 •

edited

Loading

UsernameFodder commented May 30, 2024 •

edited

Loading

theCapypara commented Jun 15, 2024 •

edited

Loading