Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add symbols API #505

Merged
merged 11 commits into from
Jun 27, 2024
Merged

Add symbols API #505

merged 11 commits into from
Jun 27, 2024

Conversation

Frostbyte0x70
Copy link
Member

@Frostbyte0x70 Frostbyte0x70 commented May 28, 2024

Adds an API that combines binary editing and the pmdsky-debug symbols. Requires SkyTemple/pmdsky-debug-py#16.

The following components have been added:

RWValue

New class under common/ that can be used to read/write a value of a given type at a given offset on a certain binary. Has one subclass for each supported data type. It mostly encapsulates the read_<type> and write_<type> methods that are currently implemented in utils.py, while adding support for new data types (such as fixed-point numbers and instruction immediates).

RWSymbol

New class under hardcoded/symbols. Used to link symbol information (normally pulled from Symbol instances) to RWValue instances. Combined, this allows reading/writing the value of a symbol with a single method call, while abstracting everything related to its type or offset.

CType

Small helper class used to parse some C Types.

SymbolPath

Allows accessing the fields and elements of complex symbol types (such as arrays and structs) using a path string.

BinaryDataGetter

API class that allows retrieving some information about the game's binaries and the symbols contained in it. It's based on an existing implementation in one of SkyTemple's modules, which directly uses reflection to get the list of symbols for the current ROM. I thought it would be nice to abstract all that.
I've only implemented the methods I'm going to need, but more could be added in the future if required.

Manual structs

I wanted to have support for structs, but that would have required parsing every C type in pmdsky-debug in order to figure out the size and fields of each possible type. That's too much work, so I instead decided to manually implement an API for the most relevant structs I could find. They are described in symbols/manual/structs.py.
The main downside is that the info could get outdated if any of those structs change in pmdsky-debug, but it's better than parsing the repo's entire C headers.

Manual enums

Similar idea, but with enums. I've included several types I considered relevant, listing each of their possible values.

Tests

I wrote tests for all the relevant classes implemented and included them in the tests/ subfolder. I didn't look too much into how tests are normally set up because it seemed quite complicated. I didn't need a real ROM to run them either.
So I don't know if the implementation needs changes. If so, let me know.
The tests will fail until the pmdsky-debug-py PR mentioned above is merged. They all pass in my dev environment.

Current TODOs

  • Update pmdsky-debug to include the new types (many of them are simply represented as integers of various sizes)
  • Wait for next pmdsky-debug-py version, then bump the version requirement
  • Make sure the tests are correctly structured, that they are also run during CI and that they all pass

I'll mark the PR as a draft for now. Suggestions are welcome.

Copy link

github-actions bot commented May 28, 2024

Test Results

    12 files  ±    0      12 suites  ±0   43m 47s ⏱️ + 2m 8s
 1 490 tests +  135   1 489 ✅ +  135      1 💤 ±0  0 ❌ ±0 
11 892 runs  +1 080  10 324 ✅ +1 080  1 568 💤 ±0  0 ❌ ±0 

Results for commit bee3be8. ± Comparison against base commit 49d8f8e.

♻️ This comment has been updated with latest results.

@UsernameFodder
Copy link
Contributor

UsernameFodder commented May 30, 2024

I haven't looked at this closely enough to properly understand the techniques being used here to translate from the symbol info to the high-level API, but I'm concerned that using regexes to interpret the C types could be pretty fragile. The pmdsky-debug tooling does this to some extent, but at least in that case those assumptions are all tightly coupled to the headers themselves since they live together in the same repo. Making the same sort of tacit assumptions outside of the original repo makes me a bit worried.

C is pretty flexible in the way things can be declared. Do these regexes break down if there's some slight variation on the "normal" syntax/style used in the pmdsky-debug headers currently? For example:

// typedef'd struct
typedef struct {
    int x;
} typedef_struct;

struct bare_struct {
    typedef_struct y;
};

// keywords in weird places
char const *const const_string;
const char *volatile *volatile volatile_string;

// parens in weird places
int *(array_of_ptrs[8]);
int (*ptr_to_array)[8];

// un-typedef'd function pointer types
void (*function_ptr)(int);
void (*signal(int, void (*fp)(int)))(int);

// macro magic, like the existing ENUM_8_BIT and ENUM_16_BIT macros
#define MAKE_PAIR(a, b)      \
    struct pair_##a##_##b {  \
        a first;             \
        b second;            \
    };
MAKE_PAIR(int, char);
struct pair_int_char pair;

Not that all of these constructs are particularly likely, but they are still possible.

Have you considered using some proper AST-based approach? Something like pycparser or Clang AST? Clang AST in particular is something I've thought about experimenting with before, but I've never gotten around to it.

@Frostbyte0x70
Copy link
Member Author

You're right, it won't work with weirder declarations. However, right now the only headers that are parsed are those under the headers/data folder, it doesn't parse types. Can you forsee a situation where those symbols could have different syntax?
If not, I think the current approach should be good enugh. Like I said, I'd like to avoid getting into the territory of parsing the entire C headers, which would probably be a lot more complicated.
And even if there's some particular cases where data symbols have their type declared with strange syntax, all that would happen is that those symbols would be unsupported, all the other ones would still work. Which I think it's good enough, I'd like this to support as many symbols as possible, but there's no need to cover them all, especially if it requires an overcomplicated setup.

@UsernameFodder
Copy link
Contributor

You're right, it won't work with weirder declarations. However, right now the only headers that are parsed are those under the headers/data folder, it doesn't parse types. Can you forsee a situation where those symbols could have different syntax? If not, I think the current approach should be good enugh. Like I said, I'd like to avoid getting into the territory of parsing the entire C headers, which would probably be a lot more complicated. And even if there's some particular cases where data symbols have their type declared with strange syntax, all that would happen is that those symbols would be unsupported, all the other ones would still work. Which I think it's good enough, I'd like this to support as many symbols as possible, but there's no need to cover them all, especially if it requires an overcomplicated setup.

Well if we're only best-effort parsing headers/data declarations, and we're failing gracefully, I suppose that's okay.

I don't expect the extremely weird constructs, but you never know. The more tame ones could conceivably slip through? I remember somebody at some point tried to use the typedef struct style. And the un-typedef'd function pointer is certainly something I can imagine happening from sloppiness. One other thing that I could actually see slipping through is stuff related to whitespace and comments, for example:

// Declaration that's commented out
// extern int foo;

extern int bar /* inline comment */ ;

extern struct baz
    var; // long comment that causes the formatter to wrap the line halfway through the declaration

So maybe it's worth making sure the regexes are resilient to that sort of thing, at least.

pyproject.toml Outdated Show resolved Hide resolved
@Frostbyte0x70
Copy link
Member Author

You're right, it won't work with weirder declarations. However, right now the only headers that are parsed are those under the headers/data folder, it doesn't parse types. Can you forsee a situation where those symbols could have different syntax? If not, I think the current approach should be good enugh. Like I said, I'd like to avoid getting into the territory of parsing the entire C headers, which would probably be a lot more complicated. And even if there's some particular cases where data symbols have their type declared with strange syntax, all that would happen is that those symbols would be unsupported, all the other ones would still work. Which I think it's good enough, I'd like this to support as many symbols as possible, but there's no need to cover them all, especially if it requires an overcomplicated setup.

Well if we're only best-effort parsing headers/data declarations, and we're failing gracefully, I suppose that's okay.

I don't expect the extremely weird constructs, but you never know. The more tame ones could conceivably slip through? I remember somebody at some point tried to use the typedef struct style. And the un-typedef'd function pointer is certainly something I can imagine happening from sloppiness. One other thing that I could actually see slipping through is stuff related to whitespace and comments, for example:

// Declaration that's commented out
// extern int foo;

extern int bar /* inline comment */ ;

extern struct baz
    var; // long comment that causes the formatter to wrap the line halfway through the declaration

So maybe it's worth making sure the regexes are resilient to that sort of thing, at least.

Would the typedef struct style affect files under data/? If the only thing that changes is the declaration, then it shouldn't matter.
Comments are a reasonable concern. I'll see what I can do about those. Although I don't see why someone would add an inline comment before the semicolon.

@UsernameFodder
Copy link
Contributor

Would the typedef struct style affect files under data/? If the only thing that changes is the declaration, then it shouldn't matter.

It shouldn't be a big deal, but I'd probably just make sure the parsing code here doesn't rely on the struct keyword always being in front of a struct tag. I suppose with the fixed-point typedefs you introduced, it probably already doesn't.

Although I don't see why someone would add an inline comment before the semicolon.

Yeah it is rather weird, I probably wouldn't willingly allow it in code reviews. It's probably fine not to worry about it. I brought it up because there are C projects that put comments in the middle of code like this for various reasons. Some people like to use it to comment out stuff. Though I think probably it's more common in function prototypes for commenting individual parameters within the same line.

@Frostbyte0x70
Copy link
Member Author

It shouldn't be a big deal, but I'd probably just make sure the parsing code here doesn't rely on the struct keyword always being in front of a struct tag. I suppose with the fixed-point typedefs you introduced, it probably already doesn't.

Isn't the tag required when declaring a variable of that type? I'm not that familiar with C.

@UsernameFodder
Copy link
Contributor

Isn't the tag required when declaring a variable of that type? I'm not that familiar with C.

"tag" refers to the "name" part of the struct. For struct foo, "foo" is the "tag". What I meant was that nothing should break horribly if there was something like MyStruct foo; as opposed to struct MyStruct foo;.

@Frostbyte0x70
Copy link
Member Author

"tag" refers to the "name" part of the struct. For struct foo, "foo" is the "tag". What I meant was that nothing should break horribly if there was something like MyStruct foo; as opposed to struct MyStruct foo;.

So C supports the same syntax as more recent languages where struct and enum types can be referred to using only the tag? Why is struct or enum always added every time it's referenced in pmdsky-debug then?

@UsernameFodder
Copy link
Contributor

"tag" refers to the "name" part of the struct. For struct foo, "foo" is the "tag". What I meant was that nothing should break horribly if there was something like MyStruct foo; as opposed to struct MyStruct foo;.

So C supports the same syntax as more recent languages where struct and enum types can be referred to using only the tag? Why is struct or enum always added every time it's referenced in pmdsky-debug then?

I wouldn't say it "supports" it. You can only do it if you define the struct with typedefs, i.e.

typedef struct {...} MyStruct;

It's a stylistic preference. Some projects, like the Linux kernel, prefer to write the struct and enum keywords explicitly for various reasons. In pmdsky-debug, I chose the style because I think the explicitness adds clarity around the low-level thing being represented.

@Frostbyte0x70
Copy link
Member Author

Oh, I see.
Well, without the keyword there's no way to know if the type is a struct or an enum without parsing the entire headers, which I'm not going to do, as stated above. So I think it's reasonable to support only structs and enums that use the prefix.

@Frostbyte0x70
Copy link
Member Author

After over two hours of pain, I have realized that you can't just keep a copy of the bytearray object that represents a binary, since it might go out of sync with the internal version kept by SkyTemple. I stumbled upon what I believe is the proper way of modifying binaries (RomProject.modify_binary()).
This means the RWValue class will have to be rewritten to take the bytes or bytearray object to read/write from/to on its read()/write() method.

I really wish SkyTemple had better documentation. It is quite frustrating to realize you've been doing everything wrong because there were no hints as to what the proper way of implementing things is.

@theCapypara
Copy link
Member

theCapypara commented Jun 15, 2024

@End45 I merged #506 so tests should now run through again when you update your branch.

I'll do a review once all the checks run through, but in general I think this is great so I don't expect to have many change requests!

@Frostbyte0x70
Copy link
Member Author

@End45 I merged #506 so tests should now run through again when you update your branch.

I'll do a review once all the checks run through, but in general I think this is great so I don't expect to have many change requests!

Alright! I still expect to add a few more commits here though. Likely just minor fixes.

@Frostbyte0x70
Copy link
Member Author

Alright, now that the UI part is implemented, I think the backend is ready!

The only thing that is missing would be formatting. However, I have some issues with the suggested changes there, especially with all the unnecessary line breaks the formatter wants to add. Some sections become borderline unreadable afterwards. I think the formatter settings would need to be reviewed.

@Frostbyte0x70 Frostbyte0x70 marked this pull request as ready for review June 19, 2024 23:50
@Frostbyte0x70 Frostbyte0x70 changed the title Add symbols API (draft) Add symbols API Jun 19, 2024
@theCapypara
Copy link
Member

@End45 The formatter has next to no settings by design. You can disable the formatter for some lines:
https://docs.astral.sh/ruff/formatter/#format-suppression

@theCapypara theCapypara self-requested a review June 20, 2024 12:58
@Frostbyte0x70
Copy link
Member Author

@End45 The formatter has next to no settings by design. You can disable the formatter for some lines: https://docs.astral.sh/ruff/formatter/#format-suppression

Looking at the docs, I think one of the reasons why so many line breaks are introduced is because the default max line length is 88 characters. Maybe it could be bumped to 120. I just tested it and that gets rid of most of the problematic lines. The rest are either not too bad after reformatting or warrant using the suppression tag.

@theCapypara
Copy link
Member

No, we will keep the defaults.

@Frostbyte0x70
Copy link
Member Author

Why though? 120 is the default I see the most nowadays, and the current setting really makes a lot of stuff almost unreadable. I could just litter the code with suppression tags, but I think it would be an even worse solution.

@theCapypara
Copy link
Member

79 is the maximum required by PEP 8.
https://peps.python.org/pep-0008/#maximum-line-length

@Frostbyte0x70
Copy link
Member Author

That makes no sense to me, tbh. Code is viewed on a single window 99% percent of the time, so that's what it should be optimized for. You can't tell me that it makes sense to hinder readability and inflate the line count just so we don't have to use the horizontal scrollbar in the 1% of cases where code is viewed on a split window.

Again, if you think it's a better solution, I'll just add suppression statements anywhere I consider them to be necessary, but I really don't get why we would make everything harder for ourselves just becuase an arbitrary standard says so.

@theCapypara
Copy link
Member

Personally I work with multiple panes and zoomed in code all the time, I use 80 columns for that reason everywhere.
Can you give an example where the formatter would mess up the readability a lot?

@Frostbyte0x70
Copy link
Member Author

I'll move this conversation to a thread on the Discord server, since I don't want to flood the PR with pictures.

@theCapypara
Copy link
Member

I raised the character line limit to 120.

skytemple_files/common/rw_value.py Outdated Show resolved Hide resolved
@theCapypara theCapypara self-requested a review June 23, 2024 10:21
Copy link
Member

@theCapypara theCapypara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Really nothing to add, just overall really cool!

@theCapypara theCapypara merged commit f69bc5b into SkyTemple:master Jun 27, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants