Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disambiguate obfuscated/non-obfuscated names #718

Open
justhecuke opened this issue Aug 30, 2022 · 12 comments
Open

Disambiguate obfuscated/non-obfuscated names #718

justhecuke opened this issue Aug 30, 2022 · 12 comments

Comments

@justhecuke
Copy link

My team and I have been developing passes for use internally but we commonly hit an issue where we interact with deobfuscated names (perhaps from a hardcoded list, usage.txt from proguard, or other sources) but then use them with names retrieved from DexClass or DexType reachable from scope which are generally obfuscated. This has caused many problems.

Is this also an issue for others? And does anyone have ideas for a fix?

My idea is to create two different types for obfuscated and deobfuscated strings and have all names derived from the APK be instances of obfuscated strings and where APIs that read deobfuscated names (ones that developers write for input names) use the deobfuscated type.

I'm willing to do the work to refactor, but this would be a huge change and I wanted to get feedback before starting work on what may be a doomed project.

@NTillmann
Copy link
Contributor

I don't fully understand the problem... I'd rather want to avoid another parallel obfuscated name infrastructure.

Are you passing a rename map into Redex, e.g. from a previous Proguard run?

A few thoughts / suggestions:

  • "deobfuscated names" in Redex should always reflect the original name of an entity.
  • Run obfuscation / renaming passes towards the very end of your pass list.
  • Every pass can override an "eval_pass" method. All "eval_pass" methods run before any "run_pass". This would be a good place to look up some named entities (DexClass, DexMethod, etc.), and store them in some pass state. Note that this is only a good idea if those entities have keep rules, so that the pointers to them won't get released during optimizations such as RemoveUnreachablePass.

Does that help?

@justhecuke
Copy link
Author

We run proguard obfuscation before redex since some passes require a proguard mapping file in order to run properly. I'm not sure why this is the case but it is (or was when we first adopted redex). This means we have a bunch of obfuscated classes in the mix as well as the rename pass to further shrink the renaming although this runs second to last right before RegAllocPass.

This leads to an error-prone situation where some pass developers forget that they are working with obfuscated/deobfuscated strings and try to do comparisons between them which always fails leading to problems. We've hit this several times already and were trying to come up with a fix to make the situation less error-prone.

While using eval_pass may fix the immediate issue of mixing obfuscated and deobfuscated strings, it requires that the developer realize they are doing the wrong thing which is the actual problem we are having. Once we realize we've made this mistake again, the fix is usually relatively simple -- DexClass allows you to get its deobfuscated name so it's mostly a matter of creating a deobfuscated name -> obfuscated class/type mapping to normalize on using obfuscated names.

Every developer we have working on redex passes has made this mistake at least once.

@thezhangwei
Copy link
Contributor

It sounds like in your situation the deobfuscated_name is already "obfuscated" by Proguard, is that right?

At one point, we were in the same spot. Maybe @ssj933 has more context on the Proguard deprecation history.

But I wonder how often does your optimization logic rely on the names of a type/class. That's not very common for us. In most cases, we rely on the name of some key classes that are the root of a hierarchy. Most of them are not renamed either because they are external (part of JDK), or marked as "donotrename" in the dependency (Kotlin runtime).

My idea is to create two different types for obfuscated and deobfuscated strings and have all names derived from the APK be instances of obfuscated strings

Do you plan to add an option to plugin the Proguard rename map into the deobfuscated_names in Redex?

@justhecuke
Copy link
Author

justhecuke commented Aug 30, 2022 via email

@thezhangwei
Copy link
Contributor

It's just the ease with which new
developers can forget to translate between de obfuscated and obfuscated.

Yeah we have to live with that. It hasn't been a major concern tho. Can I ask why is get_deobfuscated_name not enough for you? It should always give you the deobfuscated_name, and that's what you want right?

@NTillmann
Copy link
Contributor

NTillmann commented Aug 30, 2022

We run proguard obfuscation before redex since some passes require a proguard mapping file in order to run properly. I'm not sure why this is the case but it is (or was when we first adopted redex).

Many passes require that proguard keep rules have been communicated to Redex via some ProGuard config files (and even that is just a superficial sanity check), but I am not aware of any relevant passes that would require an actual pro guard (class-name) mapping.

If at all possible, I suggest that you stop running Proguard to obfuscate classes first. That's the direction we've been taking.

@justhecuke
Copy link
Author

It's just the ease with which new
developers can forget to translate between de obfuscated and obfuscated.

Yeah we have to live with that. It hasn't been a major concern tho. Can I ask why is get_deobfuscated_name not enough for you? It should always give you the deobfuscated_name, and that's what you want right?

It's what we want, but we also want it to be obvious that we need to use get_deobfuscated_name. The issue is that our devs sometimes forget to translate to/from deobfuscated strings and things end up not working right and it takes a while to track down what went wrong.

If we had this typing difference, we'd know at compile-time what we made a mistake.

@thezhangwei
Copy link
Contributor

thezhangwei commented Aug 31, 2022

If we had this typing difference, we'd know at compile-time what we made a mistake.

IC. Thanks for explaining that. It's more clear to me what this is about now.

Sounds like you are proposing adding subtypes to DexString to encode obfuscation status into types? Or something similar? Personally I'm open to a change like that as long as it does not require a sweeping update for all the existing name handling logic in the code base.

@justhecuke
Copy link
Author

justhecuke commented Aug 31, 2022 via email

@thezhangwei
Copy link
Contributor

SG. I will be open to see what the change looks like.

But I guess there's an underlying question about the differences in how we set things up. Like in your case, there's more interaction between the preceding Proguard and Redex. But we have deprecated Proguard a few years back. So the vast majority of the symbol obfuscation is done by Redex (towards the end of the optimization pipeline). That's maybe a worthwhile discussion by itself.

@justhecuke
Copy link
Author

justhecuke commented Sep 1, 2022 via email

@NTillmann
Copy link
Contributor

Obfuscated entities in dex files may include types, field names, method names, method protos with their contained result types and (argument) type lists. Do you want subclasses for DexType, DexField, etc.? For DexField and DexMethod we already have a special thing going where we play games with the -Ref subtypes. I wonder how the different subtypes would affect existing passes.

For example, strings, there's a rule that there is only one DexString* for each unique string payload. Would your subtype get a different pointer address for identical strings? This would create problems, semantic and optimization-quality-wise.

In general, the Dex* concepts mirror quite closely concepts that are found in dex files, including uniqueness properties.

Regarding DexStrings in particular, we've tried to make sure they are very lightweight. On an 64-bit platform, each DexString takes up 16 bytes, and doesn't have a vtable. I'd like to keep that.

Anyway, I am not sure what you exactly you have in mind. Some examples for potential API changes would help :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants