Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLEIGH: assembly template not following the usual mnemonic + operands #315

Closed
NeatMonster opened this issue Apr 2, 2019 · 10 comments
Closed
Assignees
Labels
Type: Question Further information is requested

Comments

@NeatMonster
Copy link

Hello,

I'm trying to add support for the Qualcomm's Hexagon V5x architecture to Ghidra.

However, I am facing 1 non-blocking and 1 blocking issue:

  1. Hexagon is a VLIW architecture. It uses instruction packets that contains up to 4 instructions. This is usually represented in assembly using the prefix { and the suffix }, like so:
Disassembly of section .text:
main:
       0:	80 c1 01 b0 b001c180 { 	r0=add(r1,#12) } 
       4:	80 41 01 b0 b0014180 { 	r0=add(r1,#12)
       8:	82 c1 03 b0 b003c182   	r2=add(r3,#12) } 
       c:	80 41 01 b0 b0014180 { 	r0=add(r1,#12)
      10:	82 41 03 b0 b0034182   	r2=add(r3,#12)
      14:	84 c1 05 b0 b005c184   	r4=add(r5,#12) }

Because Ghidra strips whitespace from the beginning of the display section, I end up with:

        00010000 80 c1 01 b0     { r0=add(r1,#0xc) }
        00010004 80 41 01 b0     { r0=add(r1,#0xc)
        00010008 82 c1 03 b0     r2=add(r3,#0xc) }
        0001000c 80 41 01 b0     { r0=add(r1,#0xc)
        00010010 82 41 03 b0     r2=add(r3,#0xc)
        00010014 84 c1 05 b0     r4=add(r5,#0xc) }

I've been using . as a second space character, but while it works it isn't pretty.

  1. Because "the first string of characters in the display section [...] is treated as the literal mnemonic of the instruction", I had to prefix the entries of the instruction table with ^. The whole line is now being treated as part of the mnemonic, and the operands are not being recognised.

Screenshot 2019-04-02 at 21 32 29

Screenshot 2019-04-02 at 21 34 38

Is there any solution to these issues? If not, I'm guessing the Sleigh language has to be modified.


For reference, here is my current code:

define register offset=0x1000 size=4 contextreg;
define context contextreg
  phase=(0,0) noflow
  slot=(1,2)
  next_slot=(3,4) 
;

define token instr(32)
    iclass=(28,31)
    si27_21=(21,27) signed
    s5=(16,20)
    parse=(14,15)
    si13_5=(5,13)
    d5=(0,4)
;

attach variables [ s5 d5 ] [
    r0  r1  r2  r3  r4  r5  r6  r7
    r8  r9  r10 r11 r12 r13 r14 r15
    r16 r17 r18 r19 r20 r21 r22 r23
    r24 r25 r26 r27 r28 sp  fp  lr
];

calc_slot: is (parse=0b01 | parse=0b10) [ slot=next_slot; next_slot=next_slot+1; globalset(inst_next,next_slot); ] {}
calc_slot: is (parse=0b00 | parse=0b11) [ slot=next_slot; next_slot=0; globalset(inst_next,next_slot); ] {}

prefix:"{ " is slot=0 {}
prefix:". " is slot!=0 {}

suffix:" }" is next_slot=0 {}
suffix:" ." is next_slot!=0 {}

:^prefix^instruction^suffix is phase=0 & calc_slot & prefix & suffix & instruction [ phase=1; ] {}


with: phase=1 {
    #
    # Instructions
    #
    
    :^d5^"=add("^s5^",#"^s16^")" is d5 & s5 & si27_21 & si13_5 [ s16 = (si27_21 << 8) | si13_5; ] {}
}
@NeatMonster NeatMonster added the Type: Question Further information is requested label Apr 2, 2019
@nihilus
Copy link

nihilus commented Apr 3, 2019

Neat, I would love to use that module when it is finished for baseband REing.

@NeatMonster
Copy link
Author

I dug a little bit into Ghidra's source code and I am now certain that fixing the second issue requires major changes to the current architecture. Let me present my findings if anyone is interested:

  • I started by looking at the .sla file generated by the Sleigh compiler and searched for the add constructor. I quickly noticed that the first attribute was set to 6:
<constructor parent="0x0" first="6" length="4" line="41">
    <oper id="0x5f"/>
    <oper id="0x60"/>
    <oper id="0x62"/>
    <oper id="0x63"/>
    <oper id="0x61"/>
    <opprint id="0"/>
    <print piece="=add("/>
    <opprint id="1"/>
    <print piece=",#"/>
    <opprint id="4"/>
    <print piece=")"/>
    <construct_tpl>
        <null/>
    </construct_tpl>
</constructor>
  • By looking at the source code of the compiler, I found out how this attribute was calculated:
package ghidra.pcodeCPort.slghsymbol;

public class Constructor {
    private int firstwhitespace; // Index of first whitespace piece in -printpiece-

    public void addSyntax(String syn) {
        // [...]
        if (firstwhitespace == -1 && " ".equals(syn)) {
            firstwhitespace = printpiece.size();
        }
        // [...]
        printpiece.push_back(syn);
    }

    public void saveXml(PrintStream s) {
        s.append("<constructor");
        // [...]
        s.append(" first=\"");
        s.print(firstwhitespace);
        // [...]
        s.append("</constructor>\n");
    }
}
  • So far, so good. addSyntax() is called for each of the literals. We could replace firstwhitespace by firstmnemonic / nextmnemonic to store the range containing the mnemonic. This range could be defined as the first sequence of literals matching some kind of regular expression.

  • I looked for functions using the firstwhitespace attribute, and found two of them:

package ghidra.pcodeCPort.slghsymbol;

public class Constructor {
    public void printMnemonic(PrintStream s, ParserWalker pos) {
        // [...]
        int endind = (firstwhitespace == -1) ? printpiece.size() : firstwhitespace;
        for (int i = 0; i < endind; ++i) {
            if (printpiece.get(i).charAt(0) == '\n') {
                int index = printpiece.get(i).charAt(1) - 'A';
                operands.get(index).print(s, pos);
            }
            else {
                s.append(printpiece.get(i));
            }
        }
    }

    public void printBody(PrintStream s, ParserWalker pos) {
        // [...]
        for (int i = firstwhitespace + 1; i < printpiece.size(); ++i) {
            if (printpiece.get(i).charAt(0) == '\n') {
                int index = printpiece.get(i).charAt(1) - 'A';
                operands.get(index).print(s, pos);
            }
            else {
                s.append(printpiece.get(i));
            }
        }
    }
}
  • It is evident that the mnemonic is made of the tokens in the range [0, firstwhitespace[ and the body of the tokens in [firstwhitespace, printpiece.size()[. This is problematic because, in our special assembly, most of the times the mnemonic is not at the beginning of the line.

  • printMnemonic() could easily be modified, and it is only called by the following function:

package ghidra.app.plugin.processors.sleigh;

public class SleighInstructionPrototype implements InstructionPrototype {
    @Override
    public String getMnemonic(InstructionContext context) {
        SleighParserContext protoContext = (SleighParserContext) context.getParserContext();
        ParserWalker walker = new ParserWalker(protoContext);
        walker.baseState();
        return walker.getConstructor().printMnemonic(walker);
    }
}
  • printBody() could be split into two functions printBodyStart() and printBodyEnd(). This function is never called. Looking into it a little more, I discovered another class also named Constructor that loads the XML element saved by the previous class.

  • I'm guessing the first class is used during the parsing of the slaspec file and writing of the sla file, and that the second class is used to read the sla file and make use of it (parse a binary). In any case, this class also implements similar methods, plus another one:

package ghidra.app.plugin.processors.sleigh;

public class Constructor implements Comparable<Constructor> {
    public String printSeparator(int separatorIndex) {
        // Separator is all chars to the left of the corresponding operand
        // The mnemonic (first sequence of contiguous non-space print-pieces)
        // is ignored when identifying the first separator (index 0) and the 
        // operand which immediately follows.
        // NOTE: sleigh "operands" may appear as part of mnemonic so the 
        // separator cache may be slightly over-allocated.
        // [...]
     }

    public String printMnemonic(ParserWalker walker) throws MemoryAccessException { // [...] }

    public String printBody(ParserWalker walker) throws MemoryAccessException { // [...] }
}
  • Looking for usages of the printSeparator() function, I found the same class yet again:
package ghidra.app.plugin.processors.sleigh;

public class SleighInstructionPrototype implements InstructionPrototype {
    @Override
    public String getSeparator(int opIndex, InstructionContext context) {
        Constructor ct = mnemonicState.getConstructor();
        return ct.printSeparator(opIndex);
    }
}
  • And looking for usages of the getSeparator() function yielded many results:
package ghidra.program.model.listing;

public class CodeUnitFormat {
    public String getRepresentationString(CodeUnit cu, boolean includeEOLcomment) {
        // [...]
        StringBuffer stringBuffer = new StringBuffer(getMnemonicRepresentation(cu));
        Instruction instr = (Instruction) cu;
        int n = instr.getNumOperands();
        for (int i = 0; i < n; i++) {
            if (i == 0) {
                stringBuffer.append(" ");
            }
            else {
                String separator = instr.getSeparator(i);
                if (separator != null && separator.length() != 0) {
                    stringBuffer.append(separator);
                }
            }
            stringBuffer.append(getOperandRepresentationString(cu, i));
        }
        return stringBuffer.toString();
    }
}
  • ghidra/program/database/code/InstructionDB.java
  • ghidra/app/util/viewer/field/OperandFieldHelper.java
  • ghidra/app/plugin/core/searchtext/databasesearcher/InstructionMOFS.java
  • and others...

So now I am left wondering if I should make any modifications at all. I definitively could us some advice from the main developers on wether or not it is a good idea to implement those changes, and if they would be willing to merge these hypothetical changes in the next release. Or to make them themselves.

@nihilus
Copy link

nihilus commented Apr 3, 2019

Nice work there!

I think it would be nice to make the changes necessary or at least provide a patch for them.

How you are used to a syntax makes a lot when it comes to how fast one can get a grip of what's going on when it comes to assembler; not in an all cases but in the common case I've encountered this all over with mediocre REers (you should know who you are).

@NeatMonster
Copy link
Author

I've been giving it some more thought. Here's what I'm currently thinking:

Rd=add(Rs,Rt):sat
Rx.H=#u16
if (!Pu.new) Rd=add(Rs,#s8)
loop0(#r7:2,#U10)
if (Pu.new) jumpr:t Rs
p0=cmp.eq(Rs,#-1); if (!p0.new) jump:nt #r9:2
Rd=Rs; jump #r9:2
Rdd=memd(Rx++#s4:3:circ(Mu))
Rd=add(Rt.H,Rs.L):sat:<<16
Rx&=and(Rs,~Rt)
Rxx,Pe=vacsh(Rss,Rtt)
Rxx+=vmpyweh(Rss,Rtt):<<1:rnd:sat

Can you say, for each of these instructions, which part is the mnemonic? I know I can't.

Because Sleigh only supports templates in a simple shape (<mnemonic> <op>[<sep><op>]), it is not going to work for us. Modifying it to support our custom templates would requires heavy changes. The only solution that I can think of is to use the intrinsic-like representation of the instructions.

For example, Rd=add(Rs,Rt) would be become Q6_R_add_RR_sat(Rd, Rs, Rt). You'll notice that the operands are not the same as the real C intrinsic function, but we don't have return values.

This would make the Instruction Info... window and other internals of Ghidra work properly. But as a user, we don't want to see this representation in the Listing window. What can we do then?

My current idea is to programmatically override the class representing a disassembled instruction so that it returns different mnemonic / operands values when used for the listing display.

I don't know yet at which level this should be performed:

  • InstructionDB?
  • SleighInstructionPrototype?
  • Constructor?

I'm also unsure how to do it from a processor module: using Java reflection maybe?

Then we could either have the mnemonic field display an empty string and move everything into the operands field, or we could create a new special field Template to that effect.

This still leaves the issue of the { prefix and } / }:endloopN suffixes, but maybe while we're at it, we could define two more fields Prefix and Suffix. Everything would be properly aligned this way.


@nsadeveloper789 @emteere @ryanmkurtz @ghidra1 @d-millar @saruman9 @dragonmacher @dev747368 and others, could you offer some pointers on how to proceed? Many thanks!

@NeatMonster
Copy link
Author

NeatMonster commented Apr 5, 2019

Looking at the file that defines the properties of a processor specification, I've found:

package ghidra.program.model.lang;

public final class GhidraLanguagePropertyKeys {
    /**
     * CUSTOM_DISASSEMBLER_CLASS is a full class name for a language-specific
     * disassembler implementation.  The specified class must extend the generic 
     * disassembler {@link Disassembler} implementation and must implement the same
     * set of constructors.
     */
    public static final String CUSTOM_DISASSEMBLER_CLASS = "customDisassemblerClass";

    /**
     * PARALLEL_INSTRUCTION_HELPER_CLASS is a full class name for an implementation
     * of the ParallelInstructionLanguageHelper.  Those languages which support parallel
     * instruction execution may implement this helper class to facilitate display of
     * a || indicator within a listing view.
     */
    public static final String PARALLEL_INSTRUCTION_HELPER_CLASS = "parallelInstructionHelperClass";
}

It looks like the first one can be useful to override the rendering of the instructions.
And the second one to add the { prefix (still not perfect because there is } too).


Update: I've added a parallelInstructionHelperClass property to the processor specification and created a new class in ghidra.app.util.viewer.field to allow for parallel suffixes, and voilà:

Screenshot 2019-04-05 at 18 49 56

Now I still need to override the instruction template, but I'm more confident that I was this morning. 😄

@NeatMonster
Copy link
Author

I have kinda been able to do what I wanted by defining 3 custom fields: Hexagon Prefix / Hexagon Suffix to display the { / } characters, and Hexagon Template to display the correct assembly syntax.

Screenshot 2019-04-10 at 13 31 55

I'm not a big fan of my solution because the user is forced to add these 3 custom fields, and worse, the remove the existing Mnemonic / Operands fields. These fields are also used to display data, e.g. Mnemonic might be db and Operands might be 00h. So as a workaround, I have made the Hexagon Template field act as it was the Mnemonic and Operands separated by one space character.


@nsadeveloper789 @emteere @ryanmkurtz @ghidra1 @d-millar @saruman9 @dragonmacher @dev747368 and others, if anyone is reading this issue, could you offer some feedback?

@NeatMonster NeatMonster changed the title SLEIGH: display issues with unusual architecture SLEIGH: assembly template not following the usual mnemonic + operands Apr 10, 2019
@emteere
Copy link
Contributor

emteere commented Apr 16, 2019

I took a look at the instruction manual for this processor. The processor is quite a beast, and the format of the instructions are somewhat unique.
For inserting spaces into the mnemonic for the instructions for the processor, it would at times be useful to insert spaces. I'm not sure it would be useful to insert them at the beginning of the instruction. In theory they could be protected from stripping with some sort of escape sequence (%20). Although that might complicate code using the mnemonic field.
The extending of the field is interesting and I actually like this solution better. The code browser was meant for extending with special fields such as the braces you've added. You can configure, and save multiple tool configurations with particular plugins or code browser fields set up to work best for a particular use.
You mention that overriding the monic/operand fields affects the data display fields (mnemonic and operand), which is true. Currently the code browser field formats have a single format for instructions and data (code unit). These could potentially be split such that the instructions have their own format and data has another. Having these two split could be a detriment for other processors, but might actually be more useful, as data can display much differently that code, with the operand needing more space such as for strings.
I'm not sure the ramifications of making this change. We'd need to look into it more closely as the code browser listing is a complicated set of code.

@NeatMonster
Copy link
Author

Thank you for your answer @emteere, it is very much appreciated!

I'm not sure the ramifications of making this change. We'd need to look into it more closely as the code browser listing is a complicated set of code.

If you ever got time to work on this, please keep me updated. In the meantime, I might continue to work on this, but I should definitively open-source what I have already done (even though it is not pretty).

@nsadeveloper789
Copy link
Contributor

One ramification I can think of is for the "Patch Instruction" command. It will still expect the syntax that appears in the usual Mnemonic/Operands fields. Maybe that's not a problem, but it's something to consider.

@aguerriero1998
Copy link

@NeatMonster Have you release what you have as open source? I looked on your github and did not see anything. If you could release what you have as open source that'd be awesome. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants