Creation of a text listing file (extension .lst) similar to the listing file of the MASM or TASM compiler for programs in the Assembler language. Given that creating a compiler is a time-consuming process, significant restrictions are used on the list of valid machine instructions, modes of addressing data and commands, and valid directives, which are a subset of the standard language of the Intel Processor Assembly.
Project is created with:
- C++
- gcc: version 6.3.0
git clone and run a program with the following makefile:
$ make
*Identifiers*
Contains capital letters of the Latin alphabet and numbers. They start with a letter. The length of identifiers is no more than 7 characters
*Constants*
Hexadecimal and text constants
*Directives*
END,
SEGMENT - without operands, ENDS, PROC,ENDP,ASSUME
DB, DW, DD with one operand - a constant (string constants only for DB)
Directives DW and DD can have a name or a label as an operand
*Bit rate of data and addresses*
16-bit data and offset in the segment, 32-bit data and offset are not used
*Addressing of memory operands*
Basic index addressing (VAL[BP+SI],VAL[BX+DI]), etc
*Replacing segments*
Segment replacement prefixes are automatically generated by the translator if necessary
*Machine commands*
Ret
Push **reg**
Pop **mem**
Mov **reg**, **mem**
Mov **reg**, **imm**
Jne
Call (with direct and indirect addressing, intra-segment and inter-segment)
Where **reg** is an 8- or 16-bit RZP
**mem** is the address of the operand in memory
**imm** – constant or offset value in the segment (OFFSET val1, OFFSET label1, etc.)
- The input data of the compiler is a text file with an arbitrary program in the Assembler language, which is compiled in accordance with the limitations of the individual version of the course work.
- The result of the compiler is the creation of a text listing file (extension .lst).
- The names of the initial assembly file to be processed by the compiler and the created listing file are specified in the command line when starting the compiler program.
- All diagnostic messages contained in the listing file are also displayed on the monitor screen. In addition, the total number of errors detected in the initial program is displayed on the screen.
- A diagnostic syntax error message is issued for all syntax constructs (identifiers, constants, directives, machine commands, addressing modes, etc.) that are allowed in the TASM compiler (MASM), but go beyond the limits.
After a creation of test programs in the Assembler language that meet the requirements of an individual task and allow you to check the correctness of the compiler, following logical units should be generated:
- Lexical analyzer of the Assembly language.
- Setting up the table of identifiers.
- Determination of the number of bytes that will be formed according to each instruction.
- Generation of commands and data.
- Formation of the listing file.
Instructions may have: 0 operands 1 operand 2 operands 2+ operands (well, fuck off)
Operand types: Constant (8, 32/16 bit). <=255 = 8 Register (8, 32/16, segment) Memory (8, 32/16) Label (offset32/16)
modrm
mod = 00
rm = 100 (sib)
sib
scale = 10
index = esi
base = ebx
reg = 0..7
mem/reg
\4
push [bx]
mov [dx + si], ax - mem32 | reg32
reg32 | mem32
mov ebx, ecx - reg32 | reg32
mov [ebx], [ebx] - ERROR
mov 14, 51 - ERROR
mov [ebx], 52 - mem32 | const8
mov ebx, 3452 - reg32 | const32```
You can't have two constants in the same command
You can't have two memories in one command
You cannot have two labels (DISP) in one command
# Marking rules
```/r – signals that the command has a mod r/m byte in which the reg
contains the data register number;
/digit (from 0 to 7) - indicates that the command has a mod r/m byte,
in which the reg field contains part of the operation code;
r/m8 r/m16 r/m32 – indicates that the command has a mod r/m byte,
in which the field r/m can contain either the number of the corresponding data register
bits, or the command forms an effective address (offset in the segment)
memory for data of the appropriate bit rate;
r8, r16, r32 – an operand in one of the byte-sized registers, a word or
double word;
ib, iw, id (or imm8, imm16, imm32) – immediate operand
size byte, word or double word;
+rb, +rw, +rd – command operation code byte in the lower three bits
contains the data register number of the corresponding bit rate.
FB sti
==| FB STI
F9 stc
==| F9 STC
8B C3 mov eax, ebx
==| 8B /r MOV r32, r/m32
====| 0xC3 = 0b11 000 011
====| mod reg rm
====| mod = 11 (просто регистр)
====| reg = 000 (eax)
====| rm = 011 (ebx)
8A C3 mov al, bl
==| 8A /r MOV r8, r/m8
8B CE mov ecx, esi
==| 8B /r MOV r32, r/m32
====| 0xCE = 0b11 001 110
====| mod reg rm
====| mod = 11 (просто регистр)
====| reg = 001 (ecx)
====| rm = 110 (esi)
89 30 mov [eax], esi
==| 89 /r MOV r/m32, r32
====| 0x30 = 0b00 110 000
====| mod = 00 ( [ reg ] )
====| reg = 110 (esi)
====| rm = 000 (eax)
89 7B 03 mov dword ptr [ebx + 3], edi
==| 89 /r MOV r/m32, r32
====| 0x7B = 0b01 111 011
====| mod = 01 ( [ reg + disp8 ] )
====| reg = 111 (edi)
====| rm = 011 (ebx)
89 35 00000000r mov var1, esi
==| 89 /r MOV r/m32, r32
====| 0x35 = 0b00 110 101
====| mod = 00 (просто регистр)
====| reg = 110 (esi)
====| rm = 101 (т.к mod == 0, то DISP32 only mod)
int a[4]
4 * sizeof(int) + a
89 B3 00000000 mov [ebx + var1], esi
==| 89 /r MOV r/m32, r32
====| 0xB3 = 0b10 110 011
====| mod = 10 ( [ reg + disp32 ] )
====| reg = 110 (esi)
====| rm = 011 (ebx)
89 04 7E mov dword ptr [esi + edi * 2], eax
==| 89 /r MOV r/m32, r32
====| 0x04 = 0b00 000 100
====| mod = 00 ( [ reg ] )
====| reg = 000 (eax)
====| rm = 100 (т.к. mod == 0, то SIB mode)
====| 0x7E = 0b01 111 110
====| scale indx base
====| EXP = base + index * scale
====| scale = 01 ( *2 )
====| index = 111 (edi)
====| base = 110 (esi)
66| 8B CF mov cx, di
==| 8B /r MOV r32, r/m32
====| DATA CHANGE PREFIX (66)
====| 0xCF = 0b11 001 111
====| mod = 11 ( просто регистр )
====| reg = 001 (сx)
====| rm = 111 (di)
66| 67| 89 34 mov word ptr [si], si
==| 89 /r MOV r/m32, r32
====| DATA CHANGE PREFIX (0x66)
====| INDEX CHANGE PREFIX (0x67)
====| 0x34 = 0b00 110 100
====| mod = 00 / rm = 100 ( [ si ] )
====| reg = 110 = sib
66| 89 B0 00000004r mov var2[eax], si
==| 89 /r MOV r/m32, r32
====| DATA CHANGE PREFIX (0x66)
====| 0xB0 = 0b10 110 000
====| mod = 10 ( [reg + disp32] )
====| reg = 110 (esi/si)
====| rm = 000 (eax)
50 push eax
==| 50+rw PUSH r32
====| 0x50 = 0x50 | 0x00 (eax)
66| 50 push ax
==| 50+rw PUSH r32
====| DATA CHANGE PREFIX (0x66)
====| 0x50 = 0x50 | 0x00 (eax / ax)
66| 53 push bx
==| 50+rw PUSH r32
====| DATA CHANGE PREFIX (0x66)
====| 0x53 = 0x50 | 0x03 (ebx / bx)
E4 20 in eax, 32
==| E4 ib
05 000010AA add eax, 4266
==| 05 id ADD EAX, imm32
83 C0 2A add eax, 42
==| 83 /0 ib ADD r/m32, imm8
====| 0xC0 = 0b11 000 000
====| mod = 10 ( просто регистр )
====| reg = 0 ( OPCODE )
====| rm = 000 (eax)
near - 127
far - 2^31
76 00
OF 86 0000000
bool far
bool forward
if !far and !forward then 2
else then 6
offset = destination - current + size + unused current size
[prefixes] [opcode] [modrm] [sib] [disp] [imm] [prefixes] = Replacing data, indexes, segment [opcode] = 1 or 2 bytes. Always is [modrm] = Stores fundamentally 2 things: REG, MODRM REG - const/register MODRM - register/memory [disp] = Required by the mod. It is always there or always not. Depending on the variant. [imm] = If the instruction has a constant, it is put in this field.
MODRM
General algorithm for calculating bytes:
- Opcode
- ModRM / NIB Separately, you need to fill in the REG field (may be a register, or maybe part of an opcode) Separately, you need to fill in the MOD / RM fields (can be a register or a meme. In the case of a meme, it can need to add NIB. Very often or ALWAYS there will be a sib, or it will ALWAYS not be, depending on varika. Also MODRM may require DISP8/DISP32 field)
- Go through all operands and if: Operand constant: add constant The operand has a segment change, and changes not to the current segment: add replacement prefix Operand and national size: add data prefix
As a result of the lexical analysis, a table of lexemes of the next line of the program is formed. A lexeme is one or more characters in a string, which the Assembler treats as a single object. Lexemes are "indivisible atoms" of the programming language. In this connection, the term "terminal symbol" of the programming language is often used. The following lexemes are distinguished in the Assembler language:
- Identifiers – sequences of letters and numbers that begin with a letter. Letters include lowercase and uppercase letters of the respective alphabets. In MASM and TASM, this alphabet is the Latin alphabet, as well as the symbols _, @, ?, $. It is generally accepted that letters are not case sensitive. At the same time, lowercase and uppercase letters in ASCII have different codes. Therefore, when processing identifiers, the compiler converts all lowercase letters to uppercase. If the compiler is developed, for example, in Ukraine, then it is appropriate to enter the letters of the Ukrainian alphabet as identifier letters. The Latin and Ukrainian alphabets have many similar-looking letters, but they have different ASCII codes (eg a, e, p, x, etc.). In order to avoid potential errors related to incorrect switching of the alphabet when entering initial programs, the compiler must choose the same code for letters of the Latin and Ukrainian alphabets that are identical in appearance when processing identifiers. The length of the identifier is practically unlimited, but the first 32 characters are significant.
- Numeric constants – sequences of numbers and some letters that begin with a number.
There are binary, octal, decimal and hexadecimal constants. The letters a, b, c, d, e, f and the base letters of the numbering system – b, o, q, d, h – are permissible letters of the numerical constant (they do not differ as in the case of identifiers).
-
Binary constants consist of the digits 0 and 1 and necessarily end with the letter b.
-
Octal constants consist of numbers from 0 to 7 and necessarily end with the letter o or q.
-
Decimal constants consist of digits from 0 to 9 and can end with the letter d. The peculiarity of decimal constants is that they allow the absence of the letter d. The sign of the end of the decimal constant in this case is the appearance of a symbol that is not included in the list of digits from 0 to 9.
-
Hexadecimal constants consist of numbers from 0 to 9 and the letters a, b, c, d, e, f and necessarily end with the letter h. If the first character of a hexadecimal constant is a letter, it must first be set to 0 so that the compiler can distinguish between an identifier and a numeric constant.
-
Text constants – a sequence of arbitrary characters that can be entered from the keyboard and that begin and end with the symbol " or ". Symbols in text constants are not subject to any transformation.
-
Token separators are space, tab, and semicolon characters.
-
Single-symbol tokens - all other symbols that are included in the alphabet of the Assembler language. A separator between two lexemes is mandatory if the concatenation (at docking) of these lexemes is also a lexeme.
Definition of program sentence structure After the lexical analysis, the list of permissible sequences of tokens in the possible structure of sentences in the table of tokens can be reduced to the following:
<empty>
<label>:
<mnem>
<mnem> {<op>}
<label>: <mnem>
<name> <mnem>
<label>: <mnem> {<op>}
<name> <mnem> {<op>} where
<pusto> – a table in which there are no tokens (in the case of empty sentences and comment sentences);
<label> – label identifier in the machine instruction;
<mnem> is the identifier of a directive, machine instruction, or macro, or a sequence of two identifiers, one of which is a repetition prefix and the other is a command-line mnemonic;
<name> – name identifier in the directive;
{<op>} – one or more operands, which in most cases are a sequence of tokens separated by a comma symbol (each sequence is a separate operand or sequence of identifiers).
Disregarding the case of an empty token table, the following conclusions can be drawn regarding the given sequences. First, the first token of the string is always an identifier (if another token is present, this indicates a syntax error and further analysis can be stopped). Secondly, it is necessary to check the presence of this identifier among the mnemonics of machine instructions, mnemonics of directives and mnemonics of macro commands using the search procedure in the tables. As a result of this analysis, a sentence structure table is created, where the serial numbers of tokens from the token table that belong to the label field, the mnemonic field, and each of the operands of the operand field are indicated.
Analysis of operands of machine instructions Machine instructions can have the following operands:
• constants (absolute terms) or expressions above constants (absolute expressions) – are used to assign direct data; • data registers – in most cases, they are used to form the modR/m byte reg field; • relocatable expressions – definition of offset components (operator offset) and the name of a logical segment (operator seg), which are used to assign direct data in commands. Unlike absolute expressions, relocatable expressions form direct data in commands that can change when creating executable files and when loading programs into memory; • address expressions are used to assign the address parts of the commands - bytes of the addressing mode and offset in the command (as an exception, in the commands of inter-segment transfer of control - to assign a logical address). Address expressions are recognized if there are address terms in the operand (labels or names that define three components: the name of the logical segment, the offset in the logical segment, and the type) and/or if there are address registers. The presence of absolute expressions is allowed in the structure of address expressions.
Determining the number of bytes that are generated per program line After processing the name (label) field, the number of bytes that the Assembler should generate for the current sentence is determined. We will consider the detailed algorithm for determining the number of bytes below. Note here that the specified value k of the number of bytes is added to the value of the "Current offset" field of the active segment in the segment table (ie $ = $+ k). The considered process of forming a table of user-defined identifiers and determining the number of bytes per sentence is called the process of automatic program memory allocation, or simply memory allocation.
Determining the number of bytes of data in the directives db, dw, dd, dp, dq, date On both views, the value of the number of bytes k that is needed to generate according to the corresponding directive, is determined by the formula k = type*count where type is determined by the directive, and count (in the absence of operands with repetitions) is the number of operands of the directive according to the sentence structure table. In the case of using the operand with the repetition operator dup, count
- the number of repetitions. If the directive uses operands with nested repetitions or operands both with and without repetitions, then it is recommended to use a recursive procedure to determine count. If a text constant is used in the db directive, then count is the number of characters of the corresponding token from the token table.
On the second view, it is additionally necessary to generate bytes of data, having previously calculated the values of constants or absolute expressions. When generating bytes, it is necessary to remember that in Intel processors, lower bytes are placed at lower addresses. Generation has some features if a label or a symbolic name is specified as an operand in the directives dw, dd, dp. The process depends on the bitness specified in the current segment (where the directives dw, dd, dp are located) and the bitness of the segment where the name (label) is located, and, accordingly, the bitness of the offset of the name (label). The rules are common sense. The dw directive can only use a 16-bit offset in a 16-bit segment, and the corresponding 16-bit offset is read from the user ID table and program bytes are generated based on it. In the dd directive, in a similar case, two more high zero bytes are additionally generated for the value of the segment component, which will be determined by the loader. In the dd directive, in the case of a 32-bit offset and a 32-bit current segment, 4 bytes of this offset are generated. In the same case, for the directive dp, two more high zero bytes are additionally generated for the value of the segment component. All other cases are erroneous (although Masm and Tasm do not always indicate an error).
Structure of byte Modr/m with 16-bit addressing.
Field mod: 2 bits Field reg: 3 bits Field r/m: 3 bits
The mod field is used to specify the offset in the instruction, and the r/m field is used to specify the address registers whose contents are used to
formation of an effective address. The reg field is intended to assign a data register that is used in an instruction or is part of an opcode. Consider the mod field: • At mod=00, there is no offset in the command, and the effective address is formed by the contents of the address register(s) specified by the r/m field. • With mod=01, the offset in the command is one byte, which, when forming the effective address, is sign-extended to two bytes with subsequent addition to the contents of the register(s) of the addresses specified by the r/m field. • With mod=10, the offset in the command is two bytes and when forming the effective address, the contents of the address register (registers) specified by the r/m field are added to the contents. • With mod=11, the memory address is not specified, and the r/m field specifies the data register code.
Code in the field r/m | Register addresses | Segment register that is used by default
000 | BX+SI | DS
001 | BX+DI | DS
010 | BP+SI | SS
011 | BP+DI | SS
100 | SI | DS
101 | DI | DS
110 | BP | SS
111 | BX | DS
The following conclusions can be drawn from the table, which are also valid for modern microprocessors of the family with 16-bit addressing:
- With 16-bit addressing, there are only four registers - BX, BP, SI and DI – can be used as address registers.
- To form a multi-component address, only a limited set of register pairs is used: (BX, SI), (BX, DI), (BP, SI) (BP, DI).
- One of the widely used modes is dropped from the implementation - the direct addressing mode, that is, the mode when the offset in the command is also the offset in the segment. Regarding the last point, Intel engineers were forced to make the following decision: to introduce the direct addressing mode at mod=00 and r/m=110, that is, regardless of mod=00, the offset in the command should be set to two bytes if r/m=110. In this case, the BP register is not used. It is very similar to a "patch" in programs. But this "patch" is quite well thought out. The mode of intermediate register addressing using the BP register as an address register falls out of the hardware implementation, but the use of this mode in the strategic purpose of the BP register as the base register of stack data structures is unlikely. In the extreme case, you can use the mode with mod=01 and a zero byte offset in the command, which is implemented by the compilers of the Assembler language programs.