Skip to content

zld documentation

Hiroshi edited this page Mar 22, 2022 · 4 revisions

zld Documentation

Welcome to the documentation of zld.

https://docs.google.com/presentation/d/1NUzFXcgWgpMCSca03_JXnHuRzY2XJdFujgrEFn9_YFY/edit#slide=id.p

zld is a linker which aims to link coff/elf/(mach-o) object files. You might ask why I must create a new linker. The motivation is as follows.

  1. Combining Linking and Packing.
    By combining linking and packing, the boarder between static and dynamic linking will be blurred. Most of packers are targeted for executable format and its static linking had been done. But if a linker has not yet been processed, you are allowed to put object files before static linking on packed binary and do some works which are more or less equivalent to static linking on the code after being unpacked. This option might let linking computational cost lighter. By putting object files without static linking

  2. Adding additional functionalities with it such as encryption and Anti-Debug and Anti-Sandbox not only compression.

Inputs to a linker

Inputs to a linker is object files which are relocatable. Being relocatable means function or address which bridges or refers address on another object files which are supplied together or could be otherwise.

Test

Test is extremely important for developing a linker. Basically, I need to test from two sides which are input and output. Especially, output test needs to be done very heavily as a linker will be a brother of a loader(dynamic linker) referring its source codes.

  1. Input
    Input is assumed to be produced from a compiler or an assembler. I assume this is for the time being targeted to nasm and gcc(mingw), but should be tested for msvc, and other compilers if possible.

  2. Output
    Output should be tested as custom or default loader on OS. Custom loader is required for intermediate validation of output since default loader tells you very little when the output was not proper and its debug is hard. On linux, musl-libc is used. On windows, assembly of emotet-loader, default loader are used. Loader has so many details to be understood. The investigation towards them is separately done on here()

Diagram of data structure

Linker development is messed up unless you set all of data structures up in a proper way in advance.

The diagram which comes next is for declaration of internal structures.

heap allocation list

name of object role
Section Container Manage output section which may pack multiple input sections
Section Chain Manage each single input section which is referred from one Section Container or other Section chain.
Object Chain used for external object resolution
Symbol Chain used for relocation over multiple object files with hash table.
Hash Table Manage hash of symbols. Each values on each array contains symbol chain.

Computation cost analysis

operation shared object coff elf
input file allocation
iteration of section header(section chain allocation)
object chain allocation
iteration of symbol table(symbol chain allocation)
internal relocation
external relocation Currently implemented by chain contained by hash table.
When the hash exists, you must iterate the loop inside of same hash key.
If you find the entry which is matched with the input, then you should return the entry of virtual address. 1. Cost of chain iteration.
2. Cost of string comparison.
3. Cost of virtual Address retrieval .
3 can be heavy as it will be done only in finding a matched string. Ideally, each string should be contained in a relatively simple way on the chain. Symbol table often contains index of string table, not directly to string itself. You need to know the string table offset(or you can assign each single entry.). Probably, chain is not a good way to be stored. As
relocation by dynamic entry

Difference between coff and reloc relocation

coff elf
Original input Each SectionHeader may contain relocation table along with raw data Relocation table is gathered on 1 section per a corresponding resolved section.
iteration for relocation table Iterate section with annotated virtual address. Iterate only a or multi relocation section(s).
internal resolution SymbolTable entry(StorageClass == 3) SymbolTable entry(sym_type == SECTION)
external resolution SymbolTable entry(StorageClass == 2) SymbolTable entry(sym_type == UNDEF)
dynamic resolution
Computation Cost(Internal) iterate every entry on symbol table. iterate every entry on symbol table.
Computation Cost(External) 1. Compute a value of Hash function.
2. Get entry of symbol table.
3. Compute its virtual address
Computation Cost(Dynamic)

Relocation for coff object format file

Address resolver will be worked on following 3 stage.

  1. Resolved by symbols on a same file. If the symbol table entry for the relocation entry has type(StorageClass == 3), then it means the symbols will be resolved by another symbol on the same object file.

  2. Resolved by symbols on different files which was provided. If the symbol table has type(StorageClass == 2), then it means the address which is used for computation of relocation will be determined by one of symbols on an external object file.

  3. Resolved by symbols on dynamic load library. If 1 or 2 could not find any proper symbols, then it is checked by the list of export symbols on dynamic load libraries. In fact, you do not need to provide a .dll or .lib file which contains all export symbols as long as you have lists of externally provided API tagged with the dll name. You can provide a sqlite database which contains table named as dll name and its record named as symbol name.

Quite often, a compiler generates relative call (such as 0xe8 .. on x86) keeping the address to be called blank. On the other hand, Import Address Table is provided as just as address. They mean you need to prepare a buffer area where the call will be jumped and contains instruction which reads the value of IAT provided by a loader and jumps it (do not need to call as you do not need to come back to the point). The section which contains the buffer code area is called .plt section and it is stubbed just before .idata section where IAT resides. Specifically, .plt section contains list of absolute jump instruction ; (0xff 0x25 subsequent 4bytes). Each instruction which aims to call export function on external dll passes through it.

Approach to incremental linking

LDD will provide functionality of incremental linking.

Incremental linking is one way of computation where you links some object files without waiting all of arrivals of object files which are supposed to be linked. The downside of this is if you link two entities which are on different sections, then you are not allowed to insert any additional sections or increase the size of the section unless it lets its virtual address size be. In other words, the only entities which can be linked together on the stage is the ones on a same section. These exclude reference to symbols of dynamic load library, static data reference such as lea on .text section. This restriction lets linking slower. To allow links between two sections, you need to make sure following 2.

  1. The one which is referred comes first(The virtual address needs to be smaller than the one which refers this).
  2. The referred section is not going to be added anymore on every subsequent linking processes.
  3. There is no section between the referred and the one which refer.

For instance, you have a data section which are supposed to be referred from a lot of function on .text section.

optional header

Export Directory

Import Directory

Exception Directory

.xdata (Exception Information)

Base Relocation Directory

.pdata

object file generation

Intermediate object file can be generated to realize Incremental linking merging two or more object files.

This process does not involve any virtual address assignment to each section. Nevertheless, a type of relocation can be done, which is the relocation within a same section. On relocation, relocation block will be checked whether the type of the relocation is RELATIVE which does not require virtual address, and import and export address resides on a section which has same section name.

Other object files will be either merged(if the section name is identical with any ever-recorded section) or added individually.

Specifically, following values are updated.

  • size of raw data :: if it is merged.
  • pointer to raw data :: if new section is inserted, then the rest of section needs to be slided backward.
  • pointer to relocation table :: same with pointer to raw data
  • number of relocations :: if there is a resolved entry, then pull it out.

Pointer to raw data was the update element before relocation when generating .exe or .dll( with virtual address).

But, as this process is omitted when generating .obj, it needs to be done on the last stage after relocation.

You do not know how many entry on a relocation table is disappeared unless you begin it.