Skip to content

Commit

Permalink
DWARF input format
Browse files Browse the repository at this point in the history
  • Loading branch information
al13n321 committed Oct 12, 2023
1 parent 363171e commit e405ad1
Show file tree
Hide file tree
Showing 19 changed files with 1,047 additions and 24 deletions.
4 changes: 3 additions & 1 deletion contrib/llvm-project-cmake/CMakeLists.txt
Expand Up @@ -6,7 +6,9 @@ endif()

option (ENABLE_EMBEDDED_COMPILER "Enable support for JIT compilation during query execution" ${ENABLE_EMBEDDED_COMPILER_DEFAULT})

if (NOT ENABLE_EMBEDDED_COMPILER)
option (ENABLE_DWARF_PARSER "Enable support for DWARF input format" ON)

if (NOT ENABLE_EMBEDDED_COMPILER AND NOT ENABLE_DWARF_PARSER)
message(STATUS "Not using LLVM")
return()
endif()
Expand Down
48 changes: 48 additions & 0 deletions docs/en/interfaces/formats.md
Expand Up @@ -87,6 +87,7 @@ The supported formats are:
| [RawBLOB](#rawblob) |||
| [MsgPack](#msgpack) |||
| [MySQLDump](#mysqldump) |||
| [DWARF](#dwarf) |||
| [Markdown](#markdown) |||


Expand Down Expand Up @@ -2711,6 +2712,53 @@ FROM file(dump.sql, MySQLDump)
└───┘
```

## DWARF {#dwarf}

Parses DWARF debug symbols from an ELF file (executable, library, or object file). Similar to `dwarfdump`, but much faster (hundreds of MB/s) and with SQL. Produces one row for each Debug Information Entry (DIE) in the `.debug_info` section. Includes "null" entries that the DWARF encoding uses to terminate lists of children in the tree.

Quick background: `.debug_info` consists of *units*, corresponding to compilation units. Each unit is a tree of *DIE*s, with a `compile_unit` DIE as its root. Each DIE has a *tag* and a list of *attributes*. Each attribute has a *name* and a *value* (and also a *form*, which specifies how the value is encoded). The DIEs represent things from the source code, and their *tag* tells what kind of thing it is. E.g. there are functions (tag = `subprogram`), classes/structs/enums (`class_type`/`structure_type`/`enumeration_type`), variables (`variable`), function arguments (`formal_parameter`). The tree structure mirrors the corresponding source code. E.g. a `class_type` DIE can contain `subprogram` DIEs representing methods of the class.

Outputs the following columns:
- `offset` - position of the DIE in the `.debug_info` section
- `size` - number of bytes in the encoded DIE (including attributes)
- `tag` - type of the DIE; the conventional "DW_TAG_" prefix is omitted
- `unit_name` - name of the compilation unit containing this DIE
- `unit_offset` - position of the compilation unit containing this DIE in the `.debug_info` section
- `ancestor_tags` - array of tags of the ancestors of the current DIE in the tree, in order from innermost to outermost
- `ancestor_offsets` - offsets of ancestors, parallel to `ancestor_tags`
- a few common attributes duplicated from the attributes array for convenience:
- `name`
- `linkage_name` - mangled fully-qualified name; typically only functions have it (but not all functions)
- `decl_file` - name of the source code file where this entity was declared
- `decl_line` - line number in the source code where this entity was declared
- parallel arrays describing attributes:
- `attr_name` - name of the attribute; the conventional "DW_AT_" prefix is omitted
- `attr_form` - how the attribute is encoded and interpreted; the conventional DW_FORM_ prefix is omitted
- `attr_int` - integer value of the attribute; 0 if the attribute doesn't have a numeric value
- `attr_str` - string value of the attribute; empty if the attribute doesn't have a string value

Example: find compilation units that have the most function definitions (including template instantiations and functions from included header files):
```sql
SELECT
unit_name,
count() AS c
FROM file('programs/clickhouse', DWARF)
WHERE tag = 'subprogram' AND NOT has(attr_name, 'declaration')
GROUP BY unit_name
ORDER BY c DESC
LIMIT 3
```
```text
┌─unit_name──────────────────────────────────────────────────┬─────c─┐
│ ./src/Core/Settings.cpp │ 28939 │
│ ./src/AggregateFunctions/AggregateFunctionSumMap.cpp │ 23327 │
│ ./src/AggregateFunctions/AggregateFunctionUniqCombined.cpp │ 22649 │
└────────────────────────────────────────────────────────────┴───────┘
3 rows in set. Elapsed: 1.487 sec. Processed 139.76 million rows, 1.12 GB (93.97 million rows/s., 752.77 MB/s.)
Peak memory usage: 271.92 MiB.
```

## Markdown {#markdown}

You can export results using [Markdown](https://en.wikipedia.org/wiki/Markdown) format to generate output ready to be pasted into your `.md` files:
Expand Down
2 changes: 2 additions & 0 deletions src/Common/CurrentMetrics.cpp
Expand Up @@ -153,6 +153,8 @@
M(ParquetDecoderThreadsActive, "Number of threads in the ParquetBlockInputFormat thread pool running a task.") \
M(ParquetEncoderThreads, "Number of threads in ParquetBlockOutputFormat thread pool.") \
M(ParquetEncoderThreadsActive, "Number of threads in ParquetBlockOutputFormat thread pool running a task.") \
M(DWARFReaderThreads, "Number of threads in the DWARFBlockInputFormat thread pool.") \
M(DWARFReaderThreadsActive, "Number of threads in the DWARFBlockInputFormat thread pool running a task.") \
M(OutdatedPartsLoadingThreads, "Number of threads in the threadpool for loading Outdated data parts.") \
M(OutdatedPartsLoadingThreadsActive, "Number of active threads in the threadpool for loading Outdated data parts.") \
M(DistributedBytesToInsert, "Number of pending bytes to process for asynchronous insertion into Distributed tables. Number of bytes for every shard is summed.") \
Expand Down
20 changes: 16 additions & 4 deletions src/Common/Elf.cpp
Expand Up @@ -16,15 +16,27 @@ namespace ErrorCodes
}


Elf::Elf(const std::string & path)
: in(path, 0)
Elf::Elf(const std::string & path_)
{
in.emplace(path_, 0);
init(in->buffer().begin(), in->buffer().size(), path_);
}

Elf::Elf(const char * data, size_t size, const std::string & path_)
{
init(data, size, path_);
}

void Elf::init(const char * data, size_t size, const std::string & path_)
{
path = path_;
mapped = data;
elf_size = size;

/// Check if it's an elf.
elf_size = in.buffer().size();
if (elf_size < sizeof(ElfEhdr))
throw Exception(ErrorCodes::CANNOT_PARSE_ELF, "The size of supposedly ELF file '{}' is too small", path);

mapped = in.buffer().begin();
header = reinterpret_cast<const ElfEhdr *>(mapped);

if (memcmp(header->e_ident, "\x7F""ELF", 4) != 0)
Expand Down
22 changes: 12 additions & 10 deletions src/Common/Elf.h
Expand Up @@ -9,16 +9,14 @@
#include <functional>

#include <elf.h>
#include <link.h>


using ElfAddr = ElfW(Addr);
using ElfEhdr = ElfW(Ehdr);
using ElfOff = ElfW(Off);
using ElfPhdr = ElfW(Phdr);
using ElfShdr = ElfW(Shdr);
using ElfNhdr = ElfW(Nhdr);
using ElfSym = ElfW(Sym);
using ElfEhdr = Elf64_Ehdr;
using ElfOff = Elf64_Off;
using ElfPhdr = Elf64_Phdr;
using ElfShdr = Elf64_Shdr;
using ElfNhdr = Elf64_Nhdr;
using ElfSym = Elf64_Sym;


namespace DB
Expand All @@ -44,7 +42,8 @@ class Elf final
const Elf & elf;
};

explicit Elf(const std::string & path);
explicit Elf(const std::string & path_);
Elf(const char * data, size_t size, const std::string & path_);

bool iterateSections(std::function<bool(const Section & section, size_t idx)> && pred) const;
std::optional<Section> findSection(std::function<bool(const Section & section, size_t idx)> && pred) const;
Expand All @@ -64,13 +63,16 @@ class Elf final
String getStoredBinaryHash() const;

private:
MMapReadBufferFromFile in;
std::string path; // just for error messages
std::optional<MMapReadBufferFromFile> in;
size_t elf_size;
const char * mapped;
const ElfEhdr * header;
const ElfShdr * section_headers;
const ElfPhdr * program_headers;
const char * section_names = nullptr;

void init(const char * data, size_t size, const std::string & path_);
};

}
Expand Down
1 change: 1 addition & 0 deletions src/Common/config.h.in
Expand Up @@ -43,6 +43,7 @@
#cmakedefine01 USE_AMQPCPP
#cmakedefine01 USE_NATSIO
#cmakedefine01 USE_EMBEDDED_COMPILER
#cmakedefine01 USE_DWARF_PARSER
#cmakedefine01 USE_LDAP
#cmakedefine01 USE_ROCKSDB
#cmakedefine01 USE_LIBPQXX
Expand Down
7 changes: 5 additions & 2 deletions src/Formats/FormatFactory.cpp
Expand Up @@ -256,7 +256,8 @@ InputFormatPtr FormatFactory::getInput(
std::optional<size_t> _max_parsing_threads,
std::optional<size_t> _max_download_threads,
bool is_remote_fs,
CompressionMethod compression) const
CompressionMethod compression,
bool need_only_count) const
{
const auto& creators = getCreators(name);
if (!creators.input_creator && !creators.random_access_input_creator)
Expand Down Expand Up @@ -284,7 +285,9 @@ InputFormatPtr FormatFactory::getInput(

// Decide whether to use ParallelParsingInputFormat.

bool parallel_parsing = max_parsing_threads > 1 && settings.input_format_parallel_parsing && creators.file_segmentation_engine && !creators.random_access_input_creator;
bool parallel_parsing =
max_parsing_threads > 1 && settings.input_format_parallel_parsing && creators.file_segmentation_engine &&
!creators.random_access_input_creator && !need_only_count;

if (settings.max_memory_usage && settings.min_chunk_bytes_for_parallel_parsing * max_parsing_threads * 2 > settings.max_memory_usage)
parallel_parsing = false;
Expand Down
3 changes: 2 additions & 1 deletion src/Formats/FormatFactory.h
Expand Up @@ -167,7 +167,8 @@ class FormatFactory final : private boost::noncopyable
bool is_remote_fs = false,
// allows to do: buf -> parallel read -> decompression,
// because parallel read after decompression is not possible
CompressionMethod compression = CompressionMethod::None) const;
CompressionMethod compression = CompressionMethod::None,
bool need_only_count = false) const;

/// Checks all preconditions. Returns ordinary format if parallel formatting cannot be done.
OutputFormatPtr getOutputFormatParallelIfPossible(
Expand Down
4 changes: 4 additions & 0 deletions src/Formats/registerFormats.cpp
Expand Up @@ -101,6 +101,7 @@ void registerInputFormatJSONAsObject(FormatFactory & factory);
void registerInputFormatLineAsString(FormatFactory & factory);
void registerInputFormatMySQLDump(FormatFactory & factory);
void registerInputFormatParquetMetadata(FormatFactory & factory);
void registerInputFormatDWARF(FormatFactory & factory);
void registerInputFormatOne(FormatFactory & factory);

#if USE_HIVE
Expand Down Expand Up @@ -143,6 +144,7 @@ void registerTemplateSchemaReader(FormatFactory & factory);
void registerMySQLSchemaReader(FormatFactory & factory);
void registerBSONEachRowSchemaReader(FormatFactory & factory);
void registerParquetMetadataSchemaReader(FormatFactory & factory);
void registerDWARFSchemaReader(FormatFactory & factory);
void registerOneSchemaReader(FormatFactory & factory);

void registerFileExtensions(FormatFactory & factory);
Expand Down Expand Up @@ -245,6 +247,7 @@ void registerFormats()
registerInputFormatMySQLDump(factory);

registerInputFormatParquetMetadata(factory);
registerInputFormatDWARF(factory);
registerInputFormatOne(factory);

registerNonTrivialPrefixAndSuffixCheckerJSONEachRow(factory);
Expand Down Expand Up @@ -282,6 +285,7 @@ void registerFormats()
registerMySQLSchemaReader(factory);
registerBSONEachRowSchemaReader(factory);
registerParquetMetadataSchemaReader(factory);
registerDWARFSchemaReader(factory);
registerOneSchemaReader(factory);
}

Expand Down

0 comments on commit e405ad1

Please sign in to comment.