Common database operations are slow #985

dvdkon · 2019-09-08T14:16:32Z

Is your feature request related to a problem? Please describe.
Currently, Ghidra uses its own custom database, which supports only rudimentary indexes and filtering, which means that almost all filtering and sorting has to be done on a higher level by iterating over all possible records (see TreeTextFilter). This results in a very slow user experience for common actions, like searching for a symbol (related issue: #500).

I measured the time to filter symbols (in the "Symbol Tree" using the "Contains" mode) by a given string in a large project (roughly a million symbols) to be around 15s (computer specs: Ryzen 3700X, 32GB RAM, mid-range NVME SSD). I then exported all symbols into a CSV file using Ghidra's Jython

outf = open("CSV_PATH", "w")
for s in currentProgram.getSymbolTable().getAllSymbols(False):
    outf.write(str(s.getAddress().getOffset()) + ',"' + s.getName() + '"\n')
outf.close()

and imported the CSV into three SQL databases (H2, PostgreSQL and SQLite)

-- For H2:
CREATE TABLE symbols (address bigint, name text)
AS SELECT * FROM csvread('CSV_PATH');
-- For PostgreSQL:
CREATE TABLE symbols (address bigint, name text);
COPY symbols FROM 'CSV_PATH' WITH (FORMAT CSV);
-- For SQLite
CREATE TABLE symbols (address bigint, name text);
.mode csv
.import CSV_PATH symbols

A subsequent SELECT * FROM symbols WHERE name ILIKE '%SEARCH_TERM%' or equivalent executed in ~1000ms on H2, ~250ms on PostgreSQL and ~100ms on SQLite.

This isn't meant to compare the databases, but to show that, even without adding fulltext indexes, common SQL databases might be at least 15x better than Ghidra's current code. I'm not very familiar with Ghidra's internals, so please point out any problems that might invalidate these results.

Describe the solution you'd like
I'd like to propose moving filtering and sorting functionality into the database layer by adopting an existing database (a relational database seems most suitable), abstracting it under the same API used today and adding "DB-aware" code to hotspots that are currently slow. This would have the added benefit of relying on a well-maintained open source project instead of custom code, making maintenance easier.

Describe alternatives you've considered
It is certainly possible to optimise Ghidra's current database to get performance comparable to existing, faster databases. I feel, however, that this would take more effort than integrating an existing database, both in up-front effort and especially in maintenance.

It's also possible that the current performance is caused by bugs that can be fixed without such architectural changes, which would definitely be preferable.

The text was updated successfully, but these errors were encountered:

ghost · 2019-09-09T19:37:39Z

I'm pretty sure the custom database is not the issue here. I just did a test and I was able to retrieve 1M records and filtered them and it took 533 ms, which is in the ballpark of the numbers you mentioned. I suspect the problem has more to do with the way we are mapping the records to Symbol objects. We may be doing it in an inefficient way where we are only keeping the symbol record keys in memory and every time we access a symbol field (such as its name), we may have to retrieve the symbol record again and again as we filter and sort. There is a soft cache for the symbol objects, but if you are low on memory and the garbage collector runs, those objects will be reclaimed, forcing it to constantly go back to the database to retrieve the information. Anyway, this is something we can look into and see what is really causing the slowness.

dvdkon · 2019-09-09T20:16:22Z

@ghidragon Thanks for your insight. My guess was that spreading filtering functionality over multiple classes and layers was part of the problem, it would be nice to confirm or deny that. Could you post the code you used to test DB performance? I don't know how to access the database for a project directly and would like to experiment myself.

ghost · 2019-09-09T22:30:36Z

Here is the code that I used to see basic database speed. It simply creates a million random strings
and stores them in the database. Then it iterates over all records and find the ones that match "abcd".

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import db.*;
import ghidra.util.UniversalIdGenerator;

public class DatabaseTest {

public static void main(String[] args) throws IOException {
	UniversalIdGenerator.initialize();
	DBHandle dbHandle = new DBHandle();

	Schema schema = new Schema(1, "Key", new Class[] { LongField.class, StringField.class },
		new String[] { "Address", "Name" });

	long txId = dbHandle.startTransaction();
	Table table = dbHandle.createTable("Test", schema);

	for (int i = 0; i < 1000000; i++) {
		Record record = schema.createRecord(i);
		record.setLongValue(0, i);
		record.setString(1, createRandomString());
		table.putRecord(record);
	}
	dbHandle.endTransaction(txId, true);

	RecordIterator iterator = table.iterator();
	System.out.println("\n\nGO");
	List<Record> records = new ArrayList<>();
	long startTime = System.currentTimeMillis();
	while (iterator.hasNext()) {
		Record record = iterator.next();
		String value = record.getString(1);
		if (value.contains("abcd")) {
			records.add(record);
		}
	}
	long endTime = System.currentTimeMillis();
	for (Record record : records) {
		System.out.println("" + record.getLongValue(0) + ", " + record.getString(1));
	}
	System.out.println(
		"\n\n found " + records.size() + " records in " + (endTime - startTime) + " ms");
}

private static String createRandomString() {
	int stringLength = (int) (Math.random() * 25);
	StringBuffer buf = new StringBuffer();
	for (int i = 0; i < stringLength; i++) {
		char c = (char) ('a' + (int) (Math.random() * 26));
		buf.append(c);
	}
	return buf.toString();
}

}

dragonmacher · 2019-09-09T22:35:36Z

Also, here is a Ghidra script that shows iterating all symbol in the database via the API, which is pretty fast.
SearchSymbolScript.java.txt

dragonmacher · 2019-09-09T22:37:00Z

It is worth mentioning that while investigating this, we have seen a few issues with the tables and trees that severely affect performance as the symbol count increases. These are things we intend to work on.

ryanmkurtz · 2019-09-20T17:19:20Z

Since this doesn't appear to be a DB issue, I am going to close the ticket. Speedup pertaining to the Symbol Tree/Table can be tracked in #500.

dvdkon added the Type: Enhancement New feature or request label Sep 8, 2019

ryanmkurtz closed this as completed Sep 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Common database operations are slow #985

Common database operations are slow #985

dvdkon commented Sep 8, 2019

ghost commented Sep 9, 2019

dvdkon commented Sep 9, 2019

ghost commented Sep 9, 2019

dragonmacher commented Sep 9, 2019

dragonmacher commented Sep 9, 2019 •

edited

ryanmkurtz commented Sep 20, 2019

Common database operations are slow #985

Common database operations are slow #985

Comments

dvdkon commented Sep 8, 2019

ghost commented Sep 9, 2019

dvdkon commented Sep 9, 2019

ghost commented Sep 9, 2019

dragonmacher commented Sep 9, 2019

dragonmacher commented Sep 9, 2019 • edited

ryanmkurtz commented Sep 20, 2019

dragonmacher commented Sep 9, 2019 •

edited