Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Common database operations are slow #985

Closed
dvdkon opened this issue Sep 8, 2019 · 6 comments
Closed

Common database operations are slow #985

dvdkon opened this issue Sep 8, 2019 · 6 comments
Labels
Type: Enhancement New feature or request

Comments

@dvdkon
Copy link

dvdkon commented Sep 8, 2019

Is your feature request related to a problem? Please describe.
Currently, Ghidra uses its own custom database, which supports only rudimentary indexes and filtering, which means that almost all filtering and sorting has to be done on a higher level by iterating over all possible records (see TreeTextFilter). This results in a very slow user experience for common actions, like searching for a symbol (related issue: #500).

I measured the time to filter symbols (in the "Symbol Tree" using the "Contains" mode) by a given string in a large project (roughly a million symbols) to be around 15s (computer specs: Ryzen 3700X, 32GB RAM, mid-range NVME SSD). I then exported all symbols into a CSV file using Ghidra's Jython

outf = open("CSV_PATH", "w")
for s in currentProgram.getSymbolTable().getAllSymbols(False):
    outf.write(str(s.getAddress().getOffset()) + ',"' + s.getName() + '"\n')
outf.close()

and imported the CSV into three SQL databases (H2, PostgreSQL and SQLite)

-- For H2:
CREATE TABLE symbols (address bigint, name text)
AS SELECT * FROM csvread('CSV_PATH');
-- For PostgreSQL:
CREATE TABLE symbols (address bigint, name text);
COPY symbols FROM 'CSV_PATH' WITH (FORMAT CSV);
-- For SQLite
CREATE TABLE symbols (address bigint, name text);
.mode csv
.import CSV_PATH symbols

A subsequent SELECT * FROM symbols WHERE name ILIKE '%SEARCH_TERM%' or equivalent executed in ~1000ms on H2, ~250ms on PostgreSQL and ~100ms on SQLite.

This isn't meant to compare the databases, but to show that, even without adding fulltext indexes, common SQL databases might be at least 15x better than Ghidra's current code. I'm not very familiar with Ghidra's internals, so please point out any problems that might invalidate these results.

Describe the solution you'd like
I'd like to propose moving filtering and sorting functionality into the database layer by adopting an existing database (a relational database seems most suitable), abstracting it under the same API used today and adding "DB-aware" code to hotspots that are currently slow. This would have the added benefit of relying on a well-maintained open source project instead of custom code, making maintenance easier.

Describe alternatives you've considered
It is certainly possible to optimise Ghidra's current database to get performance comparable to existing, faster databases. I feel, however, that this would take more effort than integrating an existing database, both in up-front effort and especially in maintenance.

It's also possible that the current performance is caused by bugs that can be fixed without such architectural changes, which would definitely be preferable.

@dvdkon dvdkon added the Type: Enhancement New feature or request label Sep 8, 2019
@ghost
Copy link

ghost commented Sep 9, 2019

I'm pretty sure the custom database is not the issue here. I just did a test and I was able to retrieve 1M records and filtered them and it took 533 ms, which is in the ballpark of the numbers you mentioned. I suspect the problem has more to do with the way we are mapping the records to Symbol objects. We may be doing it in an inefficient way where we are only keeping the symbol record keys in memory and every time we access a symbol field (such as its name), we may have to retrieve the symbol record again and again as we filter and sort. There is a soft cache for the symbol objects, but if you are low on memory and the garbage collector runs, those objects will be reclaimed, forcing it to constantly go back to the database to retrieve the information. Anyway, this is something we can look into and see what is really causing the slowness.

@dvdkon
Copy link
Author

dvdkon commented Sep 9, 2019

@ghidragon Thanks for your insight. My guess was that spreading filtering functionality over multiple classes and layers was part of the problem, it would be nice to confirm or deny that. Could you post the code you used to test DB performance? I don't know how to access the database for a project directly and would like to experiment myself.

@ghost
Copy link

ghost commented Sep 9, 2019

Here is the code that I used to see basic database speed. It simply creates a million random strings
and stores them in the database. Then it iterates over all records and find the ones that match "abcd".

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import db.*;
import ghidra.util.UniversalIdGenerator;

public class DatabaseTest {

public static void main(String[] args) throws IOException {
	UniversalIdGenerator.initialize();
	DBHandle dbHandle = new DBHandle();

	Schema schema = new Schema(1, "Key", new Class[] { LongField.class, StringField.class },
		new String[] { "Address", "Name" });

	long txId = dbHandle.startTransaction();
	Table table = dbHandle.createTable("Test", schema);

	for (int i = 0; i < 1000000; i++) {
		Record record = schema.createRecord(i);
		record.setLongValue(0, i);
		record.setString(1, createRandomString());
		table.putRecord(record);
	}
	dbHandle.endTransaction(txId, true);

	RecordIterator iterator = table.iterator();
	System.out.println("\n\nGO");
	List<Record> records = new ArrayList<>();
	long startTime = System.currentTimeMillis();
	while (iterator.hasNext()) {
		Record record = iterator.next();
		String value = record.getString(1);
		if (value.contains("abcd")) {
			records.add(record);
		}
	}
	long endTime = System.currentTimeMillis();
	for (Record record : records) {
		System.out.println("" + record.getLongValue(0) + ", " + record.getString(1));
	}
	System.out.println(
		"\n\n found " + records.size() + " records in " + (endTime - startTime) + " ms");
}

private static String createRandomString() {
	int stringLength = (int) (Math.random() * 25);
	StringBuffer buf = new StringBuffer();
	for (int i = 0; i < stringLength; i++) {
		char c = (char) ('a' + (int) (Math.random() * 26));
		buf.append(c);
	}
	return buf.toString();
}

}

@dragonmacher
Copy link
Collaborator

Also, here is a Ghidra script that shows iterating all symbol in the database via the API, which is pretty fast.
SearchSymbolScript.java.txt

@dragonmacher
Copy link
Collaborator

dragonmacher commented Sep 9, 2019

It is worth mentioning that while investigating this, we have seen a few issues with the tables and trees that severely affect performance as the symbol count increases. These are things we intend to work on.

@ryanmkurtz
Copy link
Collaborator

Since this doesn't appear to be a DB issue, I am going to close the ticket. Speedup pertaining to the Symbol Tree/Table can be tracked in #500.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants