Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run load tests #275

Open
RobertCraigie opened this issue Feb 6, 2022 · 7 comments
Open

Run load tests #275

RobertCraigie opened this issue Feb 6, 2022 · 7 comments
Labels
kind/improvement An improvement to existing feature and code. level/intermediate priority/high topic: internal Improvement or addition to non user facing code

Comments

@RobertCraigie
Copy link
Owner

RobertCraigie commented Feb 6, 2022

Problem

We have not yet tested how well Prisma Client Python scales. Scalability is a core concern with ORMs / database clients.

Suggested solution

Use Locust to write tests.

We should also include these tests in CI so that we can fail checks if there are any regressions and upload the results so they can be viewed on GitHub pages.

@RobertCraigie RobertCraigie added topic: internal Improvement or addition to non user facing code kind/improvement An improvement to existing feature and code. level/intermediate priority/high labels Feb 6, 2022
@tday
Copy link

tday commented Feb 9, 2022

I have some basic performance data available. I can give a more specific timing on sections if you tell me what you're interested in profiling.


Last week, I built an API to ingest a file that has 256 thousand rows with 21 columns into postgres+flask. It's about 100MB of plaintext CSV data. This data is inserted into three tables:

  • Table 1 - one row with metadata about file (1 row, 6 columns)
  • Table 2 - all rows within file (256k rows, 21 columns)
  • Table 3 - metadata table tracking all possible items in database (256k rows, 9 columns)

Here's an anecdotal version of performance:

  • Processing the file is very fast so there's no significant performance hits before reaching prisma
  • prisma in python land is slow to convert into objects ready to send to the engine. A lot of time spent in prisma.builder.create, especially creating children (which are the 256k rows)
  • httpx is slow sending/receiving data
  • prisma is fast returning data

I had two versions implemented:

  1. No transactions using connection pool
  • Batched rows to create_many. Upsert for table 3, No upsert for Table 1 and 2
  • ~3 mins request to response
  1. Transaction around Table 1 + 2, no transactions for Table 3 but with upserts:
  • Batched rows to Table 3, use create with relationships for Table 1 + 2
  • Needed to increase httpx timeout to support this
  • ~6 mins request to response

We're okay with 2 for now, but ultimately might move to doing a raw query transaction using INSERT + COPY instead for performance wins + transaction.

@RobertCraigie
Copy link
Owner Author

RobertCraigie commented Feb 10, 2022

Wow! Thank you for the very detailed report. It will certainly be helpful.

I can give a more specific timing on sections if you tell me what you're interested in profiling.

I'm most interested in profiling the parts of the code base that we can improve; Things like building the internal Prisma query, sending the query to the internal engine, deserialising records etc.

Batched rows

Just to double check, by batched rows are you referring to the query batcher? e.g.

async with client.batch_() as batcher:
	batcher.user.create({'name': 'Robert'})

If not, I wonder if it would be possible to ingest all the data in a single batched query? This would give you the benefits of running in a transaction and hopefully performance benefits too. You would obviously still be bottlenecked by httpx.


This has given me an idea for a potential performance improvement - Prisma Python will still convert the data that Prisma gives us into pydantic models even if the returned pydantic models are redundant. Pydantic validation (while still fast) can incur significant performance costs especially when deserialising hundreds of thousands of records.

We should add a parameter to actions to disable returning records.

@RobertCraigie
Copy link
Owner Author

Looking into this a bit myself with a dataset of 500,000 records with 14 columns

On my local machine:

  • Loading from JSON took just over 1 second
  • Serialising raw data to internal queries took 37 seconds
  • Executing queries (500,000 batched create calls) took 285 seconds
  • Executing queries (500,000 records in a single create_many) took 91 seconds

@tday So if I'm understanding your use case correctly, it looks like your best bet would be to wrap multiple inserts within a batched query:

async with client.batch_() as batcher:
    batcher.table1.create_many(...)
    batcher.table2.create_many(...)
    batcher.table3.create_many(...)

Bearing in mind that create_many does not support nested relational fields so if you have any relations between these tables you will either have to use a pre-known ID for each relation or split it up to not using the batcher.

@tday
Copy link

tday commented Feb 10, 2022

I actually did something a bit different by chunking up create many into multiple create many calls and then using a async.gather to maximize the io bound time. This gave about the same time as your batch version, but without a transaction. Unfortunately, I do need the relationship which batcher doesn't support.

It seems like a nested write with a one to many relations executes many creates rather than a create_many. Is that a potential improvement?

Alternatively, it'd be nice if you could do something like:

with client.batch_() as batcher:
  result = batcher.table1.create_many
  batcher.table2.create_many({result.id ...}
   batcher.table3.create_many
  

prisma supports this with interactive transactions

@RobertCraigie
Copy link
Owner Author

I actually did something a bit different by chunking up create many into multiple create many calls and then using a async.gather to maximize the io bound time.

That is a smart solution 👍

It seems like a nested write with a one to many relations executes many creates rather than a create_many. Is that a potential improvement?

Could you elaborate more on what you mean please? Are you talking about the actual generated SQL queries?

prisma supports this with interactive transactions

I do have provisional support for this in a WIP branch, you could try it out by installing like so: pip install -U git+https://github.com/RobertCraigie/prisma-client-py@wip/transactions.

Unfortunately when I started working on this there were a couple of kinks that still needed to be ironed out on the Prisma side before I could officially add support. This has been improved in the latest releases I just haven't gotten around to finishing up the implementation yet: #53

@tday
Copy link

tday commented Feb 10, 2022

It seems like a nested write with a one to many relations executes many creates rather than a create_many. Is that a potential improvement?

Could you elaborate more on what you mean please? Are you talking about the actual generated SQL queries?

Yea, the actual generated queries are different which seems to have performance impact at a glance. I need to do more profiling to prove out this theory. Here is what it looks like at the database:

create_many

2022-02-10 13:53:46.801 UTC [4486] LOG:  AUDIT: SESSION,8,1,WRITE,INSERT,,,"INSERT INTO ""beadchip_schema"".""probe_build"" (""build_id"",""genome_build_version"",""source_sequence"",""source_strand"",""snp"",""updated_at"",""address_a_id"",""source"",""address_b_id"",""allele_b_probe_sequence"",""created_at"",""chromosome"",""allele_a_probe_sequence"",""illumina_strand"",""locus_name"",""top_genomic_sequence"",""source_version"",""created_by"",""updated_by"",""illumina_id"") VALUES ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20), ($21,$22,$23,$24,$25,$26,$27,$28,$29,$30,$31,$32,$33,$34,$35,$36,$37,$38,$39,$40), ($41,$42,$43,$44,$45,$46,$47,$48,$49,$50,$51,$52,$53,$54,$55,$56,$57,$58,$59,$60), ($61,$62,$63,$64,$65,$66,$67,$68,$69,$70,$71,$72,$73,$74,$75,$76,$77,$78,$79,$80), ($81,$82,$83,$84,$85,$86,$87,$88,$89,$90,$91,$92,$93,$94,$95,$96,$97,$98,$99,$100)

create with relation
2022-02-10 13:54:06.795 UTC [4490] LOG:  AUDIT: SESSION,8,1,WRITE,INSERT,,,"INSERT INTO ""beadchip_schema"".""probe_build"" (""build_id"",""locus_name"",""illumina_id"",""illumina_strand"",""snp"",""address_a_id"",""allele_a_probe_sequence"",""address_b_id"",""allele_b_probe_sequence"",""genome_build_version"",""chromosome"",""source"",""source_version"",""source_strand"",""source_sequence"",""top_genomic_sequence"",""created_at"",""updated_at"",""created_by"",""updated_by"") VALUES ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20) RETURNING ""beadchip_schema"".""probe_build"".""probe_build_id""",<not logged>
2022-02-10 13:54:06.798 UTC [4490] LOG:  AUDIT: SESSION,9,1,WRITE,INSERT,,,"INSERT INTO ""beadchip_schema"".""probe_build"" (""build_id"",""locus_name"",""illumina_id"",""illumina_strand"",""snp"",""address_a_id"",""allele_a_probe_sequence"",""address_b_id"",""allele_b_probe_sequence"",""genome_build_version"",""chromosome"",""source"",""source_version"",""source_strand"",""source_sequence"",""top_genomic_sequence"",""created_at"",""updated_at"",""created_by"",""updated_by"") VALUES ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20) RETURNING ""beadchip_schema"".""probe_build"".""probe_build_id""",<not logged>
2022-02-10 13:54:06.799 UTC [4490] LOG:  AUDIT: SESSION,10,1,WRITE,INSERT,,,"INSERT INTO ""beadchip_schema"".""probe_build"" (""build_id"",""locus_name"",""illumina_id"",""illumina_strand"",""snp"",""address_a_id"",""allele_a_probe_sequence"",""address_b_id"",""allele_b_probe_sequence"",""genome_build_version"",""chromosome"",""source"",""source_version"",""source_strand"",""source_sequence"",""top_genomic_sequence"",""created_at"",""updated_at"",""created_by"",""updated_by"") VALUES ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20) RETURNING ""beadchip_schema"".""probe_build"".""probe_build_id""",<not logged>
2022-02-10 13:54:06.801 UTC [4490] LOG:  AUDIT: SESSION,11,1,WRITE,INSERT,,,"INSERT INTO ""beadchip_schema"".""probe_build"" (""build_id"",""locus_name"",""illumina_id"",""illumina_strand"",""snp"",""address_a_id"",""allele_a_probe_sequence"",""address_b_id"",""allele_b_probe_sequence"",""genome_build_version"",""chromosome"",""source"",""source_version"",""source_strand"",""source_sequence"",""top_genomic_sequence"",""created_at"",""updated_at"",""created_by"",""updated_by"") VALUES ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20) RETURNING ""beadchip_schema"".""probe_build"".""probe_build_id""",<not logged>
2022-02-10 13:54:06.802 UTC [4490] LOG:  AUDIT: SESSION,12,1,WRITE,INSERT,,,"INSERT INTO ""beadchip_schema"".""probe_build"" (""build_id"",""locus_name"",""illumina_id"",""illumina_strand"",""snp"",""address_a_id"",""allele_a_probe_sequence"",""address_b_id"",""allele_b_probe_sequence"",""genome_build_version"",""chromosome"",""source"",""source_version"",""source_strand"",""source_sequence"",""top_genomic_sequence"",""created_at"",""updated_at"",""created_by"",""updated_by"") VALUES ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20) RETURNING ""beadchip_schema"".""probe_build"".""probe_build_id""",<not logged>


@RobertCraigie
Copy link
Owner Author

Yea, the actual generated queries are different

I see, I wonder if it would be feasible for Prisma to be smart enough to create a single query from those batched queries instead of creating multiple queries.

I do not know how the SQL queries are generated internally, that's all handled by the Prisma engines team, so it might just not be possible, I'll have a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/improvement An improvement to existing feature and code. level/intermediate priority/high topic: internal Improvement or addition to non user facing code
Projects
Status: Backlog
Development

No branches or pull requests

2 participants