Run load tests #275

RobertCraigie · 2022-02-06T02:15:40Z

Problem

We have not yet tested how well Prisma Client Python scales. Scalability is a core concern with ORMs / database clients.

Suggested solution

Use Locust to write tests.

We should also include these tests in CI so that we can fail checks if there are any regressions and upload the results so they can be viewed on GitHub pages.

tday · 2022-02-09T22:29:03Z

I have some basic performance data available. I can give a more specific timing on sections if you tell me what you're interested in profiling.

Last week, I built an API to ingest a file that has 256 thousand rows with 21 columns into postgres+flask. It's about 100MB of plaintext CSV data. This data is inserted into three tables:

Table 1 - one row with metadata about file (1 row, 6 columns)
Table 2 - all rows within file (256k rows, 21 columns)
Table 3 - metadata table tracking all possible items in database (256k rows, 9 columns)

Here's an anecdotal version of performance:

Processing the file is very fast so there's no significant performance hits before reaching prisma
prisma in python land is slow to convert into objects ready to send to the engine. A lot of time spent in prisma.builder.create, especially creating children (which are the 256k rows)
httpx is slow sending/receiving data
prisma is fast returning data

I had two versions implemented:

No transactions using connection pool

Batched rows to create_many. Upsert for table 3, No upsert for Table 1 and 2
~3 mins request to response

Transaction around Table 1 + 2, no transactions for Table 3 but with upserts:

Batched rows to Table 3, use create with relationships for Table 1 + 2
Needed to increase httpx timeout to support this
~6 mins request to response

We're okay with 2 for now, but ultimately might move to doing a raw query transaction using INSERT + COPY instead for performance wins + transaction.

RobertCraigie · 2022-02-10T01:32:54Z

Wow! Thank you for the very detailed report. It will certainly be helpful.

I can give a more specific timing on sections if you tell me what you're interested in profiling.

I'm most interested in profiling the parts of the code base that we can improve; Things like building the internal Prisma query, sending the query to the internal engine, deserialising records etc.

Batched rows

Just to double check, by batched rows are you referring to the query batcher? e.g.

async with client.batch_() as batcher:
	batcher.user.create({'name': 'Robert'})

If not, I wonder if it would be possible to ingest all the data in a single batched query? This would give you the benefits of running in a transaction and hopefully performance benefits too. You would obviously still be bottlenecked by httpx.

This has given me an idea for a potential performance improvement - Prisma Python will still convert the data that Prisma gives us into pydantic models even if the returned pydantic models are redundant. Pydantic validation (while still fast) can incur significant performance costs especially when deserialising hundreds of thousands of records.

We should add a parameter to actions to disable returning records.

RobertCraigie · 2022-02-10T02:33:42Z

Looking into this a bit myself with a dataset of 500,000 records with 14 columns

On my local machine:

Loading from JSON took just over 1 second
Serialising raw data to internal queries took 37 seconds
Executing queries (500,000 batched create calls) took 285 seconds
Executing queries (500,000 records in a single create_many) took 91 seconds

@tday So if I'm understanding your use case correctly, it looks like your best bet would be to wrap multiple inserts within a batched query:

async with client.batch_() as batcher:
    batcher.table1.create_many(...)
    batcher.table2.create_many(...)
    batcher.table3.create_many(...)

Bearing in mind that create_many does not support nested relational fields so if you have any relations between these tables you will either have to use a pre-known ID for each relation or split it up to not using the batcher.

tday · 2022-02-10T03:11:56Z

I actually did something a bit different by chunking up create many into multiple create many calls and then using a async.gather to maximize the io bound time. This gave about the same time as your batch version, but without a transaction. Unfortunately, I do need the relationship which batcher doesn't support.

It seems like a nested write with a one to many relations executes many creates rather than a create_many. Is that a potential improvement?

Alternatively, it'd be nice if you could do something like:

with client.batch_() as batcher:
  result = batcher.table1.create_many
  batcher.table2.create_many({result.id ...}
   batcher.table3.create_many

prisma supports this with interactive transactions

RobertCraigie · 2022-02-10T03:32:32Z

I actually did something a bit different by chunking up create many into multiple create many calls and then using a async.gather to maximize the io bound time.

That is a smart solution 👍

It seems like a nested write with a one to many relations executes many creates rather than a create_many. Is that a potential improvement?

Could you elaborate more on what you mean please? Are you talking about the actual generated SQL queries?

prisma supports this with interactive transactions

I do have provisional support for this in a WIP branch, you could try it out by installing like so: pip install -U git+https://github.com/RobertCraigie/prisma-client-py@wip/transactions.

Unfortunately when I started working on this there were a couple of kinks that still needed to be ironed out on the Prisma side before I could officially add support. This has been improved in the latest releases I just haven't gotten around to finishing up the implementation yet: #53

tday · 2022-02-10T13:56:32Z

It seems like a nested write with a one to many relations executes many creates rather than a create_many. Is that a potential improvement?

Could you elaborate more on what you mean please? Are you talking about the actual generated SQL queries?

Yea, the actual generated queries are different which seems to have performance impact at a glance. I need to do more profiling to prove out this theory. Here is what it looks like at the database:

create_many

2022-02-10 13:53:46.801 UTC [4486] LOG:  AUDIT: SESSION,8,1,WRITE,INSERT,,,"INSERT INTO ""beadchip_schema"".""probe_build"" (""build_id"",""genome_build_version"",""source_sequence"",""source_strand"",""snp"",""updated_at"",""address_a_id"",""source"",""address_b_id"",""allele_b_probe_sequence"",""created_at"",""chromosome"",""allele_a_probe_sequence"",""illumina_strand"",""locus_name"",""top_genomic_sequence"",""source_version"",""created_by"",""updated_by"",""illumina_id"") VALUES ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20), ($21,$22,$23,$24,$25,$26,$27,$28,$29,$30,$31,$32,$33,$34,$35,$36,$37,$38,$39,$40), ($41,$42,$43,$44,$45,$46,$47,$48,$49,$50,$51,$52,$53,$54,$55,$56,$57,$58,$59,$60), ($61,$62,$63,$64,$65,$66,$67,$68,$69,$70,$71,$72,$73,$74,$75,$76,$77,$78,$79,$80), ($81,$82,$83,$84,$85,$86,$87,$88,$89,$90,$91,$92,$93,$94,$95,$96,$97,$98,$99,$100)

create with relation
2022-02-10 13:54:06.795 UTC [4490] LOG:  AUDIT: SESSION,8,1,WRITE,INSERT,,,"INSERT INTO ""beadchip_schema"".""probe_build"" (""build_id"",""locus_name"",""illumina_id"",""illumina_strand"",""snp"",""address_a_id"",""allele_a_probe_sequence"",""address_b_id"",""allele_b_probe_sequence"",""genome_build_version"",""chromosome"",""source"",""source_version"",""source_strand"",""source_sequence"",""top_genomic_sequence"",""created_at"",""updated_at"",""created_by"",""updated_by"") VALUES ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20) RETURNING ""beadchip_schema"".""probe_build"".""probe_build_id""",<not logged>
2022-02-10 13:54:06.798 UTC [4490] LOG:  AUDIT: SESSION,9,1,WRITE,INSERT,,,"INSERT INTO ""beadchip_schema"".""probe_build"" (""build_id"",""locus_name"",""illumina_id"",""illumina_strand"",""snp"",""address_a_id"",""allele_a_probe_sequence"",""address_b_id"",""allele_b_probe_sequence"",""genome_build_version"",""chromosome"",""source"",""source_version"",""source_strand"",""source_sequence"",""top_genomic_sequence"",""created_at"",""updated_at"",""created_by"",""updated_by"") VALUES ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20) RETURNING ""beadchip_schema"".""probe_build"".""probe_build_id""",<not logged>
2022-02-10 13:54:06.799 UTC [4490] LOG:  AUDIT: SESSION,10,1,WRITE,INSERT,,,"INSERT INTO ""beadchip_schema"".""probe_build"" (""build_id"",""locus_name"",""illumina_id"",""illumina_strand"",""snp"",""address_a_id"",""allele_a_probe_sequence"",""address_b_id"",""allele_b_probe_sequence"",""genome_build_version"",""chromosome"",""source"",""source_version"",""source_strand"",""source_sequence"",""top_genomic_sequence"",""created_at"",""updated_at"",""created_by"",""updated_by"") VALUES ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20) RETURNING ""beadchip_schema"".""probe_build"".""probe_build_id""",<not logged>
2022-02-10 13:54:06.801 UTC [4490] LOG:  AUDIT: SESSION,11,1,WRITE,INSERT,,,"INSERT INTO ""beadchip_schema"".""probe_build"" (""build_id"",""locus_name"",""illumina_id"",""illumina_strand"",""snp"",""address_a_id"",""allele_a_probe_sequence"",""address_b_id"",""allele_b_probe_sequence"",""genome_build_version"",""chromosome"",""source"",""source_version"",""source_strand"",""source_sequence"",""top_genomic_sequence"",""created_at"",""updated_at"",""created_by"",""updated_by"") VALUES ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20) RETURNING ""beadchip_schema"".""probe_build"".""probe_build_id""",<not logged>
2022-02-10 13:54:06.802 UTC [4490] LOG:  AUDIT: SESSION,12,1,WRITE,INSERT,,,"INSERT INTO ""beadchip_schema"".""probe_build"" (""build_id"",""locus_name"",""illumina_id"",""illumina_strand"",""snp"",""address_a_id"",""allele_a_probe_sequence"",""address_b_id"",""allele_b_probe_sequence"",""genome_build_version"",""chromosome"",""source"",""source_version"",""source_strand"",""source_sequence"",""top_genomic_sequence"",""created_at"",""updated_at"",""created_by"",""updated_by"") VALUES ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20) RETURNING ""beadchip_schema"".""probe_build"".""probe_build_id""",<not logged>

RobertCraigie · 2022-02-10T15:54:37Z

Yea, the actual generated queries are different

I see, I wonder if it would be feasible for Prisma to be smart enough to create a single query from those batched queries instead of creating multiple queries.

I do not know how the SQL queries are generated internally, that's all handled by the Prisma engines team, so it might just not be possible, I'll have a look.

RobertCraigie added topic: internal Improvement or addition to non user facing code kind/improvement An improvement to existing feature and code. level/intermediate priority/high labels Feb 6, 2022

RobertCraigie mentioned this issue Feb 10, 2022

Support discarding action results #283

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run load tests #275

Run load tests #275

RobertCraigie commented Feb 6, 2022 •

edited

tday commented Feb 9, 2022 •

edited

RobertCraigie commented Feb 10, 2022 •

edited

RobertCraigie commented Feb 10, 2022

tday commented Feb 10, 2022 •

edited

RobertCraigie commented Feb 10, 2022

tday commented Feb 10, 2022 •

edited

RobertCraigie commented Feb 10, 2022

Run load tests #275

Run load tests #275

Comments

RobertCraigie commented Feb 6, 2022 • edited

Problem

Suggested solution

tday commented Feb 9, 2022 • edited

RobertCraigie commented Feb 10, 2022 • edited

RobertCraigie commented Feb 10, 2022

tday commented Feb 10, 2022 • edited

RobertCraigie commented Feb 10, 2022

tday commented Feb 10, 2022 • edited

RobertCraigie commented Feb 10, 2022

RobertCraigie commented Feb 6, 2022 •

edited

tday commented Feb 9, 2022 •

edited

RobertCraigie commented Feb 10, 2022 •

edited

tday commented Feb 10, 2022 •

edited

tday commented Feb 10, 2022 •

edited