Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test extreme data loads #325

Closed
kenpugsley opened this issue Apr 3, 2020 · 20 comments
Closed

Test extreme data loads #325

kenpugsley opened this issue Apr 3, 2020 · 20 comments

Comments

@kenpugsley
Copy link
Collaborator

kenpugsley commented Apr 3, 2020

We need to be able to support very high levels of points of concern, with a "full" set of location data (over the last 14 days).

As a rough estimate, this is users location data should be on order of 28 x 24 x 12 = 8064
As a rough estimate, based on NY, the current 14 day growth is ~64000, so points of concern could be on order 64000 x 14 x 24 x 12 x 0.5 = 53,760,000

The point of this test would be to know processing time and memory usage. Also, determine limits so the app can protect itself in case of extreme data loads.

@E3V3A
Copy link

E3V3A commented Apr 8, 2020

@kenpugsley
Can you specify the variables you use for that calculation?

  • Where's the 28 coming from?
  • Where's the 64,000 coming from?

Assuming user is active 12 hrs/day, every day, for 14 days, we have:
active_time = 14 [days] x 12 [hr/day] x 3600 [sec/hr] = 604,800 [sec]

But we only take a measurement every k seconds, so:

N = number of measurements
= (active_time [sec]) / (k [sec/measurement])

But we said* we only take a measurement every 5 min, so k = 300, and thus:

N = 604,800/300 = 2016 [measurements]


* = we should take more often, probably every 1-2 minutes when device is moving!

@diarmidmackenzie
Copy link

64,000 is presumably current number of infected people in NY state.
However that's 7d ago.

Latest number is 160,000.
https://www.worldometers.info/coronavirus/country/us/

I assume the 28 is number of days of data tracked (white paper suggests range of 14d to 37d for data storage; I don't know what value the current code uses).

@kenpugsley
Copy link
Collaborator Author

Sorry ... did not notice the discussion on this. I should have been more clear, but you have the sense of it. The storage internally does indeed currently go back 28 days. The current location algorithm should not log any faster than a location point every 3 minutes.

There is a bit unknown on how much data the health provider will want to send. I think the real objective here is to figure out what the limits of the app are, so that the app can be proactive at protecting itself, as well as letting health authorities know (through the Safe Placed work) about best practice.

The obvious concerns that we should have and know the limits on:

  1. download size
  2. size in memory
  3. time to process the intersection

@diarmidmackenzie
Copy link

Ken - I was thinking about testing this, and there are some decisions to be made in how to structure the JSON data served, which could impact performance.

Is there any defined ordering for the JSON data? E.g. ordered by patient fist, then timestamp, or by timestamp across all patients? (or by location... which seems odd but might make for more efficient matching?)

Depending on what data structures it uses, the ordering of a large data set might have a significant impact on how the app will perform.

If I get to testing this, it should be straightfoward to try out multiple orderings - but it would be useful to know if there is an expected ordering for the data.

@penrods
Copy link
Contributor

penrods commented Apr 10, 2020

Order doesn't (shouldn't) matter. We normalize and sort the data inside the app.

@diarmidmackenzie
Copy link

...and it sounds like you aren't worried about the performance of the sort.

I infer from that also: There is no standard ordering that we expect from HAs. They can send the data in whatever order they like.

@kenpugsley
Copy link
Collaborator Author

Part of what we need from this testing is some real data on how the app behaves with large datasets. The app will order the data, but you are correct that if we need to help performance we could start enforcing an requirement on the endpoints that they order the data in some form (date order probably makes the most sense).

If you do get a provider up with a large amount of data, I would like myself or someone on the core dev team to do some detailed running of the apps with performance telemetry attached to see how things work out.

@diarmidmackenzie
Copy link

OK, I now have a Python script that generates a sample data set for n cases, for m days. I generate a 288 data points/day (12 x 24 hours). For now I am randomly distributing each datapoint within +/-0.5 of a nominated central longitude & latitude.

I'll look at getting the code into GitHub tomorrow.

Files generated are huge.

50 cases for 28 days is 34.5 MB.

So with 5000 infected cases, the file would be 3.45 GB.

I don't think the issue here is processing time, but simply download bandwidth.

FIle compress OK. I'm getting 12%, so the 34.5MB file compresses to 4.3MB. Files with real data should compress a lot better than my random data.

For 5000 cases, 12% compression gets us to 430MB, which is about 2 x large app downloads. But it's not clear how we would decompress it to feed it into the App.

Also not clear what amount of RAM such a volume of data would take up in the app.

Once the app is able to load these files, I can try loading the 50 cases / 34.5 MB data set, and see what this looks like in terms of occupancy.

@kenpugsley
Copy link
Collaborator Author

Thanks for getting this research started. It's not clear to me just yet how we get those numbers down beyond just reducing the dataset. With that amount of data, we should also be concerned about app memory footprint and how we're loading the dataset into memory in the app - I suspect many phones will have issues with extremely large datasets of the size you're talking about.

@diarmidmackenzie
Copy link

diarmidmackenzie commented Apr 11, 2020

I made some simple changes to the JSON format:
"time" -> "t"
"longitude"-> "x"
"latitude" -> "y"
time in seconds, not msecs
GPS data to 5 decimal points, rather than 8.

(0.00001 degrees = 10m, on the basis that 1 degree ~= 1000 km. I think anything below 10m is just spurious accuracy anyway.

With these format changes, my 50 cases / 28d file reduces from 34.5MB to 23.7MB - a 31% reduction.

Surprisingly it still compresses at the same ratio: 12% (this is using 7zip on WIndows. zip on Windows compresses at 14%, I haven't looked into what the best lossless compression algorithm would be).

Those changes for a 33% gain seem like good low hanging fruit (but would need to be co-ordinated with Safe Places, as per: #350).

Beyond that it's hard to do better....

@diarmidmackenzie
Copy link

Mulling over possible solutions here...

There's a lot of duplicate data being downloaded to the app multiple times. That's unecessary.

But the app can't just download the latest 1d of data, because there may be "old" data (i.e. data from up to 28d ago), that's new (because the case just got diagnosed).

So can't do a simple cut-off based on timestamp,

But if the server knows when each data point was added to its database (which is different from when the datapoint was recorded on the original phone), then it will know which datapoints are new for this phone (if the phone can tell it when it last downloaded data).

This will avoid duplicate data tramission & dramatically reduce the volume of data. If we pull data ever 12 hours, we'll save a factor of 56 on data transfer.

Additional benefit of this approach is that we can reduce the 12 hour polling interval to a much lower time (1 hour, or lower) without paying any penalty in terms of increased data transmission.

@kenpugsley - Thoughts?

@diarmidmackenzie
Copy link

This creates a lot more complexity on the server side.. Rather than just hosting a blob of JSON, we'd need a web server that dynamically generates content for each client based on the timestamp the client says it last pulled data.

Definitely some extra work server-side, but feels plausible to me (and probably necessary to scale up case numbers we can support).

@diarmidmackenzie
Copy link

diarmidmackenzie commented Apr 15, 2020

(following some discussion with @penrods about exactly how large the datasets will be - see full details here:
https://covidsafepaths.slack.com/archives/C0105AC9ZBM/p1586891663010700)

I don't much like binary data solutions for APIs. Yes they save data, but they harm testability and diagnosability a lot.
I'd prefer zipping the files, over a binary data format. That seems to give a 10x reduction from my testing.

Time-slicing the data based on publication time can in fact be done with static JSON. Just have a named JSON file for each day of published data (or even each hour of published data).

Then the app can just pull the new information it needs, rather than downloading the same data time and again.

I am a big fan of the YAGNI principle and fixing things only when needed, but in this instance there is an evident looming issue here, and fixing now will cost way less than fixing later, when client and server are both widely deployed.

A further consideration on data usage: we must not assume everyone has the privilege of free WiFi at home. If people are racking up significant data usage bills to use this app that will hamper adoption.

Even the 6MB for 100 cases you quote (which I think is optimistic) is 60MB for 1000 cases, which is 2GB/month.

That costs a lot in some parts of the world.
https://a4ai.org/mobile-broadband-pricing-data/

PROPOSAL

Proposal: partition the static JSON data into 12h blocks, so the app can serve itself the fresh data it needs, and avoid the need to pull the same data down multiple times.

Privacy leakage impact: effectively zero.

Cost: modest implementation effort on client and server now. Offsetting much higher cost of making the same change later.

Benefit: instant 28x reduction in bandwidth usage (assuming 14d data served). Smaller individual files to be processed at the application level, probably also results in us avoiding difficulties associated with digesting large individual files.

@diarmidmackenzie
Copy link

Other key points:

  • @penrods says OK to the proposal.
  • but priority is not yet clear - for very first deployments, we will probably be OK on scale without this.
  • changing client + server after deployment adds complexity, but will be manageable for a small number of deployments.
  • we're having discussions with Service Providers about free data for this service.

@E3V3A
Copy link

E3V3A commented Apr 15, 2020

Related issue #285 with other calc's.

Data need to be separated both by date, and into smaller map pieces, as is done when loading any Google map (on a slow) connection.

@E3V3A
Copy link

E3V3A commented Apr 16, 2020

Thanks to the project by @mundanelunacy at covidmaps he made me aware of a firbase project that can be used to download only selected gis data, so maybe this could help?

Realtime Geolocation with Firestore & RxJS. Query geographic points within a radius on the web or Node.js.

@diarmidmackenzie
Copy link

Order doesn't (shouldn't) matter.

This is probably true for performance (the context in which it was said).

For privacy I think it does matter a lot.

E.g. if all user 1's data is first, then all user2's etc. then it's trivial to correlaqte a set of movements together as one group, and that might make it a lot easier to de-anonymize.

I guess this is really input for Safe Places - will raise something over there. Wanted to add a comment here, so we had a record in the same place as the "order doesn't matter" coment.

@penrods
Copy link
Contributor

penrods commented Apr 20, 2020

Correct, I just meant that the intersection code can consume points sent to it in any order and it will continue to work correctly.

@diarmidmackenzie
Copy link

diarmidmackenzie commented Apr 28, 2020

Some test results here:

https://pathcheck.atlassian.net/wiki/spaces/TEST/pages/31621196/28+April+-+Some+basic+scale+testing

Not very "extreme". Managed to crash at 50 cases / 14d on Android. Reliably crashed at 100 cases / 14d.

Some good evidence that breaking things up into multiple smaller files would help a lot. Able to get up to 300+ cases with smaller individual files. CRashed at 500 cases. Again all 14d/case.

@tstirrat
Copy link
Contributor

Scaling work is in progress and tracked by https://pathcheck.atlassian.net/browse/SAF-166

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants