Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High latency #5

Closed
sudhirj opened this issue Jun 5, 2013 · 56 comments
Closed

High latency #5

sudhirj opened this issue Jun 5, 2013 · 56 comments
Labels

Comments

@sudhirj
Copy link
Contributor

@sudhirj sudhirj commented Jun 5, 2013

b/8334662

I'm consistently seeing hight latency numbers when using the GCD - about 1.5 seconds on average for queries (10 items or less) and single item writes ( < 1kb).

My ping time to the API endpoint is about 50ms, so discounting a 100ms round trip, that still leaves more than a second to GCD latency. This simply won't work for a server environment, certainly not one that scales. This is very surprising because I was expecting latencies much closer to GAE: https://code.google.com/status/appengine/detail/hr-datastore/2013/06/05#ae-trust-detail-hr-datastore-query-latency

Can we get a dashboard like the GAE HR Datastore status board? Possibly measuring latencies from Google Compute Engine instances and a few different AWS Regions?

@proppy
Copy link
Member

@proppy proppy commented Jun 5, 2013

We are actively working on reducing the network latency between Compute and Cloud Datastore, in particular by having a better collocation of the 2 services.

But the numbers you are getting are way higher than the one I'm seeing on average from Compute (<140ms median for a write of 1x 1k random entity, and <100ms median for runquery of 10 entities).

Can you share more details about how you measure the latency and from which location you are making the API call?

Thanks in advance.

PS: feel free to open a separate issue for the dashboard feature request.

@dgay42
Copy link

@dgay42 dgay42 commented Jun 5, 2013

It's worth noting: if there's not much GCD traffic, there's a significant "cold start" penalty on latency. To get realistic latency numbers, I would recommend having a background traffic process running at a few requests per second (doesn't matter much what requests, we've used a beginTransaction+commit pair for our own testing).

@sudhirj
Copy link
Contributor Author

@sudhirj sudhirj commented Jun 14, 2013

I think I've accounted for the cold start, but I made only about a 100 requests in sequence. I might have hit it on a particularly 'cold' day, though, will try again. Also Google's definition of an active store probably has a lot more 0s on the requests/second count.

I measured by turning on the HTTP logger that's included in the Ruby example and looking at the output for sequential runs - definitely not a well thought out benchmark, but enough to show what's going on.

@vierjp
Copy link

@vierjp vierjp commented Jun 20, 2013

I tried Cloud Datastore API from Compute Engine.

instance type:f1-micro, zone:us-central1-a

I tried with this code.
https://github.com/vierjp/vier-gcd-test-client/blob/master/vier-gcd-test-client/src/main/java/ClientTest6.java

This code puts a small entity 1000 times and query latest 10 entities.

According to my experiment, it takes for about 150-200 ms on average to put a entity, and takes for about 500-700 ms to query 10 entities.

But it takes for 40-50 ms on average to put a entity from App Engine with Low-Level-API.

I expect that Cloud Datastore API will be improved its execution speed from Compute Engine.

@sudhirj
Copy link
Contributor Author

@sudhirj sudhirj commented Jun 21, 2013

I'm still hoping for better speed outside of Compute Engine - namely from AWS. The biggest draw for me is that I can now deploy servers all around with world with a multitude of stacks and frameworks and have them share a common high availability and high speed distributed datastore. Seems like a pipe dream, but definitely possibly if Google can improve the latencies on GCD.

@proppy
Copy link
Member

@proppy proppy commented Jul 1, 2013

Can you try running your benchmarks again?

@vierjp
Copy link

@vierjp vierjp commented Jul 3, 2013

Great.
I ran same benchmark on Compute Engine (instance type: f1-micro, zone: us-central1-a).
It takes for about 98-120 ms on average to put a entity.
(Last time, it took for about 150-200 ms)

7 03, 2013 4:01:16 AM put entities 117610 milliseconds.
7 03, 2013 4:03:51 AM put entities 108515 milliseconds.
7 03, 2013 4:08:33 AM put entities 97788 milliseconds.
(This code puts a small entity 1000 times)

@briandorsey
Copy link
Member

@briandorsey briandorsey commented Jul 3, 2013

Also, it may be worth running your benchmark again on an n1-standard-4 or n1-highcpu-4 instance. Overall network throughput is higher on higher CPU instances. I wouldn't expect latency to be largely affected, but it's worth verifying with your workload.

@vierjp
Copy link

@vierjp vierjp commented Jul 3, 2013

I ran same benchmark again with GCE's n1-highcpu-4 instance.

GCE n1-highcpu-4, us-central1-a

-put
7 03, 2013 6:29:19 PM put entities 128545 milliseconds.
7 03, 2013 6:32:08 PM put entities 97222 milliseconds.
7 03, 2013 6:34:33 PM put entities 100826 milliseconds.
(This code puts a small entity 1000 times.)

-query
7 03, 2013 6:32:08 PM query entities 125 milliseconds.
7 03, 2013 6:34:33 PM query entities 740 milliseconds.

I felt that the network speed was increased while I was downloading files using wget and yum.

I think that the result of the benchmark of 'put' didn't change much.
However, query speed may have increased.
I tried 3 times and at that timing my GAE app's free quota was gone. :-(

I'll retry new benchmark for 'query' tomorrow.

@vierjp
Copy link

@vierjp vierjp commented Jul 4, 2013

I tried on GCE's n1-highcpu-4 instance.

I queried latest 10 entities after I had put entities. (14 times)
To disable caches, I changed kind name every time.

It took 326 ms to query on average.
The execute times were in a range from 283 to 416 ms.

I think the speed of queries became faster about 20-30%.

@obeleh
Copy link

@obeleh obeleh commented Sep 12, 2013

I've inserted 100 blank entities in python. It took 23 seconds :(
PS: Europe-west-a small

@Alfus
Copy link

@Alfus Alfus commented Sep 15, 2013

Did you insert them in a single batch request or in serially in individual requests?

@obeleh
Copy link

@obeleh obeleh commented Sep 16, 2013

Serially, on purpose. To see how long a single insert would take on average. 200 - 250 ms sounds very long to me. I understand that the replication is probably why it takes so long. But I expected it to be faster. Most results come in, in 5 minute intervals from multiple data sources. So I guess I'll have to queue that up.

@obeleh
Copy link

@obeleh obeleh commented Jun 6, 2014

Are there any updates here?

@Alfus
Copy link

@Alfus Alfus commented Jun 9, 2014

Not yet. This is still a top priority.

@sorin7486
Copy link

@sorin7486 sorin7486 commented Nov 7, 2014

Any news on this?

@ehrencrona
Copy link

@ehrencrona ehrencrona commented Nov 23, 2014

I'm having the same problem. I've created a small script in Node to test get performance and no matter whether running on my local machine or on Google Compute Cloud (on the cheapest instance) I get on average 400 ms response times for a single get request in a tiny data store with just a handful of entries.

The response times vary from 200 ms up to several seconds (!). I've tried letting a get run every ten seconds or so for a longer period; the times do not improve.

Is this really normal? Would latency improve by running in AppEngine (though that seems extremely complicated using NodeJS)?

Even response times of 100 ms, mentioned earlier in this thread, would seem to make it impossible to use Datastore for anything remotely time-critical. But there are people actually using Datastore, right? How are others using it? Or am I doing something wrong?

For reference, my tiny timing script:

var gcloud = require('gcloud');

var dataset = gcloud.datastore.dataset({
    projectId: 'myProject',
    keyFilename: 'key.json'
});

var calls = 0;

setInterval(function() {
    for (var i = 0; i < 10; i++) {
        var call = 'get' + calls++;
        console.time(call);

        dataset.get(dataset.key(['Language', 'EN']),
            (function(call) {
                return function(err, entities, nextQuery) {
                    if (err) {
                        console.log(err);
                    }

                    console.timeEnd(call);
                }
            })(call)
        )
    }    
}, 2000);

This yields output like:

get232: 286ms
get235: 342ms
get237: 362ms
get238: 419ms
get239: 425ms
get236: 3734ms
get241: 203ms

Thankful for any help. With these response times I will need to rething my entire architecture.

@obeleh
Copy link

@obeleh obeleh commented Nov 24, 2014

This is the same feeling I've been having for a long time now. "Why am I the only one with this problem aren't there hundreds or even thousands of others too building on the google platform? How did they solve this problem? Why is it I don't read anything about these questions on the rest of the internet?"

I feel as if the solutions available on the google cloud platform are built for stateless agents/machines that run slow (200ms - 10sec) but large numbers/operations.

I've had expected bigquery to eventually to get faster so that we could store chart data in it. But it responds between 2 and 10 secs. With GCD I expected the service behave more or less with the speeds of other databases. I would have been quite ok with 100ms.

You will probably have to adjust your design. I hope I'm wrong and I too have missed something. But so far no enlightenment has come.

@Alfus
Copy link

@Alfus Alfus commented Nov 24, 2014

We are working hard to solve this problem. We are implementing a new
serving stack that we expect will get the latency very close to the latency
you see from GAE. We launched the API into Beta without these improvements
because there are a lot of use cases not sensitive to this latency
issues (e.g. offline data processing).

As a stopgap solution, there are things you can do to mitigate the latency.
Specifically, tweaking the setting documented here:
https://cloud.google.com/appengine/docs/adminconsole/performancesettings
For example, increasing the front end class and reducing the pending
latency options might reduce the variability you are seeing.
You can also look at https://appengine.google.com/instances?&app_id=<your_app_id>&version_id=ah-builtin-datastoreservice to see more information
about what is happening.

Sorry for the inconvenience,

Alfred

On Sun Nov 23 2014 at 11:44:20 PM Sjuul Janssen notifications@github.com
wrote:

This is the same feeling I've been having for a long time now. "Why am I
the only one with this problem aren't there hundreds or even thousands of
others too building on the google platform? How did they solve this
problem? Why is it I don't read anything about these questions on the rest
of the internet?"

I feel as if the solutions available on the google cloud platform are
built for stateless agents/machines that run slow (200ms - 10sec) but large
numbers/operations.

I've had expected bigquery to eventually to get faster so that we could
store chart data in it. But it responds between 2 and 10 secs. With GCD I
expected the service behave more or less with the speeds of other
databases. I would have been quite ok with 100ms.

You will probably have to adjust your design. I hope I'm wrong and I too
have missed something. But so far no enlightenment has come.

Reply to this email directly or view it on GitHub
#5 (comment)
.

@obeleh
Copy link

@obeleh obeleh commented Nov 24, 2014

AE Frontend instance performance?

@Alfus
Copy link

@Alfus Alfus commented Nov 24, 2014

Yes, the ah-builtin-datastoreservice version is what is serving the HTTP requests

@peterrham
Copy link

@peterrham peterrham commented Feb 3, 2015

I'm measuring the latency between google cloud compute engine and google datastore.

I'm performing a simple lookup() using the python client library.

The google performance dashboard says that my requests are consuming aroud 18 milliseconds. I assume that this is a server side metric and not a round trip metric.

Can someone refer me to the service level agreements behind the minimum round trip response times I should expect currently between google compute engine and google datastore?

For a trivial lookup, i'm experiencing around 60 milliseconds. I would expect around 20 milliseconds.

Here's the code, not that some of these files are not strictly correlated. Each is intended to be indicative on its own. If someone can show me some code with better latencies, then that would be great.

https://github.com/peterrham/projects/blob/master/google_cloud/read.py

Here's an example output file:

https://github.com/peterrham/projects/blob/master/google_cloud/read.out

In this example, I'm getting 77 milliseconds. I'm not trying to be statistically significant here, I'm just looking for some indicative guidance.

Data store latency here is under 15 milliseconds:
http://code.google.com/status/appengine/detail/datastore/2015/02/02#ae-trust-detail-datastore-get-latency

I also have a sample tcpdump ascii text out put file:

https://github.com/peterrham/projects/blob/master/google_cloud/40ms.txt

using this command line: (the time stamps are the delta values in between the packet events)

/usr/sbin/tcpdump -r tcpdump.out -nnq -ttt > 40ms.txt

For example, the TCP initial SYN is acked in 1 millisecond, so network latency does not seem to be a problem.

However, the ack from the lookup() request is over 40 milliseconds after the request.

Server side latency from the app engine logs is 9 milliseconds:

2015-02-02 16:15:55.753 /datastore/v1beta2/Lookup 200 9ms 0kb module=default version=ah-builtin-datastoreservice
10.64.21.5 - - [02/Feb/2015:16:15:55 -0800] "POST /datastore/v1beta2/Lookup HTTP/1.1" 200 136 - - "ah-builtin-datastoreservice-dot-glowing-thunder-842.appspot.com" ms=10 cpu_ms=18 cpm_usd=0.000015 app_engine_release=1.9.17 instance=00c61b117c76349d57bd7ae2e3c635edd5c994da

any ideas? Are their any buffering configurations to setting to get the minimum latency?

@peterrham
Copy link

@peterrham peterrham commented Feb 3, 2015

Looks like the 40ms is related to Nagle's algorithm, although I do not think that it accounts for the delayed response which is over 40ms, but i'm not sure.

http://neophob.com/2013/09/rpc-calls-and-mysterious-40ms-delay/

@gcjc
Copy link

@gcjc gcjc commented Mar 27, 2015

Hi - we are doing some app and backend testing (prior to launch) and are seeing this exact problem (looking at the ah-builtin-datastoreservice) we see times form 10-20ms to 2000ms. Do you have any timescale on a fix (or when you will apply any such fix already made to the current Beta channel)? Else we'll just migrate to Dynamo. Thanks.

@cerdmann
Copy link

@cerdmann cerdmann commented Mar 29, 2015

I second the motion to move to Dynamo.

@andrewferk
Copy link

@andrewferk andrewferk commented Apr 12, 2015

I am also disappointed with the performance I'm seeing from GCD. A couple patterns I've noticed: 1) the first request from a new datastore connection is always slow, and 2) batch mutations (even a batch of 100 entities) is painfully slow.

@obeleh
Copy link

@obeleh obeleh commented Apr 13, 2015

In my experience it is better than it was when this issue was opened. I used to see a lot more response times between 500ms and 1000ms. Most of my requests are between 200ms and 500ms. Perhaps I've segmented my data better?

@hbizira
Copy link

@hbizira hbizira commented Apr 13, 2015

@cerdmann @gcjc I would advise against moving to dynamodb if you're concerned about keeping costs as low as possible. I did an evaluation after hitting this issue and in my opinion dynamodb has one major drawback. You have to provision your read/write capacity ahead of time. This makes it difficult to adjust to a sudden spike in traffic while keeping costs down. There are some auto scaling libraries that attempt to help with this but their adjustment time is not very fast and dynamo currently limits you to scaling back down only a few times a day.

I'm really hoping this issue gets fixed soon as there's a huge node.js community that could really take advantage of GCD.

@gcjc
Copy link

@gcjc gcjc commented Apr 13, 2015

@obeleh it is hard to predict when there will be problems (and it seems to be any operation).

We will likely knock up some longer running tests internally to see if we can spot any patterns. However, we just ran a very short test and it seems to have improved, most requests between 150-220ms. Anyone else seen any improvement, just wondering if any changes in the app engine front end have been made?

@hbizira we really do want to use GCD, moving to dynamo will be the last resort.

@cerdmann
Copy link

@cerdmann cerdmann commented Apr 13, 2015

@hbizira, thanks for the advice. I'm in the same boat as @gcjc in that we really want to use GCD, but looking into other options as the latency is killing us.

@jonface
Copy link

@jonface jonface commented Jul 8, 2015

Just started using gcloud via nodejs and it's a bit disappointing. Getting from 200ms - 500ms approx latency. This is running on RH OpenShift (really AWS) in the US and EU. Tried my home connection too, all roughly the same.

I hope I'm doing something wrong :/

@peterrham
Copy link

@peterrham peterrham commented Jul 9, 2015

I got better results than that, but still not great. What do you mean
"AWS"? My timings were from a google compute host to google cloud datastore.

By the way, I'm trying out google "bigtable" which is in Beta. It promises
a great SLA sub 10 ms for 99 percentile I think. I have tried it out, but
have not measured the latency, but I believe it!

On Wed, Jul 8, 2015 at 4:54 PM, jonface notifications@github.com wrote:

Just started using gcloud via nodejs and it's a bit disappointing. Getting
from 200ms - 500ms approx latency. This is running on RH OpenShift (really
AWS) in the US and EU. Tried my home connection too, all roughly the same.

I hope I'm doing something wrong :/


Reply to this email directly or view it on GitHub
#5 (comment)
.

@jonface
Copy link

@jonface jonface commented Jul 9, 2015

I was referring to Amazon Web Services EC2 which is the underlying infrastructure of OpenShift. Isn't BigTable more expensive and overkill?

I'll check my timings with wireshark. Also my datastore is empty, so it's not like it's got millions of items in it.

@jonface
Copy link

@jonface jonface commented Jul 9, 2015

OK, it's hard to tell exactly how long it's taking via wireshark due to the TLS but you can kind of guess.

TLS connection SYN to FIN - total 460ms
- TLS Application data at 126ms
- TLS Application data at 201ms
- TLS Application data at 399ms
- TLS Application data at 400ms
- TLS Application data at 455ms

What is interesting is that for every query, a new connection is setup/destroyed. Is this correct? Am I doing something wrong? Would it not be better to keep some connection pool and reuse connections?

Thanks

@jonface
Copy link

@jonface jonface commented Jul 10, 2015

I realise I'm being impatient but what's the plan for this? Is this just a nodejs lib problem?

@dhermes
Copy link
Member

@dhermes dhermes commented Aug 17, 2015

@pcostell Will this be addressed with v1beta3?

@pcostell
Copy link
Contributor

@pcostell pcostell commented Aug 17, 2015

That is the goal, but we are still working on benchmarking v1beta3.

@InfiniteRandomVariable
Copy link

@InfiniteRandomVariable InfiniteRandomVariable commented Jan 1, 2016

@pcostell I am seriously considering this service. Please post an update about this issue since it has been a while. Thanks for the great work.

@pcostell
Copy link
Contributor

@pcostell pcostell commented Jan 8, 2016

We are seeing much better latency numbers with v1beta3. v1beta3 has been a complete rewrite in our infrastructure and as such we are being very cautious about rolling it out. We hope to have it ready early this year.

@peterrham
Copy link

@peterrham peterrham commented Jan 8, 2016

Great!

On Fri, Jan 8, 2016 at 2:09 PM, Patrick Costello notifications@github.com
wrote:

We are seeing much better latency numbers with v1beta3. v1beta3 has been a
complete rewrite in our infrastructure and as such we are being very
cautious about rolling it out. We hope to have it ready early this year.


Reply to this email directly or view it on GitHub
#5 (comment)
.

@alexfernandez
Copy link

@alexfernandez alexfernandez commented Feb 8, 2016

@pcostell What numbers are you seeing? Anything below 20 ms? That is the performance goal that we have, and it is easily achievable in DynamoDB. This issue is a dealbreaker for us for moving to Google Cloud right now.

@leonardaustin
Copy link

@leonardaustin leonardaustin commented Feb 15, 2016

I thought I would share some instrumentation as I found this issue and have been following it with interest. Below is 95th percentile and mean (in milli-seconds) for PUT requests moving from v1beta2 to v1beta3. Both show an increase of ~10x (~25ms & ~10ms) - good work guys! It would also be nice to get it out of beta endpoints for us to use in production. One thing worth mentioning is that the v1beta3 seems to be missing the transaction endpoint.

95th
screen shot 2016-02-15 at 18 22 57

Mean
screen shot 2016-02-15 at 18 25 17

@eddavisson
Copy link
Contributor

@eddavisson eddavisson commented Feb 16, 2016

Hi @leonardaustin, can you share details about what service(s) you're using? We haven't actually launched v1beta3 yet, so I'm wondering if we're talking about two different things.

@leonardaustin
Copy link

@leonardaustin leonardaustin commented Feb 16, 2016

@eddavisson Sure, I forked https://github.com/GoogleCloudPlatform/gcloud-golang and changed the url from v1beta2 to v1beta3.

@dhermes
Copy link
Member

@dhermes dhermes commented Feb 16, 2016

gcloud-python has "already" made the switch (in a branch)

All features from v1beta3 are present, though the URIs are different.

@obeleh
Copy link

@obeleh obeleh commented Feb 17, 2016

Is there a list of changes I should see or can I just change the url and I'm set?

@eddavisson
Copy link
Contributor

@eddavisson eddavisson commented Feb 17, 2016

The v1beta3 API is not yet available to customers. We will be sure to post announcement here when it is.

@obeleh
Copy link

@obeleh obeleh commented Feb 18, 2016

If possible I would like to sign up for the public beta of the beta?

@faizalkassamalisc
Copy link

@faizalkassamalisc faizalkassamalisc commented Mar 14, 2016

@pcostell, is v1beta3 still on track to "be released this quarter" as per #34? :)

image

@dmcgrath
Copy link
Contributor

@dmcgrath dmcgrath commented Mar 15, 2016

@faizalkassamalisc, we're busy reticulating splines and hopefully will have an update for you soon. Thanks for your patience!

@obeleh, the beta will be public and just requires you using the appropriate API clients. There will be no other sign-up beyond the normal project creation you do currently.

@pcostell pcostell closed this Apr 4, 2016
@alexfernandez
Copy link

@alexfernandez alexfernandez commented Apr 4, 2016

Is there a more technical page with latencies expressed as milliseconds? Thanks!

@dmcgrath
Copy link
Contributor

@dmcgrath dmcgrath commented Apr 4, 2016

We don't have a page with latency numbers since it's a moving target as we continually improve the platform as well as depending upon both the location your Cloud Datastore was set up for as well as from where you are accessing it.

Keep in mind you cannot compare DynamoDB with Cloud Datastore directly as we're a functionally different service and in most cases are serving customers in a Multi-Regional instance vs merely Regional.

@rajeshshetty
Copy link

@rajeshshetty rajeshshetty commented Nov 10, 2016

Why the first request from a datastore is always slow?
Is there a way to fix this.

@josecaodaglio
Copy link

@josecaodaglio josecaodaglio commented Oct 2, 2017

Hello Guys, this is probably the wrong place for ask this question but I didn't find any better place.

Could datastore using local disks play the same role of redis for session cache? Everything would be much easier and cheaper to our company.

@manwithsteelnerves
Copy link

@manwithsteelnerves manwithsteelnerves commented Aug 10, 2020

Is this ever solved? Any one share how much are the latencies we can expect now on data store(Firestore with datastore mode vs native mode)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
You can’t perform that action at this time.