Contest Management System load test report.
- CMS (Contest Management System)
- AWS (Amazon Web Services)
- Amazon EC2 (Elastic Cloud)
- Amazon RDS (Relational Database Service)
- Amazon ELB (Elastic Load Balancer)
Purpose: we are organizing a contest with ~750 participants and want to know if CMS will hold.
How quickly can the workers run, and how scalable can the setup be? The following setup was tested:
- c3.4xlarge (16 vCPU, 30 GB RAM), c3.8xlarge (32 vCPU, 60 GB RAM): cmsWorker.
- db.t2.medium (2 vCPU, 4 GB RAM): hosted PostgreSQL database (AWS RDS).
- t2.medium (2 vCPU, 4GiB RAM). All management: LogService, EvaluationService, Checker, ScoringService, AdminWebServer, etc.
With this setup all submissions of all exercises of first-day BOI2014 were executed in about 50 minutes. Workers were fully loaded (CPU load ~16 and ~32 respectively).
Some observations. Like seen in the graphs below, management machine was mostly idle.
Database machine moderately busy (submission testing started around 10:50):
It is important to let workers pre-fetch the contest data. If testing starts before that, weird things happen. Also, the database shouldn not be too slow (db.t2.medium is good).
Conclusion: this is adequate.
Screenshot of random point in time running the benchmarks:
Making web servers (ContestWebService) scale was more problematic. Test
methodology: set up the web service,
ab on a url with a cookie that
understands user is logged in (so we do not hit the redirect all the time).
I understand this does not fully exercise the system because it does not load submissions, but sounds like good enough to test basic web service performance.
- Running more than one worker under the same ResourceService did not work. All but the first servers kept restarting. Since I didn't believe it's going to work anyway, I did not investigate this too much.
- Database should be able connection-number-wise. ContestWebService opens a few
connections on every request, and looks like does not shut them down
immediately. That raised significant concerns with the
db.t2.mediumdatabase (2 vCPUs, 4 GiB RAM).
TIME_WAITsockets is a big deal. More on that later.
- 2 * c3.8xlarge (32 vCPU, 60 GiB RAM): ContestWebService.
- db.r3.8xlarge (32 vCPU, 244 GiB RAM): hosted PostgreSQL database (AWS RDS).
- t2.medium (2 vCPU, 4GiB RAM). All management: LogService, EvaluationService, etc.
ELB (Elastic Load Balancer) was in front and had the public IP.
Two machines (c3.8xlarge) had nginx serving port 80 and forwarding the requests to ContestWebServices, below.
Both c3.8xlarge machines had 32 Docker containers running ContestWebService each (ports 8890-8921). Each container exposes one port for terminating HTTP traffic, and one port for management (so "management" can connect to it). All these services had to be configured in
cms.conf. After that they were launched using a simple shell script.
This is important:
echo 1 | sudo tee /proc/sys/net/ipv4/tcp_tw_reuse echo 1 | sudo tee /proc/sys/net/ipv4/tcp_tw_recycle
Normally this is a dangerous/unsafe operation, but we can do this, because we are not talking to the outside world and all traffic is contained within a single network.
Without these settings after 30-40 seconds servers got tens of thousands
TIME_WAITnetwork sockets, and the benchmarks stalled without an indication what is going wrong. Remember, always check
netstatwhen something is off.
For the cautious,
tcp_tw_reusealone might be enough. But during testing I did not yet know
tcp_tw_recyclecan be dangerous.
In the beginning, it was hard to get DB right. In the beginning of the
benchmarks requests started failing due to problems connecting to the database
connection refused). Tried workaround: pgbouncer. However, it does not
solve the error, just postpones it (the same problems happen, just after more
requests). Solution: get a more powerful instance with a higher limit on DB
db.t2.medium allows 291 connections. The (unofficial) formula is
DBInstanceClassMemory/12582880, which, for db.r3.8xlarge, is around 19k.
This is enough for our purposes:
With all that changed, the setup was able to do about 980 requests per second (there were more runs with higher rates, but I logged only this):
[ec2-user@ip-172-31-19-49 ~]$ ab -n100000 -c60 -H 'Cookie: login="...";' \ http://cmsdbproxy-xyz.elb.amazonaws.com/tasks/coprobber/description This is ApacheBench, Version 2.3 <$Revision: 655654 $> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Licensed to The Apache Software Foundation, http://www.apache.org/ Benchmarking cmsdbproxy-xyz.elb.amazonaws.com (be patient) Completed 10000 requests Completed 20000 requests Completed 30000 requests Completed 40000 requests Completed 50000 requests Completed 60000 requests Completed 70000 requests Completed 80000 requests Completed 90000 requests Completed 100000 requests Finished 100000 requests Server Software: nginx/1.6.2 Server Hostname: cmsdbproxy-xyz.elb.amazonaws.com Server Port: 80 Document Path: //tasks/coprobber/description Document Length: 6844 bytes Concurrency Level: 60 Time taken for tests: 101.861 seconds Complete requests: 100000 Failed requests: 10084 (Connect: 0, Receive: 0, Length: 10084, Exceptions: 0) Write errors: 0 Total transferred: 729883619 bytes HTML transferred: 684389916 bytes Requests per second: 981.73 [#/sec] (mean) Time per request: 61.117 [ms] (mean) Time per request: 1.019 [ms] (mean, across all concurrent requests) Transfer rate: 6997.54 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 1 0.7 1 20 Processing: 39 58 206.6 49 5110 Waiting: 38 58 206.6 48 5110 Total: 39 59 206.6 49 5110 Percentage of the requests served within a certain time (ms) 50% 49 66% 51 75% 53 80% 54 90% 56 95% 60 98% 66 99% 74 100% 5110 (longest request)
A million requests were also tried, it worked reliably. Just connection count to the database kept slowly steadily increasing. But since the connection increase rate is so small, we can ignore the problem.
cms-boi2014 fork was tested (pre-1.1.0).
- Free graphs of mostly everything. It is so damn useful.
- ELB. No need to worry about slow clients. Also, web service usage graphs are free. Great start to troubleshoot (e.g. number of 500s).
- Even the most wimpy virtual machines feel really fast. I have never used AWS seriously before (due to the fact that my previous employer had an in-house OpenStack cluster).
yum installon AWS is a remarkably good experience.
- Docker was a great help here. For example, starting the cmsWorker takes around 1 minute on a newly installed machine (
yum install docker,
docker run <args>).
- Amazon Linux is a nice Linux distribution.