Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rendering time metric #18

Closed
rafatower opened this issue Apr 25, 2017 · 17 comments
Closed

Rendering time metric #18

rafatower opened this issue Apr 25, 2017 · 17 comments

Comments

@rafatower
Copy link

Add metric to measure the time spent on rendering vs postgres

@rafatower rafatower self-assigned this Apr 25, 2017
@rafatower
Copy link
Author

A bit of history

Current implementation

See mapnik::progress_timer and MAPNIK_STATS:

It spits results to an std::ostream, which is not the best option.

This is the typical usage:

#ifdef MAPNIK_STATS
    mapnik::progress_timer __stats__(std::clog, "postgis_datasource::init");
#endif

__stats__ (and in some places __stats2__) is just the name of the variable. When the destructor is called (after getting out of scope), the timer is stopped and the stats dumped to the stream.

rafatower pushed a commit that referenced this issue May 25, 2017
rafatower pushed a commit that referenced this issue May 25, 2017
rafatower pushed a commit that referenced this issue May 25, 2017
rafatower pushed a commit that referenced this issue May 25, 2017
rafatower pushed a commit that referenced this issue May 25, 2017
rafatower pushed a commit that referenced this issue May 25, 2017
@rafatower
Copy link
Author

Requirements

Functional requirements

  • Original requirement from Santana: we want to know how much time mapnik spends querying DB versus rendering.
  • The same question, more fine-grained (thanks Raul): "How much time mapnik spends in DB vs rendering, given a map and a zoom level [and possibly a bounding box]"

Other requirements:

Non-functional requirements

  • The API should be generic enough to allow other mapnik contributors to add more metrics (e.g: to other input plugins, grid and vector tiles, etc.)
  • It should be possible to enable/disable the feature through configuration, ideally at runtime (rather than at compilation time)
  • It should be lightweight in terms of CPU.
  • It should be lightweight in terms of memory footprint, that is, total memory used shall be bounded.

@rafatower
Copy link
Author

rafatower commented Jun 7, 2017

Proposed roadmap

Milestone 1: Coarse-grain DB vs rendering times (not per map yet)

At this point I'm pretty confident we can achieve this, useful both for benchmarking and getting overall metrics in production.

  • Mapnik: API & implementation for cumulative timing stats
  • Mapnik: Inject code to measure DB and rendering time
  • Node-mapnik: bindings for timer_stats class
  • Windshaft: wrapper around node-mapnik timer_stats (IMHO this is not really needed)
  • Windshaft-cartodb: expose metrics functionality through endpoints

Milestone 2: Finer-grain metrics (per map)

  • Mapnik: Research how the information flows from Windshaft-cartodb to mapnik. Modify function interfaces when needed to pass layergroup_id, map_id or equivalent concepts to store finer-grain stats (as done and then reverted in mapnik@6bbbe15 )
  • Decide on storage and memory utilization strategy: ideally the stats should be memory-bounded by design. Storing finer-grained stats implies that more memory is to be used, but we cannot let mapnik run out of memory because of stats. There are several possible solutions to this:
    • flush them out often enough
    • use a LRU-like storage strategy, or a circular buffer
    • keep just the top-N
    • use sampling

Milestone 3: Extended API and configurable storage strategy

Some ideas for a future milestone:

  • For the first iteration at least I'm focusing the efforts on timing stats. Other possible stats may be interesting, namely counters.
  • The aggregation, if any, would be delegated on the client at least until this iteration.
  • Ideally, and to cover different use cases, the storage strategy could be chosen through configuration at server start and run times.

cc/ @jorgesancha

@rafatower
Copy link
Author

rafatower commented Jun 8, 2017

Why this is important

This is needed to better know our platform and its usage. That will allow us to:

  • set up sensible limits
  • debug performance issues experienced by particular users
  • do some capacity planning
  • optimize what is needed to be optimized

@rafatower
Copy link
Author

rafatower commented Jun 8, 2017

I'm working right now on different branches with the hope of getting the first iteration complete soon:

This is yet WIP and probably lacks a lock to avoid concurrency issues... but the hardest parts to integrate are pretty much under control.

I'll probably add some notes about the internals of mapnik and node-mapnik somewhere, which can be helpful to newcomers.

Edit: added Windshaft-cartodb branch.

@rafatower
Copy link
Author

A small test I use to check it actually gets data from node-mapnik (to be added to the corresponding branch):

// file my-test.js
var mapnik = require('.');
var fs = require('fs');

// register fonts and datasource plugins
mapnik.register_default_fonts();
mapnik.register_default_input_plugins();

var map = new mapnik.Map(256, 256);
var out = '';
map.load('./test/stylesheet.xml', function(err,map) {
    if (err) throw err;
    map.zoomAll();
    var im = new mapnik.Image(256, 256);
    map.render(im, function(err,im) {
        if (err) throw err;
        console.log(mapnik.TimerStats.flush());
    });
    
});

and the output:

$ time node my-test.js
total_map_rendering	cpu_time = 5.582 ms	wall_time = 5.58496 ms


real	0m0.184s
user	0m0.168s
sys	0m0.020s

@rafatower
Copy link
Author

rafatower commented Jun 13, 2017

A bit ugly but it starts to work in my development environment.

This is just an example after tinkering a little with one simple map (based on world borders):

$ curl localhost:8181/stats
{"stats":"total_map_rendering\tcpu_time = 1220.97 ms\twall_time = 827.765 ms\npostgis_datasource::features_with_context::get_resultset\tcpu_time = 367.364 ms\twall_time = 402.992 ms\n","ok":true}

The same output beautified:

total_map_rendering                                         cpu_time = 1220.97 ms   wall_time = 827.765 ms
postgis_datasource::features_with_context::get_resultset    cpu_time = 367.364 ms   wall_time = 402.992 ms

Some things I'd need to do:

  • check why wall_time is bigger than cpu_time when rendering. It may turn out that multiple threads and cores lead to those results when using a pool of threads. I believe it can be checked by instructing node to use just one thread.
  • add a lock to grant exclusive access to the unordered map used as internal storage of stats, in order to avoid race conditions.
  • modify the mapnik/node-mapnik interface and wrappers to return a json object instead of a plain string, depending on what we decide with regards to interface and granularity.

@rafatower
Copy link
Author

Using UV_THREADPOOL_SIZE=1 I indeed got different results, with cpu < wall times (mind that this is not a formal benchmark):

total_map_rendering                                         cpu_time = 473.751 ms   wall_time = 533.123 ms
postgis_datasource::features_with_context::get_resultset    cpu_time = 136.841 ms   wall_time = 313.367 ms

@rafatower
Copy link
Author

Something worth noting: if we wanted to deploy this changes before they were accepted upstream (or backported to mapnik v3.0.x series for what it matters), we'd need to set up our own building and packaging infrastructure as we're currently using vanilla mapnik and node-mapnik.

@rafatower
Copy link
Author

Thanks @dgaubert I could run a few tests and validate that the metrics are consistent. I disabled cache and metatiling to get them.

# reset the logs
$ truncate -s0 /tmp/postgres.log
$ curl -s localhost:8181/stats | jsonlint -p
{
  "stats": {},
  "ok": true
}

# request a tile
$ curl -v -s -o /dev/null "http://cdb.localhost.lan:8181/api/v1/map/cdb@99ca6525@ee41f1f57f9852eea1910c4f84f4e5ac:1486641153877/1/11/524/761.png" 2>&1 | grep X-Tiler-Profiler
< X-Tiler-Profiler: {"setDBAuth":7,"res":159,"getTileOrGrid":159,"getRenderer":30,"render-png":129,"render":67,"encode":62,"total":166}

# get the stats
$ curl -s localhost:8181/stats | jsonlint -p
{
  "stats": {
    "total_map_rendering": {
      "cpu_time": 44.790000000000006,
      "wall_time": 66.33305549621582
    },
    "postgis_datasource::features_with_context::get_resultset": {
      "cpu_time": 0.522,
      "wall_time": 19.935131072998047
    }
  },
  "ok": true
}

# get the postgres query times from postgres itself, quick'n'dirty way
$ cat /tmp/postgres.log | grep duration: | grep -v parse | sed 's/.*duration: \([0-9.]*\) ms.*/\1/g' | awk '{SUM+=$1}END{print SUM}'
19.647

(Note: I'm using a debug build, so no optimizations and a lot of traces. The important point here is understanding the process and check that the measurements make sense)

@rafatower
Copy link
Author

I added a new metric there, save_to_string, that corresponds to tilelive-mapnik "encode" phase.

Taking into account that postgis queries are part of the total_rendering_time, we have all the information we wanted and the metrics almost fit to the millisecond.

Here another example:

$ curl -v -s -o /dev/null "http://cdb.localhost.lan:8181/api/v1/map/cdb@99ca6525@ee41f1f57f9852eea1910c4f84f4e5ac:1486641153877/1/11/524/761.png" 2>&1 | grep X-Tiler-Profiler
< X-Tiler-Profiler: {"req2params_setup":1,"setDBAuth":3,"res":38,"getTileOrGrid":38,"render-png":38,"render":26,"encode":11,"total":42}

$ curl -s localhost:8181/stats | jsonlint -p
{
  "stats": {
    "total_map_rendering": {
      "cpu_time": 9.753,
      "wall_time": 25.885820388793945
    },
    "save_to_string": {
      "cpu_time": 11.009,
      "wall_time": 11.010885238647461
    },
    "postgis_datasource::features_with_context::get_resultset": {
      "cpu_time": 1.195,
      "wall_time": 17.41814613342285
    }
  },
  "ok": true
}

I'll stop there until further evaluation.

@rafatower rafatower removed their assignment Jun 16, 2017
@rafatower
Copy link
Author

cc/ @javisantana

@rafatower rafatower self-assigned this Jun 20, 2017
@rafatower
Copy link
Author

Under review by upstream project maintainers: mapnik#3705

@rafatower
Copy link
Author

The review above is stalled. I'll try a different approach: collect and pass the stats info back to the caller, when the caller wants to do so.

It seems like node-mapnik passes a "closure" to the mapnik renderer, which turns out to be a pretty open structure of attributes. I think that the stats collection could be triggered by the presence of an attribute, say named stats.

@rafatower
Copy link
Author

rafatower commented Jul 7, 2017

Systemtap

I just wanted to add a quick note here about a possible different approach: use systemtap.

Basically with it we can instrument both kernel and user-space code and get metrics from there. I have very little experience with it, but I could easily make it run in my local setup:

I got the identifiers by getting symbols from the binaries (not stripped):

$ readelf -Ws libmapnik.so | grep feature_style_processor.*apply 
   467: 00000000001c7c00   306 FUNC    WEAK   DEFAULT   11 _ZN6mapnik23feature_style_processorINS_12agg_rendererINS_5imageINS_7rgba8_tEEENS_25label_collision_detector4EEEE5applyERKNS_5layerERSt3setISsSt4lessISsESaISsEEd
  2207: 00000000001c4a80  1087 FUNC    WEAK   DEFAULT   11 _ZN6mapnik23feature_style_processorINS_12agg_rendererINS_5imageINS_7rgba8_tEEENS_25label_collision_detector4EEEE5applyEd
  3128: 00000000001c40d0   411 FUNC    WEAK   DEFAULT   11 _ZN6mapnik23feature_style_processorINS_13grid_rendererINS_8hit_gridINS_9gray64s_tEEEEEE14apply_to_layerERKNS_5layerERS5_RKNS_10projectionEddjjRKNS_5box2dIdEEiRSt3setISsSt4lessISsESaISsEE
...

the script:

# file /tmp/test.stp
probe process("/home/rtorre_unencrypted/src/Windshaft-cartodb/node_modules/mapnik/lib/binding/node-v48-linux-x64/mapnik/input/postgis.input").function("_ZNK18postgis_datasource21features_with_contextERKN6mapnik5queryESt10shared_ptrINS0_17IProcessorContextEE")
{
}

probe process("/home/rtorre_unencrypted/src/Windshaft-cartodb/node_modules/mapnik/lib/binding/node-v48-linux-x64/mapnik/input/postgis.input").function("_ZNK18postgis_datasource17features_at_pointERKN6mapnik5coordIdLi2EEEd")
{
}


probe process("/home/rtorre_unencrypted/src/Windshaft-cartodb/node_modules/mapnik/lib/binding/node-v48-linux-x64/libmapnik.so").function("_ZN6mapnik23feature_style_processor*apply*")
{
}

(I think it supports c++ symbol demangling so the script could be simplified)

then running it:

$ sudo stap -t /tmp/test.stp 
^C----- probe hit report: 
process("/home/rtorre_unencrypted/src/Windshaft-cartodb/node_modules/mapnik/lib/binding/node-v48-linux-x64/mapnik/input/postgis.input").function("_ZNK18postgis_datasource21features_with_contextERKN6mapnik5queryESt10shared_ptrINS0_17IProcessorContextEE"), (/tmp/test.stp:1:1), hits: 54, cycles: 520min/1552avg/4405max, from: process("/home/rtorre_unencrypted/src/Windshaft-cartodb/node_modules/mapnik/lib/binding/node-v48-linux-x64/mapnik/input/postgis.input").function("_ZNK18postgis_datasource21features_with_contextERKN6mapnik5queryESt10shared_ptrINS0_17IProcessorContextEE") from: process("/home/rtorre_unencrypted/src/Windshaft-cartodb/node_modules/mapnik/lib/binding/node-v48-linux-x64/mapnik/input/postgis.input").function("_ZNK18postgis_datasource21features_with_contextERKN6mapnik5queryESt10shared_ptrINS0_17IProcessorContextEE"), index: 0
process("/home/rtorre_unencrypted/src/Windshaft-cartodb/node_modules/mapnik/lib/binding/node-v48-linux-x64/libmapnik.so").function("_ZN6mapnik23feature_style_processorINS_13grid_rendererINS_8hit_gridINS_9gray64s_tEEEEEE5applyERKNS_5layerERSt3setISsSt4lessISsESaISsEEd"), (/tmp/test.stp:10:1), hits: 17, cycles: 1924min/5325avg/9615max, from: process("/home/rtorre_unencrypted/src/Windshaft-cartodb/node_modules/mapnik/lib/binding/node-v48-linux-x64/libmapnik.so").function("_ZN6mapnik23feature_style_processorINS_13grid_rendererINS_8hit_gridINS_9gray64s_tEEEEEE5applyERKNS_5layerERSt3setISsSt4lessISsESaISsEEd") from: process("/home/rtorre_unencrypted/src/Windshaft-cartodb/node_modules/mapnik/lib/binding/node-v48-linux-x64/libmapnik.so").function("_ZN6mapnik23feature_style_processor*apply*"), index: 9
process("/home/rtorre_unencrypted/src/Windshaft-cartodb/node_modules/mapnik/lib/binding/node-v48-linux-x64/libmapnik.so").function("_ZN6mapnik23feature_style_processorINS_13grid_rendererINS_8hit_gridINS_9gray64s_tEEEEEE14apply_to_layerERKNS_5layerERS5_RKNS_10projectionEddjjRKNS_5box2dIdEEiRSt3setISsSt4lessISsESaISsEE"), (/tmp/test.stp:10:1), hits: 17, cycles: 438min/1295avg/4971max, from: process("/home/rtorre_unencrypted/src/Windshaft-cartodb/node_modules/mapnik/lib/binding/node-v48-linux-x64/libmapnik.so").function("_ZN6mapnik23feature_style_processorINS_13grid_rendererINS_8hit_gridINS_9gray64s_tEEEEEE14apply_to_layerERKNS_5layerERS5_RKNS_10projectionEddjjRKNS_5box2dIdEEiRSt3setISsSt4lessISsESaISsEE") from: process("/home/rtorre_unencrypted/src/Windshaft-cartodb/node_modules/mapnik/lib/binding/node-v48-linux-x64/libmapnik.so").function("_ZN6mapnik23feature_style_processor*apply*"), index: 10
process("/home/rtorre_unencrypted/src/Windshaft-cartodb/node_modules/mapnik/lib/binding/node-v48-linux-x64/libmapnik.so").function("_ZN6mapnik23feature_style_processorINS_12agg_rendererINS_5imageINS_7rgba8_tEEENS_25label_collision_detector4EEEE5applyEd"), (/tmp/test.stp:10:1), hits: 37, cycles: 2189min/4295avg/10093max, from: process("/home/rtorre_unencrypted/src/Windshaft-cartodb/node_modules/mapnik/lib/binding/node-v48-linux-x64/libmapnik.so").function("_ZN6mapnik23feature_style_processorINS_12agg_rendererINS_5imageINS_7rgba8_tEEENS_25label_collision_detector4EEEE5applyEd") from: process("/home/rtorre_unencrypted/src/Windshaft-cartodb/node_modules/mapnik/lib/binding/node-v48-linux-x64/libmapnik.so").function("_ZN6mapnik23feature_style_processor*apply*"), index: 11
----- refresh report:

As you can see in the examples it is a very powerful tool and would not require changes in the source code, as long as the binaries are not stripped (contain the function symbols), which is the case with our current mapnik version.

For this to work a kernel with the matching headers and debugging symbols is required AFAIK.

@pllopis if you have any experience with this tool, I'd appreciate a "crash course" :)

@rafatower
Copy link
Author

Here's another example that I wrote while checking metrics for the MVT blog:

# file mapnik.stp
# based on https://sourceware.org/systemtap/examples/profiling/fntimes.stp

function time() { return gettimeofday_us() }

function print_time(t)
{
    printf("function %s took %d us\n", ppfunc(), t)
}

probe process("node_modules/mapnik/lib/binding/node-v48-linux-x64/mapnik/input/postgis.input").function("*get_resultset*").return
{
    print_time(time() - @entry(time()));
}

probe process("node_modules/mapnik/lib/binding/node-v48-linux-x64/mapnik.node").function("*update_tile*").return
{
    print_time(time() - @entry(time()));
}

and here's the output:

$ sudo stap mapnik.stp
function _ZNK18postgis_datasource13get_resultsetERSt10shared_ptrI10ConnectionERKSsRKS0_IN6mapnik4PoolIS1_17ConnectionCreatorEEES0_INS6_17IProcessorContextEE took 73232 us
function _ZN6mapnik16vector_tile_impl9processor11update_tileERNS0_4tileEdii took 152431 us

@rochoa rochoa assigned Algunenano and unassigned rafatower Aug 21, 2017
@rafatower
Copy link
Author

Superseeded by mapnik#3767

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants