Skip to content

Commit

Permalink
initial commit of gcviz
Browse files Browse the repository at this point in the history
  • Loading branch information
Brian Moore committed May 16, 2013
1 parent 908f5af commit 5f561ad
Show file tree
Hide file tree
Showing 25 changed files with 1,471 additions and 4 deletions.
17 changes: 17 additions & 0 deletions LICENSE
@@ -0,0 +1,17 @@
/*
*
* Copyright 2013 Netflix, Inc.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
*/
31 changes: 31 additions & 0 deletions README
@@ -0,0 +1,31 @@
This is gcviz, a set of programs that help generate visualizations
from gc.log, a log file that the HotSpot, a Java Virual Machine,
writes when configured with the following flags:
-verbose:gc
-verbose:sizes
-Xloggc:/apps/tomcat/logs/gc.log
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintTenuringDistribution

gcviz is intended to be used as a webapp when installed on the same
host as tomcat, or any other Java web container. The gcviz program
itself is served by apache httpd inside netflix, but could be served
by any webserver that supports CGI. gcviz can also be used in 'remote'
mode where the visualizer runs on local hardware but the log files are
scraped from a remote tomcat.

By default gcviz is available at:
http://127.0.0.1:8080/AdminGCViz/index

Internally, gcviz is a bundle of four sorts of things:
* python programs that require matplotlib, numpy, pylab, etc.
* cgi scripts that invoke these python programs
* some minor assistive perl scripts
* very minor rpm infrastructure to package the previous things.

I wrote gcviz to address challenges we face inside Netflix. If you
feel that any changes you might propose could be helpful for Netflix
or for the community at large, please write.

Brian Moore
4 changes: 0 additions & 4 deletions README.md

This file was deleted.

29 changes: 29 additions & 0 deletions conf/rpm.spec
@@ -0,0 +1,29 @@
Name: @name@
Summary: Netflix GC Visualization
Version: @version@
Release: @release@
License: NFLX
Packager: Engineering Tools
Vendor: Netflix, Inc.
Group: Netflix Base
AutoReqProv: no
Requires: nflx-python-matplotlib, nflx-python-numpy, nflx-python-scipy

%description
GC Visualization
----------
@build.metadata@

%install
cat $0 > %{_topdir}/install.txt
mkdir -p $RPM_BUILD_ROOT
mv ${RPM_BUILD_DIR}/* $RPM_BUILD_ROOT

%clean
rm -rf $RPM_BUILD_ROOT/*

%files
%defattr(-,root,root)
/

%post
17 changes: 17 additions & 0 deletions root/apps/apache/conf.d/admin_gc_viz.conf
@@ -0,0 +1,17 @@

# configuration for GC visualization admin app
# mounted under /AdminGCViz and /AdminGCVizImages

ScriptAlias /AdminGCViz/ "/apps/apache/htdocs/AdminGCViz/"
Alias /AdminGCVizImages/ "/mnt/logs/gc-reports/"

<LocationMatch "^/AdminGCViz">
Order deny,allow
Deny from all
Allow from 127.0.0.1/32
</LocationMatch>

# send requests to /AdminGCViz [with and without trailing /] to /index
RewriteRule ^/AdminGCViz/?$ /AdminGCViz/index [R]
# Force apache to handle the rest [this catches AdminGCVizImages too]
RewriteRule ^/AdminGCViz - [L]
32 changes: 32 additions & 0 deletions root/apps/apache/htdocs/AdminGCViz/BUGS
@@ -0,0 +1,32 @@
This is more of a todo list than things that are deeply wrong, but I want to make the fringes that I know about public.

* all of the BUGs and Potential BUGs in all of the sources
* visualize-cluster does not work
* compute throughput, allocation rate from gc data (do this with PrintGCStats?)
* sar data (cpu, etc) needs to be visualized



* netflix internal: need to visualize facet data
* netflix internal: need to ensure that clients output vms cache refresh event overall and facet-level timing
* gps
* api
* merchweb
* ecweb (curently no facet-level timing for demand-fill)
* accountweb (curently no facet-level timing for demand-fill)
* ...
* netflix-internal: delta fail needs to be added to catalina parsing (right now all lines are plotted as begin/end overall)
* netflix-internal, maybe outside too: truncated gc logs (when ec2rotate logs purges stuff older than 7 days creates "unknown" gc events.
* netflix-internal: have if -z checks for status-properties-1, -2. delete empty files. don't do both if we get the output for one. Consider getting URL from discovery/entrypoints.
* netflix-internal: the squirreled away vms-gc-reports location is <appname>/<date>/<data> and doesn't have instance id in the path... This could be a problem for visualize-cluster.
* netflix-internal: Consider grabbing some number of recent ttime and threaddump files from /apps/tomcat/logs/cores
drwxrwsr-x 2 root nac 94208 Apr 17 23:44 .
-rw-r--r-- 1 merchwebprod nac 168050 Apr 17 23:44 ttime.20120417.234401.1586.txt
lrwxrwxrwx 1 merchwebprod nac 59 Apr 17 23:44 latest -> /apps/tomcat/logs/cores/threaddump.20120417.234401.1586.txt
-rw-r--r-- 1 merchwebprod nac 1117242 Apr 17 23:44 threaddump.20120417.234401.1586.txt
-rw-r--r-- 1 merchwebprod nac 168251 Apr 17 23:34 ttime.20120417.233401.1586.txt
-rw-r--r-- 1 merchwebprod nac 1119200 Apr 17 23:34 threaddump.20120417.233401.1586.txt
-rw-r--r-- 1 merchwebprod nac 168371 Apr 17 23:24 ttime.20120417.232401.1586.txt
-rw-r--r-- 1 merchwebprod nac 1120525 Apr 17 23:24 threaddump.20120417.232401.1586.txt
-rw-r--r-- 1 merchwebprod nac 168490 Apr 17 23:15 ttime.20120417.231401.1586.txt
* netflix-internal: visualize objectCache lines in ${OUTPUTDIR}/vms-object-cache-stats
40 changes: 40 additions & 0 deletions root/apps/apache/htdocs/AdminGCViz/README
@@ -0,0 +1,40 @@
* This program is split into thee conceptual parts:
* a top-level "driver" (visualize-instance.sh and visualize-cluster.py)
* a remote data collection component (remote-data-collection/collect_remote_data.sh)
* a visualization component (visualize-gc.py)

It's difficult to get a cross-platform visualization component, so I
opted for python's matplotlib, which is cross platform and has a
single-click installer available for windows, macintosh and linux.

* To visialize the data you'lll need python's matplotlib. One
simple-to-install (and free) distribution that contains this is EPD:
http://www.enthought.com/repo/free/
This will (attempt) to patch your .profile equivalent to place itself first on your PATH.

* A note about how GC events are parsed. This software does not, at
the time of this writing, use PrintGCFixup. Instead it uses the
-XX:+PrintGCDateStamps datetime stamps (if available, secs since vm
boot if not) as an anchor, and treats each of those things as an
event.

In this context:
2012-04-04T19:07:40.395+0000: 510958.888: [GC [1 CMS-initial-mark: 18431999K(18432000K)] 18939679K(29491200K), 0.5050890 secs] [Times: user=0.50 sys=0.00, real=0.50 secs]
2012-04-04T19:07:40.903+0000: 510959.397: [CMS-concurrent-mark-start]
2012-04-04T19:07:56.564+0000: 510975.058: [CMS-concurrent-mark: 15.410/15.661 secs] [Times: user=49.94 sys=1.89, real=15.66 secs]
2012-04-04T19:07:56.565+0000: 510975.058: [CMS-concurrent-preclean-start]
2012-04-04T19:08:23.054+0000: 511001.548: [Full GC 511001.549: [CMS2012-04-04T19:08:48.906+0000: 511027.400: [CMS-concurrent-preclean: 51.957/52.341 secs] [Times: user=76.72 sys=0.15, real=52.34 secs]
(concurrent mode failure): 18431999K->16174249K(18432000K), 106.0788490 secs] 29491199K->16174249K(29491200K), [CMS Perm : 69005K->69005K(115372K)], 106.0801410 secs] [Times: user=106.01 sys=0.00, real=106.06 secs]
2012-04-04T19:10:09.150+0000: 511107.644: [GC [1 CMS-initial-mark: 16174249K(18432000K)] 16363184K(29491200K), 0.0263250 secs] [Times: user=0.02 sys=0.00, real=0.03 secs]

GC events of this form:
2012-04-04T19:08:23.054+0000: 511001.548: [Full GC 511001.549: [CMS2012-04-04T19:08:48.906+0000: 511027.400: [CMS-concurrent-preclean: 51.957/52.341 secs] [Times: user=76.72 sys=0.15, real=52.34 secs]
(concurrent mode failure): 18431999K->16174249K(18432000K), 106.0788490 secs] 29491199K->16174249K(29491200K), [CMS Perm : 69005K->69005K(115372K)], 106.0801410 secs] [Times: user=106.01 sys=0.00, real=106.06 secs]

are represented as a Full GC requiring 106 seconds rather than the
CMS-preclean part of ~52 seconds. I'm not sure if ~106 or 106-52 is
the stop-the-world part, but at that long of a pause, I'm not entirely
convinced that it matters. Bill Jackson votes that 106-52 is the stop
the world part. He's probably right. I'm surprised that the preclean
wasn't aborted.

18 changes: 18 additions & 0 deletions root/apps/apache/htdocs/AdminGCViz/gc_event_types
@@ -0,0 +1,18 @@
ParNew (stop-the-world)
CMS-initial-mark (stop-the-world)
CMS-concurrent-mark (concurrent includes yields to other theads)
CMS-concurrent-abortable-preclean (concurrent)
CMS-concurrent-preclean (concurrent)
CMS-remark (stop the world)
CMS-concurrent-sweep (concurrent)
CMS-concurrent-reset (concurrent?)
concurrent mode failure (stop the world)
promotion failed (stop the world)
Full GC (stop the world)

markers
CMS-concurrent-mark-start
CMS-concurrent-preclean-start
CMS-concurrent-sweep-start
CMS-concurrent-reset-start
CMS-concurrent-abortable-preclean-start

0 comments on commit 5f561ad

Please sign in to comment.