Skip to content

Commit

Permalink
Bump up Scanner.DEFAULT_MAX_NUM_ROWS from 16 to 128.
Browse files Browse the repository at this point in the history
16 turned out to be too conservative.  Document the fact that this
setting has a high performance impact on the scanning performance.
A higher default value will provide better performance out of the
box.  OpenTSDB sees a 40-50% speedup with this value.  People with
very large rows will probably be aware of these considerations and
are more likely to adjust the value than those who're using typical
HBase rows, which tend to be fairly limited in size.

Change-Id: I778e28c9211e2c0d462d66f7dd35f0c9ddd5afa6
  • Loading branch information
tsuna committed Feb 27, 2011
1 parent 6bebf78 commit d1aff70
Showing 1 changed file with 13 additions and 6 deletions.
19 changes: 13 additions & 6 deletions src/Scanner.java
Expand Up @@ -99,7 +99,7 @@ public final class Scanner {
* is not part of the API and is subject to change without notice.
* @see #setMaxNumRows
*/
public static final int DEFAULT_MAX_NUM_ROWS = 16;
public static final int DEFAULT_MAX_NUM_ROWS = 128;

/** Special reference we use to indicate we're done scanning. */
private static final RegionInfo DONE =
Expand Down Expand Up @@ -348,18 +348,25 @@ public void setServerBlockCache(final boolean populate_blockcache) {
}

/**
* Sets the maximum number of rows to scan per RPC.
* Sets the maximum number of rows to scan per RPC (for better performance).
* <p>
* Every time {@link #nextRows()} is invoked, up to this number of rows may
* be returned. The default value is {@link #DEFAULT_MAX_NUM_ROWS}. Using
* a smaller value (such as 1) is not recommended as it will make your client
* send an RPC to the RegionServer frequently in order to get new data.
* be returned. The default value is {@link #DEFAULT_MAX_NUM_ROWS}.
* <p>
* If you know you're gonna be scanning lots of small rows (few cells, and
* <b>This knob has a high performance impact.</b> If it's too low, you'll
* do too many network round-trips, if it's too high, you'll spend too much
* time and memory handling large amounts of data. The right value depends
* on the size of the rows you're retrieving.
* <p>
* If you know you're going to be scanning lots of small rows (few cells, and
* each cell doesn't store a lot of data), you can get better performance by
* scanning more rows by RPC. You probably always want to retrieve at least
* a few dozen kilobytes per call.
* <p>
* If you want to err on the safe side, it's better to use a value that's a
* bit too high rather than a bit too low. Avoid extreme values (such as 1
* or 1024) unless you know what you're doing.
* <p>
* Note that unlike many other methods, it's fine to change this value while
* scanning. Changing it will take affect all the subsequent RPCs issued.
* This can be useful you want to dynamically adjust how much data you want
Expand Down

0 comments on commit d1aff70

Please sign in to comment.