public
Description: Good code.
Homepage: http://www.ralree.com
Clone URL: git://github.com/hank/life.git
life / oscon / 2008 / sessions / Hypertable.rdoc
100644 149 lines (123 sloc) 4.293 kb

OSCON 2008, Session 3: Hypertable

GFS

  • Run on 1000 machines, not 1

Filesystem

  • 64MB chunk
  • Replicates each chunk across machines
  • By doing so, system is impervious to a whole class of hardware failures
    • Power supply
    • Power to the rack
    • Network failure
  • Map/Reduce
  • Bigtable

Hypertable

  • Not relational
  • Modeled after Google’s bigtable
  • One big massive primary keyed table
  • No transactions, maybe in the future
  • Scalable
  • High Random insert, update, and delete rate
  • Loaded 1TB into a 9-node HT cluster, sustained random insert @ 1M inserts per second (Quad core Intel, 16GB RAM, SATA 3Gb/s).

Data Model

  • Sparse, 2D table with cell versions
  • 1 table with 2 columns, next one has 1M, that’s OK
  • 4-part key
    • Row
    • Column Family
    • Column Qualifier
    • Timestamp
  • Tim O’Reilly walks in and looks around for a seat, they’re all taken

Anatomy of a key

  • Row key is \0 Terminated
  • Col family is a single-byte (256 possible)
  • Col qualifier is \0 terminated
  • Timestamp is big-endian 1’s Comp. (memcmp, ordering has more recent ahead of older versions)

Concurrency

  • Bigtable uses Copy on write
  • MVCC (like Couch) form in Hypertable. Deletes are inserted as delete records. Multiple versions are kept around.

Cellstore

  • 65K blocks of compressed KV pairs
  • Bloom Filter - booya!

System Overview

  • Hyperspace has a distributed lock manager, and a small metadata fs (built on BDB)
  • Chubby (sp?) is google’s hyperspace
  • Function of the master is to perform metadata operations (ALTER, CREATE, etc.)
  • Clients can communicate with Range servers
  • Master can be down for a while with no one even noticing
  • Hot standby design for availability
  • Range Servers: Responsible for UPDATING and SCANNING
  • All sits on top of HDFS distributed FS
  • Hadoop, KFS (GFS Clone)

Range server

  • Manages ranges of table data
  • Caches updates in memory (CellCache)
  • Spills (compacts) periodically to update the disk (CellStore)

Write ahead commit log

  • When updates come into a rangeserver, they’re written to a commit log, then the data structures are updated so you can replay the log.

Range meta-operation log

  • When a rangeserver does anything (moves, stops), it’s written into the log

Client API

  • C++ client is the only one supported ATM:
  • You modify a table by creating a mutator
  • You scan a table by creating a scanner
  • Thrift Broker in the works
  • Someone contrib’d a Hadoop Map/Reduce connector

Compression

  • CellStore: compressed KV pairs
  • Commit log: Compressed blocks (optionally)
  • Supported types
    • zlib (fastest/best)
    • lzo (high decomp speed)
    • quicklz (fast decomp, high ratio)
    • bmz (longest commons substring, lost of replication)
    • none

Caching

Block Cache

  • CellStore blocks of KV pairs configurable

Query cache

  • Not finished implementing
  • Caches results

Bloom Filter (!!)

  • Negative Cache
  • Configurable K
  • Allows you to find out if you definitely *don’t* have the data

Scaling

  • Session table and crawl table
  • Splits them all up into ranges, go to rangeservers
  • Just add more machines, and the system migrates data equally
  • Balancing is questionable…

Access Groups

  • Control of physical layout hybrid row/col oriented
  • Improves perf. by minimizing IO
  • Grouping columns allows physical storage control
  • Makes faster updates possible

FS Broker

  • Can run on any distributed FS
  • FUSE hooks

More

  • Comparison to Hbase (Java, yuck), C++ much better
  • System is designed for async communication
  • Hypertable is CPU intensive
  • Java uses 2-3 times the memory for large memmap
  • Poor processor cache perf.

Performance

  • AOL Query logs
  • 75,275,825 inserted cells
  • 8-node cluster (1 1.8 Ghz Dual Core Opteron)
    • 4GB RAM
    • 3x 7200 SATA
  • Row Key 7B
  • Avg value 15B
  • Crap. Slide change
  • Another test yielded over 1M sustained inserts/s

Weaknesses

  • Range data managed by a single rangeserver
    • No data loss, but if it goes down, bad bad
    • Can be mitigated with client-side cache or memcached

Status

  • Alpha, 0.9.0.7 released
  • Beta at the end of August
  • Waiting on Hadoop JIRA 1700
    • Bug in Hadoop, don’t allow appending to existing files
  • GPL 2
  • Delete records get flushed in a "major operation"