= OSCON 2008, Session 3: Hypertable
== GFS
- Run on 1000 machines, not 1
=== Filesystem
- 64MB chunk
- Replicates each chunk across machines
- By doing so, system is impervious to a whole class of hardware failures
- Power supply
- Power to the rack
- Network failure
- Map/Reduce
- Bigtable
=== Hypertable
- Not relational
- Modeled after Google's bigtable
- One big massive primary keyed table
- No transactions, maybe in the future
- *Scalable*
- High Random insert, update, and delete rate
- Loaded 1TB into a 9-node HT cluster, sustained random insert @ 1M inserts per second (Quad core Intel, 16GB RAM, SATA 3Gb/s).
=== Data Model
- Sparse, 2D table with cell versions
- 1 table with 2 columns, next one has 1M, that's OK
- 4-part key
- Row
- Column Family
- Column Qualifier
- Timestamp
- Tim O'Reilly walks in and looks around for a seat, they're all taken
=== Anatomy of a key
- Row key is \0 Terminated
- Col family is a single-byte (256 possible)
- Col qualifier is \0 terminated
- Timestamp is big-endian 1's Comp. (memcmp, ordering has more recent ahead of older versions)
=== Concurrency
- Bigtable uses Copy on write
- MVCC (like Couch) form in Hypertable. Deletes are inserted as delete records. Multiple versions are kept around.
=== Cellstore
- 65K blocks of compressed KV pairs
- Bloom Filter - booya!
=== System Overview
- Hyperspace has a distributed lock manager, and a small metadata fs (built on BDB)
- Chubby (sp?) is google's hyperspace
- Function of the master is to perform metadata operations (ALTER, CREATE, etc.)
- Clients can communicate with Range servers
- Master can be down for a while with no one even noticing
- Hot standby design for availability
- Range Servers: Responsible for UPDATING and SCANNING
- All sits on top of HDFS distributed FS
- Hadoop, KFS (GFS Clone)
=== Range server
- Manages ranges of table data
- Caches updates in memory (CellCache)
- Spills (compacts) periodically to update the disk (CellStore)
==== Write ahead commit log
- When updates come into a rangeserver, they're written to a commit log, then the data structures are updated so you can replay the log.
==== Range meta-operation log
- When a rangeserver does anything (moves, stops), it's written into the log
== Client API
- C++ client is the only one supported ATM:
- You modify a table by creating a mutator
- You scan a table by creating a scanner
- Thrift Broker in the works
- Someone contrib'd a Hadoop Map/Reduce connector
== Compression
- CellStore: compressed KV pairs
- Commit log: Compressed blocks (optionally)
- Supported types
- zlib (fastest/best)
- lzo (high decomp speed)
- quicklz (fast decomp, high ratio)
- bmz (longest commons substring, lost of replication)
- none
== Caching
=== Block Cache
- CellStore blocks of KV pairs configurable
=== Query cache
- Not finished implementing
- Caches results
=== Bloom Filter (!!)
- Negative Cache
- Configurable K
- Allows you to find out if you definitely *don't* have the data
=== Scaling
- Session table and crawl table
- Splits them all up into ranges, go to rangeservers
- Just add more machines, and the system migrates data equally
- Balancing is questionable...
=== Access Groups
- Control of physical layout hybrid row/col oriented
- Improves perf. by minimizing IO
- Grouping columns allows physical storage control
- Makes faster updates possible
=== FS Broker
- Can run on any distributed FS
- FUSE hooks
== More
- Comparison to Hbase (Java, yuck), C++ much better
- System is designed for async communication
- Hypertable is CPU intensive
- Java uses 2-3 times the memory for large memmap
- Poor processor cache perf.
== Performance
- AOL Query logs
- 75,275,825 inserted cells
- 8-node cluster (1 1.8 Ghz Dual Core Opteron)
- 4GB RAM
- 3x 7200 SATA
- Row Key 7B
- Avg value 15B
- Crap. Slide change
- Another test yielded over 1M sustained inserts/s
== Weaknesses
- Range data managed by a single rangeserver
- No data loss, but if it goes down, bad bad
- Can be mitigated with client-side cache or memcached
== Status
- Alpha, 0.9.0.7 released
- Beta at the end of August
- Waiting on Hadoop JIRA 1700
- Bug in Hadoop, don't allow appending to existing files
- GPL 2
- Delete records get flushed in a "major operation"