= OSCON 2008, Session 3: Hypertable == GFS - Run on 1000 machines, not 1 === Filesystem - 64MB chunk - Replicates each chunk across machines - By doing so, system is impervious to a whole class of hardware failures - Power supply - Power to the rack - Network failure - Map/Reduce - Bigtable === Hypertable - Not relational - Modeled after Google's bigtable - One big massive primary keyed table - No transactions, maybe in the future - *Scalable* - High Random insert, update, and delete rate - Loaded 1TB into a 9-node HT cluster, sustained random insert @ 1M inserts per second (Quad core Intel, 16GB RAM, SATA 3Gb/s). === Data Model - Sparse, 2D table with cell versions - 1 table with 2 columns, next one has 1M, that's OK - 4-part key - Row - Column Family - Column Qualifier - Timestamp - Tim O'Reilly walks in and looks around for a seat, they're all taken === Anatomy of a key - Row key is \0 Terminated - Col family is a single-byte (256 possible) - Col qualifier is \0 terminated - Timestamp is big-endian 1's Comp. (memcmp, ordering has more recent ahead of older versions) === Concurrency - Bigtable uses Copy on write - MVCC (like Couch) form in Hypertable. Deletes are inserted as delete records. Multiple versions are kept around. === Cellstore - 65K blocks of compressed KV pairs - Bloom Filter - booya! === System Overview - Hyperspace has a distributed lock manager, and a small metadata fs (built on BDB) - Chubby (sp?) is google's hyperspace - Function of the master is to perform metadata operations (ALTER, CREATE, etc.) - Clients can communicate with Range servers - Master can be down for a while with no one even noticing - Hot standby design for availability - Range Servers: Responsible for UPDATING and SCANNING - All sits on top of HDFS distributed FS - Hadoop, KFS (GFS Clone) === Range server - Manages ranges of table data - Caches updates in memory (CellCache) - Spills (compacts) periodically to update the disk (CellStore) ==== Write ahead commit log - When updates come into a rangeserver, they're written to a commit log, then the data structures are updated so you can replay the log. ==== Range meta-operation log - When a rangeserver does anything (moves, stops), it's written into the log == Client API - C++ client is the only one supported ATM: - You modify a table by creating a mutator - You scan a table by creating a scanner - Thrift Broker in the works - Someone contrib'd a Hadoop Map/Reduce connector == Compression - CellStore: compressed KV pairs - Commit log: Compressed blocks (optionally) - Supported types - zlib (fastest/best) - lzo (high decomp speed) - quicklz (fast decomp, high ratio) - bmz (longest commons substring, lost of replication) - none == Caching === Block Cache - CellStore blocks of KV pairs configurable === Query cache - Not finished implementing - Caches results === Bloom Filter (!!) - Negative Cache - Configurable K - Allows you to find out if you definitely *don't* have the data === Scaling - Session table and crawl table - Splits them all up into ranges, go to rangeservers - Just add more machines, and the system migrates data equally - Balancing is questionable... === Access Groups - Control of physical layout hybrid row/col oriented - Improves perf. by minimizing IO - Grouping columns allows physical storage control - Makes faster updates possible === FS Broker - Can run on any distributed FS - FUSE hooks == More - Comparison to Hbase (Java, yuck), C++ much better - System is designed for async communication - Hypertable is CPU intensive - Java uses 2-3 times the memory for large memmap - Poor processor cache perf. == Performance - AOL Query logs - 75,275,825 inserted cells - 8-node cluster (1 1.8 Ghz Dual Core Opteron) - 4GB RAM - 3x 7200 SATA - Row Key 7B - Avg value 15B - Crap. Slide change - Another test yielded over 1M sustained inserts/s == Weaknesses - Range data managed by a single rangeserver - No data loss, but if it goes down, bad bad - Can be mitigated with client-side cache or memcached == Status - Alpha, 0.9.0.7 released - Beta at the end of August - Waiting on Hadoop JIRA 1700 - Bug in Hadoop, don't allow appending to existing files - GPL 2 - Delete records get flushed in a "major operation"