hank / life

Good code.

This URL has Read+Write access

life / oscon / 2008 / sessions / Hypertable.rdoc
dee801b6 » Erik 2008-07-24 Mods to some notes 1 = OSCON 2008, Session 3: Hypertable
2
1ed92700 » Erik 2008-07-23 Adding Hypertable Talk 3 == GFS
4 - Run on 1000 machines, not 1
5 === Filesystem
6 - 64MB chunk
7 - Replicates each chunk across machines
8 - By doing so, system is impervious to a whole class of hardware failures
9 - Power supply
10 - Power to the rack
11 - Network failure
12 - Map/Reduce
13 - Bigtable
14
15 === Hypertable
16 - Not relational
17 - Modeled after Google's bigtable
18 - One big massive primary keyed table
19 - No transactions, maybe in the future
20 - *Scalable*
21 - <b>High Random insert, update, and delete rate</b>
1dc69265 » Erik 2008-07-26 Spelling corrections 22 - Loaded 1TB into a 9-node HT cluster, sustained random insert @ 1M inserts per second (Quad core Intel, 16GB RAM, SATA 3Gb/s).
1ed92700 » Erik 2008-07-23 Adding Hypertable Talk 23
24 === Data Model
1dc69265 » Erik 2008-07-26 Spelling corrections 25 - Sparse, 2D table with cell versions
1ed92700 » Erik 2008-07-23 Adding Hypertable Talk 26 - 1 table with 2 columns, next one has 1M, that's OK
27 - 4-part key
28 - Row
29 - Column Family
30 - Column Qualifier
31 - Timestamp
32
33 - <i>Tim O'Reilly walks in and looks around for a seat, they're all taken</i>
34
35 === Anatomy of a key
36 - Row key is \0 Terminated
37 - Col family is a single-byte (256 possible)
38 - Col qualifier is \0 terminated
39 - Timestamp is big-endian 1's Comp. (memcmp, ordering has more recent ahead of older versions)
40
41 === Concurrency
42 - Bigtable uses Copy on write
43 - MVCC (like Couch) form in Hypertable. Deletes are inserted as delete records. Multiple versions are kept around.
44
45 === Cellstore
46 - 65K blocks of compressed KV pairs
47 - Bloom Filter - booya!
48
49 === System Overview
50 - Hyperspace has a distributed lock manager, and a small metadata fs (built on BDB)
51 - Chubby (sp?) is google's hyperspace
52 - Function of the master is to perform metadata operations (ALTER, CREATE, etc.)
53 - Clients can communicate with Range servers
54 - Master can be down for a while with no one even noticing
55 - Hot standby design for availability
56 - Range Servers: Responsible for UPDATING and SCANNING
1dc69265 » Erik 2008-07-26 Spelling corrections 57 - All sits on top of HDFS distributed FS
1ed92700 » Erik 2008-07-23 Adding Hypertable Talk 58 - Hadoop, KFS (GFS Clone)
59
60 === Range server
61 - Manages ranges of table data
62 - Caches updates in memory (CellCache)
63 - Spills (compacts) periodically to update the disk (CellStore)
64
65 ==== Write ahead commit log
66 - When updates come into a rangeserver, they're written to a commit log, then the data structures are updated so you can replay the log.
67
68 ==== Range meta-operation log
69 - When a rangeserver does anything (moves, stops), it's written into the log
70
71 == Client API
72 - C++ client is the only one supported ATM:
73 - You modify a table by creating a mutator
74 - You scan a table by creating a scanner
75 - Thrift Broker in the works
76 - Someone contrib'd a Hadoop Map/Reduce connector
77
78 == Compression
79 - CellStore: compressed KV pairs
80 - Commit log: Compressed blocks (optionally)
81 - Supported types
82 - zlib (fastest/best)
83 - lzo (high decomp speed)
84 - quicklz (fast decomp, high ratio)
85 - bmz (longest commons substring, lost of replication)
86 - none
87
88 == Caching
89 === Block Cache
90 - CellStore blocks of KV pairs configurable
91
92 === Query cache
93 - Not finished implementing
94 - Caches results
95
96 === Bloom Filter (!!)
97 - Negative Cache
98 - Configurable K
99 - Allows you to find out if you definitely *don't* have the data
100
101 === Scaling
102 - Session table and crawl table
103 - Splits them all up into ranges, go to rangeservers
104 - Just add more machines, and the system migrates data equally
105 - Balancing is questionable...
106
107 === Access Groups
108 - Control of physical layout hybrid row/col oriented
109 - Improves perf. by minimizing IO
110 - Grouping columns allows physical storage control
111 - Makes faster updates possible
112
113 === FS Broker
114 - Can run on any distributed FS
115 - FUSE hooks
116
117 == More
118 - Comparison to Hbase (Java, yuck), C++ much better
119 - System is designed for async communication
120 - Hypertable is CPU intensive
121 - Java uses 2-3 times the memory for large memmap
122 - Poor processor cache perf.
123
124 == Performance
125 - AOL Query logs
126 - 75,275,825 inserted cells
127 - 8-node cluster (1 1.8 Ghz Dual Core Opteron)
128 - 4GB RAM
129 - 3x 7200 SATA
130 - Row Key 7B
131 - Avg value 15B
132 - Crap. Slide change
133
1dc69265 » Erik 2008-07-26 Spelling corrections 134 - Another test yielded over 1M sustained inserts/s
1ed92700 » Erik 2008-07-23 Adding Hypertable Talk 135
136 == Weaknesses
137 - Range data managed by a single rangeserver
138 - No data loss, but if it goes down, bad bad
139 - Can be mitigated with client-side cache or memcached
140
141 == Status
142 - Alpha, 0.9.0.7 released
143 - Beta at the end of August
144 - Waiting on Hadoop JIRA 1700
145 - Bug in Hadoop, don't allow appending to existing files
146 - GPL 2
147
148 - Delete records get flushed in a "major operation"