|
dee801b6
»
|
Erik |
2008-07-24 |
Mods to some notes |
1 |
= OSCON 2008, Session 3: Hypertable |
| |
2 |
|
|
1ed92700
»
|
Erik |
2008-07-23 |
Adding Hypertable Talk |
3 |
== GFS |
| |
4 |
- Run on 1000 machines, not 1 |
| |
5 |
=== Filesystem |
| |
6 |
- 64MB chunk |
| |
7 |
- Replicates each chunk across machines |
| |
8 |
- By doing so, system is impervious to a whole class of hardware failures |
| |
9 |
- Power supply |
| |
10 |
- Power to the rack |
| |
11 |
- Network failure |
| |
12 |
- Map/Reduce |
| |
13 |
- Bigtable |
| |
14 |
|
| |
15 |
=== Hypertable |
| |
16 |
- Not relational |
| |
17 |
- Modeled after Google's bigtable |
| |
18 |
- One big massive primary keyed table |
| |
19 |
- No transactions, maybe in the future |
| |
20 |
- *Scalable* |
| |
21 |
- <b>High Random insert, update, and delete rate</b> |
|
1dc69265
»
|
Erik |
2008-07-26 |
Spelling corrections |
22 |
- Loaded 1TB into a 9-node HT cluster, sustained random insert @ 1M inserts per second (Quad core Intel, 16GB RAM, SATA 3Gb/s). |
|
1ed92700
»
|
Erik |
2008-07-23 |
Adding Hypertable Talk |
23 |
|
| |
24 |
=== Data Model |
|
1dc69265
»
|
Erik |
2008-07-26 |
Spelling corrections |
25 |
- Sparse, 2D table with cell versions |
|
1ed92700
»
|
Erik |
2008-07-23 |
Adding Hypertable Talk |
26 |
- 1 table with 2 columns, next one has 1M, that's OK |
| |
27 |
- 4-part key |
| |
28 |
- Row |
| |
29 |
- Column Family |
| |
30 |
- Column Qualifier |
| |
31 |
- Timestamp |
| |
32 |
|
| |
33 |
- <i>Tim O'Reilly walks in and looks around for a seat, they're all taken</i> |
| |
34 |
|
| |
35 |
=== Anatomy of a key |
| |
36 |
- Row key is \0 Terminated |
| |
37 |
- Col family is a single-byte (256 possible) |
| |
38 |
- Col qualifier is \0 terminated |
| |
39 |
- Timestamp is big-endian 1's Comp. (memcmp, ordering has more recent ahead of older versions) |
| |
40 |
|
| |
41 |
=== Concurrency |
| |
42 |
- Bigtable uses Copy on write |
| |
43 |
- MVCC (like Couch) form in Hypertable. Deletes are inserted as delete records. Multiple versions are kept around. |
| |
44 |
|
| |
45 |
=== Cellstore |
| |
46 |
- 65K blocks of compressed KV pairs |
| |
47 |
- Bloom Filter - booya! |
| |
48 |
|
| |
49 |
=== System Overview |
| |
50 |
- Hyperspace has a distributed lock manager, and a small metadata fs (built on BDB) |
| |
51 |
- Chubby (sp?) is google's hyperspace |
| |
52 |
- Function of the master is to perform metadata operations (ALTER, CREATE, etc.) |
| |
53 |
- Clients can communicate with Range servers |
| |
54 |
- Master can be down for a while with no one even noticing |
| |
55 |
- Hot standby design for availability |
| |
56 |
- Range Servers: Responsible for UPDATING and SCANNING |
|
1dc69265
»
|
Erik |
2008-07-26 |
Spelling corrections |
57 |
- All sits on top of HDFS distributed FS |
|
1ed92700
»
|
Erik |
2008-07-23 |
Adding Hypertable Talk |
58 |
- Hadoop, KFS (GFS Clone) |
| |
59 |
|
| |
60 |
=== Range server |
| |
61 |
- Manages ranges of table data |
| |
62 |
- Caches updates in memory (CellCache) |
| |
63 |
- Spills (compacts) periodically to update the disk (CellStore) |
| |
64 |
|
| |
65 |
==== Write ahead commit log |
| |
66 |
- When updates come into a rangeserver, they're written to a commit log, then the data structures are updated so you can replay the log. |
| |
67 |
|
| |
68 |
==== Range meta-operation log |
| |
69 |
- When a rangeserver does anything (moves, stops), it's written into the log |
| |
70 |
|
| |
71 |
== Client API |
| |
72 |
- C++ client is the only one supported ATM: |
| |
73 |
- You modify a table by creating a mutator |
| |
74 |
- You scan a table by creating a scanner |
| |
75 |
- Thrift Broker in the works |
| |
76 |
- Someone contrib'd a Hadoop Map/Reduce connector |
| |
77 |
|
| |
78 |
== Compression |
| |
79 |
- CellStore: compressed KV pairs |
| |
80 |
- Commit log: Compressed blocks (optionally) |
| |
81 |
- Supported types |
| |
82 |
- zlib (fastest/best) |
| |
83 |
- lzo (high decomp speed) |
| |
84 |
- quicklz (fast decomp, high ratio) |
| |
85 |
- bmz (longest commons substring, lost of replication) |
| |
86 |
- none |
| |
87 |
|
| |
88 |
== Caching |
| |
89 |
=== Block Cache |
| |
90 |
- CellStore blocks of KV pairs configurable |
| |
91 |
|
| |
92 |
=== Query cache |
| |
93 |
- Not finished implementing |
| |
94 |
- Caches results |
| |
95 |
|
| |
96 |
=== Bloom Filter (!!) |
| |
97 |
- Negative Cache |
| |
98 |
- Configurable K |
| |
99 |
- Allows you to find out if you definitely *don't* have the data |
| |
100 |
|
| |
101 |
=== Scaling |
| |
102 |
- Session table and crawl table |
| |
103 |
- Splits them all up into ranges, go to rangeservers |
| |
104 |
- Just add more machines, and the system migrates data equally |
| |
105 |
- Balancing is questionable... |
| |
106 |
|
| |
107 |
=== Access Groups |
| |
108 |
- Control of physical layout hybrid row/col oriented |
| |
109 |
- Improves perf. by minimizing IO |
| |
110 |
- Grouping columns allows physical storage control |
| |
111 |
- Makes faster updates possible |
| |
112 |
|
| |
113 |
=== FS Broker |
| |
114 |
- Can run on any distributed FS |
| |
115 |
- FUSE hooks |
| |
116 |
|
| |
117 |
== More |
| |
118 |
- Comparison to Hbase (Java, yuck), C++ much better |
| |
119 |
- System is designed for async communication |
| |
120 |
- Hypertable is CPU intensive |
| |
121 |
- Java uses 2-3 times the memory for large memmap |
| |
122 |
- Poor processor cache perf. |
| |
123 |
|
| |
124 |
== Performance |
| |
125 |
- AOL Query logs |
| |
126 |
- 75,275,825 inserted cells |
| |
127 |
- 8-node cluster (1 1.8 Ghz Dual Core Opteron) |
| |
128 |
- 4GB RAM |
| |
129 |
- 3x 7200 SATA |
| |
130 |
- Row Key 7B |
| |
131 |
- Avg value 15B |
| |
132 |
- Crap. Slide change |
| |
133 |
|
|
1dc69265
»
|
Erik |
2008-07-26 |
Spelling corrections |
134 |
- Another test yielded over 1M sustained inserts/s |
|
1ed92700
»
|
Erik |
2008-07-23 |
Adding Hypertable Talk |
135 |
|
| |
136 |
== Weaknesses |
| |
137 |
- Range data managed by a single rangeserver |
| |
138 |
- No data loss, but if it goes down, bad bad |
| |
139 |
- Can be mitigated with client-side cache or memcached |
| |
140 |
|
| |
141 |
== Status |
| |
142 |
- Alpha, 0.9.0.7 released |
| |
143 |
- Beta at the end of August |
| |
144 |
- Waiting on Hadoop JIRA 1700 |
| |
145 |
- Bug in Hadoop, don't allow appending to existing files |
| |
146 |
- GPL 2 |
| |
147 |
|
| |
148 |
- Delete records get flushed in a "major operation" |