Permalink
Cannot retrieve contributors at this time
Fetching contributors…
| <!DOCTYPE html> | |
| <html> | |
| <head> | |
| <link rel="stylesheet" type="text/css" href="doc.css" /> | |
| <title>Leveldb file layout and compactions</title> | |
| </head> | |
| <body> | |
| <h1>Files</h1> | |
| The implementation of leveldb is similar in spirit to the | |
| representation of a single | |
| <a href="http://research.google.com/archive/bigtable.html"> | |
| Bigtable tablet (section 5.3)</a>. | |
| However the organization of the files that make up the representation | |
| is somewhat different and is explained below. | |
| <p> | |
| Each database is represented by a set of files stored in a directory. | |
| There are several different types of files as documented below: | |
| <p> | |
| <h2>Log files</h2> | |
| <p> | |
| A log file (*.log) stores a sequence of recent updates. Each update | |
| is appended to the current log file. When the log file reaches a | |
| pre-determined size (approximately 4MB by default), it is converted | |
| to a sorted table (see below) and a new log file is created for future | |
| updates. | |
| <p> | |
| A copy of the current log file is kept in an in-memory structure (the | |
| <code>memtable</code>). This copy is consulted on every read so that read | |
| operations reflect all logged updates. | |
| <p> | |
| <h2>Sorted tables</h2> | |
| <p> | |
| A sorted table (*.sst) stores a sequence of entries sorted by key. | |
| Each entry is either a value for the key, or a deletion marker for the | |
| key. (Deletion markers are kept around to hide obsolete values | |
| present in older sorted tables). | |
| <p> | |
| The set of sorted tables are organized into a sequence of levels. The | |
| sorted table generated from a log file is placed in a special <code>young</code> | |
| level (also called level-0). When the number of young files exceeds a | |
| certain threshold (currently four), all of the young files are merged | |
| together with all of the overlapping level-1 files to produce a | |
| sequence of new level-1 files (we create a new level-1 file for every | |
| 2MB of data.) | |
| <p> | |
| Files in the young level may contain overlapping keys. However files | |
| in other levels have distinct non-overlapping key ranges. Consider | |
| level number L where L >= 1. When the combined size of files in | |
| level-L exceeds (10^L) MB (i.e., 10MB for level-1, 100MB for level-2, | |
| ...), one file in level-L, and all of the overlapping files in | |
| level-(L+1) are merged to form a set of new files for level-(L+1). | |
| These merges have the effect of gradually migrating new updates from | |
| the young level to the largest level using only bulk reads and writes | |
| (i.e., minimizing expensive seeks). | |
| <h2>Manifest</h2> | |
| <p> | |
| A MANIFEST file lists the set of sorted tables that make up each | |
| level, the corresponding key ranges, and other important metadata. | |
| A new MANIFEST file (with a new number embedded in the file name) | |
| is created whenever the database is reopened. The MANIFEST file is | |
| formatted as a log, and changes made to the serving state (as files | |
| are added or removed) are appended to this log. | |
| <p> | |
| <h2>Current</h2> | |
| <p> | |
| CURRENT is a simple text file that contains the name of the latest | |
| MANIFEST file. | |
| <p> | |
| <h2>Info logs</h2> | |
| <p> | |
| Informational messages are printed to files named LOG and LOG.old. | |
| <p> | |
| <h2>Others</h2> | |
| <p> | |
| Other files used for miscellaneous purposes may also be present | |
| (LOCK, *.dbtmp). | |
| <h1>Level 0</h1> | |
| When the log file grows above a certain size (1MB by default): | |
| <ul> | |
| <li>Create a brand new memtable and log file and direct future updates here | |
| <li>In the background: | |
| <ul> | |
| <li>Write the contents of the previous memtable to an sstable | |
| <li>Discard the memtable | |
| <li>Delete the old log file and the old memtable | |
| <li>Add the new sstable to the young (level-0) level. | |
| </ul> | |
| </ul> | |
| <h1>Compactions</h1> | |
| <p> | |
| When the size of level L exceeds its limit, we compact it in a | |
| background thread. The compaction picks a file from level L and all | |
| overlapping files from the next level L+1. Note that if a level-L | |
| file overlaps only part of a level-(L+1) file, the entire file at | |
| level-(L+1) is used as an input to the compaction and will be | |
| discarded after the compaction. Aside: because level-0 is special | |
| (files in it may overlap each other), we treat compactions from | |
| level-0 to level-1 specially: a level-0 compaction may pick more than | |
| one level-0 file in case some of these files overlap each other. | |
| <p> | |
| A compaction merges the contents of the picked files to produce a | |
| sequence of level-(L+1) files. We switch to producing a new | |
| level-(L+1) file after the current output file has reached the target | |
| file size (2MB). We also switch to a new output file when the key | |
| range of the current output file has grown enough to overlap more then | |
| ten level-(L+2) files. This last rule ensures that a later compaction | |
| of a level-(L+1) file will not pick up too much data from level-(L+2). | |
| <p> | |
| The old files are discarded and the new files are added to the serving | |
| state. | |
| <p> | |
| Compactions for a particular level rotate through the key space. In | |
| more detail, for each level L, we remember the ending key of the last | |
| compaction at level L. The next compaction for level L will pick the | |
| first file that starts after this key (wrapping around to the | |
| beginning of the key space if there is no such file). | |
| <p> | |
| Compactions drop overwritten values. They also drop deletion markers | |
| if there are no higher numbered levels that contain a file whose range | |
| overlaps the current key. | |
| <h2>Timing</h2> | |
| Level-0 compactions will read up to four 1MB files from level-0, and | |
| at worst all the level-1 files (10MB). I.e., we will read 14MB and | |
| write 14MB. | |
| <p> | |
| Other than the special level-0 compactions, we will pick one 2MB file | |
| from level L. In the worst case, this will overlap ~ 12 files from | |
| level L+1 (10 because level-(L+1) is ten times the size of level-L, | |
| and another two at the boundaries since the file ranges at level-L | |
| will usually not be aligned with the file ranges at level-L+1). The | |
| compaction will therefore read 26MB and write 26MB. Assuming a disk | |
| IO rate of 100MB/s (ballpark range for modern drives), the worst | |
| compaction cost will be approximately 0.5 second. | |
| <p> | |
| If we throttle the background writing to something small, say 10% of | |
| the full 100MB/s speed, a compaction may take up to 5 seconds. If the | |
| user is writing at 10MB/s, we might build up lots of level-0 files | |
| (~50 to hold the 5*10MB). This may signficantly increase the cost of | |
| reads due to the overhead of merging more files together on every | |
| read. | |
| <p> | |
| Solution 1: To reduce this problem, we might want to increase the log | |
| switching threshold when the number of level-0 files is large. Though | |
| the downside is that the larger this threshold, the more memory we will | |
| need to hold the corresponding memtable. | |
| <p> | |
| Solution 2: We might want to decrease write rate artificially when the | |
| number of level-0 files goes up. | |
| <p> | |
| Solution 3: We work on reducing the cost of very wide merges. | |
| Perhaps most of the level-0 files will have their blocks sitting | |
| uncompressed in the cache and we will only need to worry about the | |
| O(N) complexity in the merging iterator. | |
| <h2>Number of files</h2> | |
| Instead of always making 2MB files, we could make larger files for | |
| larger levels to reduce the total file count, though at the expense of | |
| more bursty compactions. Alternatively, we could shard the set of | |
| files into multiple directories. | |
| <p> | |
| An experiment on an <code>ext3</code> filesystem on Feb 04, 2011 shows | |
| the following timings to do 100K file opens in directories with | |
| varying number of files: | |
| <table class="datatable"> | |
| <tr><th>Files in directory</th><th>Microseconds to open a file</th></tr> | |
| <tr><td>1000</td><td>9</td> | |
| <tr><td>10000</td><td>10</td> | |
| <tr><td>100000</td><td>16</td> | |
| </table> | |
| So maybe even the sharding is not necessary on modern filesystems? | |
| <h1>Recovery</h1> | |
| <ul> | |
| <li> Read CURRENT to find name of the latest committed MANIFEST | |
| <li> Read the named MANIFEST file | |
| <li> Clean up stale files | |
| <li> We could open all sstables here, but it is probably better to be lazy... | |
| <li> Convert log chunk to a new level-0 sstable | |
| <li> Start directing new writes to a new log file with recovered sequence# | |
| </ul> | |
| <h1>Garbage collection of files</h1> | |
| <code>DeleteObsoleteFiles()</code> is called at the end of every | |
| compaction and at the end of recovery. It finds the names of all | |
| files in the database. It deletes all log files that are not the | |
| current log file. It deletes all table files that are not referenced | |
| from some level and are not the output of an active compaction. | |
| </body> | |
| </html> |