Skip to content

Commit

Permalink
List tweak
Browse files Browse the repository at this point in the history
  • Loading branch information
DerrickWood committed Feb 18, 2015
1 parent 39f7e93 commit 6de8f35
Show file tree
Hide file tree
Showing 2 changed files with 13 additions and 5 deletions.
17 changes: 12 additions & 5 deletions docs/MANUAL.html
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,11 @@
</head>
<body>
<div class="pretoc">
<p class="title">Kraken taxonomic sequence classification system</p>
<p class="title">Kraken taxonomic sequence classification system</p>

<p class="version">Version 0.10.5-beta</p>
<p class="version">Version 0.10.5-beta</p>

<p>Operating Manual</p>
<p>Operating Manual</p>
</div>

<h1>Table of Contents</h1>
Expand Down Expand Up @@ -46,7 +46,7 @@ <h1 id="system-requirements"><a href="#system-requirements">System Requirements<
<li><p><strong>Disk space</strong>: Construction of Kraken's standard database will require at least 160 GB of disk space. Customized databases may require more or less space. Disk space used is linearly proportional to the number of distinct <span class="math"><em>k</em></span>-mers; as of Feb. 2015, Kraken's default database contains just under 6 billion (6e9) distinct <span class="math"><em>k</em></span>-mers.</p>
<p>In addition, the disk used to store the database should be locally-attached storage. Storing the database on a network filesystem (NFS) partition can cause Kraken's operation to be very slow, or to be stopped completely. As NFS accesses are much slower than local disk accesses, both preloading and database building will be slowed by use of NFS.</p></li>
<li><p><strong>Memory</strong>: To run efficiently, Kraken requires enough free memory to hold the database in RAM. While this can be accomplished using a ramdisk, Kraken supplies a utility for loading the database into RAM via the OS cache. The default database size is 75 GB (as of Feb. 2015), and so you will need at least that much RAM if you want to build or run with the default database.</p></li>
<li><p><strong>Dependencies</strong>: Kraken currently makes extensive use of Linux utilities such as sed, find, and wget. Many scripts are written using the Bash shell, and the main scripts are written using Perl. Core programs needed to build the database and run the classifier are written in C++, and need to be compiled using g++. Downloads of NCBI data are performed by wget and in some cases, by rsync.</p>
<li><p><strong>Dependencies</strong>: Kraken currently makes extensive use of Linux utilities such as sed, find, and wget. Many scripts are written using the Bash shell, and the main scripts are written using Perl. Core programs needed to build the database and run the classifier are written in C++, and need to be compiled using g++. Multithreading is handled using OpenMP. Downloads of NCBI data are performed by wget and in some cases, by rsync. Most Linux systems that have any sort of development package installed will have all of the above listed programs and libraries available.</p>
<p>Finally, if you want to build your own database, you will need to install the <a href="http://www.cbcb.umd.edu/software/jellyfish/">Jellyfish</a> <span class="math"><em>k</em></span>-mer counter. Note that Kraken only supports use of Jellyfish version 1. Jellyfish version 2 is not yet compatible with Kraken.</p></li>
<li><p><strong>Network connectivity</strong>: Kraken's standard database build and download commands expect unfettered FTP and rsync access to the NCBI FTP server. If you're working behind a proxy, you may need to set certain environment variables (such as <code>ftp_proxy</code> or <code>RSYNC_PROXY</code>) in order to get these commands to work properly.</p></li>
<li><p><strong>MiniKraken</strong>: To allow users with low-memory computing environments to use Kraken, we supply a reduced standard database that can be downloaded from the Kraken web site. When Kraken is run with a reduced database, we call it MiniKraken.</p>
Expand Down Expand Up @@ -222,7 +222,14 @@ <h1 id="sample-reports"><a href="#sample-reports">Sample Reports</a></h1>
<h1 id="confidence-scoring"><a href="#confidence-scoring">Confidence Scoring</a></h1>
<p>At present, we have not yet developed a confidence score with a solid probabilistic interpretation for Kraken. However, we have developed a simple scoring scheme that has yielded good results for us, and we've made that available in the <code>kraken-filter</code> script. The approach we use allows a user to specify a threshold score in the [0,1] interval; the <code>kraken-filter</code> script then will adjust labels up the tree until the label's score (described below) meets or exceeds that threshold. If a label at the root of the taxonomic tree would not have a score exceeding the threshold, the sequence is called unclassified by kraken-filter.</p>
<p>A sequence label's score is a fraction <span class="math"><em>C</em></span>/<span class="math"><em>Q</em></span>, where <span class="math"><em>C</em></span> is the number of <span class="math"><em>k</em></span>-mers mapped to LCA values in the clade rooted at the label, and <span class="math"><em>Q</em></span> is the number of <span class="math"><em>k</em></span>-mers in the sequence that lack an ambiguous nucleotide (i.e., they were queried against the database). Consider the example of the LCA mappings in Kraken's output given earlier:</p>
<p>&quot;562:13 561:4 A:31 0:1 562:3&quot; would indicate that: * the first 13 <span class="math"><em>k</em></span>-mers mapped to taxonomy ID #562 * the next 4 <span class="math"><em>k</em></span>-mers mapped to taxonomy ID #561 * the next 31 <span class="math"><em>k</em></span>-mers contained an ambiguous nucleotide * the next <span class="math"><em>k</em></span>-mer was not in the database * the last 3 <span class="math"><em>k</em></span>-mers mapped to taxonomy ID #562</p>
<p>&quot;562:13 561:4 A:31 0:1 562:3&quot; would indicate that:</p>
<ul>
<li>the first 13 <span class="math"><em>k</em></span>-mers mapped to taxonomy ID #562</li>
<li>the next 4 <span class="math"><em>k</em></span>-mers mapped to taxonomy ID #561</li>
<li>the next 31 <span class="math"><em>k</em></span>-mers contained an ambiguous nucleotide</li>
<li>the next <span class="math"><em>k</em></span>-mer was not in the database</li>
<li>the last 3 <span class="math"><em>k</em></span>-mers mapped to taxonomy ID #562</li>
</ul>
<p>In this case, ID #561 is the parent node of #562. Here, a label of #562 for this sequence would have a score of <span class="math"><em>C</em></span>/<span class="math"><em>Q</em></span> = (13+3)/(13+4+1+3) = 16/21. A label of #561 would have a score of <span class="math"><em>C</em></span>/<span class="math"><em>Q</em></span> = (13+4+3)/(13+4+1+3) = 20/21. If a user specified a threshold over 16/21, kraken-filter would adjust the original label from #562 to #561; if the threshold was greater than 20/21, the sequence would become unclassified.</p>
<p><code>kraken-filter</code> is used like this:</p>
<pre><code>kraken-filter --db $DBNAME [--threshold NUM] kraken.output</code></pre>
Expand Down
1 change: 1 addition & 0 deletions docs/MANUAL.markdown
Original file line number Diff line number Diff line change
Expand Up @@ -587,6 +587,7 @@ they were queried against the database). Consider the example of the
LCA mappings in Kraken's output given earlier:

"562:13 561:4 A:31 0:1 562:3" would indicate that:

* the first 13 $k$-mers mapped to taxonomy ID #562
* the next 4 $k$-mers mapped to taxonomy ID #561
* the next 31 $k$-mers contained an ambiguous nucleotide
Expand Down

0 comments on commit 6de8f35

Please sign in to comment.