## Plan
- Create a notebook to download and preprocess that summarises the steps
 - preprocessing includes defining folds
- Create a notebook generate weights and train classifiers
- Create a command-line script for generating expected/frequency/sequence triples from a biom and a sv_map
- Create a command-line script for generating observeds for
 - uniform weights
 - global weights
 - bespoke weights
 - wrong weights

## Download
download stool SVs with abundances

do not run - takes hours

In [None]:
%%bash
redbiom search metadata 'where sample_type == "stool"' > stool_samples
redbiom search metadata 'where sample_type == "Stool"' >> stool_samples
export CTX=Deblur-illumina-16S-v4-150nt-10d7e0
redbiom fetch samples --from stool_samples --context $CTX --output stool_sv.biom

## Preprocess
extract V4 for 99% greengenes

blast the stool SVs against the greengenes amplicons

do not run - takes overnight

In [3]:
%%bash
qiime tools import --input-path 99_otus.fasta --output-path 99_otus.qza --type FeatureData[Sequence]
qiime feature-classifier extract-reads --i-sequences 99_otus.qza --p-f-primer GTGYCAGCMGCCGCGGTAA --p-r-primer GGACTACNVGGGTWTCTAAT --o-reads 99_otus_v4.qza
qiime tools export 99_otus_v4.qza --output-dir .
mv dna-sequences.fasta 99_otus_v4.fasta
biom table-ids --observations -i stool_sv.biom | awk '{print ">"$1"blast_rocks\n"$1}' > stool_sv.fasta
makeblastdb -in 99_otus_v4.fasta -dbtype nucl -out 99_otus_v4.db
blastn -num_threads 4 -query stool_sv.fasta -outfmt "6 qacc sacc" -db 99_otus_v4.db -max_target_seqs 1 -out stool_sv_map.blast

## Noise

In [78]:
import csv
from collections import defaultdict, Counter
import hashlib

import biom
from numpy.random import choice
import numpy
import skbio.io
from pandas import DataFrame, Series
from qiime2 import Artifact

In [105]:
stool_sv = biom.load_table('/Users/benkaehler/Data/paycheck/raw/soil/sv.biom')

In [107]:
sample_ids = numpy.array(stool_sv.ids())
sample_ids is stool_sv.ids()
numpy.random.shuffle(sample_ids)

construct a mapping from each SV to a sequence label from 99% greengenes

In [108]:
stool_sv_map = {}
with open('/Users/benkaehler/Data/paycheck/ref/soil/sv_map.blast') as blast_results:
    blast_reader = csv.reader(blast_results, csv.excel_tab)
    for row in blast_reader:
        sv = row[0]
        if sv in stool_sv_map:
            assert stool_sv_map[sv] == row[1],\
                ' '.join([sv, stool_sv_map[sv], row[1]])
            continue
        stool_sv_map[sv] = row[1]

construct a mapping from each sequence label to its amplicon

In [109]:
with open('99_otus_v4.fasta') as ref_fh:
    fasta_reader = skbio.io.read(ref_fh, 'fasta')
    ref_seqs = {s.metadata['id']: str(s) for s in fasta_reader}

construct a mapping from each sequence label to its taxonomy

In [110]:
with open('99_otu_taxonomy.txt') as tax_fh:
    tax_reader = csv.reader(tax_fh, csv.excel_tab)
    tax_map = {r[0]: r[1] for r in tax_reader}

In [112]:
set(tax_map[t] for t in tax_map) - set(tax_map[t] for t in ref_seqs)

{'k__Bacteria; p__Cyanobacteria; c__Chloroplast; o__Chlorophyta; f__Ulvophyceae; g__Chlorodesmis; s__fastigiata',
 'k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Lachnospiraceae; g__Pseudobutyrivibrio; s__xylanivorans'}

choose a random stool sample, extract it, and filter out SVs with zero abundance

In [113]:
random_sample = stool_sv.ids()[choice(stool_sv.length())]
random_sample = '990.KA2F.E.11.30728'
random_sample = stool_sv.filter([random_sample], inplace=False)
random_sample.filter(lambda v, _, __: v[0] > 1e-9, axis='observation', inplace=True)

10615 x 1 <class 'biom.table.Table'> with 10615 nonzero entries (100% dense)

output the amplicon sequences to fasta, labelled by greengenes sequence label, with abundance that `vsearch --rereplicate` will understand

In [114]:
with open('abundance.fasta', 'w') as a_fh:
    for row in random_sample.iter(axis='observation'):
        abundance, sv, _ = row
        abundance = int(abundance[0])
        if sv in stool_sv_map:
            label = stool_sv_map[sv]
            a_fh.write('>' + label + ';size=' + str(abundance) + '\n')
            a_fh.write(ref_seqs[stool_sv_map[sv]] + '\n')

repreplicate according to abundance and run ART to simulate amplicons

In [115]:
%%bash
vsearch --rereplicate abundance.fasta --output prior_art.fasta
export PATH=$PATH:art_bin_MountRainier
art_illumina -ss MSv1 -amp -i prior_art.fasta -l 250 -o post_art -c 1 -na -p
if [ -d dada_in ]; then
    rm -r dada_in1 dada_in2 \
        dada_tmp1 dada_tmp2 \
        dada_out1 dada_out2
fi
mkdir dada_in1 dada_in2 dada_tmp1 dada_tmp2 dada_out1 dada_out2
gzip post_art1.fq
gzip post_art2.fq
mv post_art1.fq.gz dada_in1/post_art1.fastq.gz
mv post_art2.fq.gz dada_in2/post_art2.fastq.gz


             ART_Illumina (2008-2016)          
          Q Version 2.5.8 (June 6, 2016)       
     Contact: Weichun Huang <whduke@gmail.com> 
    -------------------------------------------

             Amplicon paired-end sequencing simulation

Total CPU time used: 48.8027

The random seed for the run: 1524717474

Parameters used during run
	Read Length:	250
	Genome masking 'N' cutoff frequency: 	1 in 250
	# Read Pairs per Amplion:  0
	Profile Type:             Combined
	ID Tag:                   

Quality Profile(s)
	First Read:   MiSeq v1 Length 250 R1 (built-in profile) 
	First Read:   MiSeq v1 Length 250 R2 (built-in profile) 

Output files

  FASTQ Sequence Files:
	 the 1st reads: post_art1.fq
	 the 2nd reads: post_art2.fq



vsearch v2.7.0_macos_x86_64, 16.0GB RAM, 8 cores
https://github.com/torognes/vsearch

Rereplicating 100%
Rereplicated 471386 reads from 10604 amplicons


## Denoise
do not run - contains R code
```
inp_dir = "dada_in"
out_path = "post_dada2.tsv"
filtered_dir = "dada_tmp"
truncLen = 150
trimLeft = 0
maxEE = 2.0
truncQ = 2
chimeraMethod = "none"
minParentFold = 1.0
nthreads = 4
nreads_learn = 1000000
trace_dir = "dada_out"
```

In [116]:
%%file run_traceable_dada_paired.R
#!/usr/bin/env Rscript

###################################################
# This R script takes an input two directories of
# .fastq.gz files, corresponding to matched forward
# and reverse sequence files,
# and outputs a tsv file of the dada2 processed sequence
# table. It is intended for use with the QIIME2 plugin
# for DADA2.
#
# Rscript run_dada_paired.R input_dirF input_dirR output.tsv filtered_dirF filtered_dirR 240 160 0 0 2.0 2 pooled 1.0 0 100000
####################################################

####################################################
#             DESCRIPTION OF ARGUMENTS             #
####################################################
# NOTE: All numeric arguments should be zero or positive.
# NOTE: All numeric arguments save maxEE are expected to be integers.
# NOTE: Currently the filterered_dirF/R must already exist.
# NOTE: ALL ARGUMENTS ARE POSITIONAL!
#
### FILE SYSTEM ARGUMENTS ###
#
# 1) File path to directory with the FORWARD .fastq.gz files to be processed.
#    Ex: path/to/dir/with/FWD_fastqgzs
#
# 2) File path to directory with the REVERSE .fastq.gz files to be processed.
#    Ex: path/to/dir/with/REV_fastqgzs
#
# 3) File path to output tsv file. If already exists, will be overwritten.
#    Ex: path/to/output_file.tsv
#
# 4) File path to directory to write the filtered FORWARD .fastq.gz files. These files are intermediate
#               for the full workflow. Currently they remain after the script finishes. Directory must
#               already exist.
#    Ex: path/to/dir/with/FWD_fastqgzs/filtered
#
# 5) File path to directory to write the filtered REVERSE .fastq.gz files. These files are intermediate
#               for the full workflow. Currently they remain after the script finishes. Directory must
#               already exist.
#    Ex: path/to/dir/with/REV_fastqgzs/filtered
#
### FILTERING ARGUMENTS ###
#
# 6) truncLenF - The position at which to truncate forward reads. Forward reads shorter
#               than truncLenF will be discarded.
#               Special values: 0 - no truncation or length filtering.
#    Ex: 240
#
# 7) truncLenR - The position at which to truncate reverse reads. Reverse reads shorter
#               than truncLenR will be discarded.
#               Special values: 0 - no truncation or length filtering.
#    Ex: 160
#
# 8) trimLeftF - The number of nucleotides to remove from the start of
#               each forward read. Should be less than truncLenF.
#    Ex: 0
#
# 9) trimLeftR - The number of nucleotides to remove from the start of
#               each reverse read. Should be less than truncLenR.
#    Ex: 0
#
# 10) maxEE - Reads with expected errors higher than maxEE are discarded.
#               Both forward and reverse reads are independently tested.
#    Ex: 2.0
#
# 11) truncQ - Reads are truncated at the first instance of quality score truncQ.
#                If the read is then shorter than truncLen, it is discarded.
#    Ex: 2
#
### CHIMERA ARGUMENTS ###
#
# 12) chimeraMethod - The method used to remove chimeras. Valid options are:
#               none: No chimera removal is performed.
#               pooled: All reads are pooled prior to chimera detection.
#               consensus: Chimeras are detect in samples individually, and a consensus decision
#                           is made for each sequence variant.
#    Ex: consensus
#
# 13) minParentFold - The minimum abundance of potential "parents" of a sequence being
#               tested as chimeric, expressed as a fold-change versus the abundance of the sequence being
#               tested. Values should be greater than or equal to 1 (i.e. parents should be more
#               abundant than the sequence being tested).
#    Ex: 1.0
#
### SPEED ARGUMENTS ###
#
# 14) nthreads - The number of threads to use.
#                 Special values: 0 - detect available and use all.
#    Ex: 1
#
# 15) nreads_learn - The minimum number of reads to learn the error model from.
#                 Special values: 0 - Use all input reads.
#    Ex: 1000000
#

cat(R.version$version.string, "\n")
args <- commandArgs(TRUE)

inp.dirF <- args[[1]]
inp.dirR <- args[[2]]
out.path <- args[[3]]
filtered.dirF <- args[[4]]
filtered.dirR <- args[[5]]
truncLenF <- as.integer(args[[6]])
truncLenR <- as.integer(args[[7]])
trimLeftF <- as.integer(args[[8]])
trimLeftR <- as.integer(args[[9]])
maxEE <- as.numeric(args[[10]])
truncQ <- as.integer(args[[11]])
chimeraMethod <- args[[12]]
minParentFold <- as.numeric(args[[13]])
nthreads <- as.integer(args[[14]])
nreads.learn <- as.integer(args[[15]])
trace.dirF <- args[[16]]
trace.dirR <- args[[17]]
errQuit <- function(mesg, status=1) {
  message("Error: ", mesg)
  q(status=status)
}

### VALIDATE ARGUMENTS ###

# Input directory is expected to contain .fastq.gz file(s)
# that have not yet been filtered and globally trimmed
# to the same length.
if(!(dir.exists(inp.dirF) && dir.exists(inp.dirR))) {
  errQuit("Input directory does not exist.")
} else {
  unfiltsF <- list.files(inp.dirF, pattern=".fastq.gz$", full.names=TRUE)
  unfiltsR <- list.files(inp.dirR, pattern=".fastq.gz$", full.names=TRUE)
  if(length(unfiltsF) == 0) {
    errQuit("No input forward files with the expected filename format found.")
  }
  if(length(unfiltsR) == 0) {
    errQuit("No input reverse files with the expected filename format found.")
  }
  if(length(unfiltsF) != length(unfiltsR)) {
    errQuit("Different numbers of forward and reverse .fastq.gz files.")
  }
}

# Output path is to be a filename (not a directory) and is to be
# removed and replaced if already present.
if(dir.exists(out.path)) {
  errQuit("Output filename is a directory.")
} else if(file.exists(out.path)) {
  invisible(file.remove(out.path))
}

# Convert nthreads to the logical/numeric expected by dada2
if(nthreads < 0) {
  errQuit("nthreads must be non-negative.")
} else if(nthreads == 0) {
  multithread <- TRUE # detect and use all
} else if(nthreads == 1) {
  multithread <- FALSE
} else {
  multithread <- nthreads
}

if(!dir.exists(trace.dirF) || !dir.exists(trace.dirR)) {
  errQuit("Trace directory does not exist.")
}

### LOAD LIBRARIES ###
suppressWarnings(library(methods))
suppressWarnings(library(dada2))
cat("DADA2 R package version:", as.character(packageVersion("dada2")), "\n")

### TRIM AND FILTER ###
cat("1) Filtering ")
filtsF <- file.path(filtered.dirF, basename(unfiltsF))
filtsR <- file.path(filtered.dirR, basename(unfiltsR))
out <- suppressWarnings(filterAndTrim(unfiltsF, filtsF, unfiltsR, filtsR,
                                      truncLen=c(truncLenF, truncLenR), trimLeft=c(trimLeftF, trimLeftR),
                                      maxEE=maxEE, truncQ=truncQ, rm.phix=TRUE, 
                                      multithread=multithread))
cat(ifelse(file.exists(filtsF), ".", "x"), sep="")
filtsF <- list.files(filtered.dirF, pattern=".fastq.gz$", full.names=TRUE)
filtsR <- list.files(filtered.dirR, pattern=".fastq.gz$", full.names=TRUE)
cat("\n")
if(length(filtsF) == 0) { # All reads were filtered out
  errQuit("No reads passed the filter (were truncLenF/R longer than the read lengths?)", status=2)
}

### LEARN ERROR RATES ###
# Dereplicate enough samples to get nreads.learn total reads
cat("2) Learning Error Rates\n")
NREADS <- 0
drpsF <- vector("list", length(filtsF))
drpsR <- vector("list", length(filtsR))
denoisedF <- rep(0, length(filtsF))
getN <- function(x) sum(getUniques(x))
for(i in seq_along(filtsF)) {
  drpsF[[i]] <- derepFastq(filtsF[[i]])
  drpsR[[i]] <- derepFastq(filtsR[[i]])
  NREADS <- NREADS + sum(drpsF[[i]]$uniques)
  if(NREADS > nreads.learn) { break }
}
# Run dada in self-consist mode on those samples
ddsF <- vector("list", length(filtsF))
ddsR <- vector("list", length(filtsR))
if(i==1){
  cat("2a) Forward Reads\n")
  ddsF[[1]] <- dada(drpsF[[1]], err=NULL, selfConsist=TRUE, multithread=multithread, VECTORIZED_ALIGNMENT=FALSE, SSE=1)
  cat("2b) Reverse Reads\n")
  ddsR[[1]] <- dada(drpsR[[1]], err=NULL, selfConsist=TRUE, multithread=multithread, VECTORIZED_ALIGNMENT=FALSE, SSE=1)
} else {
  cat("2a) Forward Reads\n")
  ddsF[1:i] <- dada(drpsF[1:i], err=NULL, selfConsist=TRUE, multithread=multithread, VECTORIZED_ALIGNMENT=FALSE, SSE=1)
  cat("2b) Reverse Reads\n")
  ddsR[1:i] <- dada(drpsR[1:i], err=NULL, selfConsist=TRUE, multithread=multithread, VECTORIZED_ALIGNMENT=FALSE, SSE=1)
}
errF <- ddsF[[1]]$err_out
errR <- ddsR[[1]]$err_out
cat("\n")

### PROCESS ALL SAMPLES ###
# Process samples used to learn error rates
mergers <- vector("list", length(filtsF))
if(i==1) { # breaks list assignment
  mergers[[1]] <- mergePairs(ddsF[[1]], drpsF[[1]], ddsR[[1]], drpsR[[1]])
  denoisedF[[1]] <- getN(ddsF[[1]])
} else {
  mergers[1:i] <- mergePairs(ddsF[1:i], drpsF[1:i], ddsR[1:i], drpsR[1:i])
  denoisedF[1:i] <- sapply(ddsF[1:i], getN)
}

# Loop over rest in streaming fashion with learned error rates
cat("3) Denoise remaining samples ")
if(i < length(filtsF)) {
  for(j in seq(i+1,length(filtsF))) {
    drpsF[[j]] <- derepFastq(filtsF[[j]])
    { sink("/dev/null"); ddsF[[j]] <- dada(drpsF[[j]], err=errF, multithread=multithread, VECTORIZED_ALIGNMENT=FALSE, SSE=1); sink(); }
    drpsR[[j]] <- derepFastq(filtsR[[j]])
    { sink("/dev/null"); ddsR[[j]] <- dada(drpsR[[j]], err=errR, multithread=multithread, VECTORIZED_ALIGNMENT=FALSE, SSE=1); sink(); }
    mergers[[j]] <- mergePairs(ddsF[[j]], drpsF[[j]], ddsR[[j]], drpsR[[j]], maxMismatch=10)
    denoisedF[[j]] <- getN(ddF)
    cat(".")
  }
}
cat("\n")

for (j in seq(1, length(filtsF))) {
  map_path <- file.path(trace.dirF, gsub('fastq.gz', 'dada.map', basename(filtsF[[j]])))
  uniques <- getSequences(drpsF[[j]])
  svs <- names(ddsF[[j]]$denoised[unname(ddsF[[j]]$map)])
  write.table(t(rbind(uniques, svs)),
              map_path, sep="\t", quote=FALSE, row.names=FALSE, col.names=FALSE)

  map_path <- file.path(trace.dirF, gsub('fastq.gz', 'merge.map', basename(filtsF[[j]])))
  merged <- getSequences(mergers[[j]])
  svs <- names(ddsF[[j]]$denoised[unname(mergers[[j]]$forward)])
  write.table(t(rbind(svs, merged)),
              map_path, sep="\t", quote=FALSE, row.names=FALSE, col.names=FALSE)

  map_path <- file.path(trace.dirR, gsub('fastq.gz', 'dada.map', basename(filtsR[[j]])))
  uniques <- getSequences(drpsR[[j]])
  svs <- names(ddsR[[j]]$denoised[unname(ddsR[[j]]$map)])
  write.table(t(rbind(uniques, svs)),
              map_path, sep="\t", quote=FALSE, row.names=FALSE, col.names=FALSE)

  map_path <- file.path(trace.dirR, gsub('fastq.gz', 'merge.map', basename(filtsR[[j]])))
  svs <- names(ddsR[[j]]$denoised[unname(mergers[[j]]$reverse)])
  write.table(t(rbind(svs, merged)),
              map_path, sep="\t", quote=FALSE, row.names=FALSE, col.names=FALSE)

  rm(uniques); rm(merged); rm(svs)
}
rm(drpsF); rm(drpsR); rm(ddsF); rm(ddsR)

# Make sequence table
seqtab <- makeSequenceTable(mergers)

# Remove chimeras
cat("4) Remove chimeras (method = ", chimeraMethod, ")\n", sep="")
if(chimeraMethod %in% c("pooled", "consensus")) {
  seqtab.nochim <- removeBimeraDenovo(seqtab, method=chimeraMethod, minFoldParentOverAbundance=minParentFold, multithread=multithread)
} else { # No chimera removal, copy seqtab to seqtab.nochim
  seqtab.nochim <- seqtab
}

### REPORT READ COUNTS AT EACH PROCESSING STEP ###
# Handle edge cases: Samples lost in filtering; One sample
track <- cbind(out, matrix(0, nrow=nrow(out), ncol=3))
colnames(track) <- c("input", "filtered", "denoised", "merged", "non-chimeric")
passed.filtering <- track[,"filtered"] > 0
track[passed.filtering,"denoised"] <- denoisedF
track[passed.filtering,"merged"] <- rowSums(seqtab)
track[passed.filtering,"non-chimeric"] <- rowSums(seqtab.nochim)
head(track)
#write.table(track, out.track, sep="\t",
#            row.names=TRUE, col.names=col.names, quote=FALSE)

### WRITE OUTPUT AND QUIT ###
# Formatting as tsv plain-text sequence table table
cat("6) Write output\n")
seqtab.nochim <- t(seqtab.nochim) # QIIME has OTUs as rows
col.names <- basename(filtsF)
col.names[[1]] <- paste0("#OTU ID\t", col.names[[1]])
write.table(seqtab.nochim, out.path, sep="\t",
            row.names=TRUE, col.names=col.names, quote=FALSE)
saveRDS(seqtab.nochim, gsub("tsv", "rds", out.path)) ### TESTING

q(status=0)

Overwriting run_traceable_dada_paired.R


```
tmp_forward, tmp_reverse, biom_fp, filt_forward, filt_reverse,
               str(trunc_len_f), str(trunc_len_r),
               str(trim_left_f), str(trim_left_r),
               str(max_ee), str(trunc_q),
               str(chimera_method), str(min_fold_parent_over_abundance),
               str(n_threads), str(n_reads_learn)
               
                   trunc_len_f: int, trunc_len_r: int,
                   trim_left_f: int=0, trim_left_r: int=0,
                   max_ee: float=2.0, trunc_q: int=2,
                   chimera_method: str='consensus',
                   min_fold_parent_over_abundance: float=1.0, n_threads: int=1,
                   n_reads_learn: int=1000000, hashed_feature_ids: bool=True
                   
                   inp_dir = "dada_in"
out_path = "post_dada2.tsv"
filtered_dir = "dada_tmp"
truncLen = 150
trimLeft = 0
maxEE = 2.0
truncQ = 2
chimeraMethod = "none"
minParentFold = 1.0
nthreads = 4
nreads_learn = 1000000
trace_dir = "dada_out"
```

In [None]:
!Rscript run_traceable_dada_paired.R dada_in1 dada_in2 post_dada.tsv dada_tmp1 dada_tmp2 250 250 0 0 Inf 0 none 1 1 1000000 dada_out1 dada_out2

R version 3.4.1 (2017-06-30) 
Loading required package: Rcpp
DADA2 R package version: 1.6.0 
1) Filtering .
2) Learning Error Rates
2a) Forward Reads
Initializing error rates to maximum possible estimate.
Sample 1 - 465723 reads in 238755 unique sequences.
   selfConsist step 2 


### Reconstruct
need
- `FeatureData[Taxonomy]` for expected taxonomy
- `FeatureTable[Frequency]` for abundance
- `FeatureData[Sequence]` for classification

In [55]:
unique_maps = []
for i in (1, 2):
    with open('dada_out%d/post_art%d.merge.map' % (i, i)) as merg_fh:
        reader = csv.reader(merg_fh, csv.excel_tab)
        merg_map = {}
        for dada_sv, merge_sv in reader:
            assert dada_sv not in merg_map
            merg_map[dada_sv] = merge_sv
    with open('dada_out%d/post_art%d.dada.map' % (i, i)) as dada_fh:
        reader = csv.reader(dada_fh, csv.excel_tab)
        unique_map = defaultdict(list)
        for unique, dada_sv in reader:
            if dada_sv in merg_map:
                unique_map[unique].append(merg_map[dada_sv])
    unique_maps.append(unique_map)

this is where the magic happens

In [56]:
filtered_taxa = []
single_maps = []
for i in (1, 2):
    single_map = defaultdict(set)
    taxa = []
    with skbio.io.open('dada_tmp%d/post_art%d.fastq.gz' % (i, i)) as pa_fh:
        fastq_reader = skbio.io.read(pa_fh, 'fastq', phred_offset=33)
        for j, seq in enumerate(fastq_reader):
            taxa.append(tax_map[seq.metadata['id'][:-4]])
            for sv in unique_maps[i-1][str(seq)]:
                single_map[sv].add(j)
    single_maps.append(single_map)
    filtered_taxa.append(taxa)
assert filtered_taxa[0] == filtered_taxa[1]
filtered_taxa = filtered_taxa[0]
merged_map = {sv: single_maps[0][sv].intersection(single_maps[1][sv]) 
              for sv in single_maps[0]}
merged_map = {sv: Counter(filtered_taxa[i] for i in tlist)
              for sv, tlist in merged_map.items()}
result = {(s, t): c for s in merged_map for t, c in merged_map[s].items()}

In [57]:
abundance = {}
with open('post_dada.tsv') as pd_fh:
    dada_reader = csv.reader(pd_fh, csv.excel_tab)
    dada_reader.__next__()
    for sv, count in dada_reader:
        abundance[sv] = int(count)

In [58]:
check = Counter()
for (sv, taxon), count in result.items():
    check[sv] += count
for sv, count in check.items():
    assert abundance[sv] == count, '%s %d %d' % (sv, count, abundance[sv])

In [77]:
flattened = [(s, t, c) for (s, t), c in result.items()]
svs, taxa, abundances = zip(*flattened)
hashes = [hashlib.md5((s+t).encode('utf-8')).hexdigest() for s, t in result]
expected = DataFrame({'Taxon': taxa}, index=hashes, columns=['Taxon'])
expected.index.name = 'Feature ID'
expected = Artifact.import_data('FeatureData[Taxonomy]', expected)
expected.save('expected.qza')
abundanced = DataFrame({h: a for h, a in zip(hashes, abundances)}, index=['sample-name'],
                       columns=hashes)
abundanced = Artifact.import_data('FeatureTable[Frequency]', abundanced)
abundanced.save('frequencies.qza')
sequences = Series(svs, index=hashes)
sequences = Artifact.import_data('FeatureData[Sequence]', sequences)
sequences.save('sequences.qza')

'sequences.qza'

### To Do
Ok, so we can generate a single sample. Now we have to write a script that we can run on a cluster that will do it k-foldwise. For each fold we will need
- `FeatureData[Taxonomy]` to train to classifier
- `FeatureTable[RelativeFrequency]` for weights to train classifier
- `FeatureData[Sequence]` to train classifier