# Mock community dataset generation
This notebook describes how mock community datasets were retrieved and files were generated for tax-credit comparisons. Only the feature tables, metadata maps, representative sequences, and expected taxonomies are included in tax-credit, but this notebook can regenerate intermediate files, generate these files for new mock communities, or tweaked to benchmark, e.g., quality control or OTU picking methods.

All mock communities are hosted on [mockrobiota](http://caporasolab.us/mockrobiota/), though raw reads are deposited elsewhere. To use these mock communities, clone the ``mockrobiota`` repository into the ``repo_dir`` that contains the tax-credit repository.

In [1]:
from tax_credit.process_mocks import *

from os import path, makedirs, remove, rename
from os.path import expandvars, exists, basename, splitext, dirname, join, isfile
from shutil import copyfile
import biom

import qiime
from qiime.plugins import feature_table, demux, dada2, alignment, phylogeny


Set source/destination filepaths

In [2]:
# base directory containing tax-credit and mockrobiota repositories
repo_dir = expandvars("$HOME/Desktop/projects/")
# tax-credit directory
project_dir = join(repo_dir, "short-read-tax-assignment")
# mockrobiota directory
mockrobiota_dir = join(repo_dir, "mockrobiota")
# temp destination for mock community files
mock_data_dir = join(repo_dir, "mock-community")
# destination for expected taxonomy assignments
expected_data_dir = join(project_dir, "data", "precomputed-results", "mock-community")


First we will define which mock communities we plan to use, and necessary parameters

In [3]:
# We will just use a sequential set of mockrobiota datasets, otherwise list community names manually
communities = ['mock-{0}'.format(n) for n in range(1,11)]

# Create dictionary of mock community dataset metadata
community_metadata = extract_mockrobiota_dataset_metadata(mockrobiota_dir, communities)

# Map marker-gene to reference database names in tax-credit and in mockrobiota
#           marker-gene  tax-credit-dir  mockrobiota-dir version
reference_dbs = {'16S' : ('gg_13_8_otus', 'greengenes', '13_8'),
                 'ITS' : ('unite-97-rep-set', 'unite', '97')
                }

Now we will generate data directories in ``tax-credit`` for each community and begin populating these will files from ``mockrobiota``. This may take some time, as this involves downloading raw data fastq files.

In [4]:
extract_mockrobiota_data(communities, community_metadata, reference_dbs, 
                         mockrobiota_dir, mock_data_dir, 
                         expected_data_dir)

## Process data in QIIME2
Finally, we can get to processing our data. We begin by importing our data, demultiplexing, and viewing a few fastq quality summaries to decide how to trim our raw reads prior to processing.

In [45]:
for community in communities[2:3]:
    # extract dataset metadata/params
    community_dir = join(mock_data_dir, community)
    marker_gene = community_metadata[community][2]
    forward_read_fp = join(community_dir,'mock-forward-read.fastq.gz')
    index_read_fp = join(community_dir,'mock-index-read.fastq.gz')
    sample_metadata = join(community_dir, 'sample-metadata.tsv')
    
    # import fastq to qiime artifact
    #qiime tools import --type  --input-path $raw_dir --output-path $projectdir/raw-sequences.qza
    forward_read = qiime.Artifact.import_data("RawSequences", forward_read_fp)
    index_read = qiime.Artifact.import_data("RawSequences", index_read_fp)

    # demultiplex / QC
    #qiime demux emp --i-seqs $projectdir/raw-sequences.qza --m-barcodes-file $demuxmap --m-barcodes-category BarcodeSequence --o-per-sample-sequences $projectdir/demux --p-rev-comp-barcodes
    demux_seqs = demux.methods.emp(seqs = forward_read,
                                   barcodes_file = index_read,
                                   barcodes_category = 'BarcodeSequence',
                                   per_sample_sequences = join(community_dir, 'demux-seqs.qza'),
                                   rev_comp_barcodes = True)
    
    # demultiplexing summary
    #qiime demux summarize --i-data $projectdir/demux.qza --o-visualization $projectdir/demux-summary
    demux_summary = demux.methods.summarize(data = demux_seqs,
                                            visualization = join(community_dir, 
                                                                 'demux_summary.qzv')
                                           )

    # view fastq quality plots
    #qiime dada2 plot-qualities --i-demultiplexed-seqs $projectdir/demux.qza --o-visualization $projectdir/demux-qual-plots --p-n 5
    demux_plot_qual = dada2.methods.plot_qualities(demultiplexed_seqs = demux_seqs,
                                                   visualization = join(community_dir,
                                                                        'demux_plot_qual.qzv'),
                                                   n = 1)

ValueError: Importing 'EMPMultiplexedDirFmt' requires a directory, not /Users/nbokulich/Desktop/projects/mock-community/mock-3/mock-forward-read.fastq.gz

In [None]:
for community in communities:
    tools.methods.view(join(mock_data_dir, community, 'demux_summary.qzv'))

In [None]:
for community in communities:
    tools.methods.view(join(mock_data_dir, community, 'demux_plot_qual.qzv'))   

Use the quality data above to decide how to proceed. As each dataset will have different quality profiles and read lengths, we will enter trimming parameters as a dictionary.

In [None]:
# {community : (trim_left, trunc_len)}
trim_params = {'mock-1' : (left, right)
               'mock-2' : ()
               'mock-3' : ()
               'mock-4' : ()
               'mock-5' : ()
               'mock-6' : ()
               'mock-7' : ()
               'mock-8' : ()
               'mock-9' : ()
               'mock-10' : ()
              }

Now we will quality filter with ``dada2``.

In [None]:
for community in communities:
    community_dir = join(mock_data_dir, community)
    
    # dada2
    #qiime dada2 denoise --i-demultiplexed-seqs $projectdir/demux.qza --p-trim-left 10 --p-trunc-len 90 --o-representative-sequences $projectdir/rep-seqs --o-table $projectdir/table
    dada2.methods.denoise(demultiplexed_seqs = join(community_dir, 'demux-seqs.qza'),
                          trim_left = trim_params[community][0],
                          trunc_len = trim_params[community][1],
                          representative_sequences = join(community_dir, 'rep_seqs.qza'),
                          table = join(community_dir, 'feature_table.qza')
                         )

    # summarize feature table
    #qiime feature-table summarize --i-table $projectdir/table.qza --o-visualization $projectdir/table
    feature_table.methods.summarize(table = join(community_dir, 'feature_table.qza'),
                                    visualization = join(community_dir, 'feature_table_summary.qzv')
                                   )

In [None]:
for community in communities:
    tools.methods.view(join(mock_data_dir, community, 'feature_table_summary.qzv'))   

Finally, build a phylogeny from rep sequences.

In [None]:
for community in communities:
    community_dir = join(mock_data_dir, community)
    
    # Build phylogeny
    #qiime alignment mafft --i-sequences $projectdir/rep-seqs.qza --o-alignment $projectdir/aligned-rep-seqs
    aligned_seqs = alignment.methods.mafft(join(community_dir, 'rep_seqs.qza'))
    
    #qiime alignment mask --i-alignment $projectdir/aligned-rep-seqs.qza --o-masked-alignment $projectdir/masked-aligned-rep-seqs
    masked_alignment = alignment.methods.mask(aligned_seqs)

    #qiime phylogeny fasttree --i-alignment $projectdir/masked-aligned-rep-seqs.qza --o-tree $projectdir/unrooted-tree
    unrooted_tree = phylogeny.methods.fasttree(masked_alignment)
    
    #qiime phylogeny midpoint-root --i-tree $projectdir/unrooted-tree.qza --o-rooted-tree $projectdir/rooted-tree
    tree = phylogeny.methods.midpoint_root(tree = unrooted_tree,
                                          rooted_tree = join(community_dir, 'phylogeny.qza'))

## Extract results and move to repo

Mock community data: feature_table, sample_metadata, rep_seqs, tree

In [None]:
for community in communities:
    
    # Define base dir destination for mock community directories
    repo_destination = join(project_dir, "data", "mock-community")
    
    # Files to move
    rep_seqs = join(community_dir, 'rep_seqs.qza')
    feature_table = join(community_dir, 'feature_table.qza')
    tree = join(community_dir, 'phylogeny.qza')
    sample_md = join(community_dir, 'sample-metadata.tsv')
    biom_table_fp = join(community_dir, 'feature_table.biom')
    
    # Extract biom, tree, rep_seqs
    biom_table = feature_table.view(biom.Table)
    write_biom_table(biom_table, 'hdf5', biom_table_fp)

    # Extract feature_table to biom
    # Extract rep_seq to fasta
    # Move to repo:
    for file in [rep_seqs, feature_table, tree, sample_md, biom_table_fp, ]:
        copyfile(file, join(repo_destination, community, basename(file)))
        


In [None]:
# List databases as fasta/taxonomy file pairs
databases = {'B1-REF': [expandvars("$HOME/Desktop/ref_dbs/gg_13_8_otus/rep_set/97_otus.fasta"), 
             expandvars("$HOME/Desktop/ref_dbs/gg_13_8_otus/taxonomy/97_otu_taxonomy.txt"),
             "gg_13_8_otus", "GTGCCAGCMGCCGCGGTAA", "ATTAGAWACCCBDGTAGTCC", "515F", "806R"],
             'F1-REF': [expandvars("$HOME/Desktop/ref_dbs/unite-97-rep-set/97_otus.txt"), 
             expandvars("$HOME/Desktop/ref_dbs/unite-97-rep-set/97_otu_taxonomy.txt"), 
             "unite-97-rep-set", "ACCTGCGGARGGATCA", "AACTTTYARCAAYGGAT", "BITSf", "B58S3r"]
            }