# Novel-taxa and simulated community generation

This notebook describes the generation of reference databases for both novel-taxa and simulated community analyses. Novel-taxa analysis is a form of cross-validated taxonomic classification, wherein random unique sequences are sampled from the reference database as a test set; all sequences sharing taxonomic affiliation at a given taxonomic level are removed from the reference database (training set); and taxonomy is assigned to the query sequences at the given taxonomic level. Thus, this test interrogates the behavior of a taxonomy classifier when challenged with "novel" sequences that are not represented by close matches within the reference sequence database. Such an analysis is performed to assess the degree to which "overassignment" occurs for sequences that are not represented in a reference database.

Simulated community analysis represents more conventional cross-validated classification, wherein unique sequences are randomly sampled from a reference dataset and used as a test set for taxonomic classification, using a training set that has those sequences removed, but not other sequences that share taxonomic affiliation. Instead, the training set must contain identical taxonomies to those represented by the test sequences.

The general framework for generating a modified reference database for this analysis consists of:

1) Novel-taxa reference database generation: Remove empty taxa from ref dbs, split into query/reference subsets.

2) Assign taxonomy to "novel" query sequences removed from the trimmed reference db to which it is paired.

3) Measure rates of classification accuracy for each assignment method.

## Environment
First step is to create a conda environment with the necessary dependencies. This requires installing [miniconda 3](http://conda.pydata.org/miniconda.html) to manage parallel python environments. After miniconda (or another conda version) is installed, proceed with [installing QIIME 2](https://docs.qiime2.org/2.0.6/install/).


## Definitions
* ``source`` = original reference database.
* ``REF`` = ``source`` - ``novel`` seqs, used for taxonomy assignment.
* ``QUERY`` = 'novel' query sequences randomly drawn from ``source``. 
* ``L`` = taxonomic level being tested
    * 0 = kingdom, 1 = phylum, 2 = class, 3 = order, 4 = family, 5 = genus, 6 = species
* ``branching`` = describes a taxon at level ``L`` that "branches" into two or more lineages at ``L + 1``. 
    * A "branched" taxon, then, describes these lineages. E.g., in the example below Lactobacillaceae, Lactobacillus, and Pediococcus branch, while Paralactobacillus is unbranching. The Lactobacillus and Pediococcus species are "branched". Paralactobacillus selangorensis is "unbranched"

```
Lactobacillaceae
           └── Lactobacillus
           │         ├── Lactobacillus brevis
           │         └── Lactobacillus sanfranciscensis
           ├── Pediococcus
           │         ├── Pediococcus damnosus
           │         └── Pediococcus claussenii
           └── Paralactobacillus
                     └── Paralactobacillus selangorensis
```

# Novel-taxa reference data set generation

This section describes the preparation of the data sets necessary for "novel taxa" analysis. The goals of this step are:
1. Create a "clean" reference database that can be used for evaluation of "novel taxa" from phylum to species level.
2. Generate simulated amplicons and randomly subsample query sequences to use as "novel taxa"
3. Create modified sequence reference databases for taxonomic classification of "novel taxa" sequences

In this first cell, we describe data set/database characteristics as a dictionary: dataset name is the key, with values reference sequence fasta, taxonomy, database name, forward primer sequence, reverse primer sequence, forward primer name, reverse primer name.

MODIFY these values to generate novel-taxa files on a new reference database

In [2]:
from tax_credit.taxa_manipulator import *
from tax_credit.framework_functions import *

from os import path, makedirs, remove, rename
from os.path import expandvars, exists, basename, splitext, dirname, join, isfile
from collections import OrderedDict
import pandas as pd
from skbio.util import create_dir
from skbio.alignment import global_pairwise_align_nucleotide, make_identity_substitution_matrix, local_pairwise_align_ssw
from skbio.sequence import DNA
from skbio import io, DNA
from shutil import copyfile
from itertools import product


In [3]:
project_dir = expandvars("$HOME/Desktop/projects/short-read-tax-assignment")
data_dir = join(project_dir, "data")

# List databases as fasta/taxonomy file pairs
databases = {'B1-REF': [expandvars("$HOME/Desktop/ref_dbs/gg_13_8_otus/rep_set/99_otus.fasta"), 
             expandvars("$HOME/Desktop/ref_dbs/gg_13_8_otus/taxonomy/99_otu_taxonomy.txt"),
             "gg_13_8_otus", "GTGCCAGCMGCCGCGGTAA", "ATTAGAWACCCBDGTAGTCC", "515f", "806r"],
             'F1-REF': [expandvars("$HOME/Desktop/ref_dbs/sh_qiime_release_20.11.2016/developer/sh_refs_qiime_ver7_99_20.11.2016_dev.fasta"), 
             expandvars("$HOME/Desktop/ref_dbs/sh_qiime_release_20.11.2016/developer/sh_taxonomy_qiime_ver7_99_20.11.2016_dev.txt"), 
             "unite_20.11.2016", "ACCTGCGGARGGATCA", "GAGATCCRTTGYTRAAAGTT", "BITSf", "B58S3r"]
            }

Now we will import these to a dataframe and view it. You should not need to modify the following cell.

In [4]:
# Arrange data set / database info in data frame
simulated_community_definitions = pd.DataFrame.from_dict(databases, orient="index")
simulated_community_definitions.columns = ["Reference file path", "Reference tax path", "Reference id", 
                                           "Fwd primer", "Rev primer", "Fwd primer id", "Rev primer id"]
simulated_community_definitions

Unnamed: 0,Reference file path,Reference tax path,Reference id,Fwd primer,Rev primer,Fwd primer id,Rev primer id
B1-REF,/Users/nbokulich/Desktop/ref_dbs/gg_13_8_otus/...,/Users/nbokulich/Desktop/ref_dbs/gg_13_8_otus/...,gg_13_8_otus,GTGCCAGCMGCCGCGGTAA,ATTAGAWACCCBDGTAGTCC,515f,806r
F1-REF,/Users/nbokulich/Desktop/ref_dbs/sh_qiime_rele...,/Users/nbokulich/Desktop/ref_dbs/sh_qiime_rele...,unite_20.11.2016,ACCTGCGGARGGATCA,GAGATCCRTTGYTRAAAGTT,BITSf,B58S3r


Generate "clean" reference taxonomy and sequence database by removing taxonomy strings with empty or ambiguous levels'

Set simulated community parameters, including amplicon length and the number of iterations to perform. Iterations will split our query sequence files into N chunks.

This will take a few minutes to run. Get some coffee.

In [5]:
read_length = 250
iterations = 3
generate_simulated_datasets(simulated_community_definitions, data_dir, read_length, iterations)


B1-REF Sequence Counts
Raw Fasta:            203452.0
Clean Fasta:          20745.0
Simulated Amplicons:  20694.0
Simulated Reads:      20692.0
B1-REF level 6 contains 3144 unique and 2120 branched taxa              
B1-REF level 5 contains 1456 unique and 1358 branched taxa              
B1-REF level 4 contains 277 unique and 202 branched taxa              
B1-REF level 3 contains 123 unique and 78 branched taxa              
B1-REF level 2 contains 65 unique and 49 branched taxa              
B1-REF level 1 contains 28 unique and 28 branched taxa              
F1-REF Sequence Counts
Raw Fasta:            28856.0
Clean Fasta:          18245.0
Simulated Amplicons:  11736.0
Simulated Reads:      8685.0
F1-REF level 6 contains 5472 unique and 4853 branched taxa              
F1-REF level 5 contains 1310 unique and 1184 branched taxa              
F1-REF level 4 contains 323 unique and 257 branched taxa              
F1-REF level 3 contains 117 unique and 98 branched taxa              
F1

For peace of mind, we can test our novel taxa and simulated community datasets to confirm that:

1) For simulated communities, test (query) taxa IDs are not in training (ref) set, but all taxonomy strings are

2) For novel taxa, test taxa IDs and taxonomies are not in training (ref) set, but sister branch taxa are

If no errors print, all tests pass.

In [6]:
test_simulated_communities(simulated_community_definitions, data_dir, iterations)

As a sanity check, confirm that novel taxa were generated successfully.

In [7]:
test_novel_taxa_datasets(simulated_community_definitions, data_dir, iterations)