<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Vagimock-and-gutmock:-obtain-desired-abundance-table" data-toc-modified-id="Vagimock-and-gutmock:-obtain-desired-abundance-table-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Vagimock and gutmock: obtain desired abundance table</a></span><ul class="toc-item"><li><span><a href="#Set-up" data-toc-modified-id="Set-up-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Set up</a></span></li><li><span><a href="#Get-classifier" data-toc-modified-id="Get-classifier-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Get classifier</a></span></li><li><span><a href="#Get-taxonomy" data-toc-modified-id="Get-taxonomy-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Get taxonomy</a></span></li><li><span><a href="#Get-Taxa-Barplots" data-toc-modified-id="Get-Taxa-Barplots-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Get Taxa Barplots</a></span></li><li><span><a href="#Read-Taxa-Barplots" data-toc-modified-id="Read-Taxa-Barplots-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Read Taxa Barplots</a></span></li><li><span><a href="#Obtain-Abundance-table" data-toc-modified-id="Obtain-Abundance-table-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Obtain Abundance table</a></span></li></ul></li><li><span><a href="#Mockrobiota" data-toc-modified-id="Mockrobiota-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Mockrobiota</a></span><ul class="toc-item"><li><span><a href="#Download-data" data-toc-modified-id="Download-data-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Download data</a></span></li><li><span><a href="#Filter-mockrobiota-records" data-toc-modified-id="Filter-mockrobiota-records-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Filter mockrobiota records</a></span></li><li><span><a href="#Format-taxonomy-file" data-toc-modified-id="Format-taxonomy-file-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Format taxonomy file</a></span></li><li><span><a href="#Create-abundance_env.txt" data-toc-modified-id="Create-abundance_env.txt-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Create abundance_env.txt</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Approach-2" data-toc-modified-id="Approach-2-2.4.0.1"><span class="toc-item-num">2.4.0.1&nbsp;&nbsp;</span>Approach 2</a></span></li><li><span><a href="#Approach-1" data-toc-modified-id="Approach-1-2.4.0.2"><span class="toc-item-num">2.4.0.2&nbsp;&nbsp;</span>Approach 1</a></span></li></ul></li></ul></li></ul></li></ul></div>

This notebook has been though to obtain essential files for the vagimock, gutmock and mockrobiota mock-communities. 

For **vagimock** and **gutmock** we will obtain the species that are suposed to be present in a specific sample environment such as the vagina or gut.
- Input: abundancy table of previous analysis, containing taxon and frequency.
- Ouput: expected-abundancy.tsv

For **mockrobiota**, we will download and process data to unify taxonomies.
- Input: itgdbmock
- Output: fasta and taxa mockrobiota files

# Vagimock and gutmock: obtain desired abundance table

## Set up

In [None]:
from qiime2 import Artifact


import pandas as pd
import numpy as np
from IPython.display import display
import os
from os.path import join, basename, dirname, exists, splitext

from joblib import Parallel, delayed
import random

%matplotlib inline
%load_ext autoreload

# Custom functions
import mock_community
from utils import check_dir
import utils
import db_comparison

import logging

In [2]:
# Mock-communities
mock_names = ['vagimock', 'gutmock']

# Working directory
pwd = os.getcwd()

# Database files
created_dir = "new_approach/created_db"
original_dir = "new_approach/original_db"

db = {'gsrv_V4':
      {'seq':join(created_dir, 'gsrv_V4_seqs.qza'),
      'taxa':join(created_dir, 'gsrv_V4_taxa.qza')}
}

## Database parameters
db_dir = created_dir
db_name = 'gsrv_V4'
db_version = "gsrv_V4"
db_similarity = "99-otus"

## Database output files
db_out_files = {
    mocknames: join(mocknames, 'abundance_table'
                ) for mocknames in mock_names}


# Data input directory: Qiime2 files (taxonomy and table)
in_dirs = ["/media/VMP",
          "/media/LT"]
data_dir = { mockname: d for mockname, d in zip(mock_names, in_dirs)}

# Inputs
## representative sequences of sample specific environment (Rep-seqs)
rep_seqs_path = [
    "rep-seqs_merged_VMP_run29_40_46.qza", 
    "rep-seqs-dada2_mergedRuns_LT.qza" ]
rep_seqs = {
    mockname: join(
        data_dir[mockname], rep_seq_path
    ) for mockname, rep_seq_path in zip(mock_names, rep_seqs_path)}

## Table abundancies of Rep-Seqs
tab_paths = [
    "table_merged_VMP_run29_40_46.qza", 
    "table-dada2_mergedRuns_LT.qza"]
tabs = {
    mockname: join(
        data_dir[mockname], tab_path
    ) for mockname, tab_path in zip(mock_names, tab_paths)}

In [3]:
# Generate directory
for key, outdir in db_out_files.items():
    check_dir(outdir)

We will obtain the abundance table from a previous analyzed data. Therefore, we will generate the classifier, perform the taxonomy assingment and obtain the taxa-barplots

## Get classifier

Build the classifier with default pgram ([7,7]).

In [None]:
classifier_dir = 'db_evaluation/classifiers'

%autoreload
cmds, classifier_path = db_comparison.get_classifier(
    db, classifier_dir, bespoke=False, p_alpha=0.001, p_feat='[7,7]', 
    force=True)
cmds

In [None]:
%%time
Parallel(n_jobs=38)(delayed(os.system)(cmd) for cmd in cmds);

## Get taxonomy

Perform the taxonomy assignment with default confidence 0.7

In [11]:
%%time
all_tax = []

for mockname, path in rep_seqs.items():
    rs_artifact = Artifact.load(path)
    tax = db_comparison.get_taxonomy(
            rep_seqs_artifact = rs_artifact, 
            classifier_paths = classifier_path, 
            out_dir = pwd, label = mockname, method="naive-bayes", 
            threads=35, mock_community=False, p_confidence=0.7, 
            force=True )
    all_tax.append(tax)

CPU times: user 1min 1s, sys: 11 s, total: 1min 12s
Wall time: 5min 47s


In [None]:
all_tax

## Get Taxa Barplots
We will collapse the frequency table at species level, obtaining the abundancy at specie level for each sample.

In [13]:
%autoreload
tax_barplots = []

for i , tax_dict in enumerate(all_tax):
    tax = list(tax_dict.values())[0][0]
    mockname = mock_names[i]
    tab = Artifact.load(tabs[mockname])
    path = join(
        db_out_files[mockname], 
        f"taxa_barplot-L7_{db_name}_0.001::[7,7]:0.7_{mockname}")
    
    tsv_tab = mock_community.generate_taxa_barplots(
        tax_artifact = tax, tab_artifact = tab,
        level = 7, output = f"{path}.tsv" , force = True
    
    )
    
    tax_barplots.append(tsv_tab)

## Read Taxa Barplots

In [14]:
# Read Taxa-Barplots
tb_df = {}
for tb in tax_barplots:
    key = tb.split('/')[0]
    df = pd.read_csv(tb, header = 1, index_col = 0, sep='\t')
    
    tb_df[key] = df

In [15]:
tb_df['vagimock'].reset_index()

Unnamed: 0,#OTU ID,INT.CTL_Run40,V.SWAB.13CE.1,V.SWAB.13CE.2,V.SWAB.15CE.1,V.SWAB.15CE.2,V.SWAB.16C.1.RUN29,V.SWAB.20CE.1,V.SWAB.20CE.2,V.SWAB.21CE.1,...,PBV.70.1,PBV.72.1,PBV.72.2,PBV.73.1,PBV.75.1,PBV.77.1,PBV.77.2,V.SWAB.21PE.2_R,V.SWAB.3.CE.2_R,V.SWAB.REUS2.2_R
0,k__Bacteria;__;__;__;__;__;__,313.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,4.0,0.0,0.0,0.0,0.0,13.0,0.0,0.0,10.0
1,k__Bacteria;p__Firmicutes;c__Clostridia;o__Eub...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,k__Bacteria;p__Firmicutes;c__Bacilli;o__Lactob...,334.0,1262.0,961.0,105772.0,86887.0,221.0,333.0,1021.0,531.0,...,59495.0,12205.0,2841.0,359.0,116.0,150.0,171.0,9651.0,531.0,326.0
3,k__Bacteria;p__Firmicutes;c__Tissierellia;o__;...,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,31.0,80.0,0.0,0.0,0.0,0.0,0.0
4,k__Bacteria;p__Firmicutes;c__Tissierellia;o__T...,0.0,0.0,169.0,0.0,0.0,0.0,13.0,0.0,0.0,...,174.0,0.0,0.0,34.0,390.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
585,k__Bacteria;p__Firmicutes;c__Clostridia;o__Eub...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
586,k__Bacteria;p__Bacteroidetes;c__Saprospiria;o_...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
587,k__Bacteria;p__Firmicutes;c__Clostridia;o__Eub...,44.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
588,k__Bacteria;p__Actinobacteria;c__Coriobacterii...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now, we will modified this QIIME table so that the taxonomy syntax matches the syntax database. Specifically:

- we have add a space between the separator (';') of the clades
- add the prefix (e.g. s, g...) before an unspecified clade

Only keep rows with identified species

In [16]:
for key, df in tb_df.items():
    # modify taxonomy to match the databases
    df = utils.format_taxonomy_of_taxabarplots(df)
    # Filter unindentified species
    species = [
        item for item in list(df['Taxon']) if not item.endswith('s__')]
    df = df[df['Taxon'].isin(species)].reset_index(drop=True)
    display(df)
    tb_df[key] = df

Unnamed: 0,Taxon,INT.CTL_Run40,V.SWAB.13CE.1,V.SWAB.13CE.2,V.SWAB.15CE.1,V.SWAB.15CE.2,V.SWAB.16C.1.RUN29,V.SWAB.20CE.1,V.SWAB.20CE.2,V.SWAB.21CE.1,...,PBV.70.1,PBV.72.1,PBV.72.2,PBV.73.1,PBV.75.1,PBV.77.1,PBV.77.2,V.SWAB.21PE.2_R,V.SWAB.3.CE.2_R,V.SWAB.REUS2.2_R
0,k__Bacteria; p__Firmicutes; c__Clostridia; o__...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,k__Bacteria; p__Actinobacteria; c__Coriobacter...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,k__Bacteria; p__Firmicutes; c__Negativicutes; ...,0.0,0.0,0.0,0.0,7.0,0.0,0.0,142.0,0.0,...,0.0,0.0,0.0,0.0,103.0,0.0,0.0,0.0,0.0,0.0
3,k__Bacteria; p__Firmicutes; c__Bacilli; o__Lac...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,441.0,0.0,...,0.0,0.0,0.0,0.0,558.0,0.0,0.0,0.0,0.0,0.0
4,k__Bacteria; p__Bacteroidetes; c__Bacteroidia;...,740.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
399,k__Bacteria; p__Firmicutes; c__Clostridia; o__...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
400,k__Bacteria; p__Firmicutes; c__Clostridia; o__...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
401,k__Bacteria; p__Bacteroidetes; c__Saprospiria;...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
402,k__Bacteria; p__Actinobacteria; c__Coriobacter...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Unnamed: 0,Taxon,D001_LT,D002_LT,D003_LT,D004_LT,D005_LT,HR002_LT,HR004_LT,HR005_LT,HR007_LT,...,P023.6_LT,P023.7.m2_LT,P023.8_LT,P023.9_LT,P024.3_LT,P024.5_LT,P024.6_LT,P024.7_LT,P024.8_LT,P024.9_LT
0,k__Bacteria; p__Firmicutes; c__Clostridia; o__...,261.0,0.0,287.0,138.0,0.0,0.0,0.0,271.0,182.0,...,0.0,0.0,416.0,784.0,135.0,0.0,1914.0,206.0,527.0,598.0
1,k__Bacteria; p__Firmicutes; c__Clostridia; o__...,0.0,66.0,9.0,0.0,0.0,0.0,129.0,110.0,0.0,...,0.0,0.0,15.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,k__Bacteria; p__Firmicutes; c__Erysipelotrichi...,0.0,0.0,9.0,0.0,0.0,103.0,0.0,0.0,0.0,...,0.0,91.0,172.0,110.0,107.0,148.0,92.0,85.0,133.0,180.0
3,k__Bacteria; p__Proteobacteria; c__Betaproteob...,0.0,159.0,0.0,0.0,0.0,0.0,0.0,23.0,17.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,k__Bacteria; p__Actinobacteria; c__Actinomycet...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,116.0,5.0,13.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
432,k__Bacteria; p__Proteobacteria; c__Alphaproteo...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
433,k__Bacteria; p__Firmicutes; c__Clostridia; o__...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
434,k__Archaea; p__Euryarchaeota; c__Methanomicrob...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
435,k__Bacteria; p__Firmicutes; c__Clostridia; o__...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Obtain Abundance table

Split the dataframes in five groups. In order to include *Lactobacillus crispatus* in the vagimock community, we will increase its abundancy.

In [17]:
# Number of groups
i = 5 
random.seed(123)

for key, df in tb_df.items():
    df = df.set_index('Taxon')
    colnames = list(df.columns)
    random.shuffle(colnames)
    grouped_colnames = [colnames[x::i] for x in range(i)]
    sub_dfs = [df[group].reset_index() for group in grouped_colnames]
    
    tb_df[key] = sub_dfs
    
# increase the abundancy of crispatus differently in each group:

max_abun = [1, 1000, 10000, 50000, 100000]

for i, (abun, df) in enumerate(zip(max_abun, tb_df["vagimock"])):
    df = df.set_index('Taxon')
#     crispatus_row = df[df['Taxon'].str.contains('crispatus')]
    crispatus_row = df.filter(like='crispatus', axis=0)
    ncols = len(crispatus_row.columns)
    crispatus_mod_row  = crispatus_row+random.choices(range(0,abun), k=ncols)
    df.loc[crispatus_row.index, :] = crispatus_mod_row
    df = df.reset_index()
    tb_df['vagimock'][i] = df

In [18]:
display(tb_df['vagimock'][0].head())

Unnamed: 0,Taxon,V.SWAB.SPAU2.2,PBV.4.1,V.SWAB.REUS1.1,V.SWAB.SPAU5.1,PBV.58.2,V.SWAB.21C.1,PBV.16.2,V.SWAB.14C.1,PBV.66.2,...,V.SWAB.15C.2,V.SWAB.53PE.2,V.SWAB.16C.2,V.SWAB.10PE.2,V.SWAB.15PE.1,PBV.64.2,V.SWAB.24CE.2,V.SWAB.3.CE.2_R,PBV.44.2,V.SWAB.HUMIC11.1
0,k__Bacteria; p__Firmicutes; c__Clostridia; o__...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,k__Bacteria; p__Actinobacteria; c__Coriobacter...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,71.0,0.0,0.0,0.0
2,k__Bacteria; p__Firmicutes; c__Negativicutes; ...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,k__Bacteria; p__Firmicutes; c__Bacilli; o__Lac...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,45.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,k__Bacteria; p__Bacteroidetes; c__Bacteroidia;...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In each dataframe, keep the species that represents the 99% of the abundance. 

Obtain the relative abundancies. Note that the frequencies are multiplied by 100 in order to avoid very low numbers.

In [None]:
for key, items in tb_df.items():
    for i, df in enumerate(items):
        
        # Calculate Total Abundance, Relative Abundance and  Rel Acumulative Abundance
        df['Abundance'] = df.sum(axis=1)
        df['Rel_Abundance'] = (df['Abundance']/df['Abundance'].sum())*100
        df = df.sort_values(by=['Rel_Abundance'], ascending=False)
        df['Rel_Abundance_cumulative'] = df['Rel_Abundance'].cumsum()
        display(df)
        # Only keep the species that represents the 99% of the abundance 
        df = df[df['Rel_Abundance_cumulative'] < 99]
        
        tb_df[key][i] = df 
        display(df.head())

In [20]:
abundance_tabs = {}
for key, items in tb_df.items():
    d = {}
    
    for i, item in enumerate(items):
        name = f"{key}0{i}"
        d[name] = {}
        for j in item.index:
            tax = str(item.loc[j, 'Taxon'])
            d[name][tax] = item.loc[j, 'Rel_Abundance']
        
    
    df = pd.DataFrame.from_dict(d)
    df = df.fillna(0)
    df.reset_index(inplace=True)
    df = df.rename(columns = {'index': 'Taxon'})
    abundance_tabs[key] = df
    
    display(df.head())

Unnamed: 0,Taxon,vagimock00,vagimock01,vagimock02,vagimock03,vagimock04
0,k__Bacteria; p__Firmicutes; c__Bacilli; o__Lac...,92.858135,81.513449,68.090323,40.652198,22.501148
1,k__Bacteria; p__Firmicutes; c__Bacilli; o__Lac...,2.685409,5.038885,3.810807,3.009378,0.937033
2,k__Bacteria; p__Actinobacteria; c__Actinomycet...,1.385344,0.415924,0.85226,0.78307,0.957093
3,k__Bacteria; p__Actinobacteria; c__Actinomycet...,1.300811,4.166379,9.575885,6.888249,6.508824
4,k__Bacteria; p__Bacteroidetes; c__Bacteroidia;...,0.604868,0.122824,0.146558,0.087941,0.295946


Unnamed: 0,Taxon,gutmock00,gutmock01,gutmock02,gutmock03,gutmock04
0,k__Bacteria; p__Bacteroidetes; c__Bacteroidia;...,11.777871,11.084972,16.278256,9.095578,16.14184
1,k__Bacteria; p__Bacteroidetes; c__Bacteroidia;...,5.551624,0.57797,0.890686,0.690637,0.412421
2,k__Bacteria; p__Firmicutes; c__Clostridia; o__...,4.428748,6.483817,5.711091,5.769315,4.541067
3,k__Bacteria; p__Firmicutes; c__Clostridia; o__...,3.812079,5.267262,4.116979,2.088251,8.776298
4,k__Bacteria; p__Firmicutes; c__Clostridia; o__...,3.662234,4.626383,2.786266,3.276952,1.875877


The above abundance tables contain the expected abundances for each taxon. We will save this abundancy table as in tsv.

In [21]:
# Download the table
force=True

for key, tab in abundance_tabs.items():
    exp_tax_out = join(db_out_files[key],"expected-taxonomy.tsv")
    
    if not exists(exp_tax_out) or force:
        tab.to_csv(exp_tax_out, sep="\t", index=False)

# Mockrobiota

## Download data

The data was obtained from 16S-itgdb paper. [Link](https://github.com/yphsieh/16S-ITGDB/tree/master/test%20cases) . We downloaded the combined mock validation dataset, containing PacBio HMP, Zymo and Mockrobiota mock communities:
- [final_mock_seq.fasta](https://drive.google.com/file/d/1Y4PeUcZAuXkB2uU0OBv7uygqQlX2tXXp/view?usp=share_link)
- [final_mock_taxa.txt](https://drive.google.com/file/d/1s1PuYacaWC3XRvly-TrqybCXMUtseZ2q/view?usp=share_link)

We have renamed those files using the prefix 'itgdbmock'.

In [1]:
# ete3 env
from ete3 import NCBITaxa
import pickle
from os.path import join, basename, exists
import update_taxonomy
from re import search
import utils
import pandas as pd
import random
%load_ext autoreload

In [2]:
mockrobiota_src = 'mockrobiota/src'
itgdbmock_files = {
    'seq':join(mockrobiota_src, 'itgdbmock_seqs.fasta'),
    'taxa':join(mockrobiota_src, 'itgdbmock_taxa.txt')}

# Output files
mockrobiota_files = {
    'seq':join(mockrobiota_src, 'mockrobiota_seqs.fasta'),
    'taxa':join(mockrobiota_src, 'mockrobiota_taxa.txt')}

## Filter mockrobiota records

From the itgdbmock we'll only keep the record corresponding to mockrobiota, as we do not trust the taxonomy assigment performed on PacBio HMP and Zymo mock-communities (using Blast) because we think that could be biased.

In [11]:
# Taxa
!grep 'Mockrobiota*\|Feature ID' {itgdbmock_files['taxa']} > {mockrobiota_files['taxa']}
# Seq
!seqkit grep -p 'Mockrobiota*' -r {itgdbmock_files['seq']} > {mockrobiota_files['seq']}

## Format taxonomy file

In [12]:
# Load
db_mockro = utils.load_db_from_files(taxa_path = mockrobiota_files['taxa'])

Change species format according SILVA species format (genus + specie)

In [13]:
for record in db_mockro.values():
    # change 'Chlorobium' genus for 'Pelodictyon' according 
    #+ to NCBI taxonomy
    record['taxa'][5] = record['taxa'][5].replace(
        'Chlorobium', 'Pelodictyon')
    # change species format
    record['taxa'][6] = f"{record['taxa'][5]}_{record['taxa'][6]}"

Update taxonomy acoording to NCBI using ete3 module

In [14]:
%autoreload

not_found_taxa = []

key = 'Mockrobiota'

# Information
found_taxa = 0
change_taxa = 0
len_original_db = len(db_mockro)
print(f"Original size of {key}: {len_original_db}")

# Copy database
db_mockro_copy = db_mockro.copy()


for ID, items in db_mockro.items():
    taxa = items['taxa']
    new_taxa = update_taxonomy.update_taxonomy_ncbi(taxa, join_taxa = False)


    if new_taxa:
        # check if taxa was changed
        if new_taxa != taxa:
            change_taxa +=1


        # check if there is still species level
        if new_taxa[6]:
            # replace taxa and count it
            db_mockro[ID]['taxa'] = new_taxa
            found_taxa = found_taxa + 1 
        else:
            db_mockro_copy.pop(ID)

    else:
        not_found_taxa.append(taxa)

db_mockro = db_mockro_copy

# Information
len_db = len(db_mockro_copy)
print(f"Final size of {key}: {len_db}")
print(f"Total found taxa: {found_taxa}")
print(f"{(change_taxa/len_db)*100} % of the taxa changed and was updated\n")


Original size of Mockrobiota: 274
Final size of Mockrobiota: 274
Total found taxa: 274
60.94890510948905 % of the taxa changed and was updated



Downloand updated taxonomy file

In [15]:
utils.download_db_from_dict(db_mockro, mockrobiota_files['taxa'])

## Create abundance_env.txt

In [3]:
# Load
db_mockro = utils.load_db_from_files(taxa_path = mockrobiota_files['taxa'])
len(db_mockro)

274

#### Approach 2

- subset 200 entries for each sample (no repetition)
- create feature table and abundance table. 
- create environment taxa and fasta files

In [None]:
mockro_ids = list(db_mockro.keys())

for i in range(0, 5):
    chosen_ids = random.sample(mockro_ids, k=200)
    all_taxonomies = [
    utils.join_taxa_lineages(record['taxa']
                            ) for ID, record in db_mockro.items()
        if ID in chosen_ids
]
    colname= f"mockrobiota0{i}"
    
    if i == 0:
        abun_tab = pd.Series(all_taxonomies).value_counts().to_frame(
        ).rename(columns={0: colname})
        feature_tab = pd.Series(chosen_ids).value_counts().to_frame(
        ).rename(columns={0: colname})
        
    else:
        colname= f"mockrobiota0{i}"
        abun = pd.Series(all_taxonomies).value_counts().to_frame(
        ).rename(columns={0: colname})
        feat = pd.Series(chosen_ids).value_counts().to_frame(
        ).rename(columns={0: colname})
        
        
        abun_tab = abun_tab.join(abun, how='outer').fillna(0)
        feature_tab = feature_tab.join(feat, how='outer').fillna(0) 
        
        
    abun_tab[colname] = (abun_tab[colname]/abun_tab[colname].sum())  

abun_tab = abun_tab.reset_index().rename(columns={'index': 'Taxon'})
feature_tab = feature_tab.reset_index().rename(columns={'index': 'ID'})
display(abun_tab.sort_values(by="mockrobiota00", ascending=False))
display(feature_tab)

In [7]:
# Download the table
force=True

exp_tax_mockro = join(mockrobiota_src,"expected-taxonomy.tsv")
feature_tab_mockro = join(mockrobiota_src, "expected-featuretab.tsv")

if not exists(exp_tax_mockro) or force:
    abun_tab.to_csv(exp_tax_mockro, sep="\t", index=False)
    
if not exists(feature_tab_mockro) or force:
    feature_tab.to_csv(feature_tab_mockro, sep="\t", index=False)

#### Approach 1

Calculate the abundance of the original fasta file

In [7]:
# add prefixes and join taxa into string and store in a list
all_taxonomies = [
    utils.join_taxa_lineages(record['taxa']
                            ) for ID, record in db_mockro.items()
]

abun_tab = pd.Series(all_taxonomies
  ).value_counts().to_frame().reset_index().rename(
  columns={'index': 'Taxon', 0: f"mockrobiota00"})


In [8]:
abun_tab.tail()

Unnamed: 0,Taxon,mockrobiota00
86,k__Bacteria; p__Chloroflexi; c__Anaerolineae; ...,1
87,k__Bacteria; p__Actinobacteria; c__Actinomycet...,1
88,k__Bacteria; p__Bacteroidetes; c__Bacteroidia;...,1
89,k__Bacteria; p__Bacteroidetes; c__Bacteroidia;...,1
90,k__Bacteria; p__Actinobacteria; c__Actinomycet...,1


Now, create 5 different samples with random abundancies

In [19]:
ncols = 5 

random.seed(123)

rows = [random.choices(range(0, abun), k=ncols) for abun in abun_tab['mockrobiota00']]

# Create columns
for col in range(1,5):
    abun_tab[f"mockrobiota0{col}"] = 0
# Assign values 
for i in abun_tab.index:
    abun_tab.loc[i,'mockrobiota0':] = rows[i]

for col in range(0,5):
    colname= f"mockrobiota0{col}"
    # Keep abundance in relative
    abun_tab[colname] = (abun_tab[colname]/abun_tab[colname].sum())*2

display(abun_tab.head())

Unnamed: 0,Taxon,mockrobiota00,mockrobiota01,mockrobiota02,mockrobiota03,mockrobiota04
0,k__Bacteria; p__Bacteroidetes; c__Bacteroidia;...,0.0,0.027027,0.202899,0.022222,0.359551
1,k__Bacteria; p__Proteobacteria; c__Gammaproteo...,0.0,0.216216,0.115942,0.266667,0.044944
2,k__Bacteria; p__Firmicutes; c__Clostridia; o__...,0.098765,0.108108,0.086957,0.0,0.134831
3,k__Bacteria; p__Firmicutes; c__Clostridia; o__...,0.024691,0.216216,0.0,0.088889,0.134831
4,k__Bacteria; p__Firmicutes; c__Bacilli; o__Bac...,0.222222,0.0,0.028986,0.155556,0.0


In [20]:
# Download the table
force=True

exp_tax_mockro = join(mockrobiota_src,"expected-taxonomy.tsv")

if not exists(exp_tax_mockro) or force:
    abun_tab.to_csv(exp_tax_mockro, sep="\t", index=False)