<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Generation-of-mock-microbial-community-for-16S-analysis" data-toc-modified-id="Generation-of-mock-microbial-community-for-16S-analysis-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Generation of mock microbial community for 16S analysis</a></span></li><li><span><a href="#Set-up" data-toc-modified-id="Set-up-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Set up</a></span><ul class="toc-item"><li><span><a href="#Vagimock-and-gutmock" data-toc-modified-id="Vagimock-and-gutmock-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Vagimock and gutmock</a></span></li><li><span><a href="#Mockrobiota" data-toc-modified-id="Mockrobiota-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Mockrobiota</a></span><ul class="toc-item"><li><span><a href="#cut-the-regions" data-toc-modified-id="cut-the-regions-2.2.1"><span class="toc-item-num">2.2.1&nbsp;&nbsp;</span>cut the regions</a></span></li></ul></li><li><span><a href="#Obtain-abundance-table" data-toc-modified-id="Obtain-abundance-table-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Obtain abundance table</a></span></li><li><span><a href="#Mockrobiota-environment-based-on-feature-tab" data-toc-modified-id="Mockrobiota-environment-based-on-feature-tab-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Mockrobiota environment based on feature tab</a></span></li><li><span><a href="#Simulate-environments-/-repseqs-for-custom-mock-communities" data-toc-modified-id="Simulate-environments-/-repseqs-for-custom-mock-communities-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Simulate environments / repseqs for custom mock-communities</a></span></li><li><span><a href="#Create-mockrobiota-16S-files" data-toc-modified-id="Create-mockrobiota-16S-files-2.6"><span class="toc-item-num">2.6&nbsp;&nbsp;</span><del>Create mockrobiota 16S files</del></a></span><ul class="toc-item"><li><span><a href="#Cut-V-region" data-toc-modified-id="Cut-V-region-2.6.1"><span class="toc-item-num">2.6.1&nbsp;&nbsp;</span><del>Cut V region</del></a></span></li><li><span><a href="#Create-files" data-toc-modified-id="Create-files-2.6.2"><span class="toc-item-num">2.6.2&nbsp;&nbsp;</span><del>Create files</del></a></span></li></ul></li></ul></li></ul></div>

# Generation of mock microbial community for 16S analysis

This notebook has been though to generate in-silico mock communities for 16S analysis.   
It is able to generate mock communities for the following regions: full-16S, V1-V3 and V3-V5. 


It consist in several steps (NOT TRUE FOR THIS APPROACH, CHECK):

- Simulate a specific sample environment (e.g. vagina, gut, mouth)from an abundancy table that contains the taxon and its frequency
- Generate in-silico reads from the environment
- Process the in-silico reads with Qiime2 to obtain rep-seps 

The mock community directory structure will looks like the following:
```
out_dir/
├── db_name/ # database name
│   └── db_version/ # database version
│       └── db_similarity/ # otu % similarity if applicable. If using a database that has not been clustered, use "100-otus"
│           ├── env-identifiers.tsv # environment identifiers associated with each mock community member
│           ├── env-seqs.fasta # environment sequences
│           ├── expected-taxonomy.tsv # per-sample taxonomic abundances (species level)
│           └── table.L6-taxa.biom # per-sample taxonomic abundances (species level) in biom format
            └── source/
                ├── abundancy_table # generation of abundancy table
                ├── art # fastq simulation
                │   ├── single or paired/ # depending on your choice
                │   │    ├── read1.fq # simulated reads
                │   │    └── fastq/ # simulated fastq reads compressed and with formatted name
                └── qiime2
                    └── single or paired/ # depending on your choice
                        ├── demux.qza
                        ├── table.qza
                        └── rep_seqs.qza
```

# Set up

In [1]:
import pandas as pd
from IPython.display import display
from os.path import join, basename, dirname, exists, splitext
import shutil

%matplotlib inline
%load_ext autoreload

# Custom functions
import mock_community
from utils import check_dir
import utils
import new_approach

## Vagimock and gutmock

Change the below cell according to your needs.

In [3]:
v_regions = ["full-16S", 'V1-V3', 'V3-V5']

db_dir = "/mnt/synology/DATABASES/QIIME2/SSU/gsp_99/new_approach/created_db"

databases = {f"gsrv_{v_region}": {'taxa': join(db_dir, f'gsrv_{v_region}_taxa.txt'), 
                                  'seq': join(db_dir, f'gsrv_{v_region}_seqs.fasta')} for v_region in v_regions}

In [3]:
databases

{'gsrv_full-16S': {'taxa': '/mnt/synology/DATABASES/QIIME2/SSU/gsp_99/new_approach/created_db/gsrv_full-16S_taxa.txt',
  'seq': '/mnt/synology/DATABASES/QIIME2/SSU/gsp_99/new_approach/created_db/gsrv_full-16S_seqs.fasta'},
 'gsrv_V1-V3': {'taxa': '/mnt/synology/DATABASES/QIIME2/SSU/gsp_99/new_approach/created_db/gsrv_V1-V3_taxa.txt',
  'seq': '/mnt/synology/DATABASES/QIIME2/SSU/gsp_99/new_approach/created_db/gsrv_V1-V3_seqs.fasta'},
 'gsrv_V3-V5': {'taxa': '/mnt/synology/DATABASES/QIIME2/SSU/gsp_99/new_approach/created_db/gsrv_V3-V5_taxa.txt',
  'seq': '/mnt/synology/DATABASES/QIIME2/SSU/gsp_99/new_approach/created_db/gsrv_V3-V5_seqs.fasta'}}

In [4]:
# Paired o End sequences files
paired = False

# Mock-communities
mock_names = ['vagimock', 'gutmock']

mock_dirs = {mock: {v_region: join(
    mock, f'gsrv_{v_region}', f'gsrv_{v_region}', f'NS-otus')
    for v_region in v_regions} for mock in mock_names}

mock_dirs

{'vagimock': {'full-16S': 'vagimock/gsrv_full-16S/gsrv_full-16S/NS-otus',
  'V1-V3': 'vagimock/gsrv_V1-V3/gsrv_V1-V3/NS-otus',
  'V3-V5': 'vagimock/gsrv_V3-V5/gsrv_V3-V5/NS-otus'},
 'gutmock': {'full-16S': 'gutmock/gsrv_full-16S/gsrv_full-16S/NS-otus',
  'V1-V3': 'gutmock/gsrv_V1-V3/gsrv_V1-V3/NS-otus',
  'V3-V5': 'gutmock/gsrv_V3-V5/gsrv_V3-V5/NS-otus'}}

In [5]:
# Generate directory
for key, items in mock_dirs.items():
    for out_dir in items.values():
        check_dir(out_dir)

## Mockrobiota

In [5]:
mockrobiota_src = "mockrobiota/src"
mockrobiota_files = {
    'seq': join(mockrobiota_src, "mockrobiota_seqs.fasta"),
    'taxa': join(mockrobiota_src, "mockrobiota_taxa.txt")
}

mock_dirs['mockrobiota'] = {}
for v_region in v_regions:
    # Output directory
    mock_dirs['mockrobiota'][v_region] = join(
        'mockrobiota', f'mockrobiota_{v_region}', 
        f'mockrobiota_{v_region}', 'NS-otus')
    check_dir(mock_dirs['mockrobiota'][v_region])

### cut the regions

In [6]:
v_primers = utils.get_vregion_primers()
v_length = utils.get_vregion_lengths()

In [7]:
%%time
%autoreload

out_seqs_file = {}

src_dirout = join(mockrobiota_src, 'extracted_regions')
check_dir(src_dirout)

for v_region in v_regions:
    if v_region == 'full-16S':
        continue
    
    out_fasta_pref = join(
        src_dirout, f'mockrobiota_{v_region}') 

    out_seqs_file[v_region] = new_approach.extract_region(
        seqs = mockrobiota_files['seq'], prefix_db = out_fasta_pref,
        f_primer = v_primers[v_region]['f'],
        r_primer = v_primers[v_region]['r'],
        min_length = v_length[v_region]['min'], 
        max_length=v_length[v_region]['max'], 
        threads=15, 
        force=False, output_formats = ['fasta'])

CPU times: user 1.06 s, sys: 250 ms, total: 1.31 s
Wall time: 2.85 s


Filter and downloand = Keep only entries with both sequence and taxonomy information

In [8]:
db_mockro = {}


for v_region in v_regions:
    
    if v_region != 'full-16S':
        fasta_in = out_seqs_file[v_region]['fasta']
    else:
        fasta_in = mockrobiota_files['seq']
    
    db_mockro[v_region] = utils.load_db_from_files(
        taxa_path = mockrobiota_files['taxa'], 
        seqs_path = fasta_in)

    taxa_out = join(
            src_dirout, f'mockrobiota_{v_region}_taxa.txt')
    seqs_out = join(
            src_dirout, f'mockrobiota_{v_region}_seqs.fasta')

    utils.download_db_from_dict(
        db_mockro[v_region],
        taxa_out_file = taxa_out,
        seqs_out_file = seqs_out 
    )
        
    db_name = f'mockrobiota_{v_region}'
    
    databases[db_name]= {
        'seq': seqs_out, 'taxa':taxa_out }

96 entries without sequence or taxa information were deleted
2 entries without sequence or taxa information were deleted


## Obtain abundance table 
We will obtain the abundance table from a previous analyzed data. Therefore, we will generate the classifier, perform the taxonomy assingment and obtain the taxa-barplots

In [9]:
# mock_names = ['vagimock', 'gutmock', 'mockrobiota']
mock_names = ['mockrobiota']

In [10]:
abundance_tabs = {}
force=True

# Copy files from vagimock and gutmock V4
for mockname in mock_names:
    
    abundance_tabs[mockname] = {}
    if mockname != 'mockrobiota':
        src_dir_abun = join(mockname, 'abundance_table')
    else:
        src_dir_abun = mockrobiota_src
    
    src_abun_file = join(src_dir_abun,'expected-taxonomy.tsv')
    
    for v_region in v_regions:
        dst_abun_dir = mock_dirs[mockname][v_region]
        dst_abun_file = join(dst_abun_dir,'expected-taxonomy.tsv')

        if not exists(dst_abun_file) or force:
            shutil.copyfile(src_abun_file, dst_abun_file )

        abundance_tabs[mockname][v_region] = pd.read_csv(dst_abun_file, sep='\t', header=0)     

The above abundance tables contain the expected abundances for each taxon.

## Mockrobiota environment based on feature tab

In [33]:
%autoreload
env_tab_dfs = {}
env_abun_dfs = {}
force=True

for mock, items in abundance_tabs.items():
    print(mock)
    env_tab_dfs[mock] = {}
    env_abun_dfs[mock] = {}
    
    for v_region, table in items.items():
        out_dir = mock_dirs[mock][v_region]
        dbname = f"mockrobiota_{v_region}"
        
        # output files

        env_tab_file = join(out_dir, 'env_tab.tsv')
        env_abun_file = join(out_dir, 'env_abundance.tsv')
        env_taxa_file = join(out_dir, 'env_taxa.txt')
        env_seqs_file = join(out_dir, 'env_seqs.fasta') 
        
        # inputs
        
        feat_tab =  pd.read_csv(
        join(mockrobiota_src, "expected-featuretab.tsv"), sep = "\t", 
        header = 0, index_col = 'ID')
        
        
        abun_tab = table
        
        if any(not exists(file) for file in [env_tab_file, env_abun_file, env_taxa_file, env_seqs_file]) or force:
            
            # ids to include:
            included_ids = list(feat_tab.index)
            
            # load env
            env = utils.load_db_from_files(seqs_path = databases[dbname]['seq'], 
                                   taxa_path = databases[dbname]['taxa'])
            
            # subset env
            env = {ID: items for ID, items in env.items() if ID in included_ids}
            
            # subset included ids to remove the ones not in env
            included_ids = [i for i in included_ids if i in list(env.keys())]
            
            # subset feature_tables
            env_feat_tab = feat_tab.copy()
            env_feat_tab = feat_tab.loc[included_ids]
            
            # create new abundance tab
            env_abun_tab = pd.DataFrame(index=abun_tab['Taxon'])
            
            for sample in env_feat_tab.columns:
                ids = list(env_feat_tab[env_feat_tab[sample]==1].index)
                sample_env = {ID: item for ID, item in env.items() if ID in ids}
                samp_taxonomies = [utils.join_taxa_lineages(record['taxa']
                            ) for ID, record in sample_env.items()]
                abun = pd.Series(samp_taxonomies).value_counts().to_frame(
        ).rename(columns={0: sample})
                env_abun_tab = env_abun_tab.join(abun, how='outer').fillna(0)
                
            env_abun_tab = env_abun_tab.reset_index().rename(columns = {'index': 'Taxon'})
        
            env_feat_tab = env_feat_tab.reset_index().rename(columns = {'index': 'ID'})
            
            # save files
            utils.download_db_from_dict(env, taxa_out_file=env_taxa_file, seqs_out_file=env_seqs_file)
            print(f"Saved {env_taxa_file}")
            print(f"Saved {env_seqs_file}")


            # download feature table
            env_feat_tab.to_csv(env_tab_file, sep="\t", index=False)
            print(f"Saved {env_tab_file}")
    #         display(env_feature_tab)

            # download abundance table
            env_abun_tab.to_csv(env_abun_file, sep="\t", index=False)
            print(f"Saved {env_abun_file}")
    #         display(env_abun_tab)

        else:
            env_abun_tab = pd.read_csv(env_abun_file, sep="\t")
            env_feat_tab = pd.read_csv(env_tab_file, sep="\t")  
            
        env_tab_dfs[mock][v_region] = env_feat_tab
        env_abun_dfs[mock][v_region] = env_abun_tab


mockrobiota
Saved mockrobiota/mockrobiota_full-16S/mockrobiota_full-16S/NS-otus/env_taxa.txt
Saved mockrobiota/mockrobiota_full-16S/mockrobiota_full-16S/NS-otus/env_seqs.fasta
Saved mockrobiota/mockrobiota_full-16S/mockrobiota_full-16S/NS-otus/env_tab.tsv
Saved mockrobiota/mockrobiota_full-16S/mockrobiota_full-16S/NS-otus/env_abundance.tsv
Saved mockrobiota/mockrobiota_V1-V3/mockrobiota_V1-V3/NS-otus/env_taxa.txt
Saved mockrobiota/mockrobiota_V1-V3/mockrobiota_V1-V3/NS-otus/env_seqs.fasta
Saved mockrobiota/mockrobiota_V1-V3/mockrobiota_V1-V3/NS-otus/env_tab.tsv
Saved mockrobiota/mockrobiota_V1-V3/mockrobiota_V1-V3/NS-otus/env_abundance.tsv
Saved mockrobiota/mockrobiota_V3-V5/mockrobiota_V3-V5/NS-otus/env_taxa.txt
Saved mockrobiota/mockrobiota_V3-V5/mockrobiota_V3-V5/NS-otus/env_seqs.fasta
Saved mockrobiota/mockrobiota_V3-V5/mockrobiota_V3-V5/NS-otus/env_tab.tsv
Saved mockrobiota/mockrobiota_V3-V5/mockrobiota_V3-V5/NS-otus/env_abundance.tsv


## Simulate environments / repseqs for custom mock-communities


In [12]:
abundance_tabs['mockrobiota']['full-16S'].head()

Unnamed: 0,Taxon,mockrobiota00,mockrobiota01,mockrobiota02,mockrobiota03,mockrobiota04
0,k__Bacteria; p__Bacteroidetes; c__Bacteroidia;...,0.0,0.027027,0.202899,0.022222,0.359551
1,k__Bacteria; p__Proteobacteria; c__Gammaproteo...,0.0,0.216216,0.115942,0.266667,0.044944
2,k__Bacteria; p__Firmicutes; c__Clostridia; o__...,0.098765,0.108108,0.086957,0.0,0.134831
3,k__Bacteria; p__Firmicutes; c__Clostridia; o__...,0.024691,0.216216,0.0,0.088889,0.134831
4,k__Bacteria; p__Firmicutes; c__Bacilli; o__Bac...,0.222222,0.0,0.028986,0.155556,0.0


In [13]:
%autoreload
env_tab_dfs = {}
env_abun_dfs = {}

for mock, items in abundance_tabs.items():
    print(mock)
    env_tab_dfs[mock] = {}
    env_abun_dfs[mock] = {}
    
    for v_region, table in items.items():
        if mock != 'mockrobiota':
            db_name = f"gsrv_{v_region}"
        else:
            db_name = f"mockrobiota_{v_region}"
        
        print(mock_dirs[mock][v_region])
        env_abun, env_tab = mock_community.simulate_repseqs(
            abun_table = table,
            db = databases,
            db_name = db_name,
            out_dir = mock_dirs[mock][v_region],
            force = True
        )
        
        env_tab_dfs[mock][v_region] = env_tab
        env_abun_dfs[mock][v_region] = env_abun


mockrobiota
mockrobiota/mockrobiota_full-16S/mockrobiota_full-16S/NS-otus
Saved mockrobiota/mockrobiota_full-16S/mockrobiota_full-16S/NS-otus/env_taxa.txt
Saved mockrobiota/mockrobiota_full-16S/mockrobiota_full-16S/NS-otus/env_seqs.fasta
Saved mockrobiota/mockrobiota_full-16S/mockrobiota_full-16S/NS-otus/env_tab.tsv
Saved mockrobiota/mockrobiota_full-16S/mockrobiota_full-16S/NS-otus/env_abundance.tsv
mockrobiota/mockrobiota_V1-V3/mockrobiota_V1-V3/NS-otus
Saved mockrobiota/mockrobiota_V1-V3/mockrobiota_V1-V3/NS-otus/env_taxa.txt
Saved mockrobiota/mockrobiota_V1-V3/mockrobiota_V1-V3/NS-otus/env_seqs.fasta
Saved mockrobiota/mockrobiota_V1-V3/mockrobiota_V1-V3/NS-otus/env_tab.tsv
Saved mockrobiota/mockrobiota_V1-V3/mockrobiota_V1-V3/NS-otus/env_abundance.tsv
mockrobiota/mockrobiota_V3-V5/mockrobiota_V3-V5/NS-otus
Saved mockrobiota/mockrobiota_V3-V5/mockrobiota_V3-V5/NS-otus/env_taxa.txt
Saved mockrobiota/mockrobiota_V3-V5/mockrobiota_V3-V5/NS-otus/env_seqs.fasta
Saved mockrobiota/mockrobi

Glance over the abundancies obtained after creating the environment

In [36]:
env_abun_dfs['vagimock']['full-16S'][env_abun_dfs['vagimock']['full-16S']['Taxon'].str.contains("crispatus")]

Unnamed: 0,Taxon,vagimock00,vagimock01,vagimock02,vagimock03,vagimock04
6,k__Bacteria; p__Firmicutes; c__Bacilli; o__Lac...,0.0,126.0,1246.0,4470.0,6121.0
