# ZymoBIOMICS Microbial Community Standards Sample Composition

The eight bacterial genomes and four plasmids of the ZymoBIOMICS Microbial Community Standards were used as reference. It contains tripled complete sequences for the following species:

- Bacillus subtilis
- Enterococcus faecalis
- Escherichia coli
  - Escherichia coli plasmid
- Lactobacillus fermentum
- Listeria monocytogenes
- Pseudomonas aeruginosa
- Salmonella enterica
- Staphylococcus aureus
  - Staphylococcus aureus plasmid 1
  - Staphylococcus aureus plasmid 2
  - Staphylococcus aureus plasmid 3

It also downloads the raw sequence data of the mock communities, with an even and logarithmic distribution of species:

- ERR2984773
- ERR2935805 

A set of simulated samples were generated from the genomes in the ZymoBIOMICS standard though the InSilicoSeq sequence simulator (version 1.5.2), including both even and logarithmic distribution, with and without Illumina error model. The number of read pairs generated matches the number of read pairs in the real data for each distribution. The following samples are available in Zenodo:

- ENN - Envenly distributed sample with no error model
- EHS - Envenly distributed sample with Illumina HiSeq error model
- LNN - Log distributed sample with no error model
- LHS - Log distributed sample with Illumina HiSeq error model


## Imports

In [7]:
import sys
from plotly.offline import plot
import glob
import fnmatch
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import json
import pandas as pd
from itertools import groupby
import csv
import numpy as np
import os

In [112]:
SPECIES = ['Bacillus subtilis', 'Enterococcus faecalis', 'Escherichia coli', 'Lactobacillus fermentum', 'Listeria monocytogenes', 
           'Pseudomonas aeruginosa', 'Salmonella enterica', 'Staphylococcus aureus']
COLOURS = ["#009392", "#39B185", "#9CCB86", "#E9E29C", "#EEB479", "#E88471","#CF597E",'lightgray', 'darkgray', '#004B93']

## Load Data

In [113]:
kraken_report_files = glob.glob('../Kraken/*.kraken_report')

kraken_data = {}
for report in kraken_report_files:
    sample = os.path.splitext(os.path.basename(report))[0]
    kraken_data[sample] = {}
    with open(report) as fh:
        for line in fh:
            line = line.split()
            if 'unclassified' in line[5]:
                kraken_data[sample]['unclassified'] = float(line[0])
            else:
                if line[3] == 'S':
                    if ' '.join(line[5:]) in SPECIES:
                        kraken_data[sample][' '.join(line[5:])] = float(line[0])


    other = 100 - sum(kraken_data[sample].values())
    kraken_data[sample]['other']  = other

print(kraken_data)

{'ENN': {'unclassified': 0.0, 'Staphylococcus aureus': 9.31, 'Bacillus subtilis': 0.05, 'Listeria monocytogenes': 7.71, 'Enterococcus faecalis': 8.09, 'Escherichia coli': 6.95, 'Salmonella enterica': 3.14, 'Pseudomonas aeruginosa': 2.81, 'other': 61.94}, 'EHS': {'unclassified': 0.0, 'Staphylococcus aureus': 9.52, 'Bacillus subtilis': 0.06, 'Listeria monocytogenes': 7.71, 'Enterococcus faecalis': 8.09, 'Escherichia coli': 7.02, 'Salmonella enterica': 3.16, 'Pseudomonas aeruginosa': 2.82, 'other': 61.62}, 'LHS': {'unclassified': 0.0, 'Escherichia coli': 23.49, 'Salmonella enterica': 1.49, 'Pseudomonas aeruginosa': 3.17, 'Staphylococcus aureus': 10.24, 'Bacillus subtilis': 0.04, 'Listeria monocytogenes': 1.8, 'Enterococcus faecalis': 8.23, 'other': 51.540000000000006}, 'ERR2984773': {'unclassified': 0.0, 'Bacillus subtilis': 0.09, 'Listeria monocytogenes': 9.9, 'Staphylococcus aureus': 2.77, 'Enterococcus faecalis': 11.03, 'Salmonella enterica': 4.23, 'Escherichia coli': 1.87, 'Pseudomona

## Plot Data

In [114]:
fig_kraken = make_subplots(rows=3, cols=2, shared_xaxes=True, x_title="Species composition", 
                            shared_yaxes=True, y_title='PLS',
                            subplot_titles=('LNN', 'ENN', 'LHS', 'EHS', 'ERR2935805', 'ERR2984773'))

In [115]:
fig_kraken = go.Figure()
x = ['LNN', 'ENN', 'LHS', 'EHS', 'ERR2935805', 'ERR2984773']
ydict={}
for sample in x:
    for reference in kraken_data[sample]:
        if reference not in ydict:
            ydict[reference] = [kraken_data[sample][reference]]
        else:
            ydict[reference].append(kraken_data[sample][reference])
i=0
for reference in sorted(ydict): 
    fig_kraken.add_trace(go.Bar(x=x, y=ydict[reference], name=reference, marker_color=COLOURS[i]))
    i += 1
fig_kraken.update_layout(plot_bgcolor='rgb(255,255,255)', title_text="Taxonomic composition of the ZymoBIOMICS microbial community standards samples")
fig_kraken.update_layout(barmode='stack', xaxis={'categoryorder':'array', 'categoryarray':['ENN', 'EHS', 'ERR2984773','LNN', 'LHS','ERR2935805']})
fig_kraken.show()
plot(fig_kraken, filename='Plots/Taxonomic composition.html', auto_open=False)

'Plots/Taxonomic composition.html'

In [108]:
SPECIES - ydict.keys()

{'Lactobacillus fermentum'}