### Jellyfish microbiome analysis

#### Sample background

In [1]:
# Essential objects
from qiime2 import Metadata
import pandas as pd
from qiime2 import Artifact, sdk
from qiime2.plugins.dada2.methods import denoise_pyro # The samples were obtained through pyrosequencing
import qiime2.plugins.metadata.actions as metadata_actions
import qiime2.plugins.feature_classifier.actions as feature_classifier_actions
from qiime2.plugins.feature_table.visualizers import tabulate_seqs
import qiime2.plugins.feature_table.actions as fta
import matplotlib.pyplot as plt
import matplotlib 
import seaborn as sns
from Bio import SeqIO

matplotlib.use('module://ipykernel.pylab.backend_inline')
pm = sdk.PluginManager()
def see(artifact):
    from_format = artifact.format
    if issubclass(from_format, sdk.plugin_manager.SingleFileDirectoryFormatBase):
        from_format = artifact.format.file.format
    return set(pm.transformers[from_format].keys())
import os
import pandas
import qiime2
import tempfile

def v2frame(viz_fp: str) -> list:
    '''viz_fp is a path to the qiime2 visualization object'''
    viz = qiime2.Visualization.load(viz_fp)
    with tempfile.TemporaryDirectory() as tmpdir:
        viz.export_data(tmpdir)
        fp = os.path.join(tmpdir, 'quality-plot.html')
        ov = os.path.join(tmpdir, 'overview.html')
        dfs = pandas.read_html(fp, index_col=0)
        df2s = pandas.read_html(ov, index_col=0)
    return dfs + df2s

#### Importing the data
The fastq files contain single-end reads, sequenced with the 454 GS FLX+ via a pyrosequencing approach

```{bash}
qiime tools import \
    --type 'SampleData[SequencesWithQuality]' \
    --input-path data.tsv \
    --output-path jelly.qza \
    --input-format SingleEndFastqManifestPhred33V2
```

In [2]:
raw_data = Artifact.load('artifacts/jelly.qza')

### Raw data exploration
The initial data exploration was performed with fastqc. There is a high variation in read length, though the majority are in the 450 - 500 bp range. As expected for short-read sequencing, read quality drops at higher read lengths, necessitating trimming. Surprisingly, the sample has little adapter content.

#### Blasting overrepresenting sequences
To get a rough idea of what microorganisms were represented most in the sample, I used data from `fastqc`'s overrepresented sequences module.
The sequences listed in the module were combined, clustered together to remove redundant reads with `cd-hit`, then blasted against the custom 16s rRNA database.
- `cd-hit` grouped 238 overrepresented sequences from the 7 samples into only 17 clusters
- The following describes the best 3 hits for the top 3 most overrepresented sequences
    - 1. Entoplasma, Mesoplasma, Lebetimonas
    - 2. *Ferruginivarius sediminum*, Azospirillum, Desulfosporosinus
    - 3. Flavobacteria, Dokdonia, Joostella



### Quality control and clustering
The dada2 plugin denoises sequences and clusters them into representative sequences

In [3]:
denoised_Ftable, denoised_Seqs, denoised_stats = denoise_pyro(raw_data, trunc_len=600) # Will begin filtering at 600 bp
denoised_Ftable.save('artifacts/denoised.qza')
denoised_Seqs.save('artifacts/denoised_seqs.qza')
stats_viz, = metadata_actions.tabulate(input=denoised_stats.view(Metadata))
stats_viz.save('vis/denoised.qzv')

Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada.R --input_directory /tmp/qiime2/sc31/data/54baf4e3-b7ca-4f9c-a76c-8897cbea851c/data --output_path /tmp/tmplrl4aal9/output.tsv.biom --output_track /tmp/tmplrl4aal9/track.tsv --filtered_directory /tmp/tmplrl4aal9 --truncation_length 600 --trim_left 0 --max_expected_errors 2.0 --truncation_quality_score 2 --max_length Inf --pooling_method independent --chimera_method consensus --min_parental_fold 1.0 --allow_one_off False --num_threads 1 --learn_min_reads 250000 --homopolymer_gap_penalty 1 --band_size 32

R version 4.2.2 (2022-10-31) 


Loading required package: Rcpp


DADA2: 1.26.0 / Rcpp: 1.0.10 / RcppParallel: 5.1.6 
2) Filtering .......
3) Learning Error Rates
11095200 total bases in 18492 reads from 7 samples will be used for learning the error rates.
4) Denoise samples 
.......
5) Remove chimeras (method = consensus)
6) Report read numbers through the pipeline
7) Write output


'vis/denoised.qzv'

In [4]:
# Importing previous
denoised_Seqs = Artifact.load('artifacts/denoised_seqs.qza')
print(denoised_Seqs.type)
denoised_Ftable = Artifact.load('artifacts/denoised.qza')
print(denoised_Ftable.type)
denoised_FtableDF = denoised_Ftable.view(pd.DataFrame).T
denoised_FtableDF.index.names = ['Feature ID']
denoised_FtableMD = denoised_Ftable.view(Metadata)
seqs_viz, = fta.tabulate_seqs(data=denoised_Seqs) # Tabulate sequences to 
denoised_SeqsDF = denoised_Seqs.view(Metadata).to_dataframe()
seqs_viz.save('vis/sequences.qzv')

FeatureData[Sequence]
FeatureTable[Frequency]


'vis/sequences.qzv'

#### Pairing sequences with their frequencies in Pandas

In [5]:
# display(denoised_FtableDF)
freq_seq = denoised_FtableDF.merge(denoised_SeqsDF, on='Feature ID')
freq_seq['Feature ID'] = freq_seq.index
freq_seq = freq_seq.reset_index(drop=True)
display(freq_seq)

Unnamed: 0,sample-1,sample-2,sample-3,sample-4,sample-5,sample-6,sample-7,Sequence,Feature ID
0,0.0,0.0,92.0,0.0,0.0,5247.0,1983.0,CACTCTTGCGAGCATACTACTCAGGCGGAGTACTTAACGCGTTAGC...,8331adf13c184313c464be9c7b21097c
1,4268.0,2209.0,325.0,392.0,0.0,0.0,0.0,CACTCTTGCGAGCATACTACTCAGGCGGAGTACTTAACGCGTTAGC...,b37b36ef2cd58aa19929a8ae4f8f4c21
2,671.0,277.0,97.0,30.0,0.0,387.0,42.0,CATTCTTGCGAACGTACTCCCCAGGTGGGATACTTATCACTTTCGC...,b5f4256d637420120c2e71bc85666bfd
3,94.0,0.0,298.0,284.0,0.0,0.0,6.0,TAACCTTGCGGCCGTACTCCCCAGGCGGTGTGCTTAATGCGTTAGC...,00aabc54895675bbc65b0c1fdaf1fd07
4,0.0,0.0,0.0,0.0,681.0,0.0,0.0,CACTCTTGCGAGCATACTACTCAGGCGGAGTACTTAACGCGTTAGC...,3c5ee7863e02dd5604419d803a014014
5,235.0,0.0,0.0,0.0,0.0,121.0,0.0,CACACTTGCGTGCGTACTCCCCAGGCGGAACACTTAACGCGTTGGC...,fe008f8f1f5a1776177e14bae97e10e2
6,188.0,0.0,0.0,0.0,0.0,0.0,0.0,TAATCTTGCGACCGTACTCCCCAGGCGGAATGCTTAATCCGTTAGG...,3a4c4c7d2b8e5a489eb2d1fddc95e025
7,0.0,0.0,0.0,0.0,0.0,163.0,0.0,TAACCTTGCGGCCGTACTCCCCAGGCGGTGTGCTTAATGCGTTAGC...,0c2db32fc52f027a04e433a93c0669df
8,0.0,0.0,0.0,0.0,76.0,0.0,0.0,TAACCTTGCGGCCGTACTCCCCAGGCGGTGTGCTTAATGCGTTAGC...,5f460fa8105db752feb6166a52b135d2
9,0.0,0.0,0.0,0.0,0.0,45.0,0.0,TAGTCTTGCGACCGTAGTCCCCAGGCGGAGTGCTTAACGCGTTAGC...,6d4121b0c9e7eda284b4ab5bf571dc86


In [6]:
sns.heatmap(freq_seq.iloc[:, :-2])
plt.show()

<Figure size 640x480 with 2 Axes>

### Taxonomic analyses

In [9]:
# Import combined databases 
ids = Artifact.load('../data/Ids.qza')
seqs = Artifact.load('../data/all_fasta.qza')

#### Alignment-based classification

In [None]:
from qiime2.plugins.feature_classifier.pipelines import classify_consensus_blast



#### Machine learning

## Data sources
- https://www.ebi.ac.uk/ena/browser/view/PRJEB8518
- https://docs.qiime2.org/2023.5/data-resources/  High-quality reference OTUs
- The `see` and `v2frame` functions were obtained from [this](https://forum.qiime2.org/t/how-to-capture-a-value-from-a-summary-and-pipe-it/19783/5) link via user thermokarst