<a id='setup'></a>

## 0. Setup

Setup of the packages and setting of data directory.

In [1]:
#For the import of packages
import numpy as np
import pandas as pd
from qiime2 import Visualization
import matplotlib.pyplot as plt
import qiime2 as q2

data_dir = "data/"
database_dir = "database"

<a id='Input'></a>

### 0.1 Input

Inspection of the input dataset and loading of metadata.

In [2]:
! qiime tools peek $data_dir/sequences_demux_paired.qza

[32mUUID[0m:        b5fec962-ca06-4df5-b043-3aa289e4d753
[32mType[0m:        SampleData[PairedEndSequencesWithQuality]
[32mData format[0m: SingleLanePerSamplePairedEndFastqDirFmt


In [3]:
#visualizing the input data first
! qiime demux summarize \
    --i-data $data_dir/sequences_demux_paired.qza \
    --o-visualization $data_dir/sequences_demux_paired.qzv

^C

Aborted!
[0m

In [4]:
Visualization.load(f'{data_dir}/sequences_demux_paired.qzv')

**Summary of imput data:** \
Data decreases at different length for the for and rev reads so I would use the "denoise-paired" command and seperately trim the ends to the length they fall below a Phred score of 20.\
erg

In [5]:
# this line parses the TSV file to create a DataFrame object. 
metadata_df = pd.read_csv(f'{data_dir}/metadata.tsv', sep='\t', index_col=0)
# Grab 5 random samples
metadata_df.sample(n=5)

Unnamed: 0_level_0,Library Layout,Instrument,collection_date,geo_location_name,geo_latitude,geo_longitude,host_id,age_days,weight_kg,length_cm,...,birth_length_cm,sex,delivery_mode,zygosity,race,ethnicity,delivery_preterm,diet_milk,diet_weaning,age_months
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ERR1309865,PAIRED,Illumina MiSeq,2012-01-18 00:00:00,"USA, Missouri, St. Louis",38.63699,-90.263794,23.2,625.0,,,...,50.0,male,Cesarean_emergency,Dizygotic,Caucasian,Not Hispanic,True,,,21.0
ERR1309905,PAIRED,Illumina MiSeq,2010-07-29 00:00:00,"USA, Missouri, St. Louis",38.63699,-90.263794,12.2,132.0,,,...,48.0,female,Vaginal,Monozygotic,Caucasian,Not Hispanic,False,fd,False,4.0
ERR1311639,PAIRED,Illumina MiSeq,2011-10-05 00:00:00,"USA, Missouri, St. Louis",38.63699,-90.263794,25.2,511.0,12.247,,...,48.0,female,Cesarean_emergency,Monozygotic,Caucasian,Hispanic,False,,,17.0
ERR1314204,PAIRED,Illumina MiSeq,2011-06-12 00:00:00,"USA, Missouri, St. Louis",38.63699,-90.263794,30.2,326.0,,,...,41.0,female,Cesarean,Monozygotic,Caucasian,Not Hispanic,True,fd,True,11.0
ERR1309719,PAIRED,Illumina MiSeq,2012-02-14 00:00:00,"USA, Missouri, St. Louis",38.63699,-90.263794,10.1,715.0,,,...,47.0,female,Cesarean,Dizygotic,Caucasian,Not Hispanic,True,,,23.0


<a id='denoising'></a>

## 1. Denoising and generation of ASV's

1. Truncation and denoising of the data.
2. Generation of the feature table

In [6]:
! qiime dada2 denoise-paired \
    --i-demultiplexed-seqs $data_dir/sequences_demux_paired.qza \
    --p-trunc-len-f 223 \
    --p-trunc-len-r 165 \
    --p-n-threads 3 \
    --o-table $data_dir/PJNB_dada2_table_.qza \
    --o-representative-sequences $data_dir/PJNB_dada2_rep_set.qza \
    --o-denoising-stats $data_dir/PJNB_dada2_stats.qza

[32mSaved FeatureTable[Frequency] to: data//PJNB_dada2_table_.qza[0m
[32mSaved FeatureData[Sequence] to: data//PJNB_dada2_rep_set.qza[0m
[32mSaved SampleData[DADA2Stats] to: data//PJNB_dada2_stats.qza[0m
[0m

In [7]:
#Statistics of denoising
! qiime metadata tabulate \
    --m-input-file $data_dir/PJNB_dada2_stats.qza \
    --o-visualization $data_dir/PJNB_dada2_stats.qzv

[32mSaved Visualization to: data//PJNB_dada2_stats.qzv[0m
[0m

In [8]:
Visualization.load(f'{data_dir}/PJNB_dada2_stats.qzv')

In [9]:
#Feature table visualization
! qiime feature-table summarize \
    --i-table $data_dir/PJNB_dada2_table_.qza \
    --m-sample-metadata-file $data_dir/metadata.tsv \
    --o-visualization $data_dir/PJNB_dada2_table.qzv

[32mSaved Visualization to: data//PJNB_dada2_table.qzv[0m
[0m

In [10]:
Visualization.load(f'{data_dir}/PJNB_dada2_table.qzv')

<a id='taxonomy'></a>

## 2. Taxonomy assignment

<a id='setup'></a>

### 2.1 Database loading and preparation

In [11]:
! qiime rescript get-silva-data \
    --p-version '138' \
    --p-target 'SSURef_NR99' \
    --p-include-species-labels \
    --o-silva-sequences $database_dir/silva-138-ssu-nr99-seqs.qza \
    --o-silva-taxonomy $database_dir/silva-138-ssu-nr99-tax.qza

Usage: [94mqiime rescript get-silva-data[0m [OPTIONS]

  Download, parse, and import SILVA database files, given a version number
  and reference target. Downloads data directly from SILVA, parses the
  taxonomy files, and outputs ready-to-use sequence and taxonomy artifacts.
  REQUIRES STABLE INTERNET CONNECTION. NOTE: THIS ACTION ACQUIRES DATA FROM
  THE SILVA DATABASE. SEE https://www.arb-silva.de/silva-license-
  information/ FOR MORE INFORMATION and be aware that earlier versions may
  be released under a different license.

[1mParameters[0m:
  [94m--p-version[0m VALUE [32mStr % Choices('128', '132')¹ | Str % Choices('138')² |[0m
    [32mStr % Choices('138.1')³[0m
                       SILVA database version to download.  [35m[default: '138.1'][0m
  [94m--p-target[0m VALUE [32mStr % Choices('SSURef_NR99', 'SSURef', 'LSURef')¹ | Str[0m
    [32m% Choices('SSURef_NR99', 'SSURef')² | Str % Choices('SSURef_NR99',[0m
    [32m'SSURef', 'LSURef_NR99', 'LSURef')³[0m
  

In [12]:
! qiime rescript filter-seqs-length-by-taxon \
    --i-sequences $database_dir/silva-138-ssu-nr99-seqs-cleaned.qza \
    --i-taxonomy $database_dir/silva-138-ssu-nr99-tax.qza \
    --p-labels Archaea Bacteria Eukaryota \
    --p-min-lens 900 1200 1400 \
    --o-filtered-seqs $database_dir/silva-138-ssu-nr99-seqs-filt.qza \
    --o-discarded-seqs $database_dir/silva-138-ssu-nr99-seqs-discard.qza

Usage: [94mqiime rescript filter-seqs-length-by-taxon[0m [OPTIONS]

  Filter sequences by length. Can filter both globally by minimum and/or
  maximum length, and set individual threshold for individual taxonomic
  groups (using the "labels" option). Note that filtering can be performed
  for multiple taxonomic groups simultaneously, and nested taxonomic filters
  can be applied (e.g., to apply a more stringent filter for a particular
  genus, but a less stringent filter for other members of the kingdom). For
  global length-based filtering without conditional taxonomic filtering, see
  filter_seqs_length.

[1mInputs[0m:
  [94m[4m--i-sequences[0m ARTIFACT [32mFeatureData[Sequence][0m
                          Sequences to be filtered by length.       [35m[required][0m
  [94m[4m--i-taxonomy[0m ARTIFACT [32mFeatureData[Taxonomy][0m
                          Taxonomic classifications of sequences to be
                          filtered.                                 [3

In [None]:
! qiime rescript dereplicate \
    --i-sequences $database_dir/silva-138-ssu-nr99-seqs-filt.qza  \
    --i-taxa $database_dir/silva-138-ssu-nr99-tax.qza \
    --p-rank-handles 'silva' \
    --p-mode 'uniq' \
    --p-threads 3 \
    --o-dereplicated-sequences $database_dir/silva-138-ssu-nr99-seqs-derep-uniq.qza \
    --o-dereplicated-taxa $database_dir/silva-138-ssu-nr99-tax-derep-uniq.qza