<a id='setup'></a>

## 0. Setup

Setup of the packages and setting of data directory.

In [1]:
#For the import of packages
import numpy as np
import pandas as pd
from qiime2 import Visualization
import matplotlib.pyplot as plt
import qiime2 as q2

data_dir = "data"
data_or = "../data"
database_dir = "database"

<a id='Input'></a>

### 0.1 Input

Inspection of the input dataset and loading of metadata.

In [2]:
! qiime tools peek $data_or/sequences_demux_paired.qza

[32mUUID[0m:        b5fec962-ca06-4df5-b043-3aa289e4d753
[32mType[0m:        SampleData[PairedEndSequencesWithQuality]
[32mData format[0m: SingleLanePerSamplePairedEndFastqDirFmt


In [3]:
#visualizing the input data first
! qiime demux summarize \
    --i-data $data_or/sequences_demux_paired.qza \
    --o-visualization $data_dir/sequences_demux_paired.qzv

^C

Aborted!


In [4]:
Visualization.load(f'{data_or}/sequences_demux_paired.qzv')

**Brief summary of paired end sequences with quality score:**
READS
* Lowest sequencing depth of 8165
* Mean of 30012 reads per sequence, median about the same
* Total number of reads: 50090402
* Total of 1669 forward and reverse samples
* Median length both forward and reverse is about 230nts with most (96% of samples) being +/- 10nts around this median in length

QUALITY

* Quality score of reads starts to drop below score 20 at different lengths for forward and reverse reads, hence we will use the "denoise-paired" command and seperately trim the ends to the length they fall below a Phred score of 20
* Median quality of 38 (Phred quality score)


In [8]:
# this line parses the TSV file to create a DataFrame object. 
metadata_df = pd.read_csv(f'{data_or}/metadata.tsv', sep='\t', index_col=0)
# Grab 5 random samples
metadata_df.sample(n=5)

Unnamed: 0_level_0,Library Layout,Instrument,collection_date,geo_location_name,geo_latitude,geo_longitude,host_id,age_days,weight_kg,length_cm,...,birth_length_cm,sex,delivery_mode,zygosity,race,ethnicity,delivery_preterm,diet_milk,diet_weaning,age_months
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ERR1313908,PAIRED,Illumina MiSeq,2011-07-08 00:00:00,"USA, Missouri, St. Louis",38.63699,-90.263794,8.2,535.0,,,...,53.0,female,Vaginal,Dizygotic,Caucasian,Not Hispanic,False,,,18.0
ERR1310513,PAIRED,Illumina MiSeq,2012-05-05 00:00:00,"USA, Missouri, St. Louis",38.63699,-90.263794,18.2,765.0,,,...,49.0,female,Cesarean,Dizygotic,African-American,Not Hispanic,False,,,25.0
ERR1309957,PAIRED,Illumina MiSeq,2010-07-17 00:00:00,"USA, Missouri, St. Louis",38.63699,-90.263794,17.1,128.0,5.443,61.0,...,45.0,female,Vaginal,Dizygotic,Caucasian,Not Hispanic,True,bd,False,4.0
ERR1315090,PAIRED,Illumina MiSeq,2011-12-21 00:00:00,"USA, Missouri, St. Louis",38.63699,-90.263794,6.2,716.0,,,...,52.0,female,Vaginal,Unknown,Caucasian,Not Hispanic,False,,,24.0
ERR1315617,PAIRED,Illumina MiSeq,2011-12-28 00:00:00,"USA, Missouri, St. Louis",38.63699,-90.263794,47.1,199.0,,,...,46.0,male,Vaginal,Monozygotic,African-American,Not Hispanic,True,fd,True,7.0


<a id='denoising'></a>

## 1. Denoising and generation of ASV's

1. Truncation and denoising of the data.
2. Generation of the feature table

In [2]:
! qiime dada2 denoise-paired \
    --i-demultiplexed-seqs $data_or/sequences_demux_paired.qza \
    --p-trunc-len-f 223 \
    --p-trunc-len-r 165 \
    --p-n-threads 3 \
    --o-table $data_dir/PJNB_dada2_table_.qza \
    --o-representative-sequences $data_dir/PJNB_dada2_rep_set.qza \
    --o-denoising-stats $data_dir/PJNB_dada2_stats.qza

[32mSaved FeatureTable[Frequency] to: data/PJNB_dada2_table_.qza[0m
[32mSaved FeatureData[Sequence] to: data/PJNB_dada2_rep_set.qza[0m
[32mSaved SampleData[DADA2Stats] to: data/PJNB_dada2_stats.qza[0m
[0m

In [3]:
#Statistics of denoising
! qiime metadata tabulate \
    --m-input-file $data_dir/PJNB_dada2_stats.qza \
    --o-visualization $data_dir/PJNB_dada2_stats.qzv

[32mSaved Visualization to: data/PJNB_dada2_stats.qzv[0m
[0m

In [5]:
Visualization.load(f'{data_dir}/PJNB_dada2_stats.qzv')

**Feature table**

In [5]:
#Feature table visualization
! qiime feature-table summarize \
    --i-table $data_dir/PJNB_dada2_table_.qza \
    --m-sample-metadata-file $data_or/metadata.tsv \
    --o-visualization $data_dir/PJNB_dada2_table.qzv

[32mSaved Visualization to: data/PJNB_dada2_table.qzv[0m
[0m

In [6]:
Visualization.load(f'{data_dir}/PJNB_dada2_table.qzv')

**Brief summary of ASV sequences from Dada2**

_STATISTICS_
* The lowest percentage of input passed filter was 64.52%
* The lowest percentage of input merged was 63.64%
* The lowest percentage of input non-chimeric was 63.45%

_TABLE_
* Number of features: 5055
* Median frequency per sample: 26,767.0
* Mean frequency per sample: 26,973
* Median frequency per feature: 125.0
* Mean frequency per feature: 8,905