In [None]:
# Before starting notebook, need to activate QIIME2 virtual environment
# "source activate qiime2-2017.2"

In [1]:
!ls
# List all the files in the folder.
# You want to see all your sequence XXX.fastq.gz files, R1 and R2 for each sample.

Assigning_taxonomy.ipynb        Relative_Abundances.ipynb
Differential_Abundances.ipynb   Sequence_QC.ipynb
First_glances_at_data.ipynb     Soil_Properties.ipynb
OTU_binning.ipynb               Testing_for_Sig_Diffs.ipynb
Ordination_Plots.ipynb          Tree_for_Unifrac.ipynb
Plant_Distancs_with_Ellen.ipynb qiime_QC.ipynb


In [2]:
!qiime tools import --help

Usage: qiime tools import [OPTIONS]

  Import data to create a new QIIME 2 Artifact. See https://docs.qiime2.org/
  for usage examples and details on the file types and associated semantic
  types that can be imported.

Options:
  --type TEXT           The semantic type of the new artifact.  [required]
  --input-path PATH     Path to file or directory that should be imported.
                        [required]
  --output-path PATH    Path where output artifact should be written.
                        [required]
  --source-format TEXT  The format of the data to be imported. If not
                        provided, data must be in the format expected by the
                        semantic type provided via --type.
  --help                Show this message and exit.


In [5]:
!qiime tools import  --type 'SampleData[PairedEndSequencesWithQuality]' --input-path ../../../data/Seq_data/Seqs/ --source-format CasavaOneEightSingleLanePerSampleDirFmt --output-path demux2.qza
# Here we are importing our data
# It's in a different format than the data from the tutorial
# We received our files from the sequencing centre already demultiplexed -
# that is, there is a separate pair of .fastq.gz files (forward and reverse read) for each of our samples.

In [None]:
#!qiime dada2 plot-qualities --verbose --i-demultiplexed-seqs demux.qza --p-n 4 --o-visualization demux-qualities.qzv
# This command will create plots of the quality scores for our sequences
# It will be output as demux-qual-plots.qzv

In [None]:
#!qiime tools view demux-qualities.qzv
# Let's take a look at the read quality plots.
# How does it compare to the tutorial reads?
# How long are the reads?
# Where along the read do sequences get bad?
# Do the forward or reverse reads tend to be better quality?

# Remember you have to press the square ("STOP") button to stop this cell running the visualization.

In [6]:
# Okay, you're ready to create your OTUs.
# But, you have some decisions to make, right?
# The command is below, but you need to decide which values to use for 
# --p-trim-left-f, --p-trim-left-r, --p-trunc-len-f, and --p-trunc-len-r
# For the trimming of the reads - remember, look at the quality plots,
# and see where the quality drops off. In the tutorial, we used 10
# For the truncating of the reads, look at how long they are.
# In the tutorial, the reads were 150bp, so we cut them off at 150.
# In this data, we produced longer reads. What should that length be?

!qiime dada2 denoise-paired --i-demultiplexed-seqs demux2.qza --p-trim-left-f 12 --p-trim-left-r 12 --p-trunc-len-f 250 --p-trunc-len-r 250 --p-max-ee 2 --p-n-threads 0 --verbose --o-table table2 --o-representative-sequences rep-seqs2

# On my computer, for full run data,
# this started at 1:19AM 86 samples 2:40 fwd reads was at stp3 (converged after 7) 8:12am rev reads step 6 (converged) 9:53 finished denoising 11:05AM finished chimeras and output
# I've set it so it takes as much memory as is available
# I've also set it so it reports what it's doing below.
# That way, you can see the progress it makes as it works its way through
# filtering, errors, and each sample
# You might want to delegate this task to someone (or a cluster) with a more powerful computer

# Need to play with different maxee values (and downstream analyses)

# Jamie ran sets of 10 randomly chosen samples, mix of M and O
# btw maxee=1 and maxee=2: 3% more OTUs, 10% more sequences
# btw maxee=2 and maxee=3: 0% more OTUs, 1% more sequences
# Going to use 2 (default, but 1 may be losing unnecessarily much)

# For second run of data, with maxee=2, started at 5:54PM on Friday - seemed to start off correctly. Leaving overnight.

Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada_paired.R /var/folders/d2/qqsv2qxd5fjf4k455pzytwgh0000gn/T/tmp2l2jwok2/forward /var/folders/d2/qqsv2qxd5fjf4k455pzytwgh0000gn/T/tmp2l2jwok2/reverse /var/folders/d2/qqsv2qxd5fjf4k455pzytwgh0000gn/T/tmp2l2jwok2/output.tsv.biom /var/folders/d2/qqsv2qxd5fjf4k455pzytwgh0000gn/T/tmp2l2jwok2/filt_f /var/folders/d2/qqsv2qxd5fjf4k455pzytwgh0000gn/T/tmp2l2jwok2/filt_r 250 250 12 12 2.0 2 0 1000000

R version 3.3.2 (2016-10-31) 
Loading required package: Rcpp
DADA2 R package version: 1.2.2 
1) Filtering ................................................................................................................................................................................................................................................
2) Learnin

In [7]:
# Let's see what our OTU table ended up like...
# You'll need to give it a sample-metadata file, formatted as a .tsv
# You can make one of these easily in Excel - save it as a "tab-separated" file (.tsv)
# 
!qiime feature-table summarize --i-table table2.qza --o-visualization table2.qzv #--m-sample-metadata-filesample-metadata.tsv
!qiime feature-table tabulate-seqs --i-data rep-seqs2.qza --o-visualization rep-seqs2.qzv

[32mSaved Visualization to: table2.qzv[0m
[32mSaved Visualization to: rep-seqs2.qzv[0m


In [18]:
# Looking at the table summary we just created:
!qiime tools view table2.qzv
# We got about 10% more sequences with maxee=2.

Press the 'q' key, Control-C, or Control-D to quit. This view may no longer be accessible or work correctly after quitting.
Press the 'q' key, Control-C, or Control-D to quit. This view may no longer be accessible or work correctly after quitting.

Came back with 9k OTUs, 1.9M seqs - similar seqs to our custom pipeline, but many more OTUs.

In [None]:
# Looking at the representative sequences for each otu:
#!qiime tools view rep-seqs.qzv

# Check out the possible taxonomy of a couple of OTU sequences

In [None]:
# Now we need to assign taxonomy to these sequences.
# First, extracting the relevant portion of the 16S gene
# from the greengenes database

# Importing the sequences
#!qiime tools import --type FeatureData[Sequence] --input-path 99_otus.fasta --output-path 99_otus.qza
# Importing their associated taxonomy
#!qiime tools import --type FeatureData[Taxonomy] --input-path 99_otu_taxonomy.txt --output-path ref-taxonomy.qza

In [None]:
# Trimming the reads to our target sequence
#!qiime feature-classifier extract-reads --i-sequences 99_otus.qza --p-f-primer GTGYCAGCMGCCGCGGTAA --p-r-primer GGACTACNVGGGTWTCTAAT --p-length 500 --o-reads ref-seqs.qza --verbose 

In [None]:
# And then actually training the classifier based on these sequences
#!qiime feature-classifier fit-classifier-naive-bayes --i-reference-reads ref-seqs.qza --i-reference-taxonomy ref-taxonomy.qza --o-classifier Classifier.qza --verbose

In [None]:
# You can also just download the classifier from Canvas posted in the
# announcement notifying you of the sequence availability

In [20]:
# Classify our sequences using the classifier

!qiime feature-classifier classify --i-classifier ../../classifier.qza --i-reads rep-seqs2.qza --o-classification taxonomy2.qza

[32mSaved FeatureData[Taxonomy] to: taxonomy2.qza[0m


In [21]:
# Save it all in a format we can use in R:
!mkdir OTU_table

!qiime tools export table2.qza --output-dir OTU_table
!qiime tools export rep-seqs2.qza --output-dir OTU_table
!qiime tools export taxonomy2.qza --output-dir OTU_table

In [25]:
!mv OTU_table/dna-sequences.fasta OTU_table/dna-sequences2.fasta
!mv OTU_table/feature-table.biom OTU_table/feature-table2.biom
!mv OTU_table/taxonomy.tsv OTU_table/taxonomy2.tsv
!ls OTU_table/

dna-sequences2.fasta feature-table2.biom  taxonomy2.tsv


In [1]:
# You also need to make a metadata file for your own samples.
# The first column should be your sample names in this format: 523-X-Y
# where X is your group number and Y is the sample number.
# So, you should have 6 rows plus the title row
# Subsequent columns can add other data you may have
# E.g., you probably want a Treatment column
# If you took pH, make one for that, etc.
# To keep things simple, don't use spaces in your column names
# and don't start their name with a number

# You can just make it in Excel if you want, 
# and save it as a "tab-separated" file with a .tsv extension
# (In the example below, it's called sample-metadata.tsv)

!biom add-metadata -i ../../data/Seq_data/QIIME_maxee2/OTU_table/feature-table2.biom -o ../../data/Seq_data/QIIME_maxee2/OTU_table/feature-table-metaD2.biom --sample-metadata-fp ../../data/Soil_properties/WBNPNWT_Soils_2015_Metadata_File_QIIME.txt
!biom add-metadata -i ../../data/Seq_data/QIIME_maxee2/OTU_table/feature-table-metaD2.biom -o ../../data/Seq_data/QIIME_maxee2/OTU_table/feature-table-metaD-tax2.biom --observation-metadata-fp ../../data/Seq_data/QIIME_maxee2/OTU_table/taxonomy2.tsv --sc-separated taxonomy --observation-header OTUID,taxonomy

# You should end up with a feature-table.biom file
# It should have your samples, their metadata, and your taxonomy all there.
# Now you can work with this .biom file in R, like we did in class tutorials

In [4]:
!biom summarize-table -i ../../data/Seq_data/QIIME_maxee2/OTU_table/feature-table-metaD-tax2.biom -o ../../data/Seq_data/QIIME_maxee2/OTU_table/feature-table-metaD-tax-summary2.txt

In [5]:
!head -200 ../../data/Seq_data/QIIME_maxee2/OTU_table/feature-table-metaD-tax-summary2.txt

Num samples: 240
Num observations: 9140
Total count: 2135090
Table density (fraction of non-zero values): 0.019

Counts/sample summary:
 Min: 0.0
 Max: 54113.0
 Median: 6684.000
 Mean: 8896.208
 Std. dev.: 8055.503
 Sample Metadata Categories: Moisture; Project_ID; CEC_cmol_kg; Clay_pct; Land_Class_Unburned; Mn_mg_kg; Sample_ID; ffmc; Burn_Severity_Index; P_mg_kg; Al_mg_kg; Exch_Mg_mg_kg; Sand_pct; K_mg_kg; Land_Class; temp; pH; CFSI; Mean_Duff_Depth_cm; Exch_K_mg_kg; TIC_ash_pct; Mo_mg_kg; fwi; Ca_mg_kg; Forest; Burned_Unburned; Org_or_Min; Veg_Comm; Nutrient; Exch_Na_mg_kg; O_Depth_cm; ws; Replicate; Fwd_Primer_Barcode; Exch_Ca_mg_kg; Mg_mg_kg; Interval; Moisture_Regime; EC_mS_cm; isi; Fire_ID; Sample_Name; Ecosite; prec; Rev_Primer_Barcode; TOC_LOI_pct; Na_mg_kg; Overstory_CBI; Barcodes; rh; TC_pct; Cu_mg_kg; Fe_mg_kg; dc; S_mg_kg; Zn_mg_kg; Silt_pct; Total_N_pct; Dead_Trees; Pct_Exposed_Mineral; Understory_CBI; Community; TOC_HCL_cruc_pct; CBI; RBR; Live_Trees; Total_S_p