## Quality control tutorial, based on QIIME2 "Atacama soil microbiome" tutorial
https://docs.qiime2.org/2017.2/tutorials/atacama-soils/

When we get our sequence data back, we will want to do a few things.
1. Merge paired-end reads
2. Screen sequences to include only high-quality reads
3. Match our pool of sequences with their respective samples

Watching this video from Illumina will give you an overview of how the MiSeq sequencing process runs. What we get back from the sequenceing centre will be three files: Forward reads, Reverse reads, and index ("barcode") reads

As you saw in the video, Illumina MiSeq gives us millions of reads back. Since we don't need that degree of "sequencing depth" in our samples, we can pool hundreds of samples on a single run. To do this, we attach a different unique sequence ("barcode") to all the reads from each sample. Then, when they are sequenced, we also get the barcode sequence, allowing us to trace which sample each read came from. The process of unpooling the reads is usually called "demultiplexing". This is one of the first things we will do with our sequences.

While the files we get back from the sequencing centre will be very large, they are relatively simple - basically just stylized text files with information about our sequences and the sequences themselves. They are usually end with ".fasta" or ".fa". If they include "quality scores", they end with ".fastq" ".fq". They are still basically just text files - you could technically open them with a text reader. However, they are usually very long, and you wouldn't really want to do that.

## Importing and examining sequencing files

To run a cell in the Jupyter notebook, just type shift+enter while you have the cell selected.

A prefix of ! indicates that the code should be run in the terminal shell (not python code, which is the default)

Let's take a look at some .fastq files from the qiime2 Atacama Soil Microbiome tutorial.

Start by creating a new directory for this work, and moving to it.

In [5]:
!mkdir qiime2-atacama-tutorial
!cd qiime2-atacama-tutorial
!pwd

/home/soil-micro/523_Soil_Sci/qiime2-atacama-tutorial


Now we'll download the files using wget.
"sample-metadata.tsv" is what we are going to call the resulting file. You could name it something else if you wanted.
The URL is, of course, where the file is stored.

In [9]:
!wget "sample-metadata.tsv" "https://data.qiime2.org/2017.2/tutorials/atacama-soils/sample_metadata.tsv"

--2017-02-18 14:49:14--  http://sample-metadata.tsv/
Resolving sample-metadata.tsv (sample-metadata.tsv)... failed: Name or service not known.
wget: unable to resolve host address ‘sample-metadata.tsv’
--2017-02-18 14:49:15--  https://data.qiime2.org/2017.2/tutorials/atacama-soils/sample_metadata.tsv
Resolving data.qiime2.org (data.qiime2.org)... 104.18.56.116, 104.18.57.116, 2400:cb00:2048:1::6812:3974, ...
Connecting to data.qiime2.org (data.qiime2.org)|104.18.56.116|:443... connected.
HTTP request sent, awaiting response... 302 FOUND
Location: https://docs.google.com/spreadsheets/d/1xMP1EjKZDrzdKLnQr7LGVAY35ongxrreT28k0EACtfg/export?gid=0&format=tsv [following]
--2017-02-18 14:49:21--  https://docs.google.com/spreadsheets/d/1xMP1EjKZDrzdKLnQr7LGVAY35ongxrreT28k0EACtfg/export?gid=0&format=tsv
Resolving docs.google.com (docs.google.com)... 216.58.192.206, 2607:f8b0:4009:80e::200e
Connecting to docs.google.com (docs.google.com)|216.58.192.206|:443... connected.
HTTP request sent, await

(If you want to hide the output from a given command, you can click or double-click the space below the In [#] for that command.)

We can take a look at this file we just downloaded - we might not want to see the whole thing, but we can look at the top portion using the command head:

In [8]:
!head sample-metadata.tsv

#SampleID	BarcodeSequence	LinkerPrimerSequence	Elevation	ExtractConcen	AmpliconConcentration	ExtractGroupNo	TransectName	SiteName	Depth	pH	TOC	EC	AverageSoilRelativeHumidity	RelativeHumiditySoilHigh	RelativeHumiditySoilLow	PercentRelativeHumiditySoil_100	AverageSoilTemperature	TemperatureSoilHigh	TemperatureSoilLow	Vegetation	PercentCover	Description
BAQ1370.1.2	GCCCAAGTTCAC	CCGGACTACHVGGGTWTCTAAT	1370	0.019	0.950	B	Baquedano	BAQ1370	2	7.98	525	6.08	16.17	23.97	11.42	0	22.61	35.21	12.46	no	0	BAQ1370.1.2
BAQ1370.3	GCGCCGAATCTT	CCGGACTACHVGGGTWTCTAAT	1370	0.124	17.460	E	Baquedano	BAQ1370	2	NA	771	6.08	16.17	23.97	11.42	0	22.61	35.21	12.46	no	0	BAQ1370.3
BAQ1370.1.3	ATAAAGAGGAGG	CCGGACTACHVGGGTWTCTAAT	1370	1.200	0.960	J	Baquedano	BAQ1370	3	8.13	NA	NA	16.17	23.97	11.42	0	22.61	35.21	12.46	no	0	BAQ1370.1.3
BAQ1552.1.1	ATCCCAGCATGC	CCGGACTACHVGGGTWTCTAAT	1552	0.722	18.830	J	Baquedano	BAQ1552	1	7.87	NA	NA	15.75	35.36	11.1	0	22.63	30.65	10.96	no	0	BAQ1552.1.1
BAQ1552.2	GCTTCCAGACAA	C

As you can see, this has all the data associated with our samples for this tutorial, including the barcodes.

Let's make a new folder within this tutorial, where we will house our sequence data.

In [11]:
!mkdir emp-paired-end-sequences

mkdir: cannot create directory ‘emp-paired-end-sequences’: File exists


Then we are going to download the forward and reverse reads to that folder.
Note that we are still in the qiime2-atacama-tutorial folder - we didn't move into the emp-paired-end-sequences folder. However, by specifying the file path, we can direct the computer to download the sequences to that new folder we just created.

In [52]:
!wget -O "emp-paired-end-sequences/forward.fastq.gz" "https://data.qiime2.org/2017.2/tutorials/atacama-soils/1p/forward.fastq.gz"
!wget -O "emp-paired-end-sequences/reverse.fastq.gz" "https://data.qiime2.org/2017.2/tutorials/atacama-soils/1p/reverse.fastq.gz"
!wget -O "emp-paired-end-sequences/barcodes.fastq.gz" "https://data.qiime2.org/2017.2/tutorials/atacama-soils/1p/barcodes.fastq.gz"

--2017-02-18 16:17:08--  https://data.qiime2.org/2017.2/tutorials/atacama-soils/1p/forward.fastq.gz
Resolving data.qiime2.org (data.qiime2.org)... 104.18.56.116, 104.18.57.116, 2400:cb00:2048:1::6812:3974, ...
Connecting to data.qiime2.org (data.qiime2.org)|104.18.56.116|:443... connected.
HTTP request sent, awaiting response... 302 FOUND
Location: https://dl.dropboxusercontent.com/u/2868868/data/qiime2/tutorials/importing-sequence-data/2017.2/emp-paired-end-sequences/atacama-1p/forward.fastq.gz [following]
--2017-02-18 16:17:08--  https://dl.dropboxusercontent.com/u/2868868/data/qiime2/tutorials/importing-sequence-data/2017.2/emp-paired-end-sequences/atacama-1p/forward.fastq.gz
Resolving dl.dropboxusercontent.com (dl.dropboxusercontent.com)... 108.160.172.69
Connecting to dl.dropboxusercontent.com (dl.dropboxusercontent.com)|108.160.172.69|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14579008 (14M) [application/octet-stream]
Saving to: ‘emp-paired-end-sequ

ALthough the programs we are going to use can work with these files in a zipped format, we want to take a look at them. So, let's unzip them and use commands to see what's in them. The -k flag indiates we want to "keep" the zipped file too.
The -f flag is for "force", which means it will overwrite existing files if they exist.
We are then moving them (mv) to a new directory (mkdir).

In [2]:
!gunzip -k -f emp-paired-end-sequences/*.gz
!mkdir unzipped-emp-paired-end-sequences/
!mv emp-paired-end-sequences/*.fastq unzipped-emp-paired-end-sequences/

mkdir: cannot create directory ‘unzipped-emp-paired-end-sequences/’: File exists


Let's take a look at the first 100 lines of our forward reads .fastq file.

In [5]:
!head -100 unzipped-emp-paired-end-sequences/forward.fastq

@M00176:65:000000000-A41FR:1:1101:14282:1412 1:N:0:0
NACGTAGGGTGCAAGCGTTAATCGGAATTACNGGNNNTAAAGCGTGCNNAGGCNNNNNNNNNNNANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
#>>>AAF@ACAA4BGCEEECGGHGGEFCFBG#BA###BABAEFGEEE##BBAA###########B######################################################################################
@M00176:65:000000000-A41FR:1:1101:16939:1420 1:N:0:0
NACGTAGGGGGCAAGCGTTGTCCGGAATCATTGGNNGTAAAGAGCGTGNAGGCNNNNNGNNANNTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
#>>ABAF@ABBBGGGGEEFGGGHGGFFGFHHHHH##BBEFDHHCGGFG#ABFF#####B##B##B######################################################################################
@M00176:65:000000000-A41FR:1:1101:14746:1560 1:N:0:0
TACGTAGGGAGCTAGCGTTGTCCGGAATCATTGGGCGTAAAGCGCGCGTAGGCGGCCAGATAAGTCCGGTGTAAAAGCCACAGGCTNNNNNNNNNNNNNNNNCNGGANNNNNNNNNNNNNNNNNNNNNNANNNNNNNNNNNNANNNNNGGN
+
AA1AA1CA11>AAGGCEEEE0FGACAEFFDDGFFHFGGCEGHBE?A?EGCEGFGGEE/ABFFGD

Okay! So that's what a fastq file looks like. At first, it might seem like mumbo-jumbo, but looking at it more closely, we can see a few things:
1. Every few lines starts with an @ sign. That indicates the beginning of a new record. Each record has a string of identifiers, that tie it to the sequencer and the specific run.
2. Within each record, the parts that we are the most interested in are the sequences, which are, of course, the parts that are all A's, T's, C's, and G's. You might see a lot of N's too. That's not a new nucleotide- it's the indication of an ambiguous base. The sequences couldn't figure out what should go there, so it left an N.
3. There's a + sign, and then a series of characters. These characters represent the Phred quality, or Q, scores. You can read more about Q scores here <http://www.drive5.com/usearch/manual/quality_score.html>. Basically, they give us the level of confidence the sequencer assigned to its "base call" - the likelihood that the nucleotide at that location is correct. It's a logorithmic scale, so a Q score of 20 means there's a 1 in 100 chance that the base call was wrong (99% accuracy), while a Q score of 30 means there's a 1 in 1000 chance that the base call was wrong (99.9% accuracy)

You can probably see that these first few reads don't look that good. They aren't very long, and there are a lot of N's. That's okay. We have millions of reads.

## Demultiplexing
Let's demultiplex our reads. First, we need to import our sequence files that we just downloaded into a form that QIIME works with (.qza file)

(These files are basically zipped groups of files. If you want the raw data out of them, you can unzip them and access them there.)

In [12]:
!qiime tools import --type EMPPairedEndSequences --input-path emp-paired-end-sequences/ --output-path emp-paired-end-sequences.qza

You can see there were a few variables that went into that command.
We told it that the type of file we wanted to create - EMPPairedEndSequences. We told it the input path was that folder with just the three .gz files we downloaded. And we told it we wanted it to save the output in our current directory, calling it emp-paired-end-sequences.qza.

If we wanted to know what commands were possible, we could enter:

In [50]:
!qiime tools import --help

Usage: qiime tools import [OPTIONS]

  Import data to create a new QIIME 2 Artifact. See https://docs.qiime2.org/
  for usage examples and details on the file types and associated semantic
  types that can be imported.

Options:
  --type TEXT           The semantic type of the new artifact.  [required]
  --input-path PATH     Path to file or directory that should be imported.
                        [required]
  --output-path PATH    Path where output artifact should be written.
                        [required]
  --source-format TEXT  The format of the data to be imported. If not
                        provided, data must be in the format expected by the
                        semantic type provided via --type.
  --help                Show this message and exit.


So, now that we've got the file that QIIME can read, we can ask it to demultiplex our sequences.  But what do we need to provide to tell it to do that?

In [6]:
!qiime demux emp-paired --help

Usage: qiime demux emp-paired [OPTIONS]

  Demultiplex paired-end sequence data (i.e., map barcode reads to sample
  ids) for data generated with the Earth Microbiome Project (EMP) amplicon
  sequencing protocol. Details about this protocol can be found at
  http://www.earthmicrobiome.org/emp-standard-protocols/

Options:
  --i-seqs PATH                   Artifact: EMPPairedEndSequences  [required]
                                  The paired-end sequences to be
                                  demultiplexed.
  --m-barcodes-file PATH          Metadata mapping file  [required]
  --m-barcodes-category TEXT      Category from metadata mapping file
                                  [required]
  --p-rev-comp-barcodes / --p-no-rev-comp-barcodes
                                  [default: False]
                                  If provided, the barcode
                                  sequence reads will be reverse complemented
                                  prior to d

Ok, it looks like it needs an input file (that's the .qza we just created) and a barcodes file path and the barcode category. We can also tell it whether we need to reverse-complement the barcodes, what the output files should be called, and some other options as well.

In [13]:
!qiime demux emp-paired --i-seqs emp-paired-end-sequences.qza --m-barcodes-file sample-metadata.tsv --m-barcodes-category BarcodeSequence --o-per-sample-sequences demux --p-rev-comp-mapping-barcodes

[32mSaved SampleData[PairedEndSequencesWithQuality] to: demux.qza[0m


In [14]:
!ls

demux.qza		      sample-metadata.tsv
emp-paired-end-sequences      Untitled.ipynb
emp-paired-end-sequences.qza  unzipped-emp-paired-end-sequences
QC_Tutorial.ipynb


As you can see, now we have a file called demux.qza. QIIME can put this data into an easy-to-visualize format for us.

In [16]:
!qiime demux summarize --i-data demux.qza --o-visualization demux.qzv

[32mSaved Visualization to: demux.qzv[0m


In [18]:
!qiime tools view demux.qzv

Press the 'q' key, Control-C, or Control-D to quit. This view may no longer be accessible or work correctly after quitting.
Press the 'q' key, Control-C, or Control-D to quit. This view may no longer be accessible or work correctly after quitting.

We can see lots of information about our data here, including how many sequences we got per sample and what the lengths and number of reads were. We can also grab the data in a raw format here ("Download as .csv") if we wanted.

Press the square stop button to be able to run code again after visualizing.

## Quality scores

After demultiplexing reads, we’ll look at the sequence quality based on five randomly selected samples, and then denoise the data. When you view the quality plots, note that there are  two plots per sample. The plot on the left presents the quality scores for the forward reads, and the plot on the right presents the quality scores for the reverse reads. We’ll use these plots to determine what trimming parameters we want to use for denoising.

(This command might take a while.)

In [20]:
!qiime dada2 plot-qualities --i-demultiplexed-seqs demux.qza   --o-visualization demux-qualities.qzv --p-n 5

^C

Aborted!


In [22]:
!qiime tools view demux-qualities.qzv

Press the 'q' key, Control-C, or Control-D to quit. This view may no longer be accessible or work correctly after quitting.
Press the 'q' key, Control-C, or Control-D to quit. This view may no longer be accessible or work correctly after quitting.

Take a look at the results! What are the trends? Are the quality scores for the forward and reverse reads similar?

Next week, we will take this information and generate our "OTU table" - the table of taxa (and their associated representative sequences) and their abundances in each of our samples.