## Quality control tutorial, based on QIIME2 "Atacama soil microbiome" tutorial
https://docs.qiime2.org/2017.12/tutorials/atacama-soils/

Watching this video from Illumina <https://www.youtube.com/watch?v=fCd6B5HRaZ8> will give you an overview of how the MiSeq sequencing process runs.

 What we get back from the sequencing centre will be three files: Forward reads, Reverse reads, and Index ("barcode") reads. As you saw in the video, Illumina MiSeq gives us millions of reads back. Since we don't need that degree of "sequencing depth" in our samples, we can pool hundreds of samples on a single run. To do this, we attach a different unique sequence ("barcode") to all the reads from each sample. Then, when they are sequenced, we also get the barcode sequence, allowing us to trace which sample each read came from. The process of unpooling the reads is usually called "demultiplexing". This is one of the first things we will do with our sequences.
 
When we get our sequence data back, we will want to do a few things.
1. Merge paired-end forward and reverse reads
2. Screen sequences to include only high-quality reads
3. Match our pool of sequences with their respective samples (demultiplexing)

While the files we get back from the sequencing centre will be very large, they are relatively simple - basically just stylized text files with information about our sequences and the sequences themselves. They are usually end with ".fasta" or ".fa". If they include "quality scores", they end with ".fastq" ".fq". They are still basically just text files - you could technically open them with a text reader. However, they are usually very long, and you wouldn't really want to do that.

## Importing and examining sequencing files

#### To run a cell in the Jupyter notebook, just type shift+enter while you have the cell selected.
#### While it is running, an asterix will appear: In [\*]. Once the command has finished, it will update with a number that represents the order in which the commands were run. 

(A prefix of ! indicates that the code should be run in the terminal shell (not python code, which is the default))

Let's take a look at some .fastq files from the qiime2 Atacama Soil Microbiome tutorial.

Make sure you are in the qiime2-atacama-tutorial directory for this work, or if you aren't, move to it.

In [None]:
!ls

Now we'll download the files using wget.
"sample-metadata.tsv" is what we are going to call the resulting file. You could name it something else if you wanted.
The URL is, of course, where the file is located.

In [None]:
!wget -O "sample-metadata.tsv" "https://data.qiime2.org/2017.12/tutorials/atacama-soils/sample_metadata.tsv"

(If you want to hide the output from a given command, you can click or double-click the space below the In [#] for that command.)

We can take a look at this file we just downloaded - we might not want to see the whole thing, but we can look at the top portion using the command head:

In [None]:
!head sample-metadata.tsv

As you can see, this has all the data associated with our samples for this tutorial, including the barcodes.

Let's make a new folder within this tutorial, where we will house our sequence data.

In [None]:
!mkdir emp-paired-end-sequences

Then we are going to download the forward and reverse reads to that folder.
Note that we are still in the qiime2-atacama-tutorial folder - we didn't move into the emp-paired-end-sequences folder. However, by specifying the file path, we can direct the computer to download the sequences to that new folder we just created.

In [None]:
!wget -O "emp-paired-end-sequences/forward.fastq.gz" "https://data.qiime2.org/2017.12/tutorials/atacama-soils/1p/forward.fastq.gz"
!wget -O "emp-paired-end-sequences/reverse.fastq.gz" "https://data.qiime2.org/2017.12/tutorials/atacama-soils/1p/reverse.fastq.gz"
!wget -O "emp-paired-end-sequences/barcodes.fastq.gz" "https://data.qiime2.org/2017.12/tutorials/atacama-soils/1p/barcodes.fastq.gz"

ALthough the programs we are going to use can work with these files in a zipped format, we want to take a look at them. So, let's unzip them and use commands to see what's in them. The -k flag indiates we want to "keep" the zipped file too.
The -f flag is for "force", which means it will overwrite existing files if they exist.
We are then moving them (mv) to a new directory (mkdir).

In [None]:
!gunzip -k -f emp-paired-end-sequences/*.gz
!mkdir unzipped-emp-paired-end-sequences/
!mv emp-paired-end-sequences/*.fastq unzipped-emp-paired-end-sequences/

Let's take a look at the first 100 lines of our forward reads .fastq file.

In [None]:
!head -100 unzipped-emp-paired-end-sequences/forward.fastq

Okay! So that's what a fastq file looks like. At first, it might seem like mumbo-jumbo, but looking at it more closely, we can see a few things:
1. Every few lines starts with an @ sign. That indicates the beginning of a new record. Each record has a string of identifiers, that tie it to the sequencer and the specific run.
2. Within each record, the parts that we are the most interested in are the sequences, which are, of course, the parts that are all A's, T's, C's, and G's. You might see a lot of N's too. That's not a new nucleotide- it's the indication of an ambiguous base. The sequences couldn't figure out what should go there, so it left an N.
3. There's a + sign, and then a series of characters. These characters represent the Phred quality, or Q, scores. You can read more about Q scores here <http://www.drive5.com/usearch/manual/quality_score.html>. Basically, they give us the level of confidence the sequencer assigned to its "base call" - the likelihood that the nucleotide at that location is correct. It's a logorithmic scale, so a Q score of 20 means there's a 1 in 100 chance that the base call was wrong (99% accuracy), while a Q score of 30 means there's a 1 in 1000 chance that the base call was wrong (99.9% accuracy)

You can probably see that these first few reads don't look that good. They aren't very long, and there are a lot of N's. That's okay. We have millions of reads.

## Demultiplexing
Let's demultiplex our reads. First, we need to import our sequence files that we just downloaded into a form that QIIME works with (.qza file)

(These files are basically zipped groups of files. If you want the raw data out of them, you can unzip them and access them there.)

In [None]:
!qiime tools import --type EMPPairedEndSequences --input-path emp-paired-end-sequences/ --output-path emp-paired-end-sequences.qza

You can see there were a few variables that went into that command.
We told it that the type of file we wanted to create - EMPPairedEndSequences. We told it the input path was that folder with just the three .gz files we downloaded. And we told it we wanted it to save the output in our current directory, calling it emp-paired-end-sequences.qza.

If we wanted to know what commands were possible, we could enter:

In [None]:
!qiime tools import --help

So, now that we've got the file that QIIME can read, we can ask it to demultiplex our sequences.  But what do we need to provide to tell it to do that?

In [None]:
!qiime demux emp-paired --help

Ok, it looks like it needs an input file (that's the .qza we just created) and a barcodes file path and the barcode category. We can also tell it whether we need to reverse-complement the barcodes, what the output files should be called, and some other options as well.

In [None]:
!qiime demux emp-paired --i-seqs emp-paired-end-sequences.qza --m-barcodes-file sample_metadata.tsv --m-barcodes-category BarcodeSequence --o-per-sample-sequences demux --p-rev-comp-mapping-barcodes

In [None]:
!ls

As you can see, now we have a file called demux.qza. QIIME can put this data into an easy-to-visualize format for us.

In [None]:
!qiime demux summarize --i-data demux.qza --o-visualization demux.qzv

In [None]:
!qiime tools view demux.qzv

We can see lots of information about our data here, including how many sequences we got per sample and what the lengths and number of reads were. We can also grab the data in a raw format here ("Download as .csv") if we wanted.

Press the square stop button to be able to run code again after visualizing.

## Quality scores

After demultiplexing reads, we’ll look at the sequence quality based on five randomly selected samples, and then denoise the data. When you view the quality plots, note that there are  two plots per sample. The plot on the left presents the quality scores for the forward reads, and the plot on the right presents the quality scores for the reverse reads. We’ll use these plots to determine what trimming parameters we want to use for denoising.

(This command might take a few minutes. If you know your computer's a bit slower, you might change the value for --p-n to 1 or 2.)

In [None]:
!qiime dada2 plot-qualities --i-demultiplexed-seqs demux.qza   --o-visualization demux-qualities.qzv --p-n 3

In [None]:
!qiime tools view demux-qualities.qzv

Take a look at the results! What are the trends? Are the quality scores for the forward and reverse reads similar?

Next week, we will take this information and generate our "OTU table" - the table of taxa (and their associated representative sequences) and their abundances in each of our samples.