<hr style="height:0px; visibility:hidden;" />

<h1><center>4. Setup and QC</center></h1>

<div class="alert alert-block alert-success">
Here we are going to setup our directory structure for how we will process our data, and we are going to use a popular tool for assessing the quality of our reads. Both of these are most easily done at a Unix-like command line, so this notebook uses a "Bash" kernel, the most common language used in a Unix-like environment. 
</div>

---

<center>This is notebook 4 of 6 of <a href="00-overview.ipynb">GL4U's Amplicon Bootcamp</a>. It is expected that the previous notebooks have been completed already.</center>

---

[**Previous:** 3. R intro](03-R-intro.ipynb)
<br>

<div style="text-align: right"><a href="05-amplicon-processing.ipynb"><b>Next:</b> 5. Amplicon processing</a></div>

---
---

## Setting up
First we are going to create a new location for us to work in, and then change into it:

In [None]:
mkdir -p ~/GL4U-amplicon-tutorial
cd ~/GL4U-amplicon-tutorial

Then let's check where we are and if there are any files/directories present:

In [None]:
pwd
ls

Next we are going to download the raw data files that we are going to be starting with, then unpack them with these commands:

In [None]:
curl -L -o raw-reads.tar.gz https://figshare.com/ndownloader/files/39537235
tar -xf raw-reads.tar.gz
rm raw-reads.tar.gz

Now we can see we have them:

In [None]:
ls

In [None]:
ls raw-reads

And we also grabbed a file with some sample information. Let's move that to our current directory and take a peek at it with the `head` command:

In [None]:
mv raw-reads/sample-info.tsv .

In [None]:
head sample-info.tsv

And lastly for setup, we're going to create all the directories we want to use while processing:

In [None]:
mkdir -p trimmed-and-filtered-reads fastqc-outputs final-outputs

In [None]:
ls

---

## Quality assessment of raw reads

We are going to use the popular tool [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) to assess some basic quality metrics about our reads. Here is how we can run it on our raw reads:

In [None]:
fastqc -t 6 -q -o fastqc-outputs raw-reads/*.gz 

<div class="alert alert-block alert-info">
<b>Code Breakdown</b>
<br>

- `fastqc`   - the primary command we're using
    - `-t`   - where we can specify how many we want to process at a time in parallel
    - `-q`   - telling the program not to print out everything it's doing (just because it's a little messy when running in parallel)
    - `-o`   - the directory where we want to put the output files
    - `raw-reads/*.gz` - we didn't get to cover this in our unix intro here, but this is providing all read files as a *positional* arugment, the `*` is a wildcard that here means get anything in that directory that ends with a `.gz`

</div>

In [None]:
ls fastqc-outputs

We can look at one of these by navigating to it on the left and double-clicking on the html file, or just by [**clicking here for sample F10_R1**](../GL4U-amplicon-tutorial/fastqc-outputs/F10_R1_raw_fastqc.html) (so long as we haven't deleted it yet).

Rather than look through all of them individually, we can take advantage of [MultiQC](https://multiqc.info/) to combine them for us:

In [None]:
multiqc -o fastqc-outputs -n raw_multiqc fastqc-outputs

And now let's remove all the individual files, using our `*` wildcard again:

In [None]:
rm fastqc-outputs/*fastqc*
ls fastqc-outputs

And we can open and look at the multiqc summary with the file browser on the left, or by [**clicking here**](../GL4U-amplicon-tutorial/fastqc-outputs/raw_multiqc.html). Be sure to click "Trust HTML" at the top-left after opening.

**For now, we are going to move onto our [amplicon processing notebook](05-amplicon-processing.ipynb), but we will return here after we filter our reads in order to use FastQC/MultiQC again.**

---

## Quality assessment of filtered reads

After we've generated our trimmed and filtered reads in the processing notebook, we can move forward here with fastqc and multiqc on them.

First let's check that our filtered read files are present where we expect:

In [None]:
ls trimmed-and-filtered-reads

And here is running fastqc and multiqc the same way we did above, except pointing to this filtered directory:

In [None]:
fastqc -t 6 -q -o fastqc-outputs trimmed-and-filtered-reads/*.gz 

In [None]:
multiqc -o fastqc-outputs -n filtered_multiqc fastqc-outputs

And again removing all intermediate files:

In [None]:
rm fastqc-outputs/*fastqc*

In [None]:
ls fastqc-outputs

Then like before, we can open and look at the multiqc summary with the file browser on the left, or by [**clicking here**](../GL4U-amplicon-tutorial/fastqc-outputs/filtered_multiqc.html). Be sure to click "Trust HTML" at the top-left after opening.

**Now let's head back to the [amplicon processing notebook](05-amplicon-processing.ipynb#Generate-error-model-of-data), where we are ready to proceed with generating an error profile of our data.**


---
---

[**Previous:** 3. R intro](03-R-intro.ipynb)
<br>

<div style="text-align: right"><a href="05-amplicon-processing.ipynb"><b>Next:</b> 5. Amplicon processing</a></div>

