<hr style="height:0px; visibility:hidden;" />

<h1><center>4. Setup and QC</center></h1>

---

[**Previous:** 3. R intro](03-R-intro.ipynb)
<br>

<div style="text-align: right"><a href="05-amplicon-processing.ipynb"><b>Next:</b> 5. Amplicon processing</a></div>

---
---

Here we are going to setup our directory structure for how we will process our data, and we are going to use a popular tool for assessing the quality of our reads. So this notebook uses a "Bash" kernel, the most common language used in a Unix-like environment. 

---

## Setting up
First we are going to create a new location for us to work in, and then change into it:

In [4]:
mkdir -p ~/GL4U-amplicon-tutorial
cd ~/GL4U-amplicon-tutorial

Then let's check where we are and if there are any files/directories present:

In [14]:
pwd
ls

/Users/mdlee4/GL4U-amplicon-tutorial


Next we are going to download the raw data files that we are going to be starting with, then unpack them with these commands:

In [15]:
curl -L -o raw-reads.tar.gz https://figshare.com/ndownloader/files/39537235
tar -xf raw-reads.tar.gz
rm raw-reads.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
100 18.0M  100 18.0M    0     0  1855k      0  0:00:09  0:00:09 --:--:-- 2653k


Now we can see we have them:

In [16]:
ls

raw-reads


In [17]:
ls raw-reads

F10_R1_raw.fastq.gz  F8_R2_raw.fastq.gz   G5_R1_raw.fastq.gz
F10_R2_raw.fastq.gz  F9_R1_raw.fastq.gz   G5_R2_raw.fastq.gz
F3_R1_raw.fastq.gz   F9_R2_raw.fastq.gz   G8_R1_raw.fastq.gz
F3_R2_raw.fastq.gz   G10_R1_raw.fastq.gz  G8_R2_raw.fastq.gz
F5_R1_raw.fastq.gz   G10_R2_raw.fastq.gz  G9_R1_raw.fastq.gz
F5_R2_raw.fastq.gz   G3_R1_raw.fastq.gz   G9_R2_raw.fastq.gz
F8_R1_raw.fastq.gz   G3_R2_raw.fastq.gz   sample-info.tsv


And we also grabbed a file with some sample information. Let's move that to our current directory, and then we can look at with the `column` command to format it a bit for us:

In [25]:
mv raw-reads/sample-info.tsv .
column sample-info.tsv

sample_ID  treatment  color
F10        flight     blue
F3         flight     blue
F5         flight     blue
F8         flight     blue
F9         flight     blue
G10        ground     chocolate4
G3         ground     chocolate4
G5         ground     chocolate4
G8         ground     chocolate4
G9         ground     chocolate4


And lastly for setup, we're going to create all the directories we want to use while processing:

In [19]:
mkdir -p trimmed-and-filtered-reads fastqc-outputs final-outputs

In [20]:
ls

fastqc-outputs	final-outputs  raw-reads  trimmed-and-filtered-reads


## Quality assessment of raw reads

We are going to use the popular tool [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) to assess some basic quality metrics about our reads. Here is how we can run it on our raw reads:

In [21]:
fastqc -t 4 -q -o fastqc-outputs raw-reads/*.gz 

<div class="alert alert-block alert-info">
<b>Code Breakdown</b>
<br>

- `fastqc`   - the primary command we're using
    - `-t`   - where we can specify how many we want to process at a time in parallel
    - `-q`   - telling the program not to print out everything it's doing (just because it's a little messy when running in parallel)
    - `-o`   - the directory where we want to put the output files
    - `raw-reads/*.gz` - we didn't get to cover this in our unix intro here, but this is providing all read files as a *positional* arugment, the `*` is a wildcard that here means get anything in that directory that ends with a `.gz`

</div>

In [22]:
ls fastqc-outputs

F10_R1_raw_fastqc.html	F8_R2_raw_fastqc.html	G5_R1_raw_fastqc.html
F10_R1_raw_fastqc.zip	F8_R2_raw_fastqc.zip	G5_R1_raw_fastqc.zip
F10_R2_raw_fastqc.html	F9_R1_raw_fastqc.html	G5_R2_raw_fastqc.html
F10_R2_raw_fastqc.zip	F9_R1_raw_fastqc.zip	G5_R2_raw_fastqc.zip
F3_R1_raw_fastqc.html	F9_R2_raw_fastqc.html	G8_R1_raw_fastqc.html
F3_R1_raw_fastqc.zip	F9_R2_raw_fastqc.zip	G8_R1_raw_fastqc.zip
F3_R2_raw_fastqc.html	G10_R1_raw_fastqc.html	G8_R2_raw_fastqc.html
F3_R2_raw_fastqc.zip	G10_R1_raw_fastqc.zip	G8_R2_raw_fastqc.zip
F5_R1_raw_fastqc.html	G10_R2_raw_fastqc.html	G9_R1_raw_fastqc.html
F5_R1_raw_fastqc.zip	G10_R2_raw_fastqc.zip	G9_R1_raw_fastqc.zip
F5_R2_raw_fastqc.html	G3_R1_raw_fastqc.html	G9_R2_raw_fastqc.html
F5_R2_raw_fastqc.zip	G3_R1_raw_fastqc.zip	G9_R2_raw_fastqc.zip
F8_R1_raw_fastqc.html	G3_R2_raw_fastqc.html
F8_R1_raw_fastqc.zip	G3_R2_raw_fastqc.zip


We can look at one of these by navigating to it on the left and double-clicking on the html file, or just by **clicking here for sample F10_R1 (fix when doing on actual system)** (so long as we didn't delete the file already).

Rather than look through all of them individually, we can take advantage of [MultiQC](https://multiqc.info/) to combine them for us:

In [23]:
multiqc -o fastqc-outputs -n raw_multiqc fastqc-outputs


  [34m/[0m[32m/[0m[31m/[0m ]8;id=298772;https://multiqc.info\[1mMultiQC[0m]8;;\ 🔍 [2m| v1.12[0m

[34m|           multiqc[0m | [33mMultiQC Version v1.14 now available![0m
[34m|           multiqc[0m | Search path : /Users/mdlee4/GL4U-amplicon-tutorial/fastqc-outputs
[2K[34m|[0m         [34msearching[0m | [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [32m40/40[0m  [2mfastqc-outputs/F9_R2_raw_fastqc.html[0mfastqc.html[0m
[?25h[34m|            fastqc[0m | Found 20 reports
[34m|           multiqc[0m | Compressing plot data
[34m|           multiqc[0m | Report      : fastqc-outputs/raw_multiqc.html
[34m|           multiqc[0m | Data        : fastqc-outputs/raw_multiqc_data
[34m|           multiqc[0m | MultiQC complete


And now let's remove all the individual files, using our `*` wildcard again:

In [24]:
rm fastqc-outputs/*fastqc*
ls fastqc-outputs

raw_multiqc.html  raw_multiqc_data


And we can open and look at the multiqc summary with the file browser on the left, or by clicking **here (fix link when doing on actual system)**. Be sure to click "Trust HTML" at the top-left after opening.

For now, we are going to move onto our [Amplicon processing notebook](04-amplicon-processing.ipynb), but we will return here after we filter our reads in order to use FastQC/MultiQC again. 

---

## Quality assessment of filtered reads

After we've generated our trimmed and filtered reads in the processing notebook, we can move forward here with fastqc and multiqc on them.

First let's check that our filtered read files are present where we expect:

In [26]:
ls trimmed-and-filtered-reads

F10_R1_filtered.fastq.gz  F8_R2_filtered.fastq.gz   G5_R1_filtered.fastq.gz
F10_R2_filtered.fastq.gz  F9_R1_filtered.fastq.gz   G5_R2_filtered.fastq.gz
F3_R1_filtered.fastq.gz   F9_R2_filtered.fastq.gz   G8_R1_filtered.fastq.gz
F3_R2_filtered.fastq.gz   G10_R1_filtered.fastq.gz  G8_R2_filtered.fastq.gz
F5_R1_filtered.fastq.gz   G10_R2_filtered.fastq.gz  G9_R1_filtered.fastq.gz
F5_R2_filtered.fastq.gz   G3_R1_filtered.fastq.gz   G9_R2_filtered.fastq.gz
F8_R1_filtered.fastq.gz   G3_R2_filtered.fastq.gz


And here is running fastqc and multiqc the same way we did above, except pointint to this filtered directory:

In [27]:
fastqc -t 4 -q -o fastqc-outputs trimmed-and-filtered-reads/*.gz 

In [28]:
multiqc -o fastqc-outputs -n filtered_multiqc fastqc-outputs


  [34m/[0m[32m/[0m[31m/[0m ]8;id=708244;https://multiqc.info\[1mMultiQC[0m]8;;\ 🔍 [2m| v1.12[0m

[34m|           multiqc[0m | [33mMultiQC Version v1.14 now available![0m
[34m|           multiqc[0m | Search path : /Users/mdlee4/GL4U-amplicon-tutorial/fastqc-outputs
[2K[34m|[0m         [34msearching[0m | [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [32m47/47[0m  utputs/F9_R1_filtered_fastqc.html[0mltered_fastqc.html[0m
[?25h[34m|            snippy[0m | Found 1 reports
[34m|          bargraph[0m | [33mTried to make bar plot, but had no data: snippy_variants[0m
[34m|            fastqc[0m | Found 20 reports
[34m|           multiqc[0m | Compressing plot data
[34m|           multiqc[0m | Report      : fastqc-outputs/filtered_multiqc.html
[34m|           multiqc[0m | Data        : fastqc-outputs/filtered_multiqc_data
[34m|           multiqc[0m | MultiQC complete


And again removing all intermediate files:

In [31]:
rm fastqc-outputs/*fastqc*

rm: cannot remove 'fastqc-outputs/*fastqc*': No such file or directory


: 1

In [32]:
ls fastqc-outputs

filtered_multiqc.html  raw_multiqc.html
filtered_multiqc_data  raw_multiqc_data


Then like before, we can open and look at that html report either by navigating to it at the left of the Jupyter Lab, or by clicking **here (fix when on system)**. Be sure to click "Trust HTML" at the top-left after opening it.

**Now let's head back to the [amplicon processing notebook](05-amplicon-processing.ipynb#Generate-error-model-of-data), where we are ready to proceed with generating an error profile of our data.**


---
---

[**Previous:** 3. R intro](03-R-intro.ipynb)
<br>

<div style="text-align: right"><a href="05-amplicon-processing.ipynb"><b>Next:</b> 5. Amplicon processing</a></div>

