<a href="https://colab.research.google.com/github/Rajarshi0/SARSCOV2B4B/blob/main/Rajarshi_Module_2_Part1_Sequencing_Run_QC_(Nanopore_and_Illumina).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Initial QC for Nanopore and Illumina SARS-CoV-2 genome sequencing
In this notebook we will analyze the results from sequencing runs, using two approaches to sequence the SARS-CoV-2 genome. Both are based on the **ARTIC protocol**, developed by the [ARTIC Network](https://artic.network/ncov-2019). For Illumina, correspond to the classic ARTIC protocol, wich amplifies the SARS-CoV-2 genome in 98 fragments of 400bp each. For Nanopore, the used protocol is named **"Midnight Protocol"** and is based on the amplification of 29 overlapping 1200bp fragments that cover the entire SARS-CoV-2 genome.
The content of the notebook can be summarized in:

*   Download data
*   Install software and prepare environment
*   Run a quality control of the run


# Download data

In [None]:
!gdown 1JkUU3wcexm9Y532l6IbdIsdY4saMehJO ; unzip Illumina_READS.zip
!gdown 1rRhK7H7R9aiPooqtKtT8kugnLNsqmkkR ; unzip Nanopore_READS.zip

Downloading...
From: https://drive.google.com/uc?id=1JkUU3wcexm9Y532l6IbdIsdY4saMehJO
To: /content/Illumina_READS.zip
100% 704M/704M [00:04<00:00, 146MB/s]
Archive:  Illumina_READS.zip
  inflating: Illumina_fastq/samplesheet.csv  
  inflating: Illumina_fastq/fastq/ERR5761182_2.fastq.gz  
  inflating: Illumina_fastq/fastq/ERR5761182_1.fastq.gz  
  inflating: Illumina_fastq/fastq/ERR5914874_2.fastq.gz  
  inflating: Illumina_fastq/fastq/ERR5914874_1.fastq.gz  
  inflating: Illumina_fastq/fastq/ERR5921612_2.fastq.gz  
  inflating: Illumina_fastq/fastq/ERR5926784_2.fastq.gz  
  inflating: Illumina_fastq/fastq/ERR5926784_1.fastq.gz  
  inflating: Illumina_fastq/fastq/ERR5921129_1.fastq.gz  
  inflating: Illumina_fastq/fastq/ERR6129126_1.fastq.gz  
  inflating: Illumina_fastq/fastq/ERR5932418_1.fastq.gz  
  inflating: Illumina_fastq/fastq/ERR6129126_2.fastq.gz  
  inflating: Illumina_fastq/fastq/ERR5921129_2.fastq.gz  
  inflating: Illumina_fastq/fastq/ERR5921612_1.fastq.gz  
  inflating: Il

Installing condacolab

In [None]:
!pip install -q condacolab
import condacolab
condacolab.install()

⏬ Downloading https://github.com/jaimergp/miniforge/releases/latest/download/Mambaforge-colab-Linux-x86_64.sh...
📦 Installing...
📌 Adjusting configuration...
🩹 Patching environment...
⏲ Done in 0:00:25
🔁 Restarting kernel...


# Install software
Install FastQC and NanoPlot

In [None]:
!conda install -c bioconda fastqc
!pip install nanoplot

Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | done
Solving environment: - \ | / - \ | / - \ | / - \ | / - \ | / - done

## Package Plan ##

  environment location: /usr/lo

In [None]:
!pip install multiqc

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting multiqc
  Downloading multiqc-1.13-py3-none-any.whl (1.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting matplotlib>=2.1.1
  Downloading matplotlib-3.5.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (11.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.2/11.2 MB[0m [31m65.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rich>=10
  Downloading rich-12.6.0-py3-none-any.whl (237 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m237.5/237.5 kB[0m [31m26.5 MB/s[0m eta [36m0:00:00[0m
Collecting networkx>=2.5.1
  Downloading networkx-2.6.3-py3-none-any.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m74.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting coloredlogs
  Downloading coloredlogs-15.0

# Fastq format

All sequencers produces data in a format called **fastq**. The structure is showed below. All sequences with a fastq are represented by 4 lines:

```
@SEQ_ID                   <---- SEQUENCE NAME
AGCGTGTACTGTGCATGTCGATG   <---- SEQUENCE AS BASES
+                         <---- SEPARATOR LINE
%%).1***-+*''))**55CCFF   <---- ASCII QUALITY SCORES

```

The quality of the sequences is represented as a character of the ASCII code. Check [here](https://support.illumina.com/help/BaseSpace_OLH_009008/Content/Source/Informatics/BS/QualityScoreEncoding_swBS.htm) for an explanation.
The numerical values correspond to phred quality values

# Illumina QC

We will use [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) software for the analysis of the results of a Illumina run. FastQC run a series of analysis on fastq files, and report the results as an HTML file that you open in a browser. For help on any of the sections, please check the following links.

*   [Basic statitistics](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/1%20Basic%20Statistics.html)
*   [Per base sequence quality](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/2%20Per%20Base%20Sequence%20Quality.html)
*   [Per base sequence content](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/4%20Per%20Base%20Sequence%20Content.html)
*   [Per sequence GC content](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/5%20Per%20Sequence%20GC%20Content.html)
*   [Per base N content](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/6%20Per%20Base%20N%20Content.html)
*   [Sequence length distribution](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/7%20Sequence%20Length%20Distribution.html)
*   [Duplicate Sequences](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/8%20Duplicate%20Sequences.html)
*   [Overrepresented Sequences](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/9%20Overrepresented%20Sequences.html)
*   [Adapter content](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/10%20Adapter%20Content.html)
*   [Kmer content](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/11%20Kmer%20Content.html)
*   [Per tile sequence quality](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/12%20Per%20Tile%20Sequence%20Quality.html)


Run fastQC from command line

In [None]:
#Create a directory to store all FastQC results and run FastQC
!mkdir Illumina_fastqc_results
!fastqc -o Illumina_fastqc_results /content/Illumina_fastq/fastq/*

Started analysis of ERR5761182_1.fastq.gz
Approx 5% complete for ERR5761182_1.fastq.gz
Approx 10% complete for ERR5761182_1.fastq.gz
Approx 15% complete for ERR5761182_1.fastq.gz
Approx 20% complete for ERR5761182_1.fastq.gz
Approx 25% complete for ERR5761182_1.fastq.gz
Approx 30% complete for ERR5761182_1.fastq.gz
Approx 35% complete for ERR5761182_1.fastq.gz
Approx 40% complete for ERR5761182_1.fastq.gz
Approx 45% complete for ERR5761182_1.fastq.gz
Approx 50% complete for ERR5761182_1.fastq.gz
Approx 55% complete for ERR5761182_1.fastq.gz
Approx 60% complete for ERR5761182_1.fastq.gz
Approx 65% complete for ERR5761182_1.fastq.gz
Approx 70% complete for ERR5761182_1.fastq.gz
Approx 75% complete for ERR5761182_1.fastq.gz
Approx 80% complete for ERR5761182_1.fastq.gz
Approx 85% complete for ERR5761182_1.fastq.gz
Approx 90% complete for ERR5761182_1.fastq.gz
Approx 95% complete for ERR5761182_1.fastq.gz
Analysis complete for ERR5761182_1.fastq.gz
Started analysis of ERR5761182_2.fastq.gz

As we did in the previous module, we can summarize the results of fastqc using multiqc

In [None]:
!multiqc -o /content/Illumina_fastqc_results/ /content/Illumina_fastqc_results/


  [34m/[0m[32m/[0m[31m/[0m ]8;id=703430;https://multiqc.info\[1mMultiQC[0m]8;;\ 🔍 [2m| v1.13[0m

[34m|           multiqc[0m | Search path : /content/Illumina_fastqc_results
[2K[34m|[0m         [34msearching[0m | [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [32m32/32[0m  
[?25h[34m|            fastqc[0m | Found 16 reports
[34m|           multiqc[0m | Compressing plot data
[34m|           multiqc[0m | Report      : Illumina_fastqc_results/multiqc_report.html
[34m|           multiqc[0m | Data        : Illumina_fastqc_results/multiqc_data
[34m|           multiqc[0m | MultiQC complete


This will create an HTML result file (`multiqc_report`) with a summary of FastQC reports.

Navigate the results for each file and report:

> **Which sample has more reads?**

ERR5932418_1
ERR5932418_2

> **Is there any distribution of sequences sizes?**

Yes, we have got sequence length or size distribution. Two peaks are seen at different size.


# Nanopore QC

Run fastQC from command line (actually, for Nanopore, FastQC is not a good choice)

In [None]:
!mkdir Nanopore_FastQC_report
!fastqc -o Nanopore_FastQC_report /content/Nanopore_READS/nanopore_fastq/barcode*/*

Started analysis of barcode41.fastq.gz
Approx 20% complete for barcode41.fastq.gz
Approx 40% complete for barcode41.fastq.gz
Approx 65% complete for barcode41.fastq.gz
Approx 85% complete for barcode41.fastq.gz
Analysis complete for barcode41.fastq.gz
Started analysis of barcode53.fastq.gz
Approx 5% complete for barcode53.fastq.gz
Approx 10% complete for barcode53.fastq.gz
Approx 15% complete for barcode53.fastq.gz
Approx 20% complete for barcode53.fastq.gz
Approx 25% complete for barcode53.fastq.gz
Approx 30% complete for barcode53.fastq.gz
Approx 35% complete for barcode53.fastq.gz
Approx 40% complete for barcode53.fastq.gz
Approx 45% complete for barcode53.fastq.gz
Approx 50% complete for barcode53.fastq.gz
Approx 55% complete for barcode53.fastq.gz
Approx 60% complete for barcode53.fastq.gz
Approx 65% complete for barcode53.fastq.gz
Approx 70% complete for barcode53.fastq.gz
Approx 75% complete for barcode53.fastq.gz
Approx 80% complete for barcode53.fastq.gz
Approx 85% complete fo

In [None]:
!multiqc -o /content/Nanopore_FastQC_report/ /content/Nanopore_FastQC_report/


  [34m/[0m[32m/[0m[31m/[0m ]8;id=94002;https://multiqc.info\[1mMultiQC[0m]8;;\ 🔍 [2m| v1.13[0m

[34m|           multiqc[0m | Search path : /content/Nanopore_FastQC_report
[2K[34m|[0m         [34msearching[0m | [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [32m14/14[0m  
[?25h[34m|            fastqc[0m | Found 7 reports
[34m|           multiqc[0m | Compressing plot data
[34m|           multiqc[0m | Report      : Nanopore_FastQC_report/multiqc_report.html
[34m|           multiqc[0m | Data        : Nanopore_FastQC_report/multiqc_data
[34m|           multiqc[0m | MultiQC complete


Running NanoPlot for Nanopore data 

In [None]:
!NanoPlot -o nanoplot_output --fastq_rich /content/Nanopore_READS/nanopore_fastq/barcode*/*.fastq.gz 


Likely this indicates you are combining multiple runs.
Plots based on time are invalid and therefore truncated to first 5 days.


Likely this indicates you are combining multiple runs.
Plots based on time are invalid and therefore truncated to first 5 days.



The output will be in the folder nanoplot_output. Download the file `NanoPlot-report.html` and browse the results.

---


> **How many reads are in total?**

191,605.0

> **Which is the average read size?**

619.2

> **How does this compares with Illumina results?**

Illumina results in more accurate analysis.




We don't do any trimming because the pipelines we'll use do this for use. See you on the next notebook...