<a href="https://colab.research.google.com/github/GoekeLab/sg-nex-data/blob/master/docs/colab/Introduction_Genomics_1_GoogleColab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Genomics Workshop 1: Reference genome files, reads, and read alignment

In this workshop we will learn about common file types and formats, how to download genomics data, and how to process such data.

Here we will download and process long read Nanopore RNA-Seq data from the SG-NEx project. We will use the AWS command line interface to access the data. We will then align reads to the human reference genome using Minimap2. The output data is then converted to compressed files for aligned reads using Samtools. We will visusalise the data using the UCSC Genome Browser, or IGV.

### Using Google Colab

This tutorial requires access to a shell (i.e. Linux, MacOS, or the Windows Subsystem for Linux/WSL). If you do not have access to any shell, you can run this tutorial on Google Colab by clicking the badge on top.

If you use Google Colab, you have to add `!` before any shell command to execute it in a subshell. Changing working directories requires to add `%` instead, which executes the command globally.

## Installation

In [None]:
! python -m pip install awscli

In [None]:
! aws --version  

In [None]:
! curl -L https://github.com/lh3/minimap2/releases/download/v2.26/minimap2-2.26_x64-linux.tar.bz2 | tar -jxvf -
! ./minimap2-2.26_x64-linux/minimap2

In [None]:
! sudo ln -s /content/minimap2-2.26_x64-linux/minimap2 /usr/bin/minimap2

In [None]:
! minimap2 


In [None]:
! sudo apt install samtools

In [None]:
! samtools --version

## Data download

The Singapore Nanopore Expression Project (SG-NEx) has generated a comprehensive resource of long read RNA-Sequencing data using the Oxford Nanopore Sequencing third generation sequencing platform. The data is hosted on the [AWS Open Data Registry](https://registry.opendata.aws/sgnex/) and described in detail here: <https://github.com/GoekeLab/sg-nex-data>

For this workshop we will be using a reduced data set which only includes data from the human chromosome 22. The data can be accessed using the AWS command line interface (or using direct links, which you can find in the online documentation).

In [None]:
! aws s3 ls --no-sign-request s3://sg-nex-data/data/data_tutorial/

In [None]:
! mkdir workshop
! mkdir workshop/reference
! mkdir workshop/fastq
! mkdir workshop/bam




### The reference genome and annotations

In [None]:
%cd workshop/reference/

In [None]:
! aws s3 cp --no-sign-request s3://sg-nex-data/data/data_tutorial/annotations/hg38_chr22.fa .
! aws s3 cp --no-sign-request s3://sg-nex-data/data/data_tutorial/annotations/hg38_chr22.fa.fai .

In [None]:
! ls -lh
! wc -l hg38_chr22.fa


In [None]:
! fold -w 80 hg38_chr22.fa | head


In [None]:

! fold -w 80 hg38_chr22.fa | head -n 300000 | tail

In [None]:
! aws s3 cp --no-sign-request s3://sg-nex-data/data/data_tutorial/annotations/hg38_chr22.gtf .


In [None]:
! head hg38_chr22.gtf

In [None]:
! grep '"BCR"' hg38_chr22.gtf | head


> **Exercise:** What information can you find online about the BCR gene? 
> 
> Each gene can generate different RNA versions by using different exons. These different versions are called gene isoforms, or transcripts. How many transcripts can you find for the BCR gene? 
> 
> You can use a genome browser such as IGV or the UCSC genome browser (< https://genome.ucsc.edu/cgi-bin/hgGateway>) to explore the human genome, search for genes, or show specific genome coordinates.



### Fastq files (reads)

In [None]:
%cd ../fastq/

In [None]:
! aws s3 cp --no-sign-request s3://sg-nex-data/data/data_tutorial/fastq/HepG2_directRNA_sample1.fastq.gz .

In [None]:
! ls -hl

In [None]:
! zcat HepG2_directRNA_sample1.fastq.gz | head

>**Exercise:** Search the read with the following id: "@b82f28d6-6ff1-4cf7-b41a-4888bfdb2641" (`zcat HepG2_directRNA_sample1.fastq.gz | grep -A 4 '@b82f28d6-6ff1-4cf7-b41a-4888bfdb2641'`) and align the sequence to the human genome using BLAT (<http://genome.ucsc.edu/cgi-bin/hgBlat>). Where in the human genome does this read align? Does the read match any annotated transcript? 


In [None]:
! zcat HepG2_directRNA_sample1.fastq.gz | grep -A 4 '@b82f28d6-6ff1-4cf7-b41a-4888bfdb2641'

### Read Alignment

In [None]:
%cd ..
%ll

In [None]:
! minimap2 -ax splice -uf -k14 reference/hg38_chr22.fa fastq/HepG2_directRNA_sample1.fastq.gz > bam/HepG2_directRNA_sample1.sam

In [None]:
! ls -l bam/

In [None]:
! head bam/HepG2_directRNA_sample1.sam

### Sam to Bam conversion

In [None]:
! samtools view -b bam/HepG2_directRNA_sample1.sam > bam/HepG2_directRNA_sample1.bam
! samtools sort bam/HepG2_directRNA_sample1.bam -o bam/HepG2_directRNA_sample1_sorted.bam
! samtools index bam/HepG2_directRNA_sample1_sorted.bam


In [None]:
! ls -l bam/

## Visualisation of aligned reads

In [None]:
! samtools view bam/HepG2_directRNA_sample1_sorted.bam | head

>**Exercise:** Visualise the bam file using IGV or the UCSC genome browser. If you use the UCSC genome browser, you can use the processed data from the SG-NEx project, by copying these lines into the custom track field:

```
track type=bigWig name="SGNex_HepG2_directRNA_replicate1_run3.bigwig" description="SGNex_HepG2_directRNA_replicate1_run3.bigwig" bigDataUrl=http://sg-nex-data.s3.amazonaws.com/data/sequencing_data_ont/genome_browser_data/bigwig/SGNex_HepG2_directRNA_replicate1_run3.bigwig

track type=bigBed name="SGNex_HepG2_directRNA_replicate1_run3.bigbed" description="SGNex_HepG2_directRNA_replicate1_run3.bigbed" bigDataUrl=http://sg-nex-data.s3.amazonaws.com/data/sequencing_data_ont/genome_browser_data/bigbed/SGNex_HepG2_directRNA_replicate1_run3.bigbed
```