# NGS pipeline for Alzheimer's Disease RNA-seq data

For this project I have used only one control and one sample of someone of advanced AD.

In [9]:
import os

Set up some global variables:

In [38]:
# Working directory to the sequencing data (I use an external disk, since I dont have a lot of space left on the pc)
os.environ['WORKDIR'] = '/media/mees/Elements/ngs_data/'
# The file to the reference genome
os.environ['GENOME'] = '/media/mees/Elements/ngs_data/genomes/GRCh38.13.fa'
# Location to save the indexed reference genome
os.environ['REF_GENOME'] = '/media/mees/Elements/ngs_data/genomes/ref_genome/'

In [7]:
!mkdir -p $WORKDIR
!bash download_data.sh -d $WORKDIR


2020-06-22T17:53:08 prefetch.2.10.7: 1) Downloading 'SRR2422918'...
2020-06-22T17:53:08 prefetch.2.10.7:  Downloading via HTTPS...
^C

2020-06-22T17:53:23 fasterq-dump.2.10.7 int: buffer insufficient while reading uri within cloud module - cannot Get Cloud Location


After downloading the data and convert the .sra files to .fastq files, use FastQC to determine read quality.

In [27]:
!./fastqc.sh -d $WORKDIR 

Started analysis of SRR2422918.fastq
Approx 5% complete for SRR2422918.fastq
Approx 10% complete for SRR2422918.fastq
Approx 15% complete for SRR2422918.fastq
Approx 20% complete for SRR2422918.fastq
Approx 25% complete for SRR2422918.fastq
Approx 30% complete for SRR2422918.fastq
Approx 35% complete for SRR2422918.fastq
Approx 40% complete for SRR2422918.fastq
Approx 45% complete for SRR2422918.fastq
Approx 50% complete for SRR2422918.fastq
Approx 55% complete for SRR2422918.fastq
Approx 60% complete for SRR2422918.fastq
Approx 65% complete for SRR2422918.fastq
Approx 70% complete for SRR2422918.fastq
Approx 75% complete for SRR2422918.fastq
Approx 80% complete for SRR2422918.fastq
Approx 85% complete for SRR2422918.fastq
Approx 90% complete for SRR2422918.fastq
Approx 95% complete for SRR2422918.fastq
Analysis complete for SRR2422918.fastq
Started analysis of SRR2422926.fastq
Approx 5% complete for SRR2422926.fastq
Approx 10% complete for SRR2422926.fastq
Approx 15% complete for SRR2

I like to open the created reports of fastq to see if there are some abnormalities.

In [18]:
!firefox $WORKDIR/*.html

From these reports it seems that the first 10 bp have an over representation of T, therefore we want to trim and remove the first 10 bases. And we are going to use the seqtk package.

In [15]:
!./seqtk.sh -d $WORKDIR

Now that we have improved the read qualities, we can align the reads to the human reference genome. For this alignment, we use Hisat2.

First thing is to build the reference genome using hisat2-build.
Then we perform the alignment using hisat2.

In [40]:
!./build.sh -g $GENOME -o $REF_GENOME

Settings:
  Output files: "/media/mees/Elements/ngs_data/genomes/ref_genome//ref_genome.*.ht2"
  Line rate: 6 (line is 64 bytes)
  Lines per side: 1 (side is 64 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
  Local offset rate: 3 (one in 8)
  Local fTable chars: 6
  Local sequence length: 57344
  Local sequence overlap between two consecutive indexes: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  /media/mees/Elements/ngs_data/genomes/GRCh38.13.fa
Reading reference sizes
  Time reading reference sizes: 00:01:34
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:01:34
  Time to read SNPs and splice sites: 00:00:00
Using parameters --bmax 145815024 --dcv 1024
  Doing ahead-of-time memory usage test
  Passed!  Constructing

  bucket 14: 20%
  bucket 14: 30%
  bucket 14: 40%
  Sorting block time: 00:01:30
Returning block of 120496617 for bucket 11
  Sorting block time: 00:01:40
Returning block of 128556393 for bucket 10
  bucket 14: 50%
  bucket 14: 60%
  bucket 14: 70%
  bucket 14: 80%
  bucket 14: 90%
  Sorting block time: 00:01:39
Returning block of 134492141 for bucket 13
  bucket 14: 100%
  Sorting block of length 94059793 for bucket 14
  (Using difference cover)
Getting block 15 of 29
  Reserving size (145815024) for bucket 15
  Calculating Z arrays for bucket 15
  Entering block accumulator loop for bucket 15:
Getting block 16 of 29
  Reserving size (145815024) for bucket 16
  Calculating Z arrays for bucket 16
  Entering block accumulator loop for bucket 16:
  bucket 15: 10%
  bucket 16: 10%
  bucket 16: 20%
  bucket 15: 20%
Getting block 17 of 29
  Reserving size (145815024) for bucket 17
  Calculating Z arrays for bucket 17
  Entering block accumulator loop for bucket 17:
  bucket 16: 30%
  bucke

  Sanity-checking and returning
Building samples
Reserving space for 58 sample suffixes
Generating random suffixes
QSorting 58 sample offsets, eliminating duplicates
QSorting sample offsets, eliminating duplicates time: 00:00:00
Multikey QSorting 58 samples
  (Using difference cover)
  Multikey QSorting samples time: 00:00:00
Calculating bucket sizes
Splitting and merging
  Splitting and merging time: 00:00:00
Split 8, merged 27; iterating...
Splitting and merging
  Splitting and merging time: 00:00:00
Split 4, merged 5; iterating...
Splitting and merging
  Splitting and merging time: 00:00:00
Split 1, merged 1; iterating...
Splitting and merging
  Splitting and merging time: 00:00:00
Avg bucket size: 8.40735e+07 (target: 109361267)
Converting suffix-array elements to index image
Allocating ftab, absorbFtab
Entering GFM loop
Getting block 1 of 37
  Reserving size (109361268) for bucket 1
Getting block 2 of 37
  Reserving size (109361268) for bucket 2
  Calculating Z arrays for bucket 1

  bucket 17: 10%
  bucket 17: 20%
  bucket 17: 30%
  Sorting block time: 00:01:17
Returning block of 105181795 for bucket 16
  bucket 17: 40%
  bucket 17: 50%
  bucket 17: 60%
  bucket 17: 70%
  bucket 17: 80%
Getting block 18 of 37
  Reserving size (109361268) for bucket 18
  Calculating Z arrays for bucket 18
  Entering block accumulator loop for bucket 18:
  bucket 17: 90%
Getting block 19 of 37
  Reserving size (109361268) for bucket 19
  Calculating Z arrays for bucket 19
  Entering block accumulator loop for bucket 19:
  bucket 18: 10%
  bucket 17: 100%
  Sorting block of length 86316295 for bucket 17
  (Using difference cover)
  bucket 19: 10%
  bucket 18: 20%
  bucket 19: 20%
  bucket 18: 30%
Getting block 20 of 37
  Reserving size (109361268) for bucket 20
  Calculating Z arrays for bucket 20
  Entering block accumulator loop for bucket 20:
  bucket 18: 40%
  bucket 19: 30%
  bucket 20: 10%
  bucket 18: 50%
  bucket 20: 20%
  bucket 19: 40%
  bucket 18: 60%
  bucket 20: 30%
  

  bucket 33: 90%
  bucket 35: 60%
  bucket 34: 80%
  bucket 33: 100%
  Sorting block of length 73322670 for bucket 33
  (Using difference cover)
  bucket 35: 70%
  bucket 34: 90%
  bucket 35: 80%
  bucket 34: 100%
  Sorting block of length 72491950 for bucket 34
  (Using difference cover)
  bucket 35: 90%
  bucket 35: 100%
  Sorting block of length 108688505 for bucket 35
  (Using difference cover)
  Sorting block time: 00:01:20
Returning block of 108431909 for bucket 32
  Sorting block time: 00:00:58
Returning block of 73322671 for bucket 33
  Sorting block time: 00:00:55
Returning block of 72491951 for bucket 34
  Sorting block time: 00:01:12
Returning block of 108688506 for bucket 35
Getting block 36 of 37
  Reserving size (109361268) for bucket 36
  Calculating Z arrays for bucket 36
  Entering block accumulator loop for bucket 36:
  bucket 36: 10%
  bucket 36: 20%
  bucket 36: 30%
  bucket 36: 40%
Getting block 37 of 37
  Reserving size (109361268) for bucket 37
  Calculating Z ar