# NGS pipeline for Alzheimer's Disease RNA-seq data

For this project I have used only one control and one sample of someone of advanced AD.

In [1]:
import os

Set up some global variables:

In [2]:
# Dir with the bash files
os.eviron['DIR'] = '/home/mees/NGS_Alzheimer/'
# Working directory to the sequencing data (I use an external disk, since I dont have a lot of space left on the pc)
os.environ['WORKDIR'] = '/media/mees/Elements/ngs_data/'
# The directory to the downloaded genome
os.environ['GENOMEDIR'] = '/media/mees/Elements/ngs_data/genomes/'
# Location to save the indexed reference genome
os.environ['REF_GENOME'] = '/media/mees/Elements/ngs_data/genomes/ref_genome/'

First download the human genome for reference and save it as genome.fa in $GENOMEDIR

In [None]:
!wget ftp://ftp.ensembl.org/pub/release-100/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
!gzip -d Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
!mv Homo_sapiens.GRCh38.dna.primary_assembly.fa $GENOMEDIR/genome.fa

In [7]:
!mkdir -p $WORKDIR
!bash download_data.sh -d $WORKDIR


2020-06-22T17:53:08 prefetch.2.10.7: 1) Downloading 'SRR2422918'...
2020-06-22T17:53:08 prefetch.2.10.7:  Downloading via HTTPS...
^C

2020-06-22T17:53:23 fasterq-dump.2.10.7 int: buffer insufficient while reading uri within cloud module - cannot Get Cloud Location


After downloading the data and convert the .sra files to .fastq files, use FastQC to determine read quality.

In [27]:
!./fastqc.sh -d $WORKDIR 

Started analysis of SRR2422918.fastq
Approx 5% complete for SRR2422918.fastq
Approx 10% complete for SRR2422918.fastq
Approx 15% complete for SRR2422918.fastq
Approx 20% complete for SRR2422918.fastq
Approx 25% complete for SRR2422918.fastq
Approx 30% complete for SRR2422918.fastq
Approx 35% complete for SRR2422918.fastq
Approx 40% complete for SRR2422918.fastq
Approx 45% complete for SRR2422918.fastq
Approx 50% complete for SRR2422918.fastq
Approx 55% complete for SRR2422918.fastq
Approx 60% complete for SRR2422918.fastq
Approx 65% complete for SRR2422918.fastq
Approx 70% complete for SRR2422918.fastq
Approx 75% complete for SRR2422918.fastq
Approx 80% complete for SRR2422918.fastq
Approx 85% complete for SRR2422918.fastq
Approx 90% complete for SRR2422918.fastq
Approx 95% complete for SRR2422918.fastq
Analysis complete for SRR2422918.fastq
Started analysis of SRR2422926.fastq
Approx 5% complete for SRR2422926.fastq
Approx 10% complete for SRR2422926.fastq
Approx 15% complete for SRR2

I like to open the created reports of fastq to see if there are some abnormalities.

In [18]:
!firefox $WORKDIR/*.html

From these reports it seems that the first 10 bp have an over representation of T, therefore we want to trim and remove the first 10 bases. And we are going to use the seqtk package.

In [15]:
!./seqtk.sh -d $WORKDIR

Now that we have improved the read qualities, we can align the reads to the human reference genome. For this alignment, we use Hisat2.

First thing is to build the reference genome using hisat2-build.
Then we perform the alignment using hisat2.

In [5]:
!./build.sh -g $GENOMEDIR/genome.fa -o $REF_GENOME

Settings:
  Output files: "/media/mees/Elements/ngs_data/genomes/ref_genome//ref_genome.*.ht2"
  Line rate: 6 (line is 64 bytes)
  Lines per side: 1 (side is 64 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
  Local offset rate: 3 (one in 8)
  Local fTable chars: 6
  Local sequence length: 57344
  Local sequence overlap between two consecutive indexes: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  /media/mees/Elements/ngs_data/genomes//genome.fa
Reading reference sizes
  Time reading reference sizes: 00:01:27
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:01:29
  Time to read SNPs and splice sites: 00:00:00
Using parameters --bmax 138086675 --dcv 1024
  Doing ahead-of-time memory usage test
  Passed!  Constructing w

  bucket 15: 20%
  bucket 14: 30%
  bucket 13: 40%
  bucket 12: 50%
  bucket 15: 30%
  bucket 14: 40%
  bucket 12: 60%
  bucket 13: 50%
  bucket 15: 40%
  bucket 13: 60%
  bucket 12: 70%
  bucket 14: 50%
  bucket 15: 50%
  bucket 13: 70%
  bucket 14: 60%
  bucket 15: 60%
  bucket 12: 80%
  bucket 13: 80%
  bucket 14: 70%
  bucket 15: 70%
  bucket 12: 90%
  bucket 13: 90%
  bucket 14: 80%
  bucket 15: 80%
  bucket 13: 100%
  Sorting block of length 62107305 for bucket 13
  (Using difference cover)
  bucket 12: 100%
  Sorting block of length 127639538 for bucket 12
  (Using difference cover)
  bucket 15: 90%
  bucket 14: 90%
  bucket 15: 100%
  Sorting block of length 5940386 for bucket 15
  (Using difference cover)
  bucket 14: 100%
  Sorting block of length 139760453 for bucket 14
  (Using difference cover)
  Sorting block time: 00:00:03
Returning block of 5940387 for bucket 15
Getting block 16 of 32
  Reserving size (138086675) for bucket 16
  Calculating Z arrays for bucket 16
  Ente

  bucket 31: 50%
  bucket 31: 60%
  Sorting block time: 00:01:06
Returning block of 113546999 for bucket 28
  Sorting block time: 00:00:28
Returning block of 47458124 for bucket 30
  bucket 31: 70%
  bucket 31: 80%
  Sorting block time: 00:01:02
Returning block of 114596170 for bucket 29
  bucket 31: 90%
  bucket 31: 100%
  Sorting block of length 116855486 for bucket 31
  (Using difference cover)
Getting block 32 of 32
  Reserving size (138086675) for bucket 32
  Calculating Z arrays for bucket 32
  Entering block accumulator loop for bucket 32:
  bucket 32: 10%
  bucket 32: 20%
  bucket 32: 30%
  bucket 32: 40%
  bucket 32: 50%
  bucket 32: 60%
  bucket 32: 70%
  bucket 32: 80%
  bucket 32: 90%
  bucket 32: 100%
  Sorting block of length 45807214 for bucket 32
  (Using difference cover)
  Sorting block time: 00:00:55
Returning block of 116855487 for bucket 31
  Sorting block time: 00:00:22
Returning block of 45807215 for bucket 32
Exited GFM loop
fchr[A]: 0
fchr[C]: 869653843
fchr[G]

Now we can align our data to the indexed human genome

In [6]:
!./align.sh -g $REF_GENOME -f $WORKDIR/data/sra/trimmed/

Hisat2 is aligning SRR2422918.fastq
56422194 reads; of these:
  56422194 (100.00%) were unpaired; of these:
    55216699 (97.86%) aligned 0 times
    832465 (1.48%) aligned exactly 1 time
    373030 (0.66%) aligned >1 times
2.14% overall alignment rate
Hisat2 is aligning SRR2422926.fastq
^C
(ERR): hisat2-align died with signal 2 (INT) 
