## Preparation

This pipeline is based off of the pipeline in [Counting Transcripts](https://github.com/hypercubestart/Counting-Transcripts).
I use this pipeline to clean and align the glioblastoma RNA-seq data to the transcriptome.

#### Set Working Directory

In [1]:
#RERUN this cell after restarting this notebook to set the WORKDIR variable
import os
os.environ['WORKDIR'] = './data'
#Move the new directory named data to your desired working directory

In [None]:
#create the directory if doesn't exist already
!mkdir -p $WORKDIR

Inside the data directory, organize files like so:
<html>
    <code><i>project_name</i>/
    data/
        cleanFASTQ/
        FASTQ/
        kallisto/
        gdcReference/
        seqData/</code>
</html>

We download RNA-Seq reads from a published work:https://www.ncbi.nlm.nih.gov/bioproject/635587. We will only use a subset of the data, as analyzing all of the samples would take a lot of time and space.

To download and convert these files into the FASTQ format, you will need to download [SRA Toolkit](https://www.ncbi.nlm.nih.gov/books/NBK158900/). The FASTQ format contains the RNA reads, as well as the quality of those reads.

Save the SRA Toolkit to a directory called software and add the binary file (bin) to your [PATH](https://en.wikipedia.org/wiki/PATH_(variable)) so the commands are recognized.

In [2]:
!wget https://sra-download.ncbi.nlm.nih.gov/traces/sra58/SRR/011584/SRR11862840
!wget https://sra-download.ncbi.nlm.nih.gov/traces/sra24/SRR/011584/SRR11862839
!wget https://sra-download.ncbi.nlm.nih.gov/traces/sra74/SRR/011584/SRR11862838
!wget https://sra-download.ncbi.nlm.nih.gov/traces/sra38/SRR/011584/SRR11862837
!wget https://sra-download.ncbi.nlm.nih.gov/traces/sra68/SRR/011584/SRR11862836
!wget https://sra-download.ncbi.nlm.nih.gov/traces/sra63/SRR/011584/SRR11862835

--2020-08-10 14:06:48--  https://sra-download.ncbi.nlm.nih.gov/traces/sra58/SRR/011584/SRR11862840
Resolving sra-download.ncbi.nlm.nih.gov (sra-download.ncbi.nlm.nih.gov)... 165.112.9.231, 130.14.250.28, 130.14.250.25
Connecting to sra-download.ncbi.nlm.nih.gov (sra-download.ncbi.nlm.nih.gov)|165.112.9.231|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 551105353 (526M) [application/octet-stream]
Saving to: ‘SRR11862840’


2020-08-10 14:07:49 (8.75 MB/s) - ‘SRR11862840’ saved [551105353/551105353]

--2020-08-10 14:07:49--  https://sra-download.ncbi.nlm.nih.gov/traces/sra24/SRR/011584/SRR11862839
Resolving sra-download.ncbi.nlm.nih.gov (sra-download.ncbi.nlm.nih.gov)... 130.14.250.28, 130.14.250.25, 165.112.9.235
Connecting to sra-download.ncbi.nlm.nih.gov (sra-download.ncbi.nlm.nih.gov)|130.14.250.28|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 570522735 (544M) [application/octet-stream]
Saving to: ‘SRR11862839’


2020-08-10 14:08:

If the above command returned `wget: command not found`, means you don't have the command `wget` in your machine. To install wget, install Homebrew, run `brew install wget` in your terminal, and then try the above command again.

Move these files into the data/seqData/ directory

The command `fastq-dump` converts the RNA-Seq data into the FASTQ format, which can then be put into the pipeline.

In [3]:
!fastq-dump -O $WORKDIR/FASTQ/1c $WORKDIR/seqData/SRR11862840

Read 7601798 spots for ./data/seqData/SRR11862840
Written 7601798 spots for ./data/seqData/SRR11862840


In [4]:
!fastq-dump -O $WORKDIR/FASTQ/1h $WORKDIR/seqData/SRR11862839

Read 7556819 spots for ./data/seqData/SRR11862839
Written 7556819 spots for ./data/seqData/SRR11862839


In [19]:
!fastq-dump -O $WORKDIR/FASTQ/2c $WORKDIR/seqData/SRR11862838

Read 3115255 spots for ./data/seqData/SRR11862838
Written 3115255 spots for ./data/seqData/SRR11862838


In [5]:
!fastq-dump -O $WORKDIR/FASTQ/2h $WORKDIR/seqData/SRR11862837

Read 3630969 spots for ./data/seqData/SRR11862837
Written 3630969 spots for ./data/seqData/SRR11862837


In [16]:
!fastq-dump -O $WORKDIR/FASTQ/3c $WORKDIR/seqData/SRR11862836

Read 5931474 spots for ./data/seqData/SRR11862836
Written 5931474 spots for ./data/seqData/SRR11862836


In [None]:
!fastq-dump -O $WORKDIR/FASTQ/3h $WORKDIR/seqData/SRR11862835