# Pipeline
#### this is the Trimmomatic-Kallisto pipeline, RNA-Seq data in the FASTQ format is the input, and transcripts with the levels of expression in the tsv format is the output.

#### Set Working Directory

In [1]:
#RERUN this cell after restarting this notebook to set the WORKDIR variable
import os
os.environ['WORKDIR'] = './data'
#Move the new directory named data to your desired working directory

Update your files so the different samples are organized:
<html>
    <code><i>project_name</i>/
    data/
        cleanFASTQ/
            1c/
            1h/
            2c/
            2h/
            3c/
            3h/
        FASTQ/
            1c/
            1h/
            2c/
            2h/
            3c/
            3h/
        kallisto/
            1c/
            1h/
            2c/
            2h/
            3c/
            3h/
        gdcReference/
        seqData/</code>
</html>

Download the Java software [Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic) to clean the reads. Trimmomatic uses the quality measurements in the FASTQ format to cut nucleotides that are ambiguous. Save trimmomatic to the software directory and add the bin file to your PATH. Java is necessary to run this software.

In [5]:
!java -jar /mnt/d/software/Trimmomatic-0.39/trimmomatic-0.39.jar SE -phred33 \
    $WORKDIR/FASTQ/1c/SRR11862840.fastq \
    $WORKDIR/cleanFASTQ/1c/SRR11862840_1.fastq \
    LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

TrimmomaticSE: Started with arguments:
 -phred33 ./data/FASTQ/1c/SRR11862840.fastq ./data/cleanFASTQ/1c/SRR11862840_1.fastq LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
Automatically using 4 threads
Input Reads: 7601798 Surviving: 5598966 (73.65%) Dropped: 2002832 (26.35%)
TrimmomaticSE: Completed successfully


In [6]:
!java -jar /mnt/d/software/Trimmomatic-0.39/trimmomatic-0.39.jar SE -phred33 \
    $WORKDIR/FASTQ/1h/SRR11862839.fastq \
    $WORKDIR/cleanFASTQ/1h/SRR11862839_1.fastq \
    LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

TrimmomaticSE: Started with arguments:
 -phred33 ./data/FASTQ/1h/SRR11862839.fastq ./data/cleanFASTQ/1h/SRR11862839_1.fastq LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
Automatically using 4 threads
Input Reads: 7556819 Surviving: 5482773 (72.55%) Dropped: 2074046 (27.45%)
TrimmomaticSE: Completed successfully


In [7]:
!java -jar /mnt/d/software/Trimmomatic-0.39/trimmomatic-0.39.jar SE -phred33 \
    $WORKDIR/FASTQ/2c/SRR11862838.fastq \
    $WORKDIR/cleanFASTQ/2c/SRR11862838_1.fastq \
    LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

TrimmomaticSE: Started with arguments:
 -phred33 ./data/FASTQ/2c/SRR11862838.fastq ./data/cleanFASTQ/2c/SRR11862838_1.fastq LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
Automatically using 4 threads
Input Reads: 3115255 Surviving: 2074688 (66.60%) Dropped: 1040567 (33.40%)
TrimmomaticSE: Completed successfully


In [8]:
!java -jar /mnt/d/software/Trimmomatic-0.39/trimmomatic-0.39.jar SE -phred33 \
    $WORKDIR/FASTQ/2h/SRR11862837.fastq \
    $WORKDIR/cleanFASTQ/2h/SRR11862837_1.fastq \
    LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

TrimmomaticSE: Started with arguments:
 -phred33 ./data/FASTQ/2h/SRR11862837.fastq ./data/cleanFASTQ/2h/SRR11862837_1.fastq LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
Automatically using 4 threads
Input Reads: 3630969 Surviving: 2564738 (70.64%) Dropped: 1066231 (29.36%)
TrimmomaticSE: Completed successfully


In [9]:
!java -jar /mnt/d/software/Trimmomatic-0.39/trimmomatic-0.39.jar SE -phred33 \
    $WORKDIR/FASTQ/3c/SRR11862836.fastq \
    $WORKDIR/cleanFASTQ/3c/SRR11862836_1.fastq \
    LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

TrimmomaticSE: Started with arguments:
 -phred33 ./data/FASTQ/3c/SRR11862836.fastq ./data/cleanFASTQ/3c/SRR11862836_1.fastq LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
Automatically using 4 threads
Input Reads: 5931474 Surviving: 4838901 (81.58%) Dropped: 1092573 (18.42%)
TrimmomaticSE: Completed successfully


In [10]:
!java -jar /mnt/d/software/Trimmomatic-0.39/trimmomatic-0.39.jar SE -phred33 \
    $WORKDIR/FASTQ/3h/SRR11862835.fastq \
    $WORKDIR/cleanFASTQ/3h/SRR11862835_1.fastq \
    LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

TrimmomaticSE: Started with arguments:
 -phred33 ./data/FASTQ/3h/SRR11862835.fastq ./data/cleanFASTQ/3h/SRR11862835_1.fastq LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
Automatically using 4 threads
Input Reads: 13070277 Surviving: 10689297 (81.78%) Dropped: 2380980 (18.22%)
TrimmomaticSE: Completed successfully


Download the [Reference Sequence](https://gdc.cancer.gov/about-data/gdc-data-processing/gdc-reference-files) and save it to the gdcReference directory.

In [None]:
!gunzip $WORKDIR/gdcReference/GRCh38.d1.vd1.fa.tar.gz

In [24]:
!tar -xf $WORKDIR/gdcReference/GRCh38.d1.vd1.fa.tar

Make sure the FASTA file (ending in .fa) is in the gdcReference directory

Download [Kallisto](https://pachterlab.github.io/kallisto/), a software that uses the reference sequence to match the reads from our samples to the transcriptome. Kallisto also measures the abundance of those reads.

In [34]:
!kallisto index -i $WORKDIR/kallisto/transcripts.idx $WORKDIR/gdcReference/GRCh38.d1.vd1.fa


[build] loading fasta file ./data/gdcReference/GRCh38.d1.vd1.fa
[build] k-mer length: 31
        from 30 target sequences
        with pseudorandom nucleotides
[build] counting k-mers ... terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc


In [27]:
!kallisto quant -i $WORKDIR/kallisto/transcripts.idx -o $WORKDIR/kallisto/output/1c \
    $WORKDIR/cleanFASTQ/1c/SRR11862840_1.fastq


Error: kallisto index file not found ./data/kallisto/transcripts.idx
Error: paired-end mode requires an even number of input files
       (use --single for processing single-end reads)
Error: could not create directory ./data/kallisto/output/1c

Usage: kallisto quant [arguments] FASTQ-files

Required arguments:
-i, --index=STRING            Filename for the kallisto index to be used for
                              quantification
-o, --output-dir=STRING       Directory to write output to

Optional arguments:
    --bias                    Perform sequence based bias correction
-b, --bootstrap-samples=INT   Number of bootstrap samples (default: 0)
    --seed=INT                Seed for the bootstrap sampling (default: 42)
    --plaintext               Output plaintext instead of HDF5
    --fusion                  Search for fusions for Pizzly
    --single                  Quantify single-end reads
    --single-overhang         Include reads where unobserved rest of fragment is
        

In [28]:
!kallisto quant -i $WORKDIR/kallisto/transcripts.idx -o $WORKDIR/kallisto/output/1h \
    $WORKDIR/cleanFASTQ/1h/SRR11862839_1.fastq


Error: kallisto index file not found ./data/kallisto/transcripts.idx
Error: paired-end mode requires an even number of input files
       (use --single for processing single-end reads)
Error: could not create directory ./data/kallisto/output/1h

Usage: kallisto quant [arguments] FASTQ-files

Required arguments:
-i, --index=STRING            Filename for the kallisto index to be used for
                              quantification
-o, --output-dir=STRING       Directory to write output to

Optional arguments:
    --bias                    Perform sequence based bias correction
-b, --bootstrap-samples=INT   Number of bootstrap samples (default: 0)
    --seed=INT                Seed for the bootstrap sampling (default: 42)
    --plaintext               Output plaintext instead of HDF5
    --fusion                  Search for fusions for Pizzly
    --single                  Quantify single-end reads
    --single-overhang         Include reads where unobserved rest of fragment is
        

In [29]:
!kallisto quant -i $WORKDIR/kallisto/transcripts.idx -o $WORKDIR/kallisto/output/2c \
    $WORKDIR/cleanFASTQ/2c/SRR11862838_1.fastq


Error: kallisto index file not found ./data/kallisto/transcripts.idx
Error: paired-end mode requires an even number of input files
       (use --single for processing single-end reads)
Error: could not create directory ./data/kallisto/output/2c

Usage: kallisto quant [arguments] FASTQ-files

Required arguments:
-i, --index=STRING            Filename for the kallisto index to be used for
                              quantification
-o, --output-dir=STRING       Directory to write output to

Optional arguments:
    --bias                    Perform sequence based bias correction
-b, --bootstrap-samples=INT   Number of bootstrap samples (default: 0)
    --seed=INT                Seed for the bootstrap sampling (default: 42)
    --plaintext               Output plaintext instead of HDF5
    --fusion                  Search for fusions for Pizzly
    --single                  Quantify single-end reads
    --single-overhang         Include reads where unobserved rest of fragment is
        

In [30]:
!kallisto quant -i $WORKDIR/kallisto/transcripts.idx -o $WORKDIR/kallisto/output/2h \
    $WORKDIR/cleanFASTQ/2h/SRR11862837_1.fastq


Error: kallisto index file not found ./data/kallisto/transcripts.idx
Error: paired-end mode requires an even number of input files
       (use --single for processing single-end reads)
Error: could not create directory ./data/kallisto/output/2h

Usage: kallisto quant [arguments] FASTQ-files

Required arguments:
-i, --index=STRING            Filename for the kallisto index to be used for
                              quantification
-o, --output-dir=STRING       Directory to write output to

Optional arguments:
    --bias                    Perform sequence based bias correction
-b, --bootstrap-samples=INT   Number of bootstrap samples (default: 0)
    --seed=INT                Seed for the bootstrap sampling (default: 42)
    --plaintext               Output plaintext instead of HDF5
    --fusion                  Search for fusions for Pizzly
    --single                  Quantify single-end reads
    --single-overhang         Include reads where unobserved rest of fragment is
        

In [31]:
!kallisto quant -i $WORKDIR/kallisto/transcripts.idx -o $WORKDIR/kallisto/output/3c \
    $WORKDIR/cleanFASTQ/3c/SRR11862836_1.fastq


Error: kallisto index file not found ./data/kallisto/transcripts.idx
Error: paired-end mode requires an even number of input files
       (use --single for processing single-end reads)
Error: could not create directory ./data/kallisto/output/3c

Usage: kallisto quant [arguments] FASTQ-files

Required arguments:
-i, --index=STRING            Filename for the kallisto index to be used for
                              quantification
-o, --output-dir=STRING       Directory to write output to

Optional arguments:
    --bias                    Perform sequence based bias correction
-b, --bootstrap-samples=INT   Number of bootstrap samples (default: 0)
    --seed=INT                Seed for the bootstrap sampling (default: 42)
    --plaintext               Output plaintext instead of HDF5
    --fusion                  Search for fusions for Pizzly
    --single                  Quantify single-end reads
    --single-overhang         Include reads where unobserved rest of fragment is
        

In [32]:
!kallisto quant -i $WORKDIR/kallisto/transcripts.idx -o $WORKDIR/kallisto/output/3h \
    $WORKDIR/cleanFASTQ/3h/SRR11862835_1.fastq


Error: kallisto index file not found ./data/kallisto/transcripts.idx
Error: paired-end mode requires an even number of input files
       (use --single for processing single-end reads)
Error: could not create directory ./data/kallisto/output/3h

Usage: kallisto quant [arguments] FASTQ-files

Required arguments:
-i, --index=STRING            Filename for the kallisto index to be used for
                              quantification
-o, --output-dir=STRING       Directory to write output to

Optional arguments:
    --bias                    Perform sequence based bias correction
-b, --bootstrap-samples=INT   Number of bootstrap samples (default: 0)
    --seed=INT                Seed for the bootstrap sampling (default: 42)
    --plaintext               Output plaintext instead of HDF5
    --fusion                  Search for fusions for Pizzly
    --single                  Quantify single-end reads
    --single-overhang         Include reads where unobserved rest of fragment is
        