#$\text{L-RAPiT: Long Read Analysis Pipeline for Transcriptomics}$


$\text{A pipeline to analyze Oxford Nanopore and PacBio third-generation long transcriptomic sequencing reads}$

$\text{Theodore Nelson}$

$\text{Columbia University Irving Medical Center}$


##$\color{#e74b4b}{\text{Parameter Input and User Instructions}}$

Please define where the file structure is within your Google Drive:

<ul type=disc>
<li><b>PIPELINE_FILE_PATH</b>: file path to location of long-read RNA sequencing analysis pipeline within your Google Colab/Google Drive/local file system - required for most applications. This must be defined and initialized first.</li>
</ul>

Please note: when utilizing the default ```/content``` location on a colab machine, click on the folder icon (fourth icon from the top on the left menu) to see files listed in a graphical user interface.  

In [None]:
%env PIPELINE_FILE_PATH=/content

Errors associated with the following command are most likely associated with the existence of the directory/folder where the pipeline will be installed. 

In [None]:
! mkdir $PIPELINE_FILE_PATH

In [None]:
! cd $PIPELINE_FILE_PATH ; git clone https://github.com/Theo-Nelson/long-read-sequencing-pipeline

Please modify the following parameters within the code box below to fit your own study requirements:  

<li><b>ACC</b>: Run accession number for reads within the [European Nucleotide Archive](https://www.ebi.ac.uk/ena/browser/) (SRR...) or file path to location of long-read RNA sequencing data within your Google Drive/general file system - required for most applications</li>
<li><b>INDEX_FILE_PATH</b>: file path to location of reference genome (e.g. .FASTA) within your Google Drive/general file system - required for most applications</li>
<li><b>ANNOTATION_FILE_PATH</b>: file path to location of reference annotation (e.g. .GTF) within your Google Drive/general file system - required for most applications</li>
<li><b>CHROMOSOME</b>: Name of Chromosome of Interest matching the name of the Chromosome within your Reference Annotation - required for svist4get</li>
<li><b>CHROMOSOME_START</b>: Starting Location of Interest on the Chromosome - required for svist4get</li>
<li><b>CHROMOSOME_FINISH</b>: Ending Location of Interest on the Chromosome - required for svist4get</li>
<li><b>REGION_NAME</b>: Gene Name (does not need to match annotation file) - required for FLAME and svist4get</li>
<li><b>HUB_KEYWORD</b>: Short Keyword for your UCSC Track Hub - required for MakeHub</li>
<li><b>HUB_NAME</b>: Longer Title for your UCSC Track Hub - required for MakeHub</li>
<li><b>HUB_EMAIL</b>: Email for your UCSC Track Hub (if you publish your track hub then this email will be public) - required for MakeHub</li>


In [None]:
%env ACC=SRR12389274
%env INDEX_FILE_PATH=${PIPELINE_FILE_PATH}/long-read-sequencing-pipeline/prebuilt_indices/hg38.fa
%env ANNOTATION_FILE_PATH=${PIPELINE_FILE_PATH}/long-read-sequencing-pipeline/prebuilt_indices/hg38.ensGene.gtf
%env CHROMOSOME=chr12
%env CHROMOSOME_START=116533422
%env CHROMOSOME_FINISH=116536513
%env REGION_NAME=LINC00173
%env HUB_KEYWORD=LINC00173
%env HUB_NAME="Human LINC00173"
%env HUB_EMAIL=your@email.address

##$\color{#e74b4b}{\text{Mounting your Google Drive / Exporting to Your Local Hard Drive}}$

This step allows for permanent storage of your bioinformatics analysis in Google Drive

<ul type=disc>
<li><b>STORAGE_FILE_PATH</b>: file path to location where you would wish to store output of long-read RNA sequencing analysis pipeline - required to export data from Google Colab. </li>
</ul>

In [None]:
%env STORAGE_FILE_PATH=/content/drive/MyDrive/long-read-sequencing-pipeline

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Errors associated with the following command are most likely associated with the existence of the directory/folder where the pipeline will be installed. 

In [None]:
! mkdir $STORAGE_FILE_PATH

Additionally, this command will export necessary parameter data to allow downloads to a local machine. Please ignore ```/bin/bash: line 0: export: `/content': not a valid identifier``` or similar errors. 

In [None]:
! export $ACC
! export $PIPELINE_FILE_PATH
! export $REGION_NAME

##$\color{#e74b4b}{\text{BioConda: Package Installations}}$

BioConda is a software environment and package manager, providing acess to over 8,000 different software packages related to bioinformatics (documentation: [BioConda](https://bioconda.github.io/user/install.html) and [Managing Environments via Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html)). 

In [None]:
! wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.8.2-Linux-x86_64.sh
! chmod +x Miniconda3-py37_4.8.2-Linux-x86_64.sh
! bash ./Miniconda3-py37_4.8.2-Linux-x86_64.sh -b -f -p /usr/local

In [None]:
! export PYTHONHOME='/usr/local/lib/python3.7/site-packages/'

###$\color{#ff00d5}{\text{Kingfisher: procurement of sequence files - installation}}$

The Kingfisher program allows for sequence files to be downloaded from the European Nucleotide Archive (documentation: https://github.com/wwood/kingfisher-download).

In [None]:
! git clone https://github.com/MakeTheBrainHappy/kingfisher-download

In [None]:
! conda env update -n base --file kingfisher-download/kingfisher.yml

In [None]:
! conda install -c rpetit3 aspera-connect -y

In [None]:
! wget -qO- https://download.asperasoft.com/download/sw/connect/3.9.8/ibm-aspera-connect-3.9.8.176272-linux-g2.12-64.tar.gz | tar xvz

this command will pop up with an error message; please disregard

In [None]:
! ./ibm-aspera-connect-3.9.8.176272-linux-g2.12-64.sh

###$\color{#ff00d5}{\text{FastQC: quality control tool for high-throughput sequence data - installation}}$

FastQC is a program designed to spot potential problems in high througput sequencing datasets ([documentation](https://github.com/s-andrews/FastQC)).

In [None]:
! conda install -c bioconda fastqc -y

###$\color{#ff00d5}{\text{Shark: fishing relevant reads in an RNA-Seq sample - installation}}$

Shark is a tool to extract gene-specific reads from a RNA-seq sample (documentation: https://github.com/AlgoLab/shark).

In [None]:
! conda install -c bioconda shark -y

###$\color{#e74b4b}{\text{minimap2: A versatile pairwise aligner for spliced nucleotide sequences - installation}}$


minimap2 is a long-read sequencing aligner (documentation: https://github.com/lh3/minimap2). 

In [None]:
! conda install -c bioconda minimap2 -y

###$\color{#ff00d5}{\text{samtools: Write/Index SAM to BAM - installation}}$

samtools allows for manipulation of high-throughput sequencing data (documentation: http://www.htslib.org/) 


In [None]:
! conda install -c bioconda samtools -y

###$\color{#ff00d5}{\text{TranscriptClean: correct mismatches, microindels, and noncanonical splice junctions - installation}}$

TranscriptClean is a command-line program which corrects long-read mismatches, microindels and noncanonical splice junctions (documentation: https://github.com/mortazavilab/TranscriptClean). 

In [None]:
! conda install -c bioconda pyfasta pyranges samtools -y

In [None]:
! wget https://github.com/mortazavilab/TranscriptClean/archive/refs/tags/v2.0.3.tar.gz
! tar xvf v2.0.3.tar.gz 

###$\color{#ff00d5}{\text{FLAME: long-read splice variant annotation - installation}}$

Full-Length Adjacency Matrix Enumeration (FLAME) is a program which can detect and quantify novel splice junctions on an annotated gene locus (documentation: https://github.com/marabouboy/FLAME). 

In [None]:
! sudo apt-get install bedtools

In [None]:
! pip install pysam

In [None]:
! git clone https://github.com/marabouboy/FLAME

In [None]:
! chmod 755 /content/FLAME/FLAME/FLAME.py

In [None]:
! chmod 755 /content/FLAME/setup.py

In [None]:
! cd /content/FLAME/ ; python3 setup.py install

###$\color{#ff00d5}{\text{featureCounts: assign sequence reads to genomic features - installation}}$

featureCounts is a read summarization program suitable for counting reads generated from either RNA or genomic DNA sequencing experiments (documentation: http://subread.sourceforge.net/). 

In [None]:
! conda install -c bioconda subread -y

###$\color{#ff00d5}{\text{LIQA: transcript quantification}}$

LIQA is a program which quantifies isoform/transcript expression based on long read RNA sequencing data (documentation: https://github.com/WGLab/LIQA)

In [None]:
! pip install liqa

###$\color{#ff00d5}{\text{FusionSeeker: detect gene fusions - installation}}$

FussionCaller is a gene fusion caller for long-read single-molecular sequencing data (documentation: https://github.com/Maggi-Chen/FusionSeeker). 

In [None]:
! git clone https://github.com/Theo-Nelson/FusionSeeker.git

In [None]:
! git clone https://github.com/ruanjue/bsalign.git

In [None]:
! cd /content/bsalign && make

###$\color{#ff00d5}{\text{StringTie: transcript assembly - installation}}$

StringTie is a program which can produce transcriptomes specific to the sample input (documentation: http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual#run).

In [None]:
! conda install -c bioconda stringtie -y

###$\color{#ff00d5}{\text{GffCompare: transcript assembly statistics - installation}}$

GffCompare is a program which can evaluate the specificity and novelty of transcripts within a sample-specific transcriptome to a reference transcriptome (documentation: http://ccb.jhu.edu/software/stringtie/gffcompare.shtml).

In [None]:
! conda install -c bioconda gffcompare -y

###$\color{#ff00d5}{\text{svist4get: visualize genomic tracks from sequencing experiments - installation}}$

svist4get allows you to view read coverage at a defined region on a chromosome (documentation: https://bitbucket.org/artegorov/svist4get/src/master/)

In [None]:
! apt-get update

In [None]:
! apt-get install libmagickwand-dev

In [None]:
! cp -r $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/svist4get/policy_revised.xml /etc/ImageMagick-6/policy.xml

In [None]:
! python3 -m pip install svist4get

###$\color{#ff00d5}{\text{Pistis: quality control plotting for long reads - installation}}$

Pistis generates long-read specific quality control graphs, including a plot demonstrating read alignment percentage to the reference genome (documentation: https://github.com/mbhall88/pistis)

In [None]:
! pip3 install pistis

###$\color{#0072ff}{\text{MakeHub: generate UCSC assembly hubs - installation}}$

MakeHub is a command line tool for the fully automatic generation of of track data hubs for visualizing genomes with the UCSC genome browser (documentation: https://github.com/Gaius-Augustus/MakeHub).

In [None]:
! python3.7 -m pip install biopython

In [None]:
! sudo apt install augustus augustus-data augustus-doc

###$\color{#ff00d5}{\text{MultiQC: aggregate bioinformatics analysis - installation}}$



MultiQC is a program which allows you to combine reports for as many samples as you wish ([documentation](https://multiqc.info/docs/))

In [None]:
! pip install multiqc

##$\color{#d42bb4}{\text{Kingfisher: procurement of sequence files - usage}}$


The Kingfisher program allows for sequence files to be downloaded from the European Nucleotide Archive (documentation: https://github.com/wwood/kingfisher-download).

In [None]:
! cd $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/fastq ; /content/kingfisher-download/bin/kingfisher get -r $ACC -m ena-ascp aws-http prefetch

The next command will unzip a fastq file, if neccessary. Please do not be concerned if this command throws an error. 

In [None]:
! cd $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/fastq ; gunzip *.gz

The next two commands simply standardize the file name away from uncommon variants provided by depositors in the European Nucleotide Archive. In rare instances pipeline users may need to directly manipulate these commands to standardize the filename. Please do not be concerned if either command throws an error. 

In [None]:
! cd $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/fastq ; mv ${ACC}_1.fastq $ACC.fastq

In [None]:
! cd $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/fastq ; mv ${ACC}_subreads.fastq $ACC.fastq

##$\color{#a7588f}{\text{Kingfisher: procurement of sequence files - export}}$


To Store Resulting Files in your Google Drive: 

In [None]:
! cp $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/fastq/$ACC.fastq $STORAGE_FILE_PATH/$ACC.fastq

To Store Resulting Files in your Local Hard Drive: 

In [None]:
from google.colab import files
import os

files.download("".join([os.environ["PIPELINE_FILE_PATH"],"/long-read-sequencing-pipeline/fastq/",os.environ["ACC"],".fastq"]))

##$\color{#d42bb4}{\text{FastQC: quality control tool for high-throughput sequence data - usage}}$

FastQC is a program designed to spot potential problems in high througput sequencing datasets ([documentation](https://github.com/s-andrews/FastQC)).

In [None]:
! fastqc -t 2 $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/fastq/$ACC.fastq --outdir $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/fastqc

For the best viewing experience, please download the HTML output to your Hard Drive and open in your local browser, such as Google Chrome, Firefox or Edge. 

##$\color{#a7588f}{\text{FastQC: quality control tool for high-throughput sequence data - export}}$


To Store Resulting Files in your Google Drive: 

In [None]:
! cp -r $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/fastqc/ $STORAGE_FILE_PATH

To Store Resulting Files in your Local Hard Drive: 

In [None]:
! zip -r $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/fastqc.zip $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/fastqc/

In [None]:
from google.colab import files
import os

files.download("".join([os.environ["PIPELINE_FILE_PATH"],"/long-read-sequencing-pipeline/fastqc.zip"]))

##$\color{#e74b4b}{\text{Reference Genome - installation}}$

The Reference Genome provides a scaffold to align long-read data. These commands install the hg38 genome and ensembl annotation availiable from UCSC. You can download more current genomes by utilizing the appropriate links from NCBI RefSeq, Ensembl, or other reference genome providers. If you are unsure of how to find other species, we recommend checking out the list of species available in the ```current_fasta``` and ```current_gtf``` folders: http://ftp.ensembl.org/pub/

In [None]:
! wget -P $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/prebuilt_indices http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz

In [None]:
! wget -P $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/prebuilt_indices http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.ensGene.gtf.gz

In [None]:
! gunzip -c $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/prebuilt_indices/hg38.fa.gz > $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/prebuilt_indices/hg38.fa

In [None]:
! gunzip -c $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/prebuilt_indices/hg38.ensGene.gtf.gz > $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/prebuilt_indices/hg38.ensGene.gtf

Another example is provided for which installs the UCSC murine mm39 genome. Note that the reference annotation is from RefSeq. 

In [None]:
! wget -P $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/prebuilt_indices ftp://hgdownload.soe.ucsc.edu/goldenPath/mm39/bigZips/mm39.fa.gz

In [None]:
! wget -P $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/prebuilt_indices https://hgdownload.soe.ucsc.edu/goldenPath/mm39/bigZips/genes/mm39.ncbiRefSeq.gtf.gz

In [None]:
! gunzip -c $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/prebuilt_indices/mm39.fa.gz > $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/prebuilt_indices/mm39.fa

In [None]:
! gunzip -c $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/prebuilt_indices/mm39.ncbiRefSeq.gtf.gz > $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/prebuilt_indices/mm39.ncbiRefSeq.gtf

##$\color{#d42bb4}{\text{Shark: fishing relevant reads in an RNA-Seq sample - usage}}$

Shark is a tool to extract gene-specific reads from a RNA-seq sample (documentation: https://github.com/AlgoLab/shark).

In [None]:
! echo ${CHROMOSOME}$'\t'${CHROMOSOME_START}$'\t'${CHROMOSOME_FINISH} > $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/prebuilt_indices/$REGION_NAME.bed

In [None]:
! eval bedtools getfasta -fi $INDEX_FILE_PATH -bed $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/prebuilt_indices/$REGION_NAME.bed -fo $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/prebuilt_indices/$REGION_NAME.fasta

In [None]:
! eval shark -c .40 -k 10 -r $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/prebuilt_indices/$REGION_NAME.fasta -1 $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/fastq/$ACC.fastq -o $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/fastq/$REGION_NAME$ACC.fastq

In order to continue analysis with just the filtered reads, please utilize the following command:

In [None]:
%env ACC=${REGION_NAME}${ACC}

##$\color{#a7588f}{\text{Shark: fishing relevant reads in an RNA-Seq sample - export}}$


To Store Resulting Files in your Google Drive: 

In [None]:
! cp -r $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/fastq/$REGION_NAME.fastq $STORAGE_FILE_PATH/$REGION_NAME$ACC.fastq

To Store Resulting Files in your Local Hard Drive: 

In [None]:
from google.colab import files
import os

files.download("".join([os.environ["PIPELINE_FILE_PATH"],"/long-read-sequencing-pipeline/fastq/",os.environ["REGION_NAME"],os.environ["ACC"],".fastq"]))

##$\color{#e74b4b}{\text{minimap2: A versatile pairwise aligner for spliced nucleotide sequences - index minimization}}$

minimap2 is a long-read sequencing aligner (documentation: https://github.com/lh3/minimap2). 

In [None]:
! eval minimap2 -k15 -w5 -d $INDEX_FILE_PATH.mmi $INDEX_FILE_PATH

##$\color{#e74b4b}{\text{minimap2: A versatile pairwise aligner for spliced nucleotide sequences - usage}}$

minimap2 is a long-read sequencing aligner (documentation: https://github.com/lh3/minimap2). 

In [None]:
! eval minimap2 -ax splice $INDEX_FILE_PATH.mmi $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/fastq/$ACC.fastq > $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/sam/$ACC.sam

##$\color{#e74b4b}{\text{minimap2: A versatile pairwise aligner for spliced nucleotide sequences - export}}$


To Store Resulting Files in your Google Drive: 

In [None]:
! cp $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/sam/$ACC.sam $STORAGE_FILE_PATH/$ACC.sam

To Store Resulting Files in your Local Hard Drive: 

In [None]:
from google.colab import files
import os

files.download("".join([os.environ["PIPELINE_FILE_PATH"],"/long-read-sequencing-pipeline/sam/",os.environ["ACC"],".sam"]))

##$\color{#e74b4b}{\text{samtools: Write/Index SAM to BAM - usage}}$

samtools allows for manipulation of high-throughput sequencing data (documentation: http://www.htslib.org/) 


In [None]:
! samtools view -S -b $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/sam/$ACC.sam > $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/bam/$ACC.bam 

In [None]:
! samtools sort $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/bam/$ACC.bam -o $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/bam/$ACC.sorted.bam  

In [None]:
! samtools index $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/bam/$ACC.sorted.bam 

##$\color{#e74b4b}{\text{samtools: Write/Index SAM to BAM - export}}$


To Store Resulting Files in your Google Drive: 

In [None]:
! cp $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/bam/$ACC.sorted.bam $STORAGE_FILE_PATH/$ACC.sorted.bam

In [None]:
! cp $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/bam/$ACC.sorted.bam.bai $STORAGE_FILE_PATH/$ACC.sorted.bam.bai

To Store Resulting Files in your Local Hard Drive: 

In [None]:
from google.colab import files
import os

files.download("".join([os.environ["PIPELINE_FILE_PATH"],"/long-read-sequencing-pipeline/bam/",os.environ["ACC"],".sorted.bam"]))

In [None]:
from google.colab import files
import os

files.download("".join([os.environ["PIPELINE_FILE_PATH"],"/long-read-sequencing-pipeline/bam/",os.environ["ACC"],".sorted.bam.bai"]))

##$\color{#d42bb4}{\text{TranscriptClean: correct mismatches, microindels, and noncanonical splice junctions - usage}}$

TranscriptClean is a command-line program which corrects long-read mismatches, microindels and noncanonical splice junctions (documentation: https://github.com/mortazavilab/TranscriptClean). 

Please note that the corrected reads should not be utilized for downstream-level base-calling analysis such as variant calling. 

In [None]:
! eval python /content/TranscriptClean-2.0.3/TranscriptClean.py --sam $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/sam/$ACC.sam --genome $INDEX_FILE_PATH --outprefix $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/transcriptclean/$ACC

Convert the resulting sam file to a bam file:

In [None]:
! samtools view -S -b $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/transcriptclean/${ACC}_clean.sam > $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/bam/${ACC}_clean.bam

In [None]:
! samtools sort $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/bam/${ACC}_clean.bam -o $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/bam/${ACC}_clean.sorted.bam  

In [None]:
! samtools index $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/bam/${ACC}_clean.sorted.bam 

##$\color{#a7588f}{\text{TranscriptClean: correct mismatches, microindels, and noncanonical splice junctions - export}}$


To Store Resulting Files in your Google Drive: 

In [None]:
! cp -r $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/transcriptclean/ $STORAGE_FILE_PATH

To Store Resulting Files in your Local Hard Drive: 

In [None]:
! zip -r $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/transcriptclean.zip $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/transcriptclean/

In [None]:
from google.colab import files
import os

files.download("".join([os.environ["PIPELINE_FILE_PATH"],"/long-read-sequencing-pipeline/transcriptclean.zip"]))

##$\color{#d42bb4}{\text{StringTie: transcript assembly - usage}}$


StringTie is a program which can collapse sample reads into transcriptomes specific to the sample input (documentation: http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual#run).

In [None]:
! eval stringtie -L -o $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/stringtie/${ACC}.gtf -G $ANNOTATION_FILE_PATH $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/bam/$ACC.sorted.bam 

##$\color{#a7588f}{\text{StringTie: transcript assembly - export}}$


To Store Resulting Files in your Google Drive: 

In [None]:
! cp -r $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/stringtie $STORAGE_FILE_PATH

To Store Resulting Files in your Local Hard Drive: 

In [None]:
! zip -r $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/stringtie.zip $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/stringtie/

In [None]:
from google.colab import files
import os

files.download("".join([os.environ["PIPELINE_FILE_PATH"],"/long-read-sequencing-pipeline/stringtie.zip"]))

##$\color{#d42bb4}{\text{GffCompare: transcript assembly statistics - usage}}$

GffCompare is a program which can evaluate the specificity and novelty of transcripts within a sample-specific transcriptome to a reference transcriptome (documentation: http://ccb.jhu.edu/software/stringtie/gffcompare.shtml).

In [None]:
! eval gffcompare -o $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/gffcompare/$ACC -r $ANNOTATION_FILE_PATH -R -Q $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/stringtie/${ACC}.gtf

##$\color{#a7588f}{\text{GffCompare: transcript assembly statistics - export}}$


To Store Resulting Files in your Google Drive: 

In [None]:
! cp -r $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/gffcompare $STORAGE_FILE_PATH

To Store Resulting Files in your Local Hard Drive: 

In [None]:
! zip -r $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/gffcompare.zip $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/gffcompare/

In [None]:
from google.colab import files
import os

files.download("".join([os.environ["PIPELINE_FILE_PATH"],"/long-read-sequencing-pipeline/gffcompare.zip"]))

##$\color{#d42bb4}{\text{FLAME: gene-specific long-read splice variant annotation - usage}}$

Full-Length Adjacency Matrix Enumeration (FLAME) is a program which can detect and quantify novel splice junctions on an annotated gene locus (documentation: https://github.com/marabouboy/FLAME). 

In [None]:
! bedtools bamtobed -bed12 -i $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/bam/$ACC.sorted.bam > $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/bed/$ACC.sorted.bed12

In [None]:
! eval python3 /content/FLAME/FLAME/FLAME.py -I $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/bed/$ACC.sorted.bed12 -GTF $ANNOTATION_FILE_PATH -G $REGION_NAME -O $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/flame/$ACC

##$\color{#a7588f}{\text{FLAME: gene-specific long-read splice variant annotation - export}}$


To Store Resulting Files in your Google Drive: 

In [None]:
! cp -r $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/flame $STORAGE_FILE_PATH

To Store Resulting Files in your Local Hard Drive: 

In [None]:
! zip -r $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/flame.zip $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/flame/

In [None]:
from google.colab import files
import os

files.download("".join([os.environ["PIPELINE_FILE_PATH"],"/long-read-sequencing-pipeline/flame.zip"]))

##$\color{#d42bb4}{\text{featureCounts: assign sequence reads to genomic features - usage}}$

featureCounts is a read summarization program suitable for counting reads generated from either RNA or genomic DNA sequencing experiments (documentation: http://subread.sourceforge.net/). 

In [None]:
! eval featureCounts -O -L -a $ANNOTATION_FILE_PATH -t exon -g gene_id -o $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/featureCounts/$ACC.txt $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/bam/$ACC.sorted.bam 

##$\color{#a7588f}{\text{featureCounts: assign sequence reads to genomic features - export}}$


To Store Resulting Files in your Google Drive: 

In [None]:
! cp -r $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/featureCounts/ $STORAGE_FILE_PATH

To Store Resulting Files in your Local Hard Drive: 

In [None]:
! zip -r $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/featureCounts.zip $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/featureCounts/

In [None]:
from google.colab import files
import os

files.download("".join([os.environ["PIPELINE_FILE_PATH"],"/long-read-sequencing-pipeline/featureCounts.zip"]))

##$\color{#d42bb4}{\text{LIQA: transcript quantification - usage}}$

LIQA is a program which quantifies isoform/transcript expression based on long read RNA sequencing data (documentation: https://github.com/WGLab/LIQA)

In [None]:
! eval liqa -task refgene -ref $ANNOTATION_FILE_PATH -format gtf -out $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/prebuilt_indices/$ACC.refgene

In [None]:
! eval liqa -task quantify -refgene $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/prebuilt_indices/$ACC.refgene -bam $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/bam/$ACC.sorted.bam -out $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/liqa/${ACC}_quantification.txt -max_distance 10 -f_weight 1

##$\color{#a7588f}{\text{LIQA: transcript quantification - export}}$


To Store Resulting Files in your Google Drive: 

In [None]:
! cp -r $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/liqa/ $STORAGE_FILE_PATH

To Store Resulting Files in your Local Hard Drive: 

In [None]:
! zip -r $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/liqa.zip $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/liqa/

In [None]:
from google.colab import files
import os

files.download("".join([os.environ["PIPELINE_FILE_PATH"],"/long-read-sequencing-pipeline/liqa.zip"]))

##$\color{#d42bb4}{\text{FusionSeeker: detect gene fusions - usage}}$

FussionCaller is a gene fusion caller for long-read single-molecular sequencing data (documentation: https://github.com/Maggi-Chen/FusionSeeker). 

In [None]:
! eval /content/FusionSeeker/fusionseeker --bam $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/bam/$ACC.sorted.bam --outpath $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/fusionseeker/ --ref $INDEX_FILE_PATH --gtf $ANNOTATION_FILE_PATH 

##$\color{#a7588f}{\text{FusionSeeker: detect gene fusions - export}}$


To Store Resulting Files in your Google Drive: 

In [None]:
! cp -r $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/fusionseeker $STORAGE_FILE_PATH

To Store Resulting Files in your Local Hard Drive: 

In [None]:
! zip -r $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/fusionseeker.zip $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/fusionseeker/

In [None]:
from google.colab import files
import os

files.download("".join([os.environ["PIPELINE_FILE_PATH"],"/long-read-sequencing-pipeline/fusionseeker.zip"]))

##$\color{#d42bb4}{\text{svist4get: visualize genomic tracks from sequencing experiments - usage}}$

svist4get allows you to view read coverage at a defined region on a chromosome (documentation: https://bitbucket.org/artegorov/svist4get/src/master/)

In [None]:
! bedtools genomecov -split -ibam $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/bam/$ACC.sorted.bam -bg > $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/bed/$ACC.sorted.bedgraph

In [None]:
! eval svist4get -bg $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/bed/$ACC.sorted.bedgraph -gtf $ANNOTATION_FILE_PATH -fa $INDEX_FILE_PATH -bl Long-Read Coverage -w $CHROMOSOME $CHROMOSOME_START $CHROMOSOME_FINISH -it "$REGION_NAME" -o $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/svist4get/$ACC

##$\color{#a7588f}{\text{svist4get: visualize genomic tracks from sequencing experiments - export}}$


To Store Resulting Files in your Google Drive: 

In [None]:
! cp -r $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/svist4get/ $STORAGE_FILE_PATH

To Store Resulting Files in your Local Hard Drive: 

In [None]:
! zip -r $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/svist4get.zip $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/svist4get/

In [None]:
from google.colab import files
import os

files.download("".join([os.environ["PIPELINE_FILE_PATH"],"/long-read-sequencing-pipeline/svist4get.zip"]))

##$\color{#d42bb4}{\text{Pistis: quality control plotting for long reads - usage}}$

Pistis generates long-read specific quality control graphs, including a plot demonstrating read alignment percentage to the reference genome (documentation: https://github.com/mbhall88/pistis)

Please note that the report generates assuming alignment of more than 50,000 reads. If this is not the case, please add the flag `--downsample INTEGER` replacing `INTEGER` with a number less than the number of aligned reads. Additionally, please ignore deprecation warnings. 

In [None]:
! pistis -f $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/fastq/$ACC.fastq -b $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/bam/$ACC.sorted.bam  -o $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/pistis/$ACC.pdf

##$\color{#a7588f}{\text{Pistis: quality control plotting for long reads - export}}$


To Store Resulting Files in your Google Drive: 

In [None]:
! cp -r $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/pistis/ $STORAGE_FILE_PATH

To Store Resulting Files in your Local Hard Drive: 

In [None]:
! zip -r $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/pistis.zip $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/pistis/

In [None]:
from google.colab import files
import os

files.download("".join([os.environ["PIPELINE_FILE_PATH"],"/long-read-sequencing-pipeline/pistis.zip"]))

##$\color{#2977d6}{\text{MakeHub: generate UCSC assembly hubs - usage}}$

MakeHub is a command line tool for the fully automatic generation of of track data hubs for visualizing genomes with the UCSC genome browser (documentation: https://github.com/Gaius-Augustus/MakeHub).

In [None]:
! chmod 755 $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/makehub/make_hub.py

In [None]:
! eval $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/makehub/make_hub.py -l $HUB_KEYWORD -L $HUB_NAME -g $INDEX_FILE_PATH -e \
  $HUB_EMAIL -a $ANNOTATION_FILE_PATH -b $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/bam/$ACC.sorted.bam -o $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/makehub/

##$\color{#4f7bb0}{\text{MakeHub: generate UCSC assembly hubs - export}}$


To Store Resulting Files in your Google Drive: 

In [None]:
! cp -r $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/makehub $STORAGE_FILE_PATH

To Store Resulting Files in your Local Hard Drive: 

In [None]:
! zip -r $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/makehub.zip $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/makehub/

In [None]:
from google.colab import files
import os

files.download("".join([os.environ["PIPELINE_FILE_PATH"],"/long-read-sequencing-pipeline/makehub.zip"]))

##$\color{#d42bb4}{\text{MultiQC: aggregate bioinformatics analysis - usage}}$

MultiQC is a program which allows you to combine reports for as many samples as you wish ([documentation](https://multiqc.info/docs/))

In [None]:
! multiqc $PIPELINE_FILE_PATH/long-read-sequencing-pipeline -o $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/multiqc/$ACC.pdf

##$\color{#a7588f}{\text{MultiQC: aggregate bioinformatics analysis - export}}$


To Store Resulting Files in your Google Drive: 

In [None]:
! cp -r $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/multiqc/ $STORAGE_FILE_PATH

To Store Resulting Files in your Local Hard Drive: 

In [None]:
! zip -r $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/multiqc.zip $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/multiqc/

In [None]:
from google.colab import files
import os

files.download("".join([os.environ["PIPELINE_FILE_PATH"],"/long-read-sequencing-pipeline/multiqc.zip"]))