#$\text{L-RAP: Long Read RNA Sequencing Analysis Pipeline}$


$\text{A pipeline to analyze Oxford Nanopore and PacBio third-generation long transcriptomic sequencing reads}$

$\text{Theodore Nelson}$

$\text{Columbia University Irving Medical Center}$


##$\color{#e74b4b}{\text{Parameter Input and User Instructions}}$

Please define where the file structure is within your Google Drive:

<ul type=disc>
<li><b>PIPELINE_FILE_PATH</b>: file path to location of long-read RNA sequencing analysis pipeline within your Google Colab/Google Drive/local file system - required for most applications. This must be defined and initialized first.</li>
</ul>

In [97]:
%env PIPELINE_FILE_PATH=/content/long-read-sequencing-pipeline

env: PIPELINE_FILE_PATH=/content/long-read-sequencing-pipeline


In [2]:
! mkdir $PIPELINE_FILE_PATH

In [3]:
! cd $PIPELINE_FILE_PATH ; git clone https://github.com/Theo-Nelson/long-read-sequencing-pipeline

Cloning into 'long-read-sequencing-pipeline'...
remote: Enumerating objects: 117, done.[K
remote: Counting objects: 100% (117/117), done.[K
remote: Compressing objects: 100% (103/103), done.[K
remote: Total 117 (delta 60), reused 32 (delta 10), pack-reused 0[K
Receiving objects: 100% (117/117), 102.70 KiB | 7.90 MiB/s, done.
Resolving deltas: 100% (60/60), done.


Please modify the following parameters within the code box below to fit your own study requirements:  

<li><b>ACC</b>: Run accession number for reads within the [European Nucleotide Archive](https://www.ebi.ac.uk/ena/browser/) (SRR...) or file path to location of long-read RNA sequencing data within your Google Drive/general file system - required for most applications</li>
<li><b>PLATFORM_FREE</b>: choose either `map-ont` (Nanopore reads) or `map-pb` (PacBio reads) - required for minimap2 featherweight version</li>
<li><b>INDEX_FILE_PATH</b>: file path to location of reference genome (e.g. .FASTA) within your Google Drive/general file system - required for most applications</li>
<li><b>ANNOTATION_FILE_PATH</b>: file path to location of reference annotation (e.g. .GTF) within your Google Drive/general file system - required for most applications</li>
<li><b>PARTITIONED_INDEX_FILE_PATH</b>: choose a filename; select a location to place the partioned reference annotation (e.g. .idx) when generated by the pipeline within your Google Drive/general file system - required for minimap2 featherweight version</li>
<li><b>CHROMOSOME</b>: Name of Chromosome of Interest matching the name of the Chromosome within your Reference Annotation - required for svist4get</li>
<li><b>CHROMOSOME_START</b>: Starting Location of Interest on the Chromosome - required for svist4get</li>
<li><b>CHROMOSOME_FINISH</b>: Ending Location of Interest on the Chromosome - required for svist4get</li>
<li><b>REGION_NAME</b>: Gene Name (does not need to match annotation file) - required for svist4get</li>
<li><b>HUB_KEYWORD</b>: Short Keyword for your UCSC Track Hub - required for MakeHub</li>
<li><b>HUB_NAME</b>: Longer Title for your UCSC Track Hub - required for MakeHub</li>
<li><b>HUB_EMAIL</b>: Email for your UCSC Track Hub (if you publish your track hub then this email will be public) - required for MakeHub</li>


In [98]:
%env ACC=ERR1951293  
%env PLATFORM_FREE=map-pb
%env INDEX_FILE_PATH=${PIPELINE_FILE_PATH}/long-read-sequencing-pipeline/prebuilt_indices/mm39.fa
%env ANNOTATION_FILE_PATH=${PIPELINE_FILE_PATH}/long-read-sequencing-pipeline/prebuilt_indices/mm39.ncbiRefSeq.gtf
%env PARTITIONED_INDEX_FILE_PATH=${PIPELINE_FILE_PATH}/long-read-sequencing-pipeline/prebuilt_indices/mm39.idx
%env CHROMOSOME=NC_000071.7
%env CHROMOSOME_START=118511219
%env CHROMOSOME_FINISH=118519017
%env REGION_NAME=Gm30604
%env HUB_KEYWORD=Gm30604
%env HUB_NAME="Murine Gm30604"
%env HUB_EMAIL=tmn2126@columbia.edu

env: ACC=ERR1951293
env: PLATFORM_FREE=map-pb
env: INDEX_FILE_PATH=${PIPELINE_FILE_PATH}/long-read-sequencing-pipeline/prebuilt_indices/mm39.fa
env: ANNOTATION_FILE_PATH=${PIPELINE_FILE_PATH}/long-read-sequencing-pipeline/prebuilt_indices/mm39.ncbiRefSeq.gtf
env: PARTITIONED_INDEX_FILE_PATH=${PIPELINE_FILE_PATH}/long-read-sequencing-pipeline/prebuilt_indices/mm39.idx
env: CHROMOSOME=NC_000071.7
env: CHROMOSOME_START=118511219
env: CHROMOSOME_FINISH=118519017
env: REGION_NAME=Gm30604
env: HUB_KEYWORD=Gm30604
env: HUB_NAME="Murine Gm30604"
env: HUB_EMAIL=tmn2126@columbia.edu


##$\color{#e74b4b}{\text{Mounting your Google Drive}}$

This step allows for permanent storage of your bioinformatics analysis in Google Drive

<ul type=disc>
<li><b>STORAGE_FILE_PATH</b>: file path to location where you would wish to store output of long-read RNA sequencing analysis pipeline - required to export data from Google Colab. </li>
</ul>

In [5]:
%env STORAGE_FILE_PATH=/content/drive/MyDrive/long-read-sequencing-pipeline

env: STORAGE_FILE_PATH=/content/drive/MyDrive/long-read-sequencing-pipeline


In [6]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


##$\color{#e74b4b}{\text{Managing Software via BioConda}}$

BioConda is a software environment and package manager, providing acess to over 8,000 different software packages related to bioinformatics (documentation: [BioConda](https://bioconda.github.io/user/install.html) and [Managing Environments via Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html)). 

In [7]:
! wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.8.2-Linux-x86_64.sh
! chmod +x Miniconda3-py37_4.8.2-Linux-x86_64.sh
! bash ./Miniconda3-py37_4.8.2-Linux-x86_64.sh -b -f -p /usr/local
import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')

--2022-08-22 13:29:40--  https://repo.anaconda.com/miniconda/Miniconda3-py37_4.8.2-Linux-x86_64.sh
Resolving repo.anaconda.com (repo.anaconda.com)... 104.16.131.3, 104.16.130.3, 2606:4700::6810:8203, ...
Connecting to repo.anaconda.com (repo.anaconda.com)|104.16.131.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 85055499 (81M) [application/x-sh]
Saving to: ‘Miniconda3-py37_4.8.2-Linux-x86_64.sh’


2022-08-22 13:29:41 (94.9 MB/s) - ‘Miniconda3-py37_4.8.2-Linux-x86_64.sh’ saved [85055499/85055499]

PREFIX=/usr/local
Unpacking payload ...
Collecting package metadata (current_repodata.json): - \ | done
Solving environment: - \ done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - _libgcc_mutex==0.1=main
    - asn1crypto==1.3.0=py37_0
    - ca-certificates==2020.1.1=0
    - certifi==2019.11.28=py37_0
    - cffi==1.14.0=py37h2e261b9_0
    - chardet==3.0.4=py37_1003
    - conda-package-handling==1.6.0=py37h7b6447c_0

##$\color{#ff00d5}{\text{Kingfisher: procurement of sequence files - installation}}$

The Kingfisher program allows for sequence files to be downloaded from the European Nucleotide Archive (documentation: https://github.com/wwood/kingfisher-download).

In [8]:
! git clone https://github.com/MakeTheBrainHappy/kingfisher-download

Cloning into 'kingfisher-download'...
remote: Enumerating objects: 696, done.[K
remote: Counting objects: 100% (69/69), done.[K
remote: Compressing objects: 100% (19/19), done.[K
remote: Total 696 (delta 60), reused 50 (delta 50), pack-reused 627[K
Receiving objects: 100% (696/696), 8.36 MiB | 5.48 MiB/s, done.
Resolving deltas: 100% (396/396), done.


In [9]:
! conda env update -n base --file kingfisher-download/kingfisher.yml

Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | 

In [10]:
! conda install -c rpetit3 aspera-connect -y

Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
Solving environment: / - \ | / - \ | / - \ | / done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - aspera-connect


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    aspera-connect-3.9.6       |                0        34.3 MB  rpetit3
    ca-certificates-2022.07.19 |       h06a4308_0         124 KB
    certifi-2022.6.15          |   py39h06a4308_0         153 KB
    openssl-1.1.1q             |       h7f8727e_0         2.5 MB
    ------------------------------------------------------------
                                           Total:        37.1 MB

The following NEW packages will be INSTALLED:

  aspera-connect

In [11]:
! wget -qO- https://download.asperasoft.com/download/sw/connect/3.9.8/ibm-aspera-connect-3.9.8.176272-linux-g2.12-64.tar.gz | tar xvz

ibm-aspera-connect-3.9.8.176272-linux-g2.12-64.sh


this command will pop up with an error message; please disregard

In [12]:
! ./ibm-aspera-connect-3.9.8.176272-linux-g2.12-64.sh


Installing IBM Aspera Connect

This script cannot be run as root, IBM Aspera Connect must be installed per user.


##$\color{#d42bb4}{\text{Kingfisher: procurement of sequence files - usage}}$


The Kingfisher program allows for sequence files to be downloaded from the European Nucleotide Archive (documentation: https://github.com/wwood/kingfisher-download).

In [21]:
! cd $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/fastq ; /content/kingfisher-download/bin/kingfisher get -r $ACC -m ena-ascp aws-http prefetch

08/22/2022 01:37:49 PM INFO: Kingfisher v0.0.1-dev
08/22/2022 01:37:49 PM INFO: Attempting download method ena-ascp for run ERR1951293 ..
08/22/2022 01:37:49 PM INFO: Using aspera ssh key file: $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh
08/22/2022 01:37:49 PM INFO: Querying ENA for FTP paths for ERR1951293..
08/22/2022 01:37:52 PM INFO: Downloading 1 FTP read set(s): ftp.sra.ebi.ac.uk/vol1/fastq/ERR195/003/ERR1951293/ERR1951293.fastq.gz
08/22/2022 01:37:52 PM INFO: Running command: ascp -T -l 300m -P33001 -k 2 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/ERR195/003/ERR1951293/ERR1951293.fastq.gz .
08/22/2022 01:38:08 PM INFO: Method ena-ascp worked.
08/22/2022 01:38:08 PM INFO: Output files: ./ERR1951293.fastq.gz
08/22/2022 01:38:08 PM INFO: Kingfisher done.


In [22]:
! cd $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/fastq ; gunzip *.gz

The next two commands simply standardize the file name away from uncommon variants provided by depositors in the European Nucleotide Archive. In rare instances pipeline users may need to directly manipulate these commands to standardize the filename. Please do not be concerned if either command throws an error. 

In [23]:
! cd $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/fastq ; mv ${ACC}_1.fastq $ACC.fastq

mv: cannot stat 'ERR1951293_1.fastq': No such file or directory


In [24]:
! cd $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/fastq ; mv ${ACC}_subreads.fastq $ACC.fastq

mv: cannot stat 'ERR1951293_subreads.fastq': No such file or directory


##$\color{#a7588f}{\text{Kingfisher: procurement of sequence files - export}}$


To Store Resulting Files in your Google Drive: 

In [25]:
! cp $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/fastq/$ACC.fastq $STORAGE_FILE_PATH/$ACC.fastq

To Store Resulting Files in your Local Hard Drive: 

In [26]:
! export $ACC
! export $PIPELINE_FILE_PATH

/bin/bash: line 0: export: `/content/long-read-sequencing-pipeline': not a valid identifier


In [27]:
from google.colab import files
import os

files.download("".join([os.environ["PIPELINE_FILE_PATH"],"/long-read-sequencing-pipeline/fastq/",os.environ["ACC"],".fastq"]))

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

##$\color{#ff00d5}{\text{FastQC: quality control tool for high throughput sequence data - installation}}$

FastQC is a program designed to spot potential problems in high througput sequencing datasets ([documentation](https://github.com/s-andrews/FastQC)).

In [28]:
! conda install -c bioconda fastqc -y

Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | done
Solving environment: - \ | / - \ | / - \ | / - \ | done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - fastqc


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    fastqc-0.11.9              |       hdfd78af_1         9.7 MB  bioconda
    font-ttf-dejavu-sans-mono-2.37|       hd3eb1b0_0         335 KB
    fontconfig-2.13.1          |       h6c09931_0         250 KB
    freetype-2.11.0            |       h70c0345_0         618 KB
    libpng-1.6.37              |       hbc83047_0         278 KB
    libuuid-1.0.3              |       h7f8727e_2          17 KB
    open

##$\color{#d42bb4}{\text{FastQC: quality control tool for high throughput sequence data - usage}}$

FastQC is a program designed to spot potential problems in high througput sequencing datasets ([documentation](https://github.com/s-andrews/FastQC)).

In [29]:
! fastqc $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/fastq/$ACC.fastq --outdir $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/fastqc

Started analysis of ERR1951293.fastq
Approx 5% complete for ERR1951293.fastq
Approx 10% complete for ERR1951293.fastq
Approx 15% complete for ERR1951293.fastq
Approx 20% complete for ERR1951293.fastq
Approx 25% complete for ERR1951293.fastq
Approx 30% complete for ERR1951293.fastq
Approx 35% complete for ERR1951293.fastq
Approx 40% complete for ERR1951293.fastq
Approx 45% complete for ERR1951293.fastq
Approx 50% complete for ERR1951293.fastq
Approx 55% complete for ERR1951293.fastq
Approx 60% complete for ERR1951293.fastq
Approx 65% complete for ERR1951293.fastq
Approx 70% complete for ERR1951293.fastq
Approx 75% complete for ERR1951293.fastq
Approx 80% complete for ERR1951293.fastq
Approx 85% complete for ERR1951293.fastq
Approx 90% complete for ERR1951293.fastq
Approx 95% complete for ERR1951293.fastq
Analysis complete for ERR1951293.fastq


For the best viewing experience, please down the HTML output to your Hard Drive and open in your local browser, such as Google Chrome, Firefox or Edge. 

##$\color{#a7588f}{\text{FastQC: quality control tool for high throughput sequence data - export}}$


To Store Resulting Files in your Google Drive: 

In [34]:
! cp $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/fastqc/${ACC}_fastqc.html $STORAGE_FILE_PATH/${ACC}_fastqc.html

To Store Resulting Files in your Local Hard Drive: 

In [35]:
! export $ACC
! export $PIPELINE_FILE_PATH

/bin/bash: line 0: export: `/content/long-read-sequencing-pipeline': not a valid identifier


In [36]:
from google.colab import files
import os

files.download("".join([os.environ["PIPELINE_FILE_PATH"],"/long-read-sequencing-pipeline/fastqc/",os.environ["ACC"],"_fastqc.html"]))

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

##$\color{#e74b4b}{\text{Reference Genome - installation}}$

The Reference Genome is the reference genome that the long-read sequencing reads are aligned. These commands install the hg38 genome and ensembl annotation availiable from UCSC. You can download more current genomes by utilizing the appropriate links from NCBI RefSeq, Ensembl, or other reference genome providers. If you are unsure of how to find other species, we recommend checking out the list species available in the ```current_fasta``` and ```current_gtf``` folders: http://ftp.ensembl.org/pub/

In [None]:
! wget -P $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/prebuilt_indices http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz

In [None]:
! wget -P $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/prebuilt_indices http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.ensGene.gtf.gz

In [None]:
! gunzip -c $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/prebuilt_indices/hg38.fa.gz > $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/prebuilt_indices/hg38.fa

In [None]:
! gunzip -c $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/prebuilt_indices/hg38.ensGene.gtf.gz > $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/prebuilt_indices/hg38.ensGene.gtf

Another example is provided for which installs the UCSC murine mm39 genome. Note that the reference annotation is from RefSeq. 

In [40]:
! wget -P $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/prebuilt_indices ftp://hgdownload.soe.ucsc.edu/goldenPath/mm39/bigZips/mm39.fa.gz

--2022-08-22 14:04:12--  ftp://hgdownload.soe.ucsc.edu/goldenPath/mm39/bigZips/mm39.fa.gz
           => ‘/content/long-read-sequencing-pipeline/long-read-sequencing-pipeline/prebuilt_indices/mm39.fa.gz’
Resolving hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)... 128.114.119.163
Connecting to hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)|128.114.119.163|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /goldenPath/mm39/bigZips ... done.
==> SIZE mm39.fa.gz ... 870543764
==> PASV ... done.    ==> RETR mm39.fa.gz ... done.
Length: 870543764 (830M) (unauthoritative)


2022-08-22 14:05:09 (15.2 MB/s) - ‘/content/long-read-sequencing-pipeline/long-read-sequencing-pipeline/prebuilt_indices/mm39.fa.gz’ saved [870543764]



In [41]:
! wget -P $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/prebuilt_indices https://hgdownload.soe.ucsc.edu/goldenPath/mm39/bigZips/genes/mm39.ncbiRefSeq.gtf.gz

--2022-08-22 14:05:10--  https://hgdownload.soe.ucsc.edu/goldenPath/mm39/bigZips/genes/mm39.ncbiRefSeq.gtf.gz
Resolving hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)... 128.114.119.163
Connecting to hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)|128.114.119.163|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 29617837 (28M) [application/x-gzip]
Saving to: ‘/content/long-read-sequencing-pipeline/long-read-sequencing-pipeline/prebuilt_indices/mm39.ncbiRefSeq.gtf.gz’


2022-08-22 14:05:13 (10.4 MB/s) - ‘/content/long-read-sequencing-pipeline/long-read-sequencing-pipeline/prebuilt_indices/mm39.ncbiRefSeq.gtf.gz’ saved [29617837/29617837]



In [42]:
! gunzip -c $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/prebuilt_indices/mm39.fa.gz > $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/prebuilt_indices/mm39.fa

In [43]:
! gunzip -c $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/prebuilt_indices/mm39.ncbiRefSeq.gtf.gz > $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/prebuilt_indices/mm39.ncbiRefSeq.gtf

##$\color{#e74b4b}{\text{minimap2 featherweight alignment - installation}}$

The minimap2 featherweight implementation works by splitting the reference genome into four parts with minimal loss in alignment accuracy (documentation: https://doi.org/10.1038/s41598-019-40739-8).

In [44]:
! wget https://github.com/hasindu2008/minimap2-arm/archive/v0.1.tar.gz
! tar xvf v0.1.tar.gz && cd minimap2-arm-0.1 && make

--2022-08-22 14:09:29--  https://github.com/hasindu2008/minimap2-arm/archive/v0.1.tar.gz
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/hasindu2008/minimap2-arm/tar.gz/refs/tags/v0.1 [following]
--2022-08-22 14:09:29--  https://codeload.github.com/hasindu2008/minimap2-arm/tar.gz/refs/tags/v0.1
Resolving codeload.github.com (codeload.github.com)... 140.82.121.10
Connecting to codeload.github.com (codeload.github.com)|140.82.121.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/x-gzip]
Saving to: ‘v0.1.tar.gz.1’

v0.1.tar.gz.1           [ <=>                ] 228.88K  --.-KB/s    in 0.03s   

2022-08-22 14:09:29 (6.71 MB/s) - ‘v0.1.tar.gz.1’ saved [234373]

minimap2-arm-0.1/
minimap2-arm-0.1/.gitignore
minimap2-arm-0.1/.travis.yml
minimap2-arm-0.1/LICENSE.txt
minimap2-arm-0.1/MA

In [45]:
! chmod u+x /content/minimap2-arm-0.1/misc/idxtools/divide_and_index.sh

In [46]:
! apt-get install bc

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
The following NEW packages will be installed:
  bc
0 upgraded, 1 newly installed, 0 to remove and 20 not upgraded.
Need to get 86.2 kB of archives.
After this operation, 223 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/main amd64 bc amd64 1.07.1-2 [86.2 kB]
Fetched 86.2 kB in 0s (208 kB/s)
Selecting previously unselected package bc.
(Reading database ... 155676 files and directories currently installed.)
Preparing to unpack .../archives/bc_1.07.1-2_amd64.deb ...
Unpacking bc (1.07.1-2) ...
Setting up bc (1.07.1-2) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...


##$\color{#e74b4b}{\text{minimap2 featherweight alignment - partioning}}$

The minimap2 featherweight implementation works by splitting the reference genome into four parts with minimal loss in alignment accuracy (documentation: https://doi.org/10.1038/s41598-019-40739-8). This command will only need to be run once as long as the partioned reference index is not removed from the directory. 

In [48]:
! eval /content/minimap2-arm-0.1/misc/idxtools/divide_and_index.sh $INDEX_FILE_PATH 4 $PARTITIONED_INDEX_FILE_PATH /content/minimap2-arm-0.1/minimap2 $PLATFORM_FREE

Compiling divide.c
Running divider
INFO : Collecting chromosome stats.
INFO : 61 chromosomes parsed.
INFO : Determining partitions.
INFO : Partition 0 - 6 chromosomes (702432803 non-ambiguous bases). See stat_part0.csv.
INFO : Partition 1 - 13 chromosomes (650730000 non-ambiguous bases). See stat_part1.csv.
INFO : Partition 2 - 16 chromosomes (650734136 non-ambiguous bases). See stat_part2.csv.
INFO : Partition 3 - 26 chromosomes (650724844 non-ambiguous bases). See stat_part3.csv.
INFO : Writing partitions.
INFO : Done.
Minimap2 Indexing
Creating partition 0
[M::mm_idx_gen::17.129*1.58] collected minimizers
[M::mm_idx_gen::20.511*1.81] sorted minimizers
[M::main::25.097*1.58] loaded/built the index for 6 target sequence(s)
[M::mm_idx_stat] kmer size: 19; skip: 10; is_hpc: 1; #seq: 6
[M::mm_idx_stat::25.662*1.57] distinct minimizers: 38239488 (62.53% are singletons); average occurrences: 2.413; average spacing: 7.828
[M::main] Version: 2.11-r797
[M::main] CMD: /content/minimap2-arm-0.1

##$\color{#e74b4b}{\text{minimap2 featherweight alignment - usage}}$

The minimap2 featherweight implementation works by splitting the reference genome into four parts with minimal loss in alignment accuracy (documentation: https://doi.org/10.1038/s41598-019-40739-8).

In [49]:
! eval /content/minimap2-arm-0.1/minimap2 -a -x $PLATFORM_FREE $PARTITIONED_INDEX_FILE_PATH $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/fastq/$ACC.fastq --multi-prefix tmp > $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/sam/$ACC.sam

[M::main::2.327*1.00] loaded/built the index for 6 target sequence(s)
[M::mm_mapopt_update::3.136*1.00] mid_occ = 293
[M::mm_idx_stat] kmer size: 19; skip: 10; is_hpc: 1; #seq: 6
[M::mm_idx_stat::3.672*1.00] distinct minimizers: 38239488 (62.53% are singletons); average occurrences: 2.413; average spacing: 7.828
[M::worker_pipeline::21.783*2.51] mapped 301135 sequences
[M::main::23.638*2.39] loaded/built the index for 13 target sequence(s)
[M::mm_mapopt_update::23.638*2.39] mid_occ = 293
[M::mm_idx_stat] kmer size: 19; skip: 10; is_hpc: 1; #seq: 13
[M::mm_idx_stat::24.172*2.36] distinct minimizers: 40118326 (62.16% are singletons); average occurrences: 2.136; average spacing: 7.782
[M::worker_pipeline::49.748*2.63] mapped 301135 sequences
[M::main::51.560*2.58] loaded/built the index for 16 target sequence(s)
[M::mm_mapopt_update::51.560*2.58] mid_occ = 293
[M::mm_idx_stat] kmer size: 19; skip: 10; is_hpc: 1; #seq: 16
[M::mm_idx_stat::52.117*2.56] distinct minimizers: 38477470 (62.33% 

##$\color{#e74b4b}{\text{minimap2 featherweight alignment - export}}$


To Store Resulting Files in your Google Drive: 

In [None]:
! cp $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/sam/$ACC.sam $STORAGE_FILE_PATH/$ACC.sam

To Store Resulting Files in your Local Hard Drive: 

In [64]:
! export $ACC
! export $PIPELINE_FILE_PATH

/bin/bash: line 0: export: `/content/long-read-sequencing-pipeline': not a valid identifier


In [53]:
from google.colab import files
import os

files.download("".join([os.environ["PIPELINE_FILE_PATH"],"/long-read-sequencing-pipeline/sam/",os.environ["ACC"],".sam"]))

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

##$\color{#ff00d5}{\text{samtools: Write/Index SAM to BAM - installation}}$

samtools allows for manipulation of high-throughput sequencing data (documentation: http://www.htslib.org/) 


In [54]:
! conda install -c bioconda samtools -y

Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
Solving environment: / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - samtools


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    bzip2-1.0.8                |       h7b6447c_0          78 KB
    libgcc-7.2.0               |       h69d50b8_2         269 KB
    samtools-1.7               |                1         1.0 MB  bioconda
    ------------------------------------------------------------
                                           Total:         1.4 MB

The following NEW packages will be INSTAL

##$\color{#d42bb4}{\text{samtools: Write/Index SAM to BAM - usage}}$

samtools allows for manipulation of high-throughput sequencing data (documentation: http://www.htslib.org/) 


In [55]:
! samtools view -S -b $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/sam/$ACC.sam > $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/bam/$ACC.bam 

In [58]:
! samtools sort $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/bam/$ACC.bam -o $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/bam/$ACC.sorted.bam  

In [59]:
! samtools index $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/bam/$ACC.sorted.bam 

##$\color{#a7588f}{\text{samtools: Write/Index SAM to BAM - export}}$


To Store Resulting Files in your Google Drive: 

In [60]:
! cp $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/bam/$ACC.sorted.bam $STORAGE_FILE_PATH/$ACC.sorted.bam

In [None]:
! cp $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/bam/$ACC.sorted.bam.bai $STORAGE_FILE_PATH/$ACC.sorted.bam.bai

To Store Resulting Files in your Local Hard Drive: 

In [65]:
! export $ACC
! export $PIPELINE_FILE_PATH

/bin/bash: line 0: export: `/content/long-read-sequencing-pipeline': not a valid identifier


In [66]:
from google.colab import files
import os

files.download("".join([os.environ["PIPELINE_FILE_PATH"],"/long-read-sequencing-pipeline/bam/",os.environ["ACC"],".sorted.bam"]))

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [101]:
from google.colab import files
import os

files.download("".join([os.environ["PIPELINE_FILE_PATH"],"/long-read-sequencing-pipeline/bam/",os.environ["ACC"],".sorted.bam.bai"]))

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

##$\color{#ff00d5}{\text{TranscriptClean: correct mismatches, microindels, and noncanonical splice junctions - installation}}$

TranscriptClean is a command-line program which corrects long-read mismatches, microindels and noncanonical splice junctions (documentation: https://github.com/mortazavilab/TranscriptClean). 

In [67]:
! conda install -c bioconda pyfasta pyranges samtools -y

Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - done
Solving environment: | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | failed with initial frozen solve. Retrying with flexible solve.
Solving environment: - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
Solving environment: / - \ | / - \ | / - \ | / - \ | / - \ | / - \ 

In [68]:
! wget https://github.com/mortazavilab/TranscriptClean/archive/refs/tags/v2.0.3.tar.gz
! tar xvf v2.0.3.tar.gz 

--2022-08-22 14:40:55--  https://github.com/mortazavilab/TranscriptClean/archive/refs/tags/v2.0.3.tar.gz
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/mortazavilab/TranscriptClean/tar.gz/refs/tags/v2.0.3 [following]
--2022-08-22 14:40:56--  https://codeload.github.com/mortazavilab/TranscriptClean/tar.gz/refs/tags/v2.0.3
Resolving codeload.github.com (codeload.github.com)... 140.82.121.9
Connecting to codeload.github.com (codeload.github.com)|140.82.121.9|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/x-gzip]
Saving to: ‘v2.0.3.tar.gz’

v2.0.3.tar.gz           [            <=>     ] 206.75M  2.03MB/s    in 24s     

2022-08-22 14:41:20 (8.50 MB/s) - ‘v2.0.3.tar.gz’ saved [216793874]

TranscriptClean-2.0.3/
TranscriptClean-2.0.3/.gitignore
TranscriptClean-2.0.3/LICENSE.md
Trans

##$\color{#d42bb4}{\text{TranscriptClean: correct mismatches, microindels, and noncanonical splice junctions - usage}}$

TranscriptClean is a command-line program which corrects long-read mismatches, microindels and noncanonical splice junctions (documentation: https://github.com/mortazavilab/TranscriptClean). 

Please note that the corrected reads should not be utilized for downstream-level base-calling analysis such as variant calling. 

In [69]:
! eval python /content/TranscriptClean-2.0.3/TranscriptClean.py --sam $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/sam/$ACC.sam --genome $INDEX_FILE_PATH --outprefix $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/transcriptclean/$ACC

Reading genome ..............................
No splice annotation provided. Will skip splice junction correction.
No variant file provided. Transcript correction will not be variant-aware.
Reference file processing took 0:00:03
Correcting transcripts...
Took 0:44:05 to process transcript batch.
Took 0:00:05 to combine all outputs.


##$\color{#a7588f}{\text{TranscriptClean: correct mismatches, microindels, and noncanonical splice junctions - export}}$


To Store Resulting Files in your Google Drive: 

In [70]:
! cp $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/transcriptclean/${ACC}_clean.fa $STORAGE_FILE_PATH/${ACC}_clean.fa

In [71]:
! cp $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/transcriptclean/${ACC}_clean.sam $STORAGE_FILE_PATH/${ACC}_clean.sam

To Store Resulting Files in your Local Hard Drive: 

In [None]:
! export $ACC
! export $PIPELINE_FILE_PATH

In [None]:
from google.colab import files
import os

files.download("".join([os.environ["PIPELINE_FILE_PATH"],"/bam/",os.environ["ACC"],".sorted.bam"]))

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

##$\color{#ff00d5}{\text{FLAME: long-read splice variant annotation - installation}}$

In [76]:
! git clone https://github.com/marabouboy/FLAME

Cloning into 'FLAME'...
remote: Enumerating objects: 298, done.[K
remote: Counting objects: 100% (296/296), done.[K
remote: Compressing objects: 100% (182/182), done.[K
remote: Total 298 (delta 127), reused 267 (delta 104), pack-reused 2[K
Receiving objects: 100% (298/298), 11.94 MiB | 21.68 MiB/s, done.
Resolving deltas: 100% (127/127), done.


In [88]:
! chmod 755 /content/FLAME/FLAME/FLAME.py

In [90]:
! chmod 755 /content/FLAME/setup.py

In [95]:
! cd /content/FLAME/ ; python3 setup.py install

running install
running bdist_egg
running egg_info
creating FLAME_Full_Length_Adjecency_Matrix_Enumeration.egg-info
writing FLAME_Full_Length_Adjecency_Matrix_Enumeration.egg-info/PKG-INFO
writing dependency_links to FLAME_Full_Length_Adjecency_Matrix_Enumeration.egg-info/dependency_links.txt
writing top-level names to FLAME_Full_Length_Adjecency_Matrix_Enumeration.egg-info/top_level.txt
writing manifest file 'FLAME_Full_Length_Adjecency_Matrix_Enumeration.egg-info/SOURCES.txt'
reading manifest file 'FLAME_Full_Length_Adjecency_Matrix_Enumeration.egg-info/SOURCES.txt'
adding license file 'LICENSE'
writing manifest file 'FLAME_Full_Length_Adjecency_Matrix_Enumeration.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib

creating build
creating build/bdist.linux-x86_64
creating build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying FLAME_Full_Length_Adjecency_Matrix_Enumeration.egg-info/PKG-INFO -> build/bdist.linu

##$\color{#d42bb4}{\text{FLAME: long-read splice variant annotation - usage}}$

In [103]:
! bedtools bamtobed -bed12 -i $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/bam/$ACC.sorted.bam > $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/bed/$ACC.sorted.bed12

In [107]:
! eval python3 /content/FLAME/FLAME/FLAME.py -I $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/bed/$ACC.sorted.bed12 -GTF $ANNOTATION_FILE_PATH -G $REGION_NAME -O $ACC


-------------------------------------------------------------------------------------------

-----------	FLAME: Full Length Adjacency Matrix Enumeration			-----------

-------------------------------------------------------------------------------------------

-----------	Initiating FLAME						-----------
Input:		/content/long-read-sequencing-pipeline/long-read-sequencing-pipeline/bed/ERR1951293.sorted.bed12
GTF:		/content/long-read-sequencing-pipeline/long-read-sequencing-pipeline/prebuilt_indices/mm39.ncbiRefSeq.gtf
Gene:		Gm30604
Range:		20
Output:		Flame-[Suffix]
-------------------------------------------------------------------------------------------



-----------	Initiate Filter Function					-----------
Status: [##################################################] 100% Done...
-----------	Initiate Translate Function, Incongruent			-----------
Status: [##################################################] 100% Done...
-----------	Initiate Creation of Empty Adjacency Matrix			----

##$\color{#ff00d5}{\text{featureCounts: an efficient general purpose program for assigning sequence reads to genomic features - installation}}$

featureCounts is a read summarization program suitable for counting reads generated from either RNA or genomic DNA sequencing experiments (documentation: http://subread.sourceforge.net/). 

In [None]:
! conda install -c bioconda subread -y

##$\color{#d42bb4}{\text{featureCounts: an efficient general purpose program for assigning sequence reads to genomic features - usage}}$

featureCounts is a read summarization program suitable for counting reads generated from either RNA or genomic DNA sequencing experiments (documentation: http://subread.sourceforge.net/). 

In [None]:
! eval featureCounts -M -O -T 24 -L -a $ANNOTATION_FILE_PATH -t exon -g gene_id -o $PIPELINE_FILE_PATH/featureCounts/$ACC.txt $PIPELINE_FILE_PATH/bam/$ACC.bam 

##$\color{#a7588f}{\text{featureCounts: an efficient general purpose program for assigning sequence reads to genomic features - export}}$


To Store Resulting Files in your Google Drive: 

In [None]:
! cp $PIPELINE_FILE_PATH/bam/$ACC.sorted.bam $STORAGE_FILE_PATH/$ACC.sorted.bam

To Store Resulting Files in your Local Hard Drive: 

In [None]:
! export $ACC
! export $PIPELINE_FILE_PATH

In [None]:
from google.colab import files
import os

files.download("".join([os.environ["PIPELINE_FILE_PATH"],"/bam/",os.environ["ACC"],".sorted.bam"]))

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

##svist4get: a simple visualization tool for genomic tracks from sequencing experiments - installation

svist4get allows you to view read coverage at a defined region on a chromosome (documentation: https://bitbucket.org/artegorov/svist4get/src/master/)

In [77]:
! sudo apt-get install bedtools

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'sudo apt autoremove' to remove it.
The following NEW packages will be installed:
  bedtools
0 upgraded, 1 newly installed, 0 to remove and 20 not upgraded.
Need to get 577 kB of archives.
After this operation, 2,040 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 bedtools amd64 2.26.0+dfsg-5 [577 kB]
Fetched 577 kB in 0s (5,207 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg

In [None]:
! sudo apt-get install bedtools

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'sudo apt autoremove' to remove it.
The following NEW packages will be installed:
  bedtools
0 upgraded, 1 newly installed, 0 to remove and 20 not upgraded.
Need to get 577 kB of archives.
After this operation, 2,040 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 bedtools amd64 2.26.0+dfsg-5 [577 kB]
Fetched 577 kB in 0s (5,207 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg

In [78]:
! apt-get update

0% [Working]            Hit:1 http://archive.ubuntu.com/ubuntu bionic InRelease
0% [Waiting for headers] [Connecting to security.ubuntu.com (91.189.91.39)] [Co                                                                               Get:2 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
                                                                               Get:3 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB]
0% [2 InRelease 21.4 kB/88.7 kB 24%] [Connecting to security.ubuntu.com (91.1890% [1 InRelease gpgv 242 kB] [2 InRelease 21.4 kB/88.7 kB 24%] [Connecting to s                                                                               Hit:4 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
0% [1 InRelease gpgv 242 kB] [2 InRelease 85.1 kB/88.7 kB 96%] [Connecting to s0% [1 InRelease gpgv 242 kB] [Connecting to security.ubuntu.com (91.189.91.39)]                                             

In [79]:
! apt-get install libmagickwand-dev

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
The following additional packages will be installed:
  fonts-droid-fallback fonts-noto-mono ghostscript gir1.2-freedesktop
  gir1.2-gdkpixbuf-2.0 gir1.2-rsvg-2.0 gsfonts imagemagick-6-common
  libcairo-script-interpreter2 libcairo2-dev libcupsfilters1 libcupsimage2
  libdjvulibre-dev libdjvulibre-text libdjvulibre21 libgdk-pixbuf2.0-dev
  libgs9 libgs9-common libijs-0.35 libjbig2dec0 liblcms2-dev liblqr-1-0
  liblqr-1-0-dev libmagickcore-6-arch-config libmagickcore-6-headers
  libmagickcore-6.q16-3 libmagickcore-6.q16-3-extra libmagickcore-6.q16-dev
  libmagickwand-6-headers libmagickwand-6.q16-3 libmagickwand-6.q16-dev
  libpixman-1-dev librsvg2-dev libwmf-dev libwmf0.2-7 libxcb-shm0-dev
  poppler-data
Suggested packages:
  fonts-noto ghostscript-x libcairo2

In [80]:
! cp -r $PIPELINE_FILE_PATH/svist4get/policy_revised.xml /etc/ImageMagick-6/policy.xml

cp: cannot stat '/content/long-read-sequencing-pipeline/svist4get/policy_revised.xml': No such file or directory


In [81]:
! python3 -m pip install svist4get

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting svist4get
  Downloading svist4get-1.3-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 13.8 MB/s 
[?25hCollecting reportlab
  Downloading reportlab-3.6.11-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 65.1 MB/s 
[?25hCollecting wand
  Downloading Wand-0.6.10-py2.py3-none-any.whl (142 kB)
[K     |████████████████████████████████| 142 kB 47.2 MB/s 
Collecting statistics
  Downloading statistics-1.0.3.5.tar.gz (8.3 kB)
Collecting argparse
  Downloading argparse-1.4.0-py2.py3-none-any.whl (23 kB)
Collecting biopython
  Downloading biopython-1.79-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl (2.3 MB)
[K     |████████████████████████████████| 2.3 MB 47.4 MB/s 
[?25hCollecting configs
  Downloading configs-3.0.3-py3-none-any.whl (7.1 kB)
Collecting pillow>=9.0.0
  Downloading P

##svist4get: a simple visualization tool for genomic tracks from sequencing experiments - usage

svist4get allows you to view read coverage at a defined region on a chromosome (documentation: https://bitbucket.org/artegorov/svist4get/src/master/)

In [83]:
! bedtools genomecov -ibam $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/bam/$ACC.sorted.bam -bg > $PIPELINE_FILE_PATH/long-read-sequencing-pipeline/bed/$ACC.sorted.bedgraph

tcmalloc: large alloc 1561239552 bytes == 0x5604f94ba000 @  0x7f68545d8887 0x5604f7fd653a 0x5604f7fd5659 0x5604f7fd613c 0x5604f7fda960 0x5604f7f10ab3 0x7f6853c81c87 0x5604f7f1548a
tcmalloc: large alloc 1454047232 bytes == 0x5604f94ba000 @  0x7f68545d8887 0x5604f7fd653a 0x5604f7fd5659 0x5604f7fd613c 0x5604f7fda960 0x5604f7f10ab3 0x7f6853c81c87 0x5604f7f1548a


In [None]:
! eval svist4get -bg $PIPELINE_FILE_PATH/bed/$ACC.sorted.bedgraph -gtf $ANNOTATION_FILE_PATH -fa $INDEX_FILE_PATH -bl Long-Read Coverage -w $CHROMOSOME $CHROMOSOME_START $CHROMOSOME_FINISH -it "$REGION_NAME" -o $PIPELINE_FILE_PATH/svist4get/$ACC

##Pistis: Quality control plotting for long reads - installation

Pistis generates long-read specific quality control graphs, including a plot demonstrating read alignment percentage to the reference genome (documentation: https://github.com/mbhall88/pistis)

In [None]:
! pip3 install pistis

##Pistis: Quality control plotting for long reads - usage

Pistis generates long-read specific quality control graphs, including a plot demonstrating read alignment percentage to the reference genome (documentation: https://github.com/mbhall88/pistis)

Please note that the report generates assuming alignment of more than 50,000 reads. If this is not the case, please add the flag `--downsample INTEGER` replacing `INTEGER` with a number less than the number of aligned reads. 

In [None]:
! pistis -f $PIPELINE_FILE_PATH/fastq/$ACC.fastq -b $PIPELINE_FILE_PATH/bam/$ACC.sorted.bam  -o $PIPELINE_FILE_PATH/pistis/$ACC.pdf

##MakeHub: Fully automated generation of UCSC assembly hubs - installation

MakeHub is a command line tool for the fully automatic generation of of track data hubs for visualizing genomes with the UCSC genome browser (documentation: https://github.com/Gaius-Augustus/MakeHub).

In [None]:
! python3.7 -m pip install biopython

In [None]:
! sudo apt install samtools

In [None]:
! sudo apt install augustus augustus-data augustus-doc

##MakeHub: Fully automated generation of UCSC assembly hubs - usage

MakeHub is a command line tool for the fully automatic generation of of track data hubs for visualizing genomes with the UCSC genome browser (documentation: https://github.com/Gaius-Augustus/MakeHub).

In [None]:
! chmod 755 $PIPELINE_FILE_PATH/makehub/make_hub.py

In [None]:
! eval $PIPELINE_FILE_PATH/makehub/make_hub.py -l $HUB_KEYWORD -L $HUB_NAME -g $INDEX_FILE_PATH -e \
  $HUB_EMAIL -a $ANNOTATION_FILE_PATH -b $PIPELINE_FILE_PATH/bam/$ACC.sorted.bam -o $PIPELINE_FILE_PATH/makehub/

##MultiQC: Aggregate results from bioinformatics analyses across many samples into a single report - installation



MultiQC is a program which allows you to combine reports for as many samples as you wish ([documentation](https://multiqc.info/docs/))

In [None]:
! pip install multiqc

##MultiQC: Aggregate results from bioinformatics analyses across many samples into a single report - usage

MultiQC is a program which allows you to combine reports for as many samples as you wish ([documentation](https://multiqc.info/docs/))

In [None]:
! multiqc $PIPELINE_FILE_PATH -o $PIPELINE_FILE_PATH/multiqc

#$\text{L-RAP: Long Read RNA Sequencing Analysis Pipeline - Pro Features}$


##minimap2: A versatile pairwise aligner for genomic and spliced nucleotide sequences (Google Colab Pro required) - installation

minimap2 is a long-read sequencing aligner (documentation: https://github.com/lh3/minimap2). 

In [None]:
! conda install -c bioconda minimap2 -y

##minimap2: A versatile pairwise aligner for genomic and spliced nucleotide sequences (Google Colab Pro required) - usage

minimap2 is a long-read sequencing aligner (documentation: https://github.com/lh3/minimap2). 

Google Colab will typically provide virtual machines with 12 GB of RAM. minimap2 in the case of the human genome usually requires more than 12 GB of RAM. Therefore this command will fail to execute within the free version of Google Colab. However, this pipeline also implements a less memory-intensive version of minimap2 for free tier users. 

In [None]:
! eval minimap2 -ax splice $INDEX_FILE_PATH $PIPELINE_FILE_PATH/fastq/$ACC.fastq > $PIPELINE_FILE_PATH/sam/$ACC.sam

##TAMA: Transcriptome Annotation by Modular Algorithms (Google Colab Pro required) - installation



TAMA allows for the construction of a polished transcriptome by collapsing runs into individual transcripts (documentation: https://github.com/GenomeRIK/tama/wiki). 

In [None]:
! git clone https://github.com/MakeTheBrainHappy/tama

In [None]:
! python -m pip install Bio

In [None]:
! pip install pysam

##TAMA: Transcriptome Annotation by Modular Algorithms (Google Colab Pro required) - usage

TAMA allows for the construction of a polished transcriptome by collapsing runs into individual transcripts (documentation: https://github.com/GenomeRIK/tama/wiki). 

In [None]:
! eval python tama/tama_collapse.py -s $PIPELINE_FILE_PATH/bam/$ACC.sorted.bam -f $INDEX_FILE_PATH -p $PIPELINE_FILE_PATH/tama/$ACC.polished -b BAM -rm low_mem