#Long Read RNA Sequencing Analysis Pipeline (Requires Pre-Built Indices)


A pipeline to analyze Oxford Nanopore and PacBio third-generation long transcriptomic sequencing reads

*Theodore Nelson (tmn2126)* - Columbia University Irving Medical Center


##Parameter Input and User Instructions

Please define where the file structure is within your Google Drive:

<ul type=disc>
<li><b>PIPELINE_FILE_PATH</b>: file path to location of long-read RNA sequencing analysis pipeline within your Google Drive/general file system - required for most applications. This must be defined and initialized first.</li>
</ul>

In [None]:
%env PIPELINE_FILE_PATH=/content/drive/MyDrive/long-read-sequencing-pipeline


env: PIPELINE_FILE_PATH=/content/drive/MyDrive/long_read_transcriptomic_sequencing_pipeline


Please modify the following parameters within the code box below to fit your own study requirements:  

<li><b>ACC</b>: Run accession number for reads within the [European Nucleotide Archive](https://www.ebi.ac.uk/ena/browser/) (SRR...) or file path to location of long-read RNA sequencing data within your Google Drive/general file system - required for most applications</li>
<li><b>PLATFORM_FREE</b>: choose either `map-ont` (Nanopore reads) or `map-pac` (PacBio reads) - required for minimap2 featherweight version</li>
<li><b>INDEX_FILE_PATH</b>: file path to location of reference genome (e.g. .FASTA) within your Google Drive/general file system - required for most applications</li>
<li><b>ANNOTATION_FILE_PATH</b>: file path to location of reference annotation (e.g. .GTF) within your Google Drive/general file system - required for most applications</li>
<li><b>PARTITIONED_INDEX_FILE_PATH</b>: file path to location of partioned reference annotation (e.g. .idx) within your Google Drive/general file system - required for minimap2 featherweight version</li>
<li><b>CHROMOSOME</b>: Name of Chromosome of Interest matching the name of the Chromosome within your Reference Annotation - required for svist4get</li>
<li><b>CHROMOSOME_START</b>: Starting Location of Interest on the Chromosome - required for svist4get</li>
<li><b>CHROMOSOME_FINISH</b>: Ending Location of Interest on the Chromosome - required for svist4get</li>
<li><b>REGION_NAME</b>: Gene Name (does not need to match annotation file) - required for svist4get</li>
<li><b>HUB_KEYWORD</b>: Short Keyword for your UCSC Track Hub - required for MakeHub</li>
<li><b>HUB_NAME</b>: Longer Title for your UCSC Track Hub - required for MakeHub</li>
<li><b>HUB_EMAIL</b>: Email for your UCSC Track Hub (if you publish your track hub then this email will be public) - required for MakeHub</li>


In [None]:
%env ACC=SRR12389274  
%env PLATFORM_FREE=map-ont
%env INDEX_FILE_PATH=$PIPELINE_FILE_PATH/prebuilt_indices/hg38.fa
%env ANNOTATION_FILE_PATH=${PIPELINE_FILE_PATH}/prebuilt_indices/hg38.ensGene.gtf
%env PARTITIONED_INDEX_FILE_PATH=${PIPELINE_FILE_PATH}/prebuilt_indices/hg38.idx
%env CHROMOSOME=chr12 
%env CHROMOSOME_START=116533435 
%env CHROMOSOME_FINISH=116536513
%env REGION_NAME=LINC00173
%env HUB_KEYWORD=lnc173
%env HUB_NAME="293T LINC00173"
%env HUB_EMAIL=tmn2126@columbia.edu

env: ACC=SRR12389274
env: PLATFORM_FREE=map-ont
env: INDEX_FILE_PATH=$PIPELINE_FILE_PATH/prebuilt_indices/hg38.fa
env: ANNOTATION_FILE_PATH=${PIPELINE_FILE_PATH}/prebuilt_indices/hg38.ensGene.gtf
env: PARTITIONED_INDEX_FILE_PATH=${PIPELINE_FILE_PATH}/prebuilt_indices/hg38.idx
env: CHROMOSOME=chr12
env: CHROMOSOME_START=116533435
env: CHROMOSOME_FINISH=116536513
env: REGION_NAME=LINC00173
env: HUB_KEYWORD=lnc173
env: HUB_NAME="293T LINC00173"
env: HUB_EMAIL=tmn2126@columbia.edu


##Mounting your Google Drive

This step allows for permeanent storage of your bioinformatics analysis in Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


##Managing Software via BioConda

BioConda is a software environment and package manager, providing acess to over 8,000 different software packages related to bioinformatics (documentation: [BioConda](https://bioconda.github.io/user/install.html) and [Managing Environments via Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html)). 

In [None]:
! wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.8.2-Linux-x86_64.sh
! chmod +x Miniconda3-py37_4.8.2-Linux-x86_64.sh
! bash ./Miniconda3-py37_4.8.2-Linux-x86_64.sh -b -f -p /usr/local
import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')

--2022-02-06 21:18:51--  https://repo.anaconda.com/miniconda/Miniconda3-py37_4.8.2-Linux-x86_64.sh
Resolving repo.anaconda.com (repo.anaconda.com)... 104.16.131.3, 104.16.130.3, 2606:4700::6810:8203, ...
Connecting to repo.anaconda.com (repo.anaconda.com)|104.16.131.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 85055499 (81M) [application/x-sh]
Saving to: ‘Miniconda3-py37_4.8.2-Linux-x86_64.sh’


2022-02-06 21:18:51 (256 MB/s) - ‘Miniconda3-py37_4.8.2-Linux-x86_64.sh’ saved [85055499/85055499]

PREFIX=/usr/local
Unpacking payload ...
Collecting package metadata (current_repodata.json): - \ | done
Solving environment: - \ done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - _libgcc_mutex==0.1=main
    - asn1crypto==1.3.0=py37_0
    - ca-certificates==2020.1.1=0
    - certifi==2019.11.28=py37_0
    - cffi==1.14.0=py37h2e261b9_0
    - chardet==3.0.4=py37_1003
    - conda-package-handling==1.6.0=py37h7b6447c_0


##Kingfisher: fast and flexible program for procurement of sequence files - installation 

The Kingfisher program allows for sequence files to be downloaded from the European Nucleotide Archive (documentation: https://github.com/wwood/kingfisher-download).

In [None]:
! git clone https://github.com/MakeTheBrainHappy/kingfisher-download

fatal: destination path 'kingfisher-download' already exists and is not an empty directory.


In [None]:
! conda env update -n base --file kingfisher-download/kingfisher.yml

Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | 

In [None]:
! conda install -c rpetit3 aspera-connect -y

Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - done
Solving environment: | / - \ | / - \ | / - \ done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - aspera-connect


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    aspera-connect-3.9.6       |                0        34.3 MB  rpetit3
    ca-certificates-2021.10.26 |       h06a4308_2         115 KB
    certifi-2021.10.8          |   py39h06a4308_2         151 KB
    conda-4.11.0               |   py39h06a4308_0        14.4 MB
    openssl-1.1.1m             |       h7f8727e_0         2.5 MB
    ------------------------------------------------------------
                                           Total:        51.5 MB

The following NEW packages will be INSTALLED:

  aspera-connect     rpetit3/linux-64::

In [None]:
! wget -qO- https://download.asperasoft.com/download/sw/connect/3.9.8/ibm-aspera-connect-3.9.8.176272-linux-g2.12-64.tar.gz | tar xvz

ibm-aspera-connect-3.9.8.176272-linux-g2.12-64.sh


this command will pop up with an error message; please disregard

In [None]:
! ./ibm-aspera-connect-3.9.8.176272-linux-g2.12-64.sh


Installing IBM Aspera Connect

This script cannot be run as root, IBM Aspera Connect must be installed per user.


##Kingfisher: fast and flexible program for procurement of sequence files - usage 


The Kingfisher program allows for sequence files to be downloaded from the European Nucleotide Archive (documentation: https://github.com/wwood/kingfisher-download).

In [None]:
! cd $PIPELINE_FILE_PATH/fastq ; /content/kingfisher-download/bin/kingfisher get -r $ACC -m ena-ascp aws-http prefetch

02/06/2022 09:25:07 PM INFO: Kingfisher v0.0.1-dev
02/06/2022 09:25:07 PM INFO: Attempting download method ena-ascp for run SRR12389274 ..
02/06/2022 09:25:07 PM INFO: Using aspera ssh key file: $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh
02/06/2022 09:25:07 PM INFO: Querying ENA for FTP paths for SRR12389274..
02/06/2022 09:25:08 PM INFO: Downloading 1 FTP read set(s): ftp.sra.ebi.ac.uk/vol1/fastq/SRR123/074/SRR12389274/SRR12389274_1.fastq.gz
02/06/2022 09:25:08 PM INFO: Running command: ascp -T -l 300m -P33001 -k 2 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/SRR123/074/SRR12389274/SRR12389274_1.fastq.gz .
02/06/2022 09:36:31 PM INFO: Method ena-ascp worked.
02/06/2022 09:36:31 PM INFO: Output files: ./SRR12389274_1.fastq.gz
02/06/2022 09:36:31 PM INFO: Kingfisher done.


In [None]:
! cd $PIPELINE_FILE_PATH/fastq ; gunzip *.gz

In [None]:
! cd $PIPELINE_FILE_PATH/fastq ; mv ${ACC}_1.fastq $ACC.fastq

##FastQC: A quality control tool for high throughput sequence data - installation

FastQC is a program designed to spot potential problems in high througput sequencing datasets ([documentation](https://github.com/s-andrews/FastQC)).

In [None]:
! conda install -c bioconda fastqc -y

Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / done
Solving environment: \ | / - \ | / - \ | / - \ | / - \ | done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - fastqc


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    fastqc-0.11.9              |       hdfd78af_1         9.7 MB  bioconda
    font-ttf-dejavu-sans-mono-2.37|       hd3eb1b0_0         335 KB
    fontconfig-2.13.1          |       h6c09931_0         250 KB
    freetype-2.11.0            |       h70c0345_0         618 KB
    libpng-1.6.37              |       hbc83047_0         278 KB
    libuuid-1.0.3              |       h7f8727e_2          17 KB
    openjdk-8.0.152            |       h7b6447c_3        57.4 MB
    per

##FastQC: A quality control tool for high throughput sequence data - usage

FastQC is a program designed to spot potential problems in high througput sequencing datasets ([documentation](https://github.com/s-andrews/FastQC)).

In [None]:
! fastqc $PIPELINE_FILE_PATH/fastq/$ACC.fastq --outdir $PIPELINE_FILE_PATH/fastqc

Started analysis of SRR12389274.fastq
Approx 5% complete for SRR12389274.fastq
Approx 10% complete for SRR12389274.fastq
Approx 15% complete for SRR12389274.fastq
Approx 20% complete for SRR12389274.fastq
Approx 25% complete for SRR12389274.fastq
Approx 30% complete for SRR12389274.fastq
Approx 35% complete for SRR12389274.fastq
Approx 40% complete for SRR12389274.fastq
Approx 45% complete for SRR12389274.fastq
Approx 50% complete for SRR12389274.fastq
Approx 55% complete for SRR12389274.fastq
Approx 60% complete for SRR12389274.fastq
Approx 65% complete for SRR12389274.fastq
Approx 70% complete for SRR12389274.fastq
Approx 75% complete for SRR12389274.fastq
Approx 80% complete for SRR12389274.fastq
Approx 85% complete for SRR12389274.fastq
Approx 90% complete for SRR12389274.fastq
Approx 95% complete for SRR12389274.fastq
Analysis complete for SRR12389274.fastq


##minimap2: A versatile pairwise aligner for genomic and spliced nucleotide sequences (Google Colab Pro required) - installation

minimap2 is a long-read sequencing aligner (documentation: https://github.com/lh3/minimap2). 

In [None]:
! conda install -c bioconda minimap2 -y

Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
Solving environment: / - \ | / - \ | / - \ done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - minimap2


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    k8-0.2.5                   |       h9a82719_1         1.7 MB  bioconda
    minimap2-2.17              |       h5bf99c6_4         448 KB  bioconda
    ------------------------------------------------------------
                                           Total:         2.1 MB

The following NEW packages will be INSTALLED:

  k8                 bioconda/linux-64::k8-0.2.5-h9a82719_1
  minimap2           bioconda/linux-64::minimap2-2.17-h5bf99c6_4



Downloading and Extracting Packages
k8-0.2.5         

##minimap2: A versatile pairwise aligner for genomic and spliced nucleotide sequences (Google Colab Pro required) - usage

minimap2 is a long-read sequencing aligner (documentation: https://github.com/lh3/minimap2). 

Google Colab will typically provide virtual machines with 12 GB of RAM. minimap2 in the case of the human genome usually requires more than 12 GB of RAM. Therefore this command will fail to execute within the free version of Google Colab. However, this pipeline also implements a less memory-intensive version of minimap2 for free tier users. 

In [None]:
! eval minimap2 -ax splice $INDEX_FILE_PATH $PIPELINE_FILE_PATH/fastq/$ACC.fastq > $PIPELINE_FILE_PATH/sam/$ACC.sam

tcmalloc: large alloc 1073741824 bytes == 0x5565594fa000 @  0x7fcbc6b60162 0x55652f9a55fe 0x55652f996db1 0x55652f9a306d 0x7fcbc65766db 0x7fcbc629f71f
tcmalloc: large alloc 2147483648 bytes == 0x5565b956a000 @  0x7fcbc6b602a4 0x55652f9a557c 0x55652f996db1 0x55652f9a306d 0x7fcbc65766db 0x7fcbc629f71f
^C


##minimap2 featherweight alignment - installation

The minimap2 featherweight implementation works by splitting the reference genome into four parts with minimal loss in alignment accuracy (documentation: https://doi.org/10.1038/s41598-019-40739-8).

In [None]:
! wget https://github.com/hasindu2008/minimap2-arm/archive/v0.1.tar.gz
! tar xvf v0.1.tar.gz && cd minimap2-arm-0.1 && make

--2022-02-07 05:43:37--  https://github.com/hasindu2008/minimap2-arm/archive/v0.1.tar.gz
Resolving github.com (github.com)... 13.114.40.48
Connecting to github.com (github.com)|13.114.40.48|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/hasindu2008/minimap2-arm/tar.gz/v0.1 [following]
--2022-02-07 05:43:38--  https://codeload.github.com/hasindu2008/minimap2-arm/tar.gz/v0.1
Resolving codeload.github.com (codeload.github.com)... 52.193.111.178
Connecting to codeload.github.com (codeload.github.com)|52.193.111.178|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/x-gzip]
Saving to: ‘v0.1.tar.gz’

v0.1.tar.gz             [ <=>                ] 228.88K  1.42MB/s    in 0.2s    

2022-02-07 05:43:38 (1.42 MB/s) - ‘v0.1.tar.gz’ saved [234373]

minimap2-arm-0.1/
minimap2-arm-0.1/.gitignore
minimap2-arm-0.1/.travis.yml
minimap2-arm-0.1/LICENSE.txt
minimap2-arm-0.1/MANIFEST.in
minimap2-arm

In [None]:
! chmod u+x /content/minimap2-arm-0.1/misc/idxtools/divide_and_index.sh

In [None]:
! apt-get install bc

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following packages were automatically installed and are no longer required:
  cuda-command-line-tools-10-0 cuda-command-line-tools-10-1
  cuda-command-line-tools-11-0 cuda-compiler-10-0 cuda-compiler-10-1
  cuda-compiler-11-0 cuda-cuobjdump-10-0 cuda-cuobjdump-10-1
  cuda-cuobjdump-11-0 cuda-cupti-10-0 cuda-cupti-10-1 cuda-cupti-11-0
  cuda-cupti-dev-11-0 cuda-documentation-10-0 cuda-documentation-10-1
  cuda-documentation-11-0 cuda-documentation-11-1 cuda-gdb-10-0 cuda-gdb-10-1
  cuda-gdb-11-0 cuda-gpu-library-advisor-10-0 cuda-gpu-library-advisor-10-1
  cuda-libraries-10-0 cuda-libraries-10-1 cuda-libraries-11-0
  cuda-memcheck-10-0 cuda-memcheck-10-1 cuda-memcheck-11-0 cuda-nsight-10-0
  cuda-nsight-10-1 cuda-nsight-11-0 cuda-nsight-11-1 cuda-nsight-compute-10-0
  cuda-nsight-compute-10-1 cuda-nsight-compute-11-0 cuda-nsight-compute-11-1
  cuda-nsight-systems-10-1 cuda-nsight-systems-

##minimap2 featherweight alignment - partioning 

The minimap2 featherweight implementation works by splitting the reference genome into four parts with minimal loss in alignment accuracy (documentation: https://doi.org/10.1038/s41598-019-40739-8). This command will only need to be run once as long as the partioned reference index is not removed from the directory. 

In [None]:
! eval /content/minimap2-arm-0.1/misc/idxtools/divide_and_index.sh $INDEX_FILE_PATH 4 $PARTITIONED_INDEX_FILE_PATH /content/minimap2-arm-0.1/minimap2 $PLATFORM_FREE

##minimap2 featherweight alignment - usage

The minimap2 featherweight implementation works by splitting the reference genome into four parts with minimal loss in alignment accuracy (documentation: https://doi.org/10.1038/s41598-019-40739-8).

In [None]:
! eval /content/minimap2-arm-0.1/minimap2 -a -x $PLATFORM_FREE $PARTITIONED_INDEX_FILE_PATH $PIPELINE_FILE_PATH/fastq/$ACC.fastq --multi-prefix tmp > $PIPELINE_FILE_PATH/sam/$ACC.sam

[M::main::32.975*0.20] loaded/built the index for 118 target sequence(s)
[M::mm_mapopt_update::34.644*0.23] mid_occ = 348
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 118
[M::mm_idx_stat::35.610*0.25] distinct minimizers: 51411598 (54.83% are singletons); average occurrences: 2.793; average spacing: 5.744
[M::worker_pipeline::620.289*1.74] mapped 436843 sequences
[M::worker_pipeline::1172.145*1.82] mapped 452733 sequences
[M::worker_pipeline::1723.638*1.85] mapped 457006 sequences
[M::worker_pipeline::2272.118*1.87] mapped 458854 sequences
[M::worker_pipeline::2812.104*1.88] mapped 459612 sequences
[M::worker_pipeline::3357.198*1.88] mapped 456625 sequences
[M::worker_pipeline::3898.731*1.89] mapped 456060 sequences
[M::worker_pipeline::4413.726*1.89] mapped 479682 sequences
[M::worker_pipeline::4930.406*1.89] mapped 475736 sequences
[M::worker_pipeline::5456.679*1.90] mapped 476901 sequences
[M::worker_pipeline::5981.720*1.90] mapped 508625 sequences
[M::worker_pipeline:

##samtools: Reading/writing/editing/indexing/viewing SAM/BAM/CRAM format - installation

samtools allows for manipulation of high-throughput sequencing data (documentation: http://www.htslib.org/) 


In [None]:
! conda install -c bioconda samtools -y

Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | done
Solving environment: - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - samtools


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    bzip2-1.0.8                |       h7b6447c_0          78 KB
    libgcc-7.2.0               |       h69d50b8_2         269 KB
    samtools-1.7               |                1         1.0 MB  bioconda
    ------------------------------------------------------------
                                           Total:         1.4 MB

The following NEW packages will be INSTALLED:

  bzip2              pkgs/main/linux-64::bzip2-1.0.8-h7b6447c_0
  libgcc             pkgs/main/linux-64::libgcc-7.2.0-h69d50b8_2
  sam

##samtools: Reading/writing/editing/indexing/viewing SAM/BAM/CRAM format - usage

samtools allows for manipulation of high-throughput sequencing data (documentation: http://www.htslib.org/) 


In [None]:
! samtools view -S -b $PIPELINE_FILE_PATH/sam/$ACC.sam > $PIPELINE_FILE_PATH/bam/$ACC.bam 

In [None]:
! samtools sort $PIPELINE_FILE_PATH/bam/$ACC.bam -o $PIPELINE_FILE_PATH/bam/$ACC.sorted.bam  

In [None]:
! samtools index $PIPELINE_FILE_PATH/bam/$ACC.sorted.bam 

##TranscriptClean: correct mismatches, microindels, and noncanonical splice junctions - installation

TranscriptClean is a command-line program which corrects long-read mismatches, microindels and noncanonical splice junctions (documentation: https://github.com/mortazavilab/TranscriptClean). 

In [None]:
! conda install -c bioconda pyfasta pyranges samtools -y

Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | done
Solving environment: - \ | / - \ | / - \ | / - \ | / - \ | / - \ | done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - pyfasta
    - pyranges
    - samtools


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    natsort-7.1.1              |     pyhd3eb1b0_0          33 KB
    ncls-0.0.63                |   py39h38f01e4_0         676 KB  bioconda
    pyfasta-0.5.2              |             py_1          17 KB  bioconda
    pyranges-0.0.115           |     pyh5e36f6f_0         668 KB  bioconda
    pyrle-0.0.34               |   py39h38f01e4_0         545 KB  bioconda
    sorted_nearest-0.0.33      |   py39h38f01e4_1         5.5 MB  bioconda
    tabulate-0.8.9             |   py39h06a4308_0          40 KB
   

In [None]:
! wget https://github.com/mortazavilab/TranscriptClean/archive/refs/tags/v2.0.3.tar.gz
! tar xvf v2.0.3.tar.gz 

--2022-02-06 22:15:12--  https://github.com/mortazavilab/TranscriptClean/archive/refs/tags/v2.0.3.tar.gz
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/mortazavilab/TranscriptClean/tar.gz/refs/tags/v2.0.3 [following]
--2022-02-06 22:15:13--  https://codeload.github.com/mortazavilab/TranscriptClean/tar.gz/refs/tags/v2.0.3
Resolving codeload.github.com (codeload.github.com)... 140.82.112.10
Connecting to codeload.github.com (codeload.github.com)|140.82.112.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/x-gzip]
Saving to: ‘v2.0.3.tar.gz’

v2.0.3.tar.gz           [              <=>   ] 206.75M  2.04MB/s    in 24s     

2022-02-06 22:15:37 (8.50 MB/s) - ‘v2.0.3.tar.gz’ saved [216793874]

TranscriptClean-2.0.3/
TranscriptClean-2.0.3/.gitignore
TranscriptClean-2.0.3/LICENSE.md
Tra

##TranscriptClean: correct mismatches, microindels, and noncanonical splice junctions - usage

TranscriptClean is a command-line program which corrects long-read mismatches, microindels and noncanonical splice junctions (documentation: https://github.com/mortazavilab/TranscriptClean). 

In [None]:
! eval python /content/TranscriptClean-2.0.3/TranscriptClean.py --sam $PIPELINE_FILE_PATH/sam/$ACC.sam --genome $INDEX_FILE_PATH --outprefix $PIPELINE_FILE_PATH/transcriptclean/$ACC

Traceback (most recent call last):
  File "/content/TranscriptClean-2.0.3/TranscriptClean.py", line 1675, in <module>
    main()
  File "/content/TranscriptClean-2.0.3/TranscriptClean.py", line 38, in main
    validate_chroms(options.refGenome, options.variantFile, sam_chroms)
  File "/content/TranscriptClean-2.0.3/TranscriptClean.py", line 466, in validate_chroms
    genome = Fasta(genome_file)
  File "/usr/local/lib/python3.9/site-packages/pyfasta/fasta.py", line 67, in __init__
    raise FastaNotFound('"' + fasta_name + '"')
pyfasta.fasta.FastaNotFound: "$PIPELINE_FILE_PATH/prebuilt_indices/hg38.fa"


##featureCounts: an efficient general purpose program for assigning sequence reads to genomic features - installation

featureCounts is a read summarization program suitable for counting reads generated from either RNA or genomic DNA sequencing experiments (documentation: http://subread.sourceforge.net/). 

In [None]:
! conda install -c bioconda subread -y

Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | done
Solving environment: - \ | / - \ | / - \ | / - \ | / - done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - subread


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    subread-2.0.1              |       h5bf99c6_1        22.8 MB  bioconda
    ------------------------------------------------------------
                                           Total:        22.8 MB

The following NEW packages will be INSTALLED:

  subread            bioconda/linux-64::subread-2.0.1-h5bf99c6_1



Downloading and Extracting Packages
subread-2.0.1        | 22.8 MB   | : 100% 1.0/1 [00:04<00:00,  4.50s/it]               
Preparing transaction: | done
Verifying transaction: - \ done
Executing transaction: / done


##featureCounts: an efficient general purpose program for assigning sequence reads to genomic features - usage

featureCounts is a read summarization program suitable for counting reads generated from either RNA or genomic DNA sequencing experiments (documentation: http://subread.sourceforge.net/). 

In [None]:
! eval featureCounts -M -O -T 24 -a $ANNOTATION_FILE_PATH -t exon -g gene_id -o $PIPELINE_FILE_PATH/featureCounts/$ACC.txt $PIPELINE_FILE_PATH/bam/$ACC.sorted.bam 


       [44;37m =====      [0m[36m   / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
       [44;37m   =====    [0m[36m  | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
       [44;37m     ====   [0m[36m   \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
       [44;37m       ==== [0m[36m   ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
	  v2.0.1

||  [0m                                                                          ||
||             Input files : [36m1 BAM file  [0m [0m                                    ||
||                           [32mo[36m SRR12389274.sorted.bam[0m [0m                        ||
||  [0m                                                                          ||
||             Output file : [36mSRR12389274.txt[0m [0m                                 ||
||                 Summary : [36mSRR12389274.txt.summary[0m [0m                         ||
||              Annotation : [36mhg38.ensGene.gtf (GTF)[0m [0m            

##TAMA: Transcriptome Annotation by Modular Algorithms (Google Colab Pro required) - installation



TAMA allows for the construction of a polished transcriptome by collapsing runs into individual transcripts (documentation: https://github.com/GenomeRIK/tama/wiki). 

In [None]:
! git clone https://github.com/MakeTheBrainHappy/tama

Cloning into 'tama'...
remote: Enumerating objects: 438, done.[K
remote: Counting objects: 100% (29/29), done.[K
remote: Compressing objects: 100% (24/24), done.[K
remote: Total 438 (delta 14), reused 16 (delta 5), pack-reused 409[K
Receiving objects: 100% (438/438), 673.46 KiB | 16.84 MiB/s, done.
Resolving deltas: 100% (252/252), done.


##TAMA: Transcriptome Annotation by Modular Algorithms (Google Colab Pro required) - usage

TAMA allows for the construction of a polished transcriptome by collapsing runs into individual transcripts (documentation: https://github.com/GenomeRIK/tama/wiki). 

In [None]:
! eval python tama/tama_collapse.py -s $PIPELINE_FILE_PATH/bam/$ACC.sorted.bam -f $INDEX_FILE_PATH -p $PIPELINE_FILE_PATH/tama/$ACC.polished -b BAM -rm low_mem

tc_version_date_2020_12_14
Default capped flag will be used: Capped
Default collapse exon ends flag will be used: common_ends
Default coverage: 99
Default identity: 85
Default identity calculation method: ident_cov
Default 5 prime threshold: 10
Default exon/splice junction threshold: 10
Default 3 prime threshold: 10
Default duplicate merge flag: merge_dup
Default splice junction priority: no_priority
Default splice junction error threshold: 10
Default splice junction local density error threshold: 1000
Default simple error symbol for matches is the underscore "_" .
Using BAM format for reading in.
Default log output on
Default 5 read threshold
time taken since last check:	0.0:0.0:0.0
time taken since beginning:	0.0:0.0:0.0
going through fasta
tcmalloc: large alloc 1991655424 bytes == 0x55db8685e000 @  0x7f9f8a76e1e7 0x55db42f3f00b 0x55db42f8a746 0x55db42fe5417 0x55db42f95d4c 0x55db4304745c 0x55db42f96ddb 0x55db430474bb 0x55db43080de5 0x55db42f2a59e 0x55db430854e9 0x55db43085a76 0x55db4

##svist4get: a simple visualization tool for genomic tracks from sequencing experiments - installation

svist4get allows you to view read coverage at a defined region on a chromosome (documentation: https://bitbucket.org/artegorov/svist4get/src/master/)

In [None]:
! sudo apt-get install bedtools

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following packages were automatically installed and are no longer required:
  cuda-command-line-tools-10-0 cuda-command-line-tools-10-1
  cuda-command-line-tools-11-0 cuda-compiler-10-0 cuda-compiler-10-1
  cuda-compiler-11-0 cuda-cuobjdump-10-0 cuda-cuobjdump-10-1
  cuda-cuobjdump-11-0 cuda-cupti-10-0 cuda-cupti-10-1 cuda-cupti-11-0
  cuda-cupti-dev-11-0 cuda-documentation-10-0 cuda-documentation-10-1
  cuda-documentation-11-0 cuda-documentation-11-1 cuda-gdb-10-0 cuda-gdb-10-1
  cuda-gdb-11-0 cuda-gpu-library-advisor-10-0 cuda-gpu-library-advisor-10-1
  cuda-libraries-10-0 cuda-libraries-10-1 cuda-libraries-11-0
  cuda-memcheck-10-0 cuda-memcheck-10-1 cuda-memcheck-11-0 cuda-nsight-10-0
  cuda-nsight-10-1 cuda-nsight-11-0 cuda-nsight-11-1 cuda-nsight-compute-10-0
  cuda-nsight-compute-10-1 cuda-nsight-compute-11-0 cuda-nsight-compute-11-1
  cuda-nsight-systems-10-1 cuda-nsight-systems-

In [None]:
! apt-get update

0% [Working]            Get:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
0% [Connecting to archive.ubuntu.com (91.189.88.152)] [Connecting to security.u0% [Connecting to archive.ubuntu.com (91.189.88.152)] [Connecting to security.u0% [1 InRelease gpgv 3,626 B] [Connecting to archive.ubuntu.com (91.189.88.152)                                                                               Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
0% [1 InRelease gpgv 3,626 B] [Waiting for headers] [Waiting for headers] [Wait                                                                               Get:3 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB]
0% [1 InRelease gpgv 3,626 B] [Waiting for headers] [Waiting for headers] [3 In                                                                               Get:4 http://security.ubuntu.com/ubuntu bionic-security InRelease 

In [None]:
! apt-get install libmagickwand-dev

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following packages were automatically installed and are no longer required:
  cuda-command-line-tools-10-0 cuda-command-line-tools-10-1
  cuda-command-line-tools-11-0 cuda-compiler-10-0 cuda-compiler-10-1
  cuda-compiler-11-0 cuda-cuobjdump-10-0 cuda-cuobjdump-10-1
  cuda-cuobjdump-11-0 cuda-cupti-10-0 cuda-cupti-10-1 cuda-cupti-11-0
  cuda-cupti-dev-11-0 cuda-documentation-10-0 cuda-documentation-10-1
  cuda-documentation-11-0 cuda-documentation-11-1 cuda-gdb-10-0 cuda-gdb-10-1
  cuda-gdb-11-0 cuda-gpu-library-advisor-10-0 cuda-gpu-library-advisor-10-1
  cuda-libraries-10-0 cuda-libraries-10-1 cuda-libraries-11-0
  cuda-memcheck-10-0 cuda-memcheck-10-1 cuda-memcheck-11-0 cuda-nsight-10-0
  cuda-nsight-10-1 cuda-nsight-11-0 cuda-nsight-11-1 cuda-nsight-compute-10-0
  cuda-nsight-compute-10-1 cuda-nsight-compute-11-0 cuda-nsight-compute-11-1
  cuda-nsight-systems-10-1 cuda-nsight-systems-

In [None]:
! cp -r $PIPELINE_FILE_PATH/svist4get/policy_revised.xml /etc/ImageMagick-6/policy.xml

cp: cannot create regular file '/etc/ImageMagick-6/policy.xml': No such file or directory


In [None]:
! python3 -m pip install svist4get

Collecting svist4get
  Downloading svist4get-1.2.24-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 23.5 MB/s 
[?25hCollecting wand
  Downloading Wand-0.6.7-py2.py3-none-any.whl (139 kB)
[K     |████████████████████████████████| 139 kB 61.4 MB/s 
[?25hCollecting configs
  Downloading configs-3.0.3-py3-none-any.whl (7.1 kB)
Collecting Pybedtools
  Downloading pybedtools-0.9.0.tar.gz (12.5 MB)
[K     |████████████████████████████████| 12.5 MB 48.6 MB/s 
[?25hCollecting reportlab
  Downloading reportlab-3.6.6-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 46.5 MB/s 
[?25hCollecting statistics
  Downloading statistics-1.0.3.5.tar.gz (8.3 kB)
Collecting argparse
  Downloading argparse-1.4.0-py2.py3-none-any.whl (23 kB)
Collecting biopython
  Downloading biopython-1.79-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl (2.3 MB)
[K     |████████████████████████████████| 2.3 MB 42.2 MB/s 


##svist4get: a simple visualization tool for genomic tracks from sequencing experiments - usage

svist4get allows you to view read coverage at a defined region on a chromosome (documentation: https://bitbucket.org/artegorov/svist4get/src/master/)

In [None]:
! bedtools genomecov -ibam $PIPELINE_FILE_PATH/bam/$ACC.sorted.bam -bg > $PIPELINE_FILE_PATH/bed/$ACC.sorted.bedgraph

In [None]:
! eval svist4get -bg $PIPELINE_FILE_PATH/bed/$ACC.sorted.bedgraph -gtf $ANNOTATION_FILE_PATH -fa $INDEX_FILE_PATH -bl Long-Read Coverage -w $CHROMOSOME $CHROMOSOME_START $CHROMOSOME_FINISH -it "$REGION_NAME" -o $PIPELINE_FILE_PATH/svist4get/$ACC

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/wand/api.py", line 151, in <module>
    libraries = load_library()
  File "/usr/local/lib/python3.9/site-packages/wand/api.py", line 140, in load_library
    raise IOError('cannot find library; tried paths: ' + repr(tried_paths))
OSError: cannot find library; tried paths: []

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/svist4get", line 5, in <module>
    import svist4get.data_processing as data_processing
  File "/usr/local/lib/python3.9/site-packages/svist4get/__init__.py", line 1, in <module>
    from .data_processing import *
  File "/usr/local/lib/python3.9/site-packages/svist4get/data_processing.py", line 7, in <module>
    import svist4get.methods as methods
  File "/usr/local/lib/python3.9/site-packages/svist4get/methods.py", line 4, in <module>
    from wand.image import Image, Color
  File "/usr/local/lib/python3.9

##AlignQC: Long read alignment analysis - installation

AlignQC generates a reports on sequence alignments for mappability vs read sizes, error patterns, annotations and rarefraction curve analysis (documentation: https://github.com/jason-weirather/AlignQC/wiki). 

In [None]:
! conda create -n alignqc -c vacation alignqc

Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / done
Solving environment: \ | failed with repodata from current_repodata.json, will retry with next repodata source.

CondaError: KeyboardInterrupt



##AlignQC: Long read alignment analysis - usage

AlignQC generates a reports on sequence alignments for mappability vs read sizes, error patterns, annotations and rarefraction curve analysis (documentation: https://github.com/jason-weirather/AlignQC/wiki). 

In [None]:
! source activate alignqc && eval alignqc analyze $PIPELINE_FILE_PATH/bam/$ACC.sorted.bam --no_genome -t $ANNOTATION_FILE_PATH.gz -o $PIPELINE_FILE_PATH/alignqc/$ACC.xhtml

Could not find conda environment: alignqc
You can list all discoverable environments with `conda info --envs`.



##MakeHub: Fully automated generation of UCSC assembly hubs - installation

MakeHub is a command line tool for the fully automatic generation of of track data hubs for visualizing genomes with the UCSC genome browser (documentation: https://github.com/Gaius-Augustus/MakeHub).

In [None]:
! python3.7 -m pip install biopython

Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.7/dist-packages/pip/__main__.py", line 29, in <module>
    from pip._internal.cli.main import main as _main
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/cli/main.py", line 9, in <module>
    from pip._internal.cli.autocompletion import autocomplete
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/cli/autocompletion.py", line 10, in <module>
    from pip._internal.cli.main_parser import create_main_parser
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/cli/main_parser.py", line 8, in <module>
    from pip._internal.cli import cmdoptions
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/cli/cmdoptions.py", line 21, in <module>
    from pip._vendor.packaging.utils import canonicalize_name
  

In [None]:
! sudo apt install samtools

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following packages were automatically installed and are no longer required:
  cuda-command-line-tools-10-0 cuda-command-line-tools-10-1
  cuda-command-line-tools-11-0 cuda-compiler-10-0 cuda-compiler-10-1
  cuda-compiler-11-0 cuda-cuobjdump-10-0 cuda-cuobjdump-10-1
  cuda-cuobjdump-11-0 cuda-cupti-10-0 cuda-cupti-10-1 cuda-cupti-11-0
  cuda-cupti-dev-11-0 cuda-documentation-10-0 cuda-documentation-10-1
  cuda-documentation-11-0 cuda-documentation-11-1 cuda-gdb-10-0 cuda-gdb-10-1
  cuda-gdb-11-0 cuda-gpu-library-advisor-10-0 cuda-gpu-library-advisor-10-1
  cuda-libraries-10-0 cuda-libraries-10-1 cuda-libraries-11-0
  cuda-memcheck-10-0 cuda-memcheck-10-1 cuda-memcheck-11-0 cuda-nsight-10-0
  cuda-nsight-10-1 cuda-nsight-11-0 cuda-nsight-11-1 cuda-nsight-compute-10-0
  cuda-nsight-compute-10-1 cuda-nsight-compute-11-0 cuda-nsight-compute-11-1
  cuda-nsight-systems-10-1 cuda-nsight-systems-

In [None]:
! sudo apt install augustus augustus-data augustus-doc

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following packages were automatically installed and are no longer required:
  cuda-command-line-tools-10-0 cuda-command-line-tools-10-1
  cuda-command-line-tools-11-0 cuda-compiler-10-0 cuda-compiler-10-1
  cuda-compiler-11-0 cuda-cuobjdump-10-0 cuda-cuobjdump-10-1
  cuda-cuobjdump-11-0 cuda-cupti-10-0 cuda-cupti-10-1 cuda-cupti-11-0
  cuda-cupti-dev-11-0 cuda-documentation-10-0 cuda-documentation-10-1
  cuda-documentation-11-0 cuda-documentation-11-1 cuda-gdb-10-0 cuda-gdb-10-1
  cuda-gdb-11-0 cuda-gpu-library-advisor-10-0 cuda-gpu-library-advisor-10-1
  cuda-libraries-10-0 cuda-libraries-10-1 cuda-libraries-11-0
  cuda-memcheck-10-0 cuda-memcheck-10-1 cuda-memcheck-11-0 cuda-nsight-10-0
  cuda-nsight-10-1 cuda-nsight-11-0 cuda-nsight-11-1 cuda-nsight-compute-10-0
  cuda-nsight-compute-10-1 cuda-nsight-compute-11-0 cuda-nsight-compute-11-1
  cuda-nsight-systems-10-1 cuda-nsight-systems-

##MakeHub: Fully automated generation of UCSC assembly hubs - usage

MakeHub is a command line tool for the fully automatic generation of of track data hubs for visualizing genomes with the UCSC genome browser (documentation: https://github.com/Gaius-Augustus/MakeHub).

In [None]:
! chmod 755 $PIPELINE_FILE_PATH/makehub/make_hub.py

In [None]:
! eval $PIPELINE_FILE_PATH/makehub/make_hub.py -l $HUB_KEYWORD -L $HUB_NAME -g $INDEX_FILE_PATH -e \
  $HUB_EMAIL -a $ANNOTATION_FILE_PATH -b $PIPELINE_FILE_PATH/bam/$ACC.sorted.bam -o $PIPELINE_FILE_PATH/makehub/

usage: make_hub.py
       [-h]
       [-p]
       [-e EMAIL]
       [-g GENOME]
       [-L LONG_LABEL]
       [-l SHORT_LABEL]
       [-b BAM [BAM ...]]
       [-c CORES]
       [-d]
       [-E GEMOMA_FILTERED_PREDICTIONS]
       [-X BRAKER_OUT_DIR]
       [-M MAKER_GFF]
       [-I GLIMMER_GFF]
       [-S SNAP_GFF]
       [-a ANNOT]
       [-G GENE_TRACK [GENE_TRACK ...]]
       [-A]
       [-o OUTDIR]
       [-n]
       [-s SAMTOOLS_PATH]
       [-B BAM2WIG_PATH]
       [-i HINTS]
       [-t TRAINGENES]
       [-m GENEMARK]
       [-w AUG_AB_INITIO]
       [-x AUG_HINTS]
       [-y AUG_AB_INITIO_UTR]
       [-z AUG_HINTS_UTR]
       [-N LATIN_NAME]
       [-V ASSEMBLY_VERSION]
       [-r]
       [-P]
       [-u VERBOSITY]
       [-v]
make_hub.py: error: unrecognized arguments: LINC00173"


##MultiQC: Aggregate results from bioinformatics analyses across many samples into a single report - installation



MultiQC is a program which allows you to combine reports for as many samples as you wish ([documentation](https://multiqc.info/docs/))

In [None]:
! pip install multiqc

Collecting multiqc
  Downloading multiqc-1.11-py3-none-any.whl (1.2 MB)
[?25l[K     |▎                               | 10 kB 20.4 MB/s eta 0:00:01[K     |▌                               | 20 kB 22.3 MB/s eta 0:00:01[K     |▉                               | 30 kB 26.6 MB/s eta 0:00:01[K     |█                               | 40 kB 28.7 MB/s eta 0:00:01[K     |█▍                              | 51 kB 29.0 MB/s eta 0:00:01[K     |█▋                              | 61 kB 26.2 MB/s eta 0:00:01[K     |██                              | 71 kB 21.9 MB/s eta 0:00:01[K     |██▏                             | 81 kB 23.5 MB/s eta 0:00:01[K     |██▌                             | 92 kB 24.7 MB/s eta 0:00:01[K     |██▊                             | 102 kB 23.3 MB/s eta 0:00:01[K     |███                             | 112 kB 23.3 MB/s eta 0:00:01[K     |███▎                            | 122 kB 23.3 MB/s eta 0:00:01[K     |███▋                            | 133 kB 23.3 MB/s eta 0:

##MultiQC: Aggregate results from bioinformatics analyses across many samples into a single report - usage

MultiQC is a program which allows you to combine reports for as many samples as you wish ([documentation](https://multiqc.info/docs/))

In [None]:
! multiqc $PIPELINE_FILE_PATH -o $PIPELINE_FILE_PATH/multiqc

/bin/bash: multiqc: command not found
