Skip to content

Identification and quantification of Seq-Well TCR transcripts and their recovered CDR3 sequences.

Notifications You must be signed in to change notification settings

ShalekLab/tcrgo

Repository files navigation

Description

The computational Seq-Well TCR alignment pipeline identifies and quantifies the TCR regions associated with each cell barcode in the sequencing data resulting from TCR enrichment protocol described in Tu et al. 2019. We have reengineered the Seq-Well TCR recovery pipeline originally presented by Tu et al. to make significant enhancements to the user and developer experience while delivering comparable results. Chiefly, we have improved the performance, reporting, tunability, portability, and maintainability of the computational pipeline in a new Python program, TCRGO. This pipeline consists of four main steps, preprocessing and alignment, filtering using alignment information, CDR3 recovery, and summarizing of the data. The preprocessing and alignment step consists of conversion to single-end BAM format, barcode tagging, read trimming, multimap alignment to TRA and TRB gene segments, and barcode error correction. Multimap alignment to V, J, and C TRA and TRB gene segments is performed via Bowtie2 while all other preprocessing tasks are now handled by trusted tools such as Samtools, Picard, and Picard extension Drop-Seq Tools. The filtering step iterates over the alignment data and distributes reads that multimap to both a V and J segment to lists which can be used in parallel by the CDR3 recovery step. The CDR3 recovery step quantifies the alignment identity data for each unique transcript and determines the unique transcript’s most likely CDR3 sequence using a greedy, optimized, heuristic-driven algorithm. The alignment identity and CDR3 information is gathered and tabulated during the summary step. The filtering, recovery, and analysis steps, originally scripted in Bourne Again Shell and Matlab, have been refactored in Python to afford better maintainability of the codebase, reporting for easier troubleshooting, and boosted performance via harnessing highly efficient and powerful packages such as pysam, biopython, numpy, and pandas. Furthermore, the pipeline has been fitted for cloud computing through wrapping it in Docker and the Workflow Description Language. Cloud deployability will increase accessibility to scientists on cloud research platforms such as Terra and enable parallelization of the pipeline’s data processing jobs.

Requirements

Hardware: macOS (tested), linux (untested), or Windows (untested). Recommended 16GB RAM and at least 64 GB free space on hard drive.
Software: Python ≈ 3.8.5, pysam ≈ 0.16.0.1, pandas ≈ 1.1.2, bowtie2 >= 2.4.1, samtools ≈ 1.3.1 python-igraph ≈ 0.8.3, biopython = 1.77, Java 12.0.1 (recommended), Drop-Seq Tools >= 2.4.0

Instructions for running on the cloud

This program has been wrapped as a Workflow Description Language script, TCRGO.wdl. The latest version of this workflow can be found here on Dockstore, where it can be exported to an analysis platform of your choice. The Docker image itself is hosted here on Docker Hub.

On Terra, scattering of samples is possible through using the Terra Data Table.

Instructions for a local installation

Please see the Hardware requirements before considering a local installation.

Installing

It's strong recommended you create a new virtual environment. Example using conda on a Unix-like system:

conda config --add channels conda-forge --add channels bioconda \
&& conda create -y -n tcrgo python=3.8.5 samtools=1.3.1 bowtie2=2.4.1 pysam=0.16.0.1 pandas=1.1.2 python-igraph=0.8.3 biopython=1.77 \
&& conda activate tcrgo

Then git clone this repository to your machine.
For alignment, you must install Java and git clone Drop-Seq Tools by the Broad Institute. The Picard and Java jar files must be entered as arguments to the alignment script.

Running

Change the present working directory (cd on Unix-like systems) to the repository root folder.
For help with any of the scripts, use only the -h flag for help! Ex: python -m alignment -h
On a Unix-like system, if there are spaces in any of your pathnames please escape them using \! Example: /path/to/my data/my sample.bam becomes /paht/to/my\ data/my\ sample.bam.

See the bottom section for a full list of inputs.

Prepocess and align

Call alignment.py, a wrapper for Drop-Seq Tools and bowtie2, to tag, trim, align, and barcode-repair a BAM, single-end, or .
python -m alignment -d path/to/dropseq.jar -p path/to/picard.jar -f path/to/VJ_reference.fasta -b mysamplename path/to/raw.bam
Or first convert paired-end FASTQs to single-end BAM then preprocess and align:
python -m alignment -d path/to/dropseq.jar -p path/to/picard.jar -f path/to/VJ_reference.fasta -b mysamplename path/to/barcode.fastq.gz,path/to/biological.fastq.gz

Here is an overview of the preprocessing/alignment pipeline:

Note: `<basename>` is inputted using the `--basename` argument.
  0. Input Data -> Unmapped BAM, starting from any of these substeps depending on the input:
    a. Raw BAM (`<basename>`.bam)
    b. Single-end FASTQ (`<basename>`.fastq)
    c. Pair-end FASTQs (`<basename>`_R1.fastq,`<basename>`_TCR.fastq)
  1. Unmapped BAM (`<basename>`_unmapped.bam) -> aligned and tagged BAM
    a. Tag cell barcodes (`<basename>_celltagged.bam`)
    b. Tag molecular barcodes (`<basename>_idtagged.bam`)
    c. Filter out reads with low quality cell/molecular barcodes (`<basename>`_filtered.bam)
    d. Trim 5’ primer sequence (`<basename>`_adaptertrimmed.bam)
    e. Trim 3’ polyA sequence (`<basename>`_trimmed.bam)
    f. SAM -> Fastq (`<basename>`_trimmed.fastq)
    g. Bowtie2 alignment (`<basename>`_aligned.sam)
    h. Sort STAR alignment in queryname order (`<basename>`_alignedsorted.bam)
    i. Merge STAR alignment tagged SAM to recover cell/molecular barcodes (`<basename>`_merged.bam)
    j. Add gene/exon and other annotation tags (`<basename>`_exontagged.bam)
    k. Barcode Repair
      i. Repair substitution errors (`<basename>_synthrepaired.bam`)
      ii. Repair indel errors (`<basename>`_repaired.bam)
      iii. Sort by coordinates (`<basename>`_repairedsorted.bam)

Find and distribute VJ-mapped queries

Call filter_queries.py to write a list of BAM queries (reads) which contain alignments to V and J subregions.
python -m filter_queries -lt 5 -gt 1000 /path/to/alignment.bam
where alignment.bam is a BAM produced by the alignment workflow.
The -w flag can be used to divide the list of queries up into w files for processing by workers on an HPC or the cloud.

Determine consensus VJ pairs and CDR3 sequences per UMI

Call recover_cdr3s.py to find the top-scoring V and J alignments per read, the VJ pair alignment consensus per UMI, and the consensus CDR3 sequence per UMI.
python -m recover_cdr3s -f path/to/VJ_reference.fasta -mf 0.3 -mc 5 -q out/queries1.tsv /path/to/alignment.bam
where alignment.bam is a BAM produced by the alignment workflow.

The --cdr3-positions-file argument can be used to input a 2-column TSV file containing the names of FASTA TR(A|B)(V|J) segments and that entry's corresponding CDR3 start or ending codon position in the nucleotide sequence. Each entry name in the CDR3 positions file must exactly match its respective entry in the FASTA. If the argument is not specified, the program will attempt to identify the most likely CDR3 start and end positions.

Aggregate statistics from recover_cdr3s

Call summary.py to combine all cdr3_infow.tsv files produced by recover_cdr3s.py into one large tsv file. The --collapse mode will attempt to collapse UMIs by their cell and molecular barcodes; an intensive process which consumes a considerable amount of time. python -m summary -i path/to/cdr3_info_folder/ --collapse ALL

List of inputs

Task Workflow Description Language Python Commandline Description
Global Boolean run_alignment = true N/A Whether or not to run the alignment task. If false, sequence_data must the address of repairedsorted BAM from alignment written as an array of length 1; ["gs://path/to/repairedsortedbam"].
Global String sample_name [See --basename] Becomes the prefix for many output files as to differentiate them from outputs from other samples.
Global Array[File] sequence_data [See data, bam] An array of length 1 or 2 containing the addresses to sequence data. Options include the ["gs://path/to/repairedsorted.bam"], ["gs://path/to/singleend.fastq"], or ["gs://path/to/barcode.fastq","gs://path/to/biological.fastq"].
Global File fasta [See --fasta] The reference FASTA file containing entries for TR(A|B)(V|J|C) gene segments.
Global String docker = "shaleklab/tcrgo:latest" N/A The address to the container image used to run the program.
Global String zones = "us-central1-a" ... N/A The Google zone preferences for hosting and running the program.
Global Int preemptible = 2 N/A The number of tries the program will attempt to use a preemptible VM, which is ~5 times cheaper but can be restarted at any time. If exceeded
All N/A --verbosity, -v = "INFO" The scope of verbosity to which the log will output. See logging Python module for more details. Possible options include: "INFO", "VERBOSE", "WARNING", "SUCCESS", "ERROR". "VERBOSE" may be useful in debugging.
alignment Array[File] data data (positional argument) Path to single-end BAM/FASTQ or paths (delim. by comma) Read1-index-1 FASTQs
alignment N/A (/software/dropseq/jar/dropseq.jar) --dropseq, -d Path to the Drop-Seq Tools jar file. Recommended you use a version 2.3.0 or 2.4.0.
alignment N/A (/software/dropseq/3rdParty/picard/picard.jar) --picard, -p Path to the Picard jar file. Recommended you use a version of Picard that is bundled with/compatible with your version of Drop-Seq Tools
alignment File fasta --fasta, -f The reference FASTA file containing entries for TR(A|B)(V|J|C) gene segments.
alignment String sample_name --basename, -b = None The prefix to affix to all output files. If not specified, Python's os module will be used to ascertain.
alignment N/A (/cromwell_root/out/) --output-path, -o = "./out/alignment/" The path to which outputted files will be sent.
alignment Int number_cpu_threads = 4 N/A The number of CPUs to request for the task.
alignment Int task_memory_GB = 16 N/A The amount of RAM in GB to request for the task.
alignment String disks = "local-disk 256 HDD" N/A The amount of diskspace in GB to request for the task.
alignment Int boot_disk_size_GB = 15 N/A The size of the boot disk to request for the task.
alignment Boolean filter_barcodes = true N/A Run filter barcodes based on quality score step
filter_queries File bam bam (positional argument) Path to single-end tagged, trimmed, aligned, repaired BAM. IMPORTANT: It is strongly recommended you use the alignment script to create this BAM."
filter_queries Int minimum_reads = 5 --minimum-reads, -lt = 5 UMIs with less than the minimum number of reads will be filtered out.
filter_queries Int maximum_reads = 1000 --maximum-reads, -gt = 1000 UMIs with more than the maximum number of reads will have their reads randomly subsampled down to the maximum.
filter_queries Int seed = 2020 --seed, -s = 2020 The seed for pseudo-random subsampling of reads for UMIs that exceed the maximum number of reads.
filter_queries Int workers --workers, -w = 1 Divide the list of filtered reads by this number for processing by multiple HPCC tasks or VMs
filter_queries Int number_cpu_threads = 1 N/A The number of CPUs to request for the task.
filter_queries Int task_memory_GB = 16 N/A The amount of RAM in GB to request for the task.
filter_queries String disks = "local-disk 128 HDD" N/A The amount of diskspace in GB to request for the task.
filter_queries Int boot_disk_size_GB = 10 N/A The size of the boot disk to request for the task.
filter_queries N/A --umi-tag, -u = "RX" The BAM tag which represents the unique molecular barcode sequence.
filter_queries N/A --barcode-tag, -b = "CR" The BAM tag which represents the cell barcode sequence.
filter_queries N/A (/cromwell_root/out/) --output-path, -o = "./out/queries/" The path to which the files from this program will output.
recover_cdr3s File bam bam (positional argument) Path to sorted, single-end BAM
recover_cdr3s File fasta --fasta, -f Path to reference FASTA containing TRA/TRB gene segments (V,J,C)
recover_cdr3s File? cdr3_positions --cdr3-positions-file = None Path to a two-column TSV containing reference names and the start V or end J base position. NOTE: If these positions are zero-indexed as opposed to one-indexed, also specify the --zero-indexed flag! If not provided, program attempts to identify the best ORF and positions.
recover_cdr3s Boolean is_zero_indexed = false --zero-indexed = False If this and the CDR3 positions file are specified, the positions entered are treated as indices from 0 to length(seq)-1
recover_cdr3s Int workers = 8 --workers = 1 The number ID of the worker which will read queries.tsv
recover_cdr3s File query_list --query-list, -q = None The path to the query list file, outputted by filter_queries, to read.
recover_cdr3s Float minimum_frequency = 0.3 --minimum-frequency, -mf = 0.0 UMIs whose top VJ combination frequency is beneath the minimum frequency will not undergo CDR3 sequence reconstruction.
recover_cdr3s Int minimum_cdr3s = 5 --minimum-cdr3s = 1 UMIs with less than the minimum number of reads which have recovered CDR3s will not have their CDR3 statistics reported.
recover_cdr3s Boolean exclude_relatives = false --exclude-relatives, -r = False When identifying the VJ combination for a UMI, only take for CDR3 recovery reads whose top VJ is exactly equal to the highest counted segment variants. If not enabled, all reads are taken that match the highest counted segment's root (ex: TRAV24-1 is the top V for the UMI, but TRAV24-2 reads are also gathered for CDR3 recovery). Regardless the top segment's specific variant will be reported in the output information.
recover_cdr3s N/A (/cromwell/out/) --output-path, -o = "./out/cdr3/" The path to which the files from this program will output.
recover_cdr3s Int number_cpu_threads = 1 N/A The number of CPUs to request for the task.
recover_cdr3s Int task_memory_GB = 16 N/A The amount of RAM in GB to request for the task.
recover_cdr3s String disks = "local-disk 128 HDD" N/A The amount of diskspace in GB to request for the task.
recover_cdr3s Int boot_disk_size_GB = 10 N/A The size of the boot disk to request for the task.
summary Array[File] cdr3_infos [See --workers] An array of CDR3 info files output by recover_cdr3s
summary Array[File] tiebreaks_alignments N/A An array of alignment tiebreak files output by recover_cdr3s
summary String sample_name N/A The prefix which will distinguish the output files from this pipeline from outputs of other samples.
summary Boolean collapse = true --collapse, -c = False Perform BCUMI collapsing. BCUMIs are first binned by topVJ combination, then subgrouped by greater/equal to threshold (collapsers) vs. less than threshold (collapsees). For each collapser BCUMI, hamming distance comparisons are done with each collapsee BCUMI. If HD == 1, then CDR3 sequences of equal lengths are compared. If HD <= 1, then collapsees are collapsed into the collapser with the highest number of reads (or number of edges if there is a tie)."aggrcollapsed_cdr3_info.tsv is created.
summary Int threshold = 10 --threshold, -t = 10 When BCUMI collapsing, BCUMIs with greater than or equal to this threshold will be considered collapsers and the rest collapsees. Collapsees are merged into collapsers if the collapsing criteria is met. A lower number may result in more collapsing while a higher number may be more strict and result in less collapsing.
summary Boolean plot = false --plot, -p = False If BCUMI collapsing, plot Collapsers-Collapsees graphs for each reference combination. If enabled, expect hundreds of graphs in --output-path/dgraphs/ with cumulative size exceeding 100MB. The program will take longer to finish
summary Boolean string_index = false --string-index, -s = False Output the aggregated CDR3 info .tsv file with the BC and UMI strings as the indices instead of numbering them for easier reading
summary N/A ("/cromwell_root/out/") --output-path, -o = "./out/cdr3/" The path to which outputted files will be sent.
summary N/A ("/cromwell_root/out/") --input-path, -i = "./out/cdr3/" Path to folder containing files outputted by recover_cdr3s
summary Int number_cpu_threads = 1 N/A The number of CPUs to request for the task.
summary Int task_memory_GB = 16 N/A The amount of RAM in GB to request for the task.
summary String disks = "local-disk 64 HDD" N/A The amount of RAM in GB to request for the task.
summary Int boot_disk_size_GB = 10 N/A The size of the boot disk to request for the task.