Skip to content

generates human UTRome from Human Cell Landscape data

Notifications You must be signed in to change notification settings

Mayrlab/hcl-utrome

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DOI

Overview

This repository contains a pipeline to reprocess the Human Cell Landscape data for cleavage site identification as reported in Fansler et al., bioRxiv, 2023.

🤔 Important Note: This pipeline is provided for the scientific record, not necessarily with reuse in mind. However, we made some engineering improvements when writing the analogous pipeline for mouse data (https://github.com/Mayrlab/mca-utrome) making it more geared toward reuse. In particular, we moved what are really pipeline parameters out of the Snakefile and into the config.yaml where they really belong. If considering rerunning this pipeline or applying it to other Microwell-seq data, you may want to start from that version instead, or at least incorporate those pipeline improvements here. Also, be mindful that both of these are resource heavy pipelines - we may be able to provide useful intermediate files to expediate generating output variants that do not require rerunning alignments (open an Issue).

Organization

The folders in the repository have the following purposes:

  • data - (created at runtime) output data files
  • envs - Conda environment YAML files for recreating the execution environment
  • metadata - metadata files that annotate input data files
  • scripts - scripts used by the Snakefile
  • qc - (created at runtime) output quality checks

All code is expected to be executed with this repository as the present working directory.

Source Code

The primary source code is found in the Snakefile and the scripts folder.

Input Data

Files in the metadata folder describe most of the information necessary to download the raw input sequencing files, as well as annotate the cells.

This pipeline also requires a HISAT2 index, which is not automatically retrieved. The location of this should be specified with the hisatIndex key in the config.yaml.

Reproducing the Pipeline

Cloning

This repository can be cloned with:

git clone https://github.com/Mayrlab/hcl-utrome.git

Prerequisite Software

This requires Conda/Mamba and Snakemake. If you do not already have a Conda installation, we strongly recommend Miniforge.

Configuration

Two configuration options in config.yaml should be adjusted by the user prior to running:

  • tmpdir: temporary directory for scratch
  • hisatIndex: human HISAT2 index

Optional parameters in the config.yaml that could be adjusted are:

  • minReadLength: the minumum read length required to include the resulting merged read
  • radiusGENCODE: radius for merging GENCODE transcripts
  • radiusPAS: radius for merging PolyASite entries
  • extUTR3: downstream distance from annotated gene locus to classify as "extended 3'UTR"
  • extUTR5: upwnstream distance from annotated gene locus to classify as "extended 5'UTR"

Additional parameters of interest in the Snakefile are:

  • epsilon: the initial radius within which read ends are merged to the mode
  • threshold: minimum TPM per cell type cutoff for filtering low-frequency cleavage sites
  • version: the human GENCODE version to be built upon
  • tpm: the minimum TPM threshold for PolyASite entries to be used as "supporting" evidence
  • likelihood: the minimum CleanUpdTSeq score that a cleavage site is not from internal priming to be considered a "likely" cleavage site
  • width: the width for truncating the UTRome
  • merge: the distance within which to merge 3'ends during scUTRquant quantification

Running

The full pipeline can be executed with simply

snakemake --use-conda

We encourage HPC users to configure a Snakemake profile and use this via a --profile argument.