This repository contains a pipeline to reprocess the Human Cell Landscape data for cleavage site identification as reported in Fansler et al., bioRxiv, 2023.
🤔 Important Note: This pipeline is provided for the scientific record, not necessarily with reuse in mind. However, we made some engineering improvements when writing the analogous pipeline for mouse data (https://github.com/Mayrlab/mca-utrome) making it more geared toward reuse. In particular, we moved what are really pipeline parameters out of the
Snakefile
and into theconfig.yaml
where they really belong. If considering rerunning this pipeline or applying it to other Microwell-seq data, you may want to start from that version instead, or at least incorporate those pipeline improvements here. Also, be mindful that both of these are resource heavy pipelines - we may be able to provide useful intermediate files to expediate generating output variants that do not require rerunning alignments (open an Issue).
The folders in the repository have the following purposes:
data
- (created at runtime) output data filesenvs
- Conda environment YAML files for recreating the execution environmentmetadata
- metadata files that annotate input data filesscripts
- scripts used by the Snakefileqc
- (created at runtime) output quality checks
All code is expected to be executed with this repository as the present working directory.
The primary source code is found in the Snakefile
and the scripts
folder.
Files in the metadata
folder describe most of the information necessary to download
the raw input sequencing files, as well as annotate the cells.
This pipeline also requires a HISAT2 index, which is not automatically retrieved. The location
of this should be specified with the hisatIndex
key in the config.yaml
.
This repository can be cloned with:
git clone https://github.com/Mayrlab/hcl-utrome.git
This requires Conda/Mamba and Snakemake. If you do not already have a Conda installation, we strongly recommend Miniforge.
Two configuration options in config.yaml
should be adjusted by the user prior to running:
tmpdir
: temporary directory for scratchhisatIndex
: human HISAT2 index
Optional parameters in the config.yaml
that could be adjusted are:
minReadLength
: the minumum read length required to include the resulting merged readradiusGENCODE
: radius for merging GENCODE transcriptsradiusPAS
: radius for merging PolyASite entriesextUTR3
: downstream distance from annotated gene locus to classify as "extended 3'UTR"extUTR5
: upwnstream distance from annotated gene locus to classify as "extended 5'UTR"
Additional parameters of interest in the Snakefile are:
epsilon
: the initial radius within which read ends are merged to the modethreshold
: minimum TPM per cell type cutoff for filtering low-frequency cleavage sitesversion
: the human GENCODE version to be built upontpm
: the minimum TPM threshold for PolyASite entries to be used as "supporting" evidencelikelihood
: the minimum CleanUpdTSeq score that a cleavage site is not from internal priming to be considered a "likely" cleavage sitewidth
: the width for truncating the UTRomemerge
: the distance within which to merge 3'ends during scUTRquant quantification
The full pipeline can be executed with simply
snakemake --use-conda
We encourage HPC users to configure a Snakemake profile and use this via a --profile
argument.