mka -- easy bootstrap of ATAC-seq and RNA-seq pipelines

mka (short for "make analysis") is a tool we use to create ATAC-seq and RNA-seq pipelines in the Parker Lab at the University of Michigan.

Features

Its output is a ready-to-go analysis: just run make run to start it. The Makefile also includes targets for cleaning your work directories and creating a GitHub repository for the analysis.
Generates Python scripts that in turn generate your pipeline. These scripts tend to be more robust and easier to customize than shell scripts written by hand, particularly if you have a lot of input data.
You can supply your own pipeline templates, if you don't like our choice of tools or the settings we use.
Works with any FASTQ input files; while there's specific support for data from the UM DNA Sequencing Core, we routinely use it for public data or when collaborating on data sequenced at other institutions.
Can handle input from more than one organism.
Creates read group information from the metadata you supply, making it easier to track libraries throughout the pipeline.

Shortcomings

It's not very general -- it does what we want and not much else. If you want to use bowtie instead of bwa, or cutadapt instead of our adapter trimmer, you'll need to provide your own templates.
You can't currently start from BAM files.

Requirements

Python requirements

These should automatically be installed by pip when you pip install https://github.com/ParkerLab/mka/:

Jinja2
bs4
python-dateutil
requests

Pipeline requirements

FastQC
cta, our C++ version of Jason Buenrostro's adapter trimmer
bwa
MACS2
bedtools
samtools
kent, Jim Kent's utilities
drmr, our tool for working with resource managers like Slurm or PBS

Other requirements

hub, if you want to create a GitHub repository for your analysis

Installation

pip install git+https://github.com/ParkerLab/mka

Examples

ATAC-seq

For this example, I'm using data from ATAC-seq assays we performed on human skeletal muscle for [Scott2016]. I downloaded that data into a local directory, Run_1398. The files relevant to mka were:

Run_1398_parker.csv
parker/
- Sample_53252/
  - 53252_CTCTCTAC_S1_L002_R1_001.fastq.gz
  - 53252_CTCTCTAC_S1_L002_R2_001.fastq.gz
- Sample_53253/
  - 53253_CAGAGAGG_S2_L002_R1_001.fastq.gz
  - 53253_CAGAGAGG_S2_L002_R2_001.fastq.gz

Working in that directory, I used the screname script to create specially-named symlinks to those FASTQ files, using the run information provided by the UM DNA Sequencing Core in Run_1398_parker.csv:

$ screname -v Run_1398_parker.csv parker Scott2016
Looking for FASTQ files in parker/Sample_53252
Linking ../parker/Sample_53252/53252_CTCTCTAC_S1_L002_R1_001.fastq.gz -> Scott2016/53252___53252___L002___13-human-atac-k5-10mg.1.fq.gz
Linking ../parker/Sample_53252/53252_CTCTCTAC_S1_L002_R2_001.fastq.gz -> Scott2016/53252___53252___L002___13-human-atac-k5-10mg.2.fq.gz
Looking for FASTQ files in parker/Sample_53253
Linking ../parker/Sample_53253/53253_CAGAGAGG_S2_L002_R1_001.fastq.gz -> Scott2016/53253___53253___L002___14-human-atac-k5-2mg.1.fq.gz
Linking ../parker/Sample_53253/53253_CAGAGAGG_S2_L002_R2_001.fastq.gz -> Scott2016/53253___53253___L002___14-human-atac-k5-2mg.2.fq.gz
...

The mka file naming scheme is sample___library___readgroup___description.pair_index.fq.gz. The triple underscores make for cumbersome names, but allow easy parsing and more freeform descriptions. Since the sequencing core doesn't provide more than one library per sample name, the above files have the same values for each.

To use the file name parsing with third-party data, you'll have to rename the files without the aid of screname. The files from [Buenrostro2013] could look like this:

GSM1155957___SRR891268___GM12878_ATACseq_50k_Rep1___1.fastq.gz
GSM1155957___SRR891268___GM12878_ATACseq_50k_Rep1___2.fastq.gz
GSM1155958___SRR891269___GM12878_ATACseq_50k_Rep2___1.fastq.gz
GSM1155958___SRR891269___GM12878_ATACseq_50k_Rep2___2.fastq.gz
...

You don't have to use the file name parsing at all; running mka with the --interactive flag lets you provide or override all the library metadata for arbitrary input files. The special file names are just a way to save time.

With the input files set up, I ran mka. Because there were actually additional files in Run 1398 containing data not used in the paper, I'm selecting a subset here, in samples 53252 and 53253:

$ mka -v --run-info Run_1398_parker.csv -t atac-seq -d "ATAC-seq of human skeletal muscle" -a ~/analyses/scott2016 ~/control/scott2016 Scott2016/5325[23]*
Reading sequencing run information from Run_1398_parker.csv
  Please specify the reference genome: hg19

Libraries: {
    "53252": {
        "analysis_specific_options": {},
        "description": "13-human-atac-k5-10mg",
        "library": "53252",
        "readgroups": {
            "L002": [
                "/nfs/turbo/parkerlab1/lab/data/seqcore/Run_1398/Scott2016/53252___53252___L002___13-human-atac-k5-10mg.1.fq.gz",
                "/nfs/turbo/parkerlab1/lab/data/seqcore/Run_1398/Scott2016/53252___53252___L002___13-human-atac-k5-10mg.2.fq.gz"
            ]
        },
        "reference_genome": "hg19",
        "sample": "53252",
        "sequencing_center": "UM DNA Sequencing Core",
        "sequencing_date": "2015-10-23",
        "sequencing_platform": "ILLUMINA",
        "sequencing_platform_model": "",
        "url": ""
    },
    "53253": {
        "analysis_specific_options": {},
        "description": "14-human-atac-k5-2mg",
        "library": "53253",
        "readgroups": {
            "L002": [
                "/nfs/turbo/parkerlab1/lab/data/seqcore/Run_1398/Scott2016/53253___53253___L002___14-human-atac-k5-2mg.1.fq.gz",
                "/nfs/turbo/parkerlab1/lab/data/seqcore/Run_1398/Scott2016/53253___53253___L002___14-human-atac-k5-2mg.2.fq.gz"
            ]
        },
        "reference_genome": "hg19",
        "sample": "53253",
        "sequencing_center": "UM DNA Sequencing Core",
        "sequencing_date": "2015-10-23",
        "sequencing_platform": "ILLUMINA",
        "sequencing_platform_model": "",
        "url": ""
    }
}

Your analysis is ready in /home/hensley/control/scott2016
$

At this point, I can change directory to ~/control/scott2016 and type make run to submit the pipeline with drmr. I'll be mailed when it finishes, or if any job encounters an error.

[Scott2016]

The genetic regulatory signature of type 2 diabetes in human skeletal muscle, Scott et al., Nature Communications 2016

[Buenrostro2013]

Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position, Buenrostro et al., Nature Methods 2013

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
mka		mka
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.rst		README.rst
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mka -- easy bootstrap of ATAC-seq and RNA-seq pipelines

Features

Shortcomings

Requirements

Python requirements

Pipeline requirements

Other requirements

Installation

Examples

ATAC-seq

About

Releases

Packages

Languages

License

ParkerLab/mka

Folders and files

Latest commit

History

Repository files navigation

mka -- easy bootstrap of ATAC-seq and RNA-seq pipelines

Features

Shortcomings

Requirements

Python requirements

Pipeline requirements

Other requirements

Installation

Examples

ATAC-seq

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages