Skip to content

A Snakemake powered pipeline developed to perform variant-effect-prediction and frequency analysis given multiple Variant Call Format datasets. This has been developed in partial fulfilment of a MSc in Bioinformatics at the University of Pretoria by Graeme Ford.

License

Notifications You must be signed in to change notification settings

Tuks-ICMM/Pharmacogenetic-Analysis-Pipeline

Repository files navigation

SnakeMake Pharmacogenetics pipeline


Build

CC BY 4.0

Content Guide:

  1. About the Project
  2. Datasets
  3. Built with
  4. Software
  5. Binaries
  6. File Structure
  7. Files
  8. Folders
  9. Usage
  10. Roadmap
  11. Versioning
  12. Authors
  13. Acknowledgments

About the project:

This is the development repository for a pipeline created to perform frequency analysis on African genetic datasets. This work is licensed under a Creative Commons Attribution 4.0 International License.

Datasets:

This pipeline is designed to accept variant call format data in the form of .vcf files. Due to some of the bioinformatics software used internally, these files are required to be compressed using BG-zip compression, and provided with an accompanying Tabix index file. Both of these peices of software are provided by Samtools, a standard Bioinformatics software. These two files provide a block-level compressed format of your data, and a block index, allowing the software to decompress portions of your file and access spesific entries without having to decompress the entire file.

This is also just good practice and should be a bioinformatics software standard

Please be advised, BG-zip compression is not the same as gzip compression such as that provided by linuxes gzip command. Though the final output is still block-level compression and is operable by both programs, you will need BG-zip compression in order to create a Tabix index.

Built with:

This has been made using a python-based domain spesific language (DSL) called Snakemake and coded to run on a PBS/Torque environment using the qsub command (this is set by the profile folder). As such, it needs to be run on a server with the appropriate binaries and batch scheduling software.

Software:

Below is a list of software used by this pipeline:

Binaries:

Below is a list of binary dependancies used in this pipeline.

File Structure:

This pipeline uses the standardised folder structure, where the workflow itself is located under the workflow folder.

.
├── config # All config data (PBS Profile, genes, etc)
├── resources # Commonly used resources (WARNING: DEPRECIATING SOON)
├── results # The output of the pipeline
├── workflow # The entrypoint to the code of the pipeline
└── README.md

This project uses the following naming conventions:

Files:

All user generated files should be named using under-score naming conventions. Spaces are replaced with an underscore and co capital letters are used.

E.g. this_is_a_test_example.txt

All Snakemake generated files are all labeled according to <sample_name>.<file-extension> format and stored in a folder named according to the process that produced it. > E.g. intermediates/liftover/1000g.vcf

Folders:

All user generated folders should use camelCase naming conventions, where the first letter of a multi-word name is lower-case and spaces are removed with the initial letter of the following word capitalised.

E.g. thisIsATestExample

All snakemake generated folders use the following folder structure:

.
└── intermediates
└── <ruleName>
  └── <file_name>.<extension>
      └── <file_name>.<extension>
          └── <file_name>.<extension>

Usage:

  1. use the cd command to navigate to the root repository directory containing the Snakefile.
  2. To start the pipeline and produce the default list of files, simply call snakemake on the command line with appropriate arguments. (E.g. --profile and --cluster-config flags)
  3. To generate a runtime report, detailing figures produced and performance-related numbers, use the --report snakemake flag (This requires that you have the Jinja2 python package installed.). The HTML file produced is completely self-contained and can be shared as needed. You can view it using any web browser such as firefox or Google Chrome, etc.

Roadmap:

See our Projects tab and Issues tracker for a list of proposed features (and known issues).

Versioning:

We use the SemVer syntax to manage and maintain version numbers. For the versions available, see the releases on this repository here.

Acknowledgements:

Many thanks to the following individuals who have been instrumental to the success of this project:

Graeme Ford

Author


Prof. Michael S. Pepper

Supervisor


Prof. Fourie Joubert

Co-Supervisor


Prof. Fourie Joubert

Tester


PMegan Ryder

Tester


PMegan Ryder

Tester


CC BY 4.0

About

A Snakemake powered pipeline developed to perform variant-effect-prediction and frequency analysis given multiple Variant Call Format datasets. This has been developed in partial fulfilment of a MSc in Bioinformatics at the University of Pretoria by Graeme Ford.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published