Skip to content

CardiffMRCPathfinder/GenotypeQCtoHRC

Repository files navigation

       ____ __
      { --.\  |            _   _   _   _  
       '-._\\ | (\___   %)/ \ / \ / \ / \
           `\\|{/ ^ _)-%(( D | A | T | A )
       .'^^^^^^^  /`    %%\_/ \_/ \_/ \_/ 
      //\   ) ,  /       
   _.'/  `\<-- \<
 `^^^`     ^^   ^^

DRAGON-Data "GenotypeQCtoHRC" pipeline v.2.0

Description

This is an updated version of the "GenotypeQCtoHRC" pipeline introduced in the DRAGON-Data publication and supported by an MRC "Mental Health Data Pathfinder" grant to Cardiff University. The pipeline was originally designed to perform stringent quality control of genotype data, and prepare it for imputation using the Michigan Imputation Server platform.

The purpose of this update is to simplify the use of the pipeline by streamlining most of its functions via literate and reproducible programming techniques. While the previous version relied on a single RMarkdown file with a lot of dependencies and hardcodings, this version makes heavy use of the {targets} framework for creating R workflows.

Roadmap

Currently, this is an BETA version. All functions enabling genotype quality control described in the paper are working, with some additions as below:

  • Genotype analysis of missingness and Hardy-Weinberg equilibrium ✅
  • Population structure and relatedness checks ✅
  • Heterozygosity analysis for sex chromosome markers ✅
  • Biogeographical ancestry inference1
  • Use of 1000 Genomes Phase 3 as ancestral reference panel ✅
  • Liftover (GRCh36/GRCh37/GRCh38) ✅
  • Data preparation for Michigan Imputation Server ✅
  • Data preparation for TopMED Imputation Server ✅
  • Heterozygosity analysis for autosomal markers ⬜
  • Containerisation (Docker/Apptainer) ⬜

Making it work

The expected input is a genotype dataset in binary PLINK1 .bed/.bim/.fam format. This has to be in the same folder as the GenotypeQCtoHRC.R file. The expected output is a QC report in HTML format and a folder with original, curated and intermediate files. Download GenotypeQCtoHRC_targets_example.html and open it with your internet browser to see how this looks like.

The main command to run the pipeline is: Rscript GenotypeQCtoHRC.R

You can add the -h or --help flag to display the command line arguments and their potential options. Required command line arguments for proper execution are:

  • --file to indicate the stem (i.e. name without extension) of the PLINK file set to analyse.
  • --name to indicate the name of the dataset that will be used in folder names and in the report.
  • --shortname to indicate a shorter (i.e. 10 characters) name for the dataset to be used in plots.

You can also request some additional procedures by setting the following flags to TRUE:

  • --chip to run the "Chipendium" method to identify genotyping platforms (only valid with raw array data).
  • --lo to run LiftOver from build --lo-in (set to 0 if unknown) to build --lo-out. See help for more details.
  • --gh to run Genotype Harmonizer and align the markers to the --gh-ref imputation reference. Options for the latter are HRC (GRCh37) or TopMED (GRCh38).
  • --rename to run Genotype Harmonizer and assign dbSNP RS IDs to the markers in the input dataset.
  • --clean to remove all data from previous pipeline runs and start from scratch (useful for troubleshooting).

Everything else can be left at a default state to run the QC script with the same parameters used for the DRAGON-Data publication.

On a potato-grade computer, a dataset with ~2K individuals and ~500K markers should be analysed in about an hour. The lion' share of that runtime is at the moment taken by Chipendium. This uses multiple processors when available but won't run unless requested. Harmonising the dataset to the TopMED reference (also optional) can take a long time too.

For a quick test, the Behar et al. 2010 dataset used in the example report can be downloaded from the Estonian Biocentre website.

Dependencies

The pipeline was originally developed in a Windows system using R v4.3 and RStudio v2023.12. It has since been tested in Windows, Mac and Unix using R v4.4. The following packages are required to run it from source:

  • CRAN: rmarkdown2, optparse, targets, tarchetypes, here, pbapply, tidyverse, parallelly, VGAM, AssocTests, caret, probably, tint, glue, colorspace, scales, scattermore, ggpubr.
  • GitHub: slugify.
  • Bioconductor: BiocParallel, GWASTools, GENESIS, SNPRelate, rtracklayer.

As some pipeline requirements (software3 and reference files) cannot be shared in this repository, the dragondata_extra folder is supplied partially filled. Please read the README files within for full instructions on how to populate it.

This is a provisional solution and a better method for file and depencency sharing will be implemented soon.

Credits

  • Original GenotypeQCtoHRC code by Leon Hubbard.
  • Refactoring to Targets and further improvements by Antonio Pardiñas.
  • Computational support by Lynsey Hall.
  • Testing by Jack Underwood, Djenifer Kappel and Jessica Yang.
  • Targets package by Will Landau.
  • "Tint is not Tufte" HTML theme by Dirk Eddelbuettel.
  • Chipendium procedure by Will Rayner.
  • Title ASCII art by jgs.

Citation

If this script is helpful for your work, just reference the main DRAGON-Data paper.

If it ends up being very helpful, please let us know so we can keep fighting impostor syndrome one day at a time... 😌

Contact

Please submit suggestions and bug-reports at https://github.com/CardiffMRCPathfinder/GenotypeQCtoHRC/issues.

Footnotes

  1. Originally described in Legge et al. 2019 and updated to use all available SNPs with high FST values as AIMs.

  2. The RMarkdown setup can be problematic in some HPC servers. Please ensure the linked Pandoc installation is v2.5+.

  3. This includes PLINK (v1.9 and v2) and Genotype Harmonizer v1.4.20+, which requires Java. Please check that these files have the appropriate permissions to be executed once you copy them to your local or HPC system.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published