Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review docs #336

Merged
merged 15 commits into from
Jul 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 0 additions & 3 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,9 +33,6 @@
* [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/)
> Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.

* [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/)
> da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671.

* [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

* [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)
Expand Down
2 changes: 1 addition & 1 deletion docs/_templates/globaltoc.html
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ <h3>Useful links</h3>
<li><a href="https://github.com/pgscatalog/pgsc_calc/issues">Issue tracker</a></li>
<li><a href="https://github.com/PGScatalog/pgsc_calc/discussions">Discussion board</a></li>
</ul>
<li><a href="https://github.com/PGScatalog/pgscatalog_utils">pgscatalog_utils Github</a></li>
<li><a href="https://github.com/PGScatalog/pygscatalog">pgscatalog-utils GitHub</a></li>
</ul>

<hr>
Expand Down
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@

project = 'Polygenic Score (PGS) Catalog Calculator'
copyright = 'Polygenic Score (PGS) Catalog team (licensed under Apache License V2)'
# author = 'Polygenic Score (PGS) Catalog team'
author = 'Polygenic Score (PGS) Catalog team'


# -- General configuration ---------------------------------------------------
Expand Down
10 changes: 9 additions & 1 deletion docs/explanation/geneticancestry.rst
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,15 @@ how-to guide), and has the following steps:
for variant-level QC (SNPs in Hardy–Weinberg equilibrium [p > 1e-04] that are bi-allelic and non-ambiguous,
with low missingness [<10%], and minor allele frequency [MAF > 5%]) and sample-quality (missingness <10%).
LD-pruning is then applied to the variants and sample passing these checks (r\ :sup:`2` threshold = 0.05), excluding
complex regions with high LD (e.g. MHC). These methods are implemented in the ``FILTER_VARIANTS`` module.
complex regions with high LD (e.g. MHC). These methods are implemented in the ``FILTER_VARIANTS`` module, and
the default settings can be changed (see :doc:`schema (Reference options) <params>`).

1. **Additional variant filters on TARGET samples**: in ``v2.0.0-beta`` we introduced the ability to filter
target sample variants using minimum MAF [default 10%] and maximum genotype missingness [default 10%] to
improve PCA robustness when using imputed genotype data (see :doc:`schema (Ancestry options) <params>`).
*Note: these parameters may need to be adjusted depending on your input data (currently optimized for large
cohorts like UKB), for individual samples we recommend the MAF filter to be lowered (``--pca_maf_target 0``)
to ensure homozygous reference calls are included.*

2. **PCA**: the LD-pruned variants of the unrelated samples passing QC are then used to define the PCA space of the
reference panel (default: 10 PCs) using `FRAPOSA`_ (Fast and Robust Ancestry Prediction by using Online singular
Expand Down
4 changes: 3 additions & 1 deletion docs/explanation/match.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,8 @@ When you evaluate the predictive performance of a score with low match rates it

If you reduce ``--min_overlap`` then the calculator will output scores calculated with the remaining variants, **but these scores may not be representative of the original data submitted to the PGS Catalog.**

.. _wgs:

Are your target genomes imputed? Are they WGS?
----------------------------------------------

Expand All @@ -49,7 +51,7 @@ In the future we plan to improve support for WGS.
Did you set the correct genome build?
-------------------------------------

The calculator will automatically grab scoring files in the correct genome build from the PGS Catalog. If match rates are low it may be because you have specified the wrong genome build. If you're using custom scoring files and the match rate is low it is possible that the `--liftover` command may have been omitted.
The calculator will automatically grab scoring files in the correct genome build from the PGS Catalog. If match rates are low it may be because you have specified the wrong genome build. If you're using custom scoring files and the match rate is low it is possible that the ``--liftover`` command may have been omitted.

I'm still getting match rate errors. How do I figure out what's wrong?
----------------------------------------------------------------------
Expand Down
2 changes: 2 additions & 0 deletions docs/explanation/output.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ Calculated scores are stored in a gzipped-text space-delimted text file called
seperate row (``length = n_samples*n_pgs``), and there will be at least four columns with the following headers:

- ``sampleset``: the name of the input sampleset, or ``reference`` for the panel.
- ``FID``: the family identifier of each sample within the dataset (may be the same as IID).
- ``IID``: the identifier of each sample within the dataset.
- ``PGS``: the accession ID of the PGS being reported.
- ``SUM``: reports the weighted sum of *effect_allele* dosages multiplied by their *effect_weight*
Expand Down Expand Up @@ -56,6 +57,7 @@ describing the analysis of the target samples in relation to the reference panel
following headers:

- ``sampleset``: the name of the input sampleset, or ``reference`` for the panel.
- ``FID``: the family identifier of each sample within the dataset (may be the same as IID).
- ``IID``: the identifier of each sample within the dataset.
- ``[PC1 ... PCN]``: The projection of the sample within the PCA space defined by the reference panel. There will be as
many PC columns as there are PCs calculated (default: 10).
Expand Down
162 changes: 127 additions & 35 deletions docs/how-to/bigjob.rst
Original file line number Diff line number Diff line change
Expand Up @@ -74,43 +74,132 @@ limits.
.. warning:: You'll probably want to use ``-profile singularity`` on a HPC. The
pipeline requires Singularity v3.7 minimum.

However, in general you will have to adjust the ``executor`` options and job resource
allocations (e.g. ``process_low``). Here's an example for an LSF cluster:
Here's an example configuration running about 100 scores in parallel
on UK Biobank with a SLURM cluster:

.. code-block:: text

process {
queue = 'short'
clusterOptions = ''
scratch = true
errorStrategy = 'retry'
maxRetries = 3
maxErrors = '-1'
executor = 'slurm'

withName: 'DOWNLOAD_SCOREFILES' {
cpus = 1
memory = { 1.GB * task.attempt }
time = { 1.hour * task.attempt }
}

withLabel:process_low {
cpus = 2
memory = 8.GB
time = 1.h
withName: 'COMBINE_SCOREFILES' {
cpus = 1
memory = { 8.GB * task.attempt }
time = { 2.hour * task.attempt }
}
withLabel:process_medium {
cpus = 8
memory = 64.GB
time = 4.h

withName: 'PLINK2_MAKEBED' {
cpus = 2
memory = { 8.GB * task.attempt }
time = { 1.hour * task.attempt }
}
}

executor {
name = 'lsf'
jobName = { "$task.hash" }
}
withName: 'RELABEL_IDS' {
cpus = 1
memory = { 16.GB * task.attempt }
time = { 1.hour * task.attempt }
}

withName: 'PLINK2_ORIENT' {
cpus = 2
memory = { 8.GB * task.attempt }
time = { 1.hour * task.attempt }
}

withName: 'DUMPSOFTWAREVERSIONS' {
cpus = 1
memory = { 1.GB * task.attempt }
time = { 1.hour * task.attempt }
}

withName: 'ANCESTRY_ANALYSIS' {
cpus = { 1 * task.attempt }
memory = { 8.GB * task.attempt }
time = { 1.hour * task.attempt }
}

withName: 'SCORE_REPORT' {
cpus = 2
memory = { 8.GB * task.attempt }
time = { 1.hour * task.attempt }
}

In SLURM, queue is equivalent to a partition. Specific cluster parameters can be
provided by modifying ``clusterOptions``. You should change ``cpus``,
``memory``, and ``time`` to match the amount of resources used. Assuming the
configuration file you set up is saved as ``my_custom.config`` in your current
working directory, you're ready to run pgsc_calc. Instead of running nextflow
directly on the shell, save a bash script (``run_pgscalc.sh``) to a file
instead:
withName: 'EXTRACT_DATABASE' {
cpus = 1
memory = { 8.GB * task.attempt }
time = { 1.hour * task.attempt }
}

withName: 'PLINK2_RELABELPVAR' {
cpus = 2
memory = { 16.GB * task.attempt }
time = { 2.hour * task.attempt }
}

withName: 'INTERSECT_VARIANTS' {
cpus = 2
memory = { 8.GB * task.attempt }
time = { 1.hour * task.attempt }
}

withName: 'MATCH_VARIANTS' {
cpus = 2
memory = { 32.GB * task.attempt }
time = { 6.hour * task.attempt }
}

withName: 'FILTER_VARIANTS' {
cpus = 2
memory = { 16.GB * task.attempt }
time = { 1.hour * task.attempt }
}

withName: 'MATCH_COMBINE' {
cpus = 4
memory = { 64.GB * task.attempt }
time = { 6.hour * task.attempt }
}

withName: 'FRAPOSA_PCA' {
cpus = 2
memory = { 8.GB * task.attempt }
time = { 1.hour * task.attempt }
}

withName: 'PLINK2_SCORE' {
cpus = 2
memory = { 8.GB * task.attempt }
time = { 12.hour * task.attempt }
}

withName: 'SCORE_AGGREGATE' {
cpus = 2
memory = { 16.GB * task.attempt }
time = { 4.hour * task.attempt }
}
}

Assuming the configuration file you set up is saved as
``my_custom.config`` in your current working directory, you're ready
to run pgsc_calc. Instead of running nextflow directly on the shell,
save a bash script (``run_pgscalc.sh``) to a file instead:

.. code-block:: bash


#SBATCH -J ukbiobank_pgs
#SBATCH -c 1
#SBATCH -t 24:00:00
#SBATCH --mem=2G

export NXF_ANSI_LOG=false
export NXF_OPTS="-Xms500M -Xmx2G"

Expand All @@ -126,20 +215,23 @@ instead:
.. note:: The name of the nextflow and singularity modules will be different in
your local environment

.. warning:: Make sure to copy input data to fast storage, and run the pipeline
on the same fast storage area. You might include these steps in your
bash script. Ask your sysadmin for help if you're not sure what this
means.
.. warning:: Make sure to copy input data to fast storage, and run the
pipeline on the same fast storage area. You might include
these steps in your bash script. Ask your sysadmin for
help if you're not sure what this means.

.. code-block:: console

$ bsub -M 2GB -q short -o output.txt < run_pgscalc.sh

$ sbatch run_pgsc_calc.sh
This will submit a nextflow driver job, which will submit additional jobs for
each process in the workflow. The nextflow driver requires up to 4GB of RAM
(bsub's ``-M`` parameter) and 2 CPUs to use (see a guide for `HPC users`_ here).
each process in the workflow. The nextflow driver requires up to 4GB of RAM and 2 CPUs to use (see a guide for `HPC users`_ here).

.. _`LSF and PBS`: https://nextflow.io/docs/latest/executor.html#slurm
.. _`HPC users`: https://www.nextflow.io/blog/2021/5_tips_for_hpc_users.html
.. _`a nextflow profile`: https://github.com/nf-core/configs


Cloud deployments
-----------------

We've deployed the calculator to Google Cloud Batch but some :doc:`special configuration is required<cloud>`.
25 changes: 14 additions & 11 deletions docs/how-to/cache.rst
Original file line number Diff line number Diff line change
@@ -1,23 +1,26 @@
.. _cache:

How do I speed up `pgsc_calc` computation times and avoid re-running code?
==========================================================================
How do I speed up computation times and avoid re-running code?
==============================================================

If you intend to run `pgsc_calc` multiple times on the same target samples (e.g.
If you intend to run ``pgsc_calc`` multiple times on the same target samples (e.g.
on different sets of PGS, with different variant matching flags) it is worth cacheing
information on invariant steps of the pipeline:

- Genotype harmonzation (variant relabeling steps)
- Steps of `--run_ancestry` that: match variants between the target and reference panel and
- Steps of ``--run_ancestry`` that: match variants between the target and reference panel and
generate PCA loadings that can be used to adjust the PGS for ancestry.

To do this you must specify a directory that can store these information across runs using the
`--genotypes_cache` flag to the nextflow command (also see :ref:`param ref`). Future runs of the
pipeline that use the same cache directory should then skip these steps and proceed to run only the
steps needed to calculate new PGS. This is slightly different than using the `-resume command in
nextflow <https://www.nextflow.io/blog/2019/demystifying-nextflow-resume.html>`_ which mainly checks the
`work` directory and is more often used for restarting the pipeline when a specific step has failed
(e.g. for exceeding memory limits).
To do this you must specify a directory that can store these
information across runs using the ``--genotypes_cache`` flag to the
nextflow command (also see :ref:`param ref`). Future runs of the
pipeline that use the same cache directory should then skip these
steps and proceed to run only the steps needed to calculate new PGS.
This is slightly different than using the `-resume command in nextflow
<https://www.nextflow.io/blog/2019/demystifying-nextflow-resume.html>`_
which mainly checks the ``work`` directory and is more often used for
restarting the pipeline when a specific step has failed (e.g. for
exceeding memory limits).

.. warning:: Always use a new cache directory for different samplesets, as redundant names may clash across runs.

2 changes: 1 addition & 1 deletion docs/how-to/calculate_custom.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ minimal header in the following format:
Header::

#pgs_name=metaGRS_CAD
#pgs_name=metaGRS_CAD
#pgs_id=metaGRS_CAD
#trait_reported=Coronary artery disease
#genome_build=GRCh37

Expand Down
2 changes: 1 addition & 1 deletion docs/how-to/multiple.rst
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,7 @@ Congratulations, you've now calculated multiple scores in parallel!
combine scores in the PGS Catalog with your own custom scores

After the workflow executes successfully, the calculated scores and a summary
report should be available in the ``results/make/`` directory by default. If
report should be available in the ``results/`` directory by default. If
you're interested in more information, see :ref:`interpret`.

If the workflow didn't execute successfully, have a look at the
Expand Down
8 changes: 6 additions & 2 deletions docs/how-to/offline.rst
Original file line number Diff line number Diff line change
Expand Up @@ -127,8 +127,12 @@ panel too. See :ref:`norm`.
Download scoring files
----------------------

It's best to manually download scoring files from the PGS Catalog in the correct
genome build. Using PGS001229 as an example:
.. tip:: Use our CLI application ``pgscatalog-download`` to `download multiple scoring`_ files in parallel and the correct genome build

.. _download multiple scoring: https://pygscatalog.readthedocs.io/en/latest/how-to/guides/download.html

You'll need to preload scoring files in the correct genome build.
Using PGS001229 as an example:

https://ftp.ebi.ac.uk/pub/databases/spot/pgs/scores/PGS001229/ScoringFiles/

Expand Down
2 changes: 2 additions & 0 deletions docs/how-to/prepare.rst
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,8 @@ VCF from WGS
See https://github.com/PGScatalog/pgsc_calc/discussions/123 for discussion about tools
to convert the VCF files into ones suitable for calculating PGS.

If you input WGS data to the calculator without following the steps above then you will probably encounter match rate errors. For more information, see: :ref:`wgs`


``plink`` binary fileset (bfile)
--------------------------------
Expand Down
Loading