PGScatalog · smlmbrt · Jul 30, 2024 · Jul 10, 2024 · Jul 24, 2024 · Jul 29, 2024
diff --git a/CITATIONS.md b/CITATIONS.md
@@ -33,9 +33,6 @@
 * [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/)
     > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.
 
-* [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/)
-    > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671.
-
 * [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)
 
 * [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)

diff --git a/docs/_templates/globaltoc.html b/docs/_templates/globaltoc.html
@@ -35,7 +35,7 @@ <h3>Useful links</h3>
       <li><a href="https://github.com/pgscatalog/pgsc_calc/issues">Issue tracker</a></li>
       <li><a href="https://github.com/PGScatalog/pgsc_calc/discussions">Discussion board</a></li>
   </ul>
-  <li><a href="https://github.com/PGScatalog/pgscatalog_utils">pgscatalog_utils Github</a></li>
+  <li><a href="https://github.com/PGScatalog/pygscatalog">pgscatalog-utils GitHub</a></li>
 </ul>
 
 <hr>

diff --git a/docs/conf.py b/docs/conf.py
@@ -22,7 +22,7 @@
 
 project = 'Polygenic Score (PGS) Catalog Calculator'
 copyright = 'Polygenic Score (PGS) Catalog team (licensed under Apache License V2)'
-# author = 'Polygenic Score (PGS) Catalog team'
+author = 'Polygenic Score (PGS) Catalog team'
 
 
 # -- General configuration ---------------------------------------------------

diff --git a/docs/explanation/geneticancestry.rst b/docs/explanation/geneticancestry.rst
@@ -130,7 +130,15 @@ how-to guide), and has the following steps:
         for variant-level QC (SNPs in Hardy–Weinberg equilibrium [p > 1e-04] that are bi-allelic and non-ambiguous,
         with low missingness [<10%], and minor allele frequency [MAF > 5%]) and sample-quality (missingness <10%).
         LD-pruning is then applied to the variants and sample passing these checks (r\ :sup:`2` threshold = 0.05), excluding
-        complex regions with high LD (e.g. MHC). These methods are implemented in the ``FILTER_VARIANTS`` module.
+        complex regions with high LD (e.g. MHC). These methods are implemented in the ``FILTER_VARIANTS`` module, and
+        the default settings can be changed (see :doc:`schema (Reference options) <params>`).
+
+        1.  **Additional variant filters on TARGET samples**: in ``v2.0.0-beta`` we introduced the ability to filter
+            target sample variants using minimum MAF [default 10%] and maximum genotype missingness [default 10%] to
+            improve PCA robustness when using imputed genotype data (see :doc:`schema (Ancestry options) <params>`).
+            *Note: these parameters may need to be adjusted depending on your input data (currently optimized for large
+            cohorts like UKB), for individual samples we recommend the MAF filter to be lowered (``--pca_maf_target 0``)
+            to ensure homozygous reference calls are included.*
 
     2.  **PCA**: the LD-pruned variants of the unrelated samples passing QC are then used to define the PCA space of the
         reference panel (default: 10 PCs) using `FRAPOSA`_ (Fast and Robust Ancestry Prediction by using Online singular

diff --git a/docs/explanation/match.rst b/docs/explanation/match.rst
@@ -37,6 +37,8 @@ When you evaluate the predictive performance of a score with low match rates it
 
 If you reduce ``--min_overlap`` then the calculator will output scores calculated with the remaining variants, **but these scores may not be representative of the original data submitted to the PGS Catalog.**
 
+.. _wgs:
+
 Are your target genomes imputed? Are they WGS?
 ----------------------------------------------
 
@@ -49,7 +51,7 @@ In the future we plan to improve support for WGS.
 Did you set the correct genome build?
 -------------------------------------
 
-The calculator will automatically grab scoring files in the correct genome build from the PGS Catalog. If match rates are low it may be because you have specified the wrong genome build. If you're using custom scoring files and the match rate is low it is possible that the `--liftover` command may have been omitted. 
+The calculator will automatically grab scoring files in the correct genome build from the PGS Catalog. If match rates are low it may be because you have specified the wrong genome build. If you're using custom scoring files and the match rate is low it is possible that the ``--liftover`` command may have been omitted. 
 
 I'm still getting match rate errors. How do I figure out what's wrong?
 ----------------------------------------------------------------------

diff --git a/docs/explanation/output.rst b/docs/explanation/output.rst
@@ -23,6 +23,7 @@ Calculated scores are stored in a gzipped-text space-delimted text file called
 seperate row (``length = n_samples*n_pgs``), and there will be at least four columns with the following headers:
 
 - ``sampleset``: the name of the input sampleset, or ``reference`` for the panel.
+- ``FID``: the family identifier of each sample within the dataset (may be the same as IID).
 - ``IID``: the identifier of each sample within the dataset.
 - ``PGS``: the accession ID of the PGS being reported.
 - ``SUM``: reports the weighted sum of *effect_allele* dosages multiplied by their *effect_weight*
@@ -56,6 +57,7 @@ describing the analysis of the target samples in relation to the reference panel
 following headers:
 
 - ``sampleset``: the name of the input sampleset, or ``reference`` for the panel.
+- ``FID``: the family identifier of each sample within the dataset (may be the same as IID).
 - ``IID``: the identifier of each sample within the dataset.
 - ``[PC1 ... PCN]``: The projection of the sample within the PCA space defined by the reference panel. There will be as
   many PC columns as there are PCs calculated (default: 10).

diff --git a/docs/how-to/bigjob.rst b/docs/how-to/bigjob.rst
@@ -74,43 +74,132 @@ limits.
 .. warning:: You'll probably want to use ``-profile singularity`` on a HPC. The
           pipeline requires Singularity v3.7 minimum.
 
-However, in general you will have to adjust the ``executor`` options and job resource
-allocations (e.g. ``process_low``). Here's an example for an LSF cluster:
+Here's an example configuration running about 100 scores in parallel
+on UK Biobank with a SLURM cluster:
 
 .. code-block:: text
 
     process {
-        queue = 'short'
-        clusterOptions = ''
-        scratch = true
+        errorStrategy = 'retry'
+        maxRetries = 3
+        maxErrors = '-1'
+        executor = 'slurm'
+
+        withName: 'DOWNLOAD_SCOREFILES' {
+          cpus = 1
+          memory = { 1.GB * task.attempt }
+          time = { 1.hour * task.attempt }
+        }
 
-        withLabel:process_low {
-            cpus   = 2
-            memory = 8.GB
-            time   = 1.h
+        withName: 'COMBINE_SCOREFILES' {
+          cpus = 1
+          memory = { 8.GB * task.attempt }
+          time = { 2.hour * task.attempt }
         }
-        withLabel:process_medium {
-            cpus   = 8
-            memory = 64.GB
-            time   = 4.h
+
+        withName: 'PLINK2_MAKEBED' {
+          cpus = 2
+          memory = { 8.GB * task.attempt }
+          time = { 1.hour * task.attempt }
         }
-    }
 
-    executor {
-        name = 'lsf'
-        jobName = { "$task.hash" }
-    } 
+        withName: 'RELABEL_IDS' {
+          cpus = 1
+          memory = { 16.GB * task.attempt }
+          time = { 1.hour * task.attempt }
+        }
+
+        withName: 'PLINK2_ORIENT' {
+          cpus = 2
+          memory = { 8.GB * task.attempt }
+          time = { 1.hour * task.attempt }
+        }
+
+        withName: 'DUMPSOFTWAREVERSIONS' {
+          cpus = 1
+          memory = { 1.GB * task.attempt }
+          time = { 1.hour * task.attempt }
+        }
+
+        withName: 'ANCESTRY_ANALYSIS' {
+          cpus = { 1 * task.attempt }
+          memory = { 8.GB * task.attempt }
+          time = { 1.hour * task.attempt }
+        }
+
+        withName: 'SCORE_REPORT' {
+          cpus = 2
+          memory = { 8.GB * task.attempt }
+          time = { 1.hour * task.attempt }
+        }
 
-In SLURM, queue is equivalent to a partition. Specific cluster parameters can be
-provided by modifying ``clusterOptions``. You should change ``cpus``,
-``memory``, and ``time`` to match the amount of resources used. Assuming the
-configuration file you set up is saved as ``my_custom.config`` in your current
-working directory, you're ready to run pgsc_calc. Instead of running nextflow
-directly on the shell, save a bash script (``run_pgscalc.sh``) to a file
-instead:
+        withName: 'EXTRACT_DATABASE' {
+          cpus = 1
+          memory = { 8.GB * task.attempt }
+          time = { 1.hour * task.attempt }
+        }
+
+        withName: 'PLINK2_RELABELPVAR' {
+          cpus = 2
+          memory = { 16.GB * task.attempt }
+          time = { 2.hour * task.attempt }
+        }
+
+        withName: 'INTERSECT_VARIANTS' {
+          cpus = 2
+          memory = { 8.GB * task.attempt }
+          time = { 1.hour * task.attempt }
+        }
+
+        withName: 'MATCH_VARIANTS' {
+          cpus = 2
+          memory = { 32.GB * task.attempt }
+          time = { 6.hour * task.attempt }
+        }
+
+        withName: 'FILTER_VARIANTS' {
+          cpus = 2
+          memory = { 16.GB * task.attempt }
+          time = { 1.hour * task.attempt }
+        }
+
+        withName: 'MATCH_COMBINE' {
+          cpus = 4
+          memory = { 64.GB * task.attempt }
+          time = { 6.hour * task.attempt }
+        }
+
+        withName: 'FRAPOSA_PCA' {
+          cpus = 2
+          memory = { 8.GB * task.attempt }
+          time = { 1.hour * task.attempt }
+        }
+
+        withName: 'PLINK2_SCORE' {
+          cpus = 2
+          memory = { 8.GB * task.attempt }
+          time = { 12.hour * task.attempt }
+        }
+
+        withName: 'SCORE_AGGREGATE' {
+          cpus = 2
+          memory = { 16.GB * task.attempt }
+          time = { 4.hour * task.attempt }
+        }
+    }
+
+Assuming the configuration file you set up is saved as
+``my_custom.config`` in your current working directory, you're ready
+to run pgsc_calc. Instead of running nextflow directly on the shell,
+save a bash script (``run_pgscalc.sh``) to a file instead:
 
 .. code-block:: bash
-
+
+    #SBATCH -J ukbiobank_pgs
+    #SBATCH -c 1
+    #SBATCH -t 24:00:00
+    #SBATCH --mem=2G
+
     export NXF_ANSI_LOG=false
     export NXF_OPTS="-Xms500M -Xmx2G" 
 
@@ -126,20 +215,23 @@ instead:
 .. note:: The name of the nextflow and singularity modules will be different in
           your local environment
 
-          .. warning:: Make sure to copy input data to fast storage, and run the pipeline
-            on the same fast storage area. You might include these steps in your
-            bash script. Ask your sysadmin for help if you're not sure what this
-            means.
+.. warning:: Make sure to copy input data to fast storage, and run the
+            pipeline on the same fast storage area. You might include
+            these steps in your bash script. Ask your sysadmin for
+            help if you're not sure what this means.
 
 .. code-block:: console
 
-    $ bsub -M 2GB -q short -o output.txt < run_pgscalc.sh
-
+    $ sbatch run_pgsc_calc.sh
+    
 This will submit a nextflow driver job, which will submit additional jobs for
-each process in the workflow. The nextflow driver requires up to 4GB of RAM
-(bsub's ``-M`` parameter) and 2 CPUs to use (see a guide for `HPC users`_ here).
+each process in the workflow. The nextflow driver requires up to 4GB of RAM and 2 CPUs to use (see a guide for `HPC users`_ here).
 
-.. _`LSF and PBS`: https://nextflow.io/docs/latest/executor.html#slurm
 .. _`HPC users`: https://www.nextflow.io/blog/2021/5_tips_for_hpc_users.html
 .. _`a nextflow profile`: https://github.com/nf-core/configs
 
+
+Cloud deployments
+-----------------
+
+We've deployed the calculator to Google Cloud Batch but some :doc:`special configuration is required<cloud>`.
diff --git a/docs/how-to/cache.rst b/docs/how-to/cache.rst
@@ -1,23 +1,26 @@
 .. _cache:
 
-How do I speed up `pgsc_calc` computation times and avoid re-running code?
-==========================================================================
+How do I speed up computation times and avoid re-running code?
+==============================================================
 
-If you intend to run `pgsc_calc` multiple times on the same target samples (e.g.
+If you intend to run ``pgsc_calc`` multiple times on the same target samples (e.g.
 on different sets of PGS, with different variant matching flags) it is worth cacheing
 information on invariant steps of the pipeline:
 
 - Genotype harmonzation (variant relabeling steps)
-- Steps of `--run_ancestry` that: match variants between the target and reference panel and
+- Steps of ``--run_ancestry`` that: match variants between the target and reference panel and
   generate PCA loadings that can be used to adjust the PGS for ancestry.
 
-To do this you must specify a directory that can store these information across runs using the
-`--genotypes_cache` flag to the nextflow command (also see :ref:`param ref`). Future runs of the
-pipeline that use the same cache directory should then skip these steps and proceed to run only the
-steps needed to calculate new PGS. This is slightly different than using the `-resume command in
-nextflow <https://www.nextflow.io/blog/2019/demystifying-nextflow-resume.html>`_ which mainly checks the
-`work` directory and is more often used for restarting the pipeline when a specific step has failed
-(e.g. for exceeding memory limits).
+To do this you must specify a directory that can store these
+information across runs using the ``--genotypes_cache`` flag to the
+nextflow command (also see :ref:`param ref`). Future runs of the
+pipeline that use the same cache directory should then skip these
+steps and proceed to run only the steps needed to calculate new PGS.
+This is slightly different than using the `-resume command in nextflow
+<https://www.nextflow.io/blog/2019/demystifying-nextflow-resume.html>`_
+which mainly checks the ``work`` directory and is more often used for
+restarting the pipeline when a specific step has failed (e.g. for
+exceeding memory limits).
 
 .. warning:: Always use a new cache directory for different samplesets, as redundant names may clash across runs. 
 
diff --git a/docs/how-to/calculate_custom.rst b/docs/how-to/calculate_custom.rst
@@ -26,7 +26,7 @@ minimal header in the following format:
 Header::
 
     #pgs_name=metaGRS_CAD
-    #pgs_name=metaGRS_CAD    
+    #pgs_id=metaGRS_CAD    
     #trait_reported=Coronary artery disease
     #genome_build=GRCh37
 

diff --git a/docs/how-to/multiple.rst b/docs/how-to/multiple.rst
@@ -133,7 +133,7 @@ Congratulations, you've now calculated multiple scores in parallel!
           combine scores in the PGS Catalog with your own custom scores
 
 After the workflow executes successfully, the calculated scores and a summary
-report should be available in the ``results/make/`` directory by default. If
+report should be available in the ``results/`` directory by default. If
 you're interested in more information, see :ref:`interpret`.
 
 If the workflow didn't execute successfully, have a look at the

diff --git a/docs/how-to/offline.rst b/docs/how-to/offline.rst
@@ -127,8 +127,12 @@ panel too. See :ref:`norm`.
 Download scoring files
 ----------------------
 
-It's best to manually download scoring files from the PGS Catalog in the correct
-genome build. Using PGS001229 as an example:
+.. tip:: Use our CLI application ``pgscatalog-download`` to `download multiple scoring`_ files in parallel and the correct genome build
+
+.. _download multiple scoring: https://pygscatalog.readthedocs.io/en/latest/how-to/guides/download.html
+
+You'll need to preload scoring files in the correct genome build.
+Using PGS001229 as an example:
 
 https://ftp.ebi.ac.uk/pub/databases/spot/pgs/scores/PGS001229/ScoringFiles/
 

diff --git a/docs/how-to/prepare.rst b/docs/how-to/prepare.rst
@@ -52,6 +52,8 @@ VCF from WGS
 See https://github.com/PGScatalog/pgsc_calc/discussions/123 for discussion about tools
 to convert the VCF files into ones suitable for calculating PGS.
 
+If you input WGS data to the calculator without following the steps above then you will probably encounter match rate errors. For more information, see: :ref:`wgs`
+
 
 ``plink`` binary fileset (bfile)
 --------------------------------