Skip to content

FINNGEN/pheweb-users-input-validator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PHEWEB USERS INPUTS VALIDATOR

Overview

The PheWeb users input validation tool is used to validate the correct format of user-formatted input files to make custom GWAS summary stats viewable in a PheWeb-style. User needs to provide two files: metadata file in JSON format (1) and statistics file (2). Prepare your files according to the instruction given in the FinnGen Analyst Handbook. Navigate to the section in the FinnGen Analyst Handbook as follows: open "Working in the Sandbox", navigate to "Which tools are available" -> "Custom GWAS tools" -> "How to set up a pheweb browser for summary statistics".

There are two modes for scanning your stats file: deep and shallow (specified by the parameter "--deep true/false", see user manual below). With the deep mode the whole stats file is scanned while with the shallow mode ~80k lines are subsampled from the stats file and subjected to the scan. Otherwise, scanning for the same issues is performed in either of the modes. Recommendation: first run validator in a shallow mode to check whether your metadata is correctly formatted and check if some basic requirements for the stats file are met. Once that is checked - proceed to the deep check of the files: you can either enable fixing straight away (by adding --fix true) or without it. Note that running fix mode might take a long time (more than 20 minutes) if your file is large and it requires sorting.

The following scans are performed by the PheWeb users inputs validator tool.

Metadata file:

  1. Check for special characters in metadata.
  2. Check that metadata contains all required fields.
  3. Check that metadata fields have correct format.
  4. Check that metadata field "name" matches stats filename.

Stats file:

  1. Check that file is compressed.
  2. Check that file is tab delimited.
  3. Check that columns order is correct.
  4. Check that file doesn't contain special characters.
  5. Check that chromosome column has correct formatting, i.e. only contains values 1-24, X, Y, M, MT.
  6. Check that columns 7-11 (beta, sebeta, af_alt, af_alt_cases, af_alt_controls) don't contain missing values.
  7. Check that columns 2-11 have correct formatting, i.e. according to the instructions in the FinnGen Analyst Handbook.
  8. Check that stats file doesn't contain unsorted positions.

In addition, a user can enable automated fixing of issues detected by the validator in the user-specified files by setting the parameter "--fix true" (see the user manual below). The following fixes can be done by the validator when possible:

  • Remove special characters from the metadata file.
  • Remove special characters from the stats file.
  • If stats file is space/comma delimited, it will be fixed to be tab-delimited.
  • If stats file contains missing values in the colums 7-11, they will be substituted with value 0.5.
  • Remove chromosome prefix, e.g. "chr1" change to "1" if the chromosome column contains that.
  • Sort stats file if unsorted positions are found.
  • Fix column order/number to contain 11 columns as described in the instructions.

Examples of what cannot be fixed by the validator tool:

  • Incorrect values in the columns, for instance negative p-values.
  • Name of the stats file specified in the "name" field of the metadata json file.
  • Some special characters that cannot be recognised by the validator.

Resources and expected runtimes

Validator uses multiprocessing python package and perfomes fastest when it is executed on the machine with multiple CPUs and good amount of memory, e.g. 4CPUs and memory 16GB. The recommened version of the tool is Version 1: validator.py. However, there is a second version of the tool, Version 2: validator_req25GBmem.py, which is faster than the first one but it requires more memory.

Expected runtimes:

  • Shallow scan: less than a minute
  • Deep scan: 700MB file is scanned, fixed and written in ~10-20 mins depending on the available resources.

Requirements

  • Python 3
  • Python packages: pandas, pysam, xopen, mgzip

To install python packages, run:

pip install -r requirements.txt

Usage

USAGE: python3 validator.py -m <metadata> -s <stats> -o <outdir> -d <deep_check> -f <fix_issues>

Input arguments:
-m   <metadata>    : path to metadata json file
-s   <stats>       : path to stats file
-o   <outdir>      : path to output directory
-d   <deep_check>  : enable deep scan. Possible values {0, 1, true, false, T, F, True, False}, Default: True.
-d   <fix_issues>  : enable fix of the issues. Possible values {0, 1, true, false, T, F, True, False}, Default: True. 

Outputs of the validator.py

Output files generated by the PheWeb users inputs validator tool:

  1. Full report on the results of validator scanning will be saved in file <DIR_OUT>/scan<SCAN_TIMESTAMP>/report.txt.
  2. (Optional) If the fixing mode is activated and some issues were fixed by the validator, a new stats file is written to <DIR_OUT>/scan<SCAN_TIMESTAMP>/<STATS_FILENAME>.
  3. (Optional) Lines from the stats file in which validator was able to detect issues are saved in the file <DIR_OUT>/scan<SCAN_TIMESTAMP>/<STATS_FILENAME>_lines_with_errors.

Example

Report example of the deep scan using example data specified in the Handbook:

bash validator.sh -m metadata.json -s C3_COLORECTAL.gz -o ${DIROUT}/ -d true -f true


================= SCAN STARTED AT: 2022-12-14 20:54:59.218729 =================

[PASS]  Metadata file doesn't contain special characters.
[PASS]  Metadata file contains all required fields.
[PASS]  Metadata fields have correct format.
[PASS]  Metadata field "name" matches with summary stats file name.
[FIXED] Columns order fixed.
[PASS]  Chromosome column is formatted correctly.
[PASS]  File is compressed.
[PASS]  File is tab delimited.
[PASS]  No invalid entries in columns of the stats file were found below.
[PASS]  No missing values found in columns 7-11 of stats file.
[PASS]  No special characters were found in the stats file.
[PASS]  No unsorted positions were found in the data.

================================================================================

FILENAME                    PASSED    FAILED    FIXED     MD5SUM
metadata.json               4         0         0         946f7747b57e8cf31578a3a46cf66abf
C3_COLORECTAL.gz            7         0         1         97d1ec64a91cbc4b5c277f2a2034fa5d
-----------------------------------
Total successful scans:     12 / 12

================================================================================

OUTPUT FILES:
	${DIROUT}/scan05092022T1545/report.log
	${DIROUT}/scan05092022T1545/C3_COLORECTAL.gz

SCAN ENDED AT: 2022-12-14 21:03:00.059127

Read stats file execution time: 2.96 mins
Write stats file execution time: 4.88 mins
Sort stats file 0.0 sec
Total execution time: 8.01 mins

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •