Merge d4904bb into 43e4fda

Clinical-Genomics · Jan 16, 2020 · 3f37dfe · 3f37dfe
2 parents 43e4fda + d4904bb
commit 3f37dfe
Show file tree

Hide file tree

Showing 20 changed files with 1,237 additions and 686 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -9,6 +9,22 @@ services:
 script:
   - py.test -rxs --cov mutacc/ tests/
 
+jobs:
+    include:
+      - name: unit tests
+        script:
+          - py.test -rxs --cov mutacc/ tests/
+        after_success:
+          - coveralls
+      - name: 'integration test'
+        script:
+          - mutacc --demo extract --padding 100
+          - mutacc --demo db import ./mutacc_demo_root/imports/demo_trio_import_mutacc.json
+          - mutacc --demo db export --all-variants --member child --proband --sample-name child
+          - mutacc --demo synthesize --query ./mutacc_demo_root/queries/child_query_mutacc.json
+
+cache: pip
+
 before_install:
   - wget -q http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh -O miniconda.sh
   - chmod +x miniconda.sh

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,6 +3,11 @@ All notable changes to this project will be documented in this file.
 
 ## [Unreleased]
 
+### Added
+- User can specify what meta data to import/export from the vcf in a yaml file
+
+## [1.2.1]
+
 ### Changed
 
 - mutacc dumps files as json files for later import

diff --git a/docs/config.md b/docs/config.md
@@ -0,0 +1,128 @@
+# mutacc configuration
+
+## Vcf parsing
+
+In general, mutacc should allow any amount of meta data to be inserted to the database.
+To extract the relevant meta data from the INFO column in the passed vcf, the user
+must specify what keys should be added to the variant document. This can be specified in
+the config file passed with the `--config-file` option, adding a key 'vcf_parser' in the yaml file.
+Alternatively, this information can be passed on in a separate yaml file, using the `--vcf-parser` option.
+
+### Import
+
+To specify what information should be extracted from the INFO column upon importing to the database,
+each relevant key should be given as an element in an list. Below is an example of how
+to extract a single valued field from INFO
+
+```yaml
+import:
+  - id: <ID> # the ID of the field in the vcf
+    multi_value: <true|false> # Specify if the field contains multiple values
+    out_type: int # What datatype the value should be casted to
+    out_name: <name> # This will be the name for the value in the variant document in the database
+    ...
+```
+
+If there are multiple values under one key in INFO, the key `multi_value` needs to
+be set to **true**, and the user must specify how the values are separated in the vcf
+by adding a `separator`, e.g. `separator: ','`.
+
+In case each data value in a multi valued key is given in a specific format, e.g.
+`...,value_1|value_2|value_3,...`. The user can specify the format under the `format` key
+, e.g. `format: 'value_1|value_2|value_3'` and specify how the data values are separated with
+the `format_separator` key, e.g. `format_separator: '1'`. Furthermore, the user can specify what data
+values should be extracted by using the `target` key
+```yaml
+...
+target:
+  - value_1
+  _ value_2
+...
+```
+would only extract the first and third data value. Optionally if `target` is not given,
+all values would be extracted. to convert the INFO entry `ANN=a|b|c,d|e|f,g|h|i` in a vcf
+One could specify this in the yaml with
+
+```yaml
+import:
+  - id: ANN
+    multi_value: true
+    separator: ','
+    format_separator: '|'
+    format: 'value_1|value_2|value_3'
+    out_type: list
+    out_name: annotation
+```  
+
+This would give a `annotation` field in the mongodb variant document
+
+```json
+  "annotation": [
+    {"value_1": "a", "value_2": "b", "value_3": "c"},
+    {"value_1": "d", "value_2": "e", "value_3": "f"},
+    {"value_1": "g", "value_2": "h", "value_3": "i"}
+  ]
+```
+
+### Export
+
+When exporting with mutacc, a vcf of all queried variants will be created. Just as
+when importing from a vcf, what meta data that should get included in the exported vcf.
+This is done using the same principle. however, here the user need to add the `vcf_type` and
+`description`. These will be used in constructing the vcf header. e.g.
+```yaml
+export:
+  - id: value_id # The key name in the variant, or case mongodb document
+    vcf_type: Integer # Data type the value should have in vcf, e.g. 'Integer'
+    out_name: VCF_ID # This will be the ID name in the vcf
+    description: "This is a description for vcf_id" # Description to that ID to be added in vcf header
+```
+
+would add a INFO entry `VCF_ID=<value>`. Here both the variant document, and the
+related case document will be searched for the key 'value_id', and added to the INFO
+column if found in any of the documents.
+
+If for example we have a variant document as below
+
+```json
+  "chrom": "1"
+  "start": 12345
+  ...
+  "case": "case_id"
+  ...
+```
+
+and the corresponding case document
+```json
+  "case_id": "case_id"
+  "genes": [
+    {"hgnc_id": "ID1", "gene_name": "GENE1"},
+    {"hgnc_id": "ID2", "gene_name": "GENE2"}
+  ]
+```
+
+This can be exported into the vcf file with the yaml entry
+
+```yaml
+export:
+  - id: genes
+    vcf_type: String
+    out_name: ANN
+    description: "Gene annotation, format: hgnc_id|gene_name"
+    format_separator: "|"
+```
+
+This will give a INFO entry as the one below
+```
+ANN=ID1|GENE1,ID2|GENE2
+```
+
+And a header
+```
+##INFO=<ID=ANN,Number=.,Type=String,Description="Gene annotation, format: hgnc_id|gene_name">
+```
+
+In this way, it is up to the user what meta data is imported and exported in the vcf.
+an example yaml file is found in ```mutacc/resources/vcf-info-def.yaml``` [here](../mutacc/resources/vcf-info-def.yaml). If nothing
+is given in the configuration file or with the ```--vcf-parse``` option, this will also
+be the default parser used.
diff --git a/mutacc/builds/build_case.py b/mutacc/builds/build_case.py
@@ -12,11 +12,15 @@
 
 LOG = logging.getLogger(__name__)
 
+
 class Case(dict):
     """
         Class with methods for handling case objects
     """
-    def __init__(self, input_case, read_dir, padding=None, picard_exe=None, vcf_parse=None):
+
+    def __init__(
+        self, input_case, read_dir, padding=None, picard_exe=None, vcf_parse=None
+    ):
 
         """
             Object is instantiated with a case, a dictionary giving all relevant information about
@@ -35,30 +39,26 @@ def __init__(self, input_case, read_dir, padding=None, picard_exe=None, vcf_pars
         super(Case, self).__init__()
 
         self.input_case = input_case
-        self.case_id = input_case['case']['case_id']
+        self.case_id = input_case["case"]["case_id"]
 
         # Build variants
-        rank_model_version = self.input_case['case'].get('rank_model_version')
-        self['variants'] = self._build_variants(padding=padding,
-                                                rank_model_version=rank_model_version,
-                                                vcf_parse=vcf_parse)
+        self["variants"] = self._build_variants(padding=padding, vcf_parse=vcf_parse)
 
         # Build samples
-        self['samples'] = self._build_samples(read_dir=read_dir,
-                                              padding=padding,
-                                              picard_exe=picard_exe)
+        self["samples"] = self._build_samples(
+            read_dir=read_dir, padding=padding, picard_exe=picard_exe
+        )
         # Build case
-        self['case'] = self.input_case['case']
+        self["case"] = self.input_case["case"]
 
-    def _build_variants(self, padding=None, rank_model_version=None, vcf_parse=None):
+    def _build_variants(self, padding=None, vcf_parse=None):
         """
             Method that parses the vcf in the case dictionary.
 
             Args:
 
                 padding(int): given in bp, extends the region for where to look for reads in the
                               alignment file.
-                rank_model_version(str): The rank_model varsion that has been used
                 vcf_parse(str): path to yaml file with vcf parsing information
 
         """
@@ -68,10 +68,9 @@ def _build_variants(self, padding=None, rank_model_version=None, vcf_parse=None)
 
         variant_objects = []
 
-        for variant_object in get_variants(self.input_case["variants"],
-                                           padding=padding,
-                                           rank_model_version=rank_model_version,
-                                           vcf_parse=vcf_parse):
+        for variant_object in get_variants(
+            self.input_case["variants"], padding=padding, vcf_parse=vcf_parse
+        ):
 
             # Append the variant object to the list
             variant_objects.append(variant_object)
@@ -90,16 +89,18 @@ def _build_samples(self, read_dir, padding=None, picard_exe=None):
                 stored.
         """
 
-        date_str = time.strftime('%Y-%m-%d')
+        date_str = time.strftime("%Y-%m-%d")
         sub_dir = f"{self.input_case['case']['case_id']}/{date_str}"
 
         case_dir = make_dir(read_dir.joinpath(sub_dir))
         sample_objects = []
-        for sample in get_samples(samples=self.input_case["samples"],
-                                  variants=self['variants'],
-                                  padding=padding,
-                                  picard_exe=picard_exe,
-                                  case_dir=case_dir):
+        for sample in get_samples(
+            samples=self.input_case["samples"],
+            variants=self["variants"],
+            padding=padding,
+            picard_exe=picard_exe,
+            case_dir=case_dir,
+        ):
 
             sample_objects.append(sample)