Skip to content

Commit

Permalink
Merge d4904bb into 43e4fda
Browse files Browse the repository at this point in the history
  • Loading branch information
adrosenbaum committed Jan 16, 2020
2 parents 43e4fda + d4904bb commit 3f37dfe
Show file tree
Hide file tree
Showing 20 changed files with 1,237 additions and 686 deletions.
16 changes: 16 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,22 @@ services:
script:
- py.test -rxs --cov mutacc/ tests/

jobs:
include:
- name: unit tests
script:
- py.test -rxs --cov mutacc/ tests/
after_success:
- coveralls
- name: 'integration test'
script:
- mutacc --demo extract --padding 100
- mutacc --demo db import ./mutacc_demo_root/imports/demo_trio_import_mutacc.json
- mutacc --demo db export --all-variants --member child --proband --sample-name child
- mutacc --demo synthesize --query ./mutacc_demo_root/queries/child_query_mutacc.json

cache: pip

before_install:
- wget -q http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh -O miniconda.sh
- chmod +x miniconda.sh
Expand Down
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,11 @@ All notable changes to this project will be documented in this file.

## [Unreleased]

### Added
- User can specify what meta data to import/export from the vcf in a yaml file

## [1.2.1]

### Changed

- mutacc dumps files as json files for later import
Expand Down
128 changes: 128 additions & 0 deletions docs/config.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
# mutacc configuration

## Vcf parsing

In general, mutacc should allow any amount of meta data to be inserted to the database.
To extract the relevant meta data from the INFO column in the passed vcf, the user
must specify what keys should be added to the variant document. This can be specified in
the config file passed with the `--config-file` option, adding a key 'vcf_parser' in the yaml file.
Alternatively, this information can be passed on in a separate yaml file, using the `--vcf-parser` option.

### Import

To specify what information should be extracted from the INFO column upon importing to the database,
each relevant key should be given as an element in an list. Below is an example of how
to extract a single valued field from INFO

```yaml
import:
- id: <ID> # the ID of the field in the vcf
multi_value: <true|false> # Specify if the field contains multiple values
out_type: int # What datatype the value should be casted to
out_name: <name> # This will be the name for the value in the variant document in the database
...
```

If there are multiple values under one key in INFO, the key `multi_value` needs to
be set to **true**, and the user must specify how the values are separated in the vcf
by adding a `separator`, e.g. `separator: ','`.

In case each data value in a multi valued key is given in a specific format, e.g.
`...,value_1|value_2|value_3,...`. The user can specify the format under the `format` key
, e.g. `format: 'value_1|value_2|value_3'` and specify how the data values are separated with
the `format_separator` key, e.g. `format_separator: '1'`. Furthermore, the user can specify what data
values should be extracted by using the `target` key
```yaml
...
target:
- value_1
_ value_2
...
```
would only extract the first and third data value. Optionally if `target` is not given,
all values would be extracted. to convert the INFO entry `ANN=a|b|c,d|e|f,g|h|i` in a vcf
One could specify this in the yaml with

```yaml
import:
- id: ANN
multi_value: true
separator: ','
format_separator: '|'
format: 'value_1|value_2|value_3'
out_type: list
out_name: annotation
```

This would give a `annotation` field in the mongodb variant document

```json
"annotation": [
{"value_1": "a", "value_2": "b", "value_3": "c"},
{"value_1": "d", "value_2": "e", "value_3": "f"},
{"value_1": "g", "value_2": "h", "value_3": "i"}
]
```

### Export

When exporting with mutacc, a vcf of all queried variants will be created. Just as
when importing from a vcf, what meta data that should get included in the exported vcf.
This is done using the same principle. however, here the user need to add the `vcf_type` and
`description`. These will be used in constructing the vcf header. e.g.
```yaml
export:
- id: value_id # The key name in the variant, or case mongodb document
vcf_type: Integer # Data type the value should have in vcf, e.g. 'Integer'
out_name: VCF_ID # This will be the ID name in the vcf
description: "This is a description for vcf_id" # Description to that ID to be added in vcf header
```

would add a INFO entry `VCF_ID=<value>`. Here both the variant document, and the
related case document will be searched for the key 'value_id', and added to the INFO
column if found in any of the documents.

If for example we have a variant document as below

```json
"chrom": "1"
"start": 12345
...
"case": "case_id"
...
```

and the corresponding case document
```json
"case_id": "case_id"
"genes": [
{"hgnc_id": "ID1", "gene_name": "GENE1"},
{"hgnc_id": "ID2", "gene_name": "GENE2"}
]
```

This can be exported into the vcf file with the yaml entry

```yaml
export:
- id: genes
vcf_type: String
out_name: ANN
description: "Gene annotation, format: hgnc_id|gene_name"
format_separator: "|"
```

This will give a INFO entry as the one below
```
ANN=ID1|GENE1,ID2|GENE2
```

And a header
```
##INFO=<ID=ANN,Number=.,Type=String,Description="Gene annotation, format: hgnc_id|gene_name">
```

In this way, it is up to the user what meta data is imported and exported in the vcf.
an example yaml file is found in ```mutacc/resources/vcf-info-def.yaml``` [here](../mutacc/resources/vcf-info-def.yaml). If nothing
is given in the configuration file or with the ```--vcf-parse``` option, this will also
be the default parser used.
45 changes: 23 additions & 22 deletions mutacc/builds/build_case.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,15 @@

LOG = logging.getLogger(__name__)


class Case(dict):
"""
Class with methods for handling case objects
"""
def __init__(self, input_case, read_dir, padding=None, picard_exe=None, vcf_parse=None):

def __init__(
self, input_case, read_dir, padding=None, picard_exe=None, vcf_parse=None
):

"""
Object is instantiated with a case, a dictionary giving all relevant information about
Expand All @@ -35,30 +39,26 @@ def __init__(self, input_case, read_dir, padding=None, picard_exe=None, vcf_pars
super(Case, self).__init__()

self.input_case = input_case
self.case_id = input_case['case']['case_id']
self.case_id = input_case["case"]["case_id"]

# Build variants
rank_model_version = self.input_case['case'].get('rank_model_version')
self['variants'] = self._build_variants(padding=padding,
rank_model_version=rank_model_version,
vcf_parse=vcf_parse)
self["variants"] = self._build_variants(padding=padding, vcf_parse=vcf_parse)

# Build samples
self['samples'] = self._build_samples(read_dir=read_dir,
padding=padding,
picard_exe=picard_exe)
self["samples"] = self._build_samples(
read_dir=read_dir, padding=padding, picard_exe=picard_exe
)
# Build case
self['case'] = self.input_case['case']
self["case"] = self.input_case["case"]

def _build_variants(self, padding=None, rank_model_version=None, vcf_parse=None):
def _build_variants(self, padding=None, vcf_parse=None):
"""
Method that parses the vcf in the case dictionary.
Args:
padding(int): given in bp, extends the region for where to look for reads in the
alignment file.
rank_model_version(str): The rank_model varsion that has been used
vcf_parse(str): path to yaml file with vcf parsing information
"""
Expand All @@ -68,10 +68,9 @@ def _build_variants(self, padding=None, rank_model_version=None, vcf_parse=None)

variant_objects = []

for variant_object in get_variants(self.input_case["variants"],
padding=padding,
rank_model_version=rank_model_version,
vcf_parse=vcf_parse):
for variant_object in get_variants(
self.input_case["variants"], padding=padding, vcf_parse=vcf_parse
):

# Append the variant object to the list
variant_objects.append(variant_object)
Expand All @@ -90,16 +89,18 @@ def _build_samples(self, read_dir, padding=None, picard_exe=None):
stored.
"""

date_str = time.strftime('%Y-%m-%d')
date_str = time.strftime("%Y-%m-%d")
sub_dir = f"{self.input_case['case']['case_id']}/{date_str}"

case_dir = make_dir(read_dir.joinpath(sub_dir))
sample_objects = []
for sample in get_samples(samples=self.input_case["samples"],
variants=self['variants'],
padding=padding,
picard_exe=picard_exe,
case_dir=case_dir):
for sample in get_samples(
samples=self.input_case["samples"],
variants=self["variants"],
padding=padding,
picard_exe=picard_exe,
case_dir=case_dir,
):

sample_objects.append(sample)

Expand Down

0 comments on commit 3f37dfe

Please sign in to comment.