Skip to content

4. Developer Tutorial (Annotators)

kpagel edited this page Mar 12, 2020 · 34 revisions

Writing an Annotator

Method developers and data providers can make their results available by packaging them as an OpenCRAVAT annotator and then publishing them to the OpenCRAVAT Store. Annotators generally operate on either individual variants or genes impacted by variants. OpenCRAVAT annotators generally include a database of pre-computed genome wide annotations to enable fast high-throughput analysis of large variant files. The preferred storage mechanism for annotator reference data is sqlite databases, but other formats can be used.

This page provides a tutorial-style description of writing an annotator from start to finish. For detailed documentation on each component of an annotator, try Annotator Reference

Locating The Modules Directory

To begin writing a new annotator, first locate the path to the modules directory using the command oc config md.

$ oc config md
/PythonRoot/lib/site-packages/cravat/modules

The modules directory contains all OpenCRAVAT modules, split into sub-directories by type. Annotator modules will be found in the sub-directory annotators. This is where a developer will create their own annotator.

Developers can change the modules directory by passing the new directory as an argument to oc config md. The command changes the directory that OpenCRAVAT searches in for modules. The command does not move currently installed modules to the new directory, which must be performed manually if desired.

$ oc config md /my/custom/modules/location
/my/custom/modules/location
$ oc config md
/my/custom/modules/location

Starting from the Template

OpenCRAVAT offers a the command-line utility to easily create an annotator from a template. For example, a developer writing a new annotator named example_annotator would begin with:

$ oc new annotator example_annotator
annotator example_annotator created at /Python36/Lib/site-packages/cravat/modules/annotators/example_annotator

This will create the following template file structure for an annotator:

example_annotator/
    |───example_annotator.md
    |───example_annotator.yml
    |───example_annotator.py
    └───data/
        └───annotator_template.sqlite

In this tutorial, a variant level annotator will be created from the new-annotator template. For each variant, this annotator will have two (trivial) outputs.

  1. The variant's genomic position along the chromosome as in the range [0,1], where 0 is the first position of the chromosome, and 1 is the last.
  2. The number of bases in the reference.

The final files can be found at example_annotator.

Running and Debugging

To run only this annotator, use the -a flag of cravat. Assuming an input file called input.txt:

oc run input.txt -a example_annotator

The annotator will generate one output file called input.txt.example_annotator.var with the analysis results. Logging events will be written to input.txt.log if it is run as part of the full pipeline.

Adding the Data

To begin writing an annotator, a developer will need to provide the annotator's data files. These files will be placed in the data subdirectory. example_annotator will use a single sqlite file as its data source, available here.

The database includes one table chrom_length which holds the number of nucleotides in each chromosome. The schema for chrom_length is:

CREATE TABLE 'chrom_length' (
    'chrom' TEXT NOT NULL UNIQUE,
    'len' INTEGER NOT NULL,
    PRIMARY KEY('chrom')
)

The data in the chrom_length is as follows, in tsv form.

#chrom  len
1   248956422
2   242193529
3   198295559
...
X   156040895
Y   57227415

The database needs to be placed within the directory example_annotator/data/ to be available for querying/referencing in example_annotator.py.

Adding the Code

Now that example_annotator has a data source, a developer can begin adding code to the annotator's python module example_annotator.py.

Every annotator's python module performs the same process: First, the annotator accepts input data from OpenCRAVAT describing a gene/variant. Next, the annotator uses this information to search its data source's files for new, annotator-specific information on that variant/gene. Finally, if new information is found, the annotator returns the information and it is included the final report.

oc new annotator will have automatically created example_annotator.py, which will contain a skeleton structure for an annotator:

from cravat import BaseAnnotator

class CravatAnnotator(BaseAnnotator):

    def setup():
        # ... setup code will eventually go here ...
        pass

    def annotate(input_data, secondary_data=None):
        out = {}
        # ... annotate code will eventually go here ...
        return out

    def cleanup():
        # ... cleanup code will eventually go here ...
        pass

Every annotator's python module will be a sub-class of the base-class BaseAnnotator which implements the methods setup, annotate, and cleanup.

  • setup is used to create initialize data sources, including loading information from files, and opening database connections.
  • annotate is called by the BaseAnnotator as it loops over the input file, and is where the annotator looks up additional information from the data sources.
  • cleanup is used for any post-processing following all executions of annotate, and closes any file-handlers and/or database connections opened in setup.

More detailed descriptions of the uses of each of these methods can be found in the annotator.py.

The config file

oc new annotator will also have created a yaml file called example_annotator.yml. This is the config file for the annotator. It tells the OpenCRAVAT system what data to send to the annotator, and what data to expect back. The template yaml file is well commented, and those instructions wont be fully repeated here. The only necessary edits are to the output_columns key. It should be changed to describe the three outputs this annotator is creating. The end result should be as follows:

output_columns:
  - name: start_fraction
    title: Chrom start fraction
    type: float
  - name: ref_len
    title: Reference length
    type: int

The title, description, and developer keys may also be edited, but aren't essential to the annotator's function.

A full list of accepted and required YAML properties can be found at the annotator.yml reference documentation.

setup

OpenCRAVAT will make a sqlite connection to a sqlite file in the data directory that shares the annotators name. So, because our database is located at /data/example_annotator.sqlite, a sqlite3.Connection instance called self.dbconn and a sqlite3.Cursor instance called self.cursor will be automatically opened. This can be verified with the following two assert statements in setup.

def setup():
        # Verify the connection and cursor exist.
        assert isinstance(self.dbconn, sqlite3.Connection)
        assert isinstance(self.cursor, sqlite3.Cursor)

annotate

The annotate method is called once for each line of the input file. Input files for variant annotators like this one contain 5 data values per line. These values are passed into the annotate method as a positional argument called input_data which is a dictionary. If input_format field is omitted or set to crv in the module config file, this dictionary will have five key-value pairs, as shown in the below example of a 1-base deletion:

{
    'uid' : 1, #The internal id of this input line. Seldom used.
    'chrom' : 'chr10', # The chromosome name
    'pos' : 87864486, # The genomic position of the first affected nucleotide
    'ref_base' : 'A', # The reference bases
    'alt_base' : '-', # The alternate bases
}

pos is in the GRCh38 coordinate system. If the original input is in hg19, its pos will be internally converted to hg38 ones with liftOver and fed to each annotator's annotate method.

If input_format in the module config file is set to crx, input_data dictionary will have five more key-value pairs additionally. An example is shown below.

{
    'uid' : 1, #The internal id of this input line. Seldom used.
    'chrom' : 'chr10', # The chromosome name
    'pos' : 87864486, # The genomic position of the first affected nucleotide
    'ref_base' : 'A', # The reference bases
    'alt_base' : '-', # The alternate bases
    'coding': 'Y', # 'Y' for a coding variant
    'hugo': 'PTEN', # HUGO symbol of the gene where the variant is
    'transcript': 'ENST00000371953.7', # Transcript with the most severe impact based on variant sequence ontology
    'so': 'FD1', # Most severe sequence ontology of the impact by the variant
    'all_mappings': '{"KLLN":[["B2CW77",null,"2KU","ENST00000445946.3",null]],"PTEN":[["P60484","_6_","FD1","ENST00000371953.7","A17-"],["A0A087WT17",null,"UT5","ENST00000610634.1",null]]}' # Variant impact on all transcripts which encompass the variant's position
}

In a gene level annotator ('level' field is set to gene in the module config file), input_data will have four key-value pairs, as shown in the below example (for all config options, see annotator reference).

{
    'hugo': 'PTEN', # HUGO symbol of the gene where the variant is
    'num_variants': 2, # Number of input variants for the gene in `hugo` key
    'so': 'FD1', # Most severe sequence ontology of the impact by the variants in the gene
    'all_mappings': 'FD1(1),FD2(1)' # Variant impact sequence ontology with the number of input variants which produce the same variant impact sequence ontology on all transcripts which encompass the variant's position
}

Initially, the annotate method simply creates an empty dictionary called out and returns it. We want to compute the two return values and add them to out using the right keys. The keys of the returned dictionary must exactly match the output column names defined in the yml file. Omitted names will be filled with blank values, and extra names will be ignored. For this input, the annotate method should return a dictionary like this:

{
    'start_fraction' : 0.4508156250735319,
    'ref_len' : 3
}

To compute and add the return values, add the following code to the annotate method.

start_fraction

We need to fetch the chromosome length from the database, then use that and input_data['pos'] to calculate start_fraction.

###### Get start_fraction
# input_data['chrom'] is formatted as chr1, chr2, chrX etc. The chrom
# column of the database omits the 'chr'. Need to convert formats.
chrom = input_data['chrom'].replace('chr','')
# Construct the query as a string
length_query = 'select len from chrom_length where chrom="%s"' %chrom
# Execute the query and store the result, if it exists.
self.cursor.execute(length_query)
length_result = self.cursor.fetchone()
if length_result is not None:
    chrom_length = length_result[0]
    start_fraction = input_data['pos']/chrom_length
else:
    start_fraction = None
out['start_fraction'] = start_fraction

ref_len

Calculating ref_len is easy. Simply get the length of input_data['ref_base']. However, in an insertion, the ref_base will be '-', so replace those with an empty string before calculating length.

##### Get ref_len
cleaned_ref_base = input_data['ref_base'].replace('-','')
out['ref_len'] = len(cleaned_ref_base)

Now out will contain all three values and can be returned.

cleanup

The last method in my_annotator.py is the cleanup method. This method performs post-processing steps, such as shutting down database connections or file-handlers. In this example there is no need to close the connection to my_annotator.sqlite. In the same way OpenCRAVAT automatically opens a connection to a module-named database, it will also automatically close it. The cleanup method for my_annotator.py is simply:

def cleanup():
    pass

With this last step, my_annotator is now a fully functioning annotator, ready to integrate into Cravat.

Debugging and logging

OpenCRAVAT runs annotators by importing the CravatAnnotator class from the file, and calling the run method which was inherited from BaseAnnotator. You can use visual debuggers to debug annotators by calling runcravat.py in the open-cravat pip package, with the arguments you would normally use with command line oc run.

Messages can be written to the log file by using self.logger within the annotator.

Generate an test input file in your current directory with oc new example-input .

def annotate(input_data, secondary_data=None):
    #### code
    self.logger.info(str(input_data))
    #### code

More information about the logging utility is available here.

Writing Tests

Tests must be included in every annotator to ensure that it is running correctly. Creating a test requires two files, an input and a key. OpenCRAVAT will test your annotator by running the OpenCRAVAT pipeline using input as the input, then generating a text report and comparing it to key.

To create a test, choose and input file that good coverage of your annotator. Run it through OpenCRAVAT with -a example_annotator -t text to run just your annotator, and get a text report. Examine the resulting .tsv format text report and ensure that your annotator ran properly. Then create a subdirectory called test in your annotators directory and copy the input file in as input and the text report in as key. Then run the OpenCRAVAT testing program to ensure that your test passes.

oc util test -a example_annotator

Writing a Web Result Viewer Widget

To make a web result viewer widget, two kinds of files are required: .js and .yml.

The naming convention for web result viewer widgets is to have wg prefix in front of the annotator the data of which the widget uses. Thus, the name of a widget for clinvar annotator is recommended to be wgclinvar. If such a name has already been claimed, you can use another name but keep the wg prefix.

The .yml file is a module config file. See the below example (wgclinvar.yml).

title: ClinVar
version: 1.0.0
type: webviewerwidget
required_annotator: clinvar
description: Clinvar webviewer widget
developer:
  name: 'Rick Kim'
  organization: 'In Silico Solutions'
  email: 'rkim@insilico.us.com'
  website: 'http://www.insilico.us.com'
  citation: ''

All keys except required_annotator is the same as in the module config file for an annotator. required_annotator specifies which annotator's result is needed to feed the data to this widget.

When the web result viewer draws each widget, it passes three parameters to a widget: div: div element which wraps the widget content row: array of annotation values for the clicked/selected row in the variant- or gene-level result table tabName: variant or gene, showing which level result is being passed as row

See the below example code from wgclinvar.js.

widgetGenerators['clinvar'] = {
	'variant': {
		'width': 280, 
		'height': 80, 
		'function': function (div, row, tabName) {
			addInfoLine(div, row, 'Significance', 'clinvar__sig', tabName);
			addInfoLine(div, row, 'Diseases', 'clinvar__diseases', tabName);
			addInfoLine(div, row, 'Ref. Nums', 'clinvar__refs', tabName);
		}
	}
}

widgetGenerators['widget_name'] should be used to wrap widget construction codes. If the name of a widget is sample_widget, the code should have widgetGenerators['sample_widget']. The javascript code of all widgets will be stored in widgetGenerators object, with each widget's name as a key. widgetGenerators['sample_widget'] has annotation levels as the first level keys. In the above example, it has only variant key. This means that only the variant tab's detail panel will call function part of variant part of widgetGenerators['clinvar'].

For each level, three key-value pairs are required: width, height, and function. For width, use 280 as the smallest and an increment of 300. For height, use 80 as the smallest and an increment of 100. These two will define the size of the widget.

function builds the content of the widget. addInfoLine is a convenience function you can use. It puts a div with the value of a column from the given row with a title of your choice. The syntax of this function is addInfoLine(div, row, title, db_column_name, tabName) and change title and db_column_name. db_column_name is the name of a column in the result sqlite3 file's variant or gene table, which is specified by tabName variable. The convention for db_column_name is an annotator name + '__' + column name defined in the annotator's yml file.

If an external script/library is needed, use $.getScript jQuery function, as shown in the below example. Put this function at the top of the code.

$.getScript('/result/widgetfile/wgndex/cytoscape.js', function () {});

You can’t perform that action at this time.