Skip to content

Latest commit

 

History

History
204 lines (170 loc) · 6.63 KB

README.md

File metadata and controls

204 lines (170 loc) · 6.63 KB

#SRA XML Build Status Coverage Status

A library for building SRA XML.

##Scope

Smooth the process of writing SRA XML, producing documents that comply with the XML schema. We make no attempt to validate the content of the documents produced. The initial focus of the library is writing analysis XML.

##Example

use Bio::SRAXml qw(write_xml_file);

my $analysis_set = Bio::SRAXml::Analysis::AnalysisSet->new();

$analysis_set->add_analysis(
    {
        alias         => 'foo_alias',  # a unique name for this analysis
        analysis_date => { # when was this analysis performed?
            year   => 2011,
            month  => 12,
            day    => 13,
            hour   => 7,
            minute => 8,
            second => 9,
        },
        analysis_type => 'sample_phenotype', # what is the type of the analysis? see documentation below on more complex analysis types
        title         => 'A title',
        description   => 'The description',
        # references can be made to studies, samples, experiments, runs and analyses
        study_refs    => { refname => 'my_study_alias', }, 
        sample_refs   => {
            refname => 'my_sample_alias',
            label   => 'bob',
        },
        experiment_refs => { refname => 'my_experiment_alias' },
        run_refs => [ { refname => 'run1' }, { refname => 'run2' } ],
        #attributes are tags/values/units describing the analysis
        attributes => [
            { tag => 'attr1', value => 'val1' },
            { tag => 'attr2', value => 'val2' },
            { tag => 'attr3', value => '4', units => 'kg' },
        ],
        #links can be made as URLs, entrez queries or IDs, or XRef links
        links => [
            {
                url   => 'http://something.com',
                label => 'url link'
            },
            {
                db    => 'foo',
                id    => 7,
                label => 'entrez link',
            },
            {
                db    => 'canute',
                id    => 107,
                label => 'xref link',
            }
        ],
       #files have names, types and checksums
        files    => [
            {
                filename => 'afile.bam',
                filetype => 'bam',
                checksum => 'abcdefg'
            },
        ],
    }
);

write_xml_file(root_entity => $analysis_set, filename => '/path/to/file');

##Analysis types

Each alignment type has different requirements

###Reference alignment

A reference alignment must specify which assembly was used, and which sequences it refers to. Assemblies can either be standard (refname and accession) or custom (description and url_link).

$analysis_set->add_analysis(
    {
        alias         => 'bar_alias',
        analysis_type => {
            'reference_alignment' => {
                assembly => {
                    description =>
'A custom assembly created with my beautiful assembly pipeline',
                    url_link => {
                        url   => 'ftp://url.for.assembly.fa.gz',
                        label => 'my custom assembly'
                    },
                },
                sequences => [
                    { accession => 'CM000663.1', label => '1' },
                    { accession => 'CM000672.1', label => '10' },
                    { accession => 'CM000673.1', label => '11' },
                    ...
                ]
            }
        },
        ...
    }
);

###Sequence assembly

$analysis_set->add_analysis(
    {
        alias => 'bar_alias',
        analysis_type => {
            'sequence_assembly' => {
                name     => 'Pony assembly 102',
                partial  => 0,
                coverage => '180',
                program  => 'cortex',
                platform => 'linux'
            }
        },
        ...
      }
);

###Sequence variation

Similar to reference alignment, sequence variation requires an assembly and sequences, plus some additional attributes.

$analysis_set->add_analysis(
    {
        alias         => 'bar_alias',
        analysis_type => {
            'sequence_variation' => {
                experiment_type => 'Whole genome sequencing',
                program         => 'MyVariantCaller',
                platform        => 'Illumina SuperWhizzy Sequencing Machine',
                assembly => {
                    refname   => 'GRCh37',
                    accession => 'GCA_000001405.1',
                },
                sequences => [
                    { accession => 'CM000663.1', label => '1' },
                    { accession => 'CM000672.1', label => '10' },
                    { accession => 'CM000673.1', label => '11' },
                    ..
                    ].
        },
        ...
      }
);

###Other types

The following types can be specified with a simple string:

  • sequence_annotation
  • reference_sequence
  • sample_phenotype
  • processed_reads

e.g.

$analysis_set->add_analysis(
    {
        alias         => 'bar_alias',
        analysis_type => 'reference_sequence',
        ...
      }
);

##Dependencies

Dependencies are listed in the cpanfile. cpanm should be able to install from this. From the checkout directory, run cpanm --installdeps .

##Contributing

We would welcome contributions that extend the SRA objects covered by this library. Please organise code to support one xsd file in one directory, e.g.

The roles in lib/Bio/SRAXml/Roles are used for two purposes

  • tag classes (e.g. Bio::SRAXml::Roles::Link to label the Entrez, URL and XRef link classes)
  • for use as mixins for common attributes, e.g. NameGroup to either tag classes as supporting some functionality, or as mixins to provide

The library makes heavy use of moose type coercions, so that you users can input data without knowing the entire class hierachy. This should be continued as the llibrary grows. See Bio::SRAXml::Types for examples.

Funding

The FAANG Data Coordination Centre has received funding from the European Union’s Horizon 2020 research and innovation program under Grant Agreement Nos. 815668, 817923 and 817998, and also form the Biotechnology and Biological Sciences Research Council under Grant Agreement No. BB/N019563/1.