Building Knowledge Networks

KeywanHP edited this page Sep 29, 2017 · 32 revisions

The KnetMiner web application requires as input a genome-scale knowledge network (GSKN). More background about GSKNs can be found in our paper Hassani-Pak et al. 2016. Here we provide an overview of the steps involved in building GSKNs for new species. GSKNs can be created using either the Ondex Desktop or Command Line Interface (CLI). This guide will use the Ondex CLI to build knowledge networks and Ondex Desktop to inspect them.

Software Requirements

  • Linux server with 16GB RAM or more
  • JAVA 8
  • Ondex-CLI download
  • Ondex-Desktop download

Data Requirements

  • Maize GFF3
  • Maize peptide FASTA
  • Maize protein domains
  • Maize-Arabidopsis orthologs
  • Maize-UniProt BLAST
  • Arabidopsis knowledge network v.102016 download

Folder structure

The datasets and workflows used here are located in a specific folder structure:

knetminer-data
|
|-- homology (parental folder for all homology related information)
|   |
|   |-- BioMart (homology information from BioMart)
|   |-- Decypher (BLAST or Decypher outputs)
|
|-- ontologies (GO, TO, ec2go etc.)
| 
|-- organisms (organism-specific information)
|   |
|   |-- Maize (subfolders contain data, workflow and network)
|   |   |-- Gene-Protein (maize gff3, fasta)
|   |   |-- Protein-Domain (maize protein-domains in tabular format)
|   |   |-- PubMed (maize publications from PubMed in XML format) 
|   |
|   |-- Arabidopsis ()
|
|-- references (reviewed proteomes from UniProtKB)
|
|-- knets (workflows and GSKNs for KnetMiner)
|   |
|   |-- Maize (Maize GSKN workflow and OXL file)
|   |-- Arabidopsis (Arabidopsis GSKN workflow and OXL file)

Your Organism Data

Ondex contains parsers (or importers) for a range of data formats including FASTA, GFF3, Tabular, UniProt-XML, Pubmed-XML, PSIMII, OBO, OWL etc. The role of an Ondex parser is to transform the raw data into the graph model using the standardized Ondex metadata. Here we describe how to build a core network of genes, proteins and domains for a particular organism. As an example organism, we are going to use Zea Mays.

Gene-Protein

Download the gff3 and protein fasta files, and unzip if they are zipped. There is a Ondex parser plugin, called fastagff, that we can use to create a gene-protein network. The parser has the following parameters:

Fastagff parser parameters:

  • GFF3 File: Path to GFF3
  • Fasta File: Path to peptide FASTA
  • Mapping File: Path to tabular id mapping file. Required if protein ids are not equal to gene_id.x
  • TaxId [Int]: Taxonomy ID of your organism
  • Accession [String]: Cross-reference database (xref)
  • DataSource [String]: Data origin (provenance)

We are now going to create an Ondex workflow file (my_workflow.xml) that instructs Ondex-CLI to run the fastagff parser and export the graph into OXL (Ondex Exchange Language) and create some basic stats (XML).

<?xml version="1.0" encoding="UTF-8"?>
<Ondex version="3.0">
  <Workflow>
    <Graph name="memorygraph">
      <Arg name="GraphName">default</Arg>
      <Arg name="graphId">default</Arg>
    </Graph>
    <Parser name="fastagff">
      <Arg name="GFF3 File">knetminer-data/organisms/Maize/Gene-Protein/genes.gff3</Arg>
      <Arg name="Fasta File">knetminer-data/organisms/Maize/Gene-Protein/pep.all.fa</Arg>
      <Arg name="Mapping File">knetminer-data/organisms/Maize/Gene-Protein/mart_export.txt</Arg>
      <Arg name="TaxId">4577</Arg>
      <Arg name="Accession">ENSEMBL</Arg>
      <Arg name="DataSource">ENSEMBL</Arg>
      <Arg name="Column of the genes">0</Arg>
      <Arg name="Column of the proteins">1</Arg>
      <Arg name="graphId">default</Arg>
    </Parser>
    <Export name="oxl">
      <Arg name="pretty">true</Arg>
      <Arg name="ExportIsolatedConcepts">true</Arg>
      <Arg name="GZip">true</Arg>
      <Arg name="ExportFile">knetminer-data/organisms/Maize/Gene-Protein/Gene-Protein.oxl</Arg>
      <Arg name="graphId">default</Arg>
    </Export>
    <Export name="graphinfo">
      <Arg name="ExportFile">knetminer-data/organisms/Maize/Gene-Protein/Gene-Protein-Stats.xml</Arg>
      <Arg name="graphId">default</Arg>
    </Export>
  </Workflow>
</Ondex>

To run the workflow go to the Ondex-CLI root folder and type:

bash runme.sh knetminer-data/organisms/Maize/Gene-Protein/my_workflow.xml

The above command requires the knetminer-data folder to be located within the Ondex-CLI root folder. Once the workflow has completed, it should create a Gene-Protein.oxl in the folder specified by the OXL Exporter. You can open this file in Ondex Desktop and use the Ondex Metagraph and Legend (see Figure below) for some useful information. Check: Are the gene and protein numbers same as in the gff and fasta file? Are the gene and protein concepts connected via a relation? Search for certain gene names and check if gene/protein names are correct?

Ondex Metagraph and Legend

You can also check the Gene-Protein-Stats.xml report that was generated by the graphinfo Exporter.

Note: If you cannot see relations between genes and proteins then the fastagff parser was not able to establish the link automatically and you need to provide the gene-protein id mapping as an additional parameter "Mapping File" and specify the gene and protein id columns.

If everything looks OK, congratulations, you have your beginner's network of genes connected to the proteins they encode.

Protein-Domains

Download the protein-domain information from BioMart, choose "Zea Mays". Click on "Features", unselect everything under "Attributes" and select only "Protein stable ID". Open "Protein Domains" and select "Pfam ID", "InterPro ID", "InterPro short description" and "InterPro description". Under "Filters->Protein Domains" you can select "Limit to genes with Interpro ID(s)".

The downloaded tabular file looks should look like this:

Protein stable ID Pfam ID Interpro ID Interpro Short Description Interpro Description
GRMZM5G836994_P01 PF01348 IPR002866 Maturase_MatK Maturase MatK
GRMZM5G836994_P01 PF01348 IPR024937 Domain_X Domain X
Zm00001d024998_P093 PF02785 IPR000022 Carboxyl_trans Carboxyl transferase

Now Ondex has a generic parser for tabular files, called tabParser2, that can be configured via XML. The XML schema can be found here.

The tabParser2 configuration for the above protein-domain table could look like this:

<?xml version = "1.0" encoding = "UTF-8" ?>
<parser 
	xmlns = "http://www.ondex.org/xml/schema/tab_parser"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

	<delimiter>\t</delimiter>
	<quote>"</quote>
	<encoding>UTF-8</encoding>
	<start-line>1</start-line>
	
	<concept id = "prot">
		<class>Protein</class>
		<data-source>ENSEMBL</data-source>
		<accession data-source="ENSEMBL">
		       <column index='0' />
		</accession>
	</concept>

	<concept id = "protDomain">
		<class>ProtDomain</class>
		<data-source>ENSEMBL</data-source>
		<name preferred="true">
		        <column index='3' />
		</name>
		<name>
			<column index='1' />
		</name>
		<accession data-source="IPRO">
			<column index='2' />
		</accession>
		<attribute name="Description" type="TEXT"> 
			<column index='4' />
		</attribute>
	</concept>

	<relation source-ref="prot" target-ref="protDomain">
		<type>has_domain</type>
	</relation>

</parser>

We are now going to create a new workflow (my_workflow_2.xml) with instructions to parse the tabular file and export to OXL:

<?xml version="1.0" encoding="UTF-8"?>
<Ondex version="3.0">
  <Workflow>
    <Graph name="memorygraph">
      <Arg name="GraphName">default</Arg>
      <Arg name="graphId">default</Arg>
    </Graph>
    <Parser name="tabParser2">
      <Arg name="InputFile">knetminer-data/organisms/Maize/Protein-Domain/mart_export.txt</Arg>
      <Arg name="configFile">knetminer-data/organisms/Maize/Protein-Domain/config.xml</Arg>
      <Arg name="graphId">default</Arg>
    </Parser>
    <Export name="oxl">
      <Arg name="pretty">true</Arg>
      <Arg name="ExportIsolatedConcepts">true</Arg>
      <Arg name="GZip">true</Arg>
      <Arg name="ExportFile">knetminer-data/organisms/Maize/Protein-Domain/Protein-Domain.oxl</Arg>
      <Arg name="graphId">default</Arg>
    </Export>
  </Workflow>
</Ondex>

All we need to do again is to run Ondex-CLI with the above workflow:

bash runme.sh knetminer-data/organisms/Maize/Protein-Domain/my_workflow_2.xml

Orthologs and BLAST

Or next goal is to create networks for public homology information and for our own private BLAST data.

Ensembl Homology

To download the public data we use Ensembl BioMart and choose "Zea Mays".

Click on "attributes", click on "Homologs", for now we'll get the homologs for A. thaliana. Under "Gene", unselect everything under "Gene Attributes" and select only "Protein stable ID". Then open "Orthologs" and select "Arabidopsis thaliana protein stable ID", "Homology type", "%id. target", "%id. query".

Then, scroll back up and click "Results" on the top left corner. You'll see a few example results, click on "Export all results to"'s "Go" button to get all results as a tab-delimited file. The header of that file should look something like this:

Protein stable ID	Arabidopsis thaliana protein or transcript stable ID	Arabidopsis thaliana homology type	%id. target Arabidopsis thaliana gene identical to query gene	%id. query gene identical to target Arabidopsis thaliana gene
GRMZM5G800780_P01	ATCG00720.1	ortholog_one2one	90.0862	97.2093
GRMZM5G862955_P01	ATMG00160.1	ortholog_one2one	47.7024	83.8462
GRMZM5G862955_P02	ATMG00160.1	ortholog_one2one	47.7024	83.8462
GRMZM5G855343_P01	ATCG00820.1	ortholog_one2one	61.9718	47.8261
GRMZM5G801074_P01	ATCG00640.1	ortholog_one2one	74.2424	74.2424
(etc.)

Lines with a Maize protein but no Arabidopsis ortholog should be deleted:

awk '{if ( $2 != "") print}' your_file > cleaned_biomart_export.txt

The 'tabParser2' configuration for the above table could look like this:

<?xml version = "1.0" encoding = "UTF-8" ?>
<parser 
	xmlns = "http://www.ondex.org/xml/schema/tab_parser"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

	<delimiter>\t</delimiter>
	<quote>"</quote>
	<encoding>UTF-8</encoding>
	<start-line>1</start-line>
	
	<concept id="protL">
		<class>Protein</class>
		<data-source>EnsemblCompara</data-source>
		<name preferred="true">
			<column index='0' />
		</name>
		<accession data-source="ENSEMBL">
			<column index='0' />
		</accession>
	</concept>
	
	<concept id="protR">
		<class>Protein</class>
		<data-source>EnsemblCompara</data-source>
		<name preferred="true">
			<column index='1' />
		</name>
		<accession data-source="TAIR">
			<column index='1' />
		</accession>
	</concept>
	
	<relation source-ref="protL" target-ref="protR">
		<type>ortho</type>
		<evidence>EnsemblCompara</evidence>
		<attribute name="ALGORITHM" type="TEXT">Compara-GeneTrees</attribute>
		<attribute name="Homology_type" type="TEXT">
			<column index='2' />
		</attribute>
		<attribute name="%Identity_Arabidopsis" type="NUMBER">
			<column index='3' />
		</attribute>
		<attribute name="%Identity_Maize" type="NUMBER">
			<column index='4' />
		</attribute>
	</relation>
</parser>

You can again construct a wokrfklow similar to the protein-domain workflow and create a network with ortholog relations between Maize and Arabidopsis. We are going to skip this step at this stage and run a larger workflow at the end.

BLAST data

Here we show how to import Decypher BLAST results for Maize versus all reviewed plant proteins in UniProtKB. You can also use "standard" BLAST as long as you adapt the tabParser2 configuration respectively.

The Decypher-BLAST data looks like this:

QUERYLOCUS	TARGETLOCUS	SCORE	SIGNIFICANCE	PERCENTALIGNMENT	PERCENTQUERY	PERCENTTARGET	QUERYSTART	QUERYEND	TARGETSTART	TARGETEND	QUERYLENGTH	TARGETLENGTH
AT3G05780.1	P93648	2795.00	0.000000	64	62	60	92	923	69	958	924	964
AT3G05780.1	Q0J032	1106.00	2.2e-144	41	26	27	354	920	314	872	924	884
AT3G05780.1	O04979	1089.00	4.9e-142	41	25	26	354	913	315	869	924	887
AT3G05780.1	P93647	1089.00	4.9e-142	41	25	26	354	920	315	873	924	885
AT2G27490.4	Q94DR2	791.00	1.3e-101	63	62	63	1	227	1	227	232	230
(etc.)

This tabParser2 configuration could look like this:

<?xml version = "1.0" encoding = "UTF-8" ?>
<parser 
	xmlns = "http://www.ondex.org/xml/schema/tab_parser"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

	<delimiter>\t</delimiter>
	<quote>"</quote>
	<encoding>UTF-8</encoding>
	<start-line>1</start-line>
	
	<concept id="protL">
		<class>Protein</class>
		<data-source>Decypher</data-source>
		<accession data-source="UNIPROTKB">
			<column index='1' />
		</accession>
	</concept>
	
	<concept id="protR">
		<class>Protein</class>
		<data-source>Decypher</data-source>
		<accession data-source="TAIR">
			<column index='0' />
		</accession>
	</concept>
	
	<relation source-ref="protR" target-ref="protL">
		<type>h_s_s</type>
		<evidence>Decypher-SW</evidence>
		<attribute name="ALGORITHM" type="TEXT">Smith-Waterman</attribute>
		<attribute name="SCORE" type="NUMBER">
			<column index='2' />
		</attribute>
		<attribute name="E-VALUE" type="NUMBER">
			<column index='3' />
		</attribute>
		<attribute name="PERCENTALIGNMENT" type="NUMBER">
			<column index='4' />
		</attribute>
		<attribute name="PERCENTQUERY" type="NUMBER">
			<column index='5' />
		</attribute>
		<attribute name="PERCENTTARGET" type="NUMBER">
			<column index='6' />
		</attribute>
		<attribute name="QUERYSTART" type="NUMBER">
			<column index='7' />
		</attribute>
		<attribute name="QUERYEND" type="NUMBER">
			<column index='8' />
		</attribute>
		<attribute name="TARGETSTART" type="NUMBER">
			<column index='9' />
		</attribute>
		<attribute name="TARGETEND" type="NUMBER">
			<column index='10' />
		</attribute>
		<attribute name="QUERYLENGTH" type="NUMBER">
			<column index='11' />
		</attribute>
		<attribute name="TARGETLENGTH" type="NUMBER">
			<column index='12' />
		</attribute>
	</relation>
</parser>

We have configured it so that all BLAST statistics like score, e-value etc. are added as attributes to the h_s_s (has_similar_sequence) relation in the knowledge graph.

OPTIONAL: You can of course run BLAST against many more databases and add the results here.

Again we skip the workflow to create a network with BLAST relations at this point and will do this as part of the final integration step below.

Integrating all networks into one massive network

At this stage, we have described how to parse individual fasta/gff files and different tabular files into knowledge networks. The individual networks contain following information types

  • gene-[enc]-protein
  • protein-[has]-domain
  • protein-[ortho]-protein
  • protein-[blast]-protein

We are now going to merge these individual networks into one Maize network and link it up to a pre-integrated reference network for Arabidopis and UniProt Plants.

<?xml version="1.0" encoding="UTF-8"?>
<Ondex version="3.0">
  <Workflow>
    <Graph name="memorygraph">
      <Arg name="GraphName">default</Arg>
      <Arg name="graphId">default</Arg>
    </Graph>
    
    <!-- Gene-Protein -->
    <Parser name="fastagff">
      <Arg name="GFF3 File">knetminer-data/organisms/Maize/Gene-Protein/genes.gff3</Arg>
      <Arg name="Fasta File">knetminer-data/organisms/Maize/Gene-Protein/pep.all.fa</Arg>
      <Arg name="Mapping File">knetminer-data/organisms/Maize/Gene-Protein/mart_export.txt</Arg>
      <Arg name="TaxId">4577</Arg>
      <Arg name="Accession">ENSEMBL</Arg>
      <Arg name="DataSource">ENSEMBL</Arg>
      <Arg name="Column of the genes">0</Arg>
      <Arg name="Column of the proteins">1</Arg>
      <Arg name="graphId">default</Arg>
    </Parser>
	
    <!-- Protein Domain -->
    <Parser name="tabParser2">
      <Arg name="InputFile">knetminer-data/organisms/Maize/Protein-Domain/mart_export.txt</Arg>
      <Arg name="configFile">knetminer-data/organisms/Maize/Protein-Domain/config.xml</Arg>
      <Arg name="graphId">default</Arg>
    </Parser>

    <!-- Homology -->
    <Parser name="tabParser2">
      <Arg name="InputFile">knetminer-data/homology/BioMart/Maize_Arabidopsis.txt</Arg>
      <Arg name="configFile">knetminer-data/homology/BioMart/Maize_Arabidopsis_config.xml</Arg>
      <Arg name="graphId">default</Arg>
    </Parser>
	
    <!-- Blast-->
    <Parser name="tabParser2">
	<Arg name="InputFile">knetminer-data/homology/Decypher/Maize_UniProtPlants_Decypher-SW.tab</Arg>
	<Arg name="configFile">knetminer-data/homology/Decypher/config.xml</Arg>
	<Arg name="graphId">default</Arg>   
    </Parser>

    <!-- Arabidopsis knowledge network from Rothamsted -->
    <Parser name="oxl">
      <Arg name="InputFile">knetminer-data/knets/Arabidopsis/ArabidopsisKNET_201610.oxl</Arg>
      <Arg name="graphId">default</Arg>
    </Parser>
	
    <!-- Mapping -->
    <Mapping name="lowmemoryaccessionbased">
      <Arg name="IgnoreAmbiguity">false</Arg>
      <Arg name="RelationType">collapse_me</Arg>
      <Arg name="WithinDataSourceMapping">true</Arg>
      <Arg name="graphId">default</Arg>
    </Mapping>
    
    <Transformer name="relationcollapser">
      <Arg name="CloneAttributes">true</Arg>
      <Arg name="CopyTagReferences">true</Arg>
      <Arg name="graphId">default</Arg>
      <Arg name="RelationType">collapse_me</Arg>
    </Transformer>

    <!-- Export knowledge network -->
    <Export name="oxl">
      <Arg name="pretty">true</Arg>
      <Arg name="ExportIsolatedConcepts">true</Arg>
      <Arg name="GZip">true</Arg>
      <Arg name="ExportFile">knetminer-data/knets/Maize/MaizeKNET.oxl</Arg>
      <Arg name="graphId">default</Arg>
    </Export>
  </Workflow>
</Ondex>

NOTE: Make sure the PC or server which runs the workflow has sufficient memory (15-20 GB RAM). The resulting network will be very large but it still can be opened in Ondex. Ondex won't be able to visualise the entire network but it can produce some useful information and provide simple search and filter tools for first-pass quality checks of the GSKN before deploying it in KnetMiner for further checks.

The final OXL (in this case named knetminer-data/knets/Maize/MaizeKNET.oxl) will be used in the KnetMiner server.

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.