# Computing basic semantic similarities between GO terms

Adapted from book chapter written by _Alex Warwick Vesztrocy and Christophe Dessimoz_

In this section we look at how to compute semantic similarity between GO terms. First we need to write a function that calculates the minimum number of branches connecting two GO terms.

In [1]:
%load_ext autoreload
%autoreload 2

import sys

sys.path.insert(0, "..")
from goatools import obo_parser

go = obo_parser.GODag("../go-basic.obo")

../go-basic.obo: fmt(1.2) rel(2018-08-03) 47,289 GO Terms


In [23]:
go_id3 = 'GO:0044707'
go_id4 = 'GO:0044707'
print(go[go_id3])
print(go[go_id4])

GO:0032501	level-01	depth-01	multicellular organismal process [biological_process]
GO:0032501	level-01	depth-01	multicellular organismal process [biological_process]


In [24]:
go_id3 = 'GO:0044707'
go_id4 = 'GO:0044707'
print(go[go_id3])
print(go[go_id4])

GO:0032501	level-01	depth-01	multicellular organismal process [biological_process]
GO:0032501	level-01	depth-01	multicellular organismal process [biological_process]


Let's get all the annotations from arabidopsis.

In [2]:
from goatools.associations import read_gaf

associations = read_gaf("gene_association.tair")

  READ      234,052 associations: gene_association.tair


Now we can calculate the semantic distance and semantic similarity, as so:

In [26]:
from goatools.semantic import semantic_similarity

sim = semantic_similarity(go_id3, go_id4, go)
print('The semantic similarity between terms {} and {} is {}.'.format(go_id3, go_id4, sim))

ZeroDivisionError: float division by zero

In [15]:
from goatools.semantic import semantic_similarity

sim = semantic_similarity(go_id3, go_id4, go)
print('The semantic similarity between terms {} and {} is {}.'.format(go_id3, go_id4, sim))

The semantic similarity between terms GO:0048364 and GO:0044707 is 0.2.


Then we can calculate the information content of the single term, <code>GO:0048364</code>.

In [11]:
from goatools.semantic import TermCounts, get_info_content

# First get the counts of each GO term.
termcounts = TermCounts(go, associations)

# Calculate the information content
go_id = "GO:0035185"
infocontent = get_info_content(go_id, termcounts)
print('Information content ({}) = {}'.format(go_id, infocontent))
print go[go_id]
go_id = "GO:0000278"
infocontent = get_info_content(go_id, termcounts)
print('Information content ({}) = {}'.format(go_id, infocontent))
print go[go_id]
go_id = "GO:0007049"
infocontent = get_info_content(go_id, termcounts)
print('Information content ({}) = {}'.format(go_id, infocontent))
print go[go_id]
go_id = "GO:0009987"
infocontent = get_info_content(go_id, termcounts)
print('Information content ({}) = {}'.format(go_id, infocontent))
print go[go_id]
go_id = "GO:0008150"
infocontent = get_info_content(go_id, termcounts)
print('Information content ({}) = {}'.format(go_id, infocontent))
print go[go_id]


6 Assc. GO IDs not found in the GODag

Information content (GO:0035185) = 0
GO:0035185	level-05	depth-05	preblastoderm mitotic cell cycle [biological_process]
Information content (GO:0000278) = 8.86166725642
GO:0000278	level-03	depth-03	mitotic cell cycle [biological_process]
Information content (GO:0007049) = 7.65644523269
GO:0007049	level-02	depth-02	cell cycle [biological_process]
Information content (GO:0009987) = 3.6980247932
GO:0009987	level-01	depth-01	cellular process [biological_process]
Information content (GO:0008150) = 3.35460956887
GO:0008150	level-00	depth-00	biological_process [biological_process]


Resnik's similarity measure is defined as the information content of the most informative common ancestor. That is, the most specific common parent-term in the GO. Then we can calculate this as follows:

In [29]:
from goatools.semantic import resnik_sim

sim_r = resnik_sim(go_id3, go_id4, go, termcounts)
print('Resnik similarity score ({}, {}) = {}'.format(go_id3, go_id4, sim_r))

Resnik similarity score (GO:0044707, GO:0044707) = 5.57186709321


Lin's similarity measure is defined as:
$$ \textrm{sim}_{\textrm{Lin}}(t_{1}, t_{2}) = \frac{-2*\textrm{sim}_{\textrm{Resnik}}(t_1, t_2)}{IC(t_1) + IC(t_2)} $$

Then we can calculate this as

In [30]:
from goatools.semantic import lin_sim

sim_l = lin_sim(go_id3, go_id4, go, termcounts)
print('Lin similarity score ({}, {}) = {}'.format(go_id3, go_id4, sim_l))

Lin similarity score (GO:0044707, GO:0044707) = -1.0
