# Semantic Similarity with Python
## SSMPY Library

sources:

    * code https://github.com/lasigeBioTM/DiShIn
    * doc: https://dishin.readthedocs.io/en/latest/other_examples.html
    * paper: https://www.researchgate.net/publication/323219905_Semantic_Similarity_Definition

In [4]:
import ssmpy
import pandas as pd

In [5]:
# ssmpy config
ssmpy.mica = True # determines if it uses MICA or DCA
ssmpy.intrinsic = False # determines if it uses extrinsic or intrinsic IC

###### download GO and annotations 

In [6]:
%%bash

#wget http://purl.obolibrary.org/obo/go.owl
#wget http://geneontology.org/gene-associations/goa_uniprot_all_noiea.gaf.gz
#gunzip goa_uniprot_all_noiea.gaf.gz

In [7]:
# create the semantic base
ssmpy.create_semantic_base("go.owl", "go.db", "http://purl.obolibrary.org/obo/", "http://www.w3.org/2000/01/rdf-schema#subClassOf", "goa_uniprot_all_noiea.gaf")

loading the ontology go.owl
calculating transitive closure at distance: 1
calculating transitive closure at distance: 2
calculating transitive closure at distance: 3
calculating transitive closure at distance: 4
calculating transitive closure at distance: 5
calculating transitive closure at distance: 6
calculating transitive closure at distance: 7
calculating transitive closure at distance: 8
calculating transitive closure at distance: 9
calculating transitive closure at distance: 10
calculating transitive closure at distance: 11
calculating transitive closure at distance: 12
calculating transitive closure at distance: 13
calculating transitive closure at distance: 14
calculating transitive closure at distance: 15
calculating transitive closure at distance: 16
calculating the frequency from file goa_uniprot_all_noiea.gaf
calculating the descendents
calculating the hierarchical frequency
the end


### Playing with SSMPY

In [8]:
ssmpy.semantic_base("go.db")
e1 = ssmpy.get_id("GO_0000023")
e2 = ssmpy.get_id("GO_0000025")
ssmpy.ssm_resnik(e1,e2)

4.315813746201754

In [9]:
ssmpy.semantic_base("go.db")
e1 = ssmpy.get_id("GO_0000023")
e2 = ssmpy.get_id("GO_0000023")
ssmpy.ssm_resnik(e1,e2)

10.575802576015931

In [10]:
# read input data
path = "/home/nur/workspace/duchenne-paper-analyses/semantic-similarity/"
in_f_name = "termSummary10-GOBP-MaxSize5000-Summary.csv"
in_f = path + in_f_name
data = pd.read_csv(in_f)
print(data.shape)
data.head()

(155, 11)


Unnamed: 0,Representing term id,Representing term name,Representing term size,Representing term rank,Represented term number,Eleni-GOBP.csv term rank,Freddie-GOBP.txt term rank,Nazli-GOBP.txt term rank,MOGAMUN-GOBP.csv term rank,pathfindR-GOBP.csv term rank,EnrichNet-GOBP.csv term rank
0,GO:0006952,defense response,1823,1,184,1,7,1,9,2.0,5.0
1,GO:0007165,signal transduction,6290,1,1,100,1,91,578,,
2,GO:0006409,tRNA export from nucleus,34,1,9,355,211,2,1,,
3,GO:0009605,response to external stimulus,2995,1,117,6,19,354,114,1.0,1.0
4,GO:0006955,immune response,2297,2,300,2,128,158,126,3.0,9.0


In [11]:
# dict with root branches for comparison
go_roots = {'BP': 'GO_0008150', 
            'CC': 'GO_0005575', 
            'MF': 'GO_0003674'}

In [12]:
def ss(go1, go2):
    e1 = ssmpy.get_id(go1)
    e2 = ssmpy.get_id(go2)
    return ssmpy.ssm_resnik(e1,e2)

ss("GO_0000023","GO_0000025")

4.315813746201754

In [13]:
prova = data.copy()
prova['Representing term id'] = prova['Representing term id'].apply(lambda x: x.replace(':','_'))
prova.head(2)

Unnamed: 0,Representing term id,Representing term name,Representing term size,Representing term rank,Represented term number,Eleni-GOBP.csv term rank,Freddie-GOBP.txt term rank,Nazli-GOBP.txt term rank,MOGAMUN-GOBP.csv term rank,pathfindR-GOBP.csv term rank,EnrichNet-GOBP.csv term rank
0,GO_0006952,defense response,1823,1,184,1,7,1,9,2.0,5.0
1,GO_0007165,signal transduction,6290,1,1,100,1,91,578,,


In [14]:
prova['ss_wrt_bp'] = 'GO_0008150'
prova['ss_wrt_cc'] = 'GO_0005575'
prova['ss_wrt_mf'] = 'GO_0003674'
prova.head(2)

Unnamed: 0,Representing term id,Representing term name,Representing term size,Representing term rank,Represented term number,Eleni-GOBP.csv term rank,Freddie-GOBP.txt term rank,Nazli-GOBP.txt term rank,MOGAMUN-GOBP.csv term rank,pathfindR-GOBP.csv term rank,EnrichNet-GOBP.csv term rank,ss_wrt_bp,ss_wrt_cc,ss_wrt_mf
0,GO_0006952,defense response,1823,1,184,1,7,1,9,2.0,5.0,GO_0008150,GO_0005575,GO_0003674
1,GO_0007165,signal transduction,6290,1,1,100,1,91,578,,,GO_0008150,GO_0005575,GO_0003674


In [15]:
prova = prova[['Representing term id', 'Representing term name', 'ss_wrt_bp', 'ss_wrt_cc', 'ss_wrt_mf',
       'Representing term size', 'Representing term rank',
       'Represented term number', 'Eleni-GOBP.csv term rank',
       'Freddie-GOBP.txt term rank', 'Nazli-GOBP.txt term rank',
       'MOGAMUN-GOBP.csv term rank', 'pathfindR-GOBP.csv term rank',
       'EnrichNet-GOBP.csv term rank']]
prova.head(2)

Unnamed: 0,Representing term id,Representing term name,ss_wrt_bp,ss_wrt_cc,ss_wrt_mf,Representing term size,Representing term rank,Represented term number,Eleni-GOBP.csv term rank,Freddie-GOBP.txt term rank,Nazli-GOBP.txt term rank,MOGAMUN-GOBP.csv term rank,pathfindR-GOBP.csv term rank,EnrichNet-GOBP.csv term rank
0,GO_0006952,defense response,GO_0008150,GO_0005575,GO_0003674,1823,1,184,1,7,1,9,2.0,5.0
1,GO_0007165,signal transduction,GO_0008150,GO_0005575,GO_0003674,6290,1,1,100,1,91,578,,


In [17]:
for idx, row in prova.iterrows():
    prova['ss_wrt_bp'] = ss(row[0], go_roots['BP'])
    prova['ss_wrt_cc'] = ss(row[0], go_roots['CC'])
    prova['ss_wrt_mf'] = ss(row[0], go_roots['MF'])
    
prova

Unnamed: 0,Representing term id,Representing term name,ss_wrt_bp,ss_wrt_cc,ss_wrt_mf,Representing term size,Representing term rank,Represented term number,Eleni-GOBP.csv term rank,Freddie-GOBP.txt term rank,Nazli-GOBP.txt term rank,MOGAMUN-GOBP.csv term rank,pathfindR-GOBP.csv term rank,EnrichNet-GOBP.csv term rank
0,GO_0006952,defense response,0.0,0,0,1823,1,184,1,7,1,9,2,5
1,GO_0007165,signal transduction,0.0,0,0,6290,1,1,100,1,91,578,,
2,GO_0006409,tRNA export from nucleus,0.0,0,0,34,1,9,355,211,2,1,,
3,GO_0009605,response to external stimulus,0.0,0,0,2995,1,117,6,19,354,114,1,1
4,GO_0006955,immune response,0.0,0,0,2297,2,300,2,128,158,126,3,9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
150,GO_0030431,sleep,0.0,0,0,33,1664,1,,1664,,,,
151,GO_0035272,exocrine system development,0.0,0,0,49,1737,1,,1737,,,,
152,GO_0006836,neurotransmitter transport,0.0,0,0,217,1850,1,,1850,,,,
153,GO_0071695,anatomical structure maturation,0.0,0,0,240,1868,1,,1868,,,,


In [33]:
for idx, row in prova.iterrows():
    prova.at[idx,'ss_wrt_bp'] = ss(row[0], go_roots['BP'])
    prova.at[idx,'ss_wrt_cc'] = ss(row[0], go_roots['CC'])
    prova.at[idx,'ss_wrt_mf'] = ss(row[0], go_roots['MF'])
    
prova

Unnamed: 0,Representing term id,Representing term name,ss_wrt_bp,ss_wrt_cc,ss_wrt_mf,Representing term size,Representing term rank,Represented term number,Eleni-GOBP.csv term rank,Freddie-GOBP.txt term rank,Nazli-GOBP.txt term rank,MOGAMUN-GOBP.csv term rank,pathfindR-GOBP.csv term rank,EnrichNet-GOBP.csv term rank
0,GO_0006952,defense response,0.0,0,0,1823,1,184,1,7,1,9,2,5
1,GO_0007165,signal transduction,0.0,0,0,6290,1,1,100,1,91,578,,
2,GO_0006409,tRNA export from nucleus,0.0,0,0,34,1,9,355,211,2,1,,
3,GO_0009605,response to external stimulus,0.0,0,0,2995,1,117,6,19,354,114,1,1
4,GO_0006955,immune response,0.0,0,0,2297,2,300,2,128,158,126,3,9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
150,GO_0030431,sleep,0.0,0,0,33,1664,1,,1664,,,,
151,GO_0035272,exocrine system development,0.0,0,0,49,1737,1,,1737,,,,
152,GO_0006836,neurotransmitter transport,0.0,0,0,217,1850,1,,1850,,,,
153,GO_0071695,anatomical structure maturation,0.0,0,0,240,1868,1,,1868,,,,


In [18]:
ss("GO_0007628", "GO_0008152")

0.0

In [19]:
c = ssmpy.get_id("GO_0006952")
ssmpy.get_ancestors(c)

[3694, 166, 8546, 3220]

In [20]:
c1 = ssmpy.get_id("GO_0006952")
c2 = ssmpy.get_id("GO_0008152")
ssmpy.num_paths(c1,c2)

1

In [21]:
ssmpy.shared_ic(c1,c2)

0.0

In [23]:
ssmpy.information_content(c1)

4.0161873385226885

In [24]:
ssmpy.information_content(ssmpy.get_id("GO_0008152")) # child

1.8550156925301988

In [25]:
ssmpy.information_content(ssmpy.get_id("GO_0008150")) # root ancestor

-0.0

In [26]:
ssmpy.information_content(c1)

4.0161873385226885

In [27]:
ssmpy.information_content_intrinsic(c1)

5.702010071544036

In [28]:
ssmpy.information_content_extrinsic(c1)

4.0161873385226885

### Computing the IC for the input data

In [111]:
# IC
def ic(term):
    iri = ssmpy.get_id(term)
    # "GO_0055114" is not in GO because is OBSOLETE
    try:
        res = ssmpy.information_content(iri)
    except TypeError:
        res = 'NA in OWL (OBSOLETE)'
        print("'{}' term not found in the ontology because may be OBSOLETE".format(term))
    return res

data_ss = data.copy()
data_ss['Representing term id'] = data_ss['Representing term id'].apply(lambda x: x.replace(':','_'))
data_ss['ss_ic'] = 1
data_ss = data_ss[['Representing term id', 'Representing term name', 'ss_ic',
       'Representing term size', 'Representing term rank',
       'Represented term number', 'Eleni-GOBP.csv term rank',
       'Freddie-GOBP.txt term rank', 'Nazli-GOBP.txt term rank',
       'MOGAMUN-GOBP.csv term rank', 'pathfindR-GOBP.csv term rank',
       'EnrichNet-GOBP.csv term rank']] 
data_ss['ss_ic'] = data_ss['Representing term id'].apply(ic)
data_ss = data_ss.rename(columns={'ss_ic':'Semantic Similarity (IC)'})
data_ss.head(2)

'GO_0055114' term not found in the ontology because may be OBSOLETE
'GO_0042107' term not found in the ontology because may be OBSOLETE


Unnamed: 0,Representing term id,Representing term name,Semantic Similarity (IC),Representing term size,Representing term rank,Represented term number,Eleni-GOBP.csv term rank,Freddie-GOBP.txt term rank,Nazli-GOBP.txt term rank,MOGAMUN-GOBP.csv term rank,pathfindR-GOBP.csv term rank,EnrichNet-GOBP.csv term rank
0,GO_0006952,defense response,4.01619,1823,1,184,1,7,1,9,2.0,5.0
1,GO_0007165,signal transduction,3.50282,6290,1,1,100,1,91,578,,


In [113]:
# save file to CSV
data_ss.to_csv('./termSummary10-GOBP-MaxSize5000-Summary-with-semantic-similarity.csv', index=False)