<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Generate-JSON-file-for-FAIDARE" data-toc-modified-id="Generate-JSON-file-for-FAIDARE-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Generate JSON file for FAIDARE</a></span></li><li><span><a href="#Library-import" data-toc-modified-id="Library-import-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Library import</a></span></li><li><span><a href="#Prepare-data" data-toc-modified-id="Prepare-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Prepare data</a></span><ul class="toc-item"><li><span><a href="#Load-node-annotation-file" data-toc-modified-id="Load-node-annotation-file-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Load node annotation file</a></span></li><li><span><a href="#Load-CKN-as-a-network" data-toc-modified-id="Load-CKN-as-a-network-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Load CKN as a network</a></span><ul class="toc-item"><li><span><a href="#Apply-rank-filter" data-toc-modified-id="Apply-rank-filter-3.2.1"><span class="toc-item-num">3.2.1&nbsp;&nbsp;</span>Apply rank filter</a></span></li><li><span><a href="#Apply-node-filter" data-toc-modified-id="Apply-node-filter-3.2.2"><span class="toc-item-num">3.2.2&nbsp;&nbsp;</span>Apply node filter</a></span></li><li><span><a href="#Clean-up-network" data-toc-modified-id="Clean-up-network-3.2.3"><span class="toc-item-num">3.2.3&nbsp;&nbsp;</span>Clean up network</a></span></li></ul></li></ul></li><li><span><a href="#Generate-the-attributes-for-export" data-toc-modified-id="Generate-the-attributes-for-export-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Generate the attributes for export</a></span><ul class="toc-item"><li><span><a href="#The-&quot;easy&quot;-attributes" data-toc-modified-id="The-&quot;easy&quot;-attributes-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>The "easy" attributes</a></span></li><li><span><a href="#The-GMM-(GoMapMan)-ontology-annotations" data-toc-modified-id="The-GMM-(GoMapMan)-ontology-annotations-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>The GMM (GoMapMan) ontology annotations</a></span></li><li><span><a href="#The-description..." data-toc-modified-id="The-description...-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>The description...</a></span></li></ul></li><li><span><a href="#END" data-toc-modified-id="END-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>END</a></span></li></ul></div>

# Generate JSON file for FAIDARE

Code to generate file for indexing CKN in [FAIDARE](https://urgi.versailles.inra.fr/faidare/) following 
instructions https://urgi.versailles.inra.fr/faidare/join

#  Library import

In [1]:
import pandas as pd

import json

import gzip
import networkx as nx

from datetime import datetime
today = datetime.today().strftime('%Y-%m-%d'); today

'2023-12-13'

# Prepare data

The following files are availible at [skm.nib.si/downloads](https://skm.nib.si/downloads/). 

In [2]:
node_annotation_file = "ckn/AtCKN-v2-2023.06_node-annot.tsv.gz"
edge_list_file = "ckn/AtCKN-v2-2023.06.tsv.gz"

## Load node annotation file

In [3]:
df_nodes = pd.read_csv(node_annotation_file, sep="\t")
df_nodes.head()

Unnamed: 0,node_ID,node_type,species,TAIR,short_name,synonyms,full_name,GMM,note,tissue
0,"12,13-EOT",metabolite,,,"12,13-EOT",,"12,13(S)-epoxylinolenic acid",,,not assigned
1,12-OH-JA-Ile,metabolite,,,12-OH-JA-Ile,,12-hydroxyjasmonic acid 12-O-&beta;-D-glucoside,,,not assigned
2,13-HPOT,metabolite,,,13-HPOT,,13(S)-hydroperoxylinolenic acid,,,not assigned
3,3H3PP-CoA,metabolite,,,3H3PP-CoA,,3-hydroxy-3-phenylpropanoyl-CoA,,,not assigned
4,3O3PP-CoA,metabolite,,,3O3PP-CoA,,3-oxo-3-phenylpropanoyl-CoA,,,not assigned


## Load CKN as a network

Load CKN as a network, and add node annotations
Use the network formalisms to extract up- and downstream interactions 
for the "description" field


In [4]:
with gzip.open(edge_list_file, "tr") as handle:
    handle.readline()
    g = nx.read_edgelist(handle,
                delimiter="\t",
                create_using=nx.DiGraph,
                data=[
                    ('effect', str),
                    ('type', str),
                    ('rank', int),
                    ('species', str),
                    ('isDirected', int),
                    ('isTFregulation', int),
                    ('interactionSources', str)
                ])
nx.set_node_attributes(g, df_nodes.set_index("node_ID").to_dict('index'))

# add reciprocal edges:
edges_to_add = []
for u, v, data in g.edges(data=True):
    if (data["isDirected"] == 0) and ( not g.has_edge(v, u) ):
        edges_to_add.append((v, u, data))
_ = g.add_edges_from(edges_to_add)


### Apply rank filter
Only use genes that are visible in [CKN Explorer](https://skm.nib.si/ckn/) using the default filters
(rank 0, 1, 2 = highly reliable), and remove "lower" ranks.

In [5]:
to_remove = []
for r in [3, 4]:
    to_remove += [(u,v) for u, v, d in g.edges(data=True,) if d["rank"]==r]
g.remove_edges_from(to_remove)
g.number_of_edges()

107029

### Apply node filter
Also remove "abstracted" nodes (e.g. complexes, processes, abiotic factors, etc)

In [6]:
node_types = {
 'antisense_long_noncoding_rna',
 'metabolite',
 'mirna',
 'other_rna',
 'pre_trna',
 'protein_coding',
 'pseudogene',
 'small_nuclear_rna',
 'small_nucleolar_rna',
 'transposable_element_gene'
}

wrong_type = [n for n, data in g.nodes(data=True) if not (data['node_type'] in node_types)]
g.remove_nodes_from(wrong_type)

### Clean up network

In [7]:
isolates = list(nx.isolates(g))
g.remove_nodes_from(isolates)

# Generate the attributes for export

Only interested in annotating genes with TAIR identifier

In [8]:
df_annots = df_nodes[~df_nodes["TAIR"].isna()].copy()
df_annots = df_annots[df_annots["TAIR"].isin(g.nodes())]
df_annots.shape

(13295, 10)

## The "easy" attributes

In [9]:
df_annots.loc[:, "url"] = df_annots["TAIR"].apply(lambda x: f"https://skm.nib.si/ckn/?identifier={x}")
df_annots.loc[:, "species"] = [["Arabidopsis thaliana"]]*df_annots.shape[0]
df_annots.loc[:, "node"] = "NIB"
df_annots.loc[:, "databaseName"] = "SKM"
df_annots.loc[:, "entryType"] = "Gene"
df_annots.loc[:, "name"] = df_annots["TAIR"]
df_annots.head()

Unnamed: 0,node_ID,node_type,species,TAIR,short_name,synonyms,full_name,GMM,note,tissue,url,node,databaseName,entryType,name
12,AT1G01010,protein_coding,Arabidopsis thaliana,AT1G01010,NAC001,NTL10|NAC domain containing protein 1|ANAC001|...,NAC domain containing protein 1,27.3.27_RNA.regulation of transcription.NAC do...,,"leaf,stem,root,flower",https://skm.nib.si/ckn/?identifier=AT1G01010,NIB,SKM,Gene,AT1G01010
15,AT1G01040,protein_coding,Arabidopsis thaliana,AT1G01040,DCL1,SIN1|SUS1|EMBRYO DEFECTIVE 60|ASU1|EMBRYO DEFE...,dicer-like 1,27.1.20_RNA.processing.degradation dicer,,"seed,root,leaf,stem,flower",https://skm.nib.si/ckn/?identifier=AT1G01040,NIB,SKM,Gene,AT1G01040
16,AT1G01046,mirna,Arabidopsis thaliana,AT1G01046,MIR838A,ath-miR838|MIR838A|MIR838,microRNA838A,35.2_not assigned.unknown,,not assigned,https://skm.nib.si/ckn/?identifier=AT1G01046,NIB,SKM,Gene,AT1G01046
17,AT1G01050,protein_coding,Arabidopsis thaliana,AT1G01050,PPA1,AtPPa1|PPa1|pyrophosphorylase 1,pyrophosphorylase 1,23.4.99_nucleotide metabolism.phosphotransfer ...,,"seed,root,leaf,stem,flower",https://skm.nib.si/ckn/?identifier=AT1G01050,NIB,SKM,Gene,AT1G01050
18,AT1G01060,protein_coding,Arabidopsis thaliana,AT1G01060,LHY,LATE ELONGATED HYPOCOTYL 1|LATE ELONGATED HYPO...,LATE ELONGATED HYPOCOTYL,27.3.26_RNA.regulation of transcription.MYB-re...,,"seed,root,leaf,stem,flower",https://skm.nib.si/ckn/?identifier=AT1G01060,NIB,SKM,Gene,AT1G01060


## The GMM (GoMapMan) ontology annotations

In [10]:
# Add GMM annotations
def get_gmm_annots(x):
    if not pd.isnull(x):
        annots = []
        for n in x.split("|"):
            annots.append(f'GMM:{n.split("_")[0]}')
        return annots
df_annots.loc[:, "annotationId"] = df_annots["GMM"].apply(get_gmm_annots)

def get_gmm_defs(x):
    if not pd.isnull(x):
        names = []
        for n in x.split("|"):
            annot, name = n.split("_")
            names.append(f"{name} (GMM:{annot})")
        return names
df_annots.loc[:, "annotationName"] = df_annots["GMM"].apply(get_gmm_defs)

## The description... 

We addd to the description (if avalible) the short name, the gene description, 
node type, and synonyms. 

More complicated, we add the interacting nodes to the descrition as well, including 
the type of interaction. 


CKN has edge "type" attribute defining the type of molecular interaction

Below, we create a dictionary to help provide descriptive strings for each 
interaction type, and for whether it is upstream or downstream of the node 
under consideration.

In [11]:
strings = {
    "downstream": {
        "binding": "has protein-protein binding with", 
        "transcription factor regulation": "transcriptionally regulates",
        "post-translational modification": "post-translationally modifies",
        "small RNA interactions": "regulates as a small RNA with",
        "other": "has other molecular interaction with"  
    },
    "upstream": {
        "transcription factor regulation": "is transcriptionally regulated by",
        "post-translational modification": "is post-translationally modified by",
        "small RNA interactions": "is regulated by small RNA ",
        "other": "has other molecular interaction with"  
    }  
}

In [12]:
def generate_interactions_description(node):
    ''' Fetch upstream and downstream interactions for the node, and 
    generate descriptive phrases for each type of interaction.
    '''
    
    s_interactions = []

    downstream_edges = []
    for n in g.successors(node):
        e =  g.edges[(node, n)]
        e["target"] = n
        downstream_edges.append(e)
    df_downstream = pd.DataFrame(downstream_edges)
    if df_downstream.shape[0] > 0:
        df_downstream = df_downstream.groupby("type").agg({
            "target": ", ".join
        })  
        for interaction_type, r in df_downstream.iterrows():
            s_interactions.append(f'{strings["downstream"][interaction_type]} {r["target"]}')

    upstream_edges = []
    for n in g.predecessors(node):
        e =  g.edges[(n, node)]
        e["source"] = n
        upstream_edges.append(e)
    df_upstream = pd.DataFrame(upstream_edges)
    if df_upstream.shape[0] > 0:   
        df_upstream = df_upstream.groupby("type").agg({
            "source": ", ".join
        })
        # binding is undirected, only use the downstream one
        df_upstream = df_upstream[df_upstream.index != "binding"]
        for interaction_type, r in df_upstream.iterrows():
            s_interactions.append(f'{strings["upstream"][interaction_type]} {r["source"]}')

    s = f'It {" and ".join(s_interactions)}. '
    return s

In [13]:
def generate_description(x):
    s = f'{x["TAIR"]}'
    if not (x["TAIR"] == x["short_name"]):
        s += f' ({x["short_name"]})'
    if not pd.isnull(x["node_type"]):
        s += f' is a {x["node_type"]} gene'
    if not pd.isnull(x["full_name"]):
        s += f""" and has description '{x["full_name"]}'"""
    s += ". "
    if not pd.isnull(x["synonyms"]):
        s += f' Synonyms are: {", ".join(x["synonyms"].split("|"))}. '
    s += generate_interactions_description(x["node_ID"])
    return s

In [14]:
# This takes a little while...
df_annots["description"] = df_annots.apply(generate_description, axis=1)

In [15]:
records = df_annots[['name', 'species', 'url', 'node', 'databaseName',
       'entryType', 'annotationId', 'annotationName', 'description']].to_dict(orient="records")

with open("skm-ckn-faidare.json", "w") as handle:
    json.dump(records, handle, indent=4)

# END