# How to setup ConceptNetKG? 
In this tutorial, we will learn how to setup a `ConceptNetKG` graph.

## Initial knowledge graph setup
`zsl-kg` is a zero-shot learning framework which operates on the knowledge graph.
In tutorial, we consider the SNIPS-NLU dataset which contains 7 classes, namely: 
`{weather, music, restaurant, search, movie, book, playlist}` where `book` and `playlist` are the unseen classes. (Ofcourse, in zero-shot learning, we would not have access to the unseen classes from the dataset, but for the purpose of the tutorial we will allow it.) 

Now, we map the nodes to [ConceptNet graph](https://github.com/commonsense/conceptnet5/wiki/Downloads): `{/c/en/weather, /c/en/music, /c/en/restaurant, /c/en/search, /c/en/movie, /c/en/book, /c/en/playlist}` and query its 2-hop neighbourhood . We have included the nodes and the subgraph in this directory.


## Nodes, edges, and relations

In [1]:
# read the nodes and edges 
import pandas as pd

nodes = pd.read_csv('nodes.csv')
edges = pd.read_csv('edges.csv')
relations = pd.read_csv('relations.csv')
print(nodes)

            id                       uri
0     21561344           /c/en/salsero/n
1        32772               /c/en/march
2     22577157         /c/en/blognovel/n
3     22710467  /c/en/film_entertainment
4       131101            /c/en/songbook
...        ...                       ...
6323  21495769            /c/en/diesis/n
6324  21528540           /c/en/macca's/n
6325    688096            /c/en/pneuma/n
6326    589804             /c/en/ghana/n
6327  22839285      /c/en/well_thumbed/a

[6328 rows x 2 columns]


In [2]:
edges[:5]

Unnamed: 0,start_id,relation_id,end_id
0,1204,0,1205
1,1204,0,1206
2,1204,1,3916
3,1204,1,3673
4,1204,1,714


In [3]:
relations[:5]

Unnamed: 0,id,uri,directed
0,0,/r/Antonym,0
1,1,/r/AtLocation,1
2,2,/r/CapableOf,1
3,3,/r/Causes,1
4,4,/r/CausesDesire,1


## Features
In zero-shot learning, each node in the knowledge graph is mapped to a pretrained embedding. Here, we use the GloVe 840B embeddings from [https://nlp.stanford.edu/data/glove.840B.300d.zip](https://nlp.stanford.edu/data/glove.840B.300d.zip). 

In ConceptNet, the node names have multiple words with a trailing `/c/en/` prefix followed by the concept name accompanied by varying postfixes. For simplicity, we strip the affixes from the node names and compute average of the individual words in the concept to get the embedding. 

In [4]:
import re
import numpy as np
import torch
import torch.nn.functional as F


def load_embeddings(file_path):
    """file to load glove"""
    embeddings = {}
    with open(file_path) as fp:
        for line in fp:
            fields = line.rstrip().split(" ")
            vector = np.asarray(fields[1:], dtype="float32")
            embeddings[fields[0]] = vector

    return embeddings


def get_individual_words(concept):
    """extracts the individual words from a concept"""
    clean_concepts = re.sub(r"\/c\/[a-z]{2}\/|\/.*", "", concept)
    return clean_concepts.strip().split("_")

In [5]:
# extract individual words from concepts
words = set()
all_concepts = []
for index, node in nodes.iterrows():
    concept_words = get_individual_words(node["uri"])
    all_concepts.append(concept_words)
    for w in concept_words:
        words.add(w)

word_to_idx = dict([(word, idx + 1) for idx, word in enumerate(words)])
word_to_idx["<PAD>"] = 0
idx_to_word = dict([(idx, word) for word, idx in word_to_idx.items()])

In [6]:
# load glove 840 (!! this may take some time)
glove = load_embeddings('glove.840B.300d.txt')

# get the word embedding
embedding_matrix = torch.zeros(len(word_to_idx), 300)
for idx, word in idx_to_word.items():
    if word in glove:
        embedding_matrix[idx] = torch.Tensor(glove[word])


In [7]:
# padding concepts
max_length = max([len(concept_words) for concept_words in all_concepts])
padded_concepts = []
for concept_words in all_concepts:
    concept_idx = [word_to_idx[word] for word in concept_words]
    concept_idx += [0] * (max_length - len(concept_idx))
    padded_concepts.append(concept_idx)

# add the word embeddings of indivual words and normalize
concept_embs = torch.zeros((0, 300))
padded_concepts = torch.tensor(padded_concepts)
concept_words = embedding_matrix[padded_concepts]
concept_embs = torch.sum(concept_words, dim=1)
concept_embs = F.normalize(concept_embs)
concept_embs.size()

torch.Size([6328, 300])

## ConceptNetKG
Now that we have the nodes, edges, and features, we create the ConceptNetKG object. 

In [8]:
# Params
# To automatically convert the knowledge graph to an undirected graph, 
# set "bidirectional" to True. If you ommit this option, ensure that 
# the edges in the graph are undirected.
from allennlp.common.params import Params
from zsl_kg.knowledge_graph.conceptnet import ConceptNetKG
params = Params({"bidirectional": True})

kg = ConceptNetKG(nodes['uri'].tolist(), 
                  concept_embs, 
                  edges.values.tolist(), 
                  relations['uri'].tolist(), 
                  params)

### Move KG to Device
We can either use `.to(device)` or `.cuda()` to move the knowledge graph to cuda.


In [17]:
device = "cuda" if torch.cuda.is_available() else "cpu"
kg.to(device)

True

### Random-walk

The `ConceptNetKG` object stores the knowledge graph information. However, to make the object useful in the `zsl-kg` framework, we simulate a random-walk over the knowledge graph. The random-walk assigns hitting probability based on the connectivity in the knowledge, i.e., node neighbours with higher node degree would have higher hitting probability.

The random-walk assumes three parameters:
1. k = 20: length of the random-walk
2. n = 10: number of restarts
3. seed = 0 seed value for determinism

These paramters can be changed by providing them in `Params()` object during initialization. For example:
```python
new_params = {"bidirectional": True, "rw.k": 10, "rw.n": 15, "rw.seed": 42}
kg = ConceptNetKG(nodes['uri'].tolist(), 
                  concept_embs, 
                  edges.values.tolist(), 
                  relations['uri'].tolist(), 
                  new_params)
```

To simulate a random-walk over the knowledge graph, use `run_random_walk()`. 

In [9]:
kg.run_random_walk()

100%|███████████████████████████████████████████████| 6328/6328 [00:29<00:00, 215.07it/s]


### Node IDs
In `zsl-kg`, we learn to map nodes in knowledge graph to class representations. To query the learned class representaiton, we get the node ids of the classes with `get_node_ids()` function.

In [11]:
node_names = ["/c/en/weather", 
              "/c/en/music", 
              "/c/en/restaurant", 
              "/c/en/search", 
              "/c/en/movie", 
              "/c/en/book",
              "/c/en/playlist"]
node_ids = kg.get_node_ids(node_names)
node_ids

[1937, 177, 1204, 1857, 153, 3732, 6123]

### Save KG
To save the knowledge graph data along the the random-walk information, use `save_to_disk()` method.

In [13]:
kg.save_to_disk('kg_tutorial_graph')

'kg_tutorial_graph'

## Load saved KG
To load the saved `ConceptNetKG` from disk, use `ConceptNetKG.load_from_disk(dir_path)`. 

In [15]:
saved_kg = ConceptNetKG.load_from_disk('kg_tutorial_graph')

We have succesfully completed tutorial on how to setup `ConceptNetKG`. Move to the next tutorials on class encoders to learn more about graph neural networks in `zsl-kg`. 