# How to setup ConceptNetKG? 
In this tutorial, we will learn how to setup a `ConceptNetKG` graph.

## Initial knowledge graph setup
`zsl-kg` is a zero-shot learning framework which operates on the knowledge graph.
The first step in the process of learning zero-shot classifiers is to identify the classes and map them to
the nodes in the knowledge graph of interest. 
We will consider the example of SNIPS-NLU dataset which contains 7 classes, namely: 
`{weather, music, restaurant, search, movie, book, playlist}` where `book` and `playlist` are the unseen classes. (Ofcourse, in the zero-shot learning setting, we would not have access to the unseen classes from the dataset, but for the purpose of the tutorial we will cheat.) 

Now, we map the nodes to ConceptNet graph: `{/c/en/weather, /c/en/music, /c/en/restaurant, /c/en/search, /c/en/movie, /c/en/book, /c/en/playlist}` and query its 2-hop neighbourhood. We have included the nodes and the edges from the 2-hop neighbourhood in this directory.


## Nodes and edges

In [1]:
# read the nodes and edges 
import pandas as pd

nodes = pd.read_csv('nodes.csv')
edges = pd.read_csv('edges.csv')
print(nodes)

            id                       uri
0     21561344           /c/en/salsero/n
1        32772               /c/en/march
2     22577157         /c/en/blognovel/n
3     22710467  /c/en/film_entertainment
4       131101            /c/en/songbook
...        ...                       ...
6323  21495769            /c/en/diesis/n
6324  21528540           /c/en/macca's/n
6325    688096            /c/en/pneuma/n
6326    589804             /c/en/ghana/n
6327  22839285      /c/en/well_thumbed/a

[6328 rows x 2 columns]


In [24]:
edges

Unnamed: 0,start_id,relation_id,end_id
0,1204,0,1205
1,1204,0,1206
2,1204,1,3916
3,1204,1,3673
4,1204,1,714
...,...,...,...
6813,4957,35,3732
6814,5490,35,3732
6815,3604,38,3732
6816,3351,38,3732


## Features
Unlike traditional node classification problems, in zero-shot learning, each node in the knowledge graph is mapped to a pretrained embedding. Here, we will use the GloVe 840B embeddings from [https://nlp.stanford.edu/data/glove.840B.300d.zip](https://nlp.stanford.edu/data/glove.840B.300d.zip) 

In ConceptNet, the node names will have multiple words with a trailing `/c/en/` prefix followed by the concept name accompanied by varying postfixes. For simplicity, we will strip the affixes from the node names and compute average the individual word in the concept to get the embedding. 

In [19]:
import re
import numpy as np
import torch
import torch.nn.functional as F


def load_embeddings(file_path):
    """file to load glove"""
    embeddings = {}
    with open(file_path) as fp:
        for line in fp:
            fields = line.rstrip().split(" ")
            vector = np.asarray(fields[1:], dtype="float32")
            embeddings[fields[0]] = vector

    return embeddings


def get_individual_words(concept):
    """extracts the individual words from a concept"""
    clean_concepts = re.sub(r"\/c\/[a-z]{2}\/|\/.*", "", concept)
    return clean_concepts.strip().split("_")

In [16]:
# extract individual words from concepts
words = set()
all_concepts = []
for index, node in nodes.iterrows():
    concept_words = get_individual_words(node["uri"])
    all_concepts.append(concept_words)
    for w in concept_words:
        words.add(w)

word_to_idx = dict([(word, idx + 1) for idx, word in enumerate(words)])
word_to_idx["<PAD>"] = 0
idx_to_word = dict([(idx, word) for word, idx in word_to_idx.items()])

In [17]:
# load glove 840 (!! this may take some time)
glove = load_embeddings('glove.840B.300d.txt')

# get the word embedding
embedding_matrix = torch.zeros(len(word_to_idx), 300)
for idx, word in idx_to_word.items():
    if word in glove:
        embedding_matrix[idx] = torch.Tensor(glove[word])


In [23]:
# padding concepts
max_length = max([len(concept_words) for concept_words in all_concepts])
padded_concepts = []
for concept_words in all_concepts:
    concept_idx = [word_to_idx[word] for word in concept_words]
    concept_idx += [0] * (max_length - len(concept_idx))
    padded_concepts.append(concept_idx)

# add the word embeddings of indivual words and normalize
concept_embs = torch.zeros((0, 300))
padded_concepts = torch.tensor(padded_concepts)
concept_words = embedding_matrix[padded_concepts]
concept_embs = torch.sum(concept_words, dim=1)
concept_embs = F.normalize(concept_embs)
concept_embs.size()

torch.Size([6328, 300])

## ConceptNetKG
Now that we have the nodes, edges, and features, we can create the ConceptNetKG object. 

In [None]:
# Params
# To automatically convert the knowledge graph to an undirected graph, 
# set "bidirectional" to True. If you ommit this option, ensure that 
# the edges in the graph are undirected.
from allennlp.common.params import Params
params = Params({"bidirectional": True})
