# Using Deep Search Knowledge Graphs with PyTorch Geometric

Deep Search can construct Knowledge Graphs (KGs) by parsing large collections of documents. This tutorial shows how to download these knowledge graphs and import them locally in [PyTorch Geometric](https://github.com/pyg-team/pytorch_geometric), a popular graph neural network library. 

### Access required

The content of this notebook requires access to Deep Search capabilities which are not
available on the public access system.

[Contact us](https://ds4sd.github.io) if you are interested in exploring
these Deep Search capabilities.

### Set notebook parameters

In [None]:
from dsnotebooks.settings import KGProjectNotebookSettings

# notebook settings auto-loaded from .env / env vars
notebook_settings = KGProjectNotebookSettings(kg_key="")

PROFILE_NAME = notebook_settings.profile  # the profile to use
PROJECT_KEY = notebook_settings.proj_key
KG_KEY = notebook_settings.kg_key
BASE_DIR = "./KG-data"

### Example dependencies

In [None]:
# Import standard dependencies
import os
import ssl
import json
import pandas as pd
from tqdm import tqdm
import tarfile
from urllib.request import urlopen

# Dependencies related to PyTorch Geometric
import torch
from torch_geometric.data import HeteroData
from torch_geometric.transforms import ToUndirected

# IPython utilities
from IPython.display import display

# Import deepsearch-toolkit
import deepsearch as ds
from deepsearch.core.client import DeepSearchConfig
from deepsearch.cps.client.api import CpsApi, CpsApiClient

In [None]:
# Create base directory if it does not exist
if not os.path.exists(BASE_DIR):
    os.mkdir(BASE_DIR)


# Raise an error if the base directory is not empty
if len(os.listdir(BASE_DIR)) > 0:
    raise ValueError(
        f"BASE_DIR must be empty but found the following contents: {os.listdir(BASE_DIR)}"
    )

# Part 1: Downloading the Knowledge Graph

### Connect to Deep Search

In [None]:
api = ds.CpsApi.from_env(profile_name=PROFILE_NAME)

### Download the knowledge graph

We use an example knowledge graph based on 753 documents from arXiv related to the search phrase "power conversion efficiency" + "organic". The `PROJECT_KEY` and `KG_KEY` parameters that were set above correspond to this knowledge graph. In general, these parameters can be obtained from the API section of the UI as described in the [documentation](https://ds4sd.github.io/deepsearch-toolkit/guide/kgs/). 

We begin by using the API to get a URL for downloading the knowledge graph. **This step takes about 6-8 minutes due to the size of our KG.**

In [None]:
# Get the download url using the API
kg = api.knowledge_graphs.get(PROJECT_KEY, KG_KEY)
download_url = kg.download()

The `download_url` can now be used to download a gzipped file that contains the contents of our knowledge graph. 

In [None]:
# Download the knowledge graph using urlopen
zipped_file_path = os.path.join(BASE_DIR, "kg_data.tar.gz")
context = ssl.create_default_context()
context.check_hostname = False
context.verify_mode = ssl.CERT_NONE

with open(zipped_file_path, "wb+") as download_file, urlopen(
    download_url, context=context
) as response:
    content_length = int(response.getheader("content-length"))
    with tqdm(total=100, position=0) as pbar:
        for line in response:
            download_file.write(line)
            pbar.update((len(line) / content_length) * 100)

### Extract the contents of the downloaded knowledge graph
We use the `tarfile` module to unzip the contents of the downloaded `.tar.gz` file into the directory `BASE_DIR/unzipped_data`.

In [None]:
# Save the unzipped KG
unzipped_dir = os.path.join(BASE_DIR, "unzipped_data")
os.mkdir(unzipped_dir)
with tarfile.open(zipped_file_path, "r") as f:
    f.extractall(path=unzipped_dir)

### Understanding the downloaded data
The extracted data in `BASE_DIR/unzipped_data` consists of several files in the [JSON Lines](https://jsonlines.org) format. As the downloaded KG is heterogeneous (meaning there are different types of nodes and edges), we get one `.jsonl` file for every node type. A special file named `_edges.jsonl` contains information about the edges in the KG.

In [None]:
# Get a list of all the files in the unzipped data
files = list(os.walk(os.path.join(BASE_DIR, "unzipped_data")))[0][2]
display(sorted(files))

Each line in a `.jsonl` file contains information about a node encoded in the JSON format. For example, we show the first record from the `material.jsonl` file below. This file contains information about different materials found in the documents.

In [None]:
# A function for reading the contents of a jsonl file into a pandas dataframe
def jsonl2df(filepath: str) -> pd.DataFrame:
    """
    Reads the contents of a jsonl file into a Pandas DataFrame
    :param filepath: Path to the jsonl file
    :return dataframe: A pandas DataFrame corresponding to the data stored in the file
    """
    with open(filepath, "r") as f:
        data = pd.DataFrame([json.loads(line) for line in f])
    return data

In [None]:
# Show the first record in the materials file
materials = jsonl2df(os.path.join(BASE_DIR, "unzipped_data", "material.jsonl"))
display(materials.iloc[0])

We now take a look at the `_edges.jsonl` file. As before, each line contains information about an edge in the JSON format. Given a directed edge from `x` to `y`, we call `x` the _tail_ of this edge and `y` its _head_. We show a few interesting properties of these edges listed below.
* `source_collection`: Type of the node at the tail of the edge
* `target_collection`: Type of the node at the head of the edge
* `source_hash`: Hash for the node at the tail of the edge
* `target_hash`: Hash for the node at the head of the edge
* `symmetric`: Whether the edge is directed (if `False`) or undirected (if `True`)

In [None]:
# Show the first few edges
edges = jsonl2df(os.path.join(BASE_DIR, "unzipped_data", "_edges.jsonl"))
edges = edges[
    [
        "source_collection",
        "target_collection",
        "source_hash",
        "target_hash",
        "symmetric",
    ]
]
display(edges.head())

# Part 2: Creating a PyTorch Geometric Knowledge Graph

Recall that the knowledge graph downloaded above is based on 753 documents from arXiv related to the search phrase "power conversion efficiency" + "organic". Deep Search has extracted various entities and relationships from these documents using a user-defined data flow. In this section, we will use a subset of the downloaded data to create a heterogeneous knowledge graph in PyTorch Geometric.

The subset we are interested in is concerned with materials and their properties. We will add the following two types of nodes from the downloaded KG to our PyTorch Geometric KG:
1. `material`: Materials discovered in the documents
2. `property`: Various material properties extracted from the documents

We will then extract edges relating these nodes from the `_edges.jsonl` file. Let us begin by initializing an empty knowledge graph.

### Initialize the KG

PyTorch Geometric uses the `HeteroData` class to represent a heterogeneous knowledge graph. Below, we create an empty `HeteroData` instance.

In [None]:
# Create an empty heterogeneous knowledge graph
hetero_kg = HeteroData()

### Add nodes to the KG
Next, we add the `material` and `property` nodes to our KG as mentioned above. In this simplified example, the nodes do not have explicit features associated with them. We therefore use one-hot encoding to set `hetero_kg[nodetype].x`, as is a common practice. We also add two other attributes to each node, the `_hash` for adding edges later on and the `_name` for visualization purposes. One can also set `hetero_kg[node_type].y = ...` if node labels are available as attributes in the corresponding `.jsonl` file for the node type. We do not use any labels in this minimal example.

In [None]:
nodetypes = {
    "material": os.path.join(BASE_DIR, "unzipped_data", "material.jsonl"),
    "property": os.path.join(BASE_DIR, "unzipped_data", "property.jsonl"),
}

for nodetype in nodetypes:
    data = jsonl2df(nodetypes[nodetype])
    hetero_kg[nodetype].x = torch.eye(data.shape[0])
    hetero_kg[nodetype]["_hash"] = dict(
        (_hash, _idx) for _idx, _hash in enumerate(data["_hash"].to_list())
    )
    hetero_kg[nodetype]["_name"] = data["_name"].to_list()

### Add edges to the KG

As we only have `material` and `property` nodes in `hetero_kg`, we search for edges between these two node types in the `_edges.jsonl` file and add them to `hetero_kg`. This adds an edge between `material A` and `property B` if `B` was mentioned in the context of `A` in at least one of the arXiv documents. We can go one step further and add the value that `property B` takes for `material A` as an edge attribute[$^{[1]}$](#footnote1), but we avoid it here to simplify the exposition.

In [None]:
# Find the relevant edges
edges = jsonl2df(os.path.join(BASE_DIR, "unzipped_data", "_edges.jsonl"))
edges = edges[
    (edges.source_collection == "material") & (edges.target_collection == "property")
]
edges = [edges["source_hash"].to_list(), edges["target_hash"].to_list()]

# Create the edge index
edge_index = []
for hash_mat, hash_prop in zip(*edges):
    edge_index.append(
        [
            hetero_kg["material"]["_hash"][hash_mat],
            hetero_kg["property"]["_hash"][hash_prop],
        ]
    )

# Add edge index to the KG
hetero_kg["material", "mat2prop", "property"].edge_index = (
    torch.tensor(edge_index).long().t()
)

# Make the graph undirected
hetero_kg = ToUndirected()(hetero_kg)

### Summarize the created KG

In [None]:
# Summarize the KG
print("Number of nodes")
for node_type in hetero_kg.node_types:
    print(f"\t{node_type} -> {hetero_kg[node_type].num_nodes}")
print(f"Total number of nodes: {hetero_kg.num_nodes}")

print("\nNumber of edges")
for edge_type in hetero_kg.edge_types:
    print(f"\t{edge_type} -> {hetero_kg[edge_type].num_edges}")
print(f"Total number of edges: {hetero_kg.num_edges}")

It is interesting to note that the constructed KG is very sparse as the number of edges is almost equal to the number of nodes. This is because Deep Search extracted several properties and materials but each material was only linked to a handful of properties in the document collection.

### Visualizing the KG

Note that our KG is a bipartite graph as it only contains edges between `material` and `property` nodes. We visualize a small subset of this KG by selecting four interesting `material` nodes and listing up to four randomly chosen `property` nodes that they are connected to.

In [None]:
# Select materials to display
materials = ["perovskite/Si", "O(2) Ti(1)", "A(1) I(3) M(1) Pb(1)", "O(1) Zn(1)"]
mat_idx = [hetero_kg["material"]["_name"].index(mat) for mat in materials]

# Get properties corresponding to each material
properties = dict()
for m_idx, material in zip(mat_idx, materials):
    current_edges = (
        hetero_kg["material", "mat2prop", "property"].edge_index[0, :] == m_idx
    )
    prop_idx = hetero_kg["material", "mat2prop", "property"].edge_index[
        1, current_edges
    ]
    properties[material] = [
        hetero_kg["property"]["_name"][idx] for idx in prop_idx.tolist()
    ]

# Show up to four randomly chosen properties for each material
df = pd.DataFrame()
for mat, prop in properties.items():
    # Restrict to four properties
    if len(prop) > 4:
        prop = [prop[idx] for idx in torch.randperm(len(prop)).tolist()[:4]]

    # Add the row to the dataframe
    curr_dict = dict(
        [("material", [mat])]
        + [(f"Property{p_idx}", [p]) for p_idx, p in enumerate(prop)]
    )
    curr_df = pd.DataFrame(curr_dict)
    df = pd.concat([df, curr_df]).reset_index(drop=True)

display(df)

Let's take another example. A `perovskite/Si` tandem is useful for making efficient solar cells. A few interesting properties for this material include its power conversion efficiency (PCE) and its band gap. Below we confirm that these properties are indeed linked to `perovskite/Si` in the KG.

In [None]:
# Find properties linked to perovskite/Si
m_idx = hetero_kg["material"]["_name"].index("perovskite/Si")
perovskite_edges = (
    hetero_kg["material", "mat2prop", "property"].edge_index[0, :] == m_idx
)
prop_idx = hetero_kg["material", "mat2prop", "property"].edge_index[1, perovskite_edges]
properties = [hetero_kg["property"]["_name"][idx] for idx in prop_idx.tolist()]

# Check if the desired properties are linked
print(
    f'Is perovskite/Si linked to power conversion efficiency: {"power conversion efficiency" in properties}'
)
print(f'Is perovskite/Si linked to band gap: {"band gap" in properties}')

This concludes the process of creating a simple PyTorch Geometric KG from the KG downloaded from Deep Search. One can now train powerful graph neural networks for various downstream tasks like node classification and link prediction using `hetero_kg`.

# Footnotes

<a name='footnote1'></a> 
### \[1\] Edges with attributes

Edges with attributes can be stored as additional nodes in Deep Search. For example, consider four nodes `{A, B, X, Y}` where `{A, B}` have the same node type (say `author`) and `{X, Y}` have another node type (say `paper`). The relationship between nodes of type `author` and `paper` can be encoded as nodes of type `author-to-paper`. If author `A` has written paper `X`, then there will a node `A_X` of type `author-to-paper`, and both `A` and `X` will be connected to `A_X` in the KG. 

Now suppose we want to add more information about the relationship between `author A` and `paper X`. For example, we may be interested in knowing the index at which `A` appears in the author list of `X`. Such information can be added as attributes of node `A_X` in Deep Search. One can then easily parse these nodes to add edge attributes in PyTorch Geometric.