# Analysis and Visualization of Sinopia Graphs

In [2]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
%reload_ext lab_black

import json

import rdflib
import requests
import kglab
import helpers
import widgets

## PCC BIBFRAME Work-Instance-Item Graph
We can use [Sinopia API](https://github.com/ld4p/sinopia_api) to construct a small RDF graph made up of a [BIBFRAME](https://bibframe.org/) Work, Instance, and Item and then create a knowledge graph for analysis and visualization of the entities along with their properties.

### Step One: Collect Sinopia URLs
First we will download the following resources from Sinopia's stage environment and then parse the RDF
contained in the resulting JSON.

- **BIBFRAME Work**: https://api.stage.sinopia.io/resource/489c7fef-b725-4765-a60e-e7fbfade7918
- **BIBFRAME Work AdminMetadata**: https://api.stage.sinopia.io/resource/18da6c93-b005-479d-a86d-ea8c23746828
- **BIBFRAME Instance**: https://api.stage.sinopia.io/resource/6eea39a3-aa07-43fa-8cc0-9c9192b484b9
- **BIBFRAME Instance AdminMetadata**: https://api.stage.sinopia.io/resource/7b17c000-56e9-43a0-b884-6b377ff4755c
- **BIBFRAME Item**: https://api.stage.sinopia.io/resource/ab56952c-db3c-4179-b913-d2014676d190

In [5]:
sinopia_stage_urls = [
    "https://api.stage.sinopia.io/resource/489c7fef-b725-4765-a60e-e7fbfade7918",
    "https://api.stage.sinopia.io/resource/18da6c93-b005-479d-a86d-ea8c23746828",
    "https://api.stage.sinopia.io/resource/6eea39a3-aa07-43fa-8cc0-9c9192b484b9",
    "https://api.stage.sinopia.io/resource/7b17c000-56e9-43a0-b884-6b377ff4755c",
    "https://api.stage.sinopia.io/resource/ab56952c-db3c-4179-b913-d2014676d190",
]

### Step Two: Create RDF Namespaces and Graph

Here we create two namespaces for the BIBFRAME and Sinopia vocabularies and bind them to the new Sinopia graph.

In [9]:
BIBFRAME = rdflib.Namespace("http://id.loc.gov/ontologies/bibframe/")
SINOPIA = rdflib.Namespace("http://sinopia.io/vocabulary/")

sinopia_pcc_graph = rdflib.Graph()
sinopia_pcc_graph.namespace_manager.bind("bf", BIBFRAME)
sinopia_pcc_graph.namespace_manager.bind("sinopia", SINOPIA)

### Step Three: Download and Parse RDF into the Graph

We now loop through all of the URLs in the list of Sinopia URLs, query the Sinopoia Stage API endpoint for these resources, and add the JSON-LD to the graph.

In [10]:
print(f"Starting triples in our RDF graph {len(sinopia_pcc_graph)}")
for sinopia_url in sinopia_stage_urls:
    api_result = requests.get(sinopia_url)
    if api_result.status_code < 300:
        resource_json = api_result.json()
        # Need to serialize the JSON to a string for parsing
        sinopia_pcc_graph.parse(
            data=json.dumps(resource_json["data"]), format="json-ld"
        )
    else:
        print(
            f"ERROR {api_result.status_code} for {sinopia_url}\nDetail {api_result.text}"
        )
print(f"Finished ingesting resources, size of graph {len(sinopia_pcc_graph)}")

Starting triples in our RDF graph 0
Finished ingesting resources, size of graph 136


### Step Four: Create a Knowledge Graph

Now that we have our BIBFRAME Work, Instance, and Items parsed into our RDF graph, we will then create a knowledge graph.

In [11]:
sinopia_bf_kg = kglab.KnowledgeGraph(
    name="Sinopia Stage PCC Knowledge Graph", import_graph=sinopia_pcc_graph
)

### Step Five: Basic Graph Analysis

Our simple PCC BIBFRAME Knowledge Graph provides a number of methods that allow use to analyze the shape and structure RDF through a "graph" lens. 

In graph terminology, a **Node** is the fundemental unit and is represented in RDF by a URI or blank-node in the *subject* role or a URI, blankd-node, or Literal in the *object* role. An **Edge** is a linkage between two nodes, in RDF terms, the *predicate* connecting the *subject* with the *object*. Because the order in triple matters, RDF is know as a directed graph. 



In [14]:
measure = kglab.Measure()
measure.measure_graph(sinopia_bf_kg)
print("Edges: {:,}\n".format(measure.get_edge_count()))
print("Nodes: {:,}\n".format(measure.get_node_count()))

Edges: 136

Nodes: 72



We can confirm that the number of edges in our PCC BIBFRAME graph is equal to the number of predicates by creating a list of all of the predicates and then calculate the number of members in that list using the builtin Python `len` function.

In [25]:
predicates = [p for p in sinopia_pcc_graph.predicates()]
print(
    f"""Total number of predicates: {len(predicates)}, 
         Test for equality: {len(predicates) == measure.get_edge_count()}"""
)

Total number of predicates: 136, 
         Test for equality: True


## Exercise One: Create a Sinopia BIBFRAME Knowledge Graph
Using any of the three Sinopia environments (development, stage, or production), find a BIBFRAME Work with a corresponding BIBFRAME Instance and then replicate the steps above to create a Sinopia BIBFRAME Knowledge Graph.

# Visualization of the PCC BIBFRAME Graph

In [27]:
sinopia_pcc_subgraph = kglab.SubgraphTensor(sinopia_bf_kg)
pyvis_graph = sinopia_pcc_subgraph.build_pyvis_graph(notebook=True)
pyvis_graph.force_atlas_2based()
pyvis_graph.show("pcc_bf.fig01.html")

<hr>

# Sinopia Stage Graph Analysis and Visualizations
First we will load the saved knowledge graph and then will do similar analysis to the 

In [3]:
stage_kg = kglab.KnowledgeGraph()
stage_kg.load_jsonld("data/stage.json")
print(f"Total triples {len(stage_kg.rdf_graph()):,}")

http://desktop.loc.gov/search?view=document&id=Infobasedcrmg0Dash0Dash0Dash247&hl=true&fq=allresources|true# does not look like a valid URI, trying to serialize this will break.
ld4p:RT:bf2:2D graphic material:Item does not look like a valid URI, trying to serialize this will break.
urn:ld4p:qa:gettyaat:Objects__Object_Groupings and Systems does not look like a valid URI, trying to serialize this will break.
https://api.stage.sinopia.io/resource/this is a test does not look like a valid URI, trying to serialize this will break.


<kglab.kglab.KnowledgeGraph at 0x7f8a9ffbed90>

First let see how many edges and nodes are in all of the resources in Sinopia stage:

In [6]:
stage_measure = kglab.Measure()
stage_measure.measure_graph(stage_kg)

In [8]:
print("Edges: {:,}\n".format(stage_measure.get_edge_count()))
print("Nodes: {:,}\n".format(stage_measure.get_node_count()))

Edges: 506,033

Nodes: 130,486



## How Sparse is the Sinopia Stage Graph?
In graph theory, sparseness is defined as how much many edges are connected to the nodes. We would expect from the **edge** and **node** counts above that the number connections per node would be around 5:1. Let us see if we can answer this question using some of the other features. 

## Visualizion using Pandas DataFrames

With our stage knowledge graph, we can query our graph with SPARQL and return the results as a Panda's DataFrame. DataFrames offer a number of built-in graphs that can be useful to 
