# Sinopia Knowledge Graph
This notebook creates a Sinopia [kglab](https://derwen.ai/docs/kgl/) Knowledge Graph.

In [2]:
! pip install kglab

Collecting kglab
  Using cached kglab-0.4.0-py3-none-any.whl (39 kB)
Collecting urlpath>=1.1.7
  Using cached urlpath-1.1.7.tar.gz (12 kB)
Collecting gcsfs>=0.7.1
  Downloading gcsfs-2021.5.0-py2.py3-none-any.whl (24 kB)
Collecting pyvis>=0.1.7.0
  Using cached pyvis-0.1.9-py3-none-any.whl (23 kB)
Collecting csvwlib>=0.3.2
  Using cached csvwlib-0.3.2-py3-none-any.whl (32 kB)
Collecting rdflib-jsonld>=0.5.0
  Using cached rdflib_jsonld-0.5.0-py2.py3-none-any.whl
Collecting owlrl>=5.2.1
  Using cached owlrl-5.2.1-py3-none-any.whl (56 kB)
Collecting pslpython>=2.2.2
  Using cached pslpython-2.2.2-py3-none-any.whl (20.4 MB)
Collecting rdflib>=5.0.0
  Using cached rdflib-5.0.0-py3-none-any.whl (231 kB)
Collecting pylev>=1.3
  Using cached pylev-1.3.0-py2.py3-none-any.whl (4.9 kB)
Collecting icecream>=2.0
  Using cached icecream-2.1.0-py2.py3-none-any.whl (8.0 kB)
Collecting pynvml>=8.0.4
  Using cached pynvml-8.0.4-py3-none-any.whl (36 kB)
Collecting fsspec[gs,s3]>=0.8.5
  Downloading fssp

In [1]:
import json
from datetime import datetime
from typing import Dict, List, Optional

import pandas as pd

import kglab
import rdflib
import requests

Defines Sinopia-specific namespaces in a dictionary and create a SINOPIA rdflib.Namespace used in each resource's graph. 

In [2]:
namespaces = {
    "bf": "http://id.loc.gov/ontologies/bibframe/",
    "bflc": "http://id.loc.gov/ontologies/bflc/",
    "mads": "http://www.loc.gov/mads/rdf/v1#",
    "skos": "http://www.w3.org/2004/02/skos/core#",
    "sinopia": "http://sinopia.io/vocabulary/"
}

SINOPIA = rdflib.Namespace("http://sinopia.io/vocabulary/")

**TODO:** These functions need to be moved to a local Python module for use in other notebooks. 

In [3]:
def from_api(api_url: str) -> Dict:
    """Takes a Sinopia API endpoint URI, extracts each resource and
    template, and returns a dictionary with two lists, a resources and a
    templates, and the total number of resources harvested from the api.

    @param api_url -- URI to Sinopia API endpoint
    @param group -- optional Group name
    """

    def add_resource(resource):
        if not 'data' in resource:
            print(f"\n{resource.get('uri')} missing data")
            return
        output["total"] += 1
        graph = rdflib.Graph()
        for key, url in namespaces.items():
            graph.namespace_manager.bind(key, url)
        jsonld = json.dumps(resource.pop("data")).encode()
        try:
            graph.parse(data=jsonld, format="json-ld")
        except Exception as error:
            print(f"Failed to parse {resource}\n{error}")
            return
        payload = {"graph": graph, "meta": resource}
        if "sinopia:template:resource" in resource.get("templateId"):
            output["templates"].append(payload)
        else:
            output["resources"].append(payload)

    output = {"resources": [], "templates": [], "total": 0}
    start = datetime.utcnow()
    print(f"Started harvest of resources at {start} for {api_url}")
    initial = requests.get(f"{api_url}")
    print("0", end="")
    for row in initial.json().get("data"):
        add_resource(row)
    next_link = initial.json().get("links").get("next")
    while 1:
        result = requests.get(next_link)
        if result.status_code > 300:
            break
        payload = result.json()
        new_next = payload.get("links").get("next")
        if new_next is None:
            new_text = payload.get("links").get("first")
        if new_next == next_link or new_next is None:
            break
        for row in payload.get("data"):
            add_resource(row)
        next_link = new_next
        print(".", end="")
        if not output["total"] % 250:
            print(f"{output['total']}", end="")
    end = datetime.utcnow()
    print(f"\nFinished total time {(end-start).seconds / 60.}")
    return output

In [4]:
from_api?

[0;31mSignature:[0m [0mfrom_api[0m[0;34m([0m[0mapi_url[0m[0;34m:[0m [0mstr[0m[0;34m)[0m [0;34m->[0m [0mDict[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Takes a Sinopia API endpoint URI, extracts each resource and
template, and returns a dictionary with two lists, a resources and a
templates, and the total number of resources harvested from the api.

@param api_url -- URI to Sinopia API endpoint
@param group -- optional Group name
[0;31mFile:[0m      /var/folders/zd/nj5fbhqn72qgg80cstzbsjl80000gq/T/ipykernel_12060/594893128.py
[0;31mType:[0m      function


In [5]:
stage_rdf = from_api("https://api.stage.sinopia.io/resource")

Started harvest of resources at 2021-10-11 22:25:48.678744 for https://api.stage.sinopia.io/resource
0.........250.....

https://api.stage.sinopia.io/resource/this is a test does not look like a valid URI, trying to serialize this will break.


.....500.........

ld4p:RT:bf2:2D graphic material:Item does not look like a valid URI, trying to serialize this will break.


.750..........1000....

https://api.stage.sinopia.io/resource/this test does not look like a valid URI, trying to serialize this will break.


......1250........
None missing data
.

https://api.stage.sinopia.io/resource/my fav template does not look like a valid URI, trying to serialize this will break.


.
Finished total time 0.65


In [8]:
kglab.KnowledgeGraph?

[0;31mInit signature:[0m
[0mkglab[0m[0;34m.[0m[0mKnowledgeGraph[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mname[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;34m'generic'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mbase_uri[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlanguage[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;34m'en'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0muse_gpus[0m[0;34m:[0m [0mbool[0m [0;34m=[0m [0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mimport_graph[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mrdflib[0m[0;34m.[0m[0mgraph[0m[0;34m.[0m[0mConjunctiveGraph[0m[0;34m,[0m [0mrdflib[0m[0;34m.[0m[0mgraph[0m[0;34m.[0m[0mDataset[0m[0;34m,[0m [0mrdflib[0m[0;34m.[0m[0mgraph[0m[0;34m.[0m[0mGraph[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0m

## Creating a Knowledge Graph
Here we iterate through the Stage resources, retrieve the RDF graph, and then we create an instance of the kglab
Knowledge Graph.

In [10]:
stage = rdflib.ConjunctiveGraph()
for row in stage_rdf['resources']:
    stage += row.get('graph')

    
kg = kglab.KnowledgeGraph(
    name = "Sinopia Stage KG",
    base_uri = "https://api.stage.sinopia.io/resource/",
    namespaces = namespaces,
    import_graph=stage
)

### Graph Measuresments

In [19]:
measure = kglab.Measure()

In [13]:
measure.measure_graph(kg)
print("edges: {:,}\n".format(measure.get_edge_count()))
print("nodes: {:,}\n".format(measure.get_node_count()))

edges: 629,278

nodes: 67,794



## SPARQL Query as a Pandas Data
One of the exciting capabilities of our new Stage Knowledge graph is that we can query our Stage datastore with SPARQL and runturn the result as a [Pandas](https://pandas.pydata.org/) dataframe. The dataframe can now be fead into a [FastAI](), [Pytorch](), or [Tensorflow]() models for further analysis. 

In [22]:
all_triples_query = """
SELECT ?subject ?predicate ?object 
WHERE {
  ?subject ?predicate ?object .
}"""

In [23]:
df = kg.query_as_df(all_triples_query)

In [None]:
With each triple in our 

In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 314639 entries, 0 to 314638
Data columns (total 3 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   subject    314639 non-null  object
 1   predicate  314639 non-null  object
 2   object     314639 non-null  object
dtypes: object(3)
memory usage: 7.2+ MB


In [25]:
df.head()

Unnamed: 0,subject,predicate,object
0,_:Nf369836ee2334c118a2adccfe5c6ba7b,rdf:type,bf:Title
1,_:Nf5b660a442b24fd88e534e4698b045d6,owl:sameAs,<https://id.loc.gov/vocabulary/languages/eng>
2,_:b1,bf:Agent,<http://id.loc.gov/authorities/names/n96089174>
3,_:N0dc076eacd4b4b7c99cea63fc1f7821c,bf:code,n
4,_:Nfec989756ae2489eb7f6862c247d2c43,<http://id.loc.gov/ontologies/bflc/encodingLevel>,<https://id.loc.gov/vocabulary/menclvl/f>


In [27]:
df.tail()

Unnamed: 0,subject,predicate,object
314634,_:N2d11c7095c9648158f8bef53a28ccf71,rdfs:label,Bayless Books
314635,<https://api.stage.sinopia.io/resource/ac2a379...,bf:contribution,_:b0
314636,_:N5ea15b82eec6415aa35e687c27da1d68,owl:sameAs,<http://viaf.org/viaf/174954981>
314637,_:Nbd85229aabec4e22b1276c1f3d33b3a0,rdf:type,bf:Status
314638,_:N58c81ba54d414f7e8b0130222c7eaf33,rdfs:label,rml.py SNAPSHOT: 2021-04-14T14:08:43+00:00


In [26]:
df.sample(10)

Unnamed: 0,subject,predicate,object
39321,<https://api.stage.sinopia.io/resource/549669f...,bf:identifiedBy,_:b0
13344,_:N9a77e3270d23482d8213a8604dbefc37,<http://www.loc.gov/mads/rdf/v1#authoritativeL...,New England--Juvenile fiction
304142,_:Nb82c376bca76435a81ae33fde34d6574,<http://id.loc.gov/ontologies/bflc/encodingLevel>,<https://id.loc.gov/vocabulary/menclvl/f>
6843,<https://api.stage.sinopia.io/resource/b30338e...,bf:carrier,_:N9c9f1818deff4b2cb0c40a3c032ac2a0
92804,_:b148,<http://id.loc.gov/ontologies/bflc/relation>,_:b149
54300,_:Nd01a6202d40e4154bafdc4c798a3f3f4,owl:sameAs,<https://id.loc.gov/vocabulary/languages/eng>
267979,_:b31,rdf:type,bf:ShelfMarkLcc
298605,_:Na4aa3d34cb8d4aa5b1ca9ac1a80d0711,owl:sameAs,<http://id.loc.gov/authorities/names/no2011121...
210377,<https://api.stage.sinopia.io/resource/b1e12ed...,bf:heldBy,<http://id.loc.gov/vocabulary/organizations/cst>
120149,_:Nbcf26e1ea76a4ed68d32d5100a2ab2ae,rdf:type,bf:GenerationProcess


## Visualizations


In [None]:
subgraph = kglab.SubgraphTensor(kg)
pyvis_graph = subgraph.build_pyvis_graph(notebook=True)
pyvis_graph.force_atlas_2based()
pyvis_graph.show("tmp.fig03.html")

In [39]:
second_kg = kglab.KnowledgeGraph(
    name = "Sinopia Stage Second KG",
    base_uri = "https://api.stage.sinopia.io/resource/",
    namespaces = namespaces,
    import_graph=dev_rdf['resources'][1]['graph']
)

In [40]:
subgraph = kglab.SubgraphTensor(second_kg)
pyvis_graph = subgraph.build_pyvis_graph(notebook=True)
pyvis_graph.force_atlas_2based()
pyvis_graph.show("tmp.fig03.html")

## Stanford Knowledge Base


In [5]:
stanford_url = "https://api.stage.sinopia.io/resource?group=stanford"

In [24]:
stanford_rdf = from_api(stanford_url)

Started harvest of resources at 2021-05-27 16:56:10.941766 for https://api.stage.sinopia.io/resource?group=stanford
0.........250..........500......
Finished total time 0.2833333333333333


In [25]:
len(stanford_rdf['resources'])

609

In [10]:
len(stanford_rdf['resources'])

585

In [13]:
stanford_graph = rdflib.ConjunctiveGraph()
for row in stanford_rdf['resources']:
    stanford_graph += row['graph']

In [15]:
stanford_kb = kglab.KnowledgeGraph(
    name = "Sinopia Stage Stanford KG",
    base_uri = "https://api.stage.sinopia.io/resource/",
    namespaces = namespaces,
    import_graph=stanford_graph
)

In [21]:
measure.measure_graph(stanford_kb)
print("edges: {:,}\n".format(measure.get_edge_count()))
print("nodes: {:,}\n".format(measure.get_node_count()))

edges: 16,289

nodes: 1,952



## Work-Instance-Item Visualization

In [27]:
from rdflib.plugins.stores.sparqlconnector import SPARQLConnector

In [28]:
qa_pcc_endpoint = SPARQLConnector('http://services.ld4l.org/fuseki/PCC/sparql',
                                  returnFormat='json')

In [30]:
query = qa_pcc_endpoint.query("SELECT ?s ?p ?o WHERE { ?s ?p ?o . } LIMIT 10")

In [31]:
for row in query:
    print(row)

(rdflib.term.URIRef('https://svde.org/pcc/rdfBibframe//Provenance/PCCocm01536849'), rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), rdflib.term.URIRef('http://www.w3.org/ns/prov#Entity'))
(rdflib.term.URIRef('https://svde.org/pcc/rdfBibframe//Provenance/PCCocm01536849'), rdflib.term.URIRef('http://www.w3.org/2000/01/rdf-schema#label'), rdflib.term.Literal('PCCocm01536849'))
(rdflib.term.URIRef('https://svde.org/pcc/rdfBibframe//Provenance/PCCocm01536849'), rdflib.term.URIRef('http://www.w3.org/ns/prov#wasAttributedTo'), rdflib.term.URIRef('https://svde.org/pcc/rdfBibframe//Agent/PCC'))
(rdflib.term.URIRef('https://svde.org/pcc/rdfBibframe//Agent/PCC'), rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), rdflib.term.URIRef('http://www.w3.org/ns/prov#Agent'))
(rdflib.term.URIRef('https://svde.org/pcc/rdfBibframe//AdminMetadata/PCCocm01536849'), rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), rdflib.term.URIRef('http://id.lo

In [None]:
def bf_entities(uri: str) -> rdflib.Graph:
    entity_
    

In [5]:
mla_url = "https://api.stage.sinopia.io/resource?group=mla"

In [8]:
mla_rdf = from_api(mla_url)

Started harvest of resources at 2021-06-24 17:13:08.629930 for https://api.stage.sinopia.io/resource?group=mla
0
Finished total time 0.0


In [10]:
mla_rdf

{'resources': [{'graph': <Graph identifier=N75b9382dcf5544bd8d8c22adeab1a020 (<class 'rdflib.graph.Graph'>)>,
   'meta': {'user': 'metadatum',
    'group': 'mla',
    'templateId': 'PMO:bf2:kk:workWithContributionLDWG',
    'types': ['http://id.loc.gov/ontologies/bibframe/Work'],
    'bfAdminMetadataRefs': [],
    'bfItemRefs': [],
    'bfInstanceRefs': [],
    'bfWorkRefs': [],
    'id': '7f317d00-ceeb-41bd-b327-530a45c0a847',
    'uri': 'https://api.stage.sinopia.io/resource/7f317d00-ceeb-41bd-b327-530a45c0a847',
    'timestamp': '2021-06-24T03:16:59.891Z'}},
  {'graph': <Graph identifier=Nd60f840c8a004b648fc061903f268391 (<class 'rdflib.graph.Graph'>)>,
   'meta': {'user': 'metadatum',
    'group': 'mla',
    'templateId': 'PMO:bf2:kk:workWithContributionLDWG',
    'types': ['http://id.loc.gov/ontologies/bibframe/Work'],
    'bfAdminMetadataRefs': [],
    'bfItemRefs': [],
    'bfInstanceRefs': [],
    'bfWorkRefs': [],
    'id': '7475d919-0fb2-4482-a5c1-e226c7270903',
    'uri': 'h

## Extract MLA Resources and Save as Turtle

In [7]:
namespaces['pmo'] = "http://performedmusicontology.org/ontology/"
namespaces['rdaw'] = "http://rdaregistry.info/Elements/w/"
mla_result = from_api("https://api.stage.sinopia.io/resource?group=mla")

Started harvest of resources at 2021-08-09 21:08:11.066624 for https://api.stage.sinopia.io/resource?group=mla
0.
Finished total time 0.03333333333333333


In [8]:
len(mla_result['resources'])

38

In [13]:
print(mla_result['resources'][0]['graph'].serialize(format='turtle').decode())

@prefix bf: <http://id.loc.gov/ontologies/bibframe/> .
@prefix mads: <http://www.loc.gov/mads/rdf/v1#> .
@prefix pmo: <http://performedmusicontology.org/ontology/> .
@prefix rdaw: <http://rdaregistry.info/Elements/w/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sinopia: <http://sinopia.io/vocabulary/> .

<https://api.stage.sinopia.io/resource/be4ff759-5ba9-4c77-8cfc-a249056f3b08> a bf:Audio ;
    bf:capture [ a bf:Capture ;
            bf:date "1994-05"@eng ;
            bf:place [ a bf:Place ;
                    bf:place <http://id.loc.gov/authorities/names/n82101833> ] ] ;
    bf:contribution [ a bf:Contribution ;
            bf:agent [ a bf:Agent,
                        bf:Person ;
                    mads:isIdentifiedByAuthority <http://id.loc.gov/authorities/names/n87818155> ] ;
            bf:role <http://id.loc.gov/vocabulary/relators/itr> ;
            pmo:usesMediumOfPerformance <http://id.loc

### Saves Serialized Resources as Turtle into Zip file

In [9]:
import zipfile
with zipfile.ZipFile("mla-resources.zip", "w") as zip_file:
    for resource in mla_result['resources']:
        zip_file.writestr(f"{resource['meta'].get('id')}.ttl", 
                          resource['graph'].serialize(format='turtle'))