# Sinopia Knowledge Graph
This notebook creates a Sinopia [kglab](https://derwen.ai/docs/kgl/) Knowledge Graph.

In [2]:
! pip install kglab

Collecting kglab
  Using cached kglab-0.4.0-py3-none-any.whl (39 kB)
Collecting urlpath>=1.1.7
  Using cached urlpath-1.1.7.tar.gz (12 kB)
Collecting gcsfs>=0.7.1
  Downloading gcsfs-2021.5.0-py2.py3-none-any.whl (24 kB)
Collecting pyvis>=0.1.7.0
  Using cached pyvis-0.1.9-py3-none-any.whl (23 kB)
Collecting csvwlib>=0.3.2
  Using cached csvwlib-0.3.2-py3-none-any.whl (32 kB)
Collecting rdflib-jsonld>=0.5.0
  Using cached rdflib_jsonld-0.5.0-py2.py3-none-any.whl
Collecting owlrl>=5.2.1
  Using cached owlrl-5.2.1-py3-none-any.whl (56 kB)
Collecting pslpython>=2.2.2
  Using cached pslpython-2.2.2-py3-none-any.whl (20.4 MB)
Collecting rdflib>=5.0.0
  Using cached rdflib-5.0.0-py3-none-any.whl (231 kB)
Collecting pylev>=1.3
  Using cached pylev-1.3.0-py2.py3-none-any.whl (4.9 kB)
Collecting icecream>=2.0
  Using cached icecream-2.1.0-py2.py3-none-any.whl (8.0 kB)
Collecting pynvml>=8.0.4
  Using cached pynvml-8.0.4-py3-none-any.whl (36 kB)
Collecting fsspec[gs,s3]>=0.8.5
  Downloading fssp

In [3]:
import json
from datetime import datetime
from typing import Dict, List, Optional

import pandas as pd

import kglab
import rdflib
import requests

Defines Sinopia-specific namespaces in a dictionary and create a SINOPIA rdflib.Namespace used in each resource's graph. 

In [4]:
namespaces = {
    "bf": "http://id.loc.gov/ontologies/bibframe/",
    "bflc":"",
    "mads": "http://www.loc.gov/mads/rdf/v1#",
    "skos": "http://www.w3.org/2004/02/skos/core#",
    "sinopia": "http://sinopia.io/vocabulary/"
}

SINOPIA = rdflib.Namespace("http://sinopia.io/vocabulary/")

**TODO:** These functions need to be moved to a local Python module for use in other notebooks. 

In [None]:


def from_api(api_url: str, group=None: str) -> Dict:
    """Takes the new Sinopia API endpoint URI, extracts each resource and
    template, and returns a dictionary with two lists, a resources and a
    templates, and the total number of resources harvested from the api.

    @param api_url -- URI to Sinopia API endpoint
    @param group -- optional Group name
    """

    def add_resource(resource):
        if not 'data' in resource:
            print(f"\n{resource.get('uri')} missing data")
            return
        output["total"] += 1
        graph = rdflib.Graph()
        graph.namespace_manager.bind("sinopia", SINOPIA)
        jsonld = json.dumps(resource.pop("data")).encode()
        try:
            graph.parse(data=jsonld, format="json-ld")
        except Exception as error:
            print(f"Failed to parse {resource}\n{error}")
            return
        payload = {"graph": graph, "meta": resource}
        if "sinopia:template:resource" in resource.get("templateId"):
            output["templates"].append(payload)
        else:
            output["resources"].append(payload)

    output = {"resources": [], "templates": [], "total": 0}
    start = datetime.utcnow()
    print(f"Started harvest of resources at {start} for {api_url}")
    initial = requests.get(f"{api_url}/resource")
    print("0", end="")
    for row in initial.json().get("data"):
        add_resource(row)
    next_link = initial.json().get("links").get("next")
    while 1:
        result = requests.get(next_link)
        if result.status_code > 300:
            break
        payload = result.json()
        new_next = payload.get("links").get("next")
        if new_next is None:
            new_text = payload.get("links").get("first")
        if new_next == next_link or new_next is None:
            break
        for row in payload.get("data"):
            add_resource(row)
        next_link = new_next
        print(".", end="")
        if not output["total"] % 250:
            print(f"{output['total']}", end="")
    end = datetime.utcnow()
    print(f"\nFinished total time {(end-start).seconds / 60.}")
    return output

In [7]:
stage_rdf = from_api("https://api.stage.sinopia.io")

Started harvest of resources at 2021-05-21 15:47:05.228146 for https://api.stage.sinopia.io
0.........250..........500..........750..........1000..........1250..........1500..........1750..........2000..........2250..........2500..........2750..........3000..........3250..........3500..........3750..........4000..........4250..........4500..........4750

http://desktop.loc.gov/search?view=document&id=Infobasedcrmg0Dash0Dash0Dash247&hl=true&fq=allresources|true# does not look like a valid URI, trying to serialize this will break.


..........5000..........5250

ld4p:RT:bf2:2D graphic material:Item does not look like a valid URI, trying to serialize this will break.


..........5500....

urn:ld4p:qa:gettyaat:Objects__Object_Groupings and Systems does not look like a valid URI, trying to serialize this will break.


......5750..........6000
https://api.stage.sinopia.io/resource/e49c5f1d-5e62-4b45-b87f-5d0cf3e573e5 missing data

https://api.stage.sinopia.io/resource/3770137a-bed5-4a97-bd9a-fea4f3822dd7 missing data
..........
https://api.stage.sinopia.io/resource/28961949-72b2-4c94-b1f5-a7788f1ae1f0 missing data

https://api.stage.sinopia.io/resource/c3a1d5dd-a829-4ba7-8fbe-20490c018407 missing data

https://api.stage.sinopia.io/resource/4e80a183-4487-44fd-9bf8-8497c50d27f3 missing data
.
https://api.stage.sinopia.io/resource/16625687-0208-4ea5-b299-204d36180c45 missing data
..

https://api.stage.sinopia.io/resource/this is a test does not look like a valid URI, trying to serialize this will break.


............
https://api.stage.sinopia.io/resource/a6acbbea-1770-468b-904b-51cc4a3d7f27 missing data
....Failed to parse {'user': 'mcm104', 'group': 'washington', 'templateId': 'WAU:RT:BF2:Work', 'types': ['http://id.loc.gov/ontologies/bibframe/Work'], 'id': '0398ce54-ff15-4e9f-8948-c44bcc393798', 'uri': 'https://api.stage.sinopia.io/resource/0398ce54-ff15-4e9f-8948-c44bcc393798', 'timestamp': '2021-03-30T22:02:40.077Z'}
'@eng' is not a valid language tag!
.............................................................................................................................................................
Finished total time 4.316666666666666


In [8]:
kglab.KnowledgeGraph?

[0;31mInit signature:[0m
[0mkglab[0m[0;34m.[0m[0mKnowledgeGraph[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mname[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;34m'generic'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mbase_uri[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlanguage[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;34m'en'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0muse_gpus[0m[0;34m:[0m [0mbool[0m [0;34m=[0m [0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mimport_graph[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mrdflib[0m[0;34m.[0m[0mgraph[0m[0;34m.[0m[0mConjunctiveGraph[0m[0;34m,[0m [0mrdflib[0m[0;34m.[0m[0mgraph[0m[0;34m.[0m[0mDataset[0m[0;34m,[0m [0mrdflib[0m[0;34m.[0m[0mgraph[0m[0;34m.[0m[0mGraph[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0m

## Creating a Knowledge Graph
Here we iterate through the Stage resources, retrieve the RDF graph, and then we create an instance of the kglab
Knowledge Graph.

In [10]:
stage = rdflib.ConjunctiveGraph()
for row in stage_rdf['resources']:
    stage += row.get('graph')

    
kg = kglab.KnowledgeGraph(
    name = "Sinopia Stage KG",
    base_uri = "https://api.stage.sinopia.io/resource/",
    namespaces = namespaces,
    import_graph=stage
)

### Graph Measuresments

In [11]:
measure = kglab.Measure()

In [13]:
measure.measure_graph(kg)
print("edges: {:,}\n".format(measure.get_edge_count()))
print("nodes: {:,}\n".format(measure.get_node_count()))

edges: 629,278

nodes: 67,794



## SPARQL Query as a Pandas Data
One of the exciting capabilities of our new Stage Knowledge graph is that we can query our Stage datastore with SPARQL and runturn the result as a [Pandas](https://pandas.pydata.org/) dataframe. The dataframe can now be fead into a [FastAI](), [Pytorch](), or [Tensorflow]() models for further analysis. 

In [22]:
all_triples_query = """
SELECT ?subject ?predicate ?object 
WHERE {
  ?subject ?predicate ?object .
}"""

In [23]:
df = kg.query_as_df(all_triples_query)

In [None]:
With each triple in our 

In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 314639 entries, 0 to 314638
Data columns (total 3 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   subject    314639 non-null  object
 1   predicate  314639 non-null  object
 2   object     314639 non-null  object
dtypes: object(3)
memory usage: 7.2+ MB


In [25]:
df.head()

Unnamed: 0,subject,predicate,object
0,_:Nf369836ee2334c118a2adccfe5c6ba7b,rdf:type,bf:Title
1,_:Nf5b660a442b24fd88e534e4698b045d6,owl:sameAs,<https://id.loc.gov/vocabulary/languages/eng>
2,_:b1,bf:Agent,<http://id.loc.gov/authorities/names/n96089174>
3,_:N0dc076eacd4b4b7c99cea63fc1f7821c,bf:code,n
4,_:Nfec989756ae2489eb7f6862c247d2c43,<http://id.loc.gov/ontologies/bflc/encodingLevel>,<https://id.loc.gov/vocabulary/menclvl/f>


In [27]:
df.tail()

Unnamed: 0,subject,predicate,object
314634,_:N2d11c7095c9648158f8bef53a28ccf71,rdfs:label,Bayless Books
314635,<https://api.stage.sinopia.io/resource/ac2a379...,bf:contribution,_:b0
314636,_:N5ea15b82eec6415aa35e687c27da1d68,owl:sameAs,<http://viaf.org/viaf/174954981>
314637,_:Nbd85229aabec4e22b1276c1f3d33b3a0,rdf:type,bf:Status
314638,_:N58c81ba54d414f7e8b0130222c7eaf33,rdfs:label,rml.py SNAPSHOT: 2021-04-14T14:08:43+00:00


In [26]:
df.sample(10)

Unnamed: 0,subject,predicate,object
39321,<https://api.stage.sinopia.io/resource/549669f...,bf:identifiedBy,_:b0
13344,_:N9a77e3270d23482d8213a8604dbefc37,<http://www.loc.gov/mads/rdf/v1#authoritativeL...,New England--Juvenile fiction
304142,_:Nb82c376bca76435a81ae33fde34d6574,<http://id.loc.gov/ontologies/bflc/encodingLevel>,<https://id.loc.gov/vocabulary/menclvl/f>
6843,<https://api.stage.sinopia.io/resource/b30338e...,bf:carrier,_:N9c9f1818deff4b2cb0c40a3c032ac2a0
92804,_:b148,<http://id.loc.gov/ontologies/bflc/relation>,_:b149
54300,_:Nd01a6202d40e4154bafdc4c798a3f3f4,owl:sameAs,<https://id.loc.gov/vocabulary/languages/eng>
267979,_:b31,rdf:type,bf:ShelfMarkLcc
298605,_:Na4aa3d34cb8d4aa5b1ca9ac1a80d0711,owl:sameAs,<http://id.loc.gov/authorities/names/no2011121...
210377,<https://api.stage.sinopia.io/resource/b1e12ed...,bf:heldBy,<http://id.loc.gov/vocabulary/organizations/cst>
120149,_:Nbcf26e1ea76a4ed68d32d5100a2ab2ae,rdf:type,bf:GenerationProcess


## Visualizations


In [None]:
subgraph = kglab.SubgraphTensor(kg)
pyvis_graph = subgraph.build_pyvis_graph(notebook=True)
pyvis_graph.force_atlas_2based()
pyvis_graph.show("tmp.fig03.html")

In [39]:
second_kg = kglab.KnowledgeGraph(
    name = "Sinopia Stage Second KG",
    base_uri = "https://api.stage.sinopia.io/resource/",
    namespaces = namespaces,
    import_graph=dev_rdf['resources'][1]['graph']
)

In [40]:
subgraph = kglab.SubgraphTensor(second_kg)
pyvis_graph = subgraph.build_pyvis_graph(notebook=True)
pyvis_graph.force_atlas_2based()
pyvis_graph.show("tmp.fig03.html")

## Stanford Knowledge Base


In [28]:
stanford_url = "https://api.stage.sinopia.io/resource?group=stanford"

In [30]:
stanford_result = requests.get(stanford_url)

In [31]:
stanford_result.json()

{'data': [{'data': [{'@id': '_:b20',
     '@type': ['http://id.loc.gov/ontologies/bibframe/Title'],
     'http://www.w3.org/2000/01/rdf-schema#label': [{'@value': 'Dido and Aeneas',
       '@language': 'eng'}],
     'http://id.loc.gov/ontologies/bibframe/mainTitle': [{'@value': 'Dido and Aeneas',
       '@language': 'eng'}],
     'http://www.loc.gov/mads/rdf/v1#isIdentifiedByAuthority': [{'@id': 'http://id.loc.gov/authorities/names/n83176115'}],
     'http://rdaregistry.info/Elements/x/P00016': [{'@value': 'Purcell, Henry, 1659-1695. Dido and Aeneas',
       '@language': 'eng'}]},
    {'@id': '_:b21',
     '@type': ['http://id.loc.gov/ontologies/bibframe/Contribution'],
     'http://id.loc.gov/ontologies/bibframe/agent': [{'@id': '_:b22'}],
     'http://id.loc.gov/ontologies/bibframe/role': [{'@id': 'http://id.loc.gov/vocabulary/relators/cmp'}]},
    {'@id': '_:b22',
     '@type': ['http://id.loc.gov/ontologies/bibframe/Agent',
      'http://id.loc.gov/ontologies/bibframe/Person'],
   

In [None]:
stanford_result2 = requests.get('https://api.stage.sinopia.io/resource?limit=0&start=25&group=stanford/resource')

In [35]:
stanford_rdf = from_api(stanford_url)

Started harvest of resources at 2021-05-21 19:17:00.450348 for https://api.stage.sinopia.io/resource?group=stanford
0

MissingSchema: Invalid URL 'None': No schema supplied. Perhaps you meant http://None?

In [None]:
%debug

> [0;32m/Users/jpnelson/02021/sul-dlss/labs/ml-env/lib/python3.8/site-packages/requests/models.py[0m(390)[0;36mprepare_url[0;34m()[0m
[0;32m    388 [0;31m            [0merror[0m [0;34m=[0m [0merror[0m[0;34m.[0m[0mformat[0m[0;34m([0m[0mto_native_string[0m[0;34m([0m[0murl[0m[0;34m,[0m [0;34m'utf8'[0m[0;34m)[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    389 [0;31m[0;34m[0m[0m
[0m[0;32m--> 390 [0;31m            [0;32mraise[0m [0mMissingSchema[0m[0;34m([0m[0merror[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    391 [0;31m[0;34m[0m[0m
[0m[0;32m    392 [0;31m        [0;32mif[0m [0;32mnot[0m [0mhost[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m


ipdb>  u


> [0;32m/Users/jpnelson/02021/sul-dlss/labs/ml-env/lib/python3.8/site-packages/requests/models.py[0m(316)[0;36mprepare[0;34m()[0m
[0;32m    314 [0;31m[0;34m[0m[0m
[0m[0;32m    315 [0;31m        [0mself[0m[0;34m.[0m[0mprepare_method[0m[0;34m([0m[0mmethod[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m--> 316 [0;31m        [0mself[0m[0;34m.[0m[0mprepare_url[0m[0;34m([0m[0murl[0m[0;34m,[0m [0mparams[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    317 [0;31m        [0mself[0m[0;34m.[0m[0mprepare_headers[0m[0;34m([0m[0mheaders[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    318 [0;31m        [0mself[0m[0;34m.[0m[0mprepare_cookies[0m[0;34m([0m[0mcookies[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m


ipdb>  u


> [0;32m/Users/jpnelson/02021/sul-dlss/labs/ml-env/lib/python3.8/site-packages/requests/sessions.py[0m(456)[0;36mprepare_request[0;34m()[0m
[0;32m    454 [0;31m[0;34m[0m[0m
[0m[0;32m    455 [0;31m        [0mp[0m [0;34m=[0m [0mPreparedRequest[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m--> 456 [0;31m        p.prepare(
[0m[0;32m    457 [0;31m            [0mmethod[0m[0;34m=[0m[0mrequest[0m[0;34m.[0m[0mmethod[0m[0;34m.[0m[0mupper[0m[0;34m([0m[0;34m)[0m[0;34m,[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    458 [0;31m            [0murl[0m[0;34m=[0m[0mrequest[0m[0;34m.[0m[0murl[0m[0;34m,[0m[0;34m[0m[0;34m[0m[0m
[0m


ipdb>  u


> [0;32m/Users/jpnelson/02021/sul-dlss/labs/ml-env/lib/python3.8/site-packages/requests/sessions.py[0m(528)[0;36mrequest[0;34m()[0m
[0;32m    526 [0;31m            [0mhooks[0m[0;34m=[0m[0mhooks[0m[0;34m,[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    527 [0;31m        )
[0m[0;32m--> 528 [0;31m        [0mprep[0m [0;34m=[0m [0mself[0m[0;34m.[0m[0mprepare_request[0m[0;34m([0m[0mreq[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    529 [0;31m[0;34m[0m[0m
[0m[0;32m    530 [0;31m        [0mproxies[0m [0;34m=[0m [0mproxies[0m [0;32mor[0m [0;34m{[0m[0;34m}[0m[0;34m[0m[0;34m[0m[0m
[0m


ipdb>  u


> [0;32m/Users/jpnelson/02021/sul-dlss/labs/ml-env/lib/python3.8/site-packages/requests/api.py[0m(61)[0;36mrequest[0;34m()[0m
[0;32m     59 [0;31m    [0;31m# cases, and look like a memory leak in others.[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     60 [0;31m    [0;32mwith[0m [0msessions[0m[0;34m.[0m[0mSession[0m[0;34m([0m[0;34m)[0m [0;32mas[0m [0msession[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m---> 61 [0;31m        [0;32mreturn[0m [0msession[0m[0;34m.[0m[0mrequest[0m[0;34m([0m[0mmethod[0m[0;34m=[0m[0mmethod[0m[0;34m,[0m [0murl[0m[0;34m=[0m[0murl[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     62 [0;31m[0;34m[0m[0m
[0m[0;32m     63 [0;31m[0;34m[0m[0m
[0m


ipdb>  u


> [0;32m/Users/jpnelson/02021/sul-dlss/labs/ml-env/lib/python3.8/site-packages/requests/api.py[0m(76)[0;36mget[0;34m()[0m
[0;32m     74 [0;31m[0;34m[0m[0m
[0m[0;32m     75 [0;31m    [0mkwargs[0m[0;34m.[0m[0msetdefault[0m[0;34m([0m[0;34m'allow_redirects'[0m[0;34m,[0m [0;32mTrue[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m---> 76 [0;31m    [0;32mreturn[0m [0mrequest[0m[0;34m([0m[0;34m'get'[0m[0;34m,[0m [0murl[0m[0;34m,[0m [0mparams[0m[0;34m=[0m[0mparams[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     77 [0;31m[0;34m[0m[0m
[0m[0;32m     78 [0;31m[0;34m[0m[0m
[0m


ipdb>  u


> [0;32m<ipython-input-6-ad3177731bed>[0m(40)[0;36mfrom_api[0;34m()[0m
[0;32m     38 [0;31m    [0mnext_link[0m [0;34m=[0m [0minitial[0m[0;34m.[0m[0mjson[0m[0;34m([0m[0;34m)[0m[0;34m.[0m[0mget[0m[0;34m([0m[0;34m"links"[0m[0;34m)[0m[0;34m.[0m[0mget[0m[0;34m([0m[0;34m"next"[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     39 [0;31m    [0;32mwhile[0m [0;36m1[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m---> 40 [0;31m        [0mresult[0m [0;34m=[0m [0mrequests[0m[0;34m.[0m[0mget[0m[0;34m([0m[0mnext_link[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     41 [0;31m        [0;32mif[0m [0mresult[0m[0;34m.[0m[0mstatus_code[0m [0;34m>[0m [0;36m300[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     42 [0;31m            [0;32mbreak[0m[0;34m[0m[0;34m[0m[0m
[0m


ipdb>  result.json()['links']


*** NameError: name 'result' is not defined


ipdb>  initial.json().get("links").get("first")


'https://api.stage.sinopia.io/resource?limit=0&start=25&group=stanford/resource'
