# Introduction and Setup for Sinopia's Knowledge Graph

In [15]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
%reload_ext lab_black

import datetime

import kglab
import helpers
import widgets
from IPython.display import display

## Introduction
This work-shop will introduce you to downloading and exploring the RDF created in the Sinopia Linked Data Editing environment. We will then build upon these Sinopia data artifacts we created by applying various machine learning technologies and techniques for such tasks as FAST subject heading and template classification. 

### Workshop Schedule
This workshop will be broken down into three parts, each 55 minutes with a break between each session.

#### 1. Introduction, Setup, Analysis, and Visualization of Sinopia RDF

#### 2. Using spaCy and HuggingFace Natural Language Processing (NLP)

#### 3. FastAI and PyTorch with Sinopia Data

## Set-up for running Locally or Remotely
There are multiple ways to run the [Jupyter notebooks](https://jupyter.org/) in this workshop, the easiest method to load each notebook using the [MyBinder][BINDER] service that will launch a Jupyter lab environment from which you can select and run the notebooks. The most complex method would be download and install Python along with the workshop dependencies on your local laptop or workstation. In between both of these, is running the notebooks using Google's [Collab][COLLAB} environment. 

### Run with MyBinder Cloud Service (the easiest) 
To run this workshop's Jupyter notebooks on [MyBinder][BINDER]

1. Go to the following link https://mybinder.... 
1. Launch the container 
1. When the environment is finished, you should a similar display to this:
   ![MyBinder Jupyter Lab Workshop](images/)
1. Click on the `01_IntroSetup.ipynb` to launch this notebook. 

### Run with Google's Collab Service
1.  Open 

### Local Installation Set-up
1. Download and Install latest [Python version](https://python.org/downloads), current version **3.9.6**
1. Once Python 3.9.x is installed, launch a terminal window and change to a directory where you want to install the workshop notebooks repository
1. Create a Python virtual environment i.e. `python3 -m venv ld4-env`
1. Activate the Python virtual environment, 
   - `source ld4-env/bin/activate` for Macintosh or Linux
   - `. ld4-env\Scripts\Activate` for Windows
1. Clone or copy the workshop repository.
   -  If you have [git](https://git-scm.com) installed, run `git clone https://github.com/ld4p/{name-of-repo}`
   -  Download and unzip the repository
1. Change directories into the Workshop repository and run `pip install -r requirements.txt` to install all of the libraries we will be using for the workshop
1. Launch Jupyter lab from the Workshop repository with `jupyter lab`
1. Access the running Jupyter lab by accessing the locally running jupyter lab instance at http://localhost:8888 (or another port if 8888 is being used)

[BINDER]: https://mybinder.org/
[COLAB]: https://colab.research.google.com/

### Brief Introduction to Jupyter Notebooks
[Jupyter](https://jupyter.org/) notebooks are a popular computing environment in big data and machine learning communities that runs in your web browser. A notebook is made up of one more cells that are contain either documentation, written in [Markdown][MKDOWN], or Python code. You can move cells around, copy, delete, or change the type using the notebook toolbar:

![Jupyter Notebook Toolbar](images/jupyter-nb-toolbar.png)

Here are the important buttons:

#### Saves the notebook to disk
![Save Notebook](images/notebook-save.png) 
  
####  Adds a new cell to the notebook
![Add cell](images/notebook-add-cell.png)

#### Removes current cell (but can paste the cell in a new location)
![Cut cell](images/notebook-cut-cell.png) 

#### Copy current cell
  ![Copy cell](images/notebook-copy-cell.png)
  
#### Paste cell at cursor position
![Paste cell](images/notebook-paste-cell.png)

#### Runs current cell, either renders Markdown cell to HTML or executes Python code.
![Run cell](images/notebook-run-cell.png) 

#### Stops current running Cell
[Stop Running Cell](images/notebook-stop-running-cell.png)

#### Dropdown for changing the current cell type
![Change cell type dropdown](images/notebook-cell-type-select.png) 

[MKDOWN]: https://www.markdownguide.org/

## Sinopia Group Knowledge Graph
We can use [Sinopia API](https://ld4p.github.io/sinopia_api/#tag/resources/paths/~1resource/get) to only retrieve resources associated with a Sinopia group. The general URL pattern is 

`https://api.{env?}.sinopia.io/resources?group={name}`. 

Some examples:
- Retrieve PCC resources from Sinopia stage environment: `https://api.stage.sinopia.io/resources?group=pcc`
- Retrieve Yale resources from Sinopia production: `https://api.sinopia.io/resources?group=yale`

To assist in generating the group API URL, we will use the `sinopia_api` widget:

In [2]:
display(widgets.sinopia_api_group_widget)

VBox(children=(HBox(children=(RadioButtons(description='Environment:', options=(('Development', 'https://api.d…

In [None]:
pcc_kg = helpers.create_kg('https://api.stage.sinopia.io/resource?group=pcc')

## Retrieving all RDF from Sinopia Stage Environment
Using the `sinopia_api` widget to generate the Sinopia API url for all groups, we can then use a helper function, `create_kg` that will download each resource, extract the RDF, and then return the Knowledge Graph after all of the RDF resources have been parsed.

**NOTE**: Instead of taking 8+ minutes to run this function, you can just load the existing stage knowledge graph with the following commands:

```python
stage_kg = kglab.KnowledgeGraph()
stage_kg.load_jsonld("data/stage.json")
```

In [16]:
stage_kg = kglab.KnowledgeGraph()
stage_kg.load_jsonld("data/stage.json")

<kglab.kglab.KnowledgeGraph at 0x7fce76b7c070>

In [9]:
start = datetime.datetime.utcnow()
print(f"Started creation of knowledge graph for Sinopia Stage at {start}")
stage_kg = helpers.create_kg("https://api.stage.sinopia.io/resource")
end = datetime.datetime.utcnow()
print(f"""Finished at {end}, total time {(end-start).seconds / 60.} minutes""")

Started creation of knowledge graph for Sinopia Stage at 2021-07-17 18:23:30.106150
0....100....200....300....400....500....600....700....800....900....1,000....1,100....1,200....1,300....1,400....1,500....1,600....1,700....1,800....1,900....2,000....2,100....2,200....2,300....2,400....2,500....2,600....2,700....2,800....2,900....3,000....3,100....3,200....3,300....3,400....3,500....3,600....3,700....3,800....3,900....4,000....4,100....4,200....4,300....4,400....4,500....4,600....4,700.

http://desktop.loc.gov/search?view=document&id=Infobasedcrmg0Dash0Dash0Dash247&hl=true&fq=allresources|true# does not look like a valid URI, trying to serialize this will break.


...4,800....4,900....5,000....5,100....5,200.

ld4p:RT:bf2:2D graphic material:Item does not look like a valid URI, trying to serialize this will break.


...5,300....5,400....5,500..

urn:ld4p:qa:gettyaat:Objects__Object_Groupings and Systems does not look like a valid URI, trying to serialize this will break.


..5,600....5,700....5,800...Failed to parse {'user': 'cdezelar', 'group': 'minnesota', 'templateId': 'ld4p:RT:bf2:Monograph:Work:Un-nested', 'types': ['http://id.loc.gov/ontologies/bibframe/Work'], 'bfAdminMetadataRefs': ['https://api.stage.sinopia.io/resource/f424bc37-ee8a-4608-9ec4-7ad293d98610'], 'bfItemRefs': [], 'bfInstanceRefs': ['https://api.stage.sinopia.io/resource/57c50ae7-a5e6-4116-8422-f8eb9de61cb1'], 'bfWorkRefs': [], 'id': '638264f1-4ff8-4e2f-bea6-4bb8629c488d', 'uri': 'https://api.stage.sinopia.io/resource/638264f1-4ff8-4e2f-bea6-4bb8629c488d', 'timestamp': '2020-09-29T16:03:43.979Z'}
Invalid line: '<file:///Users/jpnelson/02021/ld4p/ld4-2021/\thttps:/lccn.loc.gov/n99039887> <http://www.w3.org/2000/01/rdf-schema#label> "\thttps://lccn.loc.gov/n99039887" .'
.5,900...
https://api.stage.sinopia.io/resource/e49c5f1d-5e62-4b45-b87f-5d0cf3e573e5 missing data

https://api.stage.sinopia.io/resource/3770137a-bed5-4a97-bd9a-fea4f3822dd7 missing data
.6,000....6,100....6,200
https:

https://api.stage.sinopia.io/resource/this is a test does not look like a valid URI, trying to serialize this will break.
https://api.stage.sinopia.io/resource/this is a test#b2 does not look like a valid URI, trying to serialize this will break.
https://api.stage.sinopia.io/resource/this is a test#N3a7298a12b5d4d258721d596be77a840 does not look like a valid URI, trying to serialize this will break.
https://api.stage.sinopia.io/resource/this is a test#b2 does not look like a valid URI, trying to serialize this will break.
https://api.stage.sinopia.io/resource/this is a test#b2 does not look like a valid URI, trying to serialize this will break.
https://api.stage.sinopia.io/resource/this is a test#b2 does not look like a valid URI, trying to serialize this will break.
https://api.stage.sinopia.io/resource/this is a test#N3a7298a12b5d4d258721d596be77a840 does not look like a valid URI, trying to serialize this will break.
https://api.stage.sinopia.io/resource/this is a test#b2 does not l

.6,300Failed to parse {'user': 'michelle', 'group': 'cornell', 'templateId': 'sinopia:template:resource', 'types': ['http://sinopia.io/vocabulary/ResourceTemplate'], 'bfAdminMetadataRefs': [], 'bfItemRefs': [], 'bfInstanceRefs': [], 'bfWorkRefs': [], 'id': 'this is a test', 'uri': 'https://api.stage.sinopia.io/resource/this is a test', 'timestamp': '2021-02-16T20:12:05.927Z'}
"https://api.stage.sinopia.io/resource/this is a test#b2" does not look like a valid URI, I cannot serialize this as N3/Turtle. Perhaps you wanted to urlencode it?
....6,400....6,500....6,600
https://api.stage.sinopia.io/resource/a6acbbea-1770-468b-904b-51cc4a3d7f27 missing data
...Failed to parse {'user': 'mcm104', 'group': 'washington', 'templateId': 'WAU:RT:BF2:Work', 'types': ['http://id.loc.gov/ontologies/bibframe/Work'], 'id': '0398ce54-ff15-4e9f-8948-c44bcc393798', 'uri': 'https://api.stage.sinopia.io/resource/0398ce54-ff15-4e9f-8948-c44bcc393798', 'timestamp': '2021-03-30T22:02:40.077Z'}
'@eng' is not a va

To save the resulting knowledge graph, we will use the method `save_jsonld` that serializes the Sinopia Stage graph to JSON-LD, we will load and use this file in subsequent Jupyter notebooks in this workshop.

In [10]:
stage_kg.save_jsonld("data/stage.json")

In [12]:
start = datetime.datetime.utcnow()
print(f"Started creation of knowledge graph for Sinopia Production at {start}")
prod_kg = helpers.create_kg("https://api.sinopia.io/resource")
end = datetime.datetime.utcnow()
print(f"""Finished at {end}, total time {(end-start).seconds / 60.} minutes""")

Started creation of knowledge graph for Sinopia Production at 2021-07-18 00:26:43.114806
0....100....200....300....400....500....600....700....800....900....1,000....1,100....1,200....1,300....1,400....1,500....1,600....1,700....1,800....1,900....2,000....2,100....2,200....2,300....2,400....2,500....2,600....2,700....2,800....2,900....3,000....3,100....3,200....3,300....3,400....3,500....3,600....3,700....3,800....3,900....4,000....4,100...Finished at 2021-07-18 00:29:40.408484, total time 2.95 minutes


In [17]:
prod_kg.save_jsonld("data/production.json")

## Exercise 1
Compare the total number of triples for National Library of Medicine in each Sinopia environment; development, stage, and production.