# Testing out parsing JSON-LD from nomenclature.info

In [1]:
import requests
from bs4 import BeautifulSoup
import json
from unicodedata import normalize

## Parsing a single example

Here, I semi-randomly chose the entry for "chair", which can be found at https://www.nomenclature.info/parcourir-browse.app?id=1090&lang=en. I say semi-random, because I was looking for an entry with a "full" set of features.

In [2]:
test_url = 'https://www.nomenclature.info/parcourir-browse.app?id=1090'
r = requests.get(test_url)

After using the `requests` package to download the full page source, I use `BeautifulSoup` to parse it into HTML elements.

In [3]:
soup = BeautifulSoup(r.text)

From the `soup` that comes from `BeautifulSoup`, I can then drill down to the json-ld portion of the page source.

In [4]:
json_ld_text = soup.find('script', {'type':'application/ld+json'}).text
json_ld = json.loads(json_ld_text)
print(json.dumps(json_ld, indent=2, sort_keys=True, ensure_ascii=False))

{
  "@context": {
    "cs": "http://purl.org/vocab/changeset/schema",
    "dc": "http://purl.org/dc/elements/1.1/",
    "dct": "http://purl.org/dc/terms/",
    "foaf": "http://xmlns.com/foaf/0.1/",
    "nomo": "http://nomenclature.info/nom/ontology/",
    "owl": "http://www.w3.org/2002/07/owl#",
    "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
    "skos": "http://www.w3.org/2004/02/skos/core#",
    "skos-xl": "http://www.w3.org/2008/05/skos-xl#"
  },
  "@id": "http://nomenclature.info/nom/1090",
  "@type": [
    "http://www.w3.org/2004/02/skos/core#Concept"
  ],
  "dc:identifier": [
    {
      "@type": "http://www.w3.org/2001/XMLSchema#string",
      "@value": "1090"
    }
  ],
  "dct:modified": "2019-06-13T15:08:58.823Z",
  "foaf:img": [
    {
      "@id": "https://app.pch.gc.ca/public_info/nomenclature/images/02-00147.jpg"
    }
  ],
  "nomo:Date-Added": [
    {
      "@language": "en",
      "@value": "1978-2010"
    }
  ],
  "nomo:Definition-Source": [
    {
      "@langu

The `extract_nom_jsonld` function below is a preliminary stab at pulling out selected pieces from the JSON-LD, for saving into tabular format.

In [5]:
def extract_nom_jsonld(nom_url):
    r = requests.get(nom_url)
    soup = BeautifulSoup(r.text)
    json_ld_text = soup.find('script', {'type':'application/ld+json'}).text
    json_ld = json.loads(json_ld_text)
    
    subset = {}
    subset['id'] = json_ld['@id']
    if 'skos:definition' in json_ld:
        if isinstance(json_ld['skos:definition'],list):
            for definition in json_ld['skos:definition']:
                if definition['@language'] == 'en':
                    subset['definition'] = definition['@value']
        elif isinstance(json_ld['skos:definition'],dict):
            definition = json_ld['skos:definition']
            if definition['@language'] == 'en':
                subset['definition'] = definition['@value']
        subset['definition'] = normalize("NFKD",subset['definition']).replace('\n',' ')
    if 'skos:prefLabel' in json_ld:
        if isinstance(json_ld['skos:prefLabel'],list):
            for label in json_ld['skos:prefLabel']:
                if label['@language'] == 'en':
                    subset['label'] = label['@value']
        elif isinstance(json_ld['skos:prefLabel'],dict):
            label = json_ld['skos:prefLabel']
            if label['@language'] == 'en':
                    subset['label'] = label['@value']
        subset['label'] = normalize("NFKD",subset['label']).replace('\n',' ')
    if 'skos:narrower' in json_ld:
        children = []
        for child in json_ld['skos:narrower']:
            children.append(child['@id'])
        subset['children'] = children
    if 'skos:broader' in json_ld:
        subset['parent'] = json_ld['skos:broader'][0]['@id']
    return subset
        
        

The result of running the "chair" page through the `extract_nom_jsonld` function.

In [6]:
extract_nom_jsonld('http://nomenclature.info/nom/1090')

{'id': 'http://nomenclature.info/nom/1090',
 'definition': 'A movable seat with a back and with or without arms. It is generally made of wood, usually has four legs for support and sometimes has an upholstered seat and back. (In French, the term chaise represents a seat without arms.) Used to seat one person.',
 'label': 'Chair',
 'children': ['http://nomenclature.info/nom/1091',
  'http://nomenclature.info/nom/1101',
  'http://nomenclature.info/nom/1102',
  'http://nomenclature.info/nom/1103',
  'http://nomenclature.info/nom/1104',
  'http://nomenclature.info/nom/1105',
  'http://nomenclature.info/nom/1108',
  'http://nomenclature.info/nom/1109',
  'http://nomenclature.info/nom/1110',
  'http://nomenclature.info/nom/1111',
  'http://nomenclature.info/nom/1112',
  'http://nomenclature.info/nom/1116',
  'http://nomenclature.info/nom/1117',
  'http://nomenclature.info/nom/1118',
  'http://nomenclature.info/nom/1119',
  'http://nomenclature.info/nom/1121',
  'http://nomenclature.info/nom/

## Setting up a loop to grab all "children" nodes

Here I set up a while loop to iterate through all of the nodes below "chair". The way it works is to start with a list of a single node, and then each time through the loop, any children nodes get added to the list. The loop is complete when all nodes have been visited.

This example visits and parses 55 pages, but I could set `ids_to_grab` with the root node(s) for nomenclature.info to grab every single term and definition.

In [7]:
ids_to_grab = ['http://nomenclature.info/nom/1090']
nomenclature_results = []

while len(ids_to_grab) > 0:
    nom_id = ids_to_grab.pop()
    ld_dict = extract_nom_jsonld(nom_id)
    if 'children' in ld_dict:
        for child in ld_dict['children']:
            ids_to_grab.append(child)
    nom_result = {k: ld_dict.get(k, None) for k in ('id','definition','parent','label')}
    nomenclature_results.append(nom_result)
print(len(nomenclature_results))


55


### Loading the subset results into a Panda DataFrame

In [8]:
import pandas as pd

In [9]:
nom_df = pd.DataFrame(nomenclature_results)
nom_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 4 columns):
definition    9 non-null object
id            55 non-null object
label         55 non-null object
parent        55 non-null object
dtypes: object(4)
memory usage: 1.8+ KB


In [10]:
nom_df.head()

Unnamed: 0,definition,id,label,parent
0,A movable seat with a back and with or without...,http://nomenclature.info/nom/1090,Chair,http://nomenclature.info/nom/1071
1,,http://nomenclature.info/nom/1144,Theater Seat,http://nomenclature.info/nom/1090
2,,http://nomenclature.info/nom/1143,Kubbestol,http://nomenclature.info/nom/1090
3,,http://nomenclature.info/nom/1142,Klismos,http://nomenclature.info/nom/1090
4,,http://nomenclature.info/nom/1140,Windsor Chair,http://nomenclature.info/nom/1090


In [11]:
nom_df['parent'].value_counts()

http://nomenclature.info/nom/1090    35
http://nomenclature.info/nom/1091     9
http://nomenclature.info/nom/1112     3
http://nomenclature.info/nom/1105     2
http://nomenclature.info/nom/1123     2
http://nomenclature.info/nom/1071     1
http://nomenclature.info/nom/1128     1
http://nomenclature.info/nom/1140     1
http://nomenclature.info/nom/1119     1
Name: parent, dtype: int64

Only 9 chair terms come with an English definition, which is kind of interesting.

In [12]:
nom_df[pd.notnull(nom_df['definition'])]

Unnamed: 0,definition,id,label,parent
0,A movable seat with a back and with or without...,http://nomenclature.info/nom/1090,Chair,http://nomenclature.info/nom/1071
16,A chair that is mounted on two curved strips o...,http://nomenclature.info/nom/1128,Rocking Chair,http://nomenclature.info/nom/1090
28,"A wooden chair, sometimes with arms, with a so...",http://nomenclature.info/nom/1117,Hall Chair,http://nomenclature.info/nom/1090
30,"A chair, sometimes with arms and a light, fold...",http://nomenclature.info/nom/1112,Folding Chair,http://nomenclature.info/nom/1090
37,"A chair, with or without arms, designed to hol...",http://nomenclature.info/nom/1108,Commode Chair,http://nomenclature.info/nom/1090
39,A chair with arms and very long legs. It somet...,http://nomenclature.info/nom/1107,Highchair,http://nomenclature.info/nom/1105
46,"A chair with arms to which a large round, oval...",http://nomenclature.info/nom/1100,Chair-Table,http://nomenclature.info/nom/1091
47,"A large, low-seated, high-backed, and usually ...",http://nomenclature.info/nom/1099,Wing Chair,http://nomenclature.info/nom/1091
53,A chair with arms and usually with an upholste...,http://nomenclature.info/nom/1093,Easy Chair,http://nomenclature.info/nom/1091
