CWPK \#28: Extracting Structure for Typologies
=======================================

We Extract a Typology Scaffolding from an Active KG
--------------------------

<div style="float: left; width: 305px; margin-right: 10px;">

<img src="http://kbpedia.org/cwpk-files/cooking-with-kbpedia-305.png" title="Cooking with KBpedia" width="305" />

</div>

In this installment of the [*Cooking with Python and KBpedia*](https://www.mkbergman.com/cooking-with-python-and-kbpedia/) series, we work out in a [Python](https://en.wikipedia.org/wiki/Python_(programming_language)) code block how to extract a single typology from the [KBpedia](https://kbpedia.org/) knowledge graph. To refresh your memory, KBpedia has an upper, 'core' [ontology](https://en.wikipedia.org/wiki/Ontology_(information_science)), the KBpedia Knowledge Ontology ([KKO](https://kbpedia.org/docs/kko-upper-structure/)) that has a bit fewer than 200 top-level concepts. About half of these concepts are connecting points we call 'SuperTypes', that also function as tie-in points to underlying tree structures of reference concepts (RCs). (Remember there are about 58,000 RCs across all of KBpedia.)

We call each tree structure a 'typology', which has a root concept that is one of the upper SuperType concepts. The tree structures in each typology are built from <code>rdfs:subClassOf</code> relations, also known as '<code>is-a</code>'. The typologies range in size from a few hundred RCs to multiple thousands in some cases. The combination of the upper KKO structure and its supporting 70 or so typologies provide the conceptual backbone to KBpedia. We discussed this general terminology in our earlier [**CWPK #18**](https://www.mkbergman.com/2348/cwpk-18-basic-terminology-and-load-kbpedia/) installment.

Each typology extracted from KBpedia can be inspected as a standalone ontology in something like the [Protégé](https://en.wikipedia.org/wiki/Prot%C3%A9g%C3%A9_(software)) [IDE](https://en.wikipedia.org/wiki/Integrated_development_environment). Typologies can be created or modified offline and then imported back into KBpedia, steps we will address in later installments. The individual typologies are modular in nature, and a bit easier to inspect and maintain when dealt with independently of the entire KBpedia structure.

### Starting and Load
We begin with our standard opening routine, though we are a bit more specific about identifying prefixes in our name spaces:

<div style="background-color:#eee; border:1px dotted #aaa; vertical-align:middle; margin:15px 60px; padding:8px;"><strong>Which environment?</strong> The specific load routine you should choose below depends on whether you are using the online MyBinder service (the 'raw' version) or local files. The example below is based on using local files (though replace with your own local directory specification). If loading from MyBinder, replace with the lines that are commented (<code>#</code>) out.</div>

In [2]:
main = 'C:/1-PythonProjects/kbpedia/sandbox/kbpedia_reference_concepts.owl'
# main = 'https://raw.githubusercontent.com/Cognonto/CWPK/master/sandbox/builds/ontologies/kbpedia_reference_concepts.owl'
skos_file = 'http://www.w3.org/2004/02/skos/core' 
kko_file = 'C:/1-PythonProjects/kbpedia/sandbox/kko.owl'
# kko_file = 'https://raw.githubusercontent.com/Cognonto/CWPK/master/sandbox/builds/ontologies/kko.owl'

from owlready2 import *
world = World()
kb = world.get_ontology(main).load()
rc = kb.get_namespace('http://kbpedia.org/kko/rc/')               

skos = world.get_ontology(skos_file).load()
kb.imported_ontologies.append(skos)
core = world.get_namespace('http://www.w3.org/2004/02/skos/core#')

kko = world.get_ontology(kko_file).load()
kb.imported_ontologies.append(kko)
kko = kb.get_namespace('http://kbpedia.org/ontologies/kko#')

Like always, we execute each cell as we progress down this notebook page by pressing <code>shift+enter</code> for the highlighted cell or by choosing Run from the notebook menu.

We will start by picking one of our smaller typologies on <code>InquiryMethods</code> since its listing is a little easier to handle than one of the bigger typologies (such as <code>Products</code> or <code>Animals</code>). Unlike most all of the other RCs which are labeled in the singular, note we use plural names for these SuperType RCs.

The SuperType is also the 'root' of the typology. What we are going to do is use the Owlready2 built-in <code>descendants()</code> method for extracting out a listing of all children, grandchildren, etc., starting with our root. (Another method, <code>ancestors()</code> navigates in the opposite direction to grab parents, grandparents, etc., all the way up to the ultimate root of any OWL ontology, <code>owl:Thing</code>.) Note in these commands that we are also removing the starting node from our listing as shown in the last statement:

In [3]:
root = kko.InquiryMethods
s_set=root.descendants()
s_set.remove(root)

  http://kbpedia.org/kko/rc/Cognition
  http://kbpedia.org/kko/rc/AnimalCognition



Owlready2 has an alternate way to not include the starting class in its listing, using the <code>include_self = False</code> argument. You may want to clear your memory to test this one:

In [4]:
root = kko.InquiryMethods
s_set=root.descendants(include_self = False)

We can then see the members of <code>s_set</code>:

In [5]:
list(s_set)

[rc.DriverVisionTest,
 rc.StemCellResearch,
 rc.AnalyticNumberTheory,
 rc.ComputationalGroupTheory,
 rc.HeuristicSearching,
 rc.MedicalResearch,
 rc.Comparing,
 rc.YachtDesign,
 rc.PGroups,
 rc.SolarSystemModel,
 rc.AirNavigation,
 rc.CriticismOfMarriage,
 rc.ScientificObservation,
 rc.PokerStrategy,
 rc.MesoscopicPhysics,
 rc.Reasoning,
 rc.SalesContractNegotiation,
 rc.SocraticDialogue,
 rc.ArgumentFromMorality,
 rc.GramStainTest,
 rc.Checking-Evaluating,
 rc.TwinStudies,
 rc.ComputationalNumberTheory,
 rc.Surveillance,
 rc.MethodsOfProof,
 rc.InfiniteGroupTheory,
 rc.Examination-Investigation,
 rc.MedicalEvaluationWithImaging,
 rc.Diagnosing,
 rc.TragedyOfTheCommons,
 rc.Survey,
 rc.RepresentationTheory,
 rc.SportsTraining,
 rc.CelestialNavigation,
 rc.Metatheorem,
 rc.ModelingAndSimulation,
 rc.CriticismOfMormonism,
 rc.QuantumPhase,
 rc.Evaluating,
 rc.LatticeModel,
 rc.BreastCancerScreening,
 rc.SolvingAProblem,
 rc.NetworkTheory,
 rc.AnalyzingSomething,
 rc.TransfiniteCardinal,


After doing some counts (<code>len(s_set)</code> for example) and inspections of the list, we determine that the code block so far is providing the entire list of sub-classes under the <code>root</code>. Now we want to start formatting our output similar to the flat files we are using. We begin by prefixing our variable names with <code>s_</code>, <code>p_</code>, <code>o_</code> to correspond to our *subject - predicate - object* triples close to the native N3 format. We'll continue to see this pattern over multiple variables in multiple code blocks for multiple installments.

We also set up an iterator to loop over the <code>s_set</code>, generating an <code>s_item</code> for each element encountered in the list. We add a <code>print</code> to generate back to screen each line:

In [None]:
o_frag = list()
s_frag = list()
p_item = 'rdfs:subClassOf'
for s_item in s_set:
   o_item = s_item.is_a
   print(s_item,p_item,o_item)

Hmm, we see many of the <code>o_item</code> entries are in fact sets with more than one member. This means, of course, that a given entry has multiple parents. For input specification purposes, each one of those variants needs to have its own triple assertion. Thus, we also need to iterate over the <code>o_set</code> entries to generate another single assignment. So, we need to insert another <code>for</code> iteration loop, and indent it as Python expects. Notice, too, that the calls within these loops all terminate with a ':'.

In [None]:
o_frag = list()
s_frag = list()
p_item = 'rdfs:subClassOf'
for s_item in s_set:
   o_set = s_item.is_a
   for o_item in o_set:
       print(s_item,p_item,o_item)
       o_frag.append(o_item)
       s_frag.append(s_item) 

We test with the length (len) argument to see if we have picked up items.

In [None]:
len(o_frag)

Hmmm, that's not good. The size of <code>o_frag</code> and <code>o_frag</code> are showing to be the same, but we already saw there were multiple objects for the subjects. Clearly, we're still not counting and processing this right.

So, we need to make two final changes to this routine. First, we want to get the population of our sets correct. We can see in our prior example that we were counting <code>o_frag</code> and <code>o_frag</code> as part of the same loop, but that is not correct. The <code>s_frag</code> needs to be linked with processing the subject set. We change the indent to assign this correctly. (Testing this may require you to Kernel &rarr; Restart & Clear Output and then running all of the above cells.)

The second change we want is for our output to begin to conform to a CSV file with leading and trailing white spaces removed and entries separated by commas, moving us again toward a N3 format. Here are the resulting changes:

In [None]:
o_frag = set()
s_frag = set()
p_item = 'rdfs:subClassOf'
for s_item in s_set:
   o_set = s_item.is_a
   for o_item in o_set:
       print(s_item,',',p_item,',',o_item,'.','\n', sep='', end='')
       o_frag.add(o_item)
   s_frag.add(s_item) 

Getting rid of the leading and training white spaces is a little tricky. Indeed the <code>sep =''</code> argument above is not yet widely used since it was only recently added to Python. Versions 3.3 or earlier do not support this argument and would fail. Since I have no legacy Python code I can afford to rely on the latest versions of the language. But little nuances such as this are something to be aware of as you research various methods, commands and arguments.

We can also check counts again to ensure everything is now correct:

In [None]:
len(s_frag)

And we can start playing around with some of the set methods, in this case the <code>.intersection</code> between our too sets:

In [None]:
len(o_frag.intersection(s_frag))

This is all looking pretty good, though we have not yet dealt with putting the full URIs into the triples. That is straigntforward so we can afford to put that off until we are ready to generate the actual typologies. But we realize we also have missed one final piece of the logic necessary to have our typologies readable as separate ontologies: declaring all of our classes as such under the standard <code>owl:Thing</code>. These new classes correspond to each of the entries in the <code>s_frag</code> set, so we add another line in a <code>print</code> statement to do so. 

In [None]:
o_frag = set()
s_frag = set()
p_item = 'rdfs:subClassOf'
new_class = 'owl:Thing'
for s_item in s_set:
   o_set = s_item.is_a
   for o_item in o_set:
       if o_item in s_set:
         print(s_item,',',p_item,',',o_item,'.','\n', sep='', end='')
         o_frag.add(o_item)
   s_frag.add(s_item)
   print(s_item,',','a',',',new_class,'.','\n', sep='', end='')
len(s_frag)

Great, our logic appears correct and our counts do, too. So we can consider this code block as developed enough for assembly into a formal method and then module. Let's now move on to prototyping other components in the KBpedia structure.

### Additional Documentation

Here are some other interactive resources related to today's **CWPK** installment:

- Nice [Stack Overflow](https://stackoverflow.com/questions/1388818/how-can-i-compare-two-lists-in-python-and-return-matches) discussion
- [2D lists](https://www.cs.cmu.edu/~112/notes/notes-2d-lists.html)
- [Arrays](https://snakify.org/en/lessons/two_dimensional_lists_arrays/).


 <div style="background-color:#efefff; border:1px dotted #ceceff; vertical-align:middle; margin:15px 60px; padding:8px;"> 
  <span style="font-weight: bold;">NOTE:</span> This article is part of the <a href="https://www.mkbergman.com/cooking-with-python-and-kbpedia/" style="font-style: italic;">Cooking with Python and KBpedia</a> series. See the <a href="https://www.mkbergman.com/cooking-with-python-and-kbpedia/"><strong>CWPK</strong> listing</a> for other articles in the series. <a href="http://kbpedia.org/">KBpedia</a> has its own Web site.
  </div>

<div style="background-color:#ebf8e2; border:1px dotted #71c837; vertical-align:middle; margin:15px 60px; padding:8px;"> 

<span style="font-weight: bold;">NOTE:</span> This <strong>CWPK 
installment</strong> is available both as an online interactive
file <a href="https://mybinder.org/v2/gh/Cognonto/CWPK/master" ><img src="https://mybinder.org/badge_logo.svg" style="display:inline-block; vertical-align: middle;" /></a> or as a <a href="https://github.com/Cognonto/CWPK" title="CWPK notebook" alt="CWPK notebook">direct download</a> to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the <code>*.ipynb</code> file. It may take a bit of time for the interactive option to load.</div>

<div style="background-color:#feeedc; border:1px dotted #f7941d; vertical-align:middle; margin:15px 60px; padding:8px;"> 
<div style="float: left; margin-right: 5px;"><img src="http://kbpedia.org/cwpk-files/warning.png" title="Caution!" width="32" /></div>I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment -- which is part of the fun of Python -- and to <a href="mailto:mike@mkbergman.com">notify me</a> should you make improvements.    

</div>