# KG measures and clustering tutorial

First, make sure you have networkx installed, here you can find the [documentation](https://networkx.org/documentation/stable/reference/index.html).

In [54]:
## Uncomment if you do not have networkx installed (you should have it installed from the RDFS tutorial)
#import sys
#!{sys.executable} -m pip install networkx

import pandas as pd
import rdflib
# from rdflib import Graph, Literal, Namespace, RDF, URIRef, OWL
from rdflib.namespace import DC, FOAF

import networkx as nx
from owlready2 import *

In this tutorial, we will focus on how to characterize an ontology or knowledge graph.

Load the ontology, which you have previously created in the OWL tutorial (load the asserted owl file).

In [2]:
ontology = Graph()
ontology.parse("data/my_music_ontology_inferred.owl")

<Graph identifier=N7d2a2a8de3c14be78cecde8685b384ea (<class 'rdflib.graph.Graph'>)>

## 1. Basic (ontology) measures

Let's first focus on calculating basic measures:
* number of classes
* number of properties
* number of individuals
* number of triples
* number of entities (classes, individuals, etc. anything that can be places in the subject possition in a triple/axiom)

We start by counting the number of classes in the ontology. This can be done using a SPARQL query. We want to get the unique classes used in the ontology. 

PREFIX owl: <http://www.w3.org/2002/07/owl#> 
SELECT DISTINCT ?s 
WHERE { ?s rdf:type owl:Class. FILTER isURI(?s) }')

This query gives as all the classes that also have a definition in the ontology. However, this does not have to equal the number of classes actually used by individuals. Hence, you need to be very specific about what the number you are retriving represents.

In [14]:
answer = list(ontology.query(
    'PREFIX owl: <http://www.w3.org/2002/07/owl#> SELECT DISTINCT ?s WHERE { ?s rdf:type owl:Class. FILTER isURI(?s) }'
))
print("Number of classes: {f}".format(f=len(answer)))
for r in answer:
    print(r)


Number of classes: 9
(rdflib.term.URIRef('http://test.org/myonto.owl#Person'),)
(rdflib.term.URIRef('http://test.org/myonto.owl#Artist'),)
(rdflib.term.URIRef('http://test.org/myonto.owl#Location'),)
(rdflib.term.URIRef('http://test.org/myonto.owl#Song'),)
(rdflib.term.URIRef('http://test.org/myonto.owl#Genre'),)
(rdflib.term.URIRef('http://test.org/myonto.owl#Member'),)
(rdflib.term.URIRef('http://test.org/myonto.owl#SoloArtist'),)
(rdflib.term.URIRef('http://test.org/myonto.owl#SubGenre'),)
(rdflib.term.URIRef('http://test.org/myonto.owl#CollaboratingArtist'),)


Even though we used a query, a lot of this information can also be retrieved with owlready2. For example, the number of classes can be retrieved with the function onto.classes(). It returns all classes in the ontology. We try it below.

In [19]:
onto_file = "data/my_music_ontology_inferred.owl"
or_ontology = get_ontology(onto_file).load()
answer = list(or_ontology.classes())

print("Number of classes: {f}".format(f=len(answer)))
for r in answer:
    print(r)

Number of classes: 9
my_music_ontology_inferred.Artist
my_music_ontology_inferred.Song
my_music_ontology_inferred.Location
my_music_ontology_inferred.Genre
my_music_ontology_inferred.Person
my_music_ontology_inferred.Member
my_music_ontology_inferred.SoloArtist
my_music_ontology_inferred.SubGenre
my_music_ontology_inferred.CollaboratingArtist


In [40]:
ind = list(or_ontology.individuals())
print(len(ind))

13


### Exercise 1

Get the following metrics from your ontology, using queries and check your answer using owlready2 functions.

* number of properties
* number of individuals
* number of triples
* number of entities (classes, individuals, etc. anything that can be places in the subject possition in a triple/axiom)

In [None]:
### your code here.

## 2. Converting KGs into Gs

To make use of graph measure, we need to convert our ontology into a mathematical graph networkx.


We first need to remove all the logics before we can do the conversion.
We are interested in keeping the following things:
* individual
* classes
* relationships between individuals and classes

What we need to remove is:
* restrictions
* domain/range
* property definitions

There is two ways for us to do it: we can either remove the information from the existing graph, or create a new graph using only the information we are interested in. Depending on the size and complexity of your knowledge graph, one way will be more preferrable than the other. You also need to consider if you want to keep the inferred information in your graph after conversion or not. Here, we want to keep the inferred information, but that is depended on the task you will then execute (for link prediction, you probably want the uninferred ontology and use the inferred information as a test set)

rdflib comes with a function that lets us convert a rdflib graph into an networkx graph.

In [38]:
from rdflib.extras.external_graph_libs import rdflib_to_networkx_digraph
nx_graph = rdflib_to_networkx_digraph(ontology)

list(nx_graph.nodes())

[rdflib.term.URIRef('http://test.org/myonto.owl#SubGenre'),
 rdflib.term.URIRef('http://www.w3.org/2002/07/owl#Class'),
 rdflib.term.URIRef('http://test.org/myonto.owl#birthDate'),
 rdflib.term.URIRef('http://www.w3.org/2002/07/owl#DatatypeProperty'),
 rdflib.term.URIRef('http://test.org/myonto.owl#Edinburgh'),
 rdflib.term.URIRef('http://test.org/myonto.owl#Artist'),
 rdflib.term.URIRef('http://test.org/myonto.owl#Scotland'),
 rdflib.term.URIRef('http://www.w3.org/2002/07/owl#Thing'),
 rdflib.term.BNode('Nd01b0451ef7246229aaa9e0e358c8935'),
 rdflib.term.URIRef('http://test.org/myonto.owl#collaboratesWith'),
 rdflib.term.URIRef('http://test.org/myonto.owl#Location'),
 rdflib.term.BNode('Na5f34ba4389e46719398450e0a094684'),
 rdflib.term.URIRef('http://www.w3.org/2002/07/owl#Restriction'),
 rdflib.term.URIRef('http://test.org/myonto.owl#massive_attack'),
 rdflib.term.URIRef('http://test.org/myonto.owl#CollaboratingArtist'),
 rdflib.term.BNode('Nafd73ef509c44f7e8336bab7c42afc1b'),
 rdflib

As we can see, there are some blank nodes which were convered into the graph that are not very useful for us at this stage. To analyse the graph as a mathematical graph, we don't want the class restrictions or property range and domain definision in our graph, as we are not doing any reasoning anymore.

Often times, it is easier to create a new graph than removeing already modeled information from the graph. Instead of continuing with the ontology, we will create a graph from the metadata provided in 'data/musicoset_metadata', but will adhere to the ontology from before (use the property and class names,etc.)

In [67]:
csv_albums =  pd.read_csv('data/musicoset_metadata/albums.csv',sep='\t')
print(csv_albums.columns)
csv_artists =  pd.read_csv('data/musicoset_metadata/artists.csv',sep='\t')
print(csv_artists.columns)
csv_songs =  pd.read_csv('data/musicoset_metadata/songs.csv',sep='\t')
print(csv_songs.columns)
csv_tracks =  pd.read_csv('data/musicoset_metadata/tracks.csv',sep='\t')
print(csv_tracks.columns)

Index(['album_id', 'name', 'billboard', 'artists', 'popularity',
       'total_tracks', 'album_type', 'image_url'],
      dtype='object')
Index(['artist_id', 'name', 'followers', 'popularity', 'artist_type',
       'main_genre', 'genres', 'image_url'],
      dtype='object')
Index(['song_id', 'song_name', 'billboard', 'artists', 'popularity',
       'explicit', 'song_type'],
      dtype='object')
Index(['song_id', 'album_id', 'track_number', 'release_date',
       'release_date_precision'],
      dtype='object')


In [60]:
# We have prepared a simplified ontology to use in this tutorial
# This ontology doesn't have any restrictions or domain/range definitions
# this is to avoid blank nodes when converting to networx
music_onto = rdflib.Graph()
music_onto.parse("data/music_onto_simple.rdf")

nx_music = rdflib_to_networkx_digraph(music_onto)
list(nx_music.nodes())

[rdflib.term.URIRef('http://test.org/myonto.owl#SoloArtist'),
 rdflib.term.URIRef('http://www.w3.org/2002/07/owl#Class'),
 rdflib.term.URIRef('http://test.org/myonto.owl#SubGenre'),
 rdflib.term.URIRef('http://test.org/myonto.owl#Member'),
 rdflib.term.URIRef('http://test.org/myonto.owl#followers'),
 rdflib.term.URIRef('http://www.w3.org/2002/07/owl#DatatypeProperty'),
 rdflib.term.URIRef('http://test.org/myonto.owl#bandMember'),
 rdflib.term.URIRef('http://www.w3.org/2002/07/owl#ObjectProperty'),
 rdflib.term.URIRef('http://test.org/myonto.owl#Genre'),
 rdflib.term.URIRef('http://test.org/myonto.owl#Person'),
 rdflib.term.URIRef('http://test.org/myonto.owl#CollaboratingArtist'),
 rdflib.term.URIRef('http://test.org/myonto.owl#Location'),
 rdflib.term.URIRef('http://test.org/myonto.owl#releaseDate'),
 rdflib.term.URIRef('http://test.org/myonto.owl#Artist'),
 rdflib.term.URIRef('http://test.org/myonto.owl#collaboratesWith'),
 rdflib.term.URIRef('http://www.w3.org/2002/07/owl#Thing'),
 r

This ontology no longer produces any blank nodes. So we can now populate it with the metadata loaded from the CSV.

In [68]:
music_onto.parse("data/music_onto_simple.rdf")

EX = rdflib.Namespace("http://test.org/myonto.owl#")
from rdflib import OWL

for artist in csv_artists.iterrows():
    print(artist)
    music_onto.add()
    
    break


(0, artist_id                                 66CXWjxzNUsdJxJ2JdwvnR
name                                               Ariana Grande
followers                                               34554242
popularity                                                    96
artist_type                                               singer
main_genre                                             dance pop
genres                     ['dance pop', 'pop', 'post-teen pop']
image_url      https://i.scdn.co/image/b1dfbe843b0b9f54ab2e58...
Name: 0, dtype: object)


conversion to networkx

### Exercise 2

* Convert your knowledge graph or ontology into a networkx graph. 
* Write a function that checks for blank nodes in your networkx graph.

## 3. Graph Measures

Now we can calculate some graph measures over the networkX graph. The library provides a lot of different measures that can be calculated. Always check what kind of assumptions the measure has:
* directed or undirected graph?
* does the graph have to be connected?

We will first calculate some basic graph measures: number of nodes, number of edges and the density of the graph.

We will now look at the distribution of degree of our nodes by calculating (retrieving) the degree for each not and plotting a histogram.
Additionally, we also want to know the averge degree centrality of our graph and put it in the title of our histogram.

Another very useful measure is the clustering coefficient, which tells us how likely the nodes are to build clusters. This is a global measure.

### Exercise 3

Calculate and visualise the centrality of the music graph. Use a different measure than degree. For different measures you can refer to this online [documentation]{https://networkx.org/documentation/stable/reference/algorithms/centrality.html}. Choose wisely though, some measures require a long time to calculate (like betweenness or eigenvector centrality).

As a second step, take some time to explore the documentation of networkx. Is there something other you can calculate and learn about the graph?

In [None]:
### your code here

## 4. Clustering

NetworkX already comes with some clustering algorithms. We will try the one introduced in the theory part of the class, Louvain clustering algorithm.