Introduction to KG Builder Notebook
==============

KG Builder is a library that we have developed to build three pipelines for knowledge creation (extracting data mainly from Wikidata). The knowledge graphs created can then be used for various purposes such as data analysis, machine learning, and knowledge management. For more information, please refer to the paper *Strategies for creating knowledge graphs to depict a multi-perspective Queer communities representation* present alongside this notebook.

The three pipelines that we have created using KG Builder are:

-**Pure SPARQL**

-**Star Merging**

-**Crawler**

In this notebook, we will describe each of these pipelines in detail and provide examples of how they can be used. We will also demonstrate how to use KG Builder to clean, handle, visualize and analyse the knowledge graphs. Overall, this notebook aims to provide a comprehensive guide to using KG Builder for knowledge creation using Wikidata.

# Installing the dependencies

Due to the visualization tools used, specific versions of networkx and scipy are required

In [1]:
!pip install -U pip
!pip install rdflib
!pip install datashader

!pip install --upgrade networkx==2.6 scipy==1.8.0

Defaulting to user installation because normal site-packages is not writeable
Collecting pip
  Using cached pip-23.0.1-py3-none-any.whl (2.1 MB)
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 22.3.1
    Uninstalling pip-22.3.1:
      Successfully uninstalled pip-22.3.1
Successfully installed pip-23.0.1
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Collecting datashader
  Downloading datashader-0.14.4-py2.py3-none-any.whl (18.2 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.2/18.2 MB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
[?25hCollecting colorcet
  Downloading colorcet-3.0.1-py2.py3-none-any.whl (1.7 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m[36m0:00:01[0m[3

In [3]:
"""from rdflib import Graph
from rdflib.extras.external_graph_libs import rdflib_to_networkx_multidigraph,rdflib_to_networkx_digraph
import networkx as nx
import matplotlib.pyplot as plt

import requests

import pandas as pd

import datashader as ds
import datashader.transfer_functions as tf
from datashader.layout import random_layout, circular_layout, forceatlas2_layout
from datashader.bundling import connect_edges, hammer_bundle


from itertools import chain

import scipy"""

import datashader.transfer_functions as tf

In [5]:
from queries import *
from visuals import create_plot_graph_force_directed

SyntaxError: invalid syntax (3838046122.py, line 1)

# Pure SPARQL pipeline

From the site https://query.wikidata.org/, enter the following query
```
CONSTRUCT {
  ?person ?pred ?obj.
  }
    WHERE {
    {
        ?person wdt:P31 wd:Q5 . #?personId is a human
         ?person ?pred ?obj. 
        { 
            ?person wdt:P21 ?sexorgender. #?person has ?sexorgender
            #?sexorgender is not male, female, cisgender male, cigender female, or cisgender person
            FILTER(?sexorgender NOT IN (wd:Q6581097, wd:Q6581072, wd:Q15145778, wd:Q15145779, wd:Q1093205)). 
        } UNION {
            ?person wdt:P91 ?sexualorientation . #?person has ?sexualorientation
            FILTER(?sexualorientation != wd:Q1035954). #?sexualorientation is not heterosexual
        }
    }
    SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }

    }
```
Then download as a .csv file

# Star Merging Pipeline

This pipeline starts with a SPARQL query to Wikidata that returns a list of Wikidata item IDs that are related to the queer community.This list is then used to create a merged RDF graph using the RDF data of the nodes from the Wikidata entity URLs in ntriples format. This merged graph is then converted to a NetworkX multidigraph that is then pruned and cleaned in various ways.

In [None]:
from star_merging import star_merging_pipeline

prune_policy=['deadend','isolated']
G,l=star_merging_pipeline(2, query_queer_world, prune_policy)

In [None]:
image=create_plot_graph_force_directed(G)
tf.Images(image).cols(1)

# Crawler

In contrast to the global approach of the previous pipelines, this pipeline starts with a small number of nodes and runs an iterative process to extract important nodes that represent properties of interest (potential common points) and use those to explore and discover more queer people and communities. This pipeline runs with a PageRank algorithm on an initial set of nodes. The PageRank algorithm is run multiple times with different parameters, such as the damping factor (alpha) and the number of iterations. After the PageRank algorithm has been run, the pipeline selects a certain number of the most important nodes (k_prop) and uses them to explore further. Specifically, it runs the same SPARQL query as before, but this time using the selected nodes as the property of interest. The result is a new set of nodes that are connected to the previously selected nodes through the property of interest. These new nodes are added to the original graph, and the process repeats.

In [None]:
from crawler import crawler_process

prune_policy=['deadend','isolated']
G,l=star_merging_pipeline(2, query_queer_world, prune_policy)
new_G,l=crawler_process(G, 20, 1, 5, prune_policy, n_max=10, people_list=l)

In [None]:
image=create_plot_graph_force_directed(new_G)
tf.Images(image).cols(1)

# Exporting the graph

It was useful when too big

In [None]:
from networkx.readwrite import json_graph
import json

g_json=json_graph.node_link_data(new_G)
json.dump(g_json,open("graph.json","w"))

# Visualisation

To have a pretty visualisation, we need to have less information. To do so we can choose not to represent edges or nodes or both. This choice is always arbitrary. 

The first representations we had was based on the pruning methods that remove dead-ends or isolated nodes. Now we propose something more precise.

## Classification
The first step in selecting the data is classifying it. 

In [7]:
import pandas as pd
from classification import *

ModuleNotFoundError: No module named 'classification'

The nodes are represented by a IRI, we will use it for 

In [None]:
nodes = G.nodes[:10]
nodes_df = create_nodes_DataFrame(nodes, G)
nodes_df

## Visualisation at smaller scale

In [None]:
nodes_df = pd.read_csv("collecting_data_with_SPARQL/cocteay_wiki.csv")
zoomed_in_graph(nodes_df, G)