# DBpedia Slicer 
This notewook allows you to slide some entities from DBpedia

## Importing Libraries
First, we will import the required libraries.
To install rdflib and SPARQLWrapper execute the following command in a new cell

```
!pip install SPARQLWrapper
```

In [1]:
import random
import rdflib
from SPARQLWrapper import SPARQLWrapper, JSON, POST, N3
import requests as rq
from requests import Request, Session
import codecs
import sys
import json

We set up the encoding of the system to UTF-8

In [2]:
#reload(sys)
#sys.setdefaultencoding("utf-8")

We define globla variables with the end-points

In [3]:
endpoint = "http://dbpedia.org/sparql"
wikidata = "https://query.wikidata.org/sparql"
dydra = "https://dydra.com/mgalkin/dbpedia_hierarchy/sparql"

## Implemeting methods
Method to check the number of triples by type e.g., Person

In [24]:
def checkNumTriples(eType):
    query = """
            PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
            SELECT COUNT(?s) as ?num
            WHERE {
                ?s a dbpedia-owl:%s
            }
    """ % eType
    gg = SPARQLWrapper(endpoint)
    gg.setQuery(query)
    gg.setReturnFormat(JSON)
    results = gg.query().convert()
    
    return results["results"]["bindings"][0]["num"]["value"]

Method to get the subjects

In [29]:
def createTrainingData(eType, includeAbs):
    print("Process starting")
    gg = SPARQLWrapper(endpoint)
    gg.setReturnFormat(JSON)
    
    count = int(checkNumTriples(eType)) #900
    print("# of %s: %s"%(eType, count))
    
    file = codecs.open(eType+"_Training_data.txt","w","utf-8")
    limit = 500
    offset = 0
    
    while (count > offset): 
        query = """
            PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
            PREFIX dbp: <http://dbpedia.org/property/>
            PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
           
            SELECT ?s ?n ?abs
            WHERE {
                ?s a dbpedia-owl:%s .
                ?s rdfs:label ?n .
                ?s rdfs:comment ?abs .
                FILTER (lang(?n) = 'en' and lang(?abs) = 'en')
            } LIMIT %s OFFSET  %s
        """ %(eType, limit, offset)

        print ("Query next batch %s - %s" %(offset, offset + limit))
        
        gg.setQuery(query)
        results = gg.query().convert()
        i = 0

        for result in results["results"]["bindings"]:
            #if "DB" not in result["n"]["value"]:
            #    continue
            subject = result["s"]["value"]
            name = result["n"]["value"]

            # Print the cost every 100 training examples
            i = i + 1
            if i % 1000 == 0:
                print ("%s - Example of data %s: %s" %(i, subject, name))

            if includeAbs == True:
                abst = result["abs"]["value"]
                file.write("<"+subject+">\t"+name+"\t"+abst+"\n")
            else:
                file.write("<"+subject+">\t"+name+"\n")
            
        offset = offset + limit
    
    file.close()
    print("Process finished")

## Main method
Now we use the functions we have implemented before, the supported types are: Person, Organisation, Drug, Event... and all DBpedia classes.

In [31]:
createTrainingData("Organisation", True)

Process starting
# of Organisation: 352081
Query next batch 0 - 500
Query next batch 500 - 1000
Query next batch 1000 - 1500
Query next batch 1500 - 2000
Query next batch 2000 - 2500
Query next batch 2500 - 3000
Query next batch 3000 - 3500
Query next batch 3500 - 4000
Query next batch 4000 - 4500
Query next batch 4500 - 5000
Query next batch 5000 - 5500
Query next batch 5500 - 6000
Query next batch 6000 - 6500
Query next batch 6500 - 7000
Query next batch 7000 - 7500
Query next batch 7500 - 8000
Query next batch 8000 - 8500
Query next batch 8500 - 9000
Query next batch 9000 - 9500
Query next batch 9500 - 10000
Query next batch 10000 - 10500
Query next batch 10500 - 11000
Query next batch 11000 - 11500
Query next batch 11500 - 12000
Query next batch 12000 - 12500
Query next batch 12500 - 13000
Query next batch 13000 - 13500
Query next batch 13500 - 14000
Query next batch 14000 - 14500
Query next batch 14500 - 15000
Query next batch 15000 - 15500
Query next batch 15500 - 16000
Query nex

Query next batch 130500 - 131000
Query next batch 131000 - 131500
Query next batch 131500 - 132000
Query next batch 132000 - 132500
Query next batch 132500 - 133000
Query next batch 133000 - 133500
Query next batch 133500 - 134000
Query next batch 134000 - 134500
Query next batch 134500 - 135000
Query next batch 135000 - 135500
Query next batch 135500 - 136000
Query next batch 136000 - 136500
Query next batch 136500 - 137000
Query next batch 137000 - 137500
Query next batch 137500 - 138000
Query next batch 138000 - 138500
Query next batch 138500 - 139000
Query next batch 139000 - 139500
Query next batch 139500 - 140000
Query next batch 140000 - 140500
Query next batch 140500 - 141000
Query next batch 141000 - 141500
Query next batch 141500 - 142000
Query next batch 142000 - 142500
Query next batch 142500 - 143000
Query next batch 143000 - 143500
Query next batch 143500 - 144000
Query next batch 144000 - 144500
Query next batch 144500 - 145000
Query next batch 145000 - 145500
Query next

Query next batch 255000 - 255500
Query next batch 255500 - 256000
Query next batch 256000 - 256500
Query next batch 256500 - 257000
Query next batch 257000 - 257500
Query next batch 257500 - 258000
Query next batch 258000 - 258500
Query next batch 258500 - 259000
Query next batch 259000 - 259500
Query next batch 259500 - 260000
Query next batch 260000 - 260500
Query next batch 260500 - 261000
Query next batch 261000 - 261500
Query next batch 261500 - 262000
Query next batch 262000 - 262500
Query next batch 262500 - 263000
Query next batch 263000 - 263500
Query next batch 263500 - 264000
Query next batch 264000 - 264500
Query next batch 264500 - 265000
Query next batch 265000 - 265500
Query next batch 265500 - 266000
Query next batch 266000 - 266500
Query next batch 266500 - 267000
Query next batch 267000 - 267500
Query next batch 267500 - 268000
Query next batch 268000 - 268500
Query next batch 268500 - 269000
Query next batch 269000 - 269500
Query next batch 269500 - 270000
Query next

The training data should be ready now