# Data collecting
This notebook collects the data that is used in my project. This is done by querying [DBpedia](https://wiki.dbpedia.org/) with SPARQL.

(I am new to SPARQL so queries might not be optimal)

In [15]:
from SPARQLWrapper import SPARQLWrapper, JSON, XML, N3, RDF
import pandas as pd

Given a SPARQL query and a label (i.e *Person*, *Animal*, *City* etc.) the function getData will query DBpedia, extract the abstract and label the abstracts with the given label. Returns a pandas dataframe.

In [16]:
sparql = SPARQLWrapper("http://dbpedia.org/sparql")
sparql.setReturnFormat(JSON)
prefix = """PREFIX owl: <http://www.w3.org/2002/07/owl#>
    PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX foaf: <http://xmlns.com/foaf/0.1/>
    PREFIX dc: <http://purl.org/dc/elements/1.1/>
    PREFIX : <http://dbpedia.org/resource/>
    PREFIX dbpedia2: <http://dbpedia.org/property/>
    PREFIX dbpedia: <http://dbpedia.org/>
    PREFIX skos: <http://www.w3.org/2004/02/skos/core#>"""
def getData(query, label):
    sparql.setQuery(prefix+query)
    results = sparql.query().convert()
    df = pd.DataFrame.from_dict(results["results"]["bindings"])
    df['abstract'] = df['abstract'].apply(lambda text: text['value'])
    df['label'] = label
    return df

Building a bunch of queries to compile a dataset consisting of abstracts of different classes.

In [17]:
persons = getData("""SELECT ?abstract WHERE {
    ?person dbo:abstract ?abstract .
    ?person a dbo:Person .
    ?person dbo:birthPlace :Sweden .   
    FILTER (lang(?abstract) = 'en')
}
""", 'Person')
cities = getData("""SELECT ?abstract WHERE {
    ?city dbo:abstract ?abstract .
    ?city a dbo:City .
    ?city dbo:country :United_States .
    FILTER (lang(?abstract) = 'en')
}
""", 'City')
animals = getData("""SELECT ?abstract WHERE {
    ?animal dbo:abstract ?abstract .
    ?animal a dbo:Animal .
    FILTER (lang(?abstract) = 'en')
}
""", 'Animal')

In [18]:
data = persons.append([cities, animals])
data.head()

Unnamed: 0,abstract,label
0,Bojan Pandžić (born 13 March 1982) is a Swedis...,Person
1,"Johan August Strindberg (/ˈstrɪndbɜːrɡ, ˈstrɪn...",Person
2,Axel Gustafsson Oxenstierna af Södermöre (Swed...,Person
3,Bo Hansson (10 April 1943 – 23 April 2010) was...,Person
4,Nils Daniel Carl Bildt (born 15 July 1949) is ...,Person


In [19]:
len(data)

24307