# Exploring Wikidata Using SPARQL

The purpose of this notebook is to show how to explore the [Wikidata knowledge graph](https://www.wikidata.org/wiki/Wikidata:Main_Page) using an [RDF](https://en.wikipedia.org/wiki/Resource_Description_Framework) query language called [SPARQL](https://en.wikipedia.org/wiki/SPARQL).

You can run (Try it!) SPARQL queries directly via the [Wikidata Query Service](https://query.wikidata.org/), but to document your journey, or develop a case study, Jupyter Notebooks might be a more familiar format for many users. This notebook shows you how to get started. **Note:** The Wikidata Query Service can generate code examples in Python and many other host languages, but it is perhaps a little less intuitive than what's described below.

To run this Jupyter notebook, we recommend [PAWS](https://wikitech.wikimedia.org/wiki/PAWS), a Web Shell (PAWS) is a Jupyter notebook deployment hosted by Wikimedia.

Begin by installing the requirements:

In [None]:
!pip install -r requirements.txt

There is an excellent article __[Extracting Data from Wikidata Using SPARQL and Python](https://itnext.io/extracting-data-from-wikidata-using-sparql-and-python-59e0037996f)__ by Jelle van Kerkvoorde that explains the mechanics of running SPARQL queries from Python. To get started, we import the `data_extraction` module.

In [None]:
import python.data_extraction as DEX

That module defines the `WikiDataQueryResults` class, which we instantiate by providing a SPARQL query string. The query is executed and the results retrieved by loading them into a `pandas` dataframe. And that's it. Let's try our first SPARQL query!

We want to retrieve the name, location, and founding date of all cities in the United States.

The Wikidata qualifiers are as follows:

- [`wdt:P31`](https://www.wikidata.org/wiki/Property:P31) - instance of
- [`wd:P279`](https://www.wikidata.org/wiki/Property:P279) - subclass of
- [`wd:Q515`](https://www.wikidata.org/wiki/Q515) - city
- [`wdt:P17`](https://www.wikidata.org/wiki/Property:P17) - country
- [`wd:Q30`](https://www.wikidata.org/wiki/Q30) - United States of America
- [`wdt:P625`](https://www.wikidata.org/wiki/Property:P625) - coordinate location
- [`wdt:P571`](https://www.wikidata.org/wiki/Property:P571) - inception

Obviously, we can expect only cities that are stored in the knowledge graph. We will find out how many there are in a moment.

In [None]:
query = '''
SELECT ?city ?cityLabel ?location ?locationLabel ?founding_date
WHERE {
  ?city wdt:P31/wdt:P279* wd:Q515.  # We are looking for instances or subclasses of cities
  ?city wdt:P17 wd:Q30.             # Located in the United States
  ?city wdt:P625 ?location.         # Retrieve the city's location
  ?city wdt:P571 ?founding_date.    # Retrieve the city's founding date (inception)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
'''
DEX.WikiDataQueryResults(query).load_as_dataframe().head()

Since there are potentially thousands of cities, we just peek at the dataframe head. We could achieve the same outcome in SPARQL via the `LIMIT` keyword. However, there's a fundamental difference between the two queries: without the `LIMIT` keyword, the query returns all (thousands!) cities, and we reduce the number by peeking at the head of the dataframe. With the `LIMIT` keyword, we retrieve only up to five results that are then loaded into the dataframe. The reduction happens on the server side, which is much more efficient, especially if we don't know how many results to expect.

In [None]:
query = '''
SELECT ?city ?cityLabel ?location ?locationLabel ?founding_date
WHERE {
  ?city wdt:P31/wdt:P279* wd:Q515.  # We are looking for instances or subclasses of cities
  ?city wdt:P17 wd:Q30.             # Located in the United States
  ?city wdt:P625 ?location.         # Retrieve the city's location
  ?city wdt:P571 ?founding_date.    # Retrieve the city's founding date (inception)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 5
'''
DEX.WikiDataQueryResults(query).load_as_dataframe()

What about cities in a different country, say, Germany? The filter that asked for cities in the United States is `?city wdt:P17 wd:Q30.`, with the Wikidata entity [`Q30`](https://www.wikidata.org/wiki/Q30) referring to the United States. To determine the Wikidata item for the country 'Germany' we look for countries with a matching label.

In [None]:
query = '''
SELECT DISTINCT ?country ?countryLabel  # Don't return duplicate results
WHERE {
  ?country wdt:P31 wd:Q6256.            # We are looking for instances or countries
  ?country rdfs:label ?label.           # We want to examine the country label
  FILTER(CONTAINS(?label, "Germany"))   # We expect the label to contain 'Germany'
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
'''
DEX.WikiDataQueryResults(query).load_as_dataframe()

`Q183` is the answer, and we can reformulate our query for cities in Germany as follows:

In [None]:
query = '''
SELECT ?city ?cityLabel ?location ?locationLabel ?founding_date
WHERE {
  ?city wdt:P31/wdt:P279* wd:Q515.  # We are looking for instances or subclasses of cities
  ?city wdt:P17 wd:Q183.            # Located in Germany
  ?city wdt:P625 ?location.         # Retrieve the city's location
  ?city wdt:P571 ?founding_date.    # Retrieve the city's founding date (inception)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "de". }
}
LIMIT 5
'''
DEX.WikiDataQueryResults(query).load_as_dataframe()

Returning to our earlier question, how many U.S. cities are stored in the Wikidata knowledge graph. Well, let's find out!

In [None]:
query = '''
SELECT (COUNT(DISTINCT ?city) AS ?cityCount)
WHERE {
  ?city wdt:P31/wdt:P279* wd:Q515.  # We are looking for instances or subclasses of cities
  ?city wdt:P17 wd:Q30.             # Located in the United States
}
'''
DEX.WikiDataQueryResults(query).load_as_dataframe()

While SPARQL is not a big language, it takes some time and practice to create complex queries with ease. There are plenty of tools to help you. For one, you can have a conversation with your favorite chatbot. If writing queries is not yet your thing, check out the [Wikidata Query Builder](https://query.wikidata.org/querybuilder/?uselang=en). To get a sense of what's possible in SPARQL and as a great source for inspiration, don't miss the [Wikidata:SPARQL query service/queries/examples](https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples/en). Good luck and happy exploring!