# Data-Driven Research Assignment 3: Linked Open Data
This notebook contains the third, collaborative, graded assignment of the 2023 Data-Driven Research course. In this assignment you'll use Linked Open Data tools in order to search for information on the Web in a more thorough way than with Google.

To complete the assignment, complete the highlighted **Part 1, Part 2, Part 3 and Part 4**.

This is a collaborative assignment. In the text cell below, please include all the names of your group members.

If you used code or a solution from the internet (such as StackOverflow) or another external resource, please make reference to it (in any format). Unattributed copied code will be considered plagiarism and therefore fraud.


**Authors of this answer:** Leonards Leimanis

# 1. Introduction

In this exercise, you'll experiment with a very explicit approach to semantics, and experience how powerful a little semantics can be when searching.

You'll use the DBpedia knowledge base, essentially the content of Wikipedia in machine readable form, and explore what explicit semantic enables using [SPARQL](http://en.wikipedia.org/wiki/SPARQL). Queries allowing you to query and search through the Web of Linked Open Data. We will use the Python SPARQLWrapper library to access the DBpedia endpoint.

## 1.1. RDF
Linked Open Data consists of a huge number of small facts, in the form of RDF triples, <*Subject*, *Predicate*, *Object*>, which each consists of a pair of concepts or entities and a relationship between them, such as `Rembrandt birthPlace Leiden`. We work specifically with DBpedia's machine readable information from Wikipedia, in this case http://dbpedia.org/page/Rembrandt:
`dbr:Rembrandt dbo:birthPlace dbr:Leiden`

It's hard to guess upfront how information is encoded in DBpedia, and linked data is all about having unique identifiers for every entity or concept.   The best way is to look at examples, and use google to kickstart to find a particular DBpedia entity.

For example, Google "dbpedia rembrandt" which will give you a neat page with DBpedia facts about him (https://dbpedia.org/page/Rembrandt).   If you look at the link of the "About: Rembrandt" you find the unique link that is the identifier of this entity is http://dbpedia.org/resource/Rembrandt, and entering this link/ID in your browser will generate the overview page. 

Inside DBpedia, you can use a shorthand  `dbr:Rembrandt` (which is defined to unfold to http://dbpedia.org/resource/Rembrandt) as the unique ID, but it also works if you use the long URL!  Recall that DBpedia does not have pages, but only countless RDF triples, and the overview page is just the output of all triples, <`dbr:Rembrandt` *Predicate*, *Object*>, allowing you also to see what further relations and concepts to explore.

## 1.2. SPARQL
SPARQL is the designated query language for RDF modelled on an extended SQL relational database query language.  It uses natural language words like SELECT, DISTINCT, WHERE, ORDER BY, and LIMIT in a very specific way. We introduce it by example, but feel free to backtrack to one of the many tutorial and introductions on the web.  

# 2. Working with SPARQL queries

There is a large database of facts derived from Wikipedia, called [DBpedia](http://dbpedia.org/About), which contains information about everything and the rest.  Let's look at American Films in the first part of this assignment. You will access a so-called SPARQL endpoint for DBpedia through Python. Each film in the category American Films has as a fact about it in DBpedia that is has a relationship with the category American Films, namely, the `dct:subject` property. That is, the film Pulp Fiction has an RDF triple 
`dbr:Pulp_Fiction dct:subject category:American_films`

If you put the name of the entity `Pulp_Fiction`, so `dbr:Pulp_Fiction` which is shorthand for http://dbpedia.org/resource/Pulp_Fiction, your browser will generate a page http://dbpedia.org/page/Pulp_Fiction with a selection of facts  <`dbr:Pulp_Fiction`, ?, ?> in the data base.

Now, let's access this database through Python. We will need the SPARQLWrapper, which does not come with Google Colab by default, so we should install it:

In [2]:
#!pip install sparqlwrapper

Then, we are able to import it and choose the DBPedia endpoint to access that database:

In [3]:
from SPARQLWrapper import SPARQLWrapper
import pandas as pd
from io import StringIO
from IPython.display import display, HTML

# Specify the DBPedia endpoint
sparql = SPARQLWrapper("http://dbpedia.org/sparql")

# Specify that we want results in the CSV format
sparql.setReturnFormat('csv')

Now, we can try some SPARQL queries. First we set the query as a multi-line string, and then we use the query function to actually run the query.

Note that in the query below, ?film is a variable name, just like a Python variable - we could rename it to anything else, and the result would be the same.

In [4]:
sparql.setQuery("""
    PREFIX dct: <http://purl.org/dc/terms/>
    SELECT ?film 
    WHERE {?film dct:subject dbc:2010s_American_films} 
    LIMIT 1000
""")

result = sparql.query().convert().decode("utf-8") 
print(result[:200])

"film"
"http://dbpedia.org/resource/Cabin_Fever:_Patient_Zero"
"http://dbpedia.org/resource/Cabin_Fever_(2016_film)"
"http://dbpedia.org/resource/Caesar_and_Otto's_Deadly_Xmas"
"http://dbpedia.org/res


The result variable now contains a list in CSV format of 1000 links to 2010s American films on dbpedia (we limited the number of results to 1000, but it can be increased). To make it easier to visualize, let's turn it into a Pandas dataframe (this will become more useful as we retrieve more properties):

In [5]:
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,film
0,http://dbpedia.org/resource/Cabin_Fever:_Patie...
1,http://dbpedia.org/resource/Cabin_Fever_(2016_...
2,http://dbpedia.org/resource/Caesar_and_Otto's_...
3,http://dbpedia.org/resource/Café_(2010_film)
4,http://dbpedia.org/resource/Café_Society_(2016...
...,...
995,http://dbpedia.org/resource/Johnny_English_Str...
996,http://dbpedia.org/resource/Johnny_Frank_Garre...
997,http://dbpedia.org/resource/Joint_Body
998,http://dbpedia.org/resource/Jojo_Rabbit


We see that the enitities, e.g. `dbc:American_films` are URI's referring to a unique entity in the linked open data cloud. So the name is a unique ID, in this case shorthand for http://dbpedia.org/resource/Category:2010_American_films.  

Take a closer look at that SPARQL query. Can you figure out how it works? If you're familiar with structured query languages like SQL, you'll recognize many aspects. If this is completely new to you, there are still many recognizable words to help you interpret this query. Let's get back to this later.

To see the power of this form of searching, let's try a slightly more complex query, where you add a second RDF-like condition to the query, separated from the first by a dot `.` representing a `join` or `AND`:

In [6]:
sparql.setQuery("""
    PREFIX dct: <http://purl.org/dc/terms/>
    SELECT ?film ?actor
    WHERE {?film dct:subject dbc:2010s_American_films . ?film dbo:starring ?actor } 
    LIMIT 1000
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,film,actor
0,http://dbpedia.org/resource/Cabin_Fever:_Patie...,http://dbpedia.org/resource/Brando_Eaton
1,http://dbpedia.org/resource/Cabin_Fever:_Patie...,http://dbpedia.org/resource/Ryan_Donowho
2,http://dbpedia.org/resource/Cabin_Fever:_Patie...,http://dbpedia.org/resource/Lydia_Hearst
3,http://dbpedia.org/resource/Cabin_Fever:_Patie...,http://dbpedia.org/resource/Jillian_Murray
4,http://dbpedia.org/resource/Cabin_Fever:_Patie...,http://dbpedia.org/resource/Sean_Astin
...,...,...
995,http://dbpedia.org/resource/Scary_Movie_5,http://dbpedia.org/resource/J._P._Manoux
996,http://dbpedia.org/resource/Scary_Movie_5,http://dbpedia.org/resource/Terry_Crews
997,http://dbpedia.org/resource/Scary_Movie_5,http://dbpedia.org/resource/Ashley_Tisdale
998,http://dbpedia.org/resource/Scary_Movie_5,http://dbpedia.org/resource/Jerry_O'Connell


You should now get a table with American films and the actors that have played in them. This list is not complete. The content of DBpedia is based on the Infoboxes of Wikipedia pages, which have a standard format. The knowledge in DBpedia is as good as the encyclopedic information on Wikipedia. Not all actors per film are listed and not all films and actors have their own Wikipedia article.

Try a few more SPARQL queries, given below. See if you can figure out before hand what results they will give:

In [7]:
sparql.setQuery("""
    PREFIX dct: <http://purl.org/dc/terms/>
    SELECT DISTINCT ?actor
    WHERE {?film dct:subject dbc:2010s_American_films . ?film dbo:starring ?actor } 
    LIMIT 1000
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,actor
0,http://dbpedia.org/resource/Brando_Eaton
1,http://dbpedia.org/resource/Ryan_Donowho
2,http://dbpedia.org/resource/Lydia_Hearst
3,http://dbpedia.org/resource/Jillian_Murray
4,http://dbpedia.org/resource/Sean_Astin
...,...
995,http://dbpedia.org/resource/Chris_Zylka
996,http://dbpedia.org/resource/Louisa_Krause
997,http://dbpedia.org/resource/Dianna_Agron
998,http://dbpedia.org/resource/Scott_Speedman


In [8]:
sparql.setQuery("""
    PREFIX dct: <http://purl.org/dc/terms/>
    SELECT DISTINCT COUNT(?actor) ?actor
    WHERE {?film dct:subject dbc:2010s_American_films . ?film dbo:starring ?actor } 
    ORDER BY DESC(COUNT(?actor))
    LIMIT 1000
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,callret-0,actor
0,49,http://dbpedia.org/resource/James_Franco
1,44,http://dbpedia.org/resource/Danny_Trejo
2,34,http://dbpedia.org/resource/Nicolas_Cage
3,31,http://dbpedia.org/resource/Danny_Glover
4,31,http://dbpedia.org/resource/Eric_Roberts
...,...,...
995,7,http://dbpedia.org/resource/Cody_Horn
996,7,http://dbpedia.org/resource/Lawrence_Michael_L...
997,7,http://dbpedia.org/resource/Sean_Paul_Lockhart
998,7,http://dbpedia.org/resource/Joel_David_Moore


**Part 1**: Adapt the SPARQL query above to count per film how many actors it has. This requires a minimal change to the query.   

In [35]:
# Your adapted query here
sparql.setQuery("""
    PREFIX dct: <http://purl.org/dc/terms/>
    SELECT ?film (COUNT(?actor) AS ?actorcount)
    WHERE {?film dct:subject dbc:2010s_American_films . ?film dbo:starring ?actor } 
    GROUP BY ?film
    ORDER BY DESC(?actorcount)
    LIMIT 1000
    
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,film,actorcount
0,http://dbpedia.org/resource/Holidays_(2016_film),36
1,http://dbpedia.org/resource/Movie_43,22
2,http://dbpedia.org/resource/Isle_of_Dogs_(film),21
3,http://dbpedia.org/resource/I_Am_Comic,21
4,http://dbpedia.org/resource/Muscle_Shoals_(film),20
...,...,...
995,http://dbpedia.org/resource/Motherless_Brookly...,7
996,http://dbpedia.org/resource/The_Last_Rescue,7
997,http://dbpedia.org/resource/The_Pretenders_(20...,7
998,http://dbpedia.org/resource/The_Thinning,7


# 3. Six Degrees of Kevin Bacon

There is a game related to the notion of [Six Degrees of Separation](http://en.wikipedia.org/wiki/Six_degrees_of_separation). This involves the network of actors who have played in a film together with Kevin Bacon, and actors who have played with those actors, etc. The goal is to figure out the shortest path, between actors who have co-starred in a film, between any Hollywood actor and the actor Kevin Bacon. One of the research aspects is whether the network of actors in Hollywood form a so-called [Small World](http://en.wikipedia.org/wiki/Small-world_experiment) network.

**Part 2** (some questions in the steps below)

Steps:

1. Let's first query:

In [36]:
sparql.setQuery("""
    SELECT ?film
    WHERE {?film dbo:starring dbr:Kevin_Bacon}
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,film
0,http://dbpedia.org/resource/Queens_Logic
1,http://dbpedia.org/resource/Quicksilver_(film)
2,http://dbpedia.org/resource/Murder_in_the_Firs...
3,http://dbpedia.org/resource/Beauty_Shop
4,http://dbpedia.org/resource/Beverly_Hills_Cop:...
...,...
64,http://dbpedia.org/resource/R.I.P.D.
65,http://dbpedia.org/resource/Rails_&_Ties
66,http://dbpedia.org/resource/She's_Having_a_Baby
67,http://dbpedia.org/resource/X-Men:_First_Class


2. How many films are in the list? We can count using a COUNT statement:

In [37]:
sparql.setQuery("""
    SELECT COUNT(?film)
    WHERE {?film dbo:starring dbr:Kevin_Bacon}
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,callret-0
0,69


3. Is this list complete? Compare this number with, for instance, the number of films listed on Kevin Bacon's IMDB page.

This list in not complete as the IMDB page already states that he has stared 79 movies of which a few are still to be released.

4. To see the power of this form of searching, let's gradually make this into a more complex query. Let's ask for the list of actors co-starring with Kevin Bacon. You can add a second RDF-like condition to the query, separated from the first by a dot:

In [38]:
sparql.setQuery("""
    SELECT ?film ?actor
    WHERE { ?film dbo:starring dbr:Kevin_Bacon . ?film dbo:starring ?actor} 
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,film,actor
0,http://dbpedia.org/resource/Queens_Logic,http://dbpedia.org/resource/Linda_Fiorentino
1,http://dbpedia.org/resource/Queens_Logic,http://dbpedia.org/resource/Tom_Waits
2,http://dbpedia.org/resource/Queens_Logic,http://dbpedia.org/resource/Tony_Spiridakis
3,http://dbpedia.org/resource/Queens_Logic,http://dbpedia.org/resource/Jamie_Lee_Curtis
4,http://dbpedia.org/resource/Queens_Logic,http://dbpedia.org/resource/Chloe_Webb
...,...,...
392,http://dbpedia.org/resource/X-Men:_First_Class,http://dbpedia.org/resource/Kevin_Bacon
393,http://dbpedia.org/resource/X-Men:_First_Class,http://dbpedia.org/resource/Michael_Fassbender
394,http://dbpedia.org/resource/X-Men:_First_Class,http://dbpedia.org/resource/Oliver_Platt
395,http://dbpedia.org/resource/You_Should_Have_Left,http://dbpedia.org/resource/Amanda_Seyfried


You should now get a table with films that Kevin Bacon played in, and the actors who played with him in those films. This list also includes Kevin Bacon in each of those films.

5. To remove Kevin Bacon himself from the list of co-actors, you can use a Regular Expression (as discussed in [Coding the Humanities](https://github.com/bloemj/2023-coding-the-humanities/blob/main/notebooks/2_Text.ipynb)) to remove any actor containing the name `Kevin Bacon`, like this:

In [39]:
sparql.setQuery("""
    SELECT ?film ?actor
    WHERE { ?film dbo:starring dbr:Kevin_Bacon . ?film dbo:starring ?actor .
            FILTER (!regex(?actor, "Kevin_Bacon"))} 
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,film,actor
0,http://dbpedia.org/resource/Queens_Logic,http://dbpedia.org/resource/Linda_Fiorentino
1,http://dbpedia.org/resource/Queens_Logic,http://dbpedia.org/resource/Tom_Waits
2,http://dbpedia.org/resource/Queens_Logic,http://dbpedia.org/resource/Tony_Spiridakis
3,http://dbpedia.org/resource/Queens_Logic,http://dbpedia.org/resource/Jamie_Lee_Curtis
4,http://dbpedia.org/resource/Queens_Logic,http://dbpedia.org/resource/Chloe_Webb
...,...,...
323,http://dbpedia.org/resource/X-Men:_First_Class,http://dbpedia.org/resource/January_Jones
324,http://dbpedia.org/resource/X-Men:_First_Class,http://dbpedia.org/resource/Jennifer_Lawrence
325,http://dbpedia.org/resource/X-Men:_First_Class,http://dbpedia.org/resource/Michael_Fassbender
326,http://dbpedia.org/resource/X-Men:_First_Class,http://dbpedia.org/resource/Oliver_Platt


Or even better, using the semantic entity name directly:

In [40]:
sparql.setQuery("""
    SELECT ?film ?actor
    WHERE { ?film dbo:starring dbr:Kevin_Bacon . ?film dbo:starring ?actor .
            FILTER ( ?actor != dbr:Kevin_Bacon )}
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,film,actor
0,http://dbpedia.org/resource/Queens_Logic,http://dbpedia.org/resource/Linda_Fiorentino
1,http://dbpedia.org/resource/Queens_Logic,http://dbpedia.org/resource/Tom_Waits
2,http://dbpedia.org/resource/Queens_Logic,http://dbpedia.org/resource/Tony_Spiridakis
3,http://dbpedia.org/resource/Queens_Logic,http://dbpedia.org/resource/Jamie_Lee_Curtis
4,http://dbpedia.org/resource/Queens_Logic,http://dbpedia.org/resource/Chloe_Webb
...,...,...
323,http://dbpedia.org/resource/X-Men:_First_Class,http://dbpedia.org/resource/January_Jones
324,http://dbpedia.org/resource/X-Men:_First_Class,http://dbpedia.org/resource/Jennifer_Lawrence
325,http://dbpedia.org/resource/X-Men:_First_Class,http://dbpedia.org/resource/Michael_Fassbender
326,http://dbpedia.org/resource/X-Men:_First_Class,http://dbpedia.org/resource/Oliver_Platt


6. To get a list of only the actors, remove `?film` from the selection. Use `SELECT DISTINCT` if needed to avoid duplicates.  Now can you get a count of the number of actors in the list?

In [67]:
sparql.setQuery("""
    SELECT COUNT(DISTINCT ?actor) AS ?actorcount
    WHERE { ?film dbo:starring dbr:Kevin_Bacon . ?film dbo:starring ?actor .
            FILTER ( ?actor != dbr:Kevin_Bacon )}
    
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,actorcount
0,297


7. We'll expand the SPARQL query to get actors who co-starred with actors who co-starred with Kevin Bacon, i.e. actors who are two steps away from Kevin Bacon. First get all films that the co-stars of Kevin Bacon played in:

In [68]:
sparql.setQuery("""
    SELECT DISTINCT ?film1 ?actor1 ?film2 
    WHERE { ?film1 dbo:starring dbr:Kevin_Bacon . ?film1 dbo:starring ?actor1 
               . ?film2 dbo:starring ?actor1 .
               FILTER ( ?actor1 != dbr:Kevin_Bacon )  }
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,film1,actor1,film2
0,http://dbpedia.org/resource/Telling_Lies_in_Am...,http://dbpedia.org/resource/Calista_Flockhart,http://dbpedia.org/resource/The_Last_Shot
1,http://dbpedia.org/resource/Telling_Lies_in_Am...,http://dbpedia.org/resource/Calista_Flockhart,http://dbpedia.org/resource/Things_You_Can_Tel...
2,http://dbpedia.org/resource/Telling_Lies_in_Am...,http://dbpedia.org/resource/Calista_Flockhart,http://dbpedia.org/resource/Brothers_&_Sisters...
3,http://dbpedia.org/resource/Telling_Lies_in_Am...,http://dbpedia.org/resource/Calista_Flockhart,http://dbpedia.org/resource/A_Midsummer_Night'...
4,http://dbpedia.org/resource/Telling_Lies_in_Am...,http://dbpedia.org/resource/Calista_Flockhart,http://dbpedia.org/resource/Ally_McBeal
...,...,...,...
9995,http://dbpedia.org/resource/JFK_(film),http://dbpedia.org/resource/Michael_Rooker,http://dbpedia.org/resource/Afterburn_(film)
9996,http://dbpedia.org/resource/JFK_(film),http://dbpedia.org/resource/Michael_Rooker,http://dbpedia.org/resource/DC_Showcase:_Jonah...
9997,http://dbpedia.org/resource/JFK_(film),http://dbpedia.org/resource/Michael_Rooker,http://dbpedia.org/resource/F9_(film)
9998,http://dbpedia.org/resource/JFK_(film),http://dbpedia.org/resource/Michael_Rooker,http://dbpedia.org/resource/Fantasy_Island_(film)


8. Next we add the co-stars of the co-stars of Kevin Bacon as `?actor2`:

In [69]:
sparql.setQuery("""
    SELECT ?film1 ?actor1 ?film2 ?actor2 
    WHERE { ?film1 dbo:starring dbr:Kevin_Bacon . ?film1 dbo:starring ?actor1 
           . ?film2 dbo:starring ?actor1 . ?film2 dbo:starring ?actor2 .
               FILTER (?actor1 != dbr:Kevin_Bacon && ?actor2 != dbr:Kevin_Bacon ) }
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,film1,actor1,film2,actor2
0,http://dbpedia.org/resource/Wild_Things_(film)...,http://dbpedia.org/resource/Bill_Murray,http://dbpedia.org/resource/Caddyshack,http://dbpedia.org/resource/Rodney_Dangerfield
1,http://dbpedia.org/resource/Wild_Things_(film)...,http://dbpedia.org/resource/Bill_Murray,http://dbpedia.org/resource/Caddyshack,http://dbpedia.org/resource/Bill_Murray
2,http://dbpedia.org/resource/Wild_Things_(film)...,http://dbpedia.org/resource/Bill_Murray,http://dbpedia.org/resource/Caddyshack,http://dbpedia.org/resource/Ted_Knight
3,http://dbpedia.org/resource/Wild_Things_(film)...,http://dbpedia.org/resource/Bill_Murray,http://dbpedia.org/resource/Caddyshack,http://dbpedia.org/resource/Chevy_Chase
4,http://dbpedia.org/resource/Wild_Things_(film)...,http://dbpedia.org/resource/Bill_Murray,http://dbpedia.org/resource/Caddyshack,http://dbpedia.org/resource/Michael_O'Keefe
...,...,...,...,...
9995,http://dbpedia.org/resource/Black_Mass_(film),http://dbpedia.org/resource/Benedict_Cumberbatch,http://dbpedia.org/resource/12_Years_a_Slave_(...,http://dbpedia.org/resource/Brad_Pitt
9996,http://dbpedia.org/resource/Black_Mass_(film),http://dbpedia.org/resource/Benedict_Cumberbatch,http://dbpedia.org/resource/12_Years_a_Slave_(...,http://dbpedia.org/resource/Paul_Dano
9997,http://dbpedia.org/resource/Black_Mass_(film),http://dbpedia.org/resource/Benedict_Cumberbatch,http://dbpedia.org/resource/12_Years_a_Slave_(...,http://dbpedia.org/resource/Paul_Giamatti
9998,http://dbpedia.org/resource/Black_Mass_(film),http://dbpedia.org/resource/Benedict_Cumberbatch,http://dbpedia.org/resource/12_Years_a_Slave_(...,http://dbpedia.org/resource/Garret_Dillahunt


9. How many actors are within two steps of Kevin Bacon?  Are there no duplicates? Note the difference between `COUNT(DISTINCT ?film)` and `COUNT(?film)`.

In [73]:
sparql.setQuery("""
    SELECT COUNT (DISTINCT ?actor2) 
    WHERE { ?film1 dbo:starring dbr:Kevin_Bacon . ?film1 dbo:starring ?actor1 
           . ?film2 dbo:starring ?actor1 . ?film2 dbo:starring ?actor2 .
               FILTER (?actor1 != dbr:Kevin_Bacon && ?actor2 != dbr:Kevin_Bacon ) }
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,callret-0
0,11962


There are 11962 actors within two steps of Kevin Bacon. There are many duplicates. Around 70000 actors without using DISTINCT.

# 4. Wanderlust

DBpedia is great to wander around---just like browsing through Wikipedia---but then with powerful aggregation tools at your finger tips.   Follow this walk, and make your own walks.

## Example Walk

1. Starting is always hard, so let's start with Google "dbpedia rembrandt", as we did in the lecture.

This gives you quite some info, and reveals Rembrandt is a dbpedia resource, hence `dbr:Rembrandt` (shorthand for http://dbpedia.org/resource/Rembrandt) is the unique ID inside dbpedia.

2. Let's see what information there is with `dbr:Rembrandt`

In [74]:
sparql.setQuery("""
    SELECT ?p ?o WHERE { dbr:Rembrandt ?p ?o }
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df.head(20) #Show the first 30 results

Unnamed: 0,p,o
0,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://www.w3.org/2002/07/owl#Thing
1,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://xmlns.com/foaf/0.1/Person
2,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://dbpedia.org/ontology/Person
3,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://www.ontologydesignpatterns.org/ont/dul/...
4,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://www.wikidata.org/entity/Q19088
5,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://www.wikidata.org/entity/Q215627
6,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://www.wikidata.org/entity/Q483501
7,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://www.wikidata.org/entity/Q5
8,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://www.wikidata.org/entity/Q729
9,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://dbpedia.org/class/yago/WikicatArtistsFr...


3. The list is partial, but one relation to explore could be the "type" of entity.

In [75]:
sparql.setQuery("""
    SELECT ?o WHERE { dbr:Rembrandt rdf:type ?o }
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df.head(30) #Show the first 30 results

Unnamed: 0,o
0,http://www.w3.org/2002/07/owl#Thing
1,http://xmlns.com/foaf/0.1/Person
2,http://dbpedia.org/ontology/Person
3,http://www.ontologydesignpatterns.org/ont/dul/...
4,http://www.wikidata.org/entity/Q19088
5,http://www.wikidata.org/entity/Q215627
6,http://www.wikidata.org/entity/Q483501
7,http://www.wikidata.org/entity/Q5
8,http://www.wikidata.org/entity/Q729
9,http://dbpedia.org/class/yago/WikicatArtistsFr...


4. That's a lot---but probably you knew already a lot about him, let's move to another entity that is well-represented in Wikipedia.

In [76]:
sparql.setQuery("""
    SELECT ?o WHERE { dbr:Darth_Vader rdf:type ?o }
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df.head(30) #Show the first 30 results

Unnamed: 0,o
0,http://www.w3.org/2002/07/owl#Thing
1,http://www.ontologydesignpatterns.org/ont/dul/...
2,http://dbpedia.org/ontology/Agent
3,http://www.wikidata.org/entity/Q24229398
4,http://www.wikidata.org/entity/Q95074
5,http://dbpedia.org/class/yago/Amputee109789566
6,http://dbpedia.org/class/yago/Assassin109813696
7,http://dbpedia.org/class/yago/Aviator109826204
8,http://dbpedia.org/class/yago/BadPerson109831962
9,http://dbpedia.org/class/yago/CausalAgent10000...


5. With so many things, what is his actual occupation?

In [87]:
sparql.setQuery("""
    SELECT ?occupation WHERE { dbr:Darth_Vader dbo:occupation ?occupation }
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,occupation
0,http://dbpedia.org/resource/Sith
1,http://dbpedia.org/resource/Slave
2,http://dbpedia.org/resource/Jedi


6. Wow, but who else then is a Jedi?

In [78]:
sparql.setQuery("""
    SELECT ?person WHERE { ?person dbo:occupation dbr:Jedi }
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,person
0,http://dbpedia.org/resource/Rey_(Star_Wars)
1,http://dbpedia.org/resource/Mara_Jade
2,http://dbpedia.org/resource/Qui-Gon_Jinn
3,http://dbpedia.org/resource/Count_Dooku
4,http://dbpedia.org/resource/Quinlan_Vos
5,http://dbpedia.org/resource/General_Grievous
6,http://dbpedia.org/resource/Luke_Skywalker
7,http://dbpedia.org/resource/Mace_Windu
8,http://dbpedia.org/resource/Starkiller
9,http://dbpedia.org/resource/Ahsoka_Tano


7. And who is actually a Sith (spoiler alert)?  

In [79]:
sparql.setQuery("""
    SELECT ?person WHERE { ?person dbo:occupation dbr:Sith }
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,person
0,http://dbpedia.org/resource/Count_Dooku
1,http://dbpedia.org/resource/Starkiller
2,http://dbpedia.org/resource/Darth_Plagueis
3,http://dbpedia.org/resource/Darth_Maul
4,http://dbpedia.org/resource/Darth_Vader
5,http://dbpedia.org/resource/Palpatine
6,http://dbpedia.org/resource/Asajj_Ventress


8. For those who don't want to know, how many are there?

In [80]:
sparql.setQuery("""
    SELECT count(?person) WHERE { ?person dbo:occupation dbr:Sith }
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,callret-0
0,7


9. With Jedi becoming a career opportunity, what occupations are there anyway (incomplete list)?

In [81]:
sparql.setQuery("""
    SELECT ?person ?occupation WHERE { ?person dbo:occupation ?occupation } ORDER BY ?occupation
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,person,occupation
0,http://dbpedia.org/resource/Everett_Alvarez_Jr.,http://alvarezassociates.com/
1,http://dbpedia.org/resource/Fred_Hassan,http://caretgroup.com/
2,http://dbpedia.org/resource/Lisa_Poppaw,http://childsafecolorado.org/Staff.html
3,http://dbpedia.org/resource/David_Brailer,http://cigna.com/
4,http://dbpedia.org/resource/Geoffrey_Cowan,http://communicationleadership.usc.edu/
...,...,...
9995,http://dbpedia.org/resource/Jack_Perrin,http://dbpedia.org/resource/Actor
9996,http://dbpedia.org/resource/Jack_Plotnick,http://dbpedia.org/resource/Actor
9997,http://dbpedia.org/resource/Jack_Raymond,http://dbpedia.org/resource/Actor
9998,http://dbpedia.org/resource/Jack_Ryder_(actor),http://dbpedia.org/resource/Actor


10. Impressive, but how many people are there in DBpedia anyway, when looking at nationality?

In [82]:
sparql.setQuery("""
    SELECT count(?nationality) ?nationality 
       WHERE { ?person dbo:nationality ?nationality } ORDER BY ?nationality
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,callret-0,nationality
0,1,http://dbpedia.org/resource/18th-century_histo...
1,1,http://dbpedia.org/resource/1931_French_Grand_...
2,1,http://dbpedia.org/resource/1951_French_Grand_...
3,1,http://dbpedia.org/resource/1952_Italian_Grand...
4,1,http://dbpedia.org/resource/1956_Argentine_Gra...
...,...,...
3611,1,http://dbpedia.org/resource/Åland
3612,7,http://dbpedia.org/resource/Úrvalsdeild_karla_...
3613,1,http://dbpedia.org/resource/Úrvalsdeild_kvenna...
3614,2,http://dbpedia.org/resource/Đại_Cồ_Việt


11. Hmm. Some interesting nationalities there... Also this may be a somewhat biased world view---let's sort that by "impact on Wikipedia"?

In [83]:
sparql.setQuery("""
    SELECT count(?nationality) AS ?count ?nationality 
       WHERE { ?person dbo:nationality ?nationality } ORDER BY DESC(?count)
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,count,nationality
0,20971,http://dbpedia.org/resource/United_States
1,7260,http://dbpedia.org/resource/United_Kingdom
2,4818,http://dbpedia.org/resource/India
3,4764,http://dbpedia.org/resource/Americans
4,3912,http://dbpedia.org/resource/Canadians
...,...,...
3611,1,http://dbpedia.org/resource/Fijian_people
3612,1,http://dbpedia.org/resource/Limbu_people
3613,1,http://dbpedia.org/resource/Dubrovnik
3614,1,http://dbpedia.org/resource/Teochew_people


12. Etcetera.

**Part 3**

Are there issues with completeness of encoding (does Rembrandt have an occupation or nationality?) or with selection and bias/representation that you observed in this example walk?

Rembrandt did not have an occupations so there are issues in coding. It is possible that the occupation field has another relevant name. There are issues with representation because anyone can create a wikipedia page for nationality and some are not even real nationalities, but sub groups of nationalities such as being a part of the Polish diaspora.

## Make your own walk!

In a similar way as the example walk, make your own walk through DBpedia/Wikipedia. Explore some of the amazing power SPARQL queries give you to explore, as well as being aware of the limitations and bias of the collection and the encoding. We reward creativity as much as technical skills: some of the most interesting queries are a very simple SPARQL statement! Just like in Assignment 1, please do put a topic that you are interested into this one, not one of our boring example ones.

**(Your answers with query code blocks here)** Finding the best places to live for working with a hedge fund


In [121]:
sparql.setQuery("""
    SELECT ?occupation WHERE { dbr:John_Overdeck dbo:occupation ?occupation }
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,occupation
0,http://dbpedia.org/resource/John_Overdeck__Per...
1,http://dbpedia.org/resource/Hedge_fund


In [124]:
sparql.setQuery("""
    SELECT ?person WHERE { ?person dbo:occupation dbr:Hedge_fund }
""")

result = sparql.query().convert().decode("utf-8") 
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,person
0,http://dbpedia.org/resource/Rocco_Benetton
1,http://dbpedia.org/resource/Roy_Niederhoffer
2,http://dbpedia.org/resource/Bill_Ackman
3,http://dbpedia.org/resource/Bill_Perkins_(busi...
4,http://dbpedia.org/resource/Brad_Gerstner
5,http://dbpedia.org/resource/Brian_Kim_(hedge_f...
6,http://dbpedia.org/resource/David_E._Shaw
7,http://dbpedia.org/resource/David_Einhorn_(hed...
8,http://dbpedia.org/resource/David_Tepper
9,http://dbpedia.org/resource/Anthony_Chiasson


In [135]:
sparql.setQuery("""
    SELECT DISTINCT ?company WHERE {
        ?person dbo:occupation dbr:Hedge_fund .
        ?person dbo:knownFor ?company .
    }
""")

result = sparql.query().convert().decode("utf-8")
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,company
0,http://dbpedia.org/resource/Altimeter_Capital
1,http://dbpedia.org/resource/Greenlight_Capital
2,http://dbpedia.org/resource/Carolina_Panthers
3,http://dbpedia.org/resource/Appaloosa_Management
4,http://dbpedia.org/resource/Charlotte_FC
5,http://dbpedia.org/resource/Two_Sigma_Investments
6,http://dbpedia.org/resource/Robin_Hood_Foundation
7,http://dbpedia.org/resource/Black_Monday_(1987)
8,http://dbpedia.org/resource/The_Children's_Inv...
9,http://dbpedia.org/resource/Castleton_Commodit...


In [156]:
sparql.setQuery("""
    SELECT ?company WHERE {
        ?company dbo:industry dbr:Hedge_fund
        
    }
""")

result = sparql.query().convert().decode("utf-8")
df = pd.read_csv(StringIO(result), sep=",")

In [154]:
sparql.setQuery("""
    SELECT ?company ?hqLocation WHERE {
        ?company dbo:industry dbr:Hedge_fund .
        { ?company dbp:hqLocation ?hqLocation . } UNION
        { ?company dbo:locationCity ?hqLocation . } UNION
        { ?company dbo:headquarter ?hqLocation . } UNION
        { ?company dbp:hqLocationCity ?hqLocation . } UNION
        { ?company dbo:location ?hqLocation . }
        
    }
""")

result = sparql.query().convert().decode("utf-8")
df = pd.read_csv(StringIO(result), sep=",")
display(df)

Unnamed: 0,company,hqLocation
0,http://dbpedia.org/resource/Magnetar_Capital,"http://dbpedia.org/resource/Evanston,_Illinois"
1,http://dbpedia.org/resource/Capula_Investment_...,http://dbpedia.org/resource/London
2,http://dbpedia.org/resource/Renaissance_Techno...,"http://dbpedia.org/resource/East_Setauket,_New..."
3,http://dbpedia.org/resource/The_Children's_Inv...,http://dbpedia.org/resource/London
4,http://dbpedia.org/resource/Millennium_Managem...,http://dbpedia.org/resource/New_York_City
...,...,...
56,http://dbpedia.org/resource/Wynnefield_Capital,http://dbpedia.org/resource/New_York_City
57,http://dbpedia.org/resource/Centerbridge_Partn...,http://dbpedia.org/resource/United_States
58,http://dbpedia.org/resource/Centerbridge_Partn...,http://dbpedia.org/resource/New_York_(state)
59,http://dbpedia.org/resource/Centerbridge_Partn...,http://dbpedia.org/resource/New_York_City


Searching through the people who are known for their hedge funds because other properties would yield irrelevant results in context.
<p><b> The best place to live is New York, London, and Miami

In [155]:
sparql.setQuery("""
    SELECT ?company ?hqLocation WHERE {
        ?person dbo:occupation dbr:Hedge_fund .
        ?person dbo:knownFor ?company .
        { ?company dbp:hqLocation ?hqLocation . } UNION
        { ?company dbo:locationCity ?hqLocation . } UNION
        { ?company dbo:headquarter ?hqLocation . } UNION
        { ?company dbp:hqLocationCity ?hqLocation . } UNION
        { ?company dbo:Location ?hqLocation . }
        
    }
""")

result = sparql.query().convert().decode("utf-8")
df = pd.read_csv(StringIO(result), sep=",")
df

Unnamed: 0,company,hqLocation
0,http://dbpedia.org/resource/Citadel_Securities,"Southeast Financial Center, Miami, Florida, U.S."
1,http://dbpedia.org/resource/Third_Point_Manage...,55
2,http://dbpedia.org/resource/Avenue_Capital_Group,399
3,http://dbpedia.org/resource/Avenue_Capital_Group,399
4,http://dbpedia.org/resource/Appaloosa_Management,"http://dbpedia.org/resource/Miami_Beach,_Florida"
5,http://dbpedia.org/resource/Avenue_Capital_Group,http://dbpedia.org/resource/New_York_City
6,http://dbpedia.org/resource/Avenue_Capital_Group,http://dbpedia.org/resource/New_York_City
7,http://dbpedia.org/resource/The_Children's_Inv...,http://dbpedia.org/resource/London
8,http://dbpedia.org/resource/Millennium_Managem...,http://dbpedia.org/resource/New_York_City
9,http://dbpedia.org/resource/Citadel_Securities,http://dbpedia.org/resource/Southeast_Financia...


# 5. Annotating a corpus

Up to now, we have only queried DBpedia/Wikipedia, but the true power of linked open data is the ability to connect any corpus to the entities in DBpedia.  Manually annotating a corpus is very laborious, but automatic tools for entity linking can potentially annotate any DBpedia entity found in any text.

In Python, Spacy has a DBpedia Spotlight-based entity recognizer. Spacy is a very useful Python tool that can handle a large variety of text processing tasks, including also named entity recognition, sentiment analysis, part-of-speech tagging or text categorization, for various languages. It is definitely worth exploring more in your project or some other time.

As this assignment is already long enough, we will only briefly show how it works and ask you to experiment with it in a free format.

Let's install it:

In [29]:
#!pip install spacy-dbpedia-spotlight

Collecting spacy-dbpedia-spotlight
  Downloading spacy_dbpedia_spotlight-0.2.6.tar.gz (17 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting spacy<4.0.0,>=3.0.0 (from spacy-dbpedia-spotlight)
  Downloading spacy-3.5.2-cp311-cp311-win_amd64.whl (12.2 MB)
                                              0.0/12.2 MB ? eta -:--:--
                                              0.2/12.2 MB 6.3 MB/s eta 0:00:02
     --                                       0.9/12.2 MB 13.8 MB/s eta 0:00:01
     ------                                   1.9/12.2 MB 19.6 MB/s eta 0:00:01
     ---------                                2.8/12.2 MB 19.6 MB/s eta 0:00:01
     -----------                              3.7/12.2 MB 21.

We will make a text processing pipeline that only includes the DBedia entity linker, and nothing else:

In [157]:
import spacy_dbpedia_spotlight
# a new blank model will be created, with the language code provided in the parameter
nlp = spacy_dbpedia_spotlight.create('en')
# in this case, the pipeline will be only contain the EntityLinker
print(nlp.pipe_names)
# ['dbpedia_spotlight']

['dbpedia_spotlight']


If you are able to experiment with a different language than English, we encourage you to try it! For that, you can change the 'en' language code above to a different language code.

Now, we simply need to use the nlp function on a string, and it will attempt to recognize DBPedia entities in it:

In [158]:
doc = nlp('The University Of Amsterdam is a Dutch higher education institution located in Amsterdam.')

print(doc.ents) #This just prints the entities that were found

#This prints some more details, including the DBPedia identifier and the similarity score:
print([(ent.text, ent.kb_id_, ent._.dbpedia_raw_result['@similarityScore']) for ent in doc.ents])

(University Of Amsterdam, Dutch, Amsterdam)
[('University Of Amsterdam', 'http://dbpedia.org/resource/University_of_Amsterdam', '1.0'), ('Dutch', 'http://dbpedia.org/resource/Netherlands', '0.7255664488349297'), ('Amsterdam', 'http://dbpedia.org/resource/Amsterdam', '0.9999436227871693')]


Let's just make a tiny change to the input and see if the output changes:

In [159]:
doc = nlp('The University of Amsterdam is a Dutch higher education institution located in Amsterdam.')
print([(ent.text, ent.kb_id_, ent._.dbpedia_raw_result['@similarityScore']) for ent in doc.ents])

[('Amsterdam', 'http://dbpedia.org/resource/Amsterdam', '0.9999436227871693'), ('Dutch', 'http://dbpedia.org/resource/Netherlands', '0.7255664488349297'), ('Amsterdam', 'http://dbpedia.org/resource/Amsterdam', '0.9999436227871693')]


Interesting. It no longer recognizes the university in my case.

**Part 4**

Now, it's your turn to experiment with entity linking. Find a paragraph of text anywhere in your target language -- text with concrete names of persons, organizations, locations, events, ... (like news) may have more entities than abstract philosophy -- and try it for yourself. How well does this work?  Did it find all or most entities?  Do you see "errors"? 

In [163]:
doc = nlp("""ArcelorMittal reports first quarter 2023 results
Luxembourg, May 4, 2023 - ArcelorMittal (referred to as “ArcelorMittal” or the “Company”) (MT (New York,
Amsterdam, Paris, Luxembourg), MTS (Madrid)), the world’s leading integrated steel and mining company, today
announced results1 for the three-month ended March 31, 2023.
1Q 2023 Key highlights:
• Health and safety focus: Protecting the health and wellbeing of the employees remains an overarching priority
of the Company. Moving to a ‘predict-and-prevent’ culture with a particular focus on proactively reporting – and
responding to – instances with the specific potential for serious injuries or fatalities (proactive PSIFs)2; LTIF2 rate
of 0.64x in 1Q 2023
• Operating performance improved: 1Q 2023 operating income of $1.2bn (vs. operating loss of $0.3bn3,4 in 4Q
2022); EBITDA of $1.8bn in 1Q 2023 (vs. $1.3bn in 4Q 2022) and EBITDA/t of $126/t in 1Q 2023 (vs. $100/t in
4Q 2022)
• Enhanced share value: 1Q 2023 basic EPS of $1.28/sh vs. $0.30/sh in 4Q 2022, rolling twelve month ROE5 of
14.2%; 1Q 2023 book value per share6 of $64/sh
• Net income: $1.1bn in 1Q 2023 (vs. $0.3bn7 in 4Q 2022) includes share of JV and associates net income of
$0.3bn (vs. $0.1bn in 4Q 2022)
• Financial strength: The Company ended March 2023 with net debt of $5.2bn (vs. $2.2bn at the end of
December 2022) primarily due to M&A outflow (mainly the $2.2bn acquisition of ArcelorMittal Pecém formerly
known as CSP18), share buyback ($0.5bn) and seasonal investment in working capital ($0.8bn). Gross debt of
$11.5bn and cash and cash equivalents of $6.3bn as of March 31, 2023 (compared to $11.7bn and $9.4bn,
respectively, as of December 31, 2022)
• A platform for investment and consistent capital returns:
◦ Recent acquisitions (ArcelorMittal Pecém (Brazil) and ArcelorMittal Texas HBI) and completed strategic capex
projects (Mexico hot strip mill) performing well relative to assumed normalized levels of profitability
◦ Capex in 1Q 2023 of $0.9bn is in line with our full year guidance of within the range of $4.5bn-$5.0bn
◦ The Company has repurchased 19.1m shares so far in 2023, completing its previously announced buyback
program and bringing the total reduction in diluted share count since September 30, 20208 to 31%
◦ Following the approval granted by shareholders at the 2023 AGM, the Company announces its intention to
repurchase up to 85 million shares through May 2025. The level of repurchases will reflect (and is subject to)
the level of post-dividend FCF generated over the period. The Company’s capital return policy defines that a
minimum 50% of post-dividend annual FCF is returned to shareholders through buybacks.
◦ $0.44/sh base dividend to be paid in 2 equal instalments of $0.22/share in June 2023 and December 2023
• Continued progress in climate action: In April 2023, the Company announced that ArcelorMittal Brasil will
form a renewable energy joint venture partnership with Casa dos Ventos to develop a 554MW wind power
project in northeast Brazil""")
print([(ent.text, ent.kb_id_, ent._.dbpedia_raw_result['@similarityScore']) for ent in doc.ents])

[('ArcelorMittal', 'http://dbpedia.org/resource/ArcelorMittal', '1.0'), ('Luxembourg', 'http://dbpedia.org/resource/Luxembourg', '0.9972687348778917'), ('ArcelorMittal', 'http://dbpedia.org/resource/ArcelorMittal', '1.0'), ('ArcelorMittal', 'http://dbpedia.org/resource/ArcelorMittal', '1.0'), ('New York', 'http://dbpedia.org/resource/New_York_City', '0.9999983027233463'), ('Amsterdam', 'http://dbpedia.org/resource/Euronext', '0.9999999999995453'), ('Paris', 'http://dbpedia.org/resource/Paris', '0.9999988499905403'), ('Luxembourg', 'http://dbpedia.org/resource/Luxembourg', '0.9972687348778917'), ('MTS', 'http://dbpedia.org/resource/Bell_MTS', '0.7695343220148642'), ('Madrid', 'http://dbpedia.org/resource/Madrid', '0.9999697612285651'), ('steel', 'http://dbpedia.org/resource/Steel', '0.9999888709613431'), ('operating income', 'http://dbpedia.org/resource/Earnings_before_interest_and_taxes', '1.0'), ('EBITDA', 'http://dbpedia.org/resource/Earnings_before_interest,_taxes,_depreciation,_and

The plugin has found many of he concepts and all of them accurately. Yet it has not choosen to show many concepts out of the total words.

Now assume you have a large text corpus annotated in this way (think up your own corpus of interest, or otherwise think about the corpus of movie reviews you explored earlier).   

Can you think up something you could explore using these annotations? E.g. if we remember the movie reviews collection from Assignment 2 we could look at particular actors, but also about comparing male and female actors as two groups, and doing aggregated queries about male and female actor mentions in the whole corpus of negative and positive movie reviews (similar to the SPARQL queries earlier).

In this case I could see these annotations being used to interpret text data of financial quarterly reports.