# Abstract

Final project for Electronic Publishing and Digital Storytelling in fullfillment of an LM in Digital Humanities and Digital Knowledge from the University of Bologna.

## Project Aims
Wikidata is one of the largest free and open knowledge databases in the world. 
Launched in 2012, it now contains over 97 million items, over six million of them people.

This project investigates how Wikidata describes art historians and how those descriptions differ across gender.
This project serves as a case study in how our descriptions of history create history.

### Phase 1: Overview
We first wanted to get an wide view of Wikidata's data on art historians.
To do this we first queried art historians grouped by gender.

In [59]:
#insert Denise's initial query that breaks down those with art historian/sub groups into genders
from SPARQLWrapper import SPARQLWrapper, JSON
import ssl

ssl._create_default_https_context = ssl._create_unverified_context

# get the endpoint API
wikidata_endpoint = "https://query.wikidata.org/bigdata/namespace/wdq/sparql"

# prepare the query : 10 random triples
my_SPARQL_query = """
SELECT ?genderLabel (count(distinct ?human) as ?number)
WHERE
{SERVICE wikibase:label {
     bd:serviceParam wikibase:language "en" .
   }
  ?human wdt:P31 wd:Q5
  ; wdt:P21 ?gender
  ; wdt:P106/wdt:P279* wd:Q1792450 .
}
GROUP BY ?genderLabel
LIMIT 10

"""

# set the endpoint 
sparql_wd = SPARQLWrapper(wikidata_endpoint)
# set the query
sparql_wd.setQuery(my_SPARQL_query)
# set the returned format
sparql_wd.setReturnFormat(JSON)
# get the results
results = sparql_wd.query().convert()

#men = result["genderLabel"]["male"]
# manipulate the result
for result in results["results"]["bindings"]:
    print(result["genderLabel"]["value"], result["number"]["value"])


print (results)
print("💩")
#print(results["results"]["bindings"][0]['number']['value'])
men = results["results"]["bindings"][0]['number']['value']
women = results["results"]["bindings"][1]['number']['value']
nb = results["results"]["bindings"][2]['number']['value']
print(men)
print(women)
print(nb)


male 11756
female 5880
non-binary 2
{'head': {'vars': ['genderLabel', 'number']}, 'results': {'bindings': [{'genderLabel': {'xml:lang': 'en', 'type': 'literal', 'value': 'male'}, 'number': {'datatype': 'http://www.w3.org/2001/XMLSchema#integer', 'type': 'literal', 'value': '11756'}}, {'genderLabel': {'xml:lang': 'en', 'type': 'literal', 'value': 'female'}, 'number': {'datatype': 'http://www.w3.org/2001/XMLSchema#integer', 'type': 'literal', 'value': '5880'}}, {'genderLabel': {'xml:lang': 'en', 'type': 'literal', 'value': 'non-binary'}, 'number': {'datatype': 'http://www.w3.org/2001/XMLSchema#integer', 'type': 'literal', 'value': '2'}}]}}
💩
11756
5880
2


Then we wanted to look at the properties used to describe art historians across genders. So we ran a query to count the number of properties used for each

In [13]:
#insert Sarah's query getting property counts
from SPARQLWrapper import SPARQLWrapper, JSON
import ssl

ssl._create_default_https_context = ssl._create_unverified_context

# get the endpoint API
wikidata_endpoint = "https://query.wikidata.org/bigdata/namespace/wdq/sparql"

# prepare the query
my_SPARQL_query = """
SELECT ?genderLabel (count(distinct ?property) as ?number)
WHERE
{SERVICE wikibase:label {
     bd:serviceParam wikibase:language "en" .
   }

  ?human wdt:P31 wd:Q5
  ; wdt:P21 ?gender
  ; ?property ?object
  ; wdt:P106/wdt:P279* wd:Q1792450 .

}

GROUP BY ?genderLabel
"""
# set the endpoint 
sparql_wd = SPARQLWrapper(wikidata_endpoint)
# set the query
sparql_wd.setQuery(my_SPARQL_query)
# set the returned format
sparql_wd.setReturnFormat(JSON)
# get the results
results = sparql_wd.query().convert()

# manipulate the result
for result in results["results"]["bindings"]:
    print(result["genderLabel"]["value"], result["number"]["value"])
print("🦐")


male 2662
female 1797
non-binary 233
🦐


This query is for NOT distinct, so total number of declarations

In [63]:
#insert Sarah's query getting property counts
from SPARQLWrapper import SPARQLWrapper, JSON
import ssl

ssl._create_default_https_context = ssl._create_unverified_context

# get the endpoint API
wikidata_endpoint = "https://query.wikidata.org/bigdata/namespace/wdq/sparql"

# prepare the query
my_SPARQL_query = """
SELECT ?genderLabel (count(?property) as ?number)
WHERE
{SERVICE wikibase:label {
     bd:serviceParam wikibase:language "en" .
   }

  ?human wdt:P31 wd:Q5
  ; wdt:P21 ?gender
  ; ?property ?object
  ; wdt:P106/wdt:P279* wd:Q1792450 .

}

GROUP BY ?genderLabel
"""
# set the endpoint 
sparql_wd = SPARQLWrapper(wikidata_endpoint)
# set the query
sparql_wd.setQuery(my_SPARQL_query)
# set the returned format
sparql_wd.setReturnFormat(JSON)
# get the results
results = sparql_wd.query().convert()

# manipulate the result
for result in results["results"]["bindings"]:
    print(result["genderLabel"]["value"], result["number"]["value"])
print("🌮")

male 1204094
female 422519
non-binary 493
🌮


__NOTE!!!! This code copies everything from the first one and just changes the query. Is there a more efficient way to do this? It seems good to have it all bc you could run whichever query you want whenever you want but if we know she's going to run all the preceeding code before, maybe there's a way to make it more efficient (eg. only import once, reuse variables, etc.?)__

We also wanted to look at basic trends over time.

In [None]:
#queries for seeing if more art historians/women art historians from later periods are better or worse represented

Optional: trends over geographic space?

### Phase 2: Types of Properties
Then we wanted to break down those properties into types to see if certain properties/types of properties appear more often for some genders over others.
The first query is for how many art historians of each gender are also linked to a VIAF authority.

******* had huge problems getting a more generic query to work; really want ANY authority, not just viaf. see project notes

In [22]:
from SPARQLWrapper import SPARQLWrapper, JSON
import ssl

ssl._create_default_https_context = ssl._create_unverified_context

# get the endpoint API
wikidata_endpoint = "https://query.wikidata.org/bigdata/namespace/wdq/sparql"

# prepare the query
my_SPARQL_query = """
SELECT ?genderLabel (count(distinct ?human) as ?number)
WHERE
{SERVICE wikibase:label {
     bd:serviceParam wikibase:language "en" .
   }

  ?human wdt:P31 wd:Q5
  ; wdt:P21 ?gender
  ; ?property ?object
  ; wdt:P106/wdt:P279* wd:Q1792450 
  ; wdt:P214 ?viafid .
}


GROUP BY ?genderLabel
LIMIT 10
"""
# set the endpoint 
sparql_wd = SPARQLWrapper(wikidata_endpoint)
# set the query
sparql_wd.setQuery(my_SPARQL_query)
# set the returned format
sparql_wd.setReturnFormat(JSON)
# get the results
results = sparql_wd.query().convert()

# manipulate the result
for result in results["results"]["bindings"]:
    print(result["genderLabel"]["value"], result["number"]["value"])
print("🧁")


male 10998
female 5197
non-binary 2
🧁


In [18]:
from SPARQLWrapper import SPARQLWrapper, JSON
import ssl

ssl._create_default_https_context = ssl._create_unverified_context

# get the endpoint API
wikidata_endpoint = "https://query.wikidata.org/bigdata/namespace/wdq/sparql"

# prepare the query
my_SPARQL_query = """
# Make a list of the most used authority control properties for people for art historians by gender
SELECT ?propertyLabel ?genderLabel ?count WHERE {
  {
    select distinct?gender ?propertyclaim (COUNT(*) AS ?count) where {
      ?item wdt:P106/wdt:P279* wd:Q1792450  .
      ?item wdt:P31 wd:Q5 .
      ?item wdt:P21 ?gender .
      ?item ?propertyclaim [] .
    } group by ?propertyclaim ?gender
  }
  ?property wikibase:propertyType wikibase:ExternalId .
  ?property wdt:P31 wd:Q19595382 .
  ?property wikibase:claim ?propertyclaim .
  SERVICE wikibase:label {            # ... include the labels
    bd:serviceParam wikibase:language "en" .
  }
} ORDER BY DESC (?count)
#LIMIT 100
"""
# set the endpoint 
sparql_wd = SPARQLWrapper(wikidata_endpoint)
# set the query
sparql_wd.setQuery(my_SPARQL_query)
# set the returned format
sparql_wd.setReturnFormat(JSON)
# get the results
results = sparql_wd.query().convert()

# manipulate the result
for result in results["results"]["bindings"]:
    print(result["propertyLabel"]["value"], result["genderLabel"]["value"], result["count"]["value"])
print("👻")

VIAF ID male 11532
ISNI male 9625
WorldCat Identities ID male 9505
Library of Congress authority ID male 8584
GND ID male 8462
NUKAT ID male 7817
Nationale Thesaurus voor Auteurs ID male 7178
Bibliothèque nationale de France ID male 5726
VIAF ID female 5419
NKCR AUT ID male 5073
Deutsche Biographie (GND) ID male 4950
PLWABN ID male 4273
ISNI female 4016
WorldCat Identities ID female 3651
Library of Congress authority ID female 3484
SHARE Catalogue author ID male 3421
GND ID female 3410
Vatican Library VcBA ID male 3123
Unione Romana Biblioteche Scientifiche ID male 3085
NUKAT ID female 2987
American Academy in Rome ID male 2929
NORAF ID male 2574
Open Library ID male 2413
Nationale Thesaurus voor Auteurs ID female 2361
Bibliothèque nationale de France ID female 2204
National Library of Israel J9U ID male 2132
abART person ID male 2088
Vatican Library ID (former scheme) male 2031
CONOR.SI ID male 1800
NKCR AUT ID female 1742
Kallías ID male 1571
Deutsche Biographie (GND) ID female 1476


Swedish Portrait Archive ID female 3
EPHE ID female 3
Premiers préfets ID male 3
France Culture person ID male 3
IUF member ID female 3
Whonamedit? doctor ID male 3
Encyclopaedia Herder person ID male 3
BD Gest' author ID female 3
Swedish Film Database person ID female 3
Elonet person ID female 3
warheroes.ru ID male 3
LBT person ID female 3
PhilPeople profile male 3
Memorial Book Bundesarchiv ID female 3
Memorial Book Bundesarchiv ID male 3
Frick Art Reference Library Artist File ID female 3
WeChangEd ID female 3
Curran Index contributor ID female 3
Provenio ID male 3
The Conversation author ID male 3
Norwegian prisoner register person ID male 3
Český hudební slovník osob a institucí ID female 3
Jewish Virtual Library ID male 3
Canadian Women Artists History Initiative ID female 3
Quirinale ID male 3
ZOBODAT person ID female 3
BVFE author ID male 3
Germanistenverzeichnis ID male 3
MovieMeter person ID female 3
Podchaser creator ID female 3
IRIS UNINA author ID female 3
IRIS UNIPV auth

### Phase 2a: Professions and Occupations

Total number of other jobs

In [11]:
from SPARQLWrapper import SPARQLWrapper, JSON
import ssl

ssl._create_default_https_context = ssl._create_unverified_context

# get the endpoint API
wikidata_endpoint = "https://query.wikidata.org/bigdata/namespace/wdq/sparql"

# prepare the query
my_SPARQL_query = """
SELECT ?genderLabel (COUNT(?job) AS ?count)
WHERE 
{ 

  ?human wdt:P21 ?gender
  ; wdt:P106 wd:Q1792450
  ; wdt:P106 ?job
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }

}
GROUP BY ?genderLabel
"""
# set the endpoint 
sparql_wd = SPARQLWrapper(wikidata_endpoint)
# set the query
sparql_wd.setQuery(my_SPARQL_query)
# set the returned format
sparql_wd.setReturnFormat(JSON)
# get the results
results = sparql_wd.query().convert()


# manipulate the result
for result in results["results"]["bindings"]:
    print(result["genderLabel"]["value"], result["count"]["value"])
print("🍩")
  



male 27555
female 11077
non-binary 7
🍩


Query for University Degrees

In [6]:
#various queries about these; how many other jobs do people have, what area are the in, what are the most popular other jobs for each gender?
from SPARQLWrapper import SPARQLWrapper, JSON
import ssl

ssl._create_default_https_context = ssl._create_unverified_context

# get the endpoint API
wikidata_endpoint = "https://query.wikidata.org/bigdata/namespace/wdq/sparql"

# prepare the query
my_SPARQL_query = """

SELECT ?jobLabel ?genderLabel (COUNT(?human) AS ?count)

WHERE 
{ 
  ?human wdt:P21 ?gender
  ; wdt:P106 wd:Q1792450
  ; wdt:P106 ?job
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
GROUP BY ?jobLabel ?genderLabel
HAVING (?count > 1)  #in order to get only the most important jobs per gender (maybe we can put the limit to 10?) and to avoid outliers

ORDER BY ?gender DESC(?count)
"""

# set the endpoint 
sparql_wd = SPARQLWrapper(wikidata_endpoint)
# set the query
sparql_wd.setQuery(my_SPARQL_query)
# set the returned format
sparql_wd.setReturnFormat(JSON)
# get the results
results = sparql_wd.query().convert()

# manipulate the result
for result in results["results"]["bindings"]:
    print(result["jobLabel"]["value"], result["genderLabel"]["value"], result["count"]["value"])
print("🍭")

art historian male 11138
art historian female 5697
university teacher male 1804
writer male 1056
archaeologist male 1033
historian male 1003
painter male 640
curator male 552
art critic male 497
university teacher female 487
historian female 418
exhibition curator male 415
exhibition curator female 391
journalist male 385
curator female 385
writer female 379
teacher male 377
architect male 355
museum director male 311
anthropologist male 258
poet male 237
architectural historian male 236
art collector male 208
translator male 208
opinion journalist male 206
author male 204
politician male 194
photographer male 193
archaeologist female 188
art critic female 164
librarian male 151
philosopher male 149
professor male 149
classical archaeologist male 143
museologist male 143
journalist female 139
pedagogue male 134
artist male 134
author female 117
translator female 114
teacher female 112
preservationist male 109
art theorist male 108
graphic artist male 105
opinion journalist female 104
m

consultant male 3
salonnière female 2
semiotician male 2
Near Eastern archaeologist male 2
historic preservation male 2
LGBTI rights activist male 2
film actor female 2
mathematician female 2
songwriter male 2
anarchist male 2
anatomist male 2
stockbroker male 2
television actor male 2
film actor male 2
impresario male 2
poster artist male 2
botanical illustrator male 2
business manager male 2
statistician male 2
paleontologist female 2
honorary professor male 2
zoologist male 2
military historian male 2
military physician male 2
runologist male 2
illuminator male 2
head teacher male 2
newspaper editor male 2
prehistorian female 2
historian of the modern age male 2
magician male 2
theatre manager male 2
resistance fighter male 2
choreographer female 2
dancer male 2
humanist male 2
medical historian female 2
performing artist female 2
aristocrat female 2
postage stamp designer male 2
ornithologist male 2
partisan male 2
documentalist female 2
engineer of the French Corps of Bridges and 

Then we analyzed each of these areas more deeply.
### Phase 2b: Personal Relationships
Are men or women more likely to have personal relationships listed? What kinds of relationships appear?

Below query shows all personal relationship properties and how often they're used. I think it's super weird that "relative" is used exclusively in women's profiles.

In [20]:
from SPARQLWrapper import SPARQLWrapper, JSON
import ssl

ssl._create_default_https_context = ssl._create_unverified_context

# get the endpoint API
wikidata_endpoint = "https://query.wikidata.org/bigdata/namespace/wdq/sparql"

# prepare the query
my_SPARQL_query = """
# Make a list of the most used authority control properties for people for art historians by gender
SELECT ?propertyLabel ?genderLabel ?count WHERE {
  {
    select distinct?gender ?propertyclaim (COUNT(*) AS ?count) where {
      ?item wdt:P106/wdt:P279* wd:Q1792450  .
      ?item wdt:P31 wd:Q5 .
      ?item wdt:P21 ?gender .
      ?item ?propertyclaim [] .
    } group by ?propertyclaim ?gender
  }
  #?property wikibase:propertyType wikibase:ExternalId .
  ?property wdt:P31 wd:Q22964231 .
  ?property wikibase:claim ?propertyclaim .
  SERVICE wikibase:label {            # ... include the labels
    bd:serviceParam wikibase:language "en" .
  }
} ORDER BY DESC (?count)
#LIMIT 100
"""
# set the endpoint 
sparql_wd = SPARQLWrapper(wikidata_endpoint)
# set the query
sparql_wd.setQuery(my_SPARQL_query)
# set the returned format
sparql_wd.setReturnFormat(JSON)
# get the results
results = sparql_wd.query().convert()

# manipulate the result
for result in results["results"]["bindings"]:
    print(result["propertyLabel"]["value"], result["genderLabel"]["value"], result["count"]["value"])
print("👶")

child male 798
father male 736
sibling male 624
spouse male 591
spouse female 378
father female 241
mother male 213
child female 161
relative male 156
sibling female 132
mother female 102
relative female 40
number of children male 30
unmarried partner male 23
number of children female 16
unmarried partner female 4
stepparent male 2
godparent male 1
godparent female 1
number of children non-binary 1
👶


### Phase 2c: professional relationships queries: 
are women more likely to engaged professionally with other women? men? students? what about institutional relationships? awards?