## Wikidata entity linking

**wd_search.py** finds entities given a string and optional sets of types. It returns a ranked list of objects, one for each hit, with basic information from Wikidata and optionally DBpedia in one or more languages. An example of a call from the command line:

  *python wd_search.py "UMBC" --types ORG --oktypes LOC FAC --badtypes 'sports team' --top 5 --context "I studied computer science at UMBC"*

Types can be any wikidata type (e.g., Q5 for human) or a type name in **entity_types.py**.  The search prefers hits with a type in **--types** but will accept onese with a type in **--oktypes**.  If an entity has a type in **--badtypes**, it is rejected. The **--limit** parameter defines how many initial candidates are checked (up to 50) and **--top** says how many good hits are returned.

Global parameters are set in the config file: **wd_search_config.yml**.

**search()** is the basic function to call if you want a set of hits.  If you just one the one best hit, use **link()**

In [1]:
import wd_search as wds

Loading config file wd_search_config.yml
spacy_entity_linker


### the link function returns the one best link found

In [2]:
wds.link("UMBC")

{'types': ['Q35120:entity', 'Q43229:organization'],
 'id': 'Q735049',
 'qid': 'Q735049',
 'mention': 'UMBC',
 'description': 'public university in Maryland',
 'label': 'University of Maryland, Baltimore County',
 'search_rank': 1,
 'score': 0.0,
 'scores': [0.0],
 'score_rank': 1,
 'rank': 1.0,
 'wd_uri': 'https://www.wikidata.org/wiki/Q735049',
 'immediate_types': ['Q3918:university',
  'Q15936437:research university',
  'Q23002039:public educational institution of the United States'],
 'immediate_supertypes': [],
 'sitelinks': 12,
 'wikipedia': 'University_of_Maryland,_Baltimore_County',
 'is_instance': True,
 'is_concept': False,
 'en': {'label': 'University of Maryland, Baltimore County',
  'aliases': ['UMBC'],
  'description': 'public university in Maryland',
  'wikiname': 'University of Maryland, Baltimore County'}}

### summary() is good for experimenting since it just returns the link's ID, Label, and description

In [6]:
wds.summary(wds.link("UMBC"))

('Q735049',
 'University of Maryland, Baltimore County',
 'public university in Maryland')

In [7]:
wds.summary(wds.link("GE research"))

('Q5513284', 'GE Global Research', 'organization')

### If we don't specify a type, any entity can match the wrong. Search for Washington returns the state (a GPE)

In [19]:
wds.summary(wds.link("Washington"))

('Q1223', 'Washington', 'state of the United States of America')

In [20]:
wds.summary(wds.link("Washington", target_types=['PERSON']))

('Q23', 'George Washington', '1st president of the United States (1732-1799)')

### The type is a strong signal about what we want, tho. NORP is an ontonotes type for "Nationality, Organization, Religious, Political".

In [37]:
wds.summary(wds.link("Chinese", target_types="NORP"))

('Q42740', 'Han Chinese people', 'ethnic group')

### We get a different match if we use the ontonotes LANGUAGE type.

In [36]:
wds.summary(wds.link("Chinese", target_types="LANGUAGE"))

('Q37041',
 'Classical Chinese',
 'language of the Sino-Tibetan language family (ISO 639-3: lzh)')

### A simple, fast NLP pipline like Spacy can type named enties with ontonotes types.  Here's Donbas, a region in the Ukraine.

In [35]:
wds.summary(wds.link("Donbas",target_types="GPE"))

('Q605714', 'Donbas', 'region in eastern Ukraine')

### If we set dbpedia to True, here is the information we get from search for a GPE suing 'ukraine'

In [40]:
wds.link("Ukraine", target_types="GPE", dbpedia=1)

{'types': ['Q56061:administrative territorial entity',
  'Q35120:entity',
  'Q2221906:geographic location',
  'Q43229:organization'],
 'id': 'Q212',
 'qid': 'Q212',
 'mention': 'Ukraine',
 'description': 'sovereign state in eastern Europe',
 'label': 'Ukraine',
 'search_rank': 1,
 'score': 0.0,
 'scores': [0.0],
 'score_rank': 1,
 'rank': 1.0,
 'wd_uri': 'https://www.wikidata.org/wiki/Q212',
 'immediate_types': ['Q3624078:sovereign state'],
 'immediate_supertypes': [],
 'sitelinks': 344,
 'wikipedia': 'Ukraine',
 'is_instance': True,
 'is_concept': False,
 'en': {'label': 'Ukraine',
  'aliases': ['UKR', 'UA', 'Ukraina', 'ua', '🇺🇦', 'Ukr.', 'Ukrainia'],
  'description': 'sovereign state in eastern Europe',
  'wikiname': 'Ukraine',
  'abstract': "Ukraine (Ukrainian: Україна, romanized: Ukraina, pronounced [ʊkrɐˈjinɐ] ) is a country in Eastern Europe. It covers an area of 603,628 km2 (233,062 sq mi), and is the second-largest country in Europe after Russia, which it borders to the east an

### This is a very simple approach that can't do much if there are many good candidates with the right type.

Try looking for a PER named **Michael Jordon**

In [45]:
wds.summary(wds.search("Michael Jordan", target_types='PER', top=20))

[('Q41421', 'Michael Jordan', 'American basketball player and businessman'),
 ('Q3308285',
  'Michael I. Jordan',
  'American computer scientist, University of California, Berkeley'),
 ('Q1928047',
  'Michael Jordan',
  'German draughtsperson, artist and comics artist'),
 ('Q27069141', 'Michael Jordan', 'American football cornerback'),
 ('Q6831715', 'Michael Jordan', 'Irish politician'),
 ('Q65029442', 'Michael Jordan', 'American football offensive lineman'),
 ('Q6831719', 'Michael Jordan', 'British mycologist'),
 ('Q6831716', 'Michael Jordan', 'English footballer (born 1984)'),
 ('Q95244944', 'Michael Jordan', ''),
 ('Q100983908',
  'Michael Jordan',
  'college basketball player (1998–1999) Detroit Mercy'),
 ('Q100983906',
  'Michael Jordan',
  'college basketball player (2001–2001) New Mexico'),
 ('Q97521844', 'Michael Jordan', 'researcher'),
 ('Q6835972', 'Michaël Jordan Nkololo', 'Congolese footballer'),
 ('Q16267175', 'Michael Jordan in Flight', '1993 video game')]

### One way to help is to provide a context string, which might be the mention's sentence

In [46]:
wds.summary(wds.link("Michael Jordan", target_types='PER', context="He has made many contributions ot mavhone learning"))

('Q3308285',
 'Michael I. Jordan',
 'American computer scientist, University of California, Berkeley')

### You can also search for a concept, not just named entities.

In [49]:
wds.summary(wds.link("blood pressure", category="concept"))

('Q82642',
 'blood pressure',
 'pressure exerted by circulating blood upon the walls of blood vessels')

In [2]:
wds.summary(wds.link("monoclonal antibodies", target_types="CHEMICAL COMPOUND"))

('Q422248',
 'monoclonal antibody',
 'monospecific antibody that is made by identical immune cells that are all clones of a unique parent cell')

fin