## Wikidata entity linking using Wikimedia Search

**wd_search.py** finds entities given a string and optional sets of types. It returns a ranked list of objects, one for each hit, with basic information from Wikidata and optionally DBpedia in one or more languages. An example of a call from the command line:

  *python wd_search.py "UMBC" --types ORG --oktypes LOC FAC --badtypes 'sports team' --lang en zh --dbpedia --limit 20 --top 5*

Types can be any wikidata type (e.g., Q5 for human) or a type name in entity_types.py.  The search prefers hits with a type in **--types** but will accept onese with a type in **--oktypes**.  If an entity has a type in **--badtypes**, it is rejected. The **--limit** parameter defines how many initial candidates are checked (up to 50) and **--top** says how many good hits are returned.

**wd_search()** is the basic function to call and **wd-scale_search()** is a version with defaults set for the 2021 scale project.

In [1]:
import wd_search as wds

ModuleNotFoundError: No module named 'wd_search'

### For this demo, we define a link() function that gets the top ranked entity found by wd_search() and returns a summary with its id, canonical name, short description, and url

In [4]:
def link(string, type='Q35120', dbpedia=0):
    # default type is 'entity'
    return wds.wd_scale_search(string, target_types=[type], dbpedia=dbpedia, top=1)[0]

### The search matches the string agains wikidata items' name, aliases, descriptions, and property values and also takes into account various measures of prominance and other factors to ranke the results.

In [5]:
wds.summary(link("Diana", 'PER'))

('Q9685',
 'Diana, Princess of Wales',
 'first wife of Charles, Prince of Wales (1961-1997)',
 'https://wikidata.org/wiki/Q9685')

In [4]:
wds.summary(link("HLTCOE", 'ORG'))

('Q64780099',
 'Human Language Technology Center of Excellence',
 'research center at Johns Hopkins University',
 'https://wikidata.org/wiki/Q64780099')

In [5]:
wds.summary(link("Patapsco River"))

('Q2748733',
 'Patapsco River',
 'river in Maryland, United States',
 'https://wikidata.org/wiki/Q2748733')

### If we don't specify a type, any entity can match. Search for Patapsco returns a town in MD as the first hit.

In [16]:
wds.summary(link("Patapsco"))

('Q7144250',
 'Patapsco',
 'town in Carroll County, Maryland',
 'https://wikidata.org/wiki/Q7144250')

### The type is a strong signal about what we want, tho. NORP is an ontonotes type for "Nationality, Organization, Religious, Political".

In [7]:
wds.summary(link("Chinese","NORP"))

('Q42740',
 'Han Chinese people',
 'ethnic group',
 'https://wikidata.org/wiki/Q42740')

### We get a different match if we use the ontonotes LANGUAGE type.

In [17]:
wds.summary(link("Chinese","LANGUAGE"))

('Q37041',
 'Classical Chinese',
 'language of the Sino-Tibetan language family (ISO 639-3: lzh)',
 'https://wikidata.org/wiki/Q37041')

### A simple, fast NLP pipline like Spacy can type named enties with ontonotes types.  Here's Donbas, a region in the Ukraine.

In [9]:
wds.summary(link("Donbas","GPE"))

('Q605714',
 'Donbas',
 'region in eastern Ukraine',
 'https://wikidata.org/wiki/Q605714')

### If we set dbpedia to True, here is the information we get from search for a GPE suing 'ukraine'

In [15]:
print(wds.hits_string(link("Ukraine","GPE", dbpedia=1)))

{
  "DBpedia_types":[
    "Place",
    "Location",
    "Country",
    "PopulatedPlace"
  ],
  "en":{
    "abstract":"Ukraine (Ukrainian: Україна, romanized: Ukrayina, pronounced [ʊkrɐˈjinɐ] (); Russian: Украи́на, tr. Ukraína, IPA: [ʊkrɐˈinə]) is a country in Eastern Europe. It is bordered by Russia to the north-east; Belarus to the north; Poland, Slovakia and Hungary to the west; and Romania, Moldova, and the Black Sea to the south. Ukraine is currently in a territorial dispute with Russia over the Crimean Peninsula, which Russia annexed in 2014. Including the Crimean Peninsula, Ukraine has an area of 603,628 km2 (233,062 sq mi), making it the largest country located entirely in Europe, and the 46th-largest country in the world. Excluding Crimea, Ukraine has a population of about 42 million, making it the eighth or ninth-most populous country in Europe, and the 32nd-most populous country in the world. Its capital and largest city is Kiev. Ukrainian is the official language and its alph

### This is a very simple approach that can't do much if there are many good candidates with the right type.

Try looking for a PER named **Michael Jordon**; Wikidata has 13 entries! Plus, our search finds more, because they mention *Michael Jordan* in their description or some other property.

In [11]:
mjs = wds.wd_scale_search("Michael Jordan", target_types=['PER'], dbpedia=0, top=20, limit=40)

In [12]:
wds.summary(mjs)

[('Q41421',
  'Michael Jordan',
  'American basketball player and businessman',
  'https://wikidata.org/wiki/Q41421'),
 ('Q3308285',
  'Michael I. Jordan',
  'American computer scientist, University of California, Berkeley',
  'https://wikidata.org/wiki/Q3308285'),
 ('Q27069141',
  'Michael Jordan',
  'American football cornerback',
  'https://wikidata.org/wiki/Q27069141'),
 ('Q1928047',
  'Michael Jordan',
  'German draughtsperson, artist and comics artist',
  'https://wikidata.org/wiki/Q1928047'),
 ('Q6831715',
  'Michael Jordan',
  'Irish politician',
  'https://wikidata.org/wiki/Q6831715'),
 ('Q65029442',
  'Michael Jordan',
  'American football offensive lineman',
  'https://wikidata.org/wiki/Q65029442'),
 ('Q6831719',
  'Michael Jordan',
  'British mycoloigst',
  'https://wikidata.org/wiki/Q6831719'),
 ('Q6831716',
  'Michael Jordan',
  'English footballer (born 1984)',
  'https://wikidata.org/wiki/Q6831716'),
 ('Q95244944', 'Michael Jordan', '', 'https://wikidata.org/wiki/Q95244

### You can also search for a concept, not just named entities.

In [6]:
wds.summary(link("letter bomb"))

('Q384080',
 'letter bomb',
 'Explosive device',
 'https://wikidata.org/wiki/Q384080')

In [12]:
wds.summary(link("assault rifle"))

('Q177456',
 'assault rifle',
 'type of selective fire rifle',
 'https://wikidata.org/wiki/Q177456')

fin