# Quick start

spaCyOpenTapioca is a spaCy pipeline for named entity linking using OpenTapioca

First, install spaCyOpenTapioca:

In [1]:
!pip install spacyopentapioca

Then, import spaCy and add the pipeline `OpenTapioca`:

In [2]:
import spacy
nlp = spacy.blank("en")
nlp.add_pipe('opentapioca')
doc = nlp("Christian Drosten works in Charité, Germany.")
for span in doc.ents:
    print((span.text, span.kb_id_, span.label_, span._.description, span._.score))

('Christian Drosten', 'Q1079331', 'PERSON', 'German virologist and university teacher', 1.7265554428236625)
('Charité', 'Q162684', 'ORG', 'university hospital in Berlin, Germany', 1.266987145014507)
('Germany', 'Q183', 'LOC', 'sovereign state in Central Europe', 1.0571540026892925)


Let's check the raw types and aliases:

In [3]:
for span in doc.ents:
    print((span._.types, span._.aliases))

({'Q43229': False, 'Q618123': False, 'Q5': True, 'P2427': False, 'P1566': False, 'P496': True}, ['كريستيان دروستين', 'Крістіан Дростен', 'Christian Heinrich Maria Drosten', 'کریستین دروستن', '크리스티안 드로스텐', '德羅斯登', 'クリスチャン・ドロステン', 'Дростен, Кристиан', 'Кристиан Хайнрих Мария Дростен', 'Кристиан Дростен'])
({'Q43229': True, 'Q618123': True, 'Q5': False, 'P2427': True, 'P1566': False, 'P496': False}, ['Шаритэ', 'Charite', 'Hôpital de la Charité', 'Charité - Universitätsmedizin Berlin', 'hôpital universitaire de la Charité de Berlin', 'ospedale della Charité', 'Charite (Berlijn)', 'Charitésjukhuset', 'شاريتيه', 'シャリテ', 'シャリテ＝大学医療・ベルリン', 'Шаріте', 'ospedale universitario della Charité', 'Universitätsmedizin Berlin', 'Charité-ziekenhuis', 'שאריטה', 'ชารีเท', 'Šaritē - Berlīnes Universitātes slimnīca', 'Hôpital de la Charité de Berlin', '샤리테', 'Charité Universitätsmedizin Berlin', 'Charité – University Medicine Berlin', 'Շարիտե', 'Шарите', 'シャリティ', 'シャリテー', 'Caritatis nosocomium', '夏綠蒂教學醫院', '

The Wikidata QIDs are attached to tokens:

In [4]:
for token in doc:
    print((token.text, token.ent_kb_id_))

('Christian', 'Q1079331')
('Drosten', 'Q1079331')
('works', '')
('in', '')
('Charité', 'Q162684')
(',', '')
('Germany', 'Q183')
('.', '')


The raw annotations can be found in `doc._.annotations`. Let's check the first one:

In [5]:
doc._.annotations[0]

{'start': 0,
 'end': 17,
 'tags': [{'id': 'Q1079331',
   'label': ['Christian Drosten'],
   'aliases': ['كريستيان دروستين',
    'Крістіан Дростен',
    'Christian Heinrich Maria Drosten',
    'کریستین دروستن',
    '크리스티안 드로스텐',
    '德羅斯登',
    'クリスチャン・ドロステン',
    'Дростен, Кристиан',
    'Кристиан Хайнрих Мария Дростен',
    'Кристиан Дростен'],
   'extra_aliases': ['0000-0001-7923-0519'],
   'desc': 'German virologist and university teacher',
   'nb_statements': 60,
   'nb_sitelinks': 20,
   'edges': [6581097,
    10905380,
    1120501,
    1305740,
    1205214,
    1305740,
    10905334,
    1546865,
    15634281,
    1622272,
    183,
    152171,
    617048,
    2496385,
    162684,
    18001597,
    2018484,
    25413386,
    21441764,
    4185,
    34704877,
    64,
    586,
    188,
    1860,
    54439832,
    87748614,
    1713320,
    50662,
    80011696,
    83347119,
    5370768,
    88072607,
    188,
    188,
    188,
    188,
    100492007,
    913404,
    100492007,
    5

The partial metadata for the response returned by the OpenTapioca API:

In [6]:
doc._.metadata

{'status_code': 200, 'reason': 'OK', 'ok': True, 'encoding': 'utf-8'}

Spans have `span._.annotations` and `span._.aliases` extensions. Usually they have a lot of data. Let's print the rest of span extensions:

In [7]:
print(span._.description)
print(span._.rank)
print(span._.score)
print(span._.types)
print(span._.label)
print(span._.extra_aliases)
print(span._.nb_sitelinks)
print(span._.nb_statements)

sovereign state in Central Europe
16.722767013369445
1.0571540026892925
{'Q43229': True, 'Q618123': True, 'Q5': False, 'P2427': False, 'P1566': True, 'P496': False}
['Germany']
None
367
1028


Now we can vizualize the results

In [8]:
params = {"text": doc.text,
          "ents": [{"start": ent.start_char,
                    "end": ent.end_char,
                    "label": ent.label_,
                    "kb_id": ent.kb_id_,
                    "kb_url": "https://www.wikidata.org/entity/" + ent.kb_id_} 
                   for ent in doc.ents],
          "title": None}
spacy.displacy.render(params, style="ent", manual=True)